bioCS: science

Showing posts with label science. Show all posts

Friday, June 6, 2014

A new online viewer for multiple alignments

For my work on the alignment of coiled-coil proteins, I needed an alignment viewer that could highlight coiled-coil domains, since they contain less phylogenetic signal than other parts of the protein. Adding this to JalView seemed very complicated, and not possible at all in other web-based viewers like MView. I therefore created amview (for "annotated multiple alignment viewer"), which looks like this in practice:

Shown is the MSA for spd-5 (and here is the entry in my coiled-coil orthologs database). Amino acids are colored according the ClustalW rules, coiled-coil residues are in a lighter color (and the "a" register of the heptad repeat is underlined). Below, two interaction domains are shown. At the very bottom, a small chart shows the degree of conservation across the whole alignment, which can be used to quickly scroll to the conserved regions. The cog on top hides two options: you can hide columns with too many gaps, and proteins that seem to be fragments.

I've only tested this thoroughly in Google Chrome, as I found other browsers to be too slow. Still, it's better than loading a Java applet, and even runs on iPhones/iPads etc.! The implementation relies on Django on the server side, and JavaScript / JQuery in the browser.

Monday, September 23, 2013

Introducing parallelRandomForest: faster, leaner, parallelized

Together with other members of Andreas Beyer's research group, I participated in the DREAM 8 toxicogenetics challenge. While the jury is still out on the results, I want to introduce my improvement of the R randomForest package, namely parallelRandomForest.

To cut to the chase, here is a benchmark made with genotype data from the DREAM challenge, using about 1.3 million genomic markers for 620 cell lines in the training set to predict toxicity for one drug (100 trees, mtry=1/3, node size=20):

randomForest (1 CPU): 3:50 hours (230 minutes), 24 GB RAM max.
parallelRandomForest (1 CPU): 37 minutes, 1.4 GB RAM max.
parallelRandomForest (8 CPUs): 5 minutes, 1.5 GB RAM max.

As you can see, parallelRandomForest is 6 times faster even when not running in parallel, and the memory consumption is about 16 times lower. Importantly, the algorithm is unchanged, i.e. parallelRandomForest produces the same output as randomForest.

For our submission, we wanted to try the simultaneous prediction of drug toxicity for all individuals and drugs. Our hope is that the increased training set enables the Random Forest (RF) to identify, for example, drugs with similar structure that are influenced by similar genotype variations.

It quickly became clear that the standard RF package was ill-suited for this task. The RAM needed by this implementation is several times the size of the actual feature matrix, and there is no built-in support for parallelization. I therefore made several changes and optimizations, leading to reduced memory footprint, reduced run-time and efficient parallelization.

In particular, the major changes are:

not modifying the feature matrix in any way (by avoiding transformations and extra copies)
no unnecessary copying of columns
growing trees in parallel using forked process, thus the feature matrix is stored only once in RAM regardless of the number of threads
using a single byte (values 0 to 255) per feature, instead of a double floating point number (eight bytes)
instead of sorting the items in a column when deciding where to split, the new implementation scans the column multiple times, each time collecting items that equal the tested cut-off value

The latter two optimizations are especially adapted to the use of RFs on genotype matrices, which usually contain only the values 0, 1, and 2. (Homozygous for major allele, heterozygous, homozygous for minor allele.) The 16-fold reduction in memory consumption seen above is mainly caused by switching to bytes (8-fold reduction) and avoiding extra copies (2-fold reduction).

For our simultaneous mapping strategy, the combined training matrix contains about 120,000 columns and 52,700 rows. Total RAM consumption (including the test set and various accessory objects) was only 18 GB. It took about 2.5 hours to predict 100 trees on 12 CPUs. Both in terms of time and RAM, training with the standard RF package would have been impossible.

The optimizations currently only encompass the functions we needed, namely regression RF. Classification is handled by a separate C/Fortran implementation that I didn't touch. I would estimate that with two weeks of dedicated efforts, all functions of the RF package can be overhauled, and some restrictions (such as forcing the use of bytes) could be loosened (by switching implementations on the fly). However, I do not have the time for this. My question to the community is thus: How to proceed? Leave the package on BitBucket? Ask the maintainers of the standard RF package to back-port my changes? Would someone volunteer for the coding task?

Monday, January 23, 2012

A publicly mandated medical terminology with a restrictive license

Update: Good News! In the meantime, I've been contacted by MedDRA (most probably unrelated to this blog post) and after a fruitful discussion it seems to be possible for me to base SIDER on MedDRA.

The world's regulatory agencies are increasingly adopting and mandating a new medical terminology scheme called MedDRA to capture side effects (adverse drug reactions) during the regulatory process. (For example, it is used in CVAR, AERS and other instances at the FDA [pdf], which in turn have been used in recent papers). Sounds great, right? The only problem: The dataset is under an restrictive license (pdf): MedDRA data can only be shared among MedDRA subscribers (source [pdf]). I've clarified this via email with the help desk: one can only share text examples with less than 100 terms as examples, and no numeric codes.

This means: it is not possible to create a public dataset, or supplementary material for a paper, that contains a useful amount of data based on MedDRA.

Two years ago, I created the SIDER database of drug–side effect relations (published in MSB). By relying only on publicly available drug labels and dictionaries like COSTART (with UMLS License Category 0), we were able to create a dataset that can be shared with everyone. (Disclaimer: we chose the license CC-BY-NC-SA.) If I were to base SIDER on MedDRA, the license would prevent me from making a machine-readable database available for download and further research. Thus, the next version of SIDER cannot be based on the dictionary of medical terms that regulatory agencies use at the moment.

What is especially sad about this is that the license fees themselves are not especially high, companies with an annual revenue less than $1 million have to pay only $190 USD, and I doubt that there are hundreds of subscribers who earn more than $5 billion and thus pay the maximum fee of $62,850 USD. So it would take relatively little financial effort to declare MedDRA a open access database.

IANAL, so it may be possible that a database like SIDER, which essential contains the following records: side effect identifier, side effect name, drug identifier, drug name, is derived enough to not fall under the MedDRA license. I remain doubtful, however, especially after reading the restrictions on UMLS License Category 3, under which MedDRA falls, like: "incorporation of material [...] in any publicly accessible computer-based information system [...] including the Internet; [...] creating derivative works from material from these copyrighted sources".

Information on public health, like drugs and their side effects, should be openly available for research, second only to privacy concerns. I'm not sure how to begin to change this (beyond writing this), but ideas are very welcome.

(Small fnord detail: MSSO, which manages MedDRA, is a subsidiary of the military contractor Northrop Grumman.)

Friday, January 6, 2012

Embarrassingly parallel BLAST search

A quick note: To blast a file with many proteins against a database, you can use recent version of GNU Parallel to fill up all CPUs (which the -num_threads option of BLAST doesn't do, as it only parallelizes some steps of the search):

cat query.fasta | parallel --block 100k --recstart '>' --pipe \ 
    blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result.tsv

This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").

Wednesday, February 23, 2011

A bookmarklet for shortDOI.org

shortDOI is a URL shortening service that takes DOIs and converts them to short URLs such as http://doi.org/bb6, which is nice for emails and Twitter. You can add the bookmarklet by dragging this link to your bookmarks: shortDOI. It will try to find the DOI in the current page and direct you to shortDOI.

A shortDOI URL is probably more persistent than, say, bit.ly, as it's backed by the organization that maintains the DOI infrastructure. However, if doi.org would go down, you could always use a search engine if you have the original DOI, but the shortDOI URL will be worthless.

Update 24.02.2011: Use some majority voting to find the right DOI.
Update 01.03.2011: Expand the list of allowed characters. Does anybody know which characters can be part of the DOI?
Update 29.07.2015: Be strict about having a prefix and suffix. First check for the "citation_doi" meta tag before looking in the rest of the document.

Thursday, August 19, 2010

Evolvability and the rate-limiting step

Do complex cellular processes (like the cell cycle) have a single rate-limiting step? (I.e., do they follow the Arrhenius equation?) Intuitively, this seems to be the case, as there should be one reaction that is the slowest (and thus rate-limiting).

Now, from the point of fitness, it would make sense to have all reactions occur at the same speed (e.g. by tuning enzymes levels). This would mean that there is not a single rate-limiting step, but rather a series of equally fast steps. Assuming that mutations are more likely to influence protein expression than to increase the catalytic activity of enzymes, this "balanced state" could actually occur.

However, once all reactions are tuned to be equally fast, how do you evolve? If there's only one slow reaction, clearly you have selection pressure on this reaction. If all reactions are equally fast, making one of them even faster (by changing the enzyme's activity) would have no influence, no?

Perhaps a way out is that after the enzyme's catalytic activity has increased, lower levels of this enzyme suffice, thus increasing fitness.

Wednesday, February 17, 2010

While we predict drug targets, pharma already knows them

All the excitement about predicting drug targets from remote chemical structure similarities (Nature) and drug side effects (Science) suddenly seems strangely unfounded when you realize that "they" already have the answer:

This figure is a tiny section of the data used for Preclinical Safety Profiling at the Novartis Institutes for BioMedical Research. They are not alone: BioPrint from Cerep is a repository of 2500 compounds assayed against 159 targets. Paolini et al. mention 600,000 binding data points at Pfizer.

Things are begining to change, thanks to academic curation efforts like BindingDB, PDSP Ki database, DrugBank etc. and not least to EMBL-EBI's acquisition of ChEMBL. Still, one can only imaging how different current academic research questions would be if the all-against-all drug–target assay information would be public. In essence, academia is working on predicting additional data points from a sparse matrix, while pharma companies already have the full matrix in hand, but want to add more rows/columns to the matrix in silico.

Update: As always, good discussion on FriendFeed.

Thursday, February 4, 2010

Human Protein Atlas data for download

As I just learned in our lab's journal club, the data from the Human Protein Atlas is available for download, thanks to their recent paper in MSB. Curiosly enough, the HPA help page still states that they do not make data available as a matter of "general policy."

Friday, January 15, 2010

A Newick parser for Python, supporting internal node labels

I just pushed a fork of Thomas Mailund's nice Newick parser for Python to bitbucket. I added support for labeled internal nodes, but probably partially broke support for bootstrap values.

>>> from newick import parse_tree
>>> t = parse_tree("((Human,Chimp)Primate,(Mouse,Rat)Rodent)Supraprimates;")
>>> print t
(('Human', 'Chimp')Primate, ('Mouse', 'Rat')Rodent)Supraprimates
>>> print t.identifier
Supraprimates

Monday, July 27, 2009

One step towards writing papers in Google Wave

Google Wave's underlying technology will not only enable collaboration with other people, it also make it possible for bots to interact with what you've written. I think this is going to change the way we work. E.g., all applications which require a significant amount of typing will benefit from the statistical auto-correction provided by the Wave app Spelly. In effect, Spelly goes over the text as you're typing it and correcting the obvious mistakes, just as you would do a bit later.

In a similar vein, the proof-of-concept bot Igor is watching out for inserted references and automagically converts them to a citation and a reference list. When writing papers, I usually insert reminders: "REF Imming review", "REF PMID 16007907". If I adjust this convention a bit and provide a bit more detail, Igor can figure out by itself which paper is meant and fetch the citation. Google Wave and Igor save me the tiresome going back-and-forth between a reference manager and the editor to insert all the citation, and they remove distractions from the process of writing and editing the paper.

Of course, this is a proof of concept, so the style can't yet be customized. I further think it would be helpful to quickly look "what's inside" a particular citation. I don't know if Google Wave supports this, but it would be nice to click on a citation ("[23]") and be presented with a pop-up window showing not only infos about the article, but also links to PubMed / a DOI resolver.

Thursday, May 21, 2009

How good is Wikipedia's coverage of chemical compounds?

Wikipedia has an excellent coverage of chemical compounds, featuring above 20000 articles whose names match those PubChem compounds. After finding a few important chemicals not featured in Wikipedia, I wanted to quantify the coverage of Wikipedia and point out the gaps that should be filled.

I wondered how much coverage Wikipedia actually has for "important" chemicals. Here, I define importance as "number of hits in PubMed", since that is a thing that I can easily measure (and, in fact, already determined as part of working on STITCH and Reflect).

Missing chemicals

For each bin of 100 chemicals, the number of PubMed hits for all synonyms of this chemical is plotted against the fraction of the chemicals that have a Wikipedia article for any of the synonyms. (I exclude three-letter names as they are often ambiguous.) So, for compounds that occur more than 1000 times in PubMed, Wikipedia's coverage is above 80%. Here is the list of articles that should be added.

So, if you know something about one of the missing compounds, go right ahead and create an article! :)

Missing synonyms

The second question is if Wikipedia is missing important redirects, i.e. if there are widely-used names for chemicals that don't occur in Wikipedia even though an article exists for the chemical itself (just under another name). For very common names, the coverage is slightly lower, however, the abstracts in PubMed often contain chemical notation that people probably won't use when searching Wikipedia, e.g. "Ca(2+)" is the top hit on the list of redirects that could be added.

Wednesday, April 22, 2009

Announcing SIDER: a database of side effects

After using side effects to predict drug targets, we now created a public database of side effects with a total of 62269 side effects for 888 drugs. The database was created by doing text-mining on labels from various different public sources like the FDA. Furthermore, I developed rules to extract frequency information from the labels, this worked for about one third of the drug–side effect pairs.

We think that this database will make quite a bit of interesting research possible.

Monday, November 10, 2008

4th German Conference on Chemoinformatics: ChEMBL

The talks of the first full day of the 4th German Conference on Chemoinformatics are over. Most interesting for me was Christoph Steinbeck's talk about the recently announced data acquired by the EBI. The database will be called "ChEMBL". There will be a monthly update cycle, so the acquisition does not only capture the current state, but the database is going to be extended. There are three parts (although they'll be combined eventually):

"DrugStore": interactions for 1500 drugs. Christoph says that he doesn't expect this to go much beyond what's already publicly available in DrugBank et al. today.
"CandiStore": 15,000 clinical leads
"StARLite": 500,000 medical chemistry leads. This is where most of the novelty (in terms of public data) lies. For this part, there are >5500 annotated targets, >3500 of which are proteins (the rest is e.g. tissues), and 2 million experimental bioactivities. The database contains bidirectional links to the literature on synthetic routes and assays for the ligands and descriptions of the targets.

The data will be first made available as database dumps, more user-friendly interfaces will be added later.

Two URLs of interest that I didn't know before: The ChEMBL blog and John Overington's lab homepage.

~~Other remarks about today will follow when I have a real internet connection (not just 6 kB/s via Bluetooth/GPRS for 9 ct/min) to do some more background research.~~

Wednesday, August 20, 2008

Web 2.0 killer app: FriendFeed for scientic papers

This post is inspired by Eva's thoughts on getting scientists to adopt Web 2.0 and Cameron's post on making Connotea a killer app for scientists.

Many people have added their CiteULike or Connotea libraries to FriendFeed, so during the day you can see various new papers flow by. Similarly, journal's TOC updates and saved searches on PubMed create a regular stream of possibly interesting papers. Lastly, after a few weeks or months, papers are processed by ISI Web of Science and can be tracked by citation alerts. In the end, you might see the same paper flow by a couple times.

This situation is far from ideal. You see echos of the same paper and papers arrive via multiple channels: RSS, email, web sites. There are far too many potentially interesting papers, so you have to focus your various alerts in order not to be overwhelmed.

My proposal for the killer app is a central place which tracks all of the above items (i.e., friend's libraries, PubMed searches, journal TOC and citation alerts) and integrates with your personal library. Just like in FriendFeed, there should be a way to rate/like a paper ("Faculty of 1,000,000"?), to prioritize the new papers, and to save papers to your library. The most important and difficult feature would be to merge equivalent entries, i.e. a Connotea link to PubMed needs to be merged with the journal TOC alert etc. So when you already identified something as interesting and filed it, you won't be alerted again if it comes in via another channel.

Of course, there should be a non-mandatory way to tag papers, to have groups, and to recommend papers to specific users (like the "for:" tag in delicious.com).

Bonus points: keep track of comments and blog posts of the paper, plus all the extended literature analysis that Cameron proposed.

Friday, August 15, 2008

Mendeley = Mekentosj Papers + Web 2.0 ?

Via Ricardo Vidal: Mendeley seems to be a Windows (plus Mac/Linux) equivalent of Mekentosj Papers (which is Mac OS X only, and has been described as "iTunes for your papers"). In addition to handling your PDFs, it has an online component that allows sharing your papers and other Web 2.0 features (billing itself as "Last.fm for papers").

Here, I'm reviewing the Mac beta version (0.5.6). I am focusing most on the desktop side and compare it to Papers, because I have a working solution in place and I would only switch to Mendeley if the experience is as good as with Papers. (I.e., my main problem is off-line papers management, Web 2.0 features are icing on the cake.)

By Mac standards, the app is quite ugly. Both Mendeley and Papers allow full-text PDF searches, which is important if you want to avoid tagging/categorizing all your papers. Papers can show PDFs in the main window, copy the reference of the paper and email papers. Mendeley in principle can also copy the reference, but special characters are transformed to gibberish in this beta version. Papers allows you to match papers against PubMed, Web of Science etc., while Mendeley only offers to auto-extract often incomplete meta-data. This matching feature is extremely useful as you get all the authorative data from the source, and most often Papers can use the DOI in the PDF to immeadiately give you the correct reference. Update: Mendeley also uses DOIs to retrieve the correct metadata, if available. (Thanks, Victor for your comment.)

The beta version is quite rough, I just had to kill it because I found no way to close the "about" window. Extraction of meta-data and references doesn't always work, but this might be more of a problem of the information that's stored in the PDFs.

Of course, once there's a critical mass of people using Mendeley, there'll be all the Web 2.0 features that Papers doesn't have. Judging from the talk I think they might be trying to do too much: Connotea/CiteULike plus Dopplr plus LinkedIn. For me, a simple way to export new references from Papers to Connotea/CiteULike would be enough. More modularity is better, because it allows you to choose the best tool in each layer.

More info by the Mendely folks: Short demo, a little longer talk.

CiteWeb: Following citations made easy

One good way to keep up with the literature in a field is to track which new papers are citing seminal papers of the field. Each Friday, I get lots of citation alerts from ISI Web of Science, but often enough I see the same paper again and again (citing different papers that are on my watch list). So I set out to write an app that would take ISI's RSS feeds, coalesce them, and give them back to you. For example, in the screenshot one review paper is citing five of my tracked papers:

If you're using citation alerts from Web of Science, then give CiteWeb a try at citeweb.embl.de. If you find a bug, you can either comment here, or grab the source code and fix it. :-)

I started working on this to try out if Google App Engine was useful. It turned out that downloading many items from a remote host leads to time-outs from App Engine, so I ported the app to Django. The source code is released under the MIT License.

Tuesday, August 12, 2008

Google integrates Scholar into main page

I don't know if it's just me (sitting inside a research institution), but when I search for something that returns a paper, I get info from Google Scholar:

(See also the complete screenshot with notes on Flickr.) However, the order of the results is different: Google Scholar seems to weight by citations, Google by page rank.

Saturday, July 19, 2008

ISMB 2008: "Career Paths in Bioinformatics and Computational Biology"

A panel discussion about "Career Paths in Bioinformatics and Computational Biology" was part of Friday's Student Council Symposium during ISMB. The four panelists were from academia: Philip E. Bourne (group leader at University of California San Diego), Alfonso Valencia (group leader at the Spanish National Research Council), Jong Bhak (director of the Korean BioInformation Center) and Richard Wintle (Assistant Director at The Centre for Applied Genomics). (Only RW had spent a longer time [6 years] in the industry at start-up companies.)

[All quotes are paraphrases based on the notes I took.]

Perhaps unsurprisingly, they couldn't offer real comforting answers to the questions of young researchers: "Isn't there a high chance that at the age of 40 you'll be highly trained and specialized, but without a job?" – "There's no job security in academia anyway; I'm not sure if academia is more competitive than industry" (AV). "After the initial boom in bioinformatics positions, will the fraction of grad students who become PIs approach biology with 5 to 10%?" – "Biology will morph into bioinformatics, so there will be more jobs." (JB)

However, I could take away some positive advice. In short: follow your heart, be passionate about something, don't do what everybody else is doing, start you own sub-field if you have to. From my perspective, this is both reasonable and encouraging. As I enter the last phase of being a PhD student, I begin to wonder how I can combine working in science and caring for my family. I guess I hope by staying motivated and by being effective in what I do I can have a chance to grow in my career and by there for my family. (I think this is a great advantage of computational biology: You can't make gels run faster, so to say, but you can be effective in programming and analyzing data.)

Another good insight was that a lot of basic, technological advances will come from industry in the future. Dr. Bhak cited the example of CPU development: The huge increases in processing power we see today is being implemented at Intel and AMD (although I cannot judge how much they rely on basic research by academia). My addition to this might be that part of bioinformatics will become more of an engineering discipline. So, for people interested in this, there will be a big job market in the future.

Similarly, the panelists expected that every biology lab will have embedded computational biologists in the future. I agree, but I think these will be mostly post-doc (i.e. non-permanent) positions.

Some of the questions and answers in more detail:

What will be the big opportunities in the next five years? The current generation of students will lead the bioinformatics industry, like the previous generation is currently leading in academia (JB). There will be many more embedded bioinformaticians (see above, AV). Hybrid skills (wet-lab and computational biology) will become more important (RW). The greatest opportunities are cross-disciplinary approaches that tackle as much complexity as possible (PB).

To stay motivated, and find out what you want to do: Always follow your heart in career decisions; create your own sub-division if you have to (PB). Don't do something just because it's trendy; what you like to do might change over time (at one point, industry might be appealing, later academia) (RW).

To find your spot in academia: Find influential people (RW). Diversify: try to do something that not everyone else is doing (AV).

Friday, July 18, 2008

Micro-blogging ISMB

As Pablo announced, several people including me are micro-blogging about ISMB on FriendFeed and to a lesser extent on Twitter.

Tuesday, March 11, 2008

Blogging for search engines

Related to my last post about the failings of Web 2.0 in biology, I want to ask the meta-question: Why do we blog? David Crotty proposes four reasons: Communication with other science bloggers, with non-specialists, with journalist and finally with search engine users. Unless you are a fairly well-known person, your regular audience will consist of your colleagues, collaborators and a random grad student or two. A journalist might only come by if you managed to get a press release about a Nature/Science/... paper out. But, Googlebot won't fail you and read all you posts!

Insightful blog posts won't stay without an audience. For one, the small circle of followers to your blog will spread the news if you write something worth sharing. Far more important are search engines. How do you survey a research area of interest? Most of us will query PubMed, but also do a Google search in the hope that some meaningful analysis is somewhere on a course website, in the text of a paper or maybe even in a blog.

Biologists use Google to query for their proteins of interest. STRING is a fairly successful database, and lots of people google for it by name. However, almost one quarter of all visitors from Google have actually searched for a protein name (random example) and found STRING. If you follow Lars J. Jensen's lead and publish your research observations and dead ends online, someone might serendipitously find them and use them for their own research. This will be the next steps towards open science (with open data, open notebooks—which we might never reach): "Publishing" small findings, data and back stories about papers on your blog, enabling others to gain insight.