bioCS: 2008

Saturday, December 6, 2008

Want: Auto-adjust laptop fan speed with motion sensor

Someone should really integrate Fan Control with the MacBook's motion sensor. If the computer didn't move for a long time: It's sitting on a desk, so it can run hotter, but be silent. If the computer is being moved around: I have it on my lap, so turn the fan up and keep the case cool.

Monday, November 17, 2008

Meta-blogging to commemorate the creation of a tumblelog

I am posting a meta-blog entry because I just created a mini-blog at Tumblr to supplement my micro-blog at Twitter which in turn was meant to supplement this (hypo?)blog. So, now my online blogging presence fragments into:

this blog
delicious bookmarks
Twitter
tumblelog for things too long to tweet
shared items from Google reader

...only to be pulled back together at FriendFeed, which is actually where the discussion happens. Interesting. (Oh, and this is so true. Now, do I re-tweet it or post the quote on Tumblr?)

Monday, November 10, 2008

4th German Conference on Chemoinformatics: ChEMBL

The talks of the first full day of the 4th German Conference on Chemoinformatics are over. Most interesting for me was Christoph Steinbeck's talk about the recently announced data acquired by the EBI. The database will be called "ChEMBL". There will be a monthly update cycle, so the acquisition does not only capture the current state, but the database is going to be extended. There are three parts (although they'll be combined eventually):

"DrugStore": interactions for 1500 drugs. Christoph says that he doesn't expect this to go much beyond what's already publicly available in DrugBank et al. today.
"CandiStore": 15,000 clinical leads
"StARLite": 500,000 medical chemistry leads. This is where most of the novelty (in terms of public data) lies. For this part, there are >5500 annotated targets, >3500 of which are proteins (the rest is e.g. tissues), and 2 million experimental bioactivities. The database contains bidirectional links to the literature on synthetic routes and assays for the ligands and descriptions of the targets.

The data will be first made available as database dumps, more user-friendly interfaces will be added later.

Two URLs of interest that I didn't know before: The ChEMBL blog and John Overington's lab homepage.

~~Other remarks about today will follow when I have a real internet connection (not just 6 kB/s via Bluetooth/GPRS for 9 ct/min) to do some more background research.~~

Thursday, September 25, 2008

Three presentations down, one to go

I just passed my third annual review, which means that thanks to doing my PhD in Europe I should be done soon (caveat). Everything went fine, now I'll still wait for some results to come back from the lab and then I'll write my thesis—but luckily as cumulative thesis.

Lastly, here's the Wordle digest of my report. I guess the main topic of my research clearly shines through.

Wednesday, August 20, 2008

Web 2.0 killer app: FriendFeed for scientic papers

This post is inspired by Eva's thoughts on getting scientists to adopt Web 2.0 and Cameron's post on making Connotea a killer app for scientists.

Many people have added their CiteULike or Connotea libraries to FriendFeed, so during the day you can see various new papers flow by. Similarly, journal's TOC updates and saved searches on PubMed create a regular stream of possibly interesting papers. Lastly, after a few weeks or months, papers are processed by ISI Web of Science and can be tracked by citation alerts. In the end, you might see the same paper flow by a couple times.

This situation is far from ideal. You see echos of the same paper and papers arrive via multiple channels: RSS, email, web sites. There are far too many potentially interesting papers, so you have to focus your various alerts in order not to be overwhelmed.

My proposal for the killer app is a central place which tracks all of the above items (i.e., friend's libraries, PubMed searches, journal TOC and citation alerts) and integrates with your personal library. Just like in FriendFeed, there should be a way to rate/like a paper ("Faculty of 1,000,000"?), to prioritize the new papers, and to save papers to your library. The most important and difficult feature would be to merge equivalent entries, i.e. a Connotea link to PubMed needs to be merged with the journal TOC alert etc. So when you already identified something as interesting and filed it, you won't be alerted again if it comes in via another channel.

Of course, there should be a non-mandatory way to tag papers, to have groups, and to recommend papers to specific users (like the "for:" tag in delicious.com).

Bonus points: keep track of comments and blog posts of the paper, plus all the extended literature analysis that Cameron proposed.

Friday, August 15, 2008

Mendeley = Mekentosj Papers + Web 2.0 ?

Via Ricardo Vidal: Mendeley seems to be a Windows (plus Mac/Linux) equivalent of Mekentosj Papers (which is Mac OS X only, and has been described as "iTunes for your papers"). In addition to handling your PDFs, it has an online component that allows sharing your papers and other Web 2.0 features (billing itself as "Last.fm for papers").

Here, I'm reviewing the Mac beta version (0.5.6). I am focusing most on the desktop side and compare it to Papers, because I have a working solution in place and I would only switch to Mendeley if the experience is as good as with Papers. (I.e., my main problem is off-line papers management, Web 2.0 features are icing on the cake.)

By Mac standards, the app is quite ugly. Both Mendeley and Papers allow full-text PDF searches, which is important if you want to avoid tagging/categorizing all your papers. Papers can show PDFs in the main window, copy the reference of the paper and email papers. Mendeley in principle can also copy the reference, but special characters are transformed to gibberish in this beta version. Papers allows you to match papers against PubMed, Web of Science etc., while Mendeley only offers to auto-extract often incomplete meta-data. This matching feature is extremely useful as you get all the authorative data from the source, and most often Papers can use the DOI in the PDF to immeadiately give you the correct reference. Update: Mendeley also uses DOIs to retrieve the correct metadata, if available. (Thanks, Victor for your comment.)

The beta version is quite rough, I just had to kill it because I found no way to close the "about" window. Extraction of meta-data and references doesn't always work, but this might be more of a problem of the information that's stored in the PDFs.

Of course, once there's a critical mass of people using Mendeley, there'll be all the Web 2.0 features that Papers doesn't have. Judging from the talk I think they might be trying to do too much: Connotea/CiteULike plus Dopplr plus LinkedIn. For me, a simple way to export new references from Papers to Connotea/CiteULike would be enough. More modularity is better, because it allows you to choose the best tool in each layer.

More info by the Mendely folks: Short demo, a little longer talk.

CiteWeb: Following citations made easy

One good way to keep up with the literature in a field is to track which new papers are citing seminal papers of the field. Each Friday, I get lots of citation alerts from ISI Web of Science, but often enough I see the same paper again and again (citing different papers that are on my watch list). So I set out to write an app that would take ISI's RSS feeds, coalesce them, and give them back to you. For example, in the screenshot one review paper is citing five of my tracked papers:

If you're using citation alerts from Web of Science, then give CiteWeb a try at citeweb.embl.de. If you find a bug, you can either comment here, or grab the source code and fix it. :-)

I started working on this to try out if Google App Engine was useful. It turned out that downloading many items from a remote host leads to time-outs from App Engine, so I ported the app to Django. The source code is released under the MIT License.

Tuesday, August 12, 2008

Google integrates Scholar into main page

I don't know if it's just me (sitting inside a research institution), but when I search for something that returns a paper, I get info from Google Scholar:

(See also the complete screenshot with notes on Flickr.) However, the order of the results is different: Google Scholar seems to weight by citations, Google by page rank.

Saturday, July 19, 2008

ISMB 2008: "Career Paths in Bioinformatics and Computational Biology"

A panel discussion about "Career Paths in Bioinformatics and Computational Biology" was part of Friday's Student Council Symposium during ISMB. The four panelists were from academia: Philip E. Bourne (group leader at University of California San Diego), Alfonso Valencia (group leader at the Spanish National Research Council), Jong Bhak (director of the Korean BioInformation Center) and Richard Wintle (Assistant Director at The Centre for Applied Genomics). (Only RW had spent a longer time [6 years] in the industry at start-up companies.)

[All quotes are paraphrases based on the notes I took.]

Perhaps unsurprisingly, they couldn't offer real comforting answers to the questions of young researchers: "Isn't there a high chance that at the age of 40 you'll be highly trained and specialized, but without a job?" – "There's no job security in academia anyway; I'm not sure if academia is more competitive than industry" (AV). "After the initial boom in bioinformatics positions, will the fraction of grad students who become PIs approach biology with 5 to 10%?" – "Biology will morph into bioinformatics, so there will be more jobs." (JB)

However, I could take away some positive advice. In short: follow your heart, be passionate about something, don't do what everybody else is doing, start you own sub-field if you have to. From my perspective, this is both reasonable and encouraging. As I enter the last phase of being a PhD student, I begin to wonder how I can combine working in science and caring for my family. I guess I hope by staying motivated and by being effective in what I do I can have a chance to grow in my career and by there for my family. (I think this is a great advantage of computational biology: You can't make gels run faster, so to say, but you can be effective in programming and analyzing data.)

Another good insight was that a lot of basic, technological advances will come from industry in the future. Dr. Bhak cited the example of CPU development: The huge increases in processing power we see today is being implemented at Intel and AMD (although I cannot judge how much they rely on basic research by academia). My addition to this might be that part of bioinformatics will become more of an engineering discipline. So, for people interested in this, there will be a big job market in the future.

Similarly, the panelists expected that every biology lab will have embedded computational biologists in the future. I agree, but I think these will be mostly post-doc (i.e. non-permanent) positions.

Some of the questions and answers in more detail:

What will be the big opportunities in the next five years? The current generation of students will lead the bioinformatics industry, like the previous generation is currently leading in academia (JB). There will be many more embedded bioinformaticians (see above, AV). Hybrid skills (wet-lab and computational biology) will become more important (RW). The greatest opportunities are cross-disciplinary approaches that tackle as much complexity as possible (PB).

To stay motivated, and find out what you want to do: Always follow your heart in career decisions; create your own sub-division if you have to (PB). Don't do something just because it's trendy; what you like to do might change over time (at one point, industry might be appealing, later academia) (RW).

To find your spot in academia: Find influential people (RW). Diversify: try to do something that not everyone else is doing (AV).

Friday, July 18, 2008

Micro-blogging ISMB

As Pablo announced, several people including me are micro-blogging about ISMB on FriendFeed and to a lesser extent on Twitter.

Wednesday, May 21, 2008

Invoking TextMate from a remote machine

I edit all my script files via TextMate, but they live on remote drives (mounted via sshfs). I'm always logged in to the remote machines to run scripts etc. Usually I'll open remote files via Quicksilver, but when I create a new file this isn't an option yet and so it's tedious. Anyway, I've cooked up a quick script (with some help) to open TextMate from a remote machine.

Try it, YMMV:

#!/bin/bash

cd $(dirname $1)

PWD=`/bin/pwd`

BASENAME=$(basename $1)

ssh pasadena "osascript -e \
'tell app \"Terminal\" to do \
shell script \"/usr/local/bin/mate /Volumes/emil$PWD/$BASENAME\"'" >& /dev/null &

pasadena = name of my Mac OS X machine
/Volumes/emil = sshfs mount point

Update: I removed the "-w" option (I didn't want TextMate to wait for the file to be closed) and redirected all the garbage to /dev/null.

Update 2: Support arbitrary (i.e. also absolute) paths.

Update 3: this is broken on Snow Leopard :(

Update 4: Here's a workaround, although it's a bit ugly as you need to invoke sudo.

Thursday, April 24, 2008

Unison is great for syncing files

I've already tagged Unison, but I find it so very useful that I just have to write this short blog post to praise it. If you work on two computers and want to keep a set of files in sync, you should try Unison. It's open-source, cross-platform, and the newest version works great on Mac OS X (fink's version was broken when I tried it).

In contrast to other syncing or backup programs, Unison is bidirectional. For example, I just edited some slides for a talk I'm giving tomorrow on my laptop, while I made some changes in the report on my main computer. Then I ran Unison, and it sent the files in the appropriate direction.

Minor gripe: I've tried to sync a huge folder with it and it failed (20 GB / 8600 files+folders), but it works fine and is fast for my normal syncing (350 MB / 1200 files+folders). However, in the first case I really wanted to make a backup, so I just used SuperDuper for this one folder.

Friday, March 14, 2008

Using Makefiles for jobs that run on a cluster

Makefiles are great. While you work on a project, they make it convenient to run the necessary scripts. When you come back to the project half a year later, you don't have to dig in your brain how the scripts fit together—it's all there. (More on make, and related advice.)

However, often in bioinformatics computational tasks are too big for a single CPU, so jobs are submitted to a cluster. Then, the Makefile doesn't help you much: It can't detect that jobs are running on the cluster. There is qmake, but it only works if all your parallel jobs are specified in the Makefile. I usually write my parallel scripts in a way that they can submit as many instances as necessary of themselves to the cluster via qsub.

Therefore, I went ahead and wrote a small Python wrapper script that runs the job submission script and sniffs the job ids from the output of qsub. It then waits and monitors these jobs until they are all done. Then, the execution of the Makefile can continue.

Here's an example of how to invoke the wrapper script from the Makefile:

pubchem_compound_inchi.tsv.gz:
   ~/src/misc/qwrap.py ${SRC_DIR}/inchikeys.py
   cat ../inchikey/* | gzip > pubchem_compound_inchi.tsv.gz

You can download the code (released under a BSD License, adapted to SGE). I hope it's useful!

Addendum: Hunting around in the SGE documentation I found the "-sync" option, which, together with job arrays, probably provides the same functionality but also checks the exit status of the jobs.

Wednesday, March 12, 2008

InChIKeys for PubChem

An InChIKey is a sort of checksum for chemical structures. It consists of two parts: The first captures the scaffold of the compound, the second is computed based on the stereochemistry, proton position etc. This makes the InChIKey ideal for STITCH, because we want to merge tautomers and stereoisomers.

PubChem doesn't provide an InChiKey yet in the SDF files that you can download. However, you can quickly generate a tab-delimited file with the help of the InChI toolkit (which you have to download and compile):

zcat SDF/Compound_00000001_00025000.sdf.gz | \
./cInChI-1 -STDIO -key -AuxNone -SDF:PUBCHEM_COMPOUND_CID | \
sed 's/Structure.*=//' | sed ':a; $\!N;s/\nInChI/\tInChI/;ta;P;D' > result

(The sed command is from a FAQ.)

Tuesday, March 11, 2008

Blogging for search engines

Related to my last post about the failings of Web 2.0 in biology, I want to ask the meta-question: Why do we blog? David Crotty proposes four reasons: Communication with other science bloggers, with non-specialists, with journalist and finally with search engine users. Unless you are a fairly well-known person, your regular audience will consist of your colleagues, collaborators and a random grad student or two. A journalist might only come by if you managed to get a press release about a Nature/Science/... paper out. But, Googlebot won't fail you and read all you posts!

Insightful blog posts won't stay without an audience. For one, the small circle of followers to your blog will spread the news if you write something worth sharing. Far more important are search engines. How do you survey a research area of interest? Most of us will query PubMed, but also do a Google search in the hope that some meaningful analysis is somewhere on a course website, in the text of a paper or maybe even in a blog.

Biologists use Google to query for their proteins of interest. STRING is a fairly successful database, and lots of people google for it by name. However, almost one quarter of all visitors from Google have actually searched for a protein name (random example) and found STRING. If you follow Lars J. Jensen's lead and publish your research observations and dead ends online, someone might serendipitously find them and use them for their own research. This will be the next steps towards open science (with open data, open notebooks—which we might never reach): "Publishing" small findings, data and back stories about papers on your blog, enabling others to gain insight.

Web 2.0, CiteULike and Mekentosj Papers

Roland Krause bookmarked a great post: "Why Web 2.0 is failing in Biology" by David Crotty. That I got to know about this post just by subscribing to his links in del.icio.us is a success of Web 2.0. I'm just not sure if the same successes are already in reach in the context of science. I especially agree with David Crotty's observations about entry barriers: Unless new tools/communities make it very easy to use them and provide great benefit, the rate of adoption will be low.

From my personal experience, I can share this: Almost two years ago, I participated in giving a series of talks about Web 2.0 and how it might impact biology. Looking back, I'm not sure many things have changed. I have been using CiteULike for the past three years or so, but I think I will now switch to Papers. CiteULike allows me to bookmark and tag my papers, but when I search my library I mostly use a custom Google search for the specific author.

Papers lets you easily create a collection of all PDFs you ever read. Thanks to Spotlight, you can perform full text searches on the articles and quickly retrieve the paper you have in mind. This avoids the overhead of applying tags to papers that you actually don't end up using. (GMail is another case in point: a quick search function eliminates the need for an intricate folder structure.)

I can't remember a specific case where the "Web 2.0" functions of CiteULike ever worked for me. Peeking in the bibliographies of other people can be interesting if you have some bookmarked papers in common, but the signal-to-noise ratio is very low. So, unless you know specific people or groups to follow, you'll most only use CiteULike in "Web 1.0 mode". And then we come back to the initial observation: If a web tool is more complicated or less featured than the desktop (or even, paper) version, it won't be used much.

Update: Mendeley might be the Windows equivalent some of you have been looking for (15/08/08, more in this post).

Friday, February 22, 2008

STITCH and STRING blog

As an outlet for additional information about various things going on with STRING and STITCH I've created a blog. In particular, I spent the last week in Japan at the BioHackathon 2008 in Tokyo. Besides enjoying the different culture, I got to work on an API for the servers. I guess we'll see if it actually gets used (one of the first uses could be in the Reflect text-mining / highlighting tool, which already uses STITCH to get the pop-ups).

Tuesday, February 5, 2008

Max Planck Society signs agreement with Springer

In October, I reported that the German Max Planck Society failed to reach a new license agreement with Springer. Now, via heise.de, I learn that they have signed an agreement on January 29, 2008. Here's the press release (there's also a German version).

They details are very sparse, presumably Springer had to come down with the price but they won't state that. However, the press release devotes a lot of space to Open Access, saying that the license agreement "also includes Open Choice™". Open Choice is Springer's author-pays-for-OA program. Now, what does this mean? It doesn't make sense to assume that the agreement talks about access to Open Choice articles, so I guess it must mean that all MPG articles are now going to be published under the Open Choice model. Querying PubMed a bit, I find that the MPG accounted for 6% of the total German research output, so this is certainly an interesting development.