bioCS: 2011

Thursday, August 11, 2011

ggplot2: Determining the order in which lines are drawn

In a time series, I want to plot the values of an interesting cluster versus the background. However, if I'm not careful, ggplot will draw the items in an order determined by their name, so background items will obscure the interesting cluster:


Correct: Interesting lines in front of background	Wrong: Background lines obscure interesting lines

One way to solve this is to combine the label and name columns into one column that is used to group the individual lines. In this toy example, the line belonging to group 1 should overlay the other two lines:

Friday, July 15, 2011

SSH to same directory on other server

In my work environment, home directories are shared across servers. I often realize that I need to run a script on another server, e.g. due to RAM requirements or other jobs running on the current server. I finally figured out how to log in to the other server while staying in the same directory:

ssh -t server cd `pwd` "&& $SHELL"

(Of course, you can define aliases for different servers.)

Tuesday, June 21, 2011

Mekentosj Papers and Dropbox: get a bit more space

I use Dropbox to keep an automatic backup of my Papers library. However, I was scraping close to my space allowance, and discovered that Papers adds some temporary files to the "Library.papers2" folder that eat up precious cloud space. Here is how to remove them from Dropbox:

Quit Papers 2.0. Make a backup of your Papers library.
Go to the "Library.papers2" folder inside your Papers library.
Copy the folders "Thumbnails" and "Spotlight" somewhere (e.g. your Desktop). The other folders don't need much space in my case.
Click the Dropbox icon in the menu bar, navigate to Preferences, Advanced, Selective Sync and de-select the "Thumbnails" and "Spotlight" folders. (Dropbox will now delete the folders.)
Move the folders you copied before back into the "Library.papers2" folder.
Via Dropbox's web interface, delete the online version of the two folders.
Voila: more space (200 MB, in my case).

Thursday, March 10, 2011

Comparing two-dimensional data sets in R; take II

David commented on yesterday's post and suggested to put the continuous fitted distribution in the background and the discrete, empirical distribution in the foreground. This looks quite nice, although there's a slight optical illusion that makes the circles look as if they'd be filled with a gradient, even though they're uniformly colored:

Not-so-good fit

Better fit

Wednesday, March 9, 2011

Comparing two-dimensional data sets in R

I wanted to fit a continuous function to a discrete 2D distribution in R. I managed to do this by using nls, and wanted to display the data. I discovered a nice way to compare the actual data and the fit using ggplot2, where the background is the real data and the circles are the fitted data (the legend is not optimal, but for a slide/figure it's probably easier to fix it in Illustrator):

A not-so-good fit

A better fit

My data frame includes these columns: x, y, enrichment (the real data), pred (my fitted values).

Tuesday, March 1, 2011

From FriendFeed to TwitterFeed: Losing the filters

These are the things I want to see in my feed: articles and papers you found and wanted to share, your opinions, your blog posts; to a lesser extent: your bookmarks.

However, I these things don't care about in my feed: GitHub activity, BioStar comments, shared YouTube videos, songs you liked, Wikipedia activity, etc.

On FriendFeed, it was easy to hide sources I didn't want to see, as each stream of content was clearly separate. It was even possible to hide an item until someone of my network liked it or commented on it. On Twitter, this seems almost impossible. First, everyone is setting up TwitterFeed as they see fit, so even finding a way to identify streams of items I would like to hide is not easy. Second, how would I hide it? Twitter doesn't even filter blocked users properly, they still show up in search. My current client of choice, Twitterrific, doesn't support any kind of filtering.

We all know this: "It's Not Information Overload. It's Filter Failure." Yes, the members of my network can be (active) filters. But I also need passive filters to reduce the amount of information that reaches me. Otherwise, my social network becomes much less useful to me. My fear is that, if all of us pipe everything into Twitter, it becomes a useless mess.

Wednesday, February 23, 2011

A bookmarklet for shortDOI.org

shortDOI is a URL shortening service that takes DOIs and converts them to short URLs such as http://doi.org/bb6, which is nice for emails and Twitter. You can add the bookmarklet by dragging this link to your bookmarks: shortDOI. It will try to find the DOI in the current page and direct you to shortDOI.

A shortDOI URL is probably more persistent than, say, bit.ly, as it's backed by the organization that maintains the DOI infrastructure. However, if doi.org would go down, you could always use a search engine if you have the original DOI, but the shortDOI URL will be worthless.

Update 24.02.2011: Use some majority voting to find the right DOI.
Update 01.03.2011: Expand the list of allowed characters. Does anybody know which characters can be part of the DOI?
Update 29.07.2015: Be strict about having a prefix and suffix. First check for the "citation_doi" meta tag before looking in the rest of the document.

Monday, February 21, 2011

Why should we apply Moore's Law to DNA sequencing?

In this chart of cost per megabase of DNA sequence, an extrapolation based on Moore's Law has been added. What's wrong with this? It starts in 2001, the year of the human genome.

In 2001, only few animal genomes had been published (starting with worm in 1998). If I had to compare the human genome to a computer, I'd pick ENIAC. Moore's Law, however, was stated in 1965, some 20 years after the first "real" (i.e. Turing-complete) computers like the Zuse Z3 or ENIAC. When you go back, Moore's law doesn't hold anymore:

Source: Hans Moravec

The overall rate of progress in the pre-transistor era is lower than the rate of the transistor era–which is perhaps no wonder, as Moore's law had originally been defined as the number of transistors per chip.

With which rate will DNA sequencing progress? Perhaps the sharp decrease in sequencing costs between 2008 and 2010 is comparable to the transition from vacuum tubes to transistors, and Moore's Law will be followed from now on (extrapolating from three data points...). But perhaps we'll see more sharp decreases, and should overcome the desire to extrapolate using Moore's Law from arbitrary starting points.

(HT Deepak.)

Wednesday, January 19, 2011

New repo: local NCBI taxonomy database

I added some new functionality to the taxonomy repository at Google Code, creating a fork at BitBucket. The existing Python package already makes it possible to create a local database containing the content of the NCBI taxonomy, which can then be queried for names, ranks, and lineages. I added functionality to create a Newick tree from a list of NCBI taxonomy identifiers.