Monday, January 23, 2012

A publicly mandated medical terminology with a restrictive license

Update: Good News! In the meantime, I've been contacted by MedDRA (most probably unrelated to this blog post) and after a fruitful discussion it seems to be possible for me to base SIDER on MedDRA.

The world's regulatory agencies are increasingly adopting and mandating a new medical terminology scheme called MedDRA to capture side effects (adverse drug reactions) during the regulatory process. (For example, it is used in CVAR, AERS and other instances at the FDA [pdf], which in turn have been used in recent papers). Sounds great, right? The only problem: The dataset is under an restrictive license (pdf): MedDRA data can only be shared among MedDRA subscribers (source [pdf]). I've clarified this via email with the help desk: one can only share text examples with less than 100 terms as examples, and no numeric codes.

This means: it is not possible to create a public dataset, or supplementary material for a paper, that contains a useful amount of data based on MedDRA. 

Two years ago, I created the SIDER database of drug–side effect relations (published in MSB). By relying only on publicly available drug labels and dictionaries like COSTART (with UMLS License Category 0), we were able to create a dataset that can be shared with everyone. (Disclaimer: we chose the license CC-BY-NC-SA.) If I were to base SIDER on MedDRA, the license would prevent me from making a machine-readable database available for download and further research. Thus, the next version of SIDER cannot be based on the dictionary of medical terms that regulatory agencies use at the moment.

What is especially sad about this is that the license fees themselves are not especially high, companies with an annual revenue less than $1 million have to pay only $190 USD, and I doubt that there are hundreds of subscribers who earn more than $5 billion and thus pay the maximum fee of $62,850 USD. So it would take relatively little financial effort to declare MedDRA a open access database.

IANAL, so it may be possible that a database like SIDER, which essential contains the following records: side effect identifier, side effect name, drug identifier, drug name, is derived enough to not fall under the MedDRA license. I remain doubtful, however, especially after reading the restrictions on UMLS License Category 3, under which MedDRA falls, like: "incorporation of material [...] in any publicly accessible computer-based information system [...] including the Internet; [...] creating derivative works from material from these copyrighted sources".

Information on public health, like drugs and their side effects, should be openly available for research, second only to privacy concerns. I'm not sure how to begin to change this (beyond writing this), but ideas are very welcome.

(Small fnord detail: MSSO, which manages MedDRA, is a subsidiary of the military contractor Northrop Grumman.)

Friday, January 6, 2012

Embarrassingly parallel BLAST search

A quick note: To blast a file with many proteins against a database, you can use recent version of GNU Parallel to fill up all CPUs (which the -num_threads option of BLAST doesn't do, as it only parallelizes some steps of the search):

cat query.fasta | parallel --block 100k --recstart '>' --pipe \ 
    blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result.tsv

This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").