cat query.fasta | parallel --block 100k --recstart '>' --pipe \ blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result.tsv
This will split the FASTA file into smaller chunks of about 100 kilobyte, while making sure that the records are valid (i.e. start with an ">").
3 comments:
Nice feature, but I can't get this syntax to occupy more than 100% of a CPU (looking at "top"). Normally I see gnu parallel occupy up to 1200% of a CPU (12 cores).
Perhaps it has something to do with the size of the FASTA file? The command line I posted takes 100k chunks of the file, so if you have a smaller file, it won't split it. You could try something like "wc" instead of blastp to see how many calls are made.
This has been super useful. Thanks!
I spent a long time working out how to gnu-parallelise UCSC's blat and most tricks to specify the query file didn't work (e.g. "-" "</dev/stdin" etc), so am posting what did work for me:
https://gist.github.com/sujaikumar/8932968
(pasted gist because code not allowed in comments)
Post a Comment