Friday, March 14, 2008

Using Makefiles for jobs that run on a cluster

Makefiles are great. While you work on a project, they make it convenient to run the necessary scripts. When you come back to the project half a year later, you don't have to dig in your brain how the scripts fit together—it's all there. (More on make, and related advice.)

However, often in bioinformatics computational tasks are too big for a single CPU, so jobs are submitted to a cluster. Then, the Makefile doesn't help you much: It can't detect that jobs are running on the cluster. There is qmake, but it only works if all your parallel jobs are specified in the Makefile. I usually write my parallel scripts in a way that they can submit as many instances as necessary of themselves to the cluster via qsub.

Therefore, I went ahead and wrote a small Python wrapper script that runs the job submission script and sniffs the job ids from the output of qsub. It then waits and monitors these jobs until they are all done. Then, the execution of the Makefile can continue.

Here's an example of how to invoke the wrapper script from the Makefile:
pubchem_compound_inchi.tsv.gz:
~/src/misc/qwrap.py ${SRC_DIR}/inchikeys.py
cat ../inchikey/* | gzip > pubchem_compound_inchi.tsv.gz
You can download the code (released under a BSD License, adapted to SGE). I hope it's useful!

Addendum: Hunting around in the SGE documentation I found the "-sync" option, which, together with job arrays, probably provides the same functionality but also checks the exit status of the jobs.

2 comments:

dalloliogm said...

hi!
Thank you very much for the script.
However, I don't understand very well what you said with respect to qmake.

Why it is not good?
Let's say I have a makefile like:
all: 1 2 3 4 5
%:
echo $*

won't qmake run this in parallel, on 5 different processes?

Michael Kuhn said...

Yes, this will work. But I usually don't have all the jobs that should run in parallel in a Makefile. This is just the way my scripts evolved from single-processor to cluster jobs.

At first, I write "./script.py". Then I notice that I want to let this script run on a cluster an implement "./script.py submit" which in turn submits the jobs "./script.py 0 10", "./script.py 1 10" up to "./script.py 9 10".

This has the advantage that I can easily change the number of jobs I want to have in the script, and I don't need an extra Makefile for each script.