Wednesday, March 12, 2008

InChIKeys for PubChem

An InChIKey is a sort of checksum for chemical structures. It consists of two parts: The first captures the scaffold of the compound, the second is computed based on the stereochemistry, proton position etc. This makes the InChIKey ideal for STITCH, because we want to merge tautomers and stereoisomers.

PubChem doesn't provide an InChiKey yet in the SDF files that you can download. However, you can quickly generate a tab-delimited file with the help of the InChI toolkit (which you have to download and compile):
zcat SDF/Compound_00000001_00025000.sdf.gz | \
./cInChI-1 -STDIO -key -AuxNone -SDF:PUBCHEM_COMPOUND_CID | \
sed 's/Structure.*=//' | sed ':a; $\!N;s/\nInChI/\tInChI/;ta;P;D' > result
(The sed command is from a FAQ.)

No comments: