Preprocessing SMILES in pySIML

The Short Version

Just use cSMILEStoMatrices(), and don’t look back.

The Long Version

As explained in the pySIML Concepts, two things must be done to SMILES strings before they can be used for LINGO similarity comparison:

  • Certain transformations must be performed, such as changing ring closure digits and stripping names
  • They must be converted to the SIML internal numerical representation.

The pysiml.compiler section provides details on how to do this conversion.

The pysiml.compiler module provides both a pure-Python converter SMILEStoMatrices() as well as one based around a C extension, cSMILEStoMatrices(). It is important to note that these two DO NOT produce the same output. The Python module transforms all digits in the SMILES string to zeroes; this will incorrectly affect charge numbers, isotope indicators, and hydrogen counts. The C module performs the following changes:

  • Change all digits to zero, except for numbers following a ‘+’ or ‘-‘ (charge counts), those following an ‘H’ (hydrogen counts), or a ‘[‘ (isotope indicators).
  • Reduce multiple-digit ring-closure indicators (e.g., ‘%13’) to one digit (‘%0’) to normalize ring formatting. Currently only works for molecules with under 100 rings, due to ambiguities in the SMILES specification.

Both the C and Python modules handle stripping of names and newlines from SMILES strings.

In general, there is almost no reason to use the Python compiler; cSMILEStoMatrices() is nearly 100 times faster and is more correct to the LINGO flavor outlined in [Grant06]. The Python compiler SMILEStoMatrices() is included only as a substitute in case a pure-Python replacement is needed, or it is necessary to compute SMILES strings that have been transformed in an identical way to the SIML compiler (e.g., to pass into a different LINGO package for comparison).

pysiml.compiler - Transforming SMILES strings into SIML internal representations

This module provides “compilers” to convert SMILES strings into the sparse-vector representation required for SIML. C and pure-Python implementations are provided.

pysiml.compiler.SMILEStoMatrices(smileslist)

Convert the sequence of SMILES strings smileslist into a SIML SMILES set and list of molecule names (if present in the SMILES strings). Uses a pure Python implementation. See pySIML preprocessing documentation for details on transformations performed on the SMILES strings. Note that this does NOT perform the same transformations as the C version cSMILEStoMatrices().

Returns a tuple of 5 values: a Lingo matrix, a count matrix, a length vector, a magnitude vector, and a list of molecule names (all but the molecule names make up the “SMILES set”).

pysiml.compiler.SMILEStoMultiset(smiles)

Returns Lingo and count vectors for a single SMILES string smiles, as would correspond to a row in the Lingo or count matrices from cSMILEStoMatrices() or SMILEStoMatrices(). Performs no transformations on smiles prior to conversion.

Note that in general, the results of this function will not be the same as those obtained from SMILEStoMatrices() or cSMILEStoMatrices() because this function does not preprocess the input strings.

pysiml.compiler.cSMILEStoMatrices(smileslist)

Convert the sequence of SMILES strings smileslist into a SIML SMILES set and list of molecule names (if present in the SMILES strings). Uses the SIML compiler C extension. See pySIML preprocessing documentation for details on transformations performed on the SMILES strings. Note that this does NOT perform the same transformations as the pure-Python version SMILEStoMatrices().

Returns a tuple of 5 values: a Lingo matrix, a count matrix, a length vector, a magnitude vector, and a list of molecule names (all but the molecule names make up the “SMILES set”).

pysiml.compiler.preprocessNumbers(smi, xtable=None)

Given a SMILES string, return a copy of the string with the same translations performed on it as would be done by the pure-Python preprocessor SMILEStoMatrices().

This method is primarily useful to compare the results of SIML Tanimoto calculation functions with those from other LINGO calculation packages, to ensure that identical SMILES strings are given to each.

xtable is an internal parameter and should always be set to None when called from user code.