Similarity calculations with pySIML

Generic API for computing LINGOs with pySIML

Both of the currently supported methods for LINGO computations with pySIML share the same general data flow and methods, to make it easy to switch between CPUs and GPUs as needed. The overall structure of a Tanimoto computation with pySIML is as follows:

  • Read SMILES (from file, database, generator, etc)
  • Preprocess SMILES for reference and query sets (see section Preprocessing SMILES in pySIML) into a pair of ‘SMILES sets’, each consisting of a Lingo matrix, count matrix, length vector, and magnitude vector.
  • Create a LINGO comparator object (a CPULingo, GPULingo, or OCLLingo object)
  • Initialize the comparator with the reference and query SMILES sets using the set_refsmiles and set_qsmiles functions.
  • Request a single row from the Tanimoto matrix using the getTanimotoRow or getTanimotoRow_async methods, or a contiguous range of rows using getMultipleRows or getMultipleRows_async.

Example of computing similarities with pySIML

The following is a simple demonstration of how to calculate a full N x N similarity matrix on a set of compounds read in from a file. Note that it lacks niceties such as error-checking; a more detailed example code is present in the examples directory:

import sys
import numpy
from pysiml.compiler import cSMILEStoMatrices
from pysiml.CPULingo import CPULingo

f = open(sys.argv[1],"r")
smiles = f.readlines()
f.close()

numMols = len(smiles)

# We use cSMILEStoMatrices because it is almost 100x as fast as
# SMILEStoMatrices, and more correct to boot.
#
# The SMILES compiler also returns the molecule name associated
# with each row of the output matrices
(lingos,counts,lengths,mags,names) = cSMILEStoMatrices(smiles)

# Construct a similarity object. This could also be a GPULingo
comparator = CPULingo()

# Initialize the comparator with our SMILES sets. Since this
# computation is a self-similarity matrix, the reference and
# query sets are the same
comparator.set_refsmiles(lingos,counts,lengths,mags)
comparator.set_qsmiles(lingos,counts,lengths,mags)

# Create an empty storage place to put the result
similarityMatrix = numpy.empty((numMols,numMols))

# CPULingo-specific: see if we can run the computation in parallel
numProcs = 1
if comparator.supportsParallel():
    # If we can do a row-parallel computation (OpenMP supported), choose the
    # number of processors here
    numProcs = 4
similarityMatrix[:,:] = comparator.getMultipleRows(0,numMols,nprocs=numProcs)

print similarityMatrix

The following sections explain details of the CPULingo and GPULingo APIs and differences in their respective behavior.

pysiml.CPULingo - Computing LINGO similarities on a CPU

This module exposes the API for computing LINGO similarities on a CPU. Calculations of multiple rows can be parallelized across multiple CPUs, if the library has been built with OpenMP support. There is currently no support for parallelizing the computation of a single row across multiple CPUs.

The CPULingo object is the interface to compute LINGOs on a CPU. Creating multiple CPULingo objects will not parallelize computations on each across multiple CPUs (as with GPULingo); the only parallelism currently exposed is across rows, using OpenMP.

For interface consistency, CPULingo exposes asynchronous operation methods (getTanimotoRow_async() and getMultipleRows_async()); however, these methods as currently implemented are not actually asynchronous operations.

CPULingo object documentation

class pysiml.CPULingo.CPULingo

Object to handle computation of LINGO similarities on the CPU

asyncOperationsDone()

Return True if all asynchronous operations on this object have completed.

In current implementation, always returns True.

getMultipleRows(rowbase, rowlimit, nprocs=1)

Computes multiple Tanimoto rows rowbase:rowlimit corresponding to comparing every SMILES string in the query set with the reference SMILES strings having index row, row+1, ..., rowlimit-1 in the reference set, and returns this block of rows.

If pySIML has been built with OpenMP enabled, nprocs may be set higher than 1 to parallelize computations over multiple CPUs (each CPU will handle a disjoint set of rows). If called with nprocs larger than 1 on a non-OpenMP build of pySIML, print a warning to stderr and compute with one CPU.

getMultipleRows_async(rowbase, rowlimit, nprocs=1)

Computes multiple Tanimoto rows rowbase:rowlimit corresponding to comparing every SMILES string in the query set with the reference SMILES strings having index row, row+1, ..., rowlimit-1 in the reference set, and stores this block of rows internally as the last asynchronous result value.

If pySIML has been built with OpenMP enabled, nprocs may be set higher than 1 to parallelize computations over multiple CPUs (each CPU will handle a disjoint set of rows). If called with nprocs larger than 1 on a non-OpenMP build of pySIML, print a warning to stderr and compute with one CPU.

To retrieve the result block, call retrieveAsyncResult().

Note that this function is actually synchronous, due to the limitations of running on the CPU; it will not return until the block has been completely calculated.

getTanimotoRow(row)
Returns the single Tanimoto row row corresponding to comparing every SMILES string in the query set with the single reference SMILES string having index row in the reference set.
getTanimotoRow_async(row)

Computes the single Tanimoto row row corresponding to comparing every SMILES string in the query set with the single reference SMILES string having index row in the reference set, and stores this row internally as the last asynchronous result value.

To retrieve the result row, call retrieveAsyncResult().

Note that this function is actually synchronous, due to the limitations of running on the CPU; it will not return until the row has been completely calculated.

retrieveAsyncResult()

Returns result from last asynchronous computation (getTanimotoRow_async() or getMultipleRows_async()).

Note that this result is only guaranteed to be valid if no operations have been run on this CPULingo object since the asynchronous call, except for asyncOperationsDone() and retrieveAsyncResult().

If no asynchronous operations have been invoked on this object, result is undefined.

set_qsmiles(qsmilesmat, qcountsmat, querylengths, querymags=None)
Sets the query SMILES set to use Lingo matrix qsmilesmat, count matrix qcountsmat, and length vector querylengths. If querymags is provided, it will be used as the magnitude vector; else, the magnitude vector will be computed from the count matrix.
set_refsmiles(refsmilesmat, refcountsmat, reflengths, refmags=None)
Sets the reference SMILES set to use Lingo matrix refsmilesmat, count matrix refcountsmat, and length vector reflengths. If refmags is provided, it will be used as the magnitude vector; else, the magnitude vector will be computed from the count matrix.
supportsParallel()

Return True if this installation of pySIML was built with OpenMP support for parallel calculations.

Note that even if this function returns False, the getMultipleRows() and getMultipleRows_async() methods can still be called with nprocs > 1, but only one processor will actually be used

pysiml.GPULingo - Computing LINGO similarities on a CUDA-capable GPU

This module exposes the API for computing LINGO similarities on a CUDA-capable GPU. It uses the pycuda library to interface with the GPU; in particular, due to bugs related to context management in pycuda 0.93 and before, pycuda 0.94 or greater is required.

The GPULingo object is the interface to compute LINGOs on a single GPU. To do similarity calculations on multiple GPUs, create multiple GPULingo objects, passing a different CUDA device ID to each one’s constructor:

gpu0 = pysiml.GPULingo(0)
gpu1 = pysiml.GPULingo(1)

By using the asynchronous operations on each object (getTanimotoRow_async() and getMultipleRows_async()), similarity calculations can be carried out simultaneously on multiple GPUs using only one host thread:

# gpu0 and gpu1 have been initialized with reference and query SMILES sets

# Carry out simultaneous computation of rows 0 to 10 of each set on both GPUs
gpu0.getMultipleRows_async(0,10)
gpu1.getMultipleRows_async(0,10)

# The busy waits could be replaced by a sleep, or any other work
while not gpu0.asyncOperationsDone():
    pass
gpu0result = gpu0.retrieveAsyncResult()

while not gpu1.asyncOperationsDone():
    pass
gpu1result = gpu1.retrieveAsyncResult()

After an asynchronous computation has been requested on a GPULingo object, check asyncOperationsDone() to see when the job is complete. Once the job is done, retrieveAsyncResult() can be called to retrieve the result. Note that the retrieved result is guaranteed to be valid only if no methods were called on the GPULingo object after the asynchronous request, except for asyncOperationsDone() and retrieveAsyncResult().

GPULingo object documentation

class pysiml.GPULingo.GPULingo(deviceID=0)

Object to handle computation of LINGO similarities on GPU with CUDA device ID deviceid

asyncOperationsDone()
Return True if all asynchronous operations on this object have completed.
getMultipleRows(rowbase, rowlimit)

Computes multiple Tanimoto rows rowbase:rowlimit corresponding to comparing every SMILES string in the query set with the reference SMILES strings having index row, row+1, ..., rowlimit-1 in the reference set, and returns this block of rows.

This method is synchronous (it will not return until the block has been completely computed).

getMultipleRows_async(rowbase, rowlimit)

Computes multiple Tanimoto rows rowbase:rowlimit corresponding to comparing every SMILES string in the query set with the reference SMILES strings having index row, row+1, ..., rowlimit-1 in the reference set, and stores this block as the most recent asynchronous result.

This method is asynchronous (it will return before the block has been completely computed). After calling this method, check asyncOperationsDone(); once that method returns True, the result may be retrieved by calling retrieveAsyncResult().

getTanimotoRow(row)

Returns the single Tanimoto row row corresponding to comparing every SMILES string in the query set with the single reference SMILES string having index row in the reference set.

This method is synchronous (it will not return until the entire row has been computed and brought back from the GPU).

getTanimotoRow_async(row)

Compute the single Tanimoto row row corresponding to comparing every SMILES string in the query set with the single reference SMILES string having index row in the reference set, and store it as the most recent asynchronous result.

This method is asynchronous (it will return before the row has been completely computed). After calling this method, check asyncOperationsDone(); once that method returns True, the result may be retrieved by calling retrieveAsyncResult().

getMultipleHistogrammedRows(rowbase, rowlimit)

Computes multiple Tanimoto rows rowbase:rowlimit corresponding to comparing every SMILES string in the query set with the reference SMILES strings having index row, row+1, ..., rowlimit-1 in the reference set. Histograms each row into its own histogram of 101 bins with boundaries 0, 0.01, 0.02, ... , 0.99, 1.0, 1.01. Returns this block of row-wise histograms.

This method is synchronous (it will not return until the histograms have been completely computed).

getMultipleHistogrammedRows_async(rowbase, rowlimit)

Computes multiple Tanimoto rows rowbase:rowlimit corresponding to comparing every SMILES string in the query set with the reference SMILES strings having index row, row+1, ..., rowlimit-1 in the reference set. Histograms each row into its own histogram of 101 bins with boundaries 0, 0.01, 0.02, ... , 0.99, 1.0, 1.01. Returns this block of row-wise histograms.

This method is asynchronous (it will return before the block has been completely computed). After calling this method, check asyncOperationsDone(); once that method returns True, the result may be retrieved by calling retrieveAsyncResult().

getNeighbors(rowbase, rowlimit, lowerbound, upperbound=1.1, maxneighbors=None)

For each reference SMILES string with index in rowbase:rowlimit (i.e., strings with index row, row+1, ... ,*rowlimit-1*, finds all SMILES in the query set that have LINGO similarity >= lowerbound and < upperbound (“neighbors”), up to a maximum of maxneighbors (by default, size of query set).

Result is a tuple of (matrix,vector). The vector contains, for each reference string in rowbase:rowlimit, the number of neighbors found. The matrix is of size (rowlimit-rowbase, maxNeighborsFound), where maxNeighborsFound is the maximum value in the returned vector. Each row of the matrix (corresponding to one reference SMILES string) has as its elements the query indices of neighbors. In row i, only the first vector[i] elements are valid (that is, the values elements of the matrix beyond the number of neighbors found for a given row are undefined).

This method is synchronous (it will not return until the neighbors have been completely computed. Returns a tuple of (neighborMatrix,neighborCountVector).

getNeighbors_async(rowbase, rowlimit, lowerbound, upperbound=1.1, maxneighbors=None)

For each reference SMILES string with index in rowbase:rowlimit (i.e., strings with index row, row+1, ... ,*rowlimit-1*, finds all SMILES in the query set that have LINGO similarity >= lowerbound and < upperbound (“neighbors”), up to a maximum of maxneighbors (by default, size of query set).

Result is a tuple of (matrix,vector). The vector contains, for each reference string in rowbase:rowlimit, the number of neighbors found. The matrix is of size (rowlimit-rowbase, maxNeighborsFound), where maxNeighborsFound is the maximum value in the returned vector. Each row of the matrix (corresponding to one reference SMILES string) has as its elements the query indices of neighbors. In row i, only the first vector[i] elements are valid (that is, the values elements of the matrix beyond the number of neighbors found for a given row are undefined).

This method is asynchronous (it will return before the block has been completely computed). After calling this method, check asyncOperationsDone(); once that method returns True, the result pair may be retrieved by calling retrieveAsyncResult().

retrieveAsyncResult()

Returns result from last asynchronous computation (getTanimotoRow_async(), getMultipleRows_async(), getMultipleHistogrammedRows_async(), or getNeighbors_async()).

Note that this result is only guaranteed to be valid if no operations have been run on this object since the asynchronous call, except for asyncOperationsDone() and retrieveAsyncResult().

If no asynchronous operations have been invoked on this object, result is undefined. If an asynchronous operation is still pending, this method will block until completion.

set_qsmiles(qsmilesmat, qcountsmat, qlengths[, qmags])

Sets the reference SMILES set to use Lingo matrix qsmilesmat, count matrix qcountsmat, and length vector querylengths. If querymags is provided, it will be used as the magnitude vector; else, the magnitude vector will be computed (on the GPU) from the count matrix.

Because of hardware limitations, the query matrices (qsmilesmat and qcountsmat) must have no more than 65,536 rows (molecules) and 32,768 columns (Lingos). Larger computations must be performed in tiles.

set_refsmiles(refsmilesmat, refcountsmat, reflengths[, refmags])

Sets the reference SMILES set to use Lingo matrix refsmilesmat, count matrix refcountsmat, and length vector reflengths. If refmags is provided, it will be used as the magnitude vector; else, the magnitude vector will be computed (on the GPU) from the count matrix.

Because of hardware limitations, the reference matrices (refsmilesmat and refcountsmat) must have no more than 32,768 rows (molecules) and 65,536 columns (Lingos). Larger computations must be performed in tiles.

pysiml.OCLLingo - Computing LINGO similarities on an OpenCL-capable GPU or CPU

Very Beta - only getMultipleRows currently supported

This module exposes the API for computing LINGO similarities on an OpenCL-capable GPU, CPU, or other accelerator device. It uses the pyopencl library to interface with OpenCL.

The OCLLingo object is the interface to compute LINGOs on a single OpenCL device. Multiple OCLLingo objects can be used (on the same device or multiple devices); in particular, similarity calculations may be parallelized across multiple GPUs by creating multiple OCLLingo objects, one per device. To build an OCLLingo object, an OpenCL device (obtained from an OpenCL Platform using pyopencl) must be passed to the constructor:

import pyopencl as cl
platform = cl.get_platforms()[0] # Use first platform
dev0 = platform.get_devices()[0]
dev1 = platform.get_devices()[1]
gpu0 = pysiml.OCLLingo(dev0)
gpu1 = pysiml.OCLLingo(dev1)

By using the asynchronous operations on each object (getTanimotoRow_async() and getMultipleRows_async()), similarity calculations can be carried out simultaneously on multiple GPUs using only one host thread:

# gpu0 and gpu1 have been initialized with reference and query SMILES sets

# Carry out simultaneous computation of rows 0 to 10 of each set on both GPUs
gpu0.getMultipleRows_async(0,10)
gpu1.getMultipleRows_async(0,10)

# The busy waits could be replaced by a sleep, or any other work
while not gpu0.asyncOperationsDone():
    pass
gpu0result = gpu0.retrieveAsyncResult()

while not gpu1.asyncOperationsDone():
    pass
gpu1result = gpu1.retrieveAsyncResult()

After an asynchronous computation has been requested on a OCLLingo object, check asyncOperationsDone() to see when the job is complete. Once the job is done, retrieveAsyncResult() can be called to retrieve the result. Note that the retrieved result is guaranteed to be valid only if no methods were called on the OCLLingo object after the asynchronous request, except for asyncOperationsDone() and retrieveAsyncResult().

OCLLingo object documentation

class pysiml.OCLLingo.OCLLingo(device)

Object to handle computation of LINGO similarities on GPU with CUDA device ID deviceid

asyncOperationsDone()
Return True if all asynchronous operations on this object have completed.
getMultipleRows(rowbase, rowlimit)

Computes multiple Tanimoto rows rowbase:rowlimit corresponding to comparing every SMILES string in the query set with the reference SMILES strings having index row, row+1, ..., rowlimit-1 in the reference set, and stores this block as the most recent asynchronous result.

This method is synchronous (it will not return until the block has been completely computed).

set_qsmiles(qsmilesmat, qcountsmat, qlengths[, qmags])

Sets the reference SMILES set to use Lingo matrix qsmilesmat, count matrix qcountsmat, and length vector querylengths. If querymags is provided, it will be used as the magnitude vector; else, the magnitude vector will be computed (on the GPU) from the count matrix.

Because of hardware limitations, the query matrices (qsmilesmat and qcountsmat) must have no more than 65,536 rows (molecules) and 32,768 columns (Lingos). Larger computations must be performed in tiles.

set_refsmiles(refsmilesmat, refcountsmat, reflengths[, refmags])

Sets the reference SMILES set to use Lingo matrix refsmilesmat, count matrix refcountsmat, and length vector reflengths. If refmags is provided, it will be used as the magnitude vector; else, the magnitude vector will be computed (on the GPU) from the count matrix.

Because of hardware limitations, the reference matrices (refsmilesmat and refcountsmat) must have no more than 32,768 rows (molecules) and 65,536 columns (Lingos). Larger computations must be performed in tiles.