<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">'''
To USE this code, from a client perspective, all you want to do is

&gt;&gt; from metrics import RMSD, Dihedral, Contact

Nothing else in this modules' namespace will be useful to you as a client.

and then for example

&gt;&gt; rmsdtraj1 = RMSD.prepare_trajectory(traj, atomindices)
&gt;&gt; RMSD.one_to_all(rmsdtraj1, rmsdtraj1, 0)
&gt;&gt; dihedraltraj1 = Dihedral.prepare_trajectory(traj)
&gt;&gt; Dihedral.one_to_all(dihedraltraj1, dihedraltraj1, 0)

this would compute the distances from frame 0 to all other frames under both
the rmsd metric and dihedral metric. There are a lot more options and ways you can
calcuate distances (euclidean distance vs. cityblock vs pnorm, etc etc.) and select
the frames you care about (one_to_all(), one_to_many(), many_to_many(), all_to_all(), etc).

NOTE: Because the code leverages inheritance, if you just casually browse the code
for Dihedral for example, you ARE NOT going to see all methods that the class
actually implements. I would browsing the docstrings in ipython.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-=-=-=

To DEVELOP on top of this code, just implement some classes that inherit from
either AbstractDistanceMetric or Vectorized. In particular, if you inherit
from Vectorized, you get a fully functional DistanceMetric class almost for
free. All you have to do is define a prepare_trajectory() fuction.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This should be an (almost) drop in replacement for
msmbuilder's DistanceMetic.py, but I think it is much
easier to extend.

Changes (combared to msmbuilder.DistanceMetric):
(1) Moved all the RMSD specific code in DistanceMetric.py
    into the RMSD class. (TheoData stuff). I also
    eliminated the TheoData getters and setters because this is python and
    everything is public / we're all consenting adults.
(2) All distance metrics now inherit from the abstract base
    class AbstractDistanceMetric, and define (at minimum) the
    methods prepare_trajectory(), one_to_many() and one_to_all()
(3) Dihedral and Contact implement the same interface
    as RMSD
(4) Dihedral and Contact share code by subclassing
    Vectorized, which lets you use any distance metric on vectors
    (euclidean, manhattan, chebychev, pnorm, etc) that you like. All
    that Dihedral and Contact have to actually implement
    themselves is their prepare_trajectory function which processes the
    trajectory into a bunch of vectors.
(5) So, anyone can really easily make a new Vectorized. All you have to
    do is put the code that prepares the vectors into a function called
    prepare_trajectory and put that into a class that subclasses Vectorized.
    Furthermore, if you would like to set the default metric used to not be
    euclidean (for instance in Contact since the prepared_trajectories are
    boolean vectors, you dont want euclidean to be the default), you set the
    the class variable 'default_scipy_metric' to be whatever you like.
    Thats all there is to it!
'''


import abc
import re
from euclid import _rmsdcalc # this is a forked version of the msmbuilder rmsdcalc
# with a new method that does the one_to_many efficiently inside the c code
# since python array slicing can be really slow. Big speedup using this
# for the drift calculation
import numpy as np
import itertools
import scipy.spatial.distance
import warnings
from collections import defaultdict, namedtuple
from numbers import Number
from euclid import _distance_wrap
from euclid.geometry import dihedral as _dihedralcalc
from euclid.geometry import contact as _contactcalc
from euclid.geometry import rg as _rgcalc
#######################################################
# Toggle to use fast (no typechecking, so unsafe too)
# version of cdist. Also parallelized with OMP pragmas
#######################################################
USE_FAST_CDIST = True
#USE_FAST_CDIST = False
#######################################################

def fast_cdist(XA, XB, metric='euclidean', p=2):
    '''The same code as scipy.spatial.distance.cdist, except with less typechecking
    so this might throw a segfault, but oh well
    '''
    
    mA = XA.shape[0]
    mB = XB.shape[0]
    
    if mB &gt; mA:
        raise Exception('The parallelism is the other way. switch them around.')
    #if not ((XA.dtype == np.double and XB.dtype == np.double) or (XA.dtype == np.bool and XB.dtype == np.bool)):
    #    print XA.dtype == np.bool
    #    print XB.dtype == np.bool
    #    raise Exception('Types %s, %s' % (XA.dtype, XB.dtype))
    #if not XA.flags.contiguous and XB.flags.contiguous:
    #    raise Exception('Contiguous')
    
    
    dm = np.empty((mA, mB), dtype=np.double)
    #dm = np.zeros((mA, mB), dtype=np.double)
        
    if metric == 'euclidean':
        _distance_wrap.cdist_euclidean_wrap(XA, XB, dm)
    elif metric == 'cityblock':
        _distance_wrap.cdist_city_block_wrap(XA, XB, dm)
    elif metric == 'sqeuclidean':
        _distance_wrap.cdist_euclidean_wrap(XA, XB, dm)
        dm **= 2.0
    elif metric == 'hamming':
        if XA.dtype == np.bool:
            _distance_wrap.cdist_hamming_bool_wrap(XA, XB, dm)
        else:
            _distance_wrap.cdist_hamming_wrap(XA, XB, dm)
    elif metric == 'chebychev':
        _distance_wrap.cdist_chebyshev_wrap(XA, XB, dm)
    elif metric == 'minkowski':
        _distance_wrap.cdist_minkowski_wrap(XA, XB, dm, p)
    elif metric == 'wminkowski':
        _distance_wrap.cdist_weighted_minkowski_wrap(XA, XB, dm)
    elif metric == 'cosine':
        normsA = np.sqrt(np.sum(XA * XA, axis=1))
        normsB = np.sqrt(np.sum(XB * XB, axis=1))
        _distance_wrap.cdist_cosine_wrap(XA, XB, dm, normsA, normsB)
    elif metric == 'braycurtis':
        _distance_wrap.cdist_bray_curtis_wrap(XA, XB, dm)
    elif metric == 'canberra':
        _distance_wrap.cdist_canberra_wrap(XA, XB, dm)
    elif metric == 'dice':
        _distance_wrap.cdist_dice_bool_wrap(XA, XB, dm)
    elif metric == 'kulsinki':
        _distance_wrap.cdist_kulsinski_bool_wrap(XA, XB, dm)
    elif metric == 'matching':
        _distance_wrap.cdist_matching_bool_wrap(XA, XB, dm)
    elif metric == 'rogerstanimoto':
        _distance_wrap.cdist_rogerstanimoto_bool_wrap(XA, XB, dm)
    elif metric == 'russellrao':
        _distance_wrap.cdist_russellrao_bool_wrap(XA, XB, dm)
    elif metric == 'sokalmichener':
        _distance_wrap.cdist_sokalmichener_bool_wrap(XA, XB, dm)
    elif metric == 'sokalsneath':
        _distance_wrap.cdist_sokalsneath_bool_wrap(XA, XB, dm)
    elif metric == 'yule':
        _distance_wrap.cdist_yule_bool_wrap(XA, XB, dm)
    elif metric == 'correlation':
        return scipy.spatial.distance.cdist(XA, XB, 'correlation')
    else:
        raise ValueError('Unknown Distance Metric: %s' % metric)
    
    return dm
    
if USE_FAST_CDIST:
    cdist = fast_cdist
else:
    import scipy.spatial.distance
    cdist = scipy.spatial.distance.cdist
#print 'in metrics library, USE_FAST_CDIST is set to', USE_FAST_CDIST


class AbstractDistanceMetric(object):
    '''Abstract base class for distance metrics. All distance metrics should
    inherit from this abstract class. If you're not familier with abstract
    superclasses from C++ or java (pretty similar to java interfaces too), it's
    basically a class that's "abstract" in the sense that you can't initialize
    an instance of AbstractDistanceMetric, and when you subclass it you're
    REQUIRED to implement/overriding the abstract methods yourself.
    
    If you want, your subclass can also implement more operations (see 
    Vectorized for example)'''
    
    __metaclass__ = abc.ABCMeta
    
    @abc.abstractmethod
    def prepare_trajectory(self, trajectory):
        '''Prepare trajectory on a format that is more conventient to take
        distances on. For RMSD, this is going to mean making word-aligned padded
        arrays (TheoData) suitable for faste calculation, for dihedral-space
        distances means computing the dihedral angles, etc.'''
        
        return
        
    
    @abc.abstractmethod
    def one_to_all(self, prepared_traj1, prepared_traj2, index1):
        '''Calculate the vector of distances from the index1th frame of
        prepared_traj1 to all of the frames in prepared_traj2.
        
        Note: Although this might seem to be a special case of one_to_many(), it
        can often be implemented in a much more optimized way because it doesn't
        require construction of the indices2 array and array slicing in python
        is kindof slow.
        
        Should return a vector of distances of length len(prepared_traj2)'''
        
        return
        
    
    def one_to_many(self, prepared_traj1, prepared_traj2, index1, indices2):
        '''Calculate the a vector of distances from the index1th frame of
        prepared_traj1 to all of the indices2 frames of prepared_traj2. Should
        return a vector of distances of length len(indices2)
        
        Subclass can often provide a more efficient implementation that the
        niave one below.
        '''
        return self.one_to_all(prepared_traj1, prepared_traj2, index)[indices2]
        
    
    def all_pairwise(self, prepared_traj):
        '''Calculate condensed distance metric of all pairwise distances'''
        
        traj_length = len(prepared_traj)
        output = -1 * np.ones(traj_length * (traj_length - 1) / 2)
        p = 0
        for i in range(traj_length):
            cmp_indices = np.arange(i + 1, traj_length)
            output[p: p + len(cmp_indices)] = self.one_to_many(prepared_traj, prepared_traj, i, cmp_indices)
            p += len(cmp_indices)
        return output


class RMSD(AbstractDistanceMetric):
    '''Concrete implementation of the AbstractDistanceMetric abstract base class
    for calculation of RMSD calculation.
    
    This is a slightly refactored version of the code msmbuilder DistanceMetric.py
    code, and uses a slightly modifed version of the rmsdcalc C extension code
    to support the one_to_many() method without using the slow python array slice
    operation.
    
    Note: I removed the getters and setters from the TheoData class. See the
    comments for details.'''
    
    class TheoData(object):
        """Stores temporary data required during Theobald RMSD calculation.
        
        Notes:
        Storing temporary data allows us to avoid re-calculating the G-Values
        repeatedly. Also avoids re-centering the coordinates."""
        
        Theoslice = namedtuple('TheoSlice', ('xyz', 'G'))
        def __init__(self, XYZData, NumAtoms=None, G=None):
            """Create a container for intermediate values during RMSD Calculation.
            
            Notes:
            1.  We remove center of mass.
            2.  We pre-calculate matrix magnitudes (ConfG)"""
            
            if NumAtoms is None or G is None:
                NumConfs=len(XYZData)
                NumAtoms=XYZData.shape[1]
            
                self.centerConformations(XYZData)
            
                NumAtomsWithPadding=4+NumAtoms-NumAtoms%4
            
                # Load data and generators into aligned arrays
                XYZData2 = np.zeros((NumConfs, 3, NumAtomsWithPadding), dtype=np.float32)
                for i in range(NumConfs):
                    XYZData2[i,0:3,0:NumAtoms] = XYZData[i].transpose()
                
                #Precalculate matrix magnitudes
                ConfG = np.zeros((NumConfs,),dtype=np.float32)
                for i in xrange(NumConfs):
                    ConfG[i] = self.calcGvalue(XYZData[i,:,:])
                
                self.XYZData=XYZData2
                self.G=ConfG
                self.NumAtoms=NumAtoms
                self.NumAtomsWithPadding=NumAtomsWithPadding
                self.CheckCentered()
            else:
                self.XYZData = XYZData
                self.G = G
                self.NumAtoms = NumAtoms
                self.NumAtomsWithPadding = XYZData.shape[2]
        
        
        # IMO, getters and setters are unpythonic and should be avoided. Given
        # the python maxim that "we're all consenting adults", and the public-ness
        # of the variables anyways, these are just bloat. With that in mind, if
        # you're going to modify the self.G and self.XYZData instance variables
        # directly, make sure you know what you're doing. If you modify them
        # in an unexpected way, the code will segfault. It will not throw an
        # informative python exception, so be careful.
        
        #def GetData(self):
        #    """Returns the XYZ coordinate data stored."""
        #    return(self.XYZData)
        #    
        #def GetG(self):
        #    """Return the matrix magnitudes stored."""
        #    return(self.G)
        #    
        #def SetData(self, XYZData, G):
        #    """Modify the data in self.
        #    
        #    Notes:
        #    For performance, this is done WITHOUT error checking.
        #    Only swap in data that is compatible (in shape) with previous data.
        #    """
        #    self.XYZData=XYZData
        #    self.G=G
        
        def __getitem__(self, key):
            # to keep the dimensions right, we make everything a slice
            if isinstance(key, int):
                key = slice(key, key+1)
            return RMSD.TheoData(self.XYZData[key], NumAtoms=self.NumAtoms, G=self.G[key])
        
        def __setitem__(self, key, value):
            self.XYZData[key] = value.XYZData
            self.G[key] = value.G
            
        
        def CheckCentered(self, Epsilon=1E-5):
            """Raise an exception if XYZAtomMajor has nonnzero center of mass(CM)."""
            
            XYZ=self.XYZData.transpose(0,2,1)
            x=np.array([max(abs(XYZ[i].mean(0))) for i in xrange(len(XYZ))]).max()
            if x&gt;Epsilon:
                raise Exception("The coordinate data does not appear to have been centered correctly.")
        
        @staticmethod
        def centerConformations(XYZList):
            """Remove the center of mass from conformations.  Inplace to minimize mem. use."""
            
            for ci in xrange(XYZList.shape[0]):
                X=XYZList[ci].astype('float64')#To improve the accuracy of RMSD, it can help to do certain calculations in double precision.  This is _not_ one of those operations IMHO.
                X-=X.mean(0)
                XYZList[ci]=X.astype('float32')
            return
        
        @staticmethod
        def calcGvalue(XYZ):
            """Calculate the sum of squares of the key matrix G.  A necessary component of Theobold RMSD algorithm."""
            
            conf=XYZ.astype('float64')#Doing this operation in double significantly improves numerical precision of RMSD
            G = 0
            G += np.dot(conf[:,0],conf[:,0])
            G += np.dot(conf[:,1],conf[:,1])
            G += np.dot(conf[:,2],conf[:,2])
            return G
            
        def __len__(self):
            return len(self.XYZData)
    
    
    def __init__(self, atomindices, omp_parallel=True):
        '''Use the following atomindices'''
        self.atomindices = atomindices
        self.omp_parallel = omp_parallel
    
    def __repr__(self):
        return 'metrics.RMSD(atom_indices=%s, omp_parallel=%s)' % (repr(list(self.atomindices)), self.omp_parallel)
    
    def prepare_trajectory(self, trajectory):
        '''Prepare the trajectory for RMSD calculation.
        inputs: msmbuilder "Trajectory" object and an array of the atomindices
        that you care about.
        
        Returns: a TheoData object'''
        
        return self.TheoData(trajectory['XYZList'][:,self.atomindices])
    
    
    def one_to_many(self, prepared_traj1, prepared_traj2, index1, indices2):
        '''Calculate a vector of distances from the index1th frame of prepared_traj1
        to the frames in prepared_traj2 with indices "indices2". If the omp_parallel
        optional argument is True, we use shared-memory parallelization in C to do
        this faster. Using omp_parallel = False is advised if indices2 is a short
        list and you are paralellizing your algorithm (say via mpi) at a different
        level.
        
        Returns: a vector of distances of length len(indices2)'''
        if isinstance(indices2, list):
            indices2 = np.array(indices2)
        if not isinstance(prepared_traj1, RMSD.TheoData):
            raise TypeError('Theodata required')
        if not isinstance(prepared_traj2, RMSD.TheoData):
            raise TypeError('Theodata required')
        
        if self.omp_parallel:
            return _rmsdcalc.getMultipleRMSDs_aligned_T_g_at_indices(
                      prepared_traj1.NumAtoms, prepared_traj1.NumAtomsWithPadding,
                      prepared_traj1.NumAtomsWithPadding, prepared_traj2.XYZData,
                      prepared_traj1.XYZData[index1], prepared_traj2.G,
                      prepared_traj1.G[index1], indices2)
        else:
            return _rmsdcalc.getMultipleRMSDs_aligned_T_g_at_indices_serial(
                      prepared_traj1.NumAtoms, prepared_traj1.NumAtomsWithPadding,
                      prepared_traj1.NumAtomsWithPadding, prepared_traj2.XYZData,
                      prepared_traj1.XYZData[index1], prepared_traj2.G,
                      prepared_traj1.G[index1], indices2)
        
    
    
    def one_to_all(self, prepared_traj1, prepared_traj2, index1):
        '''Calculate a vector of distances from the index1th frame of prepared_traj1
        to all the frames in prepared_traj2. This always uses OMP parallelization.
        
        If you really don't want OMP paralellization (why not?), then you can modify
        the C code yourself.
        
        Returns: a vector of distances of length len(indices2)'''
        
        return _rmsdcalc.getMultipleRMSDs_aligned_T_g(
            prepared_traj1.NumAtoms, prepared_traj1.NumAtomsWithPadding,
            prepared_traj1.NumAtomsWithPadding, prepared_traj2.XYZData,
            prepared_traj1.XYZData[index1], prepared_traj2.G,
            prepared_traj1.G[index1])
    
    
    def _square_all_pairwise(self, prepared_traj):
        '''Reference implementation of all_pairwise so that we can check the code
        above'''
        traj_length = prepared_traj.XYZData.shape[0]
        output = np.empty((traj_length, traj_length))
        for i in xrange(traj_length):
            output[i] = self.one_to_all(prepared_traj, prepared_traj, i)
        return output


class Vectorized(AbstractDistanceMetric):
    '''This is a subclass of AbstractDistanceMetric for computing distances in
    an arbitrary vector space under a WIDE VARIETY of vector distance metrics
    (using the scipy.spatial.distance library). You don't want to (and won't
    even be able to) instantiate a Vectorized instance directly, instead you
    want to sublcass it.
    
    The trick is in order to be a full featured DistanceMetric, a subclass of
    Vectorized only has to implement the prepared_trajectory() method. You
    get the one_to_many(), one_to_all() (and more) for free.
    
    When you subclass this metric, you should probably override the class variables
    allowable_scipy_metrics, default_scipy_metric and default_scipy_p as well.
    
    allowable_scipy_metrics gives the list of metrics which your client
    can use. If the vector space that you're projecting your trajectory onto is 
    just a space of boolean vectors, then you probably don't want to allow eulcidean
    distance for instances.
    
    default_scipy_metric is the metric that will be used by your default metric
    if the user leaves the "metric" field blank/unspecified.
    
    default_scipy_p is the default value of "p" that will be used if left 
    unspecified. the value "p" is ONLY used for the minkowski (pnorm) metric, so
    otherwise the scipy.spatial.distance code ignores it anyways.
    
    See http://docs.scipy.org/doc/scipy/reference/spatial.distance.html for a
    description of all the distance metrics and how they work.
    '''
    
    allowable_scipy_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock',
                               'correlation', 'cosine', 'euclidean', 'minkowski',
                               'sqeuclidean','dice', 'kulsinki', 'matching',
                               'rogerstanimoto', 'russellrao', 'sokalmichener',
                               'sokalsneath', 'yule']
    
    def __init__(self, metric='euclidean', p=2):
        self._validate_scipy_metric(metric)
        self.metric = metric
        self.p = p
        
    
    def _validate_scipy_metric(self, metric):
        '''Ensure that "metric" is an "allowable" metric (in allowable_scipy_metrics)'''
        if not metric in self.allowable_scipy_metrics:
            raise TypeError('%s is an  unrecognize metric. "metric" must be one of %s' % (metric, str(self.allowable_scipy_metrics)))
            
    
    def one_to_many(self, prepared_traj1, prepared_traj2, index1, indices2):
        """Calculate a vector of distances from the index1th frame of prepared_traj1
        to the frames in prepared_traj2 with indices "indices2", using supplied
        metric and value of p. The field p is unused unless metric="minkowski".
        
        metric can be any of the metrics in the class variable allowable_scipy_metric
        
        Returns: a vector of distances of length len(indices2)
        """
        if not isinstance(index1, int):
            raise TypeError('index1 must be of type int.')
        out = cdist(prepared_traj2[indices2], prepared_traj1[[index1]],
                    metric=self.metric, p=self.p)
                    
        return out[:,0]
    
    def one_to_all(self, prepared_traj1, prepared_traj2, index1):
        """Calculate a vector of distances from the index1th frame of prepared_traj1
        to the frames in prepared_traj2 with indices "indices2", using supplied
        metric and value of p. The field p is unused unless metric="minkowski".
        
        metric can be any of the metrics in the class variable allowable_scipy_metric
        
        Returns: a vector of distances of length len(prepared_traj2)"""
        
        if not isinstance(index1, int):
            raise TypeError('index1 must be of type int.')
        out2 = cdist(prepared_traj2, prepared_traj1[[index1]], metric=self.metric, p=self.p)
        return out2[:,0]
        
    
    def many_to_many(self, prepared_traj1, prepared_traj2, indices1, indices2):
        """Calculate a MATRIX of distances from the frames in prepared_traj1 with
        indices "indices1" to the frames in prepared_traj2 with indices "indices2",
        using supplied metric and value of p. The field p is unused unless
        metric="minkowski".
        
        metric can be any of the metrics in the class variable allowable_scipy_metric
        
        Returns: a 2D array of shape len(prepared_traj1) * len(prepared_traj2)"""
        
        
        out = cdist(prepared_traj1[indices1], prepared_traj2[indices2], metric=self.metric, p=self.p)
        return out
        
    def all_to_all(self, prepared_traj1, prepared_traj2):
        """Calculate a MATRIX of distances from the frames in prepared_traj1 to 
        all the frames in prepared_traj2 using supplied metric and value of p.
        The field p is unused unless metric="minkowski".
        
        metric can be any of the metrics in the class variable allowable_scipy_metric
        
        Returns: a 2D array of shape len(prepared_traj1) * len(prepared_traj2)"""
        
        if prepared_traj1 is prepared_traj2:
            warnings.warn('runtime', re.sub("\s+", " ", """it's not recommended to
            use this method to calculate the full pairwise distance matrix for
            one trajectory to itself (as you're doing). Use all_pairwise, which
            will be more efficient if you reall need the results as a 2D matrix
            (why?) then you can always use scipy.spatial.distance.squareform()
            on the output of all_pairwise()""".replace('\n', ' ')))
            
        out = cdist(prepared_traj1, prepared_traj2, metric=self.metric, p=self.p)
        return out                                        
        
    def all_pairwise(self, prepared_traj):
        '''Calculate a "condensed" distance matrix of all the pairwise distances
        between each frame with each other frame in prepared_traj using supplied
        metric and value of p. The field p is unused unless metric="minkowski". 
        
        metric can be any of the metrics in the class variable allowable_scipy_metric
        
        Returns: a 1D vector of length len(pairwise_traj) choose 2 where the i*jth
        entry contains the distance between prepared_traj[i] and prepared_traj[j].
        See the documentation on scipy.spatial.distance.pdist for more details.'''
        
        out = scipy.spatial.distance.pdist(prepared_traj, metric=self.metric, p=self.p)
        return out


class Dihedral(Vectorized, AbstractDistanceMetric):
    '''This is a concerete implementation of AbstractDistanceMetric by way of
    Vectorized, for calculating distances between frames based on their
    projection in dihedral space.
    
    Note: If you're browsing the code, notice that most of the methods on this
    class are actually implemented in the Vectorized superclass'''
    
    allowable_scipy_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock',
                               'correlation', 'cosine', 'euclidean', 'minkowski',
                               'sqeuclidean']
    
    def __init__(self, metric='euclidean', p=2, angles='phi/psi'):
        super(Dihedral, self).__init__(metric, p)
        self.angles = angles
    
    def __repr__(self):
        return 'metrics.Dihedral(metric=%s, p=%s, angles=%s)' % (self.metric, self.p, self.angles)
    
    def prepare_trajectory(self, trajectory):
        """Prepare the dihedral angle representation of a trajectory, suitable
        for distance calculations.
        
        Returns: a 2D array of dimension len(trajectory) x (2*number of dihedral
        angles per frame), such that in each row, the first half of the entries
        contain the cosine of the dihedral angles and the later dihedral angles
        contain the sine of the dihedral angles. This transform is necessary so
        that distance calculations preserve the periodic symmetry.
        """
        
        traj_length = len(trajectory['XYZList'])
        indices = _dihedralcalc.get_indices(trajectory, self.angles)
        dihedrals = _dihedralcalc.compute_dihedrals(trajectory, indices, degrees=False)
        
        # these dihedrals go between -pi and pi but obviously because of the
        # periodicity, when we take distances we want the distance between -179
        # and +179 to be very close, so we need to do a little transform
        
        num_dihedrals = dihedrals.shape[1]
        transformed = np.empty((traj_length, 2 * num_dihedrals))
        transformed[:, 0:num_dihedrals] = np.cos(dihedrals)
        transformed[:, num_dihedrals:2*num_dihedrals] = np.sin(dihedrals)
        
        return np.double(transformed)
        
    


class ContinuousContact(Vectorized, AbstractDistanceMetric):
    '''This is a concerete implementation of AbstractDistanceMetric by way of
    Vectorized, for calculating distances between frames based on their
    contact maps.
    
    Here each frame is represented as a vector of the distances between pairs
    of residues.
    
    Note: If you're browsing the code, notice that most of the methods on this
    class are actually implemented in the Vectorized superclass'''
    
    allowable_scipy_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock',
                               'correlation', 'cosine', 'euclidean', 'minkowski',
                               'sqeuclidean']
    
    def __init__(self, metric='euclidean', p=2, contacts='all', scheme='closest-heavy'):
        '''Create a distance calculator based on the distances between pairs of atoms
        in a sturcture -- like the contact map except without casting to boolean.
        
        contacts can be an n by 2 array, where each row is a pair of
        integers giving the indices of 2 residues whose distance you care about.
        
        Alternatively, contacts can be the string 'all'. This is a shortcut for
        supplying a contacts list that includes all (N * N-1) / 2 pairs of the n
        residues.
        
        scheme can be 'CA', 'closest', or 'closest-heavy' and gives
        the sense in which the "distance between two residues" is computed. If
        scheme is 'CA', then we'll use the cartesian distance between the residues'
        C-alpha atoms as their distance for the purpose of calculating whether or
        not they have exceeded the cutoff. If scheme is 'closest', we'll use the
        distance between the closest pair of atoms where one
        belongs to residue i and to residue j. If scheme is 'closest-heavy', we'll
        use the distance between the closest pair of non-hydrogen atoms where one
        belongs to reside i and one to residue j.
        
        'metric' should be the mathematic metric used to compare the vectors of distances.
        it can either be 'braycurtis', 'canberra', 'chebyshev', 'cityblock',
        'correlation', 'cosine', 'euclidean', 'minkowski', 'sqeuclidean'.
        
        If you use 'minkowski', then you have to supply a value for p, the degree.
        otherwise, the value p is ignored.
        '''
        
        super(ContinuousContact, self).__init__(metric, p)
        self.contacts = contacts
        
        scheme = scheme.lower()
        if not scheme in ['ca', 'center', 'closest-heavy']:
            raise ValueError('Unrecognized scheme')
        
        self.scheme = scheme
        
    
    def __repr__(self):
        try:
            contacts_repr = repr(self.contacts.tolist())
        except:
            contacts_repr = repr(self.contacts)
        return 'metrics.ContinuousContact(metric=%s, p=%s, contacts=%s, scheme=%s)' % (self.metric, self.p, contacts_repr, self.scheme)
        
    
    def prepare_trajectory(self, trajectory):
        '''Prepare a trajectory for distance calculations based on the contact map.
        Each frame in the trajectory will be represented by a vector where
        each entries represents the distance between two residues in the structure.
        Depending on what contacts you pick to use, this can be a "native biased" 
        picture or not.
        '''
        
        xyzlist = trajectory['XYZList']        
        traj_length = len(xyzlist)
        num_residues = trajectory.GetNumberOfResidues()
        num_atoms = trajectory.GetNumberOfAtoms()
        
        if self.contacts == 'all':
            contacts = np.empty(((num_residues-2) * (num_residues - 3) / 2, 2), dtype=np.int32)
            p = 0
            for (a,b) in itertools.combinations(range(num_residues),2):
                if max(a,b) &gt; min(a,b) + 2:
                    contacts[p,:] = [a,b]
                    print contacts[p, :]
                    p += 1
            assert p == len(contacts), 'Something went wrong generating "all"'
            
        else:
            num, width = self.contacts.shape
            contacts = self.contacts
            if not width == 2:
                raise ValueError('contacts must be width 2')
            if not (0 &lt; len(np.unique(contacts)) &lt;= num_residues):
                raise ValueError('contacts refers to residues, whos numbering starts at 1')
        
        if self.scheme == 'ca':
            # not all residues have a CA
            #alpha_indices = np.where(trajectory['AtomNames'] == 'CA')[0]
            atom_contacts = np.zeros_like(contacts)
            residue_to_alpha = np.zeros(num_residues) # zero based indexing
            for i in range(num_atoms):
                if trajectory['AtomNames'][i] == 'CA':
                    residue = trajectory['ResidueID'][i] - 1
                    residue_to_alpha[residue] = i
            #print 'contacts (residues)', contacts
            #print 'residue_to_alpja', residue_to_alpha.shape
            #print 'residue_to_alpja', residue_to_alpha
            atom_contacts = residue_to_alpha[contacts]
            #print 'atom_contacts', atom_contacts
            output = _contactcalc.atom_distances(xyzlist, atom_contacts)
        
        elif self.scheme in ['closest', 'closest-heavy']:
            if self.scheme == 'closest':
                residue_membership = [None for i in range(num_residues)]
                for i in range(num_residues):
                    residue_membership[i] = np.where(trajectory['ResidueID'] == i + 1)[0]
            elif self.scheme == 'closest-heavy':
                 residue_membership = [[] for i in range(num_residues)]
                 for i in range(num_atoms):
                     residue = trajectory['ResidueID'][i] - 1
                     if not trajectory['AtomNames'][i].lstrip('0123456789').startswith('H'):
                         residue_membership[residue].append(i)
            
            print 'Residue Membership'
            print residue_membership
            for row in residue_membership:
                for col in row:
                    print "%s-%s" % (trajectory['AtomNames'][col], trajectory['ResidueID'][col]),
                print
            output = _contactcalc.residue_distances(xyzlist, residue_membership, contacts)
        else:
            raise ValueError('This is not supposed to happend!')
            
        return np.double(output)
            
    def DEPRECIATEDprepare_trajectory(self, trajectory):
        '''Prepare a trajectory for distance calculations based on the contact map.
        Each frame in the trajectory will be represented by a vector where
        each entries represents the distance between two residues in the structure.
        Depending on what contacts you pick to use, this can be a "native biased" 
        picture or not.
        
        THIS VERSION IS PURE PYTHON, DEATHLY SLOW, BUT IS USEFUL AS A REFERENCE
        IMPLEMENTATION
        
        Also note that the "all" contacts option here uses all n choose 2 possible
        contacts, and does not ingnore contacts between residues i and i+1 or i
        and i+2
        '''
        
        traj_length = len(trajectory['XYZList'])
        num_residues = trajectory.GetNumberOfResidues()
        
        
        # get all pairs of residues if you supply none
        if self.contacts == 'all':
            contacts = np.empty((num_residues * (num_residues - 1) / 2, 2), dtype='int')
            p = 0
            for i in range(num_residues):
                for j in range(i+1, num_residues):
                    # THE RESIDUEIDs start at 1
                    contacts[p] = [i+1,j+1]
                    p += 1
            del p
        else:
            num, width = self.contacts.shape
            contacts = self.contacts
            if not width == 2:
                raise ValueError('contacts must be width 2')
            if not (0 &lt;= len(np.unique(contacts)) &lt; num_residues):
                raise ValueError('contacts refers to residues')
       
        # done with checks
        
        prepared_traj = np.zeros((traj_length, len(contacts)), dtype='float')
        
        # mappings that go from residue indices to certain atom indices
        involved_residues = np.unique(contacts)
        #print 'involved_residues', involved_residues
        residue_to_inds = defaultdict(lambda : [])
        residue_to_heavy_inds = defaultdict(lambda : [])
        residue_to_CA = {}
        for i in xrange(trajectory.GetNumberOfAtoms()):
            #print 'i', i, trajectory.GetNumberOfAtoms()
            #print trajectory['ResidueID'][i]
            if trajectory['ResidueID'][i] in involved_residues:
                residue_to_inds[trajectory['ResidueID'][i]].append(i)
                
                if trajectory['AtomNames'][i] == 'CA':
                    residue_to_CA[trajectory['ResidueID'][i]] = i
                
                #print 'traj[ResidueId][%d] = %s' % (i, trajectory['ResidueID'][i])
                
                if not trajectory['AtomNames'][i].lstrip('0123456789').startswith('H'):
                    residue_to_heavy_inds[trajectory['ResidueID'][i]].append(i)
        
        if not set(residue_to_heavy_inds.keys()) == set(involved_residues):
            raise Exception("One of the residues doesnt have any heavy atoms? Sometimes this happens if the residue numbering starts at 0 -- it's supposed to start at 1")
     
        
        # do each scheme
        if self.scheme == 'ca':
            for i in xrange(traj_length):
                for j in range(len(contacts)):
                    ind1 = residue_to_CA[contacts[j, 0]]
                    ind2 = residue_to_CA[contacts[j, 1]]
                    prepared_traj[i, j] = scipy.spatial.distance.euclidean(trajectory['XYZList'][i, ind1],
                                                                   trajectory['XYZList'][i, ind2])
                                                                           
        elif self.scheme == 'center':
            for i in xrange(traj_length):
                centers = {} #np.zeros((len(involved_residues), 3), dtype='float')
                for j in involved_residues:
                    centers[j] = np.mean(trajectory['XYZList'][i, residue_to_inds[j]], axis=0)
                
                for j in range(len(contacts)):
                    #print '%s to %s' % (str(centers[contacts[j, 0]]), str(centers[contacts[j, 1 ]]))
                    prepared_traj[i, j] = scipy.spatial.distance.euclidean(centers[contacts[j, 0]], centers[contacts[j, 1]])
                
        elif self.scheme == 'closest':
            for i in xrange(traj_length):
                for j in range(len(contacts)):
                    inds1 = residue_to_inds[contacts[j, 0]]
                    inds2 = residue_to_inds[contacts[j, 1]]
                    all_dists = scipy.spatial.distance.cdist(trajectory['XYZList'][i, inds1], trajectory['XYZList'][i, inds2])
                    prepared_traj[i,j] = np.min(all_dists)
            
        elif self.scheme == 'closest-heavy':
            print 'Residue to heavy inds'
            for row in residue_to_heavy_inds.values():
                print row   
                for col in row:
                    print "%s-%s" % (trajectory['AtomNames'][col], trajectory['ResidueID'][col]),
                print
            for i in xrange(traj_length):
                for j in range(len(contacts)):
                    inds1 = residue_to_heavy_inds[contacts[j, 0]]
                    inds2 = residue_to_heavy_inds[contacts[j, 1]]
                    all_dists = scipy.spatial.distance.cdist(trajectory['XYZList'][i, inds1], trajectory['XYZList'][i, inds2])
                    print "frame %d" % i
                    print "contact %d-%d" % (contacts[j,0], contacts[j,1])
                    print "inds1, inds2: (%s) (%s)" % (str(inds1), str(inds2))
                    prepared_traj[i,j] = np.min(all_dists)
            
        else:
            raise Exception('This is not supposed to happen')
        
        return np.array(prepared_traj)
    

class BooleanContact(Vectorized, AbstractDistanceMetric):
    '''This is a concerete implementation of AbstractDistanceMetric by way of
    Vectorized, for calculating distances between frames based on their
    contact maps.
    
    Here each frame is represented as a vector of booleans representing whether
    the distance between pairs of residues is less than a cutoff.
    
    Note: If you're browsing the code, notice that most of the methods on this
    class are actually implemented in the Vectorized superclass'''
    
    allowable_scipy_metrics = ['dice', 'kulsinki', 'matching', 'rogerstanimoto',
                               'russellrao', 'sokalmichener', 'sokalsneath',
                               'yule']
    
    def __init__(self, metric='matching', contacts='all', cutoff=0.5, scheme='closest-heavy'):
        ''' Create a distance metric that will measure the distance between frames
        based on differences in their contact maps.
        
        metric should be one of 'dice', 'kulsinki', 'matching', 'rogerstanimoto',
        russellrao', 'sokalmichener', 'sokalsneath', 'yule'. You should probably
        use matching. Then the distance between two frames is just the number of
        elements in their contact map that are the same. See the scipy.spatial.distance
        documentation for details.
        
        contacts can be an n by 2 array, where each row is a pair of
        integers giving the indices of 2 residues which form a native contact.
        Each conformation is then represnted by a vector of booleans representing
        whether or not that contact is present in the conformation. The distance
        metric acts on two conformations and compares their vectors of booleans.
        
        Alternatively, contacts can be the string 'all'. This is a shortcut for
        supplying a contacts list that includes all (N * N-1) / 2 pairs of the n
        residues.
        
        cutoff can be either a positive float representing the cutoff distance between two
        residues which constitues them being "in contact" vs "not in contact". It
        is measured in the same distance units that your trajectory's XYZ data is in
        (probably nanometers).
        
        Alternatively, cutoff can be an array of length equal to the number of rows in the
        contacts array, specifying a different cutoff for each contact. That is, cutoff[i]
        should contain the cutoff for the contact in contact[i].
        
        scheme can be 'CA', 'closest', or 'closest-heavy' and gives
        the sense in which the "distance between two residues" is computed. If
        scheme is 'CA', then we'll use the cartesian distance between the residues'
        C-alpha atoms as their distance for the purpose of calculating whether or
        not they have exceeded the cutoff. If scheme is 'closest', we'll use the
        distance between the closest pair of atoms where one belongs to residue i
        and to residue j. If scheme is 'closest-heavy', we'll use the distance
        between the closest pair of non-hydrogen atoms where one belongs to reside
        i and one to residue j.'''
        
        super(BooleanContact, self).__init__(metric)
        self.contacts = contacts
        self.cutoff = cutoff
          
        scheme = scheme.lower()
        if not scheme in ['ca', 'closest', 'closest-heavy']:
            raise ValueError('Unrecognized scheme')
              
        self.scheme = scheme
        
    
    def __repr__(self):
        try:
            contacts_repr = repr(self.contacts.tolist())
        except:
            contacts_repr = repr(self.contacts)
        
        try:
            cutoff_repr = repr(self.cutoff.tolist())
        except:
            cutoff_repr = repr(self.cutoff)
        
        return 'metrics.BooleanContact(metric=%s, p=%s, contacts=%s, cutoff=%s, scheme=%s)' % (self.metric, self.p, contacts_repr, cutoff_repr, self.scheme)
    
    
    def prepare_trajectory(self, trajectory):
        '''Prepare a trajectory for distance calculations based on the contact map.
        This will use the optins that you selected in the call to init.
        '''
        
        ccm = ContinuousContact(contacts=self.contacts, scheme=self.scheme)
        contact_d = ccm.prepare_trajectory(trajectory)
        if not isinstance(self.cutoff, Number):
            if not len(self.cutoff) == len(contact_d):
                raise ValueError('cutoff must be a number or match the length of contacts')
    
        contact = np.zeros_like(contact_d, dtype='bool')
        for i in xrange(contact_d.shape[1]):
            contact[:, i] = contact_d[:, i] &lt; self.cutoff
        return contact


class AtomPairs(Vectorized, AbstractDistanceMetric):
    '''Concrete implementation of Vectorized that monitors the distance
    between certain pairs of atoms (as opposed to certain pairs of residues)
    as ContinuousContact does'''
    
    allowable_scipy_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock',
                               'correlation', 'cosine', 'euclidean', 'minkowski',
                               'sqeuclidean']
                               
    def __init__(self, metric='cityblock', p=1, atom_pairs=None):
        ''' Atom pairs should be a N x 2 array of the N pairs of atoms
        whose distance you want to monitor'''
        super(AtomPairs, self).__init__(metric, p)
        try:
            atom_pairs = np.array(atom_pairs, dtype=int)
            n, m = atom_pairs.shape
            if not m == 2:
                raise ValueError()
        except ValueError:
            raise ValueError('Atom pairs must be an n x 2 array of pairs of atoms')
        self.atom_pairs = np.int32(atom_pairs)
        
    def prepare_trajectory(self, trajectory):
        length = len(trajectory['XYZList'])
        ptraj = _contactcalc.atomic_distances(trajectory['XYZList'], self.atom_pairs)
        return np.double(ptraj)
        

class Rg(Vectorized, AbstractDistanceMetric):
    allowable_scipy_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock',
                               'correlation', 'cosine', 'euclidean', 'minkowski',
                               'sqeuclidean']
                               
    def __init__(self, metric='euclidean', p=2):
        super(Rg, self).__init__(metric, p)
    
    def prepare_trajectory(self, trajectory):
        return _rgcalc.Rg(trajectory['XYZList'])
    

class Hybrid(AbstractDistanceMetric):
    def __init__(self, base_metrics, weights):
        self.base_metrics = base_metrics
        self.weights = weights
        self.num = len(self.base_metrics)
        
        if not len(self.weights) == self.num:
            raise ValueError()
        
    def prepare_trajectory(self, trajectory):
        return [self.base_metrics[i].prepare_trajectory(trajectory) for i in range(self.num)]
    
    def one_to_many(self, prepared_traj1, prepared_traj2, index1, indices2):
        distances = None
        for i in range(self.num):
            d = self.base_metrics[i].one_to_many(prepared_traj1[i], prepared_traj2[i],
                                                            index1, indices2)
            if distances is None:
                distances = d
            else:
                distances += d
        return distances
    
    def one_to_all(self, prepared_traj1, prepared_traj2, index1):
        distances = None
        for i in range(self.num):
            d = self.base_metrics[i].one_to_all(prepared_traj1[i], prepared_traj2[i], index1)
            if distances is None:
                distances = d
            else:
                distances += d
        return distances
        
    def all_pairwise(self, prepared_traj):
        '''Calculate condensed distance metric of all pairwise distances'''
        
        traj_length = len(prepared_traj[0])
        output = np.empty(traj_length * (traj_length - 1) / 2)
        for i in range(traj_length):
            cmp_indices = np.arange(i + 1, traj_length)
            output[i: i + len(cmp_indices)] = self.one_to_many(prepared_traj, prepared_traj, i, cmp_indices)
        return output

</pre></body></html>