AboutDownloadsDocumentsForumsSource CodeIssues

The project shows that K-center clustering algorithm is in particular suitable for clustering of conformation space. It also provides the source code
that implements a simple 2-approximation algorithm.

The study of this project considers the problem of clustering for conformation space. Clustering is a typical approach to study conformation space, which not only provides a concise representation of the free energy landscape but also is a necessary preprocessing step for building other more complicated representations such as Markov State Model. However most of current clustering methods often overlook an important property of conformation space, namely it is sampled according to Boltzmann distribution which is exponential to the free energy and thus large amounts of sampled conformations are concentrated at the free energy basins. Typically employed clustering algorithm such as K-means which measure the quality of clustering in terms of variance, tend to split
the densely sampled free energy basins into many small clusters and lump the low density regions into the clusters that have quite different structures.

In order to avoid over splitting the energy basin, in this paper, we consider the clustering from geometric point of view and measure the quality
of clusters in terms of their geometry property. We introduce a fast efficient clustering algorithm: approximating K-center clustering algorithm. This algorithm has two major advantages (1) It is fast, and can generate thousands of clusters from millions of conformations within several hours on a single PC, which is orders of magnitude faster than $K$-means clustering
algorithm. (2) The output clusters are about of the same radius and thus the population of each cluster represents its relative density determined by the underlying free energy landscape. Moreover, as it efficiently reduces the complexity of the system in a faithful way, it's proved to be very powerful in our experiments to combine it with more deliberate schemes which are otherwise not
applicable due to the massive nature of data.