Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures.
|CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms; Bioinformatics; Kohlhoff KJ, Sosnick MH, Hsu WT, Pande VS and Altman RB (2011) View|
|K-Means for Parallel Architectures Using All-Prefix-Sum Sorting and Updating Steps,
IEEE Trans Parallel and Distributed Systems,
Kohlhoff KJ, Pande VS, Altman RB (2012) View|
We present an implementation of parallel K-means clustering, called Kps-means, that achieves near-optimal performance without imposing limits on the number of dimensions and data points permitted as input, thus combining flexibility with high degrees of parallelism and efficiency. As a key element to performance improvement, we introduce parallel sorting as data pre-processing and updating steps. Our final implementation for Nvidia GPUs achieves speed-ups of up to 200-fold over CPU reference code and of up to three orders of magnitude when compared with popular numerical software packages.