I recently performed a benchmark of the new MSMBuilder2 code on a much larger dataset. This dataset contains the entirety of HP35 projects 3036 and 3037, for a total of 10,360,011 conformations.
Before the benchmark, I converted the entire FAH projects to lossy HDF5 file format, as that enables the best compression and read speeds.
I extracted data subsampled every 5ns, resulting in 100,000 conformations. I clustered those into 30,000 states using KCenters, then attempted 120,000 KMedoid "Generator swaps." After the KMedoid swaps, the mean RMSD from each datapoint to its assigned generator was 1.45 angstroms. This entire clustering phase took about a day, with 90% of the time spent in the KMedoid iterations.
After this, I assigned the entire dataset on Biox2. There are 556 trajectories, so I chose to use 556 cores. The entire job, including time waiting in queue, took approximately 49 minutes. Because each trajectory is a different length, most of this time was actually spent waiting for the longest few trajectories to complete. The cost of such a job should be approximately $3.
In summary, our new methods should allow us to get answers, at least for the assignment phase, at close to real-time.