Instructions for using the hierarchy clustering:

You need the following

-An atom index file named atom_indices that contains the indicies of the atoms that you want to use for computing RMSD.  The format is as follows, first the total number of atoms you want to use, and then the list of indicies, indexed by 1.  Only the atoms that are needed for RMSD computations are actually kept in memory, so the number of atoms that you use will largely determine the ammount of memory that the algorithm needs.  In particular, your memory usage should be aproximately 4*3*natoms*nconfigurations.

-A trajectory list named trajlist.  The format of this should be first the number of trajectories, and then the list of trajectory names.

-A directory named trajectories, containing for each trajectory in trajlist a file with the name of that trajectory containing the ordered list of xtc files that make up that trajectory.

-Directories named data, generators, and assignments, in which the results will be placed

Usage instructions:

There are three executables, hierchyclusterballance, cleanup, and assign.  First you want to run hierchyclusterballance.  This is the part that you want to run in parallel.  This will create a clustering on all the trajectories in trajlist, giving the results in files in data called results.(0...num processes -1), one for each process.  Then run the script cleanup, giving the number of processes you had running for clustering as a command line argument.  This will create a file for each trajectory in the assingments directory containing the state assignment for each snapshot in that trajectory.  In addition it will create files called index and tree in the data directory which are needed for the assign script.  The assign script is used to add new trajectories to an existing state assignment.  Running assign, giving a file containing a list of trajectories as a command line argument, will assign these trajectories to your existing clustering, creating a file containing the assignments for each trajectory in the assignments directory.
GRB 7/29/08 - added assign_macro, requires a file named macro_states.dat that lists the macrostates corresponding to each microstates.  macro_states.dat should be in the root directory.  as an argument give name of file listing trajectories to do assign on.  output is assign file with first column giving microstate and second column giving macrostate.

Brief description of the algorithm:

We want to create a conformational clustering of large data sets efficienctly and in parallel.  The idea is to form a hierarchical cluster as follows.  We start with all of our configurations in a single state.  Then we recursively split each state in to many states until all of the states are sufficiently small.  To split each state, we do k-medoid clustering on a possibly small subset of the data in order to determine a set of generators for the new states, and then we assign each of the configurations in the state being split to the generator that it is closest to.  In order to do the k-medoid clustering, we first determine a random set of generators, and then we iteratively assign all of the configurations to the generaters and attempt to stochastically update the generator of each state, by randomly selecting potential updates and determining if they reduce the variance of the state.  Parameters that control the behavior of this splitting can be found in parameters.h.  We also want this algorithm to work in parralel efficiently.  As we recursively go down the tree of states, it is unlikely that our data will remain equally distributed across all of the processors, and as a reslt some will finish tasks before others and we will waste time.  We can remedy this problem in two ways.  First we can try to redistribute the data across all the nodes evenly.  This is implemented in a very simple minded manner by the ballance routine, which tries to correct imballances with a series of cyclic passes of data.  Alternatively, once all the states are sufficiently small, we can try to redistribute the data such that each state is entirely contained on a single node, and then each node can work independently.  This gives a great speedup, because nodes no longer have to wait for eachother to finish anything, and there is no overhead for communications.

Xhuang NOTE:

1. cleanup.cpp is isolated from the main program, it reads from the disk the output of hirechadical.cpp.