A brief how-to for building Markov State Models (MSMs) for Folding@home (FAH) projects.
Gregory R. Bowman
gbowman@stanford.edu

First you need to setup your data in the appropriate format for building an MSM.  For a general FAH project (i.e. one where you only care about conformations and their temporal ordering) follow the example in testFAH.py.  Running this will create a directory containing all your data organized for doing an initial clustering to find microsates.  This is done using the FAHProjClustering class.  To save space soft-links are made to your original xtc files rather than making a new copy (Sorry, only gromacs is supported for now, though it'd be straight forward to make setup code for other simulation packages).  You may also use testST.py as a template for Simulated Tempering (ST) projects (i.e. where you care about both the conformation and the temperature).  This code uses the SetupSTProjClustering class.  You may also use the rsyncToRemoteMachine() method in FAHProjClustering/SetupSTProjClustering to copy your data to a remote machine to do the clustering there.  This is useful if your server doesn't have the computational power to run FAH and do clustering simultaneously.

Now you need to divide you data into mirostates, clusters based on structural similarity that are small enough that structural similarity implies kinetic connectivity. Right now we provide hierarchical K-medoids clustering (see gb_kcenters/src/README for instructions), though other algorithms may be used.

Once you have microstates you can use BuildMicroMSMsVaryLagTime.py to determine the number of macrostates to lump them into.  Use the -h option to read about the parameters.  This code will output files listing the top N implied timescales as a function of the lag time.  If the model is Markovian then the timescales should level off.  There should be an obvious gap in the timescales, in which case the number of macrostates is 1 plus the number of timescales above the gap.  See Frank Noe and Stefan Fischer, Curr. Op. Struct. Bio. 2008 for details on choosing the appropriate number of macrostates.

Now build a macrostate MSM using BuildMacroMSM.py.  Again, run the code with the -h option for an explanation of the aprameters.  You may also use BuildMacroMSMsVaryLagTime.py to get the implied timescales as a function of the lag time.  This is useful for checking that your MSM is Markovian.

If you are using K-medoids hierarchical clustering you can get populations of macrostates by copying the PCCASA.mapMicroToMacro.dat file output by BuildMacroMSM.py to a file named macro_states.dat in your projects root directory (the one containing the assignments, trajectories, and other directories created during the inital setup) and runnning the assign_macro program.  You can then get state populations with error bars using either GetMacroMSMPopStats.py or GetNoeMacroMSMPopStats.py.  GetMacroMSMPopStats.py uses bootstrapping to get averages and standard deviations whereas GetNoeMacroMSMPopStats.py uses the Noe transition matrix sampling algorithm (see NoeSampling.py and the reference listed therein for details).  GetNoeMacroMSMPopStats.py is the best option for standard FAH projects but requires that you determine the Markov time first (e.g. using BuildMacroMSMsVaryLagTime.py).  GetMacroMSMPopStats.py is more appropriate for getting populations from certain temperature ranges in ST projects because you're unlikely to have many transitions at the tempreature of interest.

GetRandomConfsFromEachMacroState.py allows you to choose some number of random conformations from each macrostate.  This is useful for starting new simulations from each state or visualizing the states.