How do I build an MSM for a FAH project?

************************************************************************

1.  Copy the FAH dataset to certainty for clustering.  You may want to subsample and strip waters before transfering.  You may also have to fix damaged trajectories or fix proteins broken by PBC (trjconv -pbc mol).  

************************************************************************

2a.  Convert the FAH project to MSMBuilder format:

We want 1 file per trajectory.  

It's usually best to throw out solvent /  vsites.

Why should we throw out solvent?  Because _most_ analyses don't involve water, and seeking through 10000 extra atoms costs us time.

I like to save as Lossy HDF5 format, as this format seems to offer the good compression and excellent read speed (both random access and sequential access).

We want a directory containing files trj0.lh5 , ... , trjN.lh5

I have some functions that help the process of migrating FAH data:

CreateMergedTrajectoriesFromFAH.CreateMergedTrajectoriesFromFAH()

An example using this function is in Examples/CreateTrajsFromFAH.py

Note that we need to specify: 

a pdb file (so that we can save the atom and residue names).  
path to the data.
Number of Runs (here, 1)
Number of Clones (here, 150)
Output file type (either ".xtc",".lh5", or ".h5")   (kyleb prefers ".lh5" for fastest File IO.  Fast IO means you can use more nodes during assignment.)

After calling this function, we should have a directory Trajectories that contains 1 lh5 file for each trajectory.

Note that we get back lists of which runs and clones correspond to which trajectories.

2b.  Create a ProjectInfo.h5 file.  This should have been created by CreateMergedTrajectoriesFromFAH()

************************************************************************

3.  Cluster the project.  See BuildMSMFromScratch.py for an example of how to do this.  Basically, we just need to load the projectfile, then call the ClusterProject routine.  Don't forget to save your generators as a .lh5.

************************************************************************

4a.  Assign the project on Biox2:

python ~/opt/msmbuilder/kyleb/Scripts/GenerateBioxScript.py -p ProjectInfo.h5 -g Data/Gens.lh5 -x AtomIndices.dat -o Assignments.h5 -n 188

This generates the file RunBiox2.sh. You will need to edit this script to fix the path to your MSMBuilder2 installation. This will generate a bunch of files of the form Assignments.h5.XXX.

OR


4b.  Assign the project on Certainty.  You will run a command like the following:

python /home/kyleb/opt/msmbuilder/Scripts/SubmitCertainty.py  -p ProjectInfo.h5 -g Data/Gens.lh5 -x AtomIndices.dat -n 10

After this, you need to merge the resulting partial results into a single set of assignmnents (DataFile).

python /home/kyleb/opt/msmbuilder/kyleb/Scripts/MergeAssign.py

This results in 3 files: Assignments.h5.WhichTrajs Assignments.h5.RMSD Assignments.h5 

Note that the index files AtomIndices.dat should be flat text files containing the atom indices used for clustering, where the first atom in the protein has index 0.  Creating such files is easy to do with the Conformation class.  Saving such files is done using scipy.savetxt().


************************************************************************