KinFold v1.0
Alain Laederach
Department of Genetics
Stanford University
February 2006

The software and data examples that accompany this documentation may be 
downloaded from http://simtk.org/home/KinFold 

Introduction

This package contains a series of matlab scripts to analyze and model time-
progress curves measured with local probes of structure. Local probes of 
macromolecular structure are measurements that are sensitive to the 
environment of a relatively small region within a macromolecule. These 
include, but are not limited to, NMR deuterium exchange and shift perturbation 
analysis, Fluorescence Resonance Energy Transfer (FRET), and RNA/DNA protein 
footprinting. The separate transitions reported by individual probes yield 
unique insight into folding intermediates. While simultaneous acquisition of 
many unique local transitions provides a cornucopia of information, creating 
an accurate global de-scription of folding that remains faithful to local 
details is very challenging. 

The package requires Matlab which one can obtain from the Mathworks 
http://www.mathworks.com), although some of the scripts will also work
with Octave (http://www.octave.org). This is not necessarily intended as
"easy to use" software, but rather to provide the basic tools to carry
out an analysis similar to the one reported in Laederach et al., "Local kinetic 
measures of macromolecular structure reveal partitioning among multiple 
parallel pathways from the earliest steps in the folding of a large RNA 
molecule," JMB 2006. This manuscript details the algorithms implemented in the
accompanying code.

Any questions, bug reports maybe reported to Alain Laederach. 
(alain@helix.stanford.edu) 

Methods:

Below are function definitions and usage examples for the main scripts to carry
out a complete analysis of a data set. By inputting the commands in matlab that are shown after >> you can get through the example data provided here. (The 
steps that require a supercomputer are noted.

Data entry:

function [time_bins,interp_data_ave,res_labels]=readxlsfootprintdata(filename);

This function will read in an plot data from an excel spreadsheet.  The data
is binned and vectorized for k-means clustering.

example:
>> [time_bins,interp_data_ave,res_labels]=readxlsfootprintdata('dataeg.xls');
Note the format of the data in the excel file dataeg.xls, it is important
that this format is respected if you are inputting your own data.
The result will be (in this case) 21 figures showing the binned and extrapolated
data.

Gap Statistic:

The data can now be analyzed using the gap statistic as follows:

>> khat=compute_Gap(time_bins,interp_data_ave,[2 6],10,100)
This will take a couple of minutes to run, but will return a value of khat=3.
This means that the Gap statistic has estimated that the data has three clusters
and this value should be used in the clustering, which is the next step. Note
that for this and the next step you need to have the Statistics Toolbox 
installed. To check if you can type 

>> kmeans 
??? Error using ==> kmeans
At least two input arguments required.

If that is the error you get, then the Toolbox is installed, if you get a 
different error message, it will have instructions on how to install that
toolbox.

If you do not want to install this toolbox or are using Octave you will need to
skip the clustering step, I have stored the results of the clustering so for
this demonstration you can continue.

Clustering:

Now we will cluster the data using kmeans clustering:

>> [IDX, C, SUMD, D] = kmeans(interp_data_ave', 3,'Distance','cityblock','Replicates',100,'EmptyAction','singleton')

Notice how the number of clusters is specified, it is the second argument of the
function.

You can see the results of the clustering using the following two commands:

>> load visualization.mat
>> display_clusts(IDX,res_labels,C,exp(time_bins),1,interp_data_ave,imagex,offset,residue_locations,1,2)

Note that this is specific to the L-21 tetrahymena ribozyme, as this function 
displays the data on the secondary structure.

Kinetic Model Fitting:

This step requires a supercomputer, but for the purposes of this example, it is possible to run the code like this:

>> search_kspace('17')

after a while you should get output like:

 Iteration  Func-count     f(x)          step          optimality   CG-iterations
     0         13         39.4335                          2.54

This tells the code to test model number 17 (out of 28, given that k=3, and 
that the number of intermediates =2). The code is setup to try and optimize
the model for approximately 15 hours of CPU time and this would need to
be repeated for all models. This particular numerical solution is also the
most computationally efficient, less efficient code is available upon request
to Alain Laederach (alain@helix.stanford.edu) that can handle stiff systems. 
This is common of there are large differences (greater than 3 orders of
magnitude) between the slowest and fastest forming sites).

Optimization visualization:

The results of a supercomputing run are stored in the analysis_all.mat file and
can be visualized by typing the command:

>> make_plots_interest(1,1,4)

Best fits to the data are shown as well as the time-evolution of the different 
intermediates.

Note that the output values (mean_K) are the rate constants of the best fitting
model with K(1,2) corresponding to the rate from U->I1, K(1,3)-> U->I2 etc.

Flux analysis

Flux analysis also requires a significant amount of time to compute accurately,
but for the purposes of this demonstration the following commands can be
issued:

>> get_fluxes_t(1)

This command should be repeated at least 100 times to get better sampling, each
time incrementing the argument by one. (About 100 CPU hours)

Pathway Analysis:

This command analyzes the result of all the pathways data stored in the pathways
folder.

>> pathways=analyze_state_pathways;

Pathway Visualization:

The most common pathways can be vizualized by issuing the following command.

>> [clust_pathway,diff_pathway_vect]=pathway_analysis(pathway);

This will show a histogram of the fluxes through the major pathways. 

The analysis presented above can be repeated with a different data set but
will require the use of a super computer (between 200-300 nodes) for the 
computationally expensive steps. The scripts for generating massively parallel
runs are nto included in this distribution as they are specific to the setup
of the machine.  For help with distributed computing setup, feel free to 
contact Alain Laederach (alain@helix.stanford.edu).