KinFold v1.0 Alain Laederach Department of Genetics Stanford University February 2006 The software and data examples that accompany this documentation may be downloaded from http://simtk.org/home/KinFold Introduction This package contains a series of matlab scripts to analyze and model time- progress curves measured with local probes of structure. Local probes of macromolecular structure are measurements that are sensitive to the environment of a relatively small region within a macromolecule. These include, but are not limited to, NMR deuterium exchange and shift perturbation analysis, Fluorescence Resonance Energy Transfer (FRET), and RNA/DNA protein footprinting. The separate transitions reported by individual probes yield unique insight into folding intermediates. While simultaneous acquisition of many unique local transitions provides a cornucopia of information, creating an accurate global de-scription of folding that remains faithful to local details is very challenging. The package requires Matlab which one can obtain from the Mathworks http://www.mathworks.com), although some of the scripts will also work with Octave (http://www.octave.org). This is not necessarily intended as "easy to use" software, but rather to provide the basic tools to carry out an analysis similar to the one reported in Laederach et al., "Local kinetic measures of macromolecular structure reveal partitioning among multiple parallel pathways from the earliest steps in the folding of a large RNA molecule," JMB 2006. This manuscript details the algorithms implemented in the accompanying code. Any questions, bug reports maybe reported to Alain Laederach. (alain@helix.stanford.edu) Methods: Below are function definitions and usage examples for the main scripts to carry out a complete analysis of a data set. By inputting the commands in matlab that are shown after >> you can get through the example data provided here. (The steps that require a supercomputer are noted. Data entry: function [time_bins,interp_data_ave,res_labels]=readxlsfootprintdata(filename); This function will read in an plot data from an excel spreadsheet. The data is binned and vectorized for k-means clustering. example: >> [time_bins,interp_data_ave,res_labels]=readxlsfootprintdata('dataeg.xls'); Note the format of the data in the excel file dataeg.xls, it is important that this format is respected if you are inputting your own data. The result will be (in this case) 21 figures showing the binned and extrapolated data. Gap Statistic: The data can now be analyzed using the gap statistic as follows: >> khat=compute_Gap(time_bins,interp_data_ave,[2 6],10,100) This will take a couple of minutes to run, but will return a value of khat=3. This means that the Gap statistic has estimated that the data has three clusters and this value should be used in the clustering, which is the next step. Note that for this and the next step you need to have the Statistics Toolbox installed. To check if you can type >> kmeans ??? Error using ==> kmeans At least two input arguments required. If that is the error you get, then the Toolbox is installed, if you get a different error message, it will have instructions on how to install that toolbox. If you do not want to install this toolbox or are using Octave you will need to skip the clustering step, I have stored the results of the clustering so for this demonstration you can continue. Clustering: Now we will cluster the data using kmeans clustering: >> [IDX, C, SUMD, D] = kmeans(interp_data_ave', 3,'Distance','cityblock','Replicates',100,'EmptyAction','singleton') Notice how the number of clusters is specified, it is the second argument of the function. You can see the results of the clustering using the following two commands: >> load visualization.mat >> display_clusts(IDX,res_labels,C,exp(time_bins),1,interp_data_ave,imagex,offset,residue_locations,1,2) Note that this is specific to the L-21 tetrahymena ribozyme, as this function displays the data on the secondary structure. Kinetic Model Fitting: This step requires a supercomputer, but for the purposes of this example, it is possible to run the code like this: >> search_kspace('17') after a while you should get output like: Iteration Func-count f(x) step optimality CG-iterations 0 13 39.4335 2.54 This tells the code to test model number 17 (out of 28, given that k=3, and that the number of intermediates =2). The code is setup to try and optimize the model for approximately 15 hours of CPU time and this would need to be repeated for all models. This particular numerical solution is also the most computationally efficient, less efficient code is available upon request to Alain Laederach (alain@helix.stanford.edu) that can handle stiff systems. This is common of there are large differences (greater than 3 orders of magnitude) between the slowest and fastest forming sites). Optimization visualization: The results of a supercomputing run are stored in the analysis_all.mat file and can be visualized by typing the command: >> make_plots_interest(1,1,4) Best fits to the data are shown as well as the time-evolution of the different intermediates. Note that the output values (mean_K) are the rate constants of the best fitting model with K(1,2) corresponding to the rate from U->I1, K(1,3)-> U->I2 etc. Flux analysis Flux analysis also requires a significant amount of time to compute accurately, but for the purposes of this demonstration the following commands can be issued: >> get_fluxes_t(1) This command should be repeated at least 100 times to get better sampling, each time incrementing the argument by one. (About 100 CPU hours) Pathway Analysis: This command analyzes the result of all the pathways data stored in the pathways folder. >> pathways=analyze_state_pathways; Pathway Visualization: The most common pathways can be vizualized by issuing the following command. >> [clust_pathway,diff_pathway_vect]=pathway_analysis(pathway); This will show a histogram of the fluxes through the major pathways. The analysis presented above can be repeated with a different data set but will require the use of a super computer (between 200-300 nodes) for the computationally expensive steps. The scripts for generating massively parallel runs are nto included in this distribution as they are specific to the setup of the machine. For help with distributed computing setup, feel free to contact Alain Laederach (alain@helix.stanford.edu).