The latest version of this document is available at http://bruns.homeip.net/~bruns/sequoia.html
December 19, 2005: Version 0.9.9a binary created for Windows systems. See download section.
December 2005: Join the new sequoia mailing list! Keep informed about the latest sequoia news, and experiences from other users.
This program is intended to assist in the alignment of homologous protein sequences. It uses a conventional dynamic programming algorithm after the work of Needleman and Wunsch to find the globally optimal alignment given a particular residue comparison matrix and gapping model. I created this after being frustrated that I was unable to find a program which would align together two files of sequences, each of which contains pre-aligned sequences. Often in protein crystallography, we have structural information about how to align two protein sequences, which we wish to keep intact. Many of the available multiple alignment programs require a series of individual sequence files as input, and return a multiple sequence alignment as output, with very little opportunity for manual intervention along the way. In my experience, the results of these programs are almost always at odds with the structural alignments (i.e. wrong). I wrote this in an attempt to provide an alignment tool which the thoughtful modern protein biochemist might be able to use in conjunction with a variety of ancillary information. The sequence and matrix files can be modified easily with your favorite text editor.
Accolades, kudos, and other diplomatically worded bug reports should be sent to cmbruns@comcast.net.
Understanding the organization of sequence information is crucial to the effective use of SEQUOIA. Most of the operations in this program involve aligning two collections of things together. The "things" in this context may be sequences, groups of aligned sequences, or three-dimensional protein structures. There are a total of six registers, or containers, for these items. The two primary ones, SEQ1 and SEQ2, contain sequences or alignments. A third register, ALIGN, usually contains the combined result of aligning SEQ1 with SEQ2. The remaining three, STRUCT1, STRUCT2, and OVERLAY, are analogous to the first three, except that they contain tertiary structures rather than simple sequences. To summarize:
In the following examples, text following the "SEQUOIA>" prompt represent commands typed by the user. Other lines represent the output of the program. Commands may be typed in either upper or lower case letters.
SEQUOIA multiple sequence alignment tool version 0.9.7 copyright (c) 1995, 1996, 1997, 1998, 1999, 2000 by Chris Bruns, Ph.D. The Scripps Research Institute, La Jolla, CA All rights reserved SEQUOIA> read SEQ1 test1.seq Opening file ``test1.seq'' Read in 1 sequence(s) SEQUOIA> read SEQ2 test2.seq Opening file ``test2.seq'' Read in 1 sequence(s) SEQUOIA> align Running TABULATE command... Mean residue alignment score = 1.82649e-07 with a standard deviation of 2.08357 Aligning sequences... 0 gaps added to make alignment Alignment score is 44.6169 SEQUOIA> print ALIGN 1) (w=1) test1 2) (w=1) test2 . . . . .50 1) MQFKHFKLATLAAALAFSANSFADIT- 26 2) -MKTSIRYALLAAALTAATPALADITV 26 * ***** **** SEQUOIA> set SEQ1 ALIGN copied ALIGN to SEQ1 SEQUOIA> read SEQ2 test3.seq Opening file ``test3.seq'' Read in 1 sequence(s) SEQUOIA> align Running TABULATE command... Mean residue alignment score = 9e-07 with a standard deviation of 3.5 Aligning sequences... 0 gaps added to make alignment Alignment score is 30 SEQUOIA> print ALIGN 1) (w=1) test1 2) (w=1) test2 3) (w=1) test3 . . . . .50 1) ---MQFKHFKLATLAAALAFSANSFADIT- 26 2) ----MKTSIRYALLAAALTAATPALADITV 26 3) MKLRISSLGPVALLASSMMLAFGAQAA--- 27 * ** * SEQUOIA> write ALIGN test.align Opening file ``test.align'' Wrote 3 sequence(s) SEQUOIA> print ALIGN test.pretty Opening file ``test.pretty'' SEQUOIA> print id ALIGN Percent sequence identities: 1 2 3 1) 100 40 16 2) 40 100 30 3) 16 30 100 SEQUOIA> quit Program terminated by user
The PRINT ID command displays the level of sequence identity among the aligned sequences. The PRINT commands are not required for the alignment process, but they show that nothing ridiculous is going on. Only the READ, ALIGN, SET, and WRITE commands in the above example are required to generate the alignment. Additional sequences can be added to the alignment by repeatedly SETing the previous SEQ1 to ALIGNment, READing the new sequence into SEQ2, and ALIGNing. The best strategy is to ALIGN the most similar sequences first. Don't forget to WRITE the final alignment when you are done.
The whole process could be done in batch mode by making a script file called test.inp, containing the lines:
read seq1 test1.seq read seq2 test2.seq align set SEQ1 ALIGN read seq2 test3.seq align write test.align quitThen you could invoke this file from within SEQUOIA by typing
@test.inpor, from a shell in UNIX or MSDOS:
sequoia < test.inp
SEQUOIA> read STRUCT1 test1.pdb SEQUOIA> read STRUCT2 test2.pdb
SEQUOIA> align
SEQUOIA> overlay
If it is, go on to step 6. The OVERLAY step often fails for difficult problems, so this is one point where you may have to experiment to get things working. OVERLAY is a semi-automatic command which uses the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE.
Try increasing DCUTOFF to, say, 10. If this works, try slowly lowering DCUTOFF toward a more reasonable value (4.0 is good).
SEQUOIA> align SEQUOIA> set dcutoff 10 SEQUOIA> overlay SEQUOIA> set dcutoff 8 SEQUOIA> overlay SEQUOIA> set dcutoff 6 SEQUOIA> overlay SEQUOIA> set dcutoff 4 SEQUOIA> overlay
If this sort of thing fails, try using a molecular graphics program to place the structures into approximate alignment. Then try something like:
SEQUOIA> read STRUCT1 test1.pdb SEQUOIA> read STRUCT2 test2.pdb SEQUOIA> set dcutoff 10 SEQUOIA> align SEQUOIA> salign SEQUOIA> equivalence SEQUOIA> overlayetc...
Some parameters you should consider adjusting to change the performance of the overlay procedure are GPEN, EPEN, USEANGLE, ACUTOFF, DCUTOFF, and RUNLENGTH. You may also wish to consider manually using the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE; instead of using the more automatic OVERLAY command.
SEQUOIA> overlay SEQUOIA> overlay SEQUOIA> overlayetc...
Eventually the structure based alignment will probably stabilize. The RMS deviations will remain about the same, and the number of aligned residues will remain the same. Examine the alignment to ensure that everything is the way you want it. If things have gone well you will have a reasonable least-squares superposition of the structures, and you will have a good starting point for a structural alignment.
DO NOT BLINDLY TRUST THIS ALIGNMENT. In my experience most features will be correct, but it is necessary for an intelligent and knowledgeable human to compare the structures and the alignment to make final adjustments.
Some parameters you should consider adjusting to change the performance of the overlay procedure are GPEN, EPEN, USEANGLE, ACUTOFF, DCUTOFF, and RUNLENGTH. You may also wish to consider manually using the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE; instead of using the more automatic OVERLAY command.
SEQUOIA> write OVERLAY test2rot.pdb
SEQUOIA> write ALIGN test1_2struct.align
Type "HELP" at the SEQUOIA prompt for on-line help
weight(i) = 1 / (SUM(over j) Proximity(i,j))For now the Proximity between sequences i and j is defined as the sequence identity. The weights are then normalized so that the smallest weight in each sequence is unity. This weighting will ordinarily improve the significance of alignments between large groups of distantly related sequences. If you ever find a case where it hurts, please let me know.
Most modern computational sequence alignment methods can be placed into one of two categories: local and global alignment methods. SEQUOIA uses a global alignment algorithm.
Both classes of techniques use some sort of comparison matrix to determine which amino acid residues are "similar". Local methods find continuous segments of similar residues between two protein sequences. The equivalent segments will pair up residues that have high comparison scores, so that the each segment has a high total sum.
Global alignment methods align two entire sequences together so as to maximize the total sum of the comparison scores of all aligned residues, not just those in the most similar continuous segments. To accomplish this, gaps are ordinarily inserted into the sequences to account for insertions and deletions in the sequences during the course of evolution.
How do you calculate which is the very highest scoring sequence alignment between two sequences? One possibility would be to explicitly generate every possible alignment, calculate each score, and save the highest scoring one. Unfortunately, the total number of alignments is quite large. If n is the number of residues in the sequences, the total number of alignments is proportional to 2 raised to the n power, or n factorial, or something like that. Whichever it is, it turns out to be greater than the number of particles in the universe for reasonable sized cases. I don't care how fast your computer is, it cannot do this.
Fortunately the global alignment problem can be broken down into a series of subproblems, so that the algorithmic technique of dynamic programming can be used. This means that in this case the optimal solution can be found in time proportional to n squared, which is accessible even on modest computers (Perhaps n cubed for more complicated gapping schemes -- more complicated than anything currently in SEQUOIA). It is important to realize that both memory and time requirements are proportional to the square of the sequence length. That is, if you change to sequences that are a bit more than three times as long, the time and memory requirements will increase by a factor of ten. But it is still far superior to that "number of particles in the universe" monkey business.
Proteins tend to accumulate insertions and deletions of stretches of amino acids during the course of evolution. This means that two homologous proteins which have diverged significantly will have regions of insertion and deletion relative to one another. This means that the "true" alignment between the two protein sequences will have to account for these insertion and deletion events. Each such event is usually represented by a gap in one of the aligned sequences.
I have attempted to keep the file format simple. What follows is a canonical sequence file in a format readable by SEQUOIA.
> A glycine rich sequence GGDFHINMVGRGEGLPWGTGQASGDGPF RGHTGLPGDV*The title of the sequence goes on its own line with a ">" character in the first position. The sequence information then follows on subsequent lines in free format until the "*" character, or the next ">" character, indicating a new sequence. In the output files, the entire actual sequence is shoved onto one line, for ease of editing the alignment. If the first character of the sequence is a '#', then it should be followed by digits, to indicate the number of the first residue. Thus a sequence which begins with residue number 17 would look like:
> partial sequence #17gvcsnflqwertyiop
version 0.9.9a December 6, 2005 - removed time limit license
version 0.9.9a December 6, 2005 - created new binaries for Mac OS X Tiger (10.4) and Linux
version 0.9.9 December 13, 2002 - compiled new version for IRIX 6.5
version 0.9.9 December 12, 2002 - compiled new version for MSDOS/windows
version 0.9.9 June 28, 2002 - set alignment score to zero when one structure contains no protein
version 0.9.7 May 08, 2000 - replaced broken RANDOM command. Significance is now reported with respect to extreme-value distribution.
version 0.9.6 - removed output buffering. gpen is updated before every alignment
version 0.9.5 May 04, 2000 - restored EQUIVALENCE command to former functionality. SUN version works again.
version 0.9.4 December 1999 - updated expiration date. Partially implemeted equivalence command is broken.
version 0.9.3b. New features include command line history and editing, plus a bug fix for cases where a D- amino acid could cause structural alignment to fail ungracefully.
The latest version of this document is available at http://bruns.homeip.net/~bruns/sequoia.html
Last modified