> Assigning gap opening and gap extension penalties
Gap penalties are an essential part of protein sequence alignment when using dynamic
programming. Gaps in an alignment correspond to an insertion/deletion event in evolution,
which can turn out to be more significant than, for example, a single mutation. Without
proper gap penalties, dynamic programming will result in scores that are not ideal and
alignments that will not show the insertions or deletions that are likely to have occurred
due to evolutionary pressures. The most widely used scheme to penalise the alignment score
for insertion/deletion events is that of affine
gap penalties (and also used in PRALINE): P = Po
+ Px*l,, where Po is
the gap opening penalty (paid once for each gap), Px
the gap extension penalty (paid for each gap position) and
l the number of gap positions (length of the gap).
This penalty is assigned when a gap is being inserted and is subtracted from the total score
causing the score of the amino acid pair being aligned to drop. The higher the gap penalties,
the stricter the insertion of gaps into the alignment and as a result the less gaps inserted.
Too high values will force a gap-less alignment, while too low gap penalties will lead to
alignments with very many gaps to allow (near) identical amino acids to be aligned.
In both cases the alignment will be inaccurate and wrong. The way the gap penalties
affect the alignment is directly dependant on the residue exchange matrix you are using.
Unfortunately, there is not yet a theory as to how the gap penalties should be chosen given
a particular residue exchange matrix used. Therefore, gap penalties are set empirically:
For example, penalties of 12 and 1 (default) might work well with BLOSUM62, but the
suggested values for PAM250 are 10 and 1.
Further reading suggestions:
- Heringa J., Frishman D. and Argos P. (1997) Computational Methods Relating Protein
Sequences and Structure, in Protein: A Comprehensive
Treatise, Vol. 1, Pages 165-268, JAI Press.
> Choosing between residue exchange matrices
The alignment or matching of two protein sequences is thought to be an attempt to simulate
evolution. In some cases the evolutionary traces are available, but in most cases they are
unclear, ambiguous, or not there at all. Therefore, when two sequences are tested for there
similarity it is important to consider the evolutionary changes that have occurred for the
one sequence to arrive into the form of the second. This is done by taking the minimum number
of mutations that may have occurred during the evolution of the two sequences. Dayhoff et
al (1978, 1983) performed a comparison of 72 protein families (1300 sequences) and recorded
the frequencies of residue substitution. These scores were tabulated and converted into
mutational probabilities for one amino acid mutation in 100 (1%). This matrix of
probabilities was named PAM-1 and is converted through a series of 250 self-multiplications
to give the PAM-250 matrix. The most commonly used PAM-250 matrix is the PAM-250 log-odds
matrix. Since then researchers have constructed new PAM-250 matrices based on larger databases
(Jones, 1992 -23,000 sequences; Gonnet et al, 1992
-1.7 million residues). In 1992, Henikoff and Henikoff constructed a new residue substitution
matrix called BLOSUM62, based on 2000 multiple sequence alignments with pairwise sequence
identity of 62% or less, that they constructed from the PROSITE (Bairoch, 1993) database.
So, these matrices hold the estimated evolutionary information of amino acid mutations and in
conjunction with gap penalties are used by dynamic programming to give the optimal alignment
of two sequences.
PRALINE offers a small selection of the available residue exchange matrices for scoring
the alignments, including Dayhoff's PAM250 log odds matrix and BLOSUM62 (default). We cannot
suggest any vital preference appart from the fact that the high number PAM and low number
BLOSUM matrices (e.g. BLOSUM45 and PAM250) have been designed for divergent sequences, while
the high number BLOSUM and low number PAM matrices (e.g. BLOSUM80 and PAM1) are applicable
to more closely related sequences. The default BLOSUM62 performs well in cases tested up
to now.
Further reading suggestions:
- Heringa J., Frishman D. and Argos P. (1997) Computational Methods Relating Protein
Sequences and Structure, in Protein: A Comprehensive
Treatise, Vol. 1, Pages 165-268, JAI Press.
> What is and why use a global progressive alignment strategy?
A global alignment strategy aligns sequences over their whole length. This means that
highly conserved segments or motifs are not considered as an inseparable unit, unless the
composition of the alignments is such that these regions have no large intervening insertions
or deletions. Therefore, a global alignment strategy is optimal for sequences of high to
medium sequence similarity. The cases of medium to low sequence similarity are best aligned
using a local alignment algorithm such as Dialign2. However, local alignment algorithms keep
conserved segments in place and align the remaining sequence between them, which makes the
alignment less accurate than global strategies in cases of point mutations.
> How to use pre-profile preprocessing to enhance alignments
Pre-profile processing is an optimisation technique used to minimise the incorporation
of erroneous information during progressive alignment. The difference between this strategy
and the standard global strategy is that the sequences to be aligned are represented by
pre-profiles instead of single sequences.
First, every sequence pair is given a score, according to their pair-wise alignment score:
- In the Global pre-profile processing strategy, the score
is the global alignment score of the pairs.
- In the PSI-BLAST pre-profile processing strategy, we use the E-value score of each
hit from the database.
A user defined minimum score can then be used as a threshold to incorporate or exclude
sequences from the pre-profiles, building more consistent and useful pre-profiles for the
alignment. A threshold higher than the maximum score means no sequences are included in
the pre-profiles and so the pre-profiles only contain a single sequence. On the other hand,
if the threshold is lower than the minimum score, the preprofiles include all sequences.
In addition, if a negative number threshold is used, then the assessment of the
pre-processing scores is done taking sequence length into consideration. In the PSI-BLAST
approach, we rely on PSI-BLAST to only include database hits that score higher than the
set E-value cut-off.
> Using DSSP or predicted secondary structure information
The PRALINE software currently allows the incorporation
of DSSP-defined secondary structure information to guide the inclusion of gaps outside
segment regions. If no DSSP is available, a choice of 4 prediction methods is available
to determine the secondary structure for those sequences that do not have a PDB structure.
The choice of prediction method to use for the alignment is left to the user's discretion.
For help, we have compiled the table below where we provide the publication reference and
links to the EVA Server
evaluation figures for each method, if available.
Please note: DSSP information is found based on your (FASTA) sequence definition line.
It should only be a PDB identifier. For example, these description lines are fine: ">102L_A", ">102L|A" and ">102LA".
For any other description line, DSSP is not extracted. No description may follow the sequence identifier.
Thus: ">pdb|102L|A" ">gi|157829524|pdb|102L|A" but also ">102L_A " (note the trailing space) are skipped.
Method |
Publication |
EVA assessment Link |
PSIPRED |
Jones, 1999 |
Go to EVA |
SSPRO4 |
Pollastri et al, 2001 |
Go to EVA |
PORTER |
Pollastri and McLysaght, 2005 |
Go to EVA |
YASPIN |
Lin et al, 2005 |
Go to EVA |
> Using predicted transmembrane structure information
The PRALINE software also allows the incorporation
predicted transmembrane structure information. Currently the user can make a choice
out of 3 prediction methods to guide the creation transmembrane-aware multiple
alignments. For help, we have compiled the table below where we provide the
publication reference.
Method |
Publication |
PHOBIUS |
Kall et al, 2005 |
TMHMM2.0 |
Krogh et al, 2001 |
HMMTOP2.1 |
Tusnady and Simon, 2001 |
> Trees in protein sequence alignment
PRALINE trees are not evolutionary dendrograms, but simply show the relationship
between the sequences as determined by the alignment.
> How to customize colours in PRALINE
If you have selected the Custom Colour Scheme option,
you will see this table before the job is submitted. Select the colour you would like
each amino acid to appear in from the options below:
When you have checked that you have selected all desired
colours for each amino acid, click on the "PRALINE Run" button to submit the
alignment job.
> PRALINE output formats
The MSF format
MSF is the multiple sequence alignment format of the GCG
sequence analysis package. A file in MSF format has certain compulsory features that
define it. Keywords "MSF:", "Type:" , and "Check:" need to be in a line that ends with
two periods (dots) (see below). The following abbreviations are used: MSF: alignment
length (length of longest sequence), Type: P for protein sequences, N for nucleotide
sequences and Check: gives a checksum made up of the ASCII values of the sequence
characters. This value can be used to check whether an alignment has been edited since
it was created.
After the periods, and preceding the alignment, there is the alignment description
part consisting of: Name: sequence name (identifier), Len: the sequence length, Check:
the checksum for the sequence and Weight: the sequence weight. Every sequence name has
to be unique (different names for any sequence pair). No blank is accepted within a
sequence name. E.g. 'Id seq a' will be interpreted as: sequence name = Id, and amino
acid 1-4 = s, e, q, a. The maximal number of characters of 'Id_seq_0' can be 13. The
fields "Len:" "Check:" and "Weight:" are not used by all software but are a compulsory
part of the MSF format. In the case that the software one uses does not use this
information any number can be inserted.
After the alignment description part there is an
essential double frontslash: "//" that acts as the termination of the header list.
After this the alignment is expected to begin. The rest of the file is interpreted as
alignment. Any line not starting with a sequence identifier (as given in the header!)
is ignored. If a line starts with a correct identifier, say Id_seq_n, everything
following the first word of this line is appended to the sequence Id_seq_n.
Pile Up
MSF: 46 Type: P Check: 5859 ..
Name: Sequence_1 Len: 46 Check: 750 Weight: 1.00
Name: Sequence_2 Len: 46 Check: 3980 Weight: 1.00
//
Sequence_1 RQLVHVVKWA KALPGFRNLH VDDQMAVIQY SWMGLMVFAM GWRSFT
Sequence_2 .QLLSVVKWS KSLPGFRNLH IDDQITLIQY SWMSLMVFGL GWRSYK
The FASTA format
The FASTA format is defined by the '>'
symbol at the beginning of the first line, followed by an identifier for the sequence on
the next line. Additional sequences each have the '>' symbol at the start of their
identifier's line:
>SEQUENCE ID
Amino acid sequence (with or without gaps)
>SEQUENCE ID2
Amino acid sequence (with or without gaps)
|