IBIVU Server - PRALINE FAQs and Information


		PRALINE FAQs and Information

♦ SOAP service ( WSDL) now available.

PRALINE sample output

References and FAQs

PRALINE is a multiple sequence alignment program with many options to optimise the information for each of the input sequences; e.g. homology-extended alignment, predicted secondary structure and/or transmembrane structure information and iteration capabilities.

When using PRALINE please cite:

• Secondary structure integration:
Pirovano W., Simossis V.A. and Heringa J. (2009) submitted

• Transmembrane structure integration:
Pirovano W., Feenstra K.A. and Heringa J. (2008) Bioinformatics 24(4), 492-497. Open Access

• Homology-extended alignment strategy:
Simossis V.A., Kleinjung J. and Heringa J. (2005) Nucleic Acid Res. 33(3):816-824. Open Access

• Iterated profile scoring scheme:
Heringa J. (2002) Comput. Chem. 26(5), 459-77. Abstract
Heringa J. (2000) Curr Protein Pept Sci. 1(3), 273-301. Review. Abstract

• Original alignment method:
Heringa J. (1999) Computers and Chemistry, 23, 341-364. Abstract

• Server usage:
Simossis V.A. and Heringa J. (2005) Nucleic Acids Res. 33, W289-W294. Open Access
Simossis V.A. and Heringa J. (2003) Comp. Biol. Chem. 27(4-5), 511-519. Abstract

> Assigning gap opening and gap extension penalties

Gap penalties are an essential part of protein sequence alignment when using dynamic programming. Gaps in an alignment correspond to an insertion/deletion event in evolution, which can turn out to be more significant than, for example, a single mutation. Without proper gap penalties, dynamic programming will result in scores that are not ideal and alignments that will not show the insertions or deletions that are likely to have occurred due to evolutionary pressures. The most widely used scheme to penalise the alignment score for insertion/deletion events is that of affine gap penalties (and also used in PRALINE): P = P_o + P_x*l,, where P_o is the gap opening penalty (paid once for each gap), P_x the gap extension penalty (paid for each gap position) and l the number of gap positions (length of the gap). This penalty is assigned when a gap is being inserted and is subtracted from the total score causing the score of the amino acid pair being aligned to drop. The higher the gap penalties, the stricter the insertion of gaps into the alignment and as a result the less gaps inserted. Too high values will force a gap-less alignment, while too low gap penalties will lead to alignments with very many gaps to allow (near) identical amino acids to be aligned.

In both cases the alignment will be inaccurate and wrong. The way the gap penalties affect the alignment is directly dependant on the residue exchange matrix you are using. Unfortunately, there is not yet a theory as to how the gap penalties should be chosen given a particular residue exchange matrix used. Therefore, gap penalties are set empirically: For example, penalties of 12 and 1 (default) might work well with BLOSUM62, but the suggested values for PAM250 are 10 and 1.

Further reading suggestions:

Heringa J., Frishman D. and Argos P. (1997) Computational Methods Relating Protein Sequences and Structure, in Protein: A Comprehensive Treatise, Vol. 1, Pages 165-268, JAI Press.

> Choosing between residue exchange matrices

The alignment or matching of two protein sequences is thought to be an attempt to simulate evolution. In some cases the evolutionary traces are available, but in most cases they are unclear, ambiguous, or not there at all. Therefore, when two sequences are tested for there similarity it is important to consider the evolutionary changes that have occurred for the one sequence to arrive into the form of the second. This is done by taking the minimum number of mutations that may have occurred during the evolution of the two sequences. Dayhoff et al (1978, 1983) performed a comparison of 72 protein families (1300 sequences) and recorded the frequencies of residue substitution. These scores were tabulated and converted into mutational probabilities for one amino acid mutation in 100 (1%). This matrix of probabilities was named PAM-1 and is converted through a series of 250 self-multiplications to give the PAM-250 matrix. The most commonly used PAM-250 matrix is the PAM-250 log-odds matrix. Since then researchers have constructed new PAM-250 matrices based on larger databases (Jones, 1992 -23,000 sequences; Gonnet et al, 1992 -1.7 million residues). In 1992, Henikoff and Henikoff constructed a new residue substitution matrix called BLOSUM62, based on 2000 multiple sequence alignments with pairwise sequence identity of 62% or less, that they constructed from the PROSITE (Bairoch, 1993) database. So, these matrices hold the estimated evolutionary information of amino acid mutations and in conjunction with gap penalties are used by dynamic programming to give the optimal alignment of two sequences.

PRALINE offers a small selection of the available residue exchange matrices for scoring the alignments, including Dayhoff's PAM250 log odds matrix and BLOSUM62 (default). We cannot suggest any vital preference appart from the fact that the high number PAM and low number BLOSUM matrices (e.g. BLOSUM45 and PAM250) have been designed for divergent sequences, while the high number BLOSUM and low number PAM matrices (e.g. BLOSUM80 and PAM1) are applicable to more closely related sequences. The default BLOSUM62 performs well in cases tested up to now.

Further reading suggestions:

Heringa J., Frishman D. and Argos P. (1997) Computational Methods Relating Protein Sequences and Structure, in Protein: A Comprehensive Treatise, Vol. 1, Pages 165-268, JAI Press.

> What is and why use a global progressive alignment strategy?

A global alignment strategy aligns sequences over their whole length. This means that highly conserved segments or motifs are not considered as an inseparable unit, unless the composition of the alignments is such that these regions have no large intervening insertions or deletions. Therefore, a global alignment strategy is optimal for sequences of high to medium sequence similarity. The cases of medium to low sequence similarity are best aligned using a local alignment algorithm such as Dialign2. However, local alignment algorithms keep conserved segments in place and align the remaining sequence between them, which makes the alignment less accurate than global strategies in cases of point mutations.

> How to use pre-profile preprocessing to enhance alignments

Pre-profile processing is an optimisation technique used to minimise the incorporation of erroneous information during progressive alignment. The difference between this strategy and the standard global strategy is that the sequences to be aligned are represented by pre-profiles instead of single sequences.
First, every sequence pair is given a score, according to their pair-wise alignment score:

In the Global pre-profile processing strategy, the score is the global alignment score of the pairs.
In the PSI-BLAST pre-profile processing strategy, we use the E-value score of each hit from the database.

A user defined minimum score can then be used as a threshold to incorporate or exclude sequences from the pre-profiles, building more consistent and useful pre-profiles for the alignment. A threshold higher than the maximum score means no sequences are included in the pre-profiles and so the pre-profiles only contain a single sequence. On the other hand, if the threshold is lower than the minimum score, the preprofiles include all sequences. In addition, if a negative number threshold is used, then the assessment of the pre-processing scores is done taking sequence length into consideration. In the PSI-BLAST approach, we rely on PSI-BLAST to only include database hits that score higher than the set E-value cut-off.

> Using DSSP or predicted secondary structure information

The PRALINE software currently allows the incorporation of DSSP-defined secondary structure information to guide the inclusion of gaps outside segment regions. If no DSSP is available, a choice of 4 prediction methods is available to determine the secondary structure for those sequences that do not have a PDB structure. The choice of prediction method to use for the alignment is left to the user's discretion. For help, we have compiled the table below where we provide the publication reference and links to the EVA Server evaluation figures for each method, if available.

Please note: DSSP information is found based on your (FASTA) sequence definition line. It should only be a PDB identifier. For example, these description lines are fine: ">102L_A", ">102L|A" and ">102LA". For any other description line, DSSP is not extracted. No description may follow the sequence identifier. Thus: ">pdb|102L|A" ">gi|157829524|pdb|102L|A" but also ">102L_A " (note the trailing space) are skipped.

Method Publication EVA assessment Link
PSIPRED Jones, 1999 Go to EVA
SSPRO4 Pollastri et al, 2001 Go to EVA
PORTER Pollastri and McLysaght, 2005 Go to EVA
YASPIN Lin et al, 2005 Go to EVA

> Using predicted transmembrane structure information

The PRALINE software also allows the incorporation predicted transmembrane structure information. Currently the user can make a choice out of 3 prediction methods to guide the creation transmembrane-aware multiple alignments. For help, we have compiled the table below where we provide the publication reference.

Method Publication
PHOBIUS Kall et al, 2005
TMHMM2.0 Krogh et al, 2001
HMMTOP2.1 Tusnady and Simon, 2001

> Trees in protein sequence alignment

PRALINE trees are not evolutionary dendrograms, but simply show the relationship between the sequences as determined by the alignment.

> How to customize colours in PRALINE

If you have selected the Custom Colour Scheme option, you will see this table before the job is submitted. Select the colour you would like each amino acid to appear in from the options below:

A/A Gray Blue Yellow Orange Red Green Pink No colour

Alanin [A]

Arginine [R]

Aspartic Acid [D]

Asparagine [N]

Cysteine [C]

Glutamic Acid [E]

Glutamine [Q]

Glycine [G]

Histidine [H]

Isoleucine [I]

Leucine [L]

Lysine [K]

Methionine [M]

Phenylalanine [F]

Proline [P]

Serine [S]

Threonine [T]

Tryptophan [W]

Tyrosine [Y]

Valine [V]

When you have checked that you have selected all desired colours for each amino acid, click on the "PRALINE Run" button to submit the alignment job.

> PRALINE output formats

The MSF format

MSF is the multiple sequence alignment format of the GCG sequence analysis package. A file in MSF format has certain compulsory features that define it. Keywords "MSF:", "Type:" , and "Check:" need to be in a line that ends with two periods (dots) (see below). The following abbreviations are used: MSF: alignment length (length of longest sequence), Type: P for protein sequences, N for nucleotide sequences and Check: gives a checksum made up of the ASCII values of the sequence characters. This value can be used to check whether an alignment has been edited since it was created.

After the periods, and preceding the alignment, there is the alignment description part consisting of: Name: sequence name (identifier), Len: the sequence length, Check: the checksum for the sequence and Weight: the sequence weight. Every sequence name has to be unique (different names for any sequence pair). No blank is accepted within a sequence name. E.g. 'Id seq a' will be interpreted as: sequence name = Id, and amino acid 1-4 = s, e, q, a. The maximal number of characters of 'Id_seq_0' can be 13. The fields "Len:" "Check:" and "Weight:" are not used by all software but are a compulsory part of the MSF format. In the case that the software one uses does not use this information any number can be inserted.

After the alignment description part there is an essential double frontslash: "//" that acts as the termination of the header list. After this the alignment is expected to begin. The rest of the file is interpreted as alignment. Any line not starting with a sequence identifier (as given in the header!) is ignored. If a line starts with a correct identifier, say Id_seq_n, everything following the first word of this line is appended to the sequence Id_seq_n.

Pile Up MSF: 46 Type: P Check: 5859 .. Name: Sequence_1 Len: 46 Check: 750 Weight: 1.00 Name: Sequence_2 Len: 46 Check: 3980 Weight: 1.00 // Sequence_1 RQLVHVVKWA KALPGFRNLH VDDQMAVIQY SWMGLMVFAM GWRSFT Sequence_2 .QLLSVVKWS KSLPGFRNLH IDDQITLIQY SWMSLMVFGL GWRSYK

The FASTA format

The FASTA format is defined by the '>' symbol at the beginning of the first line, followed by an identifier for the sequence on the next line. Additional sequences each have the '>' symbol at the start of their identifier's line:

>SEQUENCE ID Amino acid sequence (with or without gaps) >SEQUENCE ID2 Amino acid sequence (with or without gaps)