Algorithm
Finding functional specific sites is
a major problem in protein sequence analysis. Multi-RELIEF is a feature weighting
algorithm for multiclass classification problems and can be therefore used to
identify functional specific sites in a multiple sequence alignment.
Method
Based on RELIEF1,
a well known machine learning technique for feature weighting in two-class
problems, multi-RELIEF can be used also in multiclass problems such as multiple
sequence alignments. The principle behind the algorithm is to measure the
quality of an attribute, e.g. a site, in merging neighboured sequences within
the same class and separating neighboured sequences within different classes.
The neighbour of a sequence is defined as any other sequence with minimal
distance from it according to the 1-norm:
Example:
Sequence 1 |
A |
A |
B |
C |
Sequence 2 |
A |
B |
C |
D |
Sequence 3 |
A |
A |
B |
D |
Sequence 4 |
A |
B |
B |
B |
In the
example, the nearest neighbour of Sequence 1 is Sequence 3 with distance 1.
Sequences 2 and 4 are having distance 3 and 2 respectively. Using this
definition RELIEF performs the following steps:
For each iteration of the algorithm an instance R is
randomly selected and the two neighbours, hit(R) and miss(R), are calculated.
hit(R) is the neighbour of R having the same class as R. miss(R) is the
neighbour of R having the opposite class. The next step is to update the weight
vector initialized to zero according to this formula,
,
where diff(a,b) is defined as 0 if a equals b and 1
otherwise. Normalization the differences with m ensures that the resulting
weights range from -1 to 1 after m iterations of the algorithm. The multi-RELIEF
algorithm extends this approach to multiclass problems by sub-sampling from
pairs of classes. For every iteration of multi-RELIEF two classes are randomly
selected and the RELIEF algorithm performed on them. The final weights for each
attribute are calculated the following way:
If there had been positive weight assignments for attribute
a:
compute average of all positive weight assignments
If there had been negative
weight assignments for attribute a (no single positive weight assignment):
compute average of all negative weight assignments
If there had been just zero weight assignments set the weight
of attribute a to zero:
Boosting multi-RELIEF's estimates
with 3d contacts
In order to
possibly improve the results of multi-RELIEF, a heuristic to boost the weights
of a site using information about its neighbours has been developed. The idea is
fairly simple but nevertheless effective:
Every sited gets updated by the
weights of its neighbouring sites. Thus, sites with high ranked neighbours are
boosted.
Performance of
multi-RELIEF
The mulit-RELIEF
algorithm shows good overal performance in comparison with other recent methods
for selecting functional specific residues.
|