Multiple Sequence Alignment (MSA)
for Linguistics


What is Multiple Sequence Alignment (MSA)?

MSA is a data mining technique for discovering patterns in a set of comparable sequences.


What kind of information is it sensible to apply MSA to?

Any kind of information can undergo MSA, as long as it can be modelled as a sequence of symbols of a finite alphabet. The best-known example of such modelling are DNA sequences, whose own physical constitution can be immediately translated to a sequence of letters. Applying MSA techniques to these sequences has resulted in the complete description of the human genome. However, MSA is not limited to DNA sequences. Other sequences that can be successfully modelled are: proteins, timelines, many ki nds of linguistic sequences.

Since the purpose of aligning sequences is to discover patterns, it only makes sense to align those kinds of information that can be partitioned in different, comparable sequences, and where recurrent patterns can be found.


What is the outcome of MSA?

In general terms, an MSA process results in a set of aligned sequences, usally with a calculation of the relative similarity among them, and a model of the alignment, usually with some score of its reliability. This model conveys the recu rrencies found in the set of sequences, and can be expressed in many forms: as a sequence profile that synthesizes the major commonalities between all sequences of the set, as a lattice that accounts for the possible compositions of sequence s, etc.


How does it work?

MSA is usually implemented as an extension of pairwise alignment. The input to pairwise alignment are two sequences and a similarity criterion or scoring function that describes the similarity between the different symbols that constitute the se quences. The alignment algorithm determines the highest-scoring way to perform insertions, deletions and changes to transform one of the sequences into the other. This results in a new sequence containing the initial pair.

Scorings are determined with respect to the similarity criterion, so that more similar sequences are assigned a higher score. More similar sequences are those that need less insertions, deletions or changes to become equal, that is to say, those with a more comparable number of symbols, and whose symbols are more similar.

For MSA, the input is a set of sequences and also a similarity criterion. The most similar pair of sequences according to the given similarity criterion is found in the starting set, and they are aligned. The aligned sequences are removed from the set, and are replaced by the sequence resulting from the alignment. The process is repeated iteratively until a single sequence is found that contains all the sequences in the original set.


What do I need to carry out MSA?

To carry out MSA, the following is needed:

And, of course, a program that can perform MSA! Usually, such programs have been oriented to aligning DNA sequences, so they allow very little parametrisation: the input alphabet is limited to the four symbols that form DNA sequences (A,C,G,T), and the similarity criterion usually admits little changes. However, for an adequate modelling of other kinds of sequences, a more flexible tool is needed.

Alphamalig is a MSA tool that supports a configurable alphabet and allows determining a similarity criterion. It is accessible via web (http://www.lsi.upc.es/~gralggen /recerca/alialfb/alphamalig.html) and provides different possibilites for the visualization of the results. In addition, detailed instructions on usage are available, with various examples on the effects of similarity criteria.


An illustrative example


Some further work on MSA applied to linguistics

The best known application of MSA is to discover recurrent patterns in DNA. However, the parallelism of biological sequences to other kinds of sequences has been long attested, for example, in the forums about String Processing (SPIRE). More concretely about linguistics, there has recently been some work on MSA applied to the discovery of patterns at sentence level.

Geert M. Jan Kruijff aims to discover linearization rules by MSA of annotated sentences. He exploits the information available in the treebanks for Czech, Dutch, German and English. His target is to obtain a model of how words are realized in su rface structures in each of these languages, and to make comparisons between each other.

His invited talk at FG'02 [3] provides a very clear, interesting view about preliminary experiments on applying MSA techniques to annotated sentence structures from four different tree-banks. Kruijff adresses fundamental questions ab out applying MSA to linguistics, like comparability or significance.

Regina Barzilay and Lilian Lee apply MSA to acquire knowledge for paraphrasing, a subfield of Natural Language Generation recently addressed by statistical techniques. In a first experiment [1], they aligned parallel corpora with semantic annotation to acquire phrase-level semantic equivalents. A second experiment [2] was aimed to obtain sentence level paraphrases by MSA of comparable corpora (news agency stories about the same event) with no annotation at all. This last experiment clearly shows the portability and the potential of this technique for data mining of linguistic sequences.



References


[1] Regina Barzilay, Lillian Lee, "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment", in Proceedings of NAACL-HLT, 2003.
(available from personal web page, http://www.cs.cornell.edu/~regina/)

[2] Regina Barzilay, Lilian Lee, "Bootstrapping Lexical Choice via Multiple-Sequence Alignment" in EMNLP'02
(available from personal web page,
http://www.cs.cornell.edu/~regina/)

[3] Geert-Jan M. Kruijff, "Learning linearization rules from treebanks". Invited talk at the Formal Grammar'02/COLOGNET-ELSNET Symposium "Combining logical and data-oriented approaches in NLP"
(available from personal web page,
http://www.coli.uni-sb.de/~gj/)