Mathematical Support for Molecular Biology: Papers Related to the Special Year in Mathematical Support for Molecular Biology 1994-1998 [47] 0821808265, 9780821808269


154 60 3MB

English Pages 312 Year 1999

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
An introduction to molecular biology for mathematicians and computer programmers by W. M. Fitch
A Sequence alignment and phylogeny construction by M. Vingron
A new look at tree models for multiple sequence alignment by D. Durand
Sequence alignment in molecular biology by A. Apostolico and R. Giancarlo
Formal language theory and biological macromolecules by D. B. Searls
Global optimization approaches in protein folding and peptide docking by C. A. Floudas, J. L. Klepeis, and P. M. Pardalos
The topologically driven strand separation transition in DNA---methods of analysis and biological significance by C. J. Benham
Parallel strategies for DNA manipulation and analysis by C. L. Smith, T. Sano, N. E. Broude, and C. R. Cantor
A column-generation based branch-and-bound algorithm for sorting by reversals by A. Caprara, G. Lancia, and S.-K. Ng
Visualizing measures of genetic distance by E. M. Jordan
Fragment assembly system for DNA sequencing projects by L. Milanesi, M. Marsilli, G. Mauri, C. Rolfi, and L. Uboldi
Performance of the CAP2 sequence assembly program by X. Huang A simple toolkit for DNA fragment assembly by J. Meidanis.
Recommend Papers

Mathematical Support for Molecular Biology: Papers Related to the Special Year in Mathematical Support for Molecular Biology 1994-1998 [47]
 0821808265, 9780821808269

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

DIMACS s

Series in Discrete Mathematics and Theoretical Computer Science Volume 47

Mathematical Support for Molecular Biology Papers Related to the Special Year in Mathematical Support for Molecular Biology 1994-1998

Martin Farach-Colton Fred S. Roberts Martin Vingron Michael Waterman Editors

American Mathematical Society

Digitized by the Internet Archive in 2019 with funding from Kahle/Austin Foundation

https://archive.org/details/mathematicalsupp0047unse

Selected Titles in This Series 47

Martin Farach-Colton, Fred S. Roberts, Martin Vingron, and Michael Waterman, Editors, Mathematical Support for Molecular Biology

46

Peng-Jun Wan, Ding-Zhu Du, and Panos M. Pardalos, Editors, Multichannel Optical Networks: Theory and Practice

45

Marios Mavronicolas, Michael Merritt, and Nir Shavit, Editors, Networks in Distributed Computing

44

Laura F. Landweber and Eric B. Baum, Editors, DNA Based Computers II

43

Panos Pardalos, Sanguthevar Rajasekaran, and Jose Rohm, Editors, Randomization Methods in Algorithm Design

42

Ding-Zhu Du and Frank K. Hwang, Editors, Advances in Switching Networks

41

David Aldous and James Propp, Editors, Microsurveys in Discrete Probability

40

Panos M. Pardalos and Dingzhu Du, Editors, Network Design: Connectivity and Facilities Location

39

Paul W. Beame and Samuel R Buss, Editors, Proof Complexity and Feasible Arithmetics

38

Rebecca N. Wright and Peter G. Neumann, Editors, Network Threats

37

Boris Mirkin, F. R. McMorris, Fred S. Roberts, and Andrey Rzhetsky, Editors, Mathematical Hierarchies and Biology

36

Joseph G. Rosenstein, Deborah S. Franzblau, and Fred S. Roberts, Editors, Discrete Mathematics in the Schools

35

Dingzhu Du, Jun Gu, and Panos M. Pardalos, Editors, Satisfiability Problem: Theory and Applications

34

Nathaniel Dean, Editor, African Americans in Mathematics

33

Ravi B. Boppana and James F. Lynch, Editors, Logic and Random Structures

32

Jean-Charles Gregoire, Gerard J. Holzmann, and Doron A. Peled, Editors, The Spin Verification System

31

Neil Immerman and Phokion G. Kolaitis, Editors, Descriptive Complexity and Finite Models

30

Sandeep N. Bhatt, Editor, Parallel Algorithms: Third DIMACS Implementation Challenge

29

Doron A. Peled, Vaughan R. Pratt, and Gerard J. Holzmann, Editors, Partial

28

Larry Finkelstein and William M. Kantor, Editors, Groups and Computation II

27

Richard J. Lipton and Eric B. Baum, Editors, DNA Based Computers

26

David S. Johnson and Michael A. Trick, Editors, Cliques, Coloring, and

Order Methods in Verification

Satisfiability: Second DIMACS Implementation Challenge 25

Gilbert Baumslag, David Epstein, Robert Gilman, Hamish Short, and Charles Sims, Editors, Geometric and Computational Perspectives on Infinite Groups

24

Louis J. Billera, Curtis Greene, Rodica Simion, and Richard P. Stanley, Editors, Formal Power Series and Algebraic Combinatorics/Series Formelles et Combinatoire Algebrique, 1994

23

Panos M. Pardalos, David I. Shalloway, and Guoliang Xue, Editors, Global Minimization of Nonconvex Energy Functions: Molecular Conformation and Protein Folding

22

Panos M. Pardalos, Mauricio G. C. Resende, and K. G. Ramakrishnan, Editors,

21

D. Frank Hsu, Arnold L. Rosenberg, and Dominique Sotteau, Editors, Interconnection Networks and Mapping and Scheduling Parallel Computations

20

William Cook, Laszlo Lovasz, and Paul Seymour, Editors, Combinatorial

19

Ingemar J. Cox, Pierre Hansen, and Bela Julesz, Editors, Partitioning Data Sets

Parallel Processing of Discrete Optimization Problems

Optimization (See the AMS catalog for earlier titles)

%

\

DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 47

Mathematical Support for Molecular Biology Papers Related to the Special Year in Mathematical Support for Molecular Biology 1994-1998

Martin Farach-Colton Fred S. Roberts Martin Vingron Michael Waterman Editors

NSF Science and Technology Center in Discrete Mathematics and Theoretical Computer Science A consortium of Rutgers University, Princeton University, AT&T Labs, Bell Labs, Bellcore, and NEC Research Institute

American Mathematical Society

Thomas

J.

Bata Library

TRENT UNIVERSITY PETERBOROUGH,

ONTARIO

> Ob

V

1, define e(r) to be the expected number of

lifted alignments needed to be chosen at random before the smallest cost of all those alignments is within a factor of 2 + l/(r — 1) of the cost of the optimal phylogenetic alignment. Then e(r) < r. For example, e(r) is at most two for an error bound of 3, and e(r) is at most ten for a bound of 2.1112. Note that e(r) is independent of n and of the lengths of the strings. Also note that the above theorem holds when restated to apply only to uniform lifted alignments. Another way to state Theorem 2.4 is: Let k(e), for any e > 0, be the expected number of lifted

DAN GUSFIELD AND LUSHENG WANG

42

alignments to draw at random to find one of cost at most (2 + e)C(T*). Then k(e) < 1 + Proof For r = 2, the theorem says that at least half of all the lifted alignments must have cost less than or equal to 3C(T*). This follows im¬ mediately from the fact that the average cost is at most 2C(T*) and the minimum cost is C(T*).

Generalizing, at least 1/r of all the lifted align¬

ments must have cost less than or equal to (2r — l)C(T*)/(r — 1), which again follows from the fact that the minimum possible cost is C(T*) and the mean is at most 2C(T*). Stating this result in terms of probabilities rather than expectations, we have the following Theorem 2.5. Picking p lifted alignments at random,

the minimum

cost phylogenetic alignment of those p alignments will have cost within a factor o/2 + l/(r — 1) of the optimal phylogenetic alignment, with probability at least 1 — [(r — l)/r]p. Theorem 2.4 and Theorem 2.5 are analogous (and proven in related ways) to results from [1] about random multiple alignments under a different objective function (the sum-of-pairs objective function). It is worth noting that the analysis presented above is very loose, and therefore the results are overly pessimistic.

For example, in Theorem 2.4

the case of r = 2 was proven by observing that the median can be at most 3C(T*). The conclusion was that at least half of all random lifted alignments must have an error bound at most three times the optimal. But if the median were actually 3C(T*), then exactly half of the random lifted alignments must have cost C(T*), i.e., they must be optimal phylogenetic alignments. 3. Using uniform lifted alignments to compute lower bounds The optimal (uniform or not) lifted alignment is guaranteed to have cost less than twice that of the optimal phylogenetic alignment on any problem instance, but one expects that these, and other, methods will find closer to optimal solutions for most problem instances. Therefore it is desirable to have efficient methods to compute lower bounds on the cost of the optimal phylogenetic alignment, given any problem instance.

These lower bounds

can be applied not only to gauge the accuracy of the result computed by the approximation method, but by any method.

Often one can modify a

bounded-error approximation method, or exploit the ideas behind its error analysis, to obtain an efficient method to compute non-trivial lower bounds given a problem instance.

When combined with better (but unanalyzed)

heuristic methods, this can be a valuable and practical use of boundederror methodology. In this section we show how to use ideas from lifted and uniform lifted alignments to compute a non-obvious lower bound on the cost of the optimal phylogenetic alignment. For a fixed layout of T, let Au denote the average cost of a uniform lifted alignment for that layout. Since Au < 2C(T*) (shown in [8]), Au]2

NEW USES FOR UNIFORM LIFTED ALIGNMENTS

43

is a lower bound on C(T*). Moreover, it is a particularly appealing lower bound because it holds for every layout, and there are 2n_1 distinct layouts. Therefore, an attractive strategy is to randomly pick several layouts of T and compute Au for each one. The maximum Au obtained this way, divided by two, is then a lower bound for C(T*). We will later improve this approach, to obtain even higher lower bounds, but first we show that Au can be computed in 0(n2) time for any layout.

3.1. Computing the cost of the average uniform lifted align¬ ment. We continue to assume that T is binary. We also assume that the distance between each pair of leaf sequences is already known. Recall that there are 2d uniform lifted alignments for a fixed layout of T. When T is a full binary tree, then 2d = n, the number of leaves, and we can trivialy generate the n uniform lifted alignments and compute the cost for each one. This direct approach takes 0(n2) time. However, when T is very unbalanced, d can be as large as n so a full enumeration would take exponential time as a function of n. None the less, it is possible to compute the average of the 2d uniform lifted alignments in 0(n2) time. For a fixed layout of T, we define an ordered pair of leaf sequences

(Si, Sj)

to be legal for an edge (u, v) (where u is the parent of v), if there is a uniform lifted alignment in which

St is assigned to u and Sj is assigned to v. Below

we will see how to find all legal pairs, and their associated edges, in 0(n2) time. In fact, when a pair

(Si, Sj) is found to be legal for an edge (u, v), the

algorithm will determine the exact number of uniform lifted alignments of

Sz is assigned to u and Sj is assigned to v. It is simple to test if an ordered pair (Si,Sj) is legal for some edge.

the fixed layout, in which

Consider the two paths up to the root node from the leaves labeled Si and

Sj. Those two paths join at the least common ancestor of Sz and Sj, say at level l. Assume, w.l.o.g. that Si labels leaf at a level k at or above the leaf labeled by

Sj. Then (Si, Sj) is legal for some edge if and only if the two

paths are “parallel” from level k to level l. That is, at each level between k and l — 1, the nodes on the two paths must both be the left children of their respective parents, or they must both be the right children of their parents. If the paths have the required property, then the ordered pair

(Si, Sj) is legal

for one edge out of their least common ancestor (denoted u). Moreover, the ordered pair

(Sj, Sr) is legal for the other edge out of u.

Any pair of leaves can be tested in O(d) time, yielding an 0(dn2) time method. The time can be reduced to 0(n2) by reversing the direction of the walks and organizing them with depth-first traversals. In detail, to find all legal pairs, repeat the following algorithm for each edge (u, v) of T, where, w.l.o.g., v is the right child of u, i.e., v = r(u). Execute a parallel depth-first traversal from v and l(u) until one of the searches reaches a leaf (see Figure 5).

In a parallel depth- first traversal, the two traversals alternate single

edge moves, and the first one moves from any level to its right (left) child, if and only if the second one next moves from the same level to its right (left)

DAN GUSFIELD AND LUSHENG WANG

44

child. When one of the traversals (the first say) reaches a leaf labeled say with Si and found on level k, the second traversal continues below level k in a normal depth- first fashion, until it returns to level k. At that point the two traversals begin alternating again in parallel fashion. Let Sj be a leaf sequence that the second traversal encounters before the two traversals return to parallel mode. That is, Sj is a leaf sequence encountered before the second traversal backs up from level k. Then (Si,Sj) is a legal pair for edge (u,v) because the path from l(u) to Si parallels the part of the path from v to Sj down to level k. Let N(i,j) denote the number of uniform lifted alignments of the layout where Sl is assigned to u and Sj is assigned to v.

Then the ordered pair

(Si,Sj) contributes exactly N(i, j)D(Si, Sj) to the sum of the costs of all the uniform lifted alignments of that layout. (The ordered pair (Sj, Si) will also contribute the same amount to the sum.) But what is N(i,j)7 Suppose u is on level l and Sj is on level lf < k. Then N(i,j) = 2d~l+l . The reason is that for every level from l — 1 down to l', the choice of which child (left or right) assigns its sequence to its parent, is fixed by the requirement that Si be successively assigned (lifted) up to node u and Sj be lifted up to node v. However, the choices at the other levels can be made in every possible way, and are independent at each level. In general, if (Si,Sj) is legal for (u,v), then the choices are fixed for all levels between node v and the lower of the two leaves labeled by Si and Sj. Let L(T) be the set of legal (ordered) pairs for the fixed layout of T. In summary, the total cost of all the uniform lifted alignments of T (for the fixed layout) is Y2(Si Sj)eL(T) N(i,j)D(Si, Sj). The average cost is that sum divided by 2d. Assuming the distance between each pair of leaf sequences is known, the average cost of all uniform lifted alignments can be found in 0(n2) time. We should note that the average cost of all lifted alignments can also be computed in 0(n2) time, and half that average is again a lower bound on C(T*). But it is not desirable to compute only that single lower bound, since one gets a lower bound from the uniform lifted alignments of each fixed layout. The average of all those 2n~1 lower bounds is the lower bound obtained from the set of all lifted alignments, so some of those lower bounds will be better and some worse than the single one obtained from all lifted alignments. Therefore, one should compute the average cost of .the uniform lifted alignments for several randomly selected layouts (and also the average over all lifted alignments), and then take the maximum of those bounds. That approach exploits any variance there may be between the average costs for different fixed layouts. We should also note that half the cost of the optimal lifted alignment is a lower bound on C(T*). Similarly for a fixed layout, half the cost of the optimal uniform lifted alignment is a lower bound on C(T*). But both of these bounds are inferior to the bounds derived from the respective averages.

NEW USES FOR UNIFORM LIFTED ALIGNMENTS

45

u

Figure

5. Suppose the parallel depth-first traversal initially

walks from l(u) to leaf a, and from v to node A. Then a stan¬ dard depth-first traversal from A will find that the sequence at leaf a forms a legal pair for (u, v) with each of the se¬ quences at leaves b,c,d,e and /.

Then the traversal backs

up from a and A, and next determines that (g,h) and (g,i) form legal pairs for edge (u,v). 3.2. Improved lower bounds. The lower bounds discussed above are based on the fact that the average lifted (or uniform lifted) alignment has cost less than 2C(T*).

However, as shown in the proof of Theorem 1.1,

the cost of Tu is at most 2C(T*) minus twice the cost of some full path to the root of T*.

In the case of the average of all lifted (or uniform lifted)

alignments, a similar savings occurs due to legal pairs whose least common ancestor is the root of the tree. This leads to the following improvements in the lower bounds. Theorem

3.1. Given a rooted tree T, let La be the set of (ordered)

legal pairs whose lea is not the root of T, and let Lb be the set of legal pairs whose lea is the root.

Let r(i,j) be the number of edges on the path

between the lea of leaf i and leaf j, and lowest of the leaves i or j. C(T*) > EW)ez,„

Then

2*-rW)D{i,j) + EW)eis 2d-T^D(i,i))IOd+1).

DAN GUSFIELD AND LUSHENG WANG

46

We can get a similar improvement based on all the lifted alignments. Let m be the number of non-leaf nodes in T, and for any pair of leaves (i,j), let m(i,j) be the number of non-leaf nodes on the path between i and j. Theorem 3.2.

Given a rooted tree T, let A be the set of ordered leaf

pairs whose least common ancestor (lea) is not the root of T, and let B be the set of ordered leaf pairs whose lea is the root.

Then, C{T*)

>

Clearly, both of these lower bounds can be again computed in 0{n2) time, assuming that the distances are known between each pair of leaf se¬ quences. 3.3. Experimental Work. We experimented with the above ideas on a well known, sixteen node tree (we will call Sankoff’s tree), which Sankoff, Cedergren and Lapalme [6] used to study phylogenetic alignment. In that paper, they produced a phylogenetic alignment (we will call Sankoff’s align¬ ment) that is not known to be optimal for the data, but is believed to be close to optimal. We wanted to use the above ideas to establish lower bounds for the problem instance and for Sankoff’s alignment. We also wanted to see how close the best uniform lifted alignment is to the best lifted alignment. Sankoff’s tree T is shown in Figure 6.

Each leaf is assigned an RNA

sequence of length around 120 nucleotides. We use the same scoring scheme as in [6], he., D(A,C) = 1.75, D(A, G) = 1.0, D(A,U) = 1.75, D{C,G) = 1.75, D(C,U) — 1.0, D(G,U) = 1.75, and 2.25 for insertion/deletion. This scoring scheme clearly satisfies triangle inequality. Before describing the experimental results, we need to mention one ad¬ ditional modification to the lower bound methods. Sankoff’s tree T used in [6] is unrooted, so in order to apply Theorems 3.1 and 3.2 we select an edge of T and create a new node on it, making that new node the root of the tree. It is easy to see that this does not change the cost of the optimal alignment, but the optimal lifted and optimal uniform lifted alignments do depend on the root choice. Further, since Theorems 3.1 and 3.2 treat leaf pairs whose lea is the root differently than leaf pairs whose lea is not the root, the posi¬ tioning of the root might change the lower bounds based on those theorems. Every choice leads to a correct lower bound, but some choices might give higher bounds than others. In fact, this happens when using Theorem 3.1, but not when using Theorem 3.2, as we will show next. Theorem 3.3. Let T be an unrooted tree. No matter what edge is cho¬

sen for the new root, the lower bound on C(T*) based on Theorem 3.2 is where m(i,j) is the number of non-leaf nodes on the path from leaf i to leaf j in T before the addition of the root node. Proof

The lower bound is obtained by applying Theorem 3.2 to T

after the addition of the root node.

Let m! = m + 1, and let m'(i,j) be

the number of non-leaf nodes on the path from i to j after the root is

NEW USES FOR UNIFORM LIFTED ALIGNMENTS

added.

Then

47

) = ra(i, j) if the lea of i and j is not the root, and

m'(hj) — m{i, j) + 1 if the lea is the root. Applying Theorem 3.2 with m! and

and then simplifying the summation completes the proof. Hence, in the case of an unrooted tree, computing the lower bound based

on Theorem 3.2 is particularly simple, and can easily be done by hand. 3.3.1. Experimental Results. The results from these limited experiments are that the optimal uniform lifted alignments (taken over all choices for root, but only a single layout of

T for each choice) have small variance, and

have costs fairly close to the optimal lifted alignment.

The average costs

have somewhat greater variance. The highest lower bound computed in this way establishes that Sankoff’s alignment has cost at most 21.68% greater than the optimal phylogenetic alignment. In more detail, the cost of Sankoff’s alignment is 295.5. The lowest cost of an optimal lifted alignment (over varying choices for the root) was 364.0, while the highest cost of an optimal lifted alignment was 393.5. The lowest cost of an optimal uniform lifted alignment (over varying choices for the root, but a fixed layout for each choice) was 371.5, while the highest cost of an optimal uniform lifted alignment was 396.5. The highest average cost of a lifted alignment was 460.3, and the lowest average cost was 396.4. The highest average cost of a uniform lifted alignment was 461.6, and the lowest average cost was 398.2.

The highest lower bound based on Theorem 3.1

(over varying choices for the root, but a fixed layout for each choice) was 242.191, which establishes that Sankoff’s alignment has cost at most 22.02% greater than the optimal. The highest lower bound based on Theorem 3.3 was 242.86, which establishes a deviation from optimal of no more than 21.68%. It is interesting that the simpler method, based on Theorem 3.3, gave a higher lower bound than the one based on Theorem 3.1. In general, in this small experiment the results obtained from using only uniform lifted alignments were not much different than the results based on using all lifted alignments. We expect that the distinction will be greater for larger trees (or trees of greater degree), and that bounds based on Theorem 3.1 will generally be higher than those based on Theorem 3.3.

3.4. Comparison to other lower bounds. We examined three other lower bounds that have been suggested for the phylogenetic alignment prob¬ lem. One is based on computing a minimum spanning tree, one is based on non-bipartite matching, and one is based on a linear programming relaxation of the phylogenetic alignment problem. For the minimum spanning tree bound, compute the distance between each pair of leaf sequences; form a complete graph on n nodes (one node

St); and set the cost of the edge between any pair of nodes vx and v3 as D(Si,Sj)\ finally, compute a minimum spanning tree on this graph. Let MT denote the cost of the minimum spanning tree. Then MT/2 is a lower bound on C(T*). To see that, consider the optimal phylogenetic alignment T* on T. A depth-rirst traversal of T* specifies a Vi for each leaf sequence

DAN GUSFIELD AND LUSHENG WANG

48

way to connect the leaves which has cost less than 2C(T*). By definition of MST, this spanning connection of the leaves has cost no less than AIt, so C(T*) > Mt/2. However, the same argument holds for any phylogenetic alignment, op¬ timal or not. Therefore if TL is any uniform lifted alignment, then C(TL) > Mt/2, so the average cost of the uniform lifted alignments is always a better (higher) lower bound than Mt/2.

In the experiment we ran on Sankoff’s

tree, the cost of the minimum spanning tree for the nine sequences is 366. Hence it gives a lower bound of 183 compared to the (higher) lower bound of 242.86 obtained from Theorem 3.3. The MST lower bound establishes only that Sankoff’s alignment has cost at most 61.48% greater than the optimal. The lower bound based on non-bipartite matching is the following: Given a tree with l leaves, form the complete graph on l nodes and assign weight D(i,j) to the edge between nodes z and j. Then, the weight of the minimum weight complete matching in this graph is a lower bound on the cost of the optimal phylogenetic alignment.

Applied to the data from Sankoff’s

tree, the resulting minimum weight complete matching has weight 143.75, establishing only that Sankoff’s alignment has cost at most 105.6% greater than the optimal. J. Kececioglu and R. Ravi suggested a lower bound based on the dual of the following linear program. For each edge e in T, create a variable as¬ sociated with e. Then for each leaf pair (z, j), create the constraint that the variables associated with edges on the path from z to j must sum to at least D(i,j). Subject to those constraints, the minimum sum of all the variables is a lower bound on C(T*). This follows from the assumed triangle inequal¬ ity condition, and the fact that the edge distances from any phylogenetic alignment provide a feasible solution to this LP. We ran this LP on Sankoff’s tree, and obtained a value of 253.5. This establishes that Sankoff’s alignment deviates from the optimal by at most 16.57%. We have also been able to prove that the lower bound using The¬ orem 3.1 is never greater than the LP bound. However, under certain con¬ ditions, the bounds are equal, and we don’t know how far apart the two bounds will typically be. We are currently studying this question.

3.5. Discussion. The idea of using guaranteed error (upper bound) methodology to compute lower bounds on the cost of an opthnal solution has been met with some serious disbelief since it was first proposed in an earlier draft of this paper. The main objection is the claim that bounded-error approximation meth¬ ods rarely perform as badly as their guaranteed error bounds allow, hence dividing the cost of the solution by the guarantee will give a poor lower bound. Without arguing how closely bounded approximation methods gen¬ erally come to optimal, our response is twofold. First, bad compared to what else? Efficient, alternative lower bounds are not always available. In the case of phylogenetic alignment, the bound

NEW USES FOR UNIFORM LIFTED ALIGNMENTS 8. T. utilis

49

2. S. carlbergensis

Figure 6. The tree T used in [6].

based on Theorem 3.3 is much easier to compute than the non-bipartite matching bound, and was dramatically better in the experiment we ran; it is simpler to compute than the MST bound, and was substantially better; it is much simpler to compute than the LP bound, and was not much smaller than the LP bound. Other methods have been suggested that will get better lower bounds than the LP bound, but they are substantially more expensive to run. In our experiment, we initially, and trivially, computed the bound based on Theorem 3.3 by hand. It was much harder to set up, and input the LP, and then check its result. Second, in the case of the phylogenetic alignment, the best (highest) lower bound we obtained using lifted alignment methodology is not obtained by halving the cost of the best (lowest cost) lifted alignment we obtained. The optimal lifted alignment (with cost 364.0 in our experiment) has cost no greater (and generally less) than the optimal uniform lifted alignment (cost 371.5), which has cost no greater (and generally less) than the average uniform lifted alignment (cost 461.6), which has cost less than the actual number (485.72) we compute (using Theorem 3.3) and divide by two, in order to get the lower bound (242.86). Generally, we expect a sizable gap between the cost of the optimal lifted alignment and that number (364 vs. 485.72). In our experiment, Theorem 3.3 establishes that Sankoff’s alignment is at most 21.86% more than the optimal, but the lower bound obtained by halving the cost of the best lifted alignment establishes only that Sankoff’s alignment is at most 62.37% above the optimal. Moreover, different lower bounds can be calculated for every layout of the tree, and there are an exponential number of layouts. Additional variation comes with the choice of root in cases when the tree is unrooted. The final lower bound is the highest one obtained by the various trials. Hence even if the optimal lifted alignment (the boundederror method with the lowest upper bound) has a cost close to C(T*) (and it certainly doesn’t in the case of Sankoff’s tree), it does not follow that the approach suggested in this paper must give bounds close to C(T*)/2. Many bounded-error approximation methods can be made to allow choices, and have the property that any of a large number of executions

DAN GUSFIELD AND LUSHENG WANG

50

result in a solution within the proven error bound. Other methods (for min¬ imization problems) have the property that the error bound on the solution is established by showing that the algorithm gives a solution whose value is less than some number (computable for any specific instance) which is within the error bound. Prior work on approximation methods has either ignored the first property, or used it only to direct the algorithm to find solutions with lower costs in practice.

The second property has been of

no interest in approximation methods, since it goes in the wrong direction. The main point in this paper is to suggest that these two properties can be exploited to produce informative lower bounds in practice, and to illustrate this idea in the case of the phylogenetic alignment problem. The ultimate utility of the ideas presented here is still an open question. But the results are promising enough (especially considering their computational efficiency compared to other suggested methods) that one cannot dismiss those ideas based only on the claim that guaranteed error bounds are usually excessively pessimistic. As an example of additional applications of these ideas, we have begun examining another approximation method for phylogenetic alignment due to R. Ravi and J. Kececioglu [3], They studied the phylogenetic alignment problem on trees where each internal node has exactly k children.

They

gave a method to find a phylogenetic alignment whose deviation from the optimal is bounded by the multiplicative factor (k + l)/(k — 1).

For any

fixed fc, the method runs in polynomial time. Although they do not connect their method to lifted alignments, their algorithm essentially modifies lifted alignments and finds the lifted alignment whose modification gives the best phylogenetic alignment. In those terms, their paper establishes that over all lifted alignments, the average cost of these modified lifted alignments is at most (k + l)/(k — 1) times the cost of the optimal phylogenetic alignment. One can directly obtain a lower bound from that best modified lifted align¬ ment, but (as we did above) a higher lower bound results from using the average cost. It is not difficult to show how to compute that average cost as efficiently as computing the best modified lifted alignment. Moreover, es¬ sentially the same analysis given in [3] establishes that the average modified uniform lifted alignment has cost no more than (k + l)/(fc — 1) times the op¬ timal phylogenetic alignment. Even better, the time needed to compute the average (or best) cost of a modified uniform lifted alignment is dramatically below the time needed to compute the best modified lifted alignment. The change of focus from lifted alignments to uniform lifted alignments seems particularly promising in this problem because the uniformity requirement becomes much more constraining as k increases. For a full k-ary tree with l leaves, there are roughly lk lifted alignments, but only l uniform lifted alignments. Therefore, in comparison to the best modified lifted alignment, as k increases, the average modified uniform lifted alignment should be in¬ creasingly bad as a phylogenetic alignment. Hence, as k increases, the cost

NEW USES FOR UNIFORM LIFTED ALIGNMENTS

51

of that average should be increasingly good for deriving a lower bound. The results of this, and related research, will be the subject of a future paper. 4. Acknowledgement We would like to thank the anonymous referee for suggestions on im¬ proving the exposition, and on pushing us to improve the initial empirical results. References [1] D. Gusfield, Efficient methods for multiple sequence alignment with guaranteed error bounds, Bull, of Math. Biology, 55 (1993), 141-154. [2] D. Gusfield, Algorithms on Strings,

Trees and Sequences:

Computer Science and

Computational Biology, Cambridge University Press, 1997. [3] R. Ravi and J. D. Kececioglu, Approximation methods for sequence alignment un¬ der a fixed evolutionary tree, Proc. 6’th Symp. on Combinatorial Pattern Matching. Springer LNCS 937, (1995), pages 330-339. [4] D. Sankoff, Minimal mutation trees of sequences, SIAM J. on Applied Math., 28 (1975), 35-42. [5] D. Sankoff and R. Cedergren, Simultaneous comparisons of three or more sequences related by a tree, In D. Sankoff and J. Kruskal, editors, Time warps, string edits, and macromolecules:

The theory and practice of sequence comparion, Addison Wesley,

1983, pages 253-264. [6] D. Sankoff,

R. J. Cedergren,

and G. Lapalme,

Frequency of insertion-deletion,

transversion, and transition in the evolution of 5s ribosomal RNA, J. Mol. Evol., 7 (1976), 133-149. [7] David Sankoff, Minimal mutation trees of sequences, SIAM J. on Applied Math., 28 (1975), 35-42. [8] L. Wang and D. Gusfield, Improved approximation algorithms for tree alignment, Proc. 7’th Symp. on Combinatorial Pattern Matching. Springer LNCS 1075, 1996, pages 220-233. [9] L. Wang and T. Jiang, On the complexity of multiple sequence alignment, J. of Comp. Biology, 1 (1994), 337-348. [10] L. Wang, T. Jiang, and D. Gusfield, A more efficient approximation scheme for tree alignment, In Proc. of RECOMB 97: The first international conference on computa¬ tional molecular biology, ACM Press, 1997, pages 310-319. [11] L. Wang, T. Jiang, and E.L. Lawler, Approximation algorithms for tree alignment with a given phytogeny, Algorithmica, 16 (1996), 302-315. Department of Computer Science,

University of California, Davis,

95616 E-mail address: gusfieldOcs.ucdavis.edu,

lwangOcs.cityu.hk

CA,

.

\



DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 47, 1999

Sequence Alignment and Phylogeny Construction Martin Vingron

Abstract. In this paper we review concepts and key ideas that link multi¬ ple sequence alignment and phylogeny reconstruction from molecular sequence data.

The two main approaches are character based methods and hierarchi¬

cal, profile-based methods. While methods of the latter class have found wide acceptance in biology the first group has been studied more intensely by com¬ puter scientists. It is attempted to show some common features between the two ways of thinking about phylogenetic trees and multiple alignment.

1. Introduction Multiple alignment of biological sequences and the reconstruction of phyloge¬ netic trees from sequences are two prominent problems in computational molecular biology. The task in multiple sequence alignment is to find out which positions in different though similar sequences correspond to each other. This is a generaliza¬ tion of pairwise sequence alignment which was introduced to quantify similarity or distance between two sequences (for review see [40] or [41]). Pairwise alignments, however, are simple not only because they are computationally simpler but also because there are no sophisticated relationships between two sequences. A whole group of homologous sequences, though, is usually assumed to have evolved from a common ancestor and their relationship can thus be depicted in a tree. Finding this tree is the task in reconstructing a phylogenetic tree. There is a vast literature on the reconstruction of phylogenetic trees from se¬ quence data and there is quite an impressive list of papers on multiple sequence alignment. For reviews on trees see [32, 22, 41] and for reviews on multiple align¬ ment see [3] and [41]. In the seventies several authors very clearly saw a connection between the two problems. However, mostly for reasons of computational feasibil¬ ity, they are in most cases treated separately. This is manifested first of all in the way the biological problems are formalized. A word on the use of the terms “problem” and “algorithm” is in place when speaking about algorithms in molecular biology. In a biological environment a 1991 Mathematics Subject Classification. Primary 92B10, 92D20; Secondary 92-02, 05C05. The author gratefully acknowledges DIMACS for supporting an extended visit during the special year on mathematical support for molecular biology. Furthermore, enlightening discussions with Drs. John Kececioglu, Dan Gusfield and Martin Farach are gratefully acknowledged. This work was supported by Deutsche Forschungsgemeinschaft under grant Vi 160/1.

(c) 1999 American Mathematical Society

53

54

MARTIN VINGRON

“problem” may be a fairly vague question or task where for some cases a true result is known. To make the biological problem into a computer science or mathematics problem it is formalized. The formalization is a key point since it needs to capture the biological essence and transform it into something computationally manageable. In a nice case there will be more to say about the resulting algorithmic problem than that it is NP-hard. More misunderstandings between the cultures lure behind the term “algorithm”. It may refer to a procedure that emulates what an expert would do by hand in analyzing a dataset. This, of course, has little to do with a CS algorithm. However, studying it may give a first clue towards a reasonable formalization. For example, speaking about the problem of phylogeny construction implicitly means the biological problem. The biologist knows that when he con¬ structs a phylogenetic tree from sequences of man, monkey and mouse he wants the monkey to cluster with man instead of mouse. The formal problem would be, e.g., to find a tree which minimizes the number of changes in the sequences along the tree edges. In the cases of multiple sequence alignment and phylogeny reconstruction the way these problems are formalized determines if they will be linked. There is a formalization for multiple sequence alignment which ignores evolutionary relation¬ ships between sequences altogether (the sum-of-pairs alignment [2]) and there are formulations of phylogeny reconstruction which ignore the alignment issue. This paper will deal with the formalizations that try to link the two aspects. There are two reasons for attempting to connect the two. First, it is considered a fact that sequences have evolved from common ancestors and that this has influenced what we observe today. Thus, when we compare contemporary sequences it makes sense to to put them into their historical context. The second reason is more prag¬ matic. Most multiple sequence alignment programs that are being used rely on a phylogenetic tree in order to speed up the computation. To generate a good initial tree, however, one would need to know the multiple alignment ahead of time. In this situation one calculates the initial tree based only on pairwise distances. Pair¬ wise alignments, however, tend to overestimate pairwise distances as compared to a consistent assignment of distances which could be obtained from a multiple align¬ ment. Therefore the pairwise distances will lead to a not very reliable tree which is nevertheless used to aid in the multiple alignment. Finally, having the multiple alignment, it is in turn used to calculate a better tree. It has been shown though [24] that errors introduced initially propagate and influence this more refined tree. In fact, Hogeweg and Hespers [21] already suggested to iterate the process by cal¬ culating a new tree based on the multiple alignment and continuing with this tree as basis for a new alignment. One of the earliest formalizations of multiple sequence alignment essentially defines it as the problem of reconstructing ancient predecessors to contemporary sequences when the topology (i.e. the branching pattern) of a tree describing the evolution of the sequences is given [28]. The edit distances between the sequences at the nodes of this tree define the length of the edges. The problem is to choose the ancient sequences such that the overall length of the tree is minimized. This formulation is called “tree-alignment” [2], It is one of two commonly used formu¬ lations of multiple sequence alignment. The other one is the so-called sum-of-pairs alignment, where a multiple alignment has to be chosen in such a way that the sum over all induced pairwise distances between sequences is minimized [2]. This formulation does not refer to evolution and does not require the reconstruction of

SEQUENCE ALIGNMENT AND PHYLOGENY CONSTRUCTION

55

ancient sequences. Both these formulations of multiple sequence alignment have been shown to be NP-hard [38]. Formalizations for the phylogeny problem can be roughly described as either character based, distance based or as maximum likelihood methods [32]. Parsi¬ mony, to be described below, is a character based method. Its underlying philos¬ ophy allows for a natural combination with alignment. Distance based methods approximate the pairwise distances between sequences with distances measured by the path metric in a tree. There are schemes conceivable to merge this approach with alignment, too. Although a maximum likelihood formulation for a combined tree/alignment problem seems trivial to formulate it is hard to imagine a reason¬ able algorithm. Very few attempts in this or related directions have been made [35, 1, 25],

2. The Steiner tree formalism Parsimony and tree alignment. Phylogeny construction is a prominent ap¬ plication of the notion of a minimal Steiner tree [22, 10]. One of the first formal versions of phylogeny construction interpreted the ancestral sequences as Steiner points in a 4-dimensional hypercube, where each dimension represents the occurence of each of the four nucleotides A, C, G, and T. This was derived from a famous algorithm to assess the quality of a topology by calculating the number of changes that would at least be necessary to explain a given set of sequences under this topology. More formally, let a set of aligned sequences and a tree topology where the leaves are labeled with the sequences be given. For any assignment of sequences to the inner nodes the length of an edge is defined as the number of mismatches between the sequences at the nodes flanking the edge. A parsimonious assignment of sequences to the inner nodes is one that minimizes the sum of the length of the edges. Finding this assignment has become known as the parsimony problem. An algorithm for its solution that is linear in the number of species and in the length of the sequences has been given by Fitch [8] and its correctness proved by Hartigan [17], In applying this formal framework to the construction of a phylogeny one has to find the tree topology which gives rise to the most parsimonious tree. The problem of finding a most parsimonious tree can be put in graph theoretic terms. The graph to be studied has all possible sequences of a given length as its nodes. Edges are introduced between nodes whenever the corresponding sequences differ by exactly one mismatch. The most parsimonious tree is the optimal Steiner tree linking the nodes corresponding to the given sequences. Based on this view of the problem, finding the most parsimonious tree was shown to be NP-hard [4]. Practically, though, branch-and-bound algorithms perform fairly well [19]. Solving the parsimony problem by the Fitch-Hartigan algorithm, i.e. given tree topology and alignment, finding the assignments for the inner nodes that make the tree as short as possible, is easy. Without the given alignment the problem has been introduced by Sankoff [28] in one of the earliest papers on multiple sequence alignment. He assumes that a tree topology and a set of unaligned sequences are given. The problem is to assign sequences to the inner nodes such that the sum of the implied edge lengths is minimal. Note that given a tree with all nodes annotated with sequences (of possibly different length), it is straightforward to construct a multiple alignment for the sequences at the leaves (cf. [16], Appendix).

56

MARTIN VINGRON

Therefore the assignment of sequences to the inner nodes of a tree already implies a multiple alignment. Sankoff’s formulation has become known as tree alignment [2] and shown to be NP-hard [23]. Sankoff [28] gave a dynamic programming algorithm for tree alignment. He merges the high dimensional version of the dynamic programming algorithm for pairwise sequence alignment with the Fitch-Hartigan algorithm. The latter is used to calculate the score of an edge in the high dimensional edit matrix by optimizing the tree length. Jiang et a 1. [23] present an approximation algorithm which “lifts” sequences from the leaves into the tree. They prove a performance ratio of 2 and proceed to develop a polynomial time approximation scheme. Generalized tree alignment. The parsimony formulation of phylogeny re¬

construction leads to a fast algorithm when the tree topology is given and to an NP-hard problem otherwise. Without both sequence alignment and tree topology one obtains the following link between the Steiner tree formalization of phylogeny and multiple sequence alignment. Consider a graph whose nodes are all sequences with length smaller than some large constant. An edge is inserted whenever two sequences have an edit distance of 1 from each other. If now one could find a min¬ imal Steiner tree to link the given sequences one would at the same time obtain a phylogenetic tree and a multiple sequence alignment. This formalization of the problem is called generalized tree alignment [23] and has been shown to be MAX SNP-hard [23]. This means that it cannot be approximated to arbitrary precision in polynomial time (unless P = NP). Nevertheless, this formalization appears to be a perfect synthesis of phylogeny reconstruction and multiple alignment. The interpretation of phylogeny construction as finding a minimal Steiner tree immediately suggests the application of approximation algorithms for mini¬ mal Steiner trees [22] to the phylogeny and the phylogeny/alignment problems. Gusfield [16] suggested to use the minimum spanning tree (MST) heuristic. The obvious drawback in applying the MST heuristic is that the tree thus constructed will not necessarily have species as labels. Formally one can remedy the situation by introducing additional edges of length 0. However, the resulting tree remains biologically unreasonable since ancestral sequences are assumed identical to con¬ temporary sequences. Heuristics. In the context of finding the most parsimonious tree the minimum

spanning tree heuristic has already been applied by Foulds et aI. [11]. There, Prim’s algorithm of obtaining the minimum spanning tree is used. Additionally, in [11] a procedure called “coalescement” is introduced. It is intended to remedy the situation where in a spanning tree given sequences end up as inner nodes in the tree. When a sequence is attached to a node by the spanning tree algorithm Foulds et a 1. instead look for a sequence that lies “in between” the two species and introduce a new node there. In the world of phylogeny/alignment the construction that is closest to coa¬ lescement is the use of sequence graphs as introduced by Hein [18], A sequence graph is a compact structure representing all optimal alignments between two se¬ quences or between two sequence graphs. Formally, a sequence graph is a network (a DAG) with edges annotated by sets of letters or gaps. The sequence graph is constructed in such a way that each path from the source to the sink of the DAG corresponds to an optimal alignment. At the same time it can be used to rep¬ resent Steiner sequences. Hein introduces the alignment of sequence graphs in a

SEQUENCE ALIGNMENT AND PHYLOGENY CONSTRUCTION

57

way analogous to the alignment of networks given in [30]. In his multiple sequence alignment program Hein has a tree given (more accurately, he calculates one based on the pairwise distances using clustering). As he aligns along the tree he repre¬ sents clusters by sequence graphs. When assigning sequences to inner nodes based on the sequence graph he achieves an effect that may be termed coalescement. In [31] sequence graphs are applied for the design of an approximation algorithm for generalized tree alignment. This is based on a variant of the MST heuristic where the assignment of ancestral sequences is done by selecting from a sequence graph after the topology of the tree was derived. Coalescement may alternatively be effected by three-way alignment. In a sense, this has been introduced by Sankoff et aL. The dynamic programming algorithm for tree alignment [28] when specialized to three sequences calculates a Steiner sequence for this set. In [29] a heuristic method is used to produce practical software. They start with some reasonable tree alignment and then refine the sequences at the inner nodes by repeatedly computing three-way alignments of adjacent nodes. The resulting Steiner sequence replaces what was at the central node before. This procedure can only shorten the tree length and since it will remain positive it has to converge. An approximation algorithm for tree alignment that is modeled after the FitchHartigan algorithm is given by Ravi and Kececioglu [26]. They first assign lists of candidate sequences to the inner nodes of the given tree. Then a dynamic programming procedure similar to the Fitch-Hartigan algorithm is applied. The main difference is that without the alignment this algorithm cannot proceed column by column, as it usually does, but instead uses distances across tree edges.

3. Hierarchical alignment and profiles Profiles and clustering. The computational problems in generalizing the

dynamic programming algorithm for pairwise alignment to multiple sequences has led to the development of many heuristics for multiple sequence alignment. The most widely used class of heuristics hierarchically clusters the sequences which then allows to align the set by iterating pairwise alignments. Given a set of un-aligned sequences and the clustering, the sequences in two-element clusters are aligned to each other. These alignments are subsequently treated as single entities and their alignment will remain fixed throughout the remainder of the procedure. Doolittle and Feng [7] coined the phrase “once a gap always a gap”. Two groups, each of which has already been aligned, are then aligned with each other either based on, e.g., reference sequences or by describing the pre-aligned groups as profiles (to be explained below). When biological criteria are applied to judge the results, sophis¬ ticated versions of this method perform amazingly well [34], Together with the speed of such a computation this has made them widely accepted among biologists. At the same time these methods suffer from the fact that the initial clustering influences the outcome [24]. From a more formal point of view it is unclear whether such a method optimizes any scoring function of whether its outcome fulfills any stringent condition. In the following paragraphs the notion of a “profile” will be explained and some analogies between hierarchical alignment, clustering and the computation of minimum spanning trees will be introduced. Profiles have been introduced in [39] and [15]. To obtain a profile from a multi¬ ple alignment each column of the multiple alignment is converted to a distribution

58

MARTIN VINGRON

vector of the letters (possibly including a gap character) at that position. The profile is a matrix with these vectors as columns. The advantage lies in the fact that the profile does not get more complex to handle as the number of sequences increases and that similarity/distance scores between letters carry over naturally to similarity/distance scores between profile columns [37]. This has led many authors to use these similarity values in the dynamic programming procedure to align two sequences. This results in an alignment of the two profiles which can be carried over to the two pre-aligned groups. What is usually ignored is that gap treatment is sig¬ nificantly more complicated in profile alignment than in regular sequence alignment [14]. When introducing the circularity in tree-construction and alignment it was already mentioned that most practical alignment algorithms rely on an initial hi¬ erarchical clustering. A hierarchical clustering in this context can be viewed as a rooted binary tree whose leaves are labeled with the given sequences. Typically, this clustering is calculated by an agglomerative clustering algorithm [5, 32]. These procedures successively collapse more and more sequences into clusters. The deci¬ sion which distinguishes different clustering algorithms is how to define the distance between a newly formed cluster and the other clusters or sequences. To be more precise, let a matrix of distances between objects be given. Select the smallest distance, say between objects i and j. Collapse the two objects into a cluster. To treat this cluster as a new object, a distance between it and the other objects needs to be defined. Single linkage clustering at that point chooses for each object not in the cluster the minimal distance to either of the two objects in the newly formed cluster. This is equivalent to choosing the minimum distance to any of the ele¬ ments in a cluster. The new cluster from then on plays the same role as any other object. A new pair of objects at minimum distance are identified and the process is repeated. Maximum linkage clustering differs in the definition of the updated distances between a newly formed cluster and the others. In the maximum linkage procedure the distance between an object and a newly formed cluster is defined as the maximum over the distances between the objects and the elements of the clus¬ ter. Nevertheless, the objects to be clustered together are still those at minimum distance from each other. There are other variants of the procedure which instead of using minimum or maximum calculate averages. It is interesting to compare this procedure to Kruskal’s algorithm to compute a minimal spanning tree (cf. [5], sections 6.10 and 6.11). Kruskal’s algorithm proceeds by constructing a forest. For a tree in the forest the distances to the other nodes or trees each are the minimum distance between the nodes involved. Thus, the nodes in a subtree that is constructed at some point during the computation of the minimal spanning tree also form a cluster in the hierarchical clustering. Albeit, when calculating the tree the individual “attachment points” in the subtree are important and thus the information where the minimum distance is assumed needs to be retained. In single linkage clustering the information where the minimum was obtained is not retained. More successful clustering methods have used averaging procedures to set the distances to new clusters [32, 27], Interestingly, profiles can be thought of as mid¬ points between sequences [39, 37] and the distance from one sequence to a profile can be seen as an average distance between the sequence represented in the profile and the other [37]. In sequence alignment this is a great advantage over calculating the average of pairwise alignment distances. Individual pairwise distances may be

SEQUENCE ALIGNMENT AND PHYLOGENY CONSTRUCTION

59

based on contradictory alignments among the sequences that make up the profile. Using profile distance instead guarantees that all distances that go into the average can actually be realized. This suggests using profile distances in the clustering procedure. Practically this means that instead of calculating a clustering and then iterating profile alignments in the order determined by the clustering one can do without the initial clustering. One starts by collapsing the closest pair into a profile and then recalculates the distances of the remaining sequences to the profile. The shortest one is chosen again and a new profile is calculated, and so on. This elegant method has been suggested by Taylor [33] and later by Wong et a 1. [43]. However, the approach has remained almost unnoticed both in the biology and in the CS community. Sequence weighting. Another precaution needs to be taken in the applica¬ tion of profiles. Since distances to a profile are average distances to the sequences involved in the profile the same caveats apply like to any averaging procedure. When the profile is heavily biased towards a certain group of similar sequences, more distant sequences may well be present but will hardly influence the computa¬ tion any more. Therefore schemes were developed to attach weights to sequences in profiles. This results in profile distance becoming a weighted average distance [37], There are numerous rationales for design of such weighting schemes, for review see [37] and [20]. Recently sequence weighting has found its way into profile-based multiple alignment packages [34, 13]. Prim-like heuristics. In other work that is being pursued in the moment [36]

the idea of using 3-way alignments [29] is applied within the framework of profiles and alignments. It relies on a widely used heuristic to compute phylogenetic trees which in the case of tree-like data reconstructs the underlying tree [42], It is used, e.g., by Felsenstein to compute Maximum Likelihood trees [6], Let a scoring function for a tree topology be given. The heuristic takes the first two sequences and builds a (very simple) tree. Now, assume a tree on k species has been built already. Sequence k -1-1 may now branch off any of the edges of that tree. Each of these possibilities results in a new topology for which a score can be calculated. The topology that results in the best scores wins and decides where the new sequence is inserted. Then this tree becomes the starting point for insertion of the next sequence. Of course, this is not guaranteed to find an optimal solution. Still, together with some rearrangements and optimization in the end it is frequently used in practical implementations. This heuristic bears a vague resemblance to Prim’s algorithm for computing a minimum spanning tree. First, as stated above, the heuristic always inserts the next edge based on the given ordering of input data. The obvious way of making this independent of input order is to start with the shortest distance. Then the sequence is chosen which after insertion into the tree creates the shortest tree. Having three sequences linked by the tree already, one finds the fourth one by testing the other sequences to see which one, after insertion into the best edge of the tree, leads to the best tree, etc. After eliminating the order dependence in this way, the analogy to Prim’s algorithm becomes visible. Prim’s algorithm extends a growing tree by adding in each step the shortest edge incident to the tree. In a graph this is equivalent to extending the tree as little as possible whenever a new edge is added. This is exactly the principle in the insertion heuristic. Note, however, that in constructing a spanning tree, vertices are attached to other vertices. In

60

MARTIN VINGRON

the insertion heuristic, sequences are generally inserted into edges, not attached to vertices.

Like in the character based methods, this can be interpreted as a

coalescement thus improving the tree length by moving from a spanning tree to a Steiner tree. Using profiles to represent pre-aligned groups and 3-way aligments between profiles this scheme can be applied in the following way [36],

At any step we

deal with a pair made up of a tree and an alignment which go together. Let such a pair for k sequences be given.

Sequence k + 1 needs to be inserted.

Again,

each edge of the existing tree is inspected. An edge implies a split of the already aligned sequences into two groups. Think of each of these groups as one profile. To insert a new sequence one first calculates the 3-way alignment between those two profiles and the new sequence. The score of the new tree is its length with edgelength chosen to approximate the new distances. In [36] this is done in a greedy fashion without recomputing the lengths in the tree to the left and right of the edge where the insertion takes place. A new sequence is finally added to the edge where it fits best, i.e. where the tree length grows minimally. Along the lines of the modification in the insertion heuristic, the order dependence can be eliminated by also optimizing over the sequences to choose the one which leads to the smallest growth in tree length. Profiles and linear programs. A disadvantage about profile methods like the hierarchical alignment methods or the 3-way alignment based insertion heuristic is that it is not clear which scoring function they should optimize or, to put it mildly, for which objective function they constitute a reasonable heuristic. Nevertheless, the fact that hierarchical alignment is biologically extremely successful should be reason enough to think harder about the question whether these methods have a sound basis. In fact, one of the early formalizations of phylogenetic tree construction might shed some fight on this. To see the connections it is helpful to state the problem of tree approximation in general terms. Let a distance matrix D on the species be given. A tree with the path metric between its leaves implies another distance matrix Dt. A distance matrix

D is called additive if there exists a tree whose associated matrix Dt equals D. In practice, the distance matrix between molecular sequences will not be additive. A tree T is searched whose Dt approximates D. Although the true difficulty lies in finding the best topology one may formulate two alternative formalizations of the approximation task when the topology is given. In an abuse of notation think of the matrices D and Dt as column vectors listing all pairwise distances. Let A be the path-edge incidence matrix of the given tree topology where by a path a leaf-to-leaf path is meant.

Each row of matrix A corresponds to a.path between

two species and each column corrsponds to an edge. If an edge lies on a certain path the corresponding matrix entry is 1 and otherwise it is 0. The vector of edgelengths of an edge-weighted tree shall be denoted by a column-vector e. Then for an additive matrix Dt we would have Dt = Ae. Since few matrices are additive the general equation D = Ae for an arbitrary distance matrix D and for a given tree topology A will rarely be solvable. The first reaction would be to approximate the solution in a least squares sense, i.e. to choose edge-lengths e such that the euclidian distance between D and the tree-like distances Ae is minimized [9]. Alternatively, Waterman et al. [42] have formulated a linear program. The objective function of the linear program is similar in spirit to the Steiner tree setup: the overall length of

SEQUENCE ALIGNMENT AND PHYLOGENY CONSTRUCTION

61

the tree is minimized. One constraint is that the edges should have positive length. The other constraint is derived from the fact that alignment distance between two sequences underestimates the number of changes that have occurred between the two sequences during their development. This results in the inequality Dt > D, elementwise, which is the other constraint for the linear program. Solving this linear program then yields an assignment for edge lengths under a given topology. Finding the best topology still requires checking them all. Interestingly, one can formulate another linear program which then promises to be compatible with profile alignment. Note first that when aligning two profiles, one at the same time minimizes the average distance between the sequences contained in the two pre-aligned groups [37, 14, 36]. It thus becomes conceivable to formulate the following alternative constraints for the linear program described above: The inequality Dt > D may be changed to compare average distances between profiles. Let G and G' be two groups and denote by d(G, G') the average distance measure in the matrix D between members of G and G'. dr{G,G') is the average distance measured in the tree [36]. To be more specific, Ge and Gj might be the two groups of sequences defined by an edge e in a tree. For each edge one obtains a constraint for the linear program by demanding that the average distance in the tree should be larger than the average distance in the distance matrix: dT(Ge,Ge') > d(Ge,Ge') for all edges e in a tree. This is a weaker constraint than the original one which demanded that each pairwise distance should be smaller than the corresponding distance in the tree. This scheme now can be linked to alignment. Let only a set of unaligned sequences, and thus no distance matrix, be given. From any alignment A of the sequences one can derive a distance matrix D(A). An optimal pair of tree and alignment can now be defined as an alignment for which D(A) gives rise to a tree via the above linear program and this tree is of minimal length. Of course, there is no danger in conjecturing that a formal problem of identifying such a pair of tree and alignment is NP-hard. However, profile alignment may be a reasonable element in designing a heuristic since it minimizes the average distance between two groups on the level of sequences. Thus, a profile alignment yields the appropriate basis for subsequent minimization of tree length under the above weak constraints. For example, for any starting alignment and tree one could align the profiles on either side of any edge and thus improve the score. With the additional use of sequence weighting this is also the basis for a heuristic applied by Gotoh [12].

4. Conclusion

There clearly are two very different ways of thinking about multiple alignment and phylogeny reconstruction. On one hand there are the character based method which can be shortly described as Steiner tree methods. On the other hand many practical algorithms are distance based and apply profiles for the purpose of aver¬ aging over sequences. Interestingly, when one believes the interpretation as linear programs, even the second class seems to have the aim of minimizing the length of some underlying tree. It thus applies a minimum mutation criterion. The real con¬ trast is between the averaging procedure implicit in the use of profiles as opposed to the assignment of specific sequences to the inner nodes. Most of the practical algorithms are judged based on biological criteria like the agreement between the computed sequence alignment and a structure based

MARTIN VINGRON

62

sequence alignment. It is not surprising that, e.g., for the study of structural aspects of proteins other formalizations of multiple sequence alignment are needed than for the study of evolution. Thus, it should be the application that one has in mind that determines the formalization. The scheme that formalizes phylogeny and alignment using linear programs and average distances is still speculative. Nevertheless, it leads to a formulation for which there are at least some directions as to how the problem can be attacked. Additionally, the framework of profile alignment and clustering procedures to link tree and alignment might lead to a theoretical underpinning for the hierarchical clustering algorithms which are so successful in practice. References 1. L. Allison and C. S. Wallace, The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments., J. Mol. Evol. 39 (1994), no. 4, 418-430. 2. Stephen F. Altschul and David J. Lipman, Trees, stars, and multiple biological sequence alignment, SIAM Journal of Applied Mathematics 49 (1989), 197-209. 3. S. C. Chan, A. K. C. Wong, and D. K. Y. Chiu, A survey of multiple sequence comparison methods., Bull. Math. Biol. 54 (1992), no. 4, 563-598. 4. William H. E. Day, Computational complexity of inferring phytogenies from dissimilarity matrices Bulletin of Mathematical Biology 49 (1987), 461-467. 5. Richard O. Duda and Peter E. Hart, Pattern classification and scene analysis, John Wiley &: sons, 1973. 6. J. Felsenstein, Evolutionary trees from DNA sequences: a maximum likelihood approach., Journal of Molecular Evolution 17 (1981), 368-376. 7. D.-F. Feng and R. F. Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evol. 25 (1987), 351-360. 8. Walter M. Fitch, Toward defining the course of evolution: Minimum change for specific tree topology, Systematic Zoology 20 (1971), 406-416. 9. Walter M. Fitch and Emanuel Margoliash, Construction of phylogenetic trees, Science 155 (1967), 279-284. 10. L. R. Foulds, Combinatorial optimization for undergraduates, Springer Verlag, New York, 1984. 11. L. R. Foulds, M. D. Hendy, and D. Penny, A graph theoretic approach to the development of minimal phylogenetic trees, Journal of Molecular Evolution 13 (1979), 127-149. 12. O. Gotoh, A weighting system and algorithm for aligning many phylogenetically related se¬ quences, Comp. Appl. Biosci. 11 (1995), 443-551.

,

13.

_,

Significant improvement in accuracy of multiple protein sequence alignments by it¬ erative refinement as assessed by reference to structural alignments, J. Mol. Biol. 264 (1996), 823-838. 14. Osamu Gotoh, Optimal alignment between groups of sequences and its application to multiple alignment, Computer Applications in the Biosciences 9 (1993), 361-370. 15. M. Gribskov, M. McLachlan, and D. Eisenberg, Profile analysis: detection of distantly related proteins., Proc. Natl. Acad. Sci. USA 84 (1987), 4355-4358.

.

16. D. Gusfield, Efficient methods for multiple sequence alignment with guaranteed error bounds., Bull. Math. Biol. 55 (1993), no. 1, 141-154. 17. J. A. Hartigan, Minimum mutation fits to a given tree, Biometrics 29 (1973), 53-65. 18. Jotun Hein, A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given, Molecular Biology and Evolution 6 (1989), 649-668. 19. M. D. Hendy and D. Penny, Branch and bound algorithms to determine minimal evolutionary trees., Mathematical Biosciences 59 (1982), 277-290. 20. S. Henikoff and J. G. Henikoff, Position-based sequence weights, J. Mol. Biol. 4 (1994), 574578. 21. P. Hogeweg and B. Hesper, The alignment of sets of sequences and the construction of phy¬ logenetic trees, an integrated method, J. Mol. Evol. 20 (1984), 175-186.

SEQUENCE ALIGNMENT AND PHYLOGENY CONSTRUCTION

63

22. F. K. Hwang, D. S. Richards, and P. Winter, The steiner tree problem, North-Holland, 1992. 23. T. Jiang, E. L. Lawler, and L. Wang, Aligning sequences via an evolutionary tree: complexity and approximation., Symposium on Theory of Computing, 1994 to appear, pp. ?-? 24. J.A. Lake, The order of sequence alignment can bias the selection of tree topology, Molecular Biology and Evolution 8 (1991), 378-385. 25. G. Mitchison and R. Durbin, Tree-based maximal likelihood substitution matrices and hidden markov models, J. Mol. Evol. 41 (1995), 1139-1151. 26. R. Ravi and John D. Kececioglu, Approximation algorithms for multiple sequence alignment, Combinatorial Pattern Matching (Zvi Galil and Esko Llkkonen, eds.), 1995, pp. 330-339. 27. N. Saitou and M. Nei, The neighbor-joining mehtod: A new method for reconstructing phlogenetic trees, Molecular Biology and Evolution 4 (1987), 406-425. 28. D. SankofT, Minimal mutation trees of sequences, SIAM J. Appl. Math. 28 (1975), 35-42. 29. D. Sankoff, R. J. Cedergren, and G. Lapalme, Frequency of insertion-deletion, transversion, and transition in evolution of 5S ribosomal RNA., J. Mol. Evol. 7 (1976), 133-149. 30. D. Sankoff and J. B. Kruskal, Time warps, string edits and macromolecules: the theory and practice of sequence comparison., Addison Wesley, 1983. 31. B. Schwikowski and Martin Vingron, The deferred path heuristic for the generalized tree alignment problem, RECOMB Proceedings, ACM press, New York, 1997, pp. 257-266. 32. David L Swofford and Gary J. Olsen, Phylogeny Reconstruction, Molecular Systematics (D.M. Hillis and C. Moritz, eds.), Sinauer Ass., Inc., Sunderland, Massachusetts, USA, 1990, pp. 411501. 33. Willie R. Taylor, A flexible method to align large numbers of biological sequences, J. Mol. Evol. 28 (1988), 161-169. 34. Julie D. Thompson, Desmond G. Higgins, and Toby J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionsspecific gap penalties and weight matrix choice, Nucleic Acids Research 22 (1994), 4673-4680. 35. Jeffrey L. Thorne and Hirohisa Kishino, Freeing phytogenies from artifacts of alignment, Mol. Biol. Evol. 9 (1992), 1148-1162. 36. M. Vingron and A. von Haeseler, Towards integration of multiple sequence alignment and phylogentic tree construction., Journal of Computational Biology 4 (1997), 23-34. 37. Martin Vingron and Peter R. Sibbald, Weighting in sequence space: A comparison of methods in terms of generalized sequences, Proc. Natl. Acad. Sci. USA 90 (1993), 8777-8781. 38. L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Compu¬

tational Biology 1 (1994), 337-348. 39. M. S. Waterman and M. D. Perlwitz, Line geometries for sequence comparisons, Bull. Math. Biol. 46 (1984), no. 4, 567-577. 40. Michael S. Waterman, General methods of sequence comparison, Bulletin of Mathematical Biology 46 (1984), 473-500. 41. _, Introduction to computational biology, Chapman & Hall, 1995. 42. Michael S. Waterman, Temple F. Smith, M. Singh, and W. A. Beyer, Additive evolutionary trees, J. Theor. Biol. 64 (1977), 199-213. 43. A. K. C. Wong, S. C. Chan, and D. K. Y. Chiu, A multiple sequence comparison method,

Bull. Math. Biol. 55 (1993), 465-486. Deutsches Krebsforschungszentrum (DKFZ), Theoretische Bioinformatik, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany E-mail address: [email protected]

DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 47, 1999

A New Look at Tree Models for Multiple Sequence Alignment Dannie Durand

Abstract. Evolutionary trees are frequently used as the underlying model in the design of algorithms, optimization criteria and software packages for mul¬ tiple sequence alignment (MSA). In this paper, we reexamine the suitability of trees as a universal model for MSA in light of the broad range of biolog¬ ical questions that MSA’s are used to address. A tree model consists of a tree topology and a model of accepted mutations along the branches. After surveying the major applications of MSA, examples from the molecular biol¬ ogy literature are used to illustrate situations in which this tree model fails. This occurs when the relationship between residues in a column cannot be de¬ scribed by a tree; for example, in some structural and functional applications of MSA. It also occurs in situations, such as lateral gene transfer, where an entire gene cannot be modeled by a unique tree. In cases of nonparsimonous data or convergent evolution, it may be difficult to find a consistent mutational model. We hope that this survey will promote dialogue between biologists and computer scientists, leading to more biologically realistic research on MSA.

1. Introduction Multiple sequence alignment (MSA) is important in functional, structural and evolutionary studies of sequence data. Much research has focussed on the formal study of MSA as an optimization problem and several optimization criteria have been discussed at length in the literature [9, 36, 43, 45, 59, 66, 76, 86],

In

addition, many software tools for constructing MSA’s are available, mostly based on heuristics although some use exact or branch-and-bound techniques (see [18, 55] for surveys.) The concept of an evolutionary tree is a widely used model for MSA, where the tree encodes the historical relationships between the modern sequences in the alignment. Tree models have been used to construct column scoring functions for optimization criteria. Trees have also been used as implicit or explicit structures in the design of algorithms and heuristics. The design of MSA algorithms is based on assumptions about the application and the nature of the sequence data. While these assumptions must, of necessity,

1991 Mathematics Subject Classification. Primary 92-02; Secondary 92D15, 92D20, 92-08, 68-02. Supported by NSF Grants BIR-94-13215 A01 and BIR-94-12594.

(c) 1999 American Mathematical Society

65

66

DANNIE DURAND

be abstractions of reality, a better understanding of the true nature of MSA appli¬ cations and data will lead to better algorithm design. As the amount and diversity of sequence data grows, MSA is being applied to a wide variety of biological ques¬ tions. Although the tree model is frequently viewed as a universal model for MSA, for some applications it is not compelling. In addition, some data sets cannot be modeled by a unique tree. In this paper, the biological applications of MSA are reviewed. Examples from the biology literature are used to illustrate cases where the tree model is not ap¬ propriate for multiple alignments. First, multiple sequence alignment is introduced as an optimization problem in Section 2 and optimization criteria and complexity results are reviewed. Next, in Section 3, the major applications of MSA are sur¬ veyed. A discussion of data sets where the tree model fails is presented in Section 4.

2. Multiple Sequence Alignment

First we provide a brief introduction to the multiple sequence alignment prob¬ lem and its complexity, optimization criteria used to evaluate alignments and al¬ gorithmic approaches to solving it. This is not a comprehensive survey of multiple sequence alignment algorithms. Surveys of multiple sequence alignment algorithms can be found in [18] and in the introduction to [59]. An experimental comparison of multiple sequence algorithms is described in [55]. Multiple sequence alignment involves lining up a group of sequences to reveal similarities shared across the group. A sequence is a string of symbols chosen from an alphabet, E, where E = {A,C,G,T} for nucleic acid sequences and E contains the twenty amino acids for protein sequences. In a pairwise alignment of two sequences, Si and Sj, the sequences are lined up one on top of the other so that each symbol in Sl is paired with a symbol in Sj in a series of columns of height two. Blanks may be inserted into either sequence or at the ends of either sequence, so that a symbol may be paired with another symbol or with a blank. These blanks represent mutations in the form of insertions or deletions. Since it is impossible to tell whether a symbol was inserted in one sequence or deleted from the other, blanks are also referred to as “indels”. Each column in the alignment contains a match, a mismatch or an indel. Metrics for associating a cost or similarity score with two paired symbols are discussed below. We will designate the cost of two symbols, x and y, as 6(x,y), where x and ye E U {_}, where is the blank character. A global alignment seeks to line up the two sequences so that the similarity in the alignment as a whole is maximized.Mn a local alignment, we seek a subsequence from each sequence such that when the two are aligned they yield the highest scoring region. For local alignments the average score must be negative, otherwise the longer regions would score above shorter regions by virtue of their length alone, regardless of the degree of similarity between the residues [4]. Multiple sequence alignment is an extension of pairwise alignment to more than two sequences. In this case, k sequences are lined up, inserting indels as necessary into any of the k sequences, to obtain a sequence of columns of k symbols. Again, either a global alignment, maximizing the similarity over all columns together, or a local alignment, k aligned subsequences, one from each sequence, that yield.the lowest cost region, may be sought.

A NEW LOOK AT TREE MODELS FOR MULTIPLE SEQUENCE ALIGNMENT

67

Each column in a multiple sequence alignment captures a shared relationship between the residues in the column. The relationship sought depends on the ap¬ plication of alignment. For example, if the multiple alignment is used to illustrate evolutionary relationships, then the residues in each column are assumed to have a shared evolutionary history. If the multiple alignment is used to determine struc¬ ture or function, then the residues in each column should have a shared structural or functional role. Trying to find the best alignment is equivalent to finding the alignment that correctly captures this relationship in each column. We attempt to express this by associating a score with each column that ex¬ presses how well matched the residues in the column are. We then seek the multiple alignment with the best score, where the score of the MSA is the sum of the scores of its columns. Formally, we can express multiple sequence alignment as an opti¬ mization problem. We are given sequences S\, • • • Sk of length N1, - Nk, where Si — Sq ... , Sh € E. We seek a matrix A = {a^ } where ay e E U {_} and eliminating the indels from any row Aj gives the sequence Sj, such that the cost of the alignment

D(A)

^ ^

(KyCLij ... cikj)

j

is minimum. The choice of the column scoring function, d, should reflect the ap¬ plication of the alignment. Below, we review commonly used scoring functions and consider for which applications each is most appropriate. Column Scores: Three metrics commonly used to evaluate each column are sum-of-pairs (SP), tree alignment (TA) and star alignment (SA). The sum-of-pairs cost [8, 57] of a column aij ... akj is the sum of the costs of all unordered pairs in the column

dspi&lj

• 0>kj) — ^ ^ *5(0,pj , Oqj ), p

i= 1

where

is a substitution matrix such as the Dayhoff [21, 69] or BLOSSUM [41]

matrices. In addition to a row for each residue, profiles generally have two rows for gaps, one for gap initiation and one fojr gap extension. This allows the fact that some columns may have many gaps to be reflected in the profile. Statistical representations of patterns, such as position weight matrices and profiles, have been used to represent characteristic patterns found in biopolymers and as input to programs that search for instances of sequences containing these patterns. Some examples of the use of these representations in searching protein data bases include [3, 35, 80, 81].

Statistical characterizations of sequence and

structural motifs include Helix-Turn-Helix [50], the calcium-binding EF-hand [48] and libraries of motifs such as [10, 41, 71], Statistical representations of patterns in nucleic acids have been used to characterize and search for regulatory regions in DNA (see, for example, [19, 28, 30, 73]).

A NEW LOOK AT TREE MODELS FOR MULTIPLE SEQUENCE ALIGNMENT

73

MSA s are also used to identify subsequences for laboratory techniques, such as PCR and library screening, that isolate DNA fragments containing a particular pattern. In these methods, a short piece of DNA containing the desired pattern (the “probe

or “primer”) is used to hybridize with fragments containing that pattern in

the target DNA. A minimum degree of similarity, which depends on the stringency of the laboratory conditions, between the probe and target fragments is needed for hybridization to take place.

For example, when new members of a known gene

family are sought, an MSA is used to find a region that is conserved in all known members of the family. From the MSA, a consensus sequence is constructed that is close enough to all known members to hybridize with any of them. The hope is that this consensus sequence will also hybridize with the, as yet undiscovered, new members of the family. This is a use of MSA where homology is not of interest. Only similarity is important.

3.2. Phylogeny. A phylogeny describes the evolutionary history of taxa. Orig¬ inally, a taxon referred to a species or group of species but now can also describe sub-organismal entities such as genes.

Such histories are frequently expressed as

evolutionary trees, branching processes that describe the inheritance relationships between the taxa.

Traditionally, these relationships were determined by making

morphological comparisons. Now that large quantities of molecular data are avail¬ able, inheritance relationships can also be inferred from sequence information. Traditionally, phylogenetic trees have been used to determine ancestral rela¬ tionships between species. However, with the advent of molecular data, it is now possible to ask questions about the relationships between molecular entities includ¬ ing genes, proteins and regulatory elements.

Because of gene duplications, exon

shuffling and lateral displacements, the history of a gene may not be the same as the history of the species carrying it. With this in mind, both designers and users of MSA algorithms should focus carefully on what question they are trying to answer and which algorithm and data are most appropriate to answer it. In an evolutionary tree, each leaf is associated with a modern day taxon. Branch points indicate points where taxa separated. These internal nodes are associated with ancestral taxa.

If there is information available concerning the amount of

evolutionary time that has passed between branching points, then the tree can be rooted and lengths (e.g. times) associated with the branches. A multiple sequence alignment captures the inheritance information needed to reconstruct the tree. In this case, the sequences in the MSA are the leaves of the tree and it is assumed that there is some underlying tree that generated this alignment.

However, the inheritance information in the alignment is incomplete

(since only modern day sequences are available), the data may be noisy and the MSA may be incorrect. With these limitations in mind, and given an assumption about how sequences evolve, we need an algorithm to reconstruct this tree. Most such algorithms fall into three categories: distance-based methods, character-based methods and maximum likelihood.

A survey of tree reconstruction methods is

given in Swofford and Olsen [77, 78]. Multiple sequence alignments are required as input for all of these methods. Biologists frequently use only a robust subset of the columns as input. In such cases, MSA’s are used to determine which columns represent unambiguous homology.

DANNIE DURAND

74

Distance-based phytogeny: Distance-based methods work by computing pair¬ wise distances between taxa and fitting those distances to a tree. A distance ma¬ trix, M, is derived from an MSA by computing the distance between every pair of taxa in the alignment (not the minimum pairwise distance!). If these distances fit a tree, M is said to be additive and the tree is unique and can be reconstructed in 0(k2) time [87], In general, however, M is not additive. In that case, a tree is sought that “best approximates” the observed distances. Mathematically, we seek a tree T such that (3.1)

is minimized, where dk is the distance between taxa i and j and the norm, l, is usually 1, 2 or oo. It has been shown that finding the optimal T is NP-hard for Li,

L2 [20] and L^ [2], In addition, approximation algorithms for this problem have been developed [2]. However, it is not clear that a tree that minimizes Equation 3.1 is a good approximation to the true tree. There are many reasons why distance matrices are not additive including noisy sequence data, poor multiple sequence alignments and the possibility that the underlying process that generated the data was not a tree to begin with.

Character-based phytogeny: In character-based approaches, a data set is treated as a set of characters. Each character can be in one of several states. Each species is specified by a state vector. Traditionally, characters were derived from morpholog¬ ical, behavioral or chemical data so states might include things like “Has wings?”, “Breeds in water?” or the strength of an immunological reaction. With molecular data, each column in an MSA is a character. The possible states are the four bases for DNA data and the twenty amino acids for protein data.

Thus, a species is

characterized by the sequence elements found in each column in the alignment. A tree can be associated with each character. The state associated with each species is assigned to a leaf in the tree. Ancestral character states are inferred from leaf nodes and associated with internal nodes.

By associating a cost with every

state change, we can associate a cost with the tree by summing, over all branches, the cost of the state changes along each branch.

This per character cost can be

summed over all characters to obtain a character-based cost for the tree. Within this framework, it is possible to find the best estimate of the true tree given a model of evolutionary change. The most common model is maximum

parsimony, in which it is assumed that the true tree required a minimum number of evolutionary steps. The most parsimonious tree is the lowest cost tree over all possible tree topologies for k species and all possible inferred ancestral sequences for a given topology. One problem with maximum parsimony is that it does not take convergent or parallel evolution, multiple state changes or reversals in character state into account. Thus it is not a suitable model for all data sets.

Maximum Likelihood: In the maximum likelihood method, a commonly used statistical technique (see, for example, [25]) is used to find the tree most likely to have generated the current data.

This requires an underlying model of sequence

evolution specifying both residue frequencies and rates of evolution such as the Jukes-Cantor [44] or Kimura two-parameter [46] models in the case of DNA data or the PAM matrices in the case of protein data. The likelihood of seeing a transition at a given sequence position across a single branch of length d is the probability that a transition from residue i to residue j occurred during time d according to the

A NEW LOOK AT TREE MODELS FOR MULTIPLE SEQUENCE ALIGNMENT

75

model of sequence evolution chosen. If the data at each position are independent, then the likelihood of the branch over all characters is the product of the likelihoods for each position. The likelihood for the entire tree can be computed by observing that the like¬ lihood of seeing a residue at a given internal node is the product of the likelihoods that each daughter tree of that node gave rise to this nucleotide. As in parsimony, all possible tree topologies for k sequences must be considered.

3.3. Structure Prediction. Since the structure of biopolymers reveals much about their function, there has been great interest in determining the structures of proteins and RNA molecules. Protein structures can be determined experimentally using X-ray crystallography and nuclear magnetic resonance (NMR) techniques. However, these methods are difficult and time consuming, often requiring three to five years to determine a single structure experimentally.

X-ray crystallography

requires that a crystalline form of the protein be obtained. Reliable methods for protein crystallization do not exist. Furthermore, proteins in vivo are not in crystal form, so that protein structures determined through X-ray crystallography may not be the same as those which occur in the body. NMR does not require that proteins be crystalline but so far NMR has only been effective in determining the structure of small proteins. Because of the difficulty in determining protein structures experimentally, there has been great interest in computational methods to predict a protein’s structure from its sequence.

Since proteins need to maintain their structure in order to

function properly, regions of structural importance are usually highly conserved under selective pressure. Multiple sequence alignment can be used to identify these regions and, hence, underlies many structure prediction methods. Below we briefly review the use of multiple sequence alignment in determining structure. A detailed survey of protein structure prediction methods can be found in [26]. To facilitate reasoning about proteins, protein structure has been described hi¬ erarchically. The sequence itself is referred to as the primary structure. Secondary structure refers to a-helices and /3-sheets, regular structures that are encoded by short subsequences and combine in myriad ways to form more complex structures. Regions encoding secondary structures are constrained by selection and therefore exhibit a high degree of conservation in related proteins. Variable length regions connecting a-helices and /3-sheets, referred to as random coils, are much less con¬ strained.

Combinations of several secondary structures that are found in many,

unrelated proteins are referred to as motifs (see [13] for an example).

The com¬

plete structure of the folded protein is called the tertiary structure. Prediction methods for tertiary structure range from the highly detailed energy minimization approaches, in which the physical interactions of all amino acids are modeled, to abstract approaches, such as lattice-based models and models where only the hydrophobic/hydrophilic properties of each amino acid are considered. In cases where the structure of several members of a protein family have been determined experimentally, this information can also be used to predict how other proteins in the family will fold. Comparative modeling methods use this approach by constructing a multiple sequence alignment and then aligning the new protein to this MSA. From that alignment, an initial estimate of the structure of the new protein is obtained from the structure of the known protein.

This estimate is

76

DANNIE DURAND

refined by perturbing the structure to accommodate the physical properties of the substituted amino acid in the new protein (see [33] for an example). Secondary structure prediction methods have generally been learning-based or statistical approaches.

Using a data base of known structures, these methods

associate similar structures with patterns found in the primary sequence.

In re¬

cent years, these methods have been substantially enhanced by the use of multiple sequence alignments. Multiple sequence alignments provide a statistical characteri¬ zation of the patterns that encode the structure. An MSA is more informative than statistics of sequences considered individually since the information is grouped by position.

Researchers have reported improvements in the accuracy of secondary

structure prediction methods from roughly 60% without the use of MSA’s to 70% with the use of MSA’s [56, 64, 65].

Statistical approaches to multiple sequence

alignment have also been used to characterize motifs such as the EF-hand [48] and the Helix-Loop-Helix [50, 51] motifs.

However, Krogh et al. [48] point out that

while these methods find patterns associated with these motifs in the primary se¬ quence, it has not been verified whether all of the patterns found actually encode functional motifs. Multiple sequence alignments are also used in RNA structure prediction. Un¬ like DNA, in which two strands attract to form the well-known, regular helical structure, the single strand of an RNA molecule may be attracted to itself. Bonds form between different parts of the RNA strand causing the molecule to fold up into a three-dimensional structure.

Like proteins, the tertiary structure of RNA

is composed of a set of regular secondary structures.

In RNA, these secondary

structures include helical regions resulting from Watson-Crick base pairing (that is, the same bonding that unites the double helix of DNA molecules) and a variety of loops and bulges. A detailed survey of RNA structure is given in [38]. In order to maintain the structure of the RNA molecule, distant residues that bind to form a structural interaction are constrained to mutate together.

Thus

covariance in a multiple sequence alignment of related RNA molecules is a major source of predictive information. often called comparative analysis.

This approach to RNA structure prediction is These methods search for pairs of residues in

the MSA that are linked: a change in one column is (almost) always accompanied by a complementary change in the other.

This approach is also used to identify

canonical structural building blocks. Some examples of RNA structure prediction algorithms that exploit evidence of correlation in multiple sequence alignments include [17, 39, 40]. 3.4. Function. As discussed above, residues that contribute to the structure of a protein are constrained by selective pressure since a protein .must hold its shape in order to function properly.

Residues that are involved the biochemical

function of the protein are also highly conserved. Casari et al. [16] argue that the evolutionary constraints on functional residues are even greater than on residues of structural importance and cite the Ser-His-Asp triplet in protein kinases as an example of functional conservation. This triplet has been shown to have a crucial role in catalysis.

Such functional residues may be close in the 3-D structure but

distant in the sequence. In a protein superfamily, residues that contribute to the specificity of function of the protein tend to be conserved within each subfamily and to vary from one subfamily to another.

Multiple sequence alignments can

be used to identify residues of functional importance and to discriminate between

A NEW LOOK AT TREE MODELS FOR MULTIPLE SEQUENCE ALIGNMENT

77

subfamilies. For example, Casari et al. [16] developed an interactive program that takes an MSA as input and, using principal component analysis, identifies the most highly conserved residues that characterize the family as a whole and as well as those residues that distinguish one subfamily from another. 4. Evaluating Tree Alignment as a Model for MSA There is an unstated assumption that tree alignment is the gold standard, the ideal cost function, for multiple sequence alignment because the residues in each column are thought to be related by an evolutionary tree. The implication is that tree alignment is not used only because it is intractable. In addition, many MSA programs assume that a tree is the underlying model. Although they do not use a tree to score the aligment, a tree is used to construct the alignment (e.g., by guiding the order in which pairwise alignments are merged.) In this section, we present some examples that demonstrate that a tree is not always the appropriate model for multiple sequence alignment. In considering whether tree alignment is appropriate for a given set of sequences, two issues must be addressed: • Is a tree the correct model for describing the relationship between residues in each column? A tree may not be a suitable model because the relationship between residues is functional or structural rather than historical.

Even if the re¬

lationship is historical, for some data sets, no single tree will describe all columns in the alignment. • What is the correct mutational model for scoring the branches of the tree? Tree alignment has historically been based exclusively upon the parsi¬ mony criterion. Data that does not happen to be parsimonious can favor the wrong tree model. In addition, column-oriented optimization approaches to MSA usually assume that sequence positions are independent and identi¬ cally distributed. These assumptions do not, in general, hold for biological sequence data.

The residues in a column do not share a common ancestor: When alignment is used to study function or structure, residues in a column do not always share a common ancestor. The goal is to align residues that share the same role. Although functional or structural residues usually share an evolutionary history, sometimes functional or structural roles can migrate to neighboring residues. A possible example of shifting function occurs in the dihydrofolate reductase (DHFR) gene - an important chemotherapeutic target in treating cancer and vari¬ ous infectious diseases. In their studies on protozoan parasites, Roos and colleagues have sought to design drugs that inhibit a metabolic protein in the parasite with¬ out affecting the infected host [63, 22, 23],

This requires identifying regions of

structural or functional importance that differ substantially between protozoan and human versions of the DHFR protein. Early sequence alignments placed the malaria parasite DHFR residue Phe22J downstream of structurally conserved regions of the protein (within the linker re¬ gion which joins DHFR to thymidylate synthase, forming a bifunctional protein in protozoa and plants) [42, 79].

This result was puzzling because mutational

studies had suggested that Phe223plays an important role in drug sensitivity. Re¬ alignment using sequences from the related parasite Toxoplasma gondii indicated

DANNIE DURAND

78

that Phe223is more likely homologous to a portion of the /3-sheet which comprises the enzyme backbone [63].

The T. gondii sequence thus provided additional in¬

formation that suggested an alternative alignment. Other protozoan sequences in the alignment have substantial, and different, nucleotide biases1.

The T. gondii

gene, which has relatively equal nucleotide distribution links the other protozoan sequences, facilitating alignment. This is another example of the observation, also made by McClure et al. [55],that MSA’s are very sensitive to sequence choice. Phe223is thought to play an indirect role in enzyme activity, interacting with His34, within the active site [60]. The residue that plays this stabilizing role may have changed over time: a residue in a different position may provide this stabi¬ lizing effect in certain taxa (e.g.

kinetoplastid parasites such as Leishmania and

Trypanosoma) [62]. In this scenario, a random mutation allows a previously inac¬ tive residue to take on the functional role played by Phe223. In Roos’ alignment, the residues currently thought to provide this stabilizing effect appear in the same column, but these residues in this column may not all share a common ancestor.

The tree is not unique: Another case where a tree is not an appropriate model occurs when the residues in any particular column share a common ancestor but the columns themselves have different evolutionary histories. A tree may describe any given column, but the columns taken together cannot be modeled by a single tree. One situation where this occurs is exon shuffling. Most vertebrate genes consist of coding regions (exons) separated by DNA segments that are not translated into proteins (introns). The discovery of this intron/exon structure lead to the theory of exon shuffling, first, proposed by Gilbert in 1978 [31].

This theory posits that

exons represent functionally and/or structurally important subunits of proteins, that introns occur at the boundaries of these modules and that, proteins share and reuse the modules that exons encode. The first evidence that the same exons appear in more than one gene was found in the human low-density lipoprotein (LDL) receptor gene [75, 74],

The LDL

receptor gene was shown to share exons with genes for epidermal growth factor, blood clotting factor IX and complementation factor C9.

Since then, many such

“mosaic” genes have been discovered (see [24] for a survey). genes, a different, tree may be needed for each exon. known, each exon could be aligned separately. splice sites have not been determined.

In aligning mosaic

If the exon boundaries are

However for many sequences, the

Detecting such boundaries would require

that the alignment already be known. Residues with different, evolutionary histories within a single gene can also occur due to horizontal gene transfer, the transfer of genetic material between species. For example, sequences similar to the Fn3 module in fibronectin have been found in bacterial proteins [12], Fibronectin is a protein found in animals. Since Fn3 is found in both bacterial and animal proteins, one would expect to find Fn3 modules throughout the tree of life. However, Fn3 sequences have not been found in simpler eukaryotes, plants or fungi [12], suggesting a direct, transfer of genetic material between bacteria and animals. Thus, in an alignment of genes containing the Fn3 sequence, one would not expect, the residues in the Fn3 module to share

Mhe statistical profile of the primary sequence of genes can vary substantially, resulting in variations in, for example, the percentage of GC nucleotides. similarities between related sequences.

Such differences tend to obseure

A NEW LOOK AT TREE MODELS FOR MULTIPLE SEQUENCE ALIGNMENT

79

an evolutionary history with other residues in the alignment. More than one tree is needed to model the sequence and, as in the case of exon shuffling, it may not be possible to know where the module boundaries occur. Other examples of mixing of genetic material possibly requiring more than one tree include transposition and gene conversion.

The tree is not parsimonious. Sequence data are not generally parsimonious, especially between distantly related sequences.

Multiple substitution (e.g., A —»

C —>• T), coincidental substitution (e.g., A —> C vs. A —> G), parallel substitution (e.g., A —> C vs. A —> C) and back substitution (e.g., A —> C —> A), can all obscure the evolutionary history of a sequence. Convergent evolution results in a situation where the parsimony criterion will lead to the wrong tree model.

Residues that are similar under selective pressure

because they perform the same function can appear to be closely related.

An

example of this occurs in cows and colobine monkeys, species that independently evolved foregut fermentation [72], In order to maximize the nutritional benefits of foregut fermentation, the enzyme lysozyme had to evolve in these species to function in the acidic, pepsin-rich environment of the stomach. In other species, lysozyme is only needed in the intestines. As a result, lysozymes in cows and colobine monkeys exhibit amino acid substitutions not found in other species, suggesting, wrongly, that the two species are closely related. those residues.

This would result in the wrong tree for

Sequence level convergence has also been observed in viral strains grown in the laboratory [14].

Phage (fiX strains were adapted to high temperature conditions

on different bacterial hosts.

Several separate viral populations were adapted for

growth on each host. Phylogenetic analysis of these populations showed convergent nucleotide substitutions at specific sites in the viral lineages in response to environ¬ mental factors (host and temperature). Likelihood ratio tests showed a significantly better fit of the sequence data to trees grouped by host than to the true (known) phylogeny of the lineages. Tetraloops in rRNA provide another example of convergent evolution [38, 88], Tetraloops are strings of six bases that form loops at the ends of helices in rRNA structures. The two end bases bond, allowing the internal four bases to form a loop. Although there are 256 possible inner loop sequences, only a small number actually occur in nature. Tetraloop sequences will tend to appear to be closely related, even when they are not.

Sequence data is not i.i.d. Structural constraints prevent sequence data from being independent. Structural integrity depends on interactions between nonadjacent residues in the sequence. For example, a-helices are characterized by a heptad repeat, so that there are chemical interactions between every seventh residue in helical regions. Similarly, in order to maintain the structure of the RNA molecule, distant residues bind to form a structural interaction.

Compensatory mutations

between distant residues that form structural bonds are selectively favored. Sequence data are not identically distributed either. Structural constraints on protein sequences result in variations in selective pressure at different positions, depending on whether they are located in a-helices, /5-sheets or random coils and whether they have a role in tertiary structure or biochemical function. This fact has been recognized and exploited by researchers who developed structure-specific substitution matrices for recognizing specific secondary structures and motifs [54, 56, 58].

80

DANNIE DURAND

Additional variations in selective pressure occur at the DNA level. In protein coding regions, substitutions can be nonsynonymous (resulting in an amino acid substitution in the protein coded for) or synonymous (resulting in a different codon for the same amino acid). Due to differences in selective presure, synonymous changes are seen with more frequency than nonsynonymous changes. Originally, it was thought that sequence positions could be classified as replacement sites, synonymous sites and noncoding sites and that mutation rates within each class would be relatively constant. More recently, evidence has emerged that suggests that selective pressure can vary within each class, even within a single gene or intron [47, 68] 5. Conclusion Evolutionary trees have been used as an abstract model for multiple sequence alignment in designing algorithms and optimization criteria and in building soft¬ ware tools. Although frequently viewed as the fundamental model of MSA, the appropriateness of the tree model depends both on the application and the data set. After surveying the major biological applications, problems with the tree model were illustrated using examples from the biological literature. If the alignment is constructed to reveal functional or structural similarities, a tree may not correctly describe the relationship between the residues. Lateral transfers of DNA fragments between genes may result in situations where no single tree can model the entire gene. Nonparsimonious data and convergent evolution can lead to the wrong tree model. This paper is intended to give computer scientists a better understanding of the biological uses of multiple sequence alignment and how real data sets differ from the abstract assumptions made about sequence data. We hope that these ideas will provide a basis for dialogue between biologists and computer scientists and mathematicians, leading to better algorithm design and software development for multiple sequence alignment. 6. Acknowledgements The author wishes to thank Sergei Agulnik, Maja Bucan, Martin FarachColton, Dan Gusfield, John Kececioglu, Laura Landweber, Jason Miller, Eugene Myers, Gary Olsen, Chris Overton, R. Ravi, David Roos, Ilya Ruvinsky, Chris Sander, Mona Singh, Temple Smith and E. Trifonov for helpful discussions and Harold Stone for invaluable editorial advice. References v



1. A. Aevarsson, Structure-based sequence alignment of elongation factors TU and G with related GTPases involved in translation, Journal of Molecular Evolution 41 (1995), 1096 - 1104. 2. R. Agarwala, V. Bafna, M. Farach, B. Narayanan, M. Paterson, and M. Thorup, On the Approximability of Numerical Taxonomy (fitting distances by tree metrics), Proceedings of the Symposium on Discrete Algorithms, 1996. 3. S. F. Altschul, T. L. Madden, A. A. Schaeffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, Gapped blast and psi-blast: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997), 3389-3402. 4. S.F. Altschul, Amino acid substitution matrices from an information theoretic perspective, Journal of Molecular Biology 219 (1991), 555-565. 5. S.F. Altschul, R.J. Carroll, and D.J. Lipman, Weights for data related by a tree, Journal of Molecular Biology 207 (1989), 647-653.

A NEW LOOK AT TREE MODELS FOR MULTIPLE SEQUENCE ALIGNMENT

81

6. S.F. Altschul and D.J. Lipman, Trees, Stars and Multiple Sequence Alignment, Journal of Applied Mathematics 49 (1989), no. 1, 197-209. i. P. Argos, A sensitive procedure to compare amino acid sequences, Journal of Molecular Biology 193 (1987), 385 - 396. 8. D. J. Bacon and W. F. Anderson, Multiple sequence alignment, Journal of Molecular Biology

191 (1986), 153-161. 9. V. Bafna, E. L. Lawler, and P. Pevzner, Approximation Algorithms for Multiple Sequence Alignment, 5th Ann. Symp. On Pattern Combinatorial Matching, vol. 807, 1994, pp. 43-53. 10. A. Bairoch and P. Bucher, PROSITE: recent developments, Nucleic Acids Res. 22 (1994), no. 5, 3583-3589. 11. G.J. Barton and M.J.E. Sternberg, Evaluation and improvements in the automatic alignment of protein sequences, Protein Engineering 1 (1987), 89-94. 12. P. Bork and R. F. Doolittle, Proposed acquisition of an animal protein domain by bacteria, Proc.Natl.Acad.Sci. USA 89 (1992), 8990-8994. 13. C.-I. Branden, The TIM barrel

-

the most frequently occuring folding motif in proteins, Cur¬

rent Opinion in Structural Biology 1 (1991), 978-983. 14. J. J. Bull, M. R. Badgett, H. A. Wichman, J. P. Huelsenbeck, D. M. Hillis, A. Gulati, C. Ho, and I. J. Molineux, Exceptional convergent evolution in a virus, Genetics 147 (1997), 14971507. 15. H. Carillo and D.J. Lipman, The Multiple Sequence Alignment problem in biology, Journal of Applied Mathematics 48 (1988), 1073-1082. 16. G. Casari, C. Sander, and A. Valencia, A method to predict functional residues in proteins, Structural Biology 2 (1995), no. 2, 171-178. 17. L. Chan, M. Zuker, and A. B. Jacobson, A computer method for finding common base paired helices in aligned sequences: application to the analysis of random sequences, Nucleic Acids Res. 19 (1991), no. 2, 353-358. 18. S.C. Chan, A.K.C. Wong, and D.K.Y. Chiu, A Survey of Multiple Sequence Comparison Methods, Bulletin of Mathematical Biology 54 (1992), 563-598. 19. Q. K. Chen, G. Z. Hertz, and G. D. Stormo, MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices, Computer Applications in the Applied Sciences 11 (1995), 563-66. 20. W. H. E. Day, Computational complexity of inferring phytogenies from dissimilarity matrices, Bulletin of Mathematical Biology 49 (1987), no. 4, 461-467. 21. M. O. Dayhoff, Atlas of Protein Sequence and Structure, National Biomedical Research Foun¬ dation, Washington, DC, 1978. 22. R. G. Donald and D. S. Roos, Stable molecular transformation of toxoplasma gondii:

a

selectable dihydrofolate reductase-thymidylate synthase marker based on drug-resistance mu¬ tations in malaria, Proc.Natl.Acad.Sci. USA 90 (1993), 11703-11707. 23. R. G. K. Donald and D. S. Roos, Homologous recombination and gene replacement at the dihydrofolate reductase-thymidylates synthase locus in toxoplasma gondii, Molec. Biochem. Parasitol. 63 (1994), 243-253. 24. R. L. Dorit and W. Gilbert, The limited universe of exons, Curr Opin Genet Dev 1 (1991), 464-469. 25. David Durand, Stable Chaos, an introduction to statistical control., General Learning Press, Morristown, NJ, 1971. 26. F. Eisenhaber, B. Persson, and P. Argos, Protein structure prediction: Recognition of primary, secondary, and tertiary structural features from amino acid sequence, Critical Rev in Biochem and Mol Bio 30 (1995), no. 1, 1-94. 27. D.F. Feng, M.S. Johnson, and R.F. Doolittle, Aligning amino acid sequences: comparison of commonly used methods, Journal of Molecular Evolution 21 (1985), 112-125. 28. J. W. Fickett, Quantitative discrimination of me}2 sites, Molecular and Cellular Biology 16 (1996), no. 1, 437-441. 29. W.M. Fitch and T.F. Smith, Optimal sequence alignments, PNAS 80 (1983), 1382-1386. 30. M. S. Gelfand, Prediction of function in DNA sequence analysis, Journal of Computational Biology 2 (1995), no. 1, 87-115. 31. W. Gilbert, Why genes in pieces?, Nature 271 (1978), 501. 32. A. Godzik, The structural alignment between two proteins: Is there a unique answer?, Protein Science 5 (1996), 1325-1338.

82

DANNIE DURAND

33. J. Greer, Comparative modeling methods: Application to the family of the mammalian serine

proteases., PROTEINS: Structure, Function and Genetics 7 (1990), 317-334. 34. M. Gribskov and J. Devereux, Sequence analysis primer, Stockton Press, New York, NY, 1991. 35. M. Gribskov, R. Luthy, and D. Eisenberg, Profile analysis, Methods in Enzymology, vol. 183, Academic Press, 1990, pp. 146-159. 36. D. Gusfield,

37. 38.

39.

40.

Efficient Methods for Multiple Sequence Alignment with Guaranteed Error

Bounds, Bulletin of Mathematical Biology 55 (1993), 141-154. D. Gusfield, K. Balasubramanian, and D. Naor, Parametric Optimization of Sequence Align¬ ment, Proceedings of the Symposium on Discrete Algorithms, 1992, pp. 432-439. R.E. Gutell, N. Larsen, and C.R. Woese, Lessons from an evolving rRNA: 16S and 23S rRNA structures from a comparative perspective, Microbiological Reviews 58 (1994), no. 1, 10—26. R.E. Gutell, A. Power, G.Z. Hertz, E.J. Putz, and G.D. Stormo, Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods, Nucleic Acids Res. 20 (1992), no. 21, 5785-5795. K. Han and H-J Kim, Prediction of common folding structures of homolgous RNAs, Nucleic Acids Res. 21 (1993), no. 5, 1251-1257.

41. S.

Henikoff and J.

G.

Henikoff,

Amino acid substitution matrices from protein blocks,

Proc.Natl.Acad.Sci. USA 89 (1992), 10915-9. 42. J.E. Hyde, The dihydrofolate reductase-thymidylate synthetase gene in the drug resistance of

malaria parasites, Pharmacol Ther 48 (1990), no. 1, 45—59. 43. T. Jiang, E.L. Lawler, and L. Wang, Aligning sequences via an evolutionary tree: Complexity

and approximation, Proceedings of the Symposium on the Theoretical Aspects of Computer Science, 1994, pp. 760-769. 44. T. H. Jukes and C. R. Cantor, Evolution of Protein Molecules, Mammalian Protein Metabo¬ lism (H. N. Munro, ed.), Academic Press, New York, 1969, pp. 21-132. 45. J. Kececioglu, The maximum weight trace problem in multiple sequence alignment, 4th Ann. Symp. On Pattern Combinatorial Matching, Springer Verlag Lecture notes in Computer Sci¬ ence, vol. 684, 1993, pp. 106-119. 46. M. Kimura, A simple method for estimating evolutionary rate of base substitution through

comparative studies of nucleotide sequences, Journal of Molecular Evolution 16 (1980), 111— 120. 47. M. Kreitman and R. R. Hudson, Inferring the evolutionary histories of the Adh and Adh-dup

loci in drosophila melanogaster from patterns of polymorphism and divergence, Genetics 127 (1991), 565-582. 48. A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, Hidden markov models in

computational biology:

Applications to protein modeling, Tech. Report UCSC-CRL-93-32,

University of California, 1993. 49. Y. Kubota, S. Takahashi, K. Nishikawa, and T. Ooi, Homology in protein sequences expressed

by correlation coefficients, Journal of Theoretical Biology 91 (1981), 347-361. 50. C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liuand, A. F. Neuwald, and J. C. Wootton,

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment, Science 262 (1993), 208-214. 51. C. E. Lawrence, S. F. Altschul, J. C. Wootton, M. S. Boguski, A. F. Neuwald, and J. S. Liu, A Gibbs Sampler for the Detection of Subtle Motifs in Multiple Sequences, Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, 1994, pp. 245-254. 52. A.M. Lesk and C. Chothia, How different amino acid sequences determine similar protein

structures:

the structure and evolutionary dynamics of the globins, Journal of Molecular

Biology 136 (1980), 225-270. 53. A.M. Lesk, M. Levitt, and C. Chothia, Alignment of the amino acid sequences of distantly

related proteins using variable gap penalties, Protein Engineering 1 (1986), 77-78. 54. R. Luthy, A. D. McLachlan, and D. Eisenberg, Secondary structure-based profiles:

use of structure-conserving scoring tables in searching protein sequence databases for structural sim¬ ilarities, Proteins 10 (1991), 229-239.

55. M.A. McClure, T.K. Vasi, and W.M. Fitch, Comparative analysis of multiple protein-sequence

alignment methods, Mol. Biol. Evol. 11 (1994), 571

-

592.

A NEW LOOK AT TREE MODELS FOR MULTIPLE SEQUENCE ALIGNMENT

83

56. P.K. Mehta, J. Heringa, and P. Argos, A fast and simple approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%, Protein Science 4 (1995), 2517-2525. 57. M. Murata, J. S. Richardson, and J. L. Sussman, Simultaneous comparison of three protein sequences, Proc.Natl.Acad.Sci. USA 82 (1985), 3073-3077. 58. J. Overington, D. Donnelly, M. S. Johnson, A. Sali, and T. L. Blundell, Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds, Protein Science 1 (1992), 216-226. 59. P.A. Pevsner, Multiple alignment, communication cost and graph matching, Journal of Ap¬ plied Mathematics 52 (1992), 1763-1779. 60. M. Reynolds, D. Carter, M. Schumacher, and D. S. Roos, Personal communication. 61. J. L. Risler, M. O. Delorme, H. Delacroix, and A. Henaut, Amino acid substitutions in struc¬ turally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix., JMB 204 (1988), 1019-1029. 62. D. S. Roos, Personal communication. 63. D.S. Roos, Primary structure of the dihydrofolate reductase-thymidylate synthase gene from toxoplasma gondii, J. Biol. Chem. 268 (1993), 6269-6280. 64. B. Rost and C. Sander, Prediction of protein secondary structure at better than 70% accuracy, JMB 232 (1993), 584-599. 65. A. A. Salamov and V. V. Solovyev, Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments, Journal of Molecular Biology 247 (1995), 11-15. 66. D. Sankoff, Minimal mutation trees of sequences, Journal of Applied Mathematics 28 (1975), 443-453. 67. D. Sankoff and R. J. Cedergren, Simultaneous Comparison of Three or More Sequences Re¬ lated by a Tree, Timewarps, Edits and Macromolecules: The Theory and Practise of Sequence Comparison, Addison-Wesley, Reading, MA, 1983, pp. 253-258. 68. S. W. Schaeffer and C. F. Aquadro, Nucleotide sequence of the Adh gene region of drosophila pseudoobscura: evolutionary change and evidence for an ancient gene duplication, Genetics 117 (1987), 61-73. 69. R. M. Schwartz and M. O. Dayhoff, Matrices for detecting distant relationships, Atlas of Pro¬ tein Sequence and Structure (M. Dayhoff, ed.), vol. 5, National Biomedical Research Founda¬ tion, Washington, DC, 1979, pp. 353-358. 70. P.R. Sibbald and P. Argos, Weighting aligned protein or nucleic acid sequences to correct for unequal representation, Journal of Molecular Biology 216 (1990), 813-818. 71. E. L. Sonnhammer and D. Kahn, Modular arrangement of proteins as inferred from analysis of homology, Protein Science 3 (1994), 482-492. 72. C.-B. Stewart and A. C. Wilson, Sequence convergence and functional adaptation of stomach lysozymes from foregut fermenters, Cold Spring Harbor Symposium on Quantitative Biology 52 (1987), 891-899. 73. G. D. Stormo, Consensus patterns in DNA, Methods in Enzymology, vol. 183, Academic Press, 1990, pp. 211 - 221. 74. T. C. Sudhof, J. L. Goldstein, M. S. Brown, and D. W. Russell, The LDL receptor gene: a mosaic of exons shared with different proteins, Science 228 (1985), 815-822. 75. T. C. Sudhof, D. W. Russell, J. L. Goldstein, M. S. Brown, R. Sanchez-Pescador, and G. I. Bell, Cassette of eight exons shared by genes for LDL receptor and EGF precursor, Science 228 (1985), 893-895. 76. E. Sweedyk and T. Warnow, Manuscript, 1992. 77. D. L. Swofford and G. J. Olsen, Phytogeny Reconstruction, Molecular Systematics, Sinauer Associates, Inc., Sunderland, MA, 1990, pp. 411-501. 78. D. L. Swofford, G. J. Olsen, Waddell, and D. M. Hillis, Phylogenetic Inference, Molecular Systematics, Sinauer Associates, Inc., Sunderland, MA, 1996. 79. M. Tanaka, H. M. Gu, D. J. Bzik, W. B. Li, and J. W. Inselburg, Dihydrofolate reductase mu¬ tations and chromosomal changes associated with pyrimethamine resistance of plasmodium falciparum, Mol Biochem Parasitol 39 (1990), 127-134. 80. R. L. Tatusov, S.F. Altschul, and E.V. Koonin, Detection of Conserved Segments in Proteins: Iterative Scanning of Sequence Databases with Alignment Blocks, Proceedings of the National Academy of Sciences, USA 91 (1994), 12091-12095.

84

DANNIE DURAND

81. J. D. Thompson, D. G. Higgins, and T. J. Gibson, Improved sensitivity of profile searches through the use of sequence weights and gap excision, CAB 10 (1994), 19—29. 82. A. Valencia, M. Kjeldgaard, E. F. Pai, and C. Sander, GTPase domains of ras p21 oncogene protein and elongation factor Tu: analysis of three-dimensional structures, sequence families, and functional sites, Proc.Natl.Acad.Sci. USA 88 (1991), 5443-5447. 83. M. Vingron and P.R. Sibbald, Weighting in sequence space: a comparison of methods in terms of generalized sequences, Proc.Natl.Acad.Sci. USA 90 (1993), 8777-8781. 84. M. Vingron and M. S. Waterman, Sequence alignment and penalty choice. Review of concepts, case studies and implications, Journal of Molecular Biology 235 (1994), 1—12. 85. G. Vriend and C. Sander, Detection of common three-dimensional substructures in proteins, Proteins 11 (1991), no. 1, 52-58. 86. L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Compu¬ tational Biology 1 (1994), no. 4, 337-348. 87. M. S. Waterman, T. F. Smith, M. Singh, and W. A. Beyer, Additive Evolutionary Trees, Journal of Theoretical Biology 64 (1977), 199-213. 88. C. R. Woese, S. Winker, and R. R. Gutell, Architecture of ribosomal RNA: constraints on the sequence of “tetra-loops”, Proc.Natl.Acad.Sci. USA 87 (1990), 8467-8471. Computational Biology Group, University of Pennsylvania, 3401 Walnut Street, Suite 300C, Philadelphia, PA 19104-6228 and DIMACS, Rutgers University, Piscataway, NJ 08855-1179 Currently at the Department of Molecular Biology, Princeton University, Prince¬ ton, NJ 08544 E-mail address: [email protected]

DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 47, 1999

Sequence Alignment in Molecular Biology Alberto Apostolico and RafFaele Giancarlo

Abstract. Molecular biology is becoming a computationally intense realm of contemporary science and faces some of the current grand sci¬ entific challenges. In its context, tools that identify, store, compare and analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance. Biosequences are routinely compared or aligned, in a variety of ways, to infer common ancestry, to detect functional equivalence, or simply while searching for similar en¬ tries in a database. A considerable body of knowledge has accumulated on sequence alignment during the past few decades. Without pretending to be exhaustive, this paper attempts a survey of some criteria of wide use in sequence alignment and comparison problems, and of the corre¬ sponding solutions. The paper is based on presentations and literature given at the Workshop on Sequence Alignment held at Princeton, N.J., in November 1994, as part of the DIMACS Special Year on Mathemat¬ ical Support for Molecular Biology.

1. Introduction Classical taxonomy is based on the assumption that conspicuous mor¬ phological and functional similarities in species denounce close common an¬ cestry. Likewise, modern molecular taxonomy pursues phylogeny and clas¬ sification of living species based on the conformation and structure of their respective genetic codes.

It is assumed that DNA code presides over the

reproduction, development and sustainance of living organisms, part of it (RNA) being employed directly in various biological functions, part serving as a template or blueprint for proteins. In these latter, following somewhat

1991 Mathematics Subject Classification. Primary 68Q25; Secondary 92C40, 92D20. This article was adapted from an article appearing in the Journal of Computational Biology, Volume 5, Number 2, 173-196, MaryAnn Liebert, Inc., Publishers, NY, 1998.

First Author partially supported by NSF grant CCR-92-01078, by NATO grant CRG 900293, by the National Research Council of Italy, and by the ESPRIT III Basic Research Programme of the EC under contract No. 9072 (Project GEPPCOM).. Second author partially supported by MURST Grant “Algoritmi, Strutture di Calcolo e Sistemi Informativi”, by the National Research Council of Italy, part of this work was done while the author was visiting AT&T Bell Labs., Murray Hill, N.J. , U.S.A.. (c) 1999 American Mathematical Society

85

ALBERTO APOSTOLICO AND RAFFAELE GIANCARLO

86

the dual principle of modern architecture, function follows form.

The ap¬

parent functional, structural and sequence resemblance of proteins found in organisms of completely different morphology has further stimulated interest in a taxonomy based on sequence homology. Two biomolecules are said to be homologous if their sequences are likely to be offsprings of a common ancestor sequence.

However, ancestor se¬

quences are not available, and homology is deduced from the similarity of existing sequences: the more two such sequences are similar, the more likely their homology is assumed to be. Early studies on methods and tools for the automated analysis and comparisons of biosequences were stimulated by the identification of the aminoacid sequences of a number of proteins which occurred in the 1960s. Activity in this area has been growing ever since, and culminates with the considerable outburst of computational biology [117] in recent years, as more and more sequences accumulate from the genome sequencing of humans and other species.

Despite some key conceptual acquisitions and substantial

practical progress, we still do not have consolidated methods for every se¬ quence alignment task. This should not come as a surprise, since the com¬ plexity of biosequence alignment is rooted in some of the subtliest and most delicate aspects of scientific inference. In an attempt to capture genetically significant relationships among se¬ quences, several criteria have been proposed as measures of sequence sim¬ ilarity.

While they all need to relate in one way or another to the basic

mechanisms of evolution, there is no simple way to test and validate any one of them. At the empirical level, such a validation is made impossible by the uniqueness of the evolutionary experiment. At the epistemological level, the notion of a class of similar objects does not rest on very firm grounds. For instance, a claim known as the “Theorem of the Ugly Duckling” [116] states that as long as all of the predicates characterizing the objects to be classified are given the same importance or “weight”, then a swan will be found to be just as similar to a duck as to another swan.

The reason for

this apparent paradox is in the fact that classification as we experience it on an empirical basis is only possible to the extent that the various predicates characterizing objects are given nonuniform weights. Applied to the study of biomolecules, this means that we shall be able to build better and bet¬ ter alignment algorithms as we learn more and more about precisely those homologies that our algorithms seek to detect. DNA molecules appear naturally to be repositories of “information” that is propagated through evolutionary history and used in replication and tran¬ scription mechanisms central to the life-cycle of cells and organisms. Quan¬ titative definitions and measures of such information seem to be at the heart of sequence comparison methods, and have been attempted in various ways over the years. On one hand, this information seems to be relatable to Shan¬ non’s notion of uncertainty and may be measured as such (see, e.g., [33]). On the other, this information is used at the cellular level for the synthesis of

SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY

87

structure (for function), hence for departure from chaos. While Shannon’s classical measure applies to the information in transmission, structural in¬ formation possibly stored in biosequences seems related instead to the notion of redundancy. Measuring structure in finite objects, however, presupposes the accomplishment of the “intuitive program” of “analyzing randomness as far as it is possible within the region of finite sequences” [91], a goal cultivated already by Von Mises [112] and other statisticians at the turn of the century. One of the deepest acquisitions in this domain is finked to Kolmogorov’s innovative approach to the definition of information [67] (see also [77]).

In this approach -which seems reminiscent in a seductive way

of the very mechanism of molecular evolution-, information (alternatively, conditional information) is measured by the length of the recorded sequence of zeroes and ones that constitute a program by which a universal machine produces one string from scratch (alt., from another string). The sobering conclusion, however, is that under this scheme there is no such thing as a finite random sequence: while a great many sequences of sufficiently large length tend to behave randomly in the limit, any short sequence exhibits some kind of regularity. It appears thus that we can measure and study ran¬ domness (whence also structure) in finite objects (see, in particular, [71]) only to the extent that we can legitimately privilege (i.e., assign a high weight to) certain regularities and neglect others, a principle that seems to be pendant, in syntactic pattern recognition, to the statistical paradox of the ugly duckling. In conclusion, the difficulties inherent to sequence alignment fie in the process of forming the necessary educated biases, in the dynamical and interactive process of extracting the feature weights.

Along the way, the

practice of sequence alignment shall constantly oscilate between the risk of overlooking important structure and that of discovering any arbitrarily defined kind of structure everywhere. There is a number of biological motivations and contexts for computer alignment of molecular sequences, which results in a variety of methods and tools. There is global alignment of pairs of sequences that are globally re¬ lated by common ancestry, local alignments of related sequence segments, multiple alignments of members of protein families. Some special techniques of self-alignments have also emerged recently as useful measure of the “au¬ tocorrelation” inherent to molecular sequences.

Finally, and not entirely

disjointly from the above, ancillary alignments are performed in data base searches, typically, in order to detect homology of proteins. The correspond¬ ing available computational methods have developed considerably since the pioneering work by M. O. Dayhoff [25] and others.

Since 1982, Nucleic

Acids Research has routinely devoted special issues of 600 pages or more to the latest developments in this area. By 1983, enough original material had accumulated on sequence comparison techniques to warrant dedication of a full volume [97]. Part of these developments occurred in parallel with

ALBERTO APOSTOLICO AND RAFFAELE GIANCARLO

88

algorithmics on strings, thus contributing some remarkable challenges to al¬ gorithmic design. In general, however, the two endeavors conserved rather distinctive individual flavors.

On one hand, the computational biologist

would be typically driven by the need to obtain some significant speed-up in the processing of a limited class of sequences on some specific machine or class of (commercial) machines.

On the other, the algorist would be

concerned with unveiling the combinatorial structure of a rigidly defined problem, in order to achieve a higher asymptotic efficiency in computations performed by some abstract paradigm of physical computers. Historically, efficient algorithmics on strings and efficient computational methods for biosequence manipulations have found separate places in the literature, and in fact mostly addressed separate audiences. In general, molecular biologists are dismayed to find that formalization and computation involves almost invariably a number of abstractions and simplifications that denature the original problem, at least in part. On the other hand, mathematicians and theoretical computer scientists understandably incline towards an attitude that a problem that does not lend itself to a crisp formulation, or that man¬ ifests itself as either trivial or intractable, is no longer a problem.

While

some of the mutual isolation of the past is successfully being removed, the challenges that lay ahead of sequence alignment are still numerous and stag¬ gering. Some of the focal contemporary issues are addressed in this paper, the basic purpose of which is a somewhat deeper understanding of what the biologists expect from sequence alignment, how much of such an expecta¬ tion has been fulfilled and how one could go about in defining future tasks. Along the way, the paper implicitly reviews some past history, and tries to pin-point the main biological findings and formal results that led designers of sequence alignment in their pursuit. Next, some of the bio-mathematical theory of sequence alignment is addressed, as various notions and models of similarity between sequences are recalled and compared, and the mech¬ anisms that guide the choices of parameters (e.g., weighting functions) are discussed. This leads in turn to the study of the statistical tools that are used to establish the significance of alignments, and how well they model the underlying biological knowledge.

As already mentioned, when models are

translated into computational tools, some of their subtleties are inevitably lost. We shall thus try to understand what the trade-offs are between the sophistication of the theoretical models and the difficulties associated with their implementation. This is of particular interest in the case of multiple sequence alignment, perhaps the single most prominent instance of a com¬ putationally intensive alignment task where approximations, computational tradeoff's and significance are all subtly intertwined. 2. Some Background and a Synopsis The earliest reference to a problem involving sequence alignment dates back to 1879. It consists of a puzzle due to Lewis Carrol [20], that works

SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY

89

as follows: two English words of equal length are given, and one is asked to transform one into the other by going through a series of intermediate (English) words where each word differs from the next in only one letter. Insertion and deletion of letters are not allowed. Thus, for example, “head” can be transformed into “tail” using the intermediate words: “heal”, “teal”, tell

and “tall . Intuitively, one can say that the two given words are at a

distance’ of at most 4, since we found a way to go from one to the other through that number of substitutions.

Decades later, in the development

of coding theory, this simple notion of distance would be more formally introduced by Hamming [42] and used extensively for the design of good codes for data transmission [32], A richer notion of distance between sequences was defined by Levenshtein in 1965 [73].

It can be described as follows.

Given two strings,

transform one into the other by using a sequence of the following opera¬ tions: insertion of a symbol, deletion, substitution of a symbol with another one. Each such operation carries a cost or “weight”, and one would like to perform the transformation at the cheapest possible overall cost. The most commonly used weights for DNA are the ones by [31]. For amino acids, many weight systems have been proposed (see for instance [30, 78, 92, 93, 103]). However, for a long time the ones by [26] have been the standard and, very recently, the ones by [45] seem to be superior for the alignment of distantly related proteins [46]. A very important aspect of sequence alignment is to establish how “mean¬ ingful” a given alignment is. This basic question applies both to global and to local alignments. Informally, it can be stated as follows: is a given align¬ ment of two sequences due to chance?

Answers to this question involve

the investigation of many statistical aspects of sequence alignment, such as the expected length of an alignment or the expected distribution of scores of alignments. The mathematical tractability of those statistical questions seems to depend on two main factors: (1) the particular type of alignment being considered, and (2) the weights that we assign to the edit operations. Early results in this area are reported in [23, 60, 105]. A closely related issue is a quantification of the “sensitivity” and “selectivity” of sequence alignment methods. In loose terms, sensitivity is the ability to detect dis¬ tant evolutionary relationships between two strings, while selectivity is the ability to avoid the assignment of high scores to unrelated strings. In order to evaluate an alignment method with respect to those two parameters, one needs criteria to establish whether a meaningful alignment has been found or missed. Obviously, also the time required to compute an alignment is very im¬ portant.

As for the statistical significance of an alignment, the efficiency

seems to depend on the particular alignment method and on the proper¬ ties satisfied by the weights.

Over the years, as strings to be aligned got

longer, the objective of alignment shifted from that of the basic Needleman and Wunch method [84] (which computes an optimal global alignment) to

ALBERTO APOSTOLICO AND RAFFAELE GIANCARLO

90

those of methods that trade the optimality of the computed alignment for speed. Examples are the algorithms reported in [3, 75, 124, 125]. For the interested reader, an overview of some algorithmic issues related to such a speed-optimality trade-off, as well as to the impact of weights on the com¬ putational speed can be found in [34]. As mentioned, alignment methods are also routinely used in data base searches. That is, we are given a data base of proteins and we want to estab¬ lish whether a query sequence is biologically related to strings in the data base. Usually, that involves the computation of a local similarity between the query sequence and each sequence in the data base. The strings having a similarity score exceeding some threshold are then reported and further screened to assess the biological meaningfulness of the similarity criterion used.

This stage is based typically on statistical knowledge about score

distributions or empirical knowledge about biological function. As knowledge about biological sequences grows, more complex questions are found to revolve around some notion of similarity among sequences. One such type of question concerns the problem of finding approximate repeated patterns within each string. Those approximate repeated patterns seem to be associated with several important properties of DNA and proteins. For instance, parts of chromosomes are made of repeated patterns.

A fixed

number of those repeated patterns is lost when the cell divides. So, it would appear that those repeated patterns represent some kind of clock “ticking” the number of divisions that the cell can still undergo. The first algorithmic studies in this area are due to Miller [79], Kannan and Alyers [37] and Landau and Schmidt [69]. We will see an example of this kind of algorithms when we discuss the problem of locating tandem repeats in a string [99]. One more important and very active area of Sequence Alignment is the one that deals with the alignment of multiple sequences. There are many ways in which this problem can be stated, but at this point we will only highlight a few of them. For instance, one obvious multiple sequence align¬ ment problem is the one that asks for the longest common subsequence of a set of strings (see, e.g., [49]). Perhaps more significant from a biological perspective is the following problem originally stated by Sankoff, and now referred to as multiple alignment under an evolutionary tree [95]. We are given a set of strings (say, proteins) labeling the leaves of a given tree. We want to find ancestral strings to label the internal nodes of the tree and such that a given score function is minimized. Intuitively, one is trying to establish whether a set of strings have common ancestors and how those ancestors look like. Sankoff also provided algorithms for this problem [95], In multiple sequence alignment problems one faces the same difficul¬ ties and issues already mentioned for the alignment of two sequences, i.e., choice of weights, statistical significance, etc., except that here computa¬ tions become overwhelmingly demanding. In fact, many multiple sequence alignment problems are NP-Hard (see for instance [115]), or too time con¬ suming (see [52]). Also in the case of multiple sequence alignment, there is

SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY

91

a trend to develop fast algorithms that trade-off optimality of the required solution for speed (see for instance [13]). However, in some cases, even the computation of approximate solutions is hard (see for instance [115]). The notion of distance between strings that uses symbol-based edit op¬ erations models evolution in terms of very local changes to the genomes. Although such a notion of distance is appropriate to describe evolution of small portions of a genome (for instance, a segment of a sequence describing a chromosome), it shows limitations in quantifying the evolutionary distance between two entire genomes or two entire chromosomes within two genomes. Indeed, evolution at the genomic level seems to involve also non-local, large scale operations, which can rearrange a whole segment of a chromosome in one evolutionary event.

Such non-local changes are hard to detect using

a distance based on local edit operations (see, for instance, discussions in [38, 61, 85, 96, 98, 123]), and require taking into account more global structure and information.

One approach is as follows.

Assume that we

can write both genomes we want to compare as two strings of genes. Now, molecular biology suggests that if the two genomes are related, one should be obtainable from the other by a suitable rearrangement of its genes. This approach (gene rearrangement) has been pioneered by Palmer and Herbon [85] even though the earliest application of gene rearrangement to genome comparison seems to date back to 1938 [27]. From the combinatorial point of view, gene rearrangement requires the definition of new distances between the DNA sequences that encode genomes. Moreover, it gives rise to a set of very interesting and challenging combinatorial problems. In this context, the first formulation of a new distance and of an algorithm computing it seems to be due to [123]. Very recently, there has been quite a bit of work, initiated by Kececiouglu and Sankoff [66], for the formal investigation of the computational complexity of various distances between genomes. In the last part of this paper, we will briefly review the state of the art in this domain.

3.

Pairwise Alignments

In this section, we discuss two basic families of methods for detecting similarities between two bio-molecular sequences, regarded as two strings

x[l : n] and y[ 1 : m] over some suitable alphabet of symbols. (For instance, proteins are strings over the alphabet of aminoacids.) The families we are about to discuss are reminiscent, respectively, of the Hamming and Levenshtein distances introduced earlier, so that the first class of methods is in fact a restricted version of the second. In the practice of computational molecu¬ lar biology, global resemblance between two bio-sequences is significant in a variety of cases, but most often the interest of the biologist concentrates on unveiling segments bearing high mutual resemblance within otherwise feebly related sequences. In particular, Hamming-like methods are applied almost exclusively as a filter to detect local similarities.

ALBERTO APOSTOLICO AND RAFFAELE GIANCARLO

92

MSP. It computes an extremely simple notion of alignment in which no insertions and deletions of symbols are allowed.

Consider two substrings

x[i : i + k\ and y[j : j + k] of x and y, respectively.

We can align those

two substrings by pairing x[i\ with y[j\, x[i + 1] with y[j + 1] and so on. We can define a score for that alignment. Indeed, let S(a,b) be the weight of “pairing” a with 6, where a and b are letters over the alphabet E of amino acids.

S can be thought of as a matrix, which we refer to as the

substitution matrix. The score of aligning two substrings (of equal length) of x and y is given by the sum of the weights of the letters paired together. The underlying alignment is locally optimal if and only if its score cannot be improved by either extending both substrings or by shortening both of them. A maximal segment pair (MSP) is a best locally optimal alignment and it is taken as the (local) alignment of the two strings. In some cases, one may be interested in all locally optimal alignments. This simple method is the base for the BLAST program, one of the fastest programs for local sequence alignment [3]. Other methods of this kind have also been proposed (see, for instance, [11, 12, 62, 102]). Beginning in the late sixties, measures of similarity based on Levenshtein distances were independently proposed in such diverse fields as speech processing, text editing and, of course, molecular biology. Early solutions were also supplied at that time, based mostly on dynamic programming. In molecular biology, distances and related computations were proposed by Ulam, Needleman and Wunsch, Sellers, Sankoff and others [35, 110, 109, 100, 84, 94].

In computer science, the problem was dubbed “the string

to string correction problem” by Wagner and Fischer [113], who gave it the following formalization. Given two strings x and y, the problem is to transform x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol of x with another symbol. Let D(a) denote the cost of deleting from x an occurrence of symbol a, 1(a) the cost of inserting some symbol a between two consecutive positions of x, and S(a, b) the cost of substituting some oc¬ currence of a in x with an occurrence of b. The corresponding computation by dynamic programming is then readily stated. For this, let C(i,j), (0 < i < |x|, 0 < j < |y|) be the minimum cost of transforming the prefix of x of length i into the prefix of y of length j. Let sdenote the kth symbol of string s. Then C(0,0) = 0, C(i, 0) = C(i — 1,0) + D(x{) (i = 1,2,..., m), C(0J) = C(0,j - 1) + I(y3) (j = 1,2,n), and

(1)

C(i,j)

=

mm{ C(i - l,j - 1) + S(xi,yj),

C(i - 1, j) + Dfa),

C(i,j ~ 1 ) + I{yj) } for all i,j, (1 < i < |x|;l < j < \y\).

Observe that, of all entries of the

C-matrix, only the three entries C(i — 1 ,j — 1), C(i — 1, j), and C(i,j — 1) are involved in the computation of the final value of C(i,j). Hence C(i,-j)

SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY

93

can be evaluated row-by-row or column-by-column in @(|x||j/|) = 0(mn) time. An optimal edit script can be retrieved at the end by backtracking through the local decisions that were made by the algorithm. A few special cases of the recurrence above are of interest in various applications. Among them, the longest (or heaviest) common subsequences appears also among the earliest attempts at defining similarity among biosequences [84].

A subsequence of x is any string w that can be obtained

from x by deleting zero or more (not necessarily consecutive) symbols. The longest common subsequence (LCS) problem for input strings x = x\X2---xrn and y = y\y2---yn

< n) consists of finding a third string w = w\W2---uoi

such that w is a subsequence of x and also a subsequence of y, and w is of maximum possible length. The relation of this problem to string editing is understood by observing that the effect of a given substitution can be always achieved, alternatively, through an appropriate sequence consisting of one deletion and one insertion. When the cost of substituting a symbol with a different one is higher than the global cost of one deletion followed by one insertion, then an optimum edit script will always avoid substitutions and produce instead y from x solely by insertions and deletions of overall minimum cost.

Assume now

that insertions and deletions have unit costs, and that a substitution has a cost higher than 2.

Then, the pairs of matching symbols preserved in

an optimal edit script constitute a longest common subsequence of x and y.

It is not difficult to see that the cost e of such an optimal edit script,

the length l of an LCS and the lengths of the input strings obey the simple relationship: e = n + m — 21. Several LCS algorithms that do not require quadratic time on all inputs have been proposed over the years. Most fall in one of two basic paradigms due to Hirschberg [47] and Hunt and Szymanski [50], respectively, and improve on them in various ways (see, e.g., [9]). Other approaches, e.g. in [81, 83], are especially suitable for similar strings. Space saving techniques are often crucial in molecular sequence align¬ ment.

For some time, the early 0(nm) algorithm by Hirschberg [48] was

the only one to require linear space. Solutions where linear space was not necessarily accompanied by 0(nm) time began with [8].

Further work in

this area is found in [7, 22, 81, 82] and others. As mentioned, not all of the genetic material seems equally crucial to the functionality of organisms, so that the process of evolution tends to preserve some segments of DNA while accepting more liberally changes to others. Notable among the most preserved regions of DNA are those special segments that translate into sequences that fold up as proteins. Often in the practice of biological sequence comparison, sequences are compared that are known to have poor global similarity, but are also suspected to contain sim¬ ilar segments. In these cases, one resorts to methods that detect such local resemblances, such as the one below, formulated by Smith and Waterman.

ALBERTO APOSTOLICO AND RAFFAELE GIANCARLO

94

SW-Local ([56, 101, 104]). It computes local alignments of two substrings of x and y, but, unlike in the MSP, this time insertion and deletion of symbols are allowed. For each pair of prefixes x[\ : i] and y[ 1 : j] we are interested in finding the best suffixes of those two prefixes that can be aligned with each other using the full set of edit operations. That can be done by means of the following dynamic programming recurrence [104]:

(2)

LA(i,j)

=

max{LA(i - l,j - 1) + S(xi,yj),

LA(i-l,j) + 8,

LA(iJ- l) + $,0} where 8 is the weight of deleting or inserting a symbol, 1 < i < n and 1 < j < m, and is chosen to be a negative number. Also the cost of a substitution is assumed to be negative on average. The initial conditions of the recurrence are given by LA(i, 0) = 0 and LA(0,j) = 0.

This set of conditions, and

the 0 appearing under the max operator, is where Recurrence 2 departs from Recurrence 1.

Clearly, the scores assigned in this way to M reflect

the best local alignments ending at each point of that matrix. Among these locally optimal scores, we can choose a best one, i.e., one corresponding to max{LA(i,j), 1 < i < n and 1 < i < m}, and correspondingly identify two substrings of the input that achieve the best pairwise alignment. Conversely, we can change (2) and its initial conditions getting back a method for the computation of the global similarity between two strings.

SW-local (with

modifications by Gotoh [37]) is the base of the SSEARCH program for local alignment. 4. The Structure and Choice of Weights In the two alignment methods we have outlined in the previous Sec¬ tion, the values assigned to the weights or costs of insertions, deletions and substitutions are extremely important in obtaining alignments that are bi¬ ologically meaningful.

There are two natural questions that one can ask

about weights. What is the structure of weights, i.e., is there any mathe¬ matical methodology that can help us to obtain numeric values for weights that turn out to be useful for molecular biology? How do we actually com¬ pute those weights? The current state of the art provides partial answers to those questions and many of the results depend on whether we want to use weights for local or global alignment and on which method we are using. In what follows, we will first present results for the MSP method. So, we will discuss mainly substitution matrices in the local alignment context and, unless otherwise stated, an alignment of two strings is the one obtained by MSP. Then, we will consider the problem of how to choose gap and substi¬ tution weights for SW-local, since it is a good representative of the class of methods allowing gaps. 4.1. The Structure of Substitution Matrices. When one looks for optimal local alignments of two strings, one is trying to discover whether

SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY

95

parts of those two strings are related to each other through evolution. The scores (and therefore the weights of substitutions) give a quantitative mea¬ sure of that relationship. So, the weights should be designed to discriminate biologically meaningful alignments from ones due to chance. In very loose and intuitive terms, two strings can always be obtained one from the other through some amount of “evolutionary change”.

So, when we align those

two strings, it seems reasonable to use the weights that best capture that amount of to us.

evolutionary change1 . Unfortunately, that quantity is not know

However, in order to investigate the “structure of weights”, let us

assume that it is known. Moreover, to simplify notation, we assume that all letters of the alphabet are integers in [1, |E|], Let pt be the probability that symbol i appears in a randomly chosen string from £*. We refer to the p, 's as the background probability distribution of the symbols of the alphabet.

Let qt)j be the probability that, for the

amount of evolutionary change we have fixed, symbol i is substituted by symbol j.

We refer to qij's as the target probability distribution of two

symbols of the alphabet substituting each other. In [58], it is argued that the best weights one can use to capture the given amount of evolutionary change are of the form

(3)

S(i,j) = A

where A is a normalization constant. A few remarks are in order. As already stated in Section 2, many substitution matrices have been proposed, even before (3) was discovered.

Most of them are log-likelihood matrices and

therefore have the form prescribed by (3).

This is appealing to intuition,

since S(i,j) gives a “measure” of how much the probability of the event “symbol i substitutes symbol j” differs from chance. Equation (3) is impor¬ tant because it reinforces the view that, at least for MSP alignments, that intuition is indeed correct. There are other requirements that those substitution weights must sat¬ isfy. We will state and justify them. There must be at least two letters in E for which the weight is positive, since we would like to assign positive scores to locally optimal alignments. Consider the weight of two randomly chosen symbols, then its expected value must be negative. If the expected weight of two randomly chosen letters would be positive, extending two sub¬ strings in an alignment as far as possible would tend to increase the score of that alignment (so the definition of locally optimal alignment would be meaningless). 4.2. How to Compute Substitution Matrices. We now turn to the problem of how to actually determine the numeric values for substitution matrices.

The hard part is the computation of the target probabilities in

(3). We will briefly discuss two methods, one due to Dayhoff et al. [26] and the other one due to S. and J. Henikoff [45].

ALBERTO APOSTOLICO AND RAFFAELE GIANOARLO

96

The PAM Matrices. Dayhoff et al. [26] use a stochastic process to model evolution of proteins:

at each discrete time instant, a symbol (an amino

acid) has a certain fixed probability of substituting another symbol. They also define a unitary measure of evolutionary change as a Point Accepted Mutation (PAM for short): one PAM unit is the amount of evolutionary change required for a sequence of length n to have endured n/100 point mutations.

Note that a same position might mutate more than once, or

even mutate back to its original character, so that two sequences that are one PAM diverged will be expected to differ in less than 1% of their positions. Likewise, two sequences that are 250 PAM apart are still expected to agree on about 20% of their symbols. Let be the stochastic matrix giving the target probability distribu¬ tion at one PAM. That is, Mi[i,j] = Qij at one PAM. Mi is a matrix that has been experimentally determined by studying a group of closely related proteins [26]. Mn = (Mi)n is the target probability distribution at n PAM’s. Now, for a given evolutionary distance (expressed in PAM’s), we can use the corresponding M matrix to compute the substitution matrix as given by (3). Actually, one has a family of matrices from which to obtain substitution matrices. Those target distribution matrices are simply referred to by the PAM distance corresponding to them. So, PAM 250 is the matrix corresponding to an evolutionary distance of 250 PAMs.

One drawback of the method just described to compute weights is that, for distantly related strings, one infers target probabilities from the target probabilities of closely related strings. That may lead to inaccuracies since we are not estimating the target probabilities directly from samples of dis¬ tantly related strings. Henikoff and Henikoff [45] have recently proposed a different method to obtain target probabilities for distantly related strings. The idea is to infer those probabilities from a large set of representative families of proteins. Here we outline the main ideas of the method. The BLOSUM Matrices. Since |£| = 20 for the alphabet of amino acids, there are 210 distinct pairs of symbols (z,j), each of which corresponds to a substitution of symbol i with j.

(Here (j,i) is considered to be same as

The target probabilities are a normalized estimate of the frequencies with which one can find a substitution (i,j). We estimate those frequencies and therefore the corresponding probabilities using a data base of proteins. That is done as follows.

The proteins in the database are divided into

blocks, following the criteria specified in [45]. A block is a set of s strings, each of length w. So, we can think of a block as an s x w matrix. For each column of that matrix, we count how many times symbols i and j appear in that column and we increase the frequency estimate of that pair of symbols accordingly.

(Each such an occurrence corresponds to a substitution of a

symbol of a string into another symbol in a different string in the block.) The process is repeated for all blocks.

SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY

97

An important feature of this method is that one can cluster closely re¬ lated proteins that are in the same block.

Such a clustering avoids that

closely related proteins give a large contribution to the target frequencies (we want to obtain target frequencies that “capture” similarity between dis¬ tantly related proteins). The idea is the following. Let us assume that we want to reduce the contributions to the target frequencies of proteins that are at least 80% similar. Then, for each block, we cluster together all pro¬ teins that are at least 80% similar and each of those clusters is now one string in the block. We use those new blocks to estimate target frequencies. We remark that the similarity of proteins within blocks is obtained by other methods [45]. The above approach gives rise to a family of target probability distri¬ bution matrices, the BLOSUM family.

Each BLOSUM matrix has also a

number attached to it that refers to the percentage that one has used for clustering. For instance, BLOSUM 62 is the matrix obtained by clustering together, in each block, proteins that are at least 62% similar. From each BLOSUM matrix, we can obtain substitution matrices again using (3).

4.3. How to Choose Substitution Matrices. When we want to align two strings using MSP, we need to choose a substitution matrix. From what we have said above, we should use the substitution matrix that best captures the evolutionary change transforming one string into the other. Unfortunately, that distance is not known to us, even though we may have some idea about it.

Up until recently, one would resort to use the PAM

250 matrix since experimentally it seemed the best suited to align distantly related strings.

Recently, Altschul [1] has proposed an interpretation of

substitution matrices in information theoretic terms.

That interpretation

is useful both in choosing a substitution matrix from a family of matrices (like the PAM or BLOSUM families) and in comparing the performance of matrices from different families, i.e., the ability of substitution matrices to yield biologically meaningful alignments. Let H =

\og2(qij /pzpj) be the relative entropy of the target and

background distributions. We associate the quantity expressed by H to a substitution matrix with target probabilities qij's. Now, fix two strings for which we want to compute an MSP alignment and assume that q[ • is the target probability distribution for the alignment of those two strings.

Let H' be the same as H but with the new target

probability distribution. Notice that H' is the average score per symbol substitution in that align¬ ment. Since H' is also a measure of information, we can interpret it as the average information (in bits) per position that is needed to distinguish that alignment from chance.

In other words, it gives, for each position of the

alignment, the amount of information that we need to discriminate between the target and the background probability distributions. Now, to align the two chosen strings, we should choose a substitution matrix with relative

98

ALBERTO APOSTOLICO AND RAFFAELE GIANCARLO

entropy close to H'. Since we do not know H', we have to resort to approx¬ imate it. That is done as follows. Altschul [1] has shown that, in order for an MSP alignment to be mean¬ ingful, its score must be of at least log A” bits, where N is the product of the lengths of the two strings being aligned. Now assume that, for the two strings we want to align, we expect the length of the best local alignment to be / (this value is usually determined by resorting to heuristic knowledge). So, the amount of bits each position must contribute to the score of that alignment is log N/f. Therefore, we choose a substitution matrix with rela¬ tive entropy close to log N/f. Based on this criterion, Altschul [1] provides a set of guidelines on which PAM matrices to use in which contexts. Further studies are reported in [2, 106]. The relative entropy of a substitution matrix is also very important to establish a criterion on which to base the comparison of the performance (in terms of parameters such as sensitivity and selectivity, to be discussed below) of two different substitution matrices. For instance, the PAM family of matrices has been obtained in a totally different way from the BLOSUM family and there seems to be no clear relationship between members of the two families. The question is how to compare matrices in those two families (e.g., should the performance of the PAM 250 matrix be compared with BLOSUM 62 or with another matrix?). Using relative entropy, it seems rea¬ sonable to compare the performance of two matrices that have the same en¬ tropy. Using this criterion, S. and J. Henikoff [45] have compared those two families of matrices and they showed that the BLOSUM matrices seem to perform much better than the PAM matrices. Additional extensive studies are reported in [46]. For completeness, we point out that those experimental studies have been carried out for local alignment methods that allow gaps. The problem of choosing weights for global or local alignment methods when both gaps and substitutions are allowed is considered next.

4.4. Weights for Alignments with Gaps. When gaps are allowed, the weights consist of a substitution matrix and a gap penalty r. Let Score(xrj, x\) be the similarity score between those two strings, i.e., the sum of the weights of the operations in an edit script transforming x[j : r] into x[0. : t\. Such an edit script is associated with an alignment of x[j : r] with x[i : t]. We refer to that alignment as a repeat.

A repeat is a tandem repeat if £ —

r +1. To keep things simple, let us assume that, among all possible repeats, we are interested in finding the best one. score.

That is, the one of maximum

This problem has been proposed by Miller [79], who also gave a

practical algorithm solving it.

The heart of the algorithm is the dynamic

programming recurrence (2) that is used for local alignment. However, the computation of that recurrence needs to be modified to account for the facts that we are aligning a string with itself and that we are interested in non-overlapping regions of similarity. Kannan and Myers [57] subsequently devised an algorithm with an 0(n2 log2 n) time performance, by combining some of the ideas in [79] and [6]. More recently, Schmidt [99] generalized the notion of locally optimal alignment for strings (the one due to Sellers [101]) to locally optimal (non¬ overlapping) repeats and to locally optimal tandem repeats. She also devised an 0(n2 logn) time algorithm that computes all such locally optimal repeats. The data structures she devised may be useful in other sequence alignment contexts. Intuitively, such data structures support fast answers to queries on shortest paths in the grid graphs associated with the dynamic programming recurrence (2). For completeness, we mention that a simplified version of the problem of finding all locally optimal tandem repeats had been considered in [69].

Algorithms dealing with similar “incremental” versions of string

editing and longest common subsequences were produced in the mid 80’s by Myers [80] and, in a distant context of searching for cliques on some special classes of graphs, Apostolico et al. [5]. A recent paper by Landau, Myers and Schmidt [68] have combined the incremental methods for edit distance without mismatches of [80] with the incremental methods for the Levenshtein edit distance and discusses a number biological applications where this is useful. 7. Multiple Sequence Alignment Given a set of strings X\,X2, ...,£fc over an alphabet S, a multiple align¬ ment for those strings is a two-dimensional matrix of k rows satisfying the following conditions.

The entries of A are either symbols from the input

alphabet or the empty symbol A (the identity under concatenation), and concatenating the entries in row i of A yields the string a?;.

Given a cost

function d on £fc, an optimal alignment is one that minimizes the sum of the costs on all columns of a multiple alignment matrix A. Formally, we want a multiple alignment matrix A = (aij)i A | £ A -> bAb | AB | bb B -> bAb | bb

These results are summarized in the following lemma, where we consider “secondary structure” to be the pairings of bases that appear together in a single step of the derivation: Lemma 3.2 (Structural Ambiguity). For the grammars and lan¬

guage of orthodox strings defined above, L(G0) = L{G0(f) = L(Gon) = L0. For any w G L0, there is exactly one leftmost derivation from G0d. For any secondary structure of any w G L0, there is exactly one left¬ most derivation from Gon. Proof: By various inductions (e.g. see [18]). Perhaps the most significant result to emerge from this line of re¬ search has been that DNA is beyond context-free. This follows from several observations: • Direct repeats are common in DNA, and copy languages, of which {uu | u G E*} is the archetype, are known not to be con¬ text-free. Of course, the mere presence of repeats is not a proof (a fact often overlooked), since in fact repeats are required in context-free languages by the pumping lemma. Thus, it is nec¬ essary to establish functional roles for specific repeats in order to make a formal argument; a number of possibilities have been proposed [18].

For example', the attenuator languages La and

Lan introduced above are non-context-free by virtue of repeats with one such functional role, founded in the need for alternative secondary structure in certain control mechanisms. • More generally, crossing dependencies are observed in parallel (as opposed to anti-parallel) interactions between strands, seen commonly in proteins and less commonly in phenomena such as triple- and quadruple-stranded DNA.

FORMAL LANGUAGE THEORY AND BIOLOGICAL MACROMOLECULES

127

• Pseudoknots are a form of secondary structure in which two stem-and-loop structures overlap, such that one of the loops contains half of the other stem. The ideal version of the pseu¬ doknot language, L^ = {uvuRvR \ u,v 6 S^NA}, contains no re¬ peats but, while each of the stems is conventionally based-paired with nested dependencies, in combination the dependencies are forced to cross. Pseudoknots, and non-orthodox secondary struc¬ ture in general, have created challenges for algorithms dealing with secondary structure prediction and pattern recognition. Thus the language of DNA appears to be relatively complex in a formal linguistic sense, and the question arises as to how such com¬ plexity arises.

That is, we might presume that the first DNA (or,

more likely, RNA) molecules were random strings, and thus regular, and ask by what series of operations such strings were manipulated to create the complex languages which have evidently been selected by evolution. The mathematical way of asking this rather philosophi¬ cal question might be in terms of closure properties under the various domain-specific operations that are observed on strings of DNA. Along these lines, we have observed the following: • Under the operation of replication, (11)

REP(L) = {w,wR | w * w}. The cut language union, |J L(G) = { u G £* | S =>* w and u G w}, will also be important.

A string will be in this language if and

only if it is a substring of some string derived from the grammar in the ordinary way, such that the substring is bracketed by 656 | 6S | S6 \ 6

We can also require a minimum “overhang’' to create what biologists call “sticky ends”: (22)

S —> bSb | wSSwR \ wS6wR \ 6

for each w € S^NA where n is the desired length (or for particular vj's to model restriction enzyme sites). Thus, cut grammars can.be

FORMAL LANGUAGE THEORY AND BIOLOGICAL MACROMOLECULES

131

used to describe hybridization of populations of strings, e.g. cut lan¬ guage elements as sets of hybridizable oligonucleotides. Note that this formalizes the strategy recently used by Adleman to “compute” a wellknown intractable problem [1], This is easily generalized to branched hybridization, again by anal¬ ogy with the development in the previous section. The language of generalized hybridization networks would derive from (23)

5 -> bSb | SS | 6 | e

since nicks can thus arise at the end of any stem or, via the doubled SS rule, in-line on either side of a stem. Still other generalizations suggest themselves. Note, for example, that using the start symbol to leave one end of the double-stranded molecule open, and a 6 to cut open the other end, seems arbitrary given the symmetry. In order to “close off” the start of a derivation tree, we can do the following:

For any u = ui8u28u38 8un where n > 1 (i.e. at least one 8 appears in u) and U{ G E* for Definition 4.2 (Circular Cut Languages).

1 < i < n, we define a circular cut function u= {unu\, u2, u3,

, un_i}.

A circular cut language is defined as before, but using the circular cut function. Then, for Gds ' (24)

S —* bSb \ 8, we have ordinary stems U °L

= LG*) = Lh

which is to say, the set of stems open at the start is the same as the set open at the terminus. For any G we can form a G' by adding a new start symbol S' and rule S' —> S8, to get L(G') = L(G). More generally, circular cut languages will allow us to begin deriva¬ tions at interior points in structures.

The following grammar forms

a hybridization “wheel” with an arbitrary number of spokes radiating from a central S, as depicted in Figure 4:

(25) v ’

A —> ^ ffk A bAb i| 8A

In order to be able to distinguish between “ligateable” cuts (which is to say, nicks) and unligateable gaps, we introduce the following:

A ligation grammar is a cut gram¬ mar with an additional new symbol 7. For any u = U17U27 • • • 7un where Ui G (EUjT})* for 1 < i < n, we define a ligate function u — {u\, m2, • • • ,un}. For a ligation grammar G with start symbol S we Definition 4.3 (Ligation).

define the ligated language L(G) = { u G 2s | S =>* u).

D. B. SEARLS

132

6

I A

i i .A. A 4;

|

A

I A

|

A

I I I

A

A

A

1

I

1

/' a 4

s—-s §

A—A

A—A — AS 6 7 I S I A—A

A—A

A

8

:,s |\s7 eX

A

I r a

4

l I A a 4 I I 1 1

4 A 4

I I A

I 5 Figure 4. A hybridization “wheel”.

This allows us to propose a model of generalized linear hybridization, as embodied in the grammar below:

Gth : (26)

S -+ bSb \ A6 \ 6B | A —> bAy \ bSb | 7 B - jBb | bSb | 7

7

It can be seen that any cut appearing at the end of a stem is an unligateable gap (7), as are cuts opposite unpaired bases, in the A and

B rules. Cuts appearing between paired bases, via the S rule, are ligateable nicks (6). Such a grammar can produce parse trees representing plausible linear hybridization products such as that shown in Figure 5.

S-A—A—A-S-S-B-B-B-S-S-—A-S-S-S-A-S-S—B-B-B — Y 6

Y

Y

=“'

-

5

Y

Y

Y

Figure 5. A linear hybridization.

Given such a model, the problem of determining ways in which a set of oligonucleotides can anneal can be cast as a parsing task, albeit a non-traditional one. Unfortunately, even for regular cut grammars this task appears to be intractable. The following proof is due to Michael Niv:

FORMAL LANGUAGE THEORY AND BIOLOGICAL MACROMOLECULES

Lemma 4.1 (Cut Language Recognition).

133

Given a regular cut gram¬

mar G and a set V C £*, determining whether V £ L(G) is NP-hard. Proof: By reduction from Directed Hamiltonian Path (DHP). Given a graph (V, E), where |L| = n, the start vertex is u1} and the end vertex is vn, if we define a cut grammar with the vertex labels as its alphabet (S = V), nonterminals {A) | 1 < i, j < n} U {S}, and rules S —* V\ A\ (27)

Aj — 6vkAlh+1

for all (j, k) e E, 1 < i < n

then V £ L(G) iff there is a path of length n that passes through every vertex in V exactly once (DHP).

5. Evolutionary Structure Just as conventional grammars fall short in describing interactions between strings in a set, they are also inadequate to capture evolution¬ ary relationships between strings.

Formal language theory has been

notably absent in one of the most widely-practiced computational ac¬ tivities surrounding macromolecules, that of string comparison.

The

determination of optimal alignments and putative evolutionary dis¬ tances between strings has heretofore been confined to the realm of algorithmics. The close relationship between grammars and automata is an im¬ portant aspect of formal language theory. We have recently explored the use of a brand of automaton called a finite transducer in modelling relationships between strings in such a way as to provide a connection to the efficient algorithms commonly used in this field [24]. Finite transducers are simply finite-state machines for which tran¬ sitions have both input and output. Figure 6a shows how a finite trans¬ ducer can be used to model a single mutation occurring anywhere in a string, where the mutation could be a single-base substitution, dele¬ tion, or insertion (using transitions labelled by their input and output, separated by a foreslash, with x and y standing for any non-identical nucleotides). By simply merging the start and final states of the “mutation ma¬ chine”, we can produce an “edit machine” as shown in Figure

6b,

that

will make any number of non-overlapping mutations in a string. We have also introduced weights on the transitions, to be added to a run¬ ning total with each move of the transducer. What is conventionally defined as the minimal edit distance between two strings is simply the

D. B. SEARLS

134

Figure 6. Finite transducers for single mutation (left) and edit (right). minimal computation of this automaton.

This affords the following

formulation of the notion of edit distance:

Consider a weighted finite trans¬ ducer (Q, E, 6, s, F) with states Q, input/output alphabet E, start state s 6 Q, final states F C Q, and transitions