Computational Molecular Biology [1 ed.] 0471872512, 9780471872511

Recently molecular biology has undergone unprecedented development generating vast quantities of data needing sophistica

262 69 13MB

English Pages 307 Year 2000

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Computational Molecular Biology [1 ed.]
 0471872512, 9780471872511

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Page i

Computational Molecular Biology An Introduction

Page ii

WILEY SERIES IN MATHEMATICAL AND COMPUTATIONAL BIOLOGY Editor-in-Chief Simon Levin Department of Ecology and Evolutionary Biology, Princeton University, USA Associate Editors Zvia Agur, Tel-Aviv University, Israel Odo Diekmann, University of Utrecht, The Netherlands Marcus Feldman, Stanford University, USA Bryan Grenfell, Cambridge University, UK Philip Maini, Oxford University, UK Martin Nowak, Oxford University, UK Karl Sigmund, University of Vienna, Austria CHAPLAIN/SINGH/MCLACHLAN—On Growth and Form: Spatio-temporal Pattern Formation in Biology CHRISTIANSEN—Population Genetics of Multiple Loci CLOTE/BACKOFEN—Computational Molecular Biology: An Introduction DIEKMANN/HEESTERBEEK—Mathematical Epidemiology of Infectious Diseases: Model Building, Analysis and Interpretation Reflecting the rapidly growing interest and research in the field of mathematical biology, this outstanding new book series examines the integration of mathematical and computational methods into biological work. It also encourages the advancement of theoretical and quantitative approaches to biology, and the development of biological organisation and function. The scope of the series is broad, ranging from molecular structure and processes to the dynamics of ecosystems and the biosphere, but unified through evolutionary and physical principles, and the interplay of processes across scales of biological organisation. Topics to be covered in the series include: • Cell and molecular biology

• Functional morphology and physiology • Neurobiology and higher function • Immunology

• Epidemiology

• Ecological and evolutionary dynamics of interacting populations A fundamental research tool, the Wiley Series in Mathematical and Computational Biology provides essential and invaluable reading for biomathematicians and development biologists, as well as graduate students and researchers in mathematical biology and epidemiology.

Page iii

Computational Molecular Biology An Introduction Peter Clote Department of Computer Science and Department of Biology, Boston College, USA Formerly Ludwig-Maximilians-Universität München, Germany Rolf Backofen Ludwig-Maximilians-Universität München, Germany

Page iv

Copyright ©2000 John Wiley & Sons Ltd Baffins Lane, Chichester, West Sussex, PO19 1UD, England National 01243 779777 International (+44) 1243 779777 e-mail (for orders and customer service enquiries): [email protected] Visit our Home Page on http://www.wiley.co.uk or http://www.wiley.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency, 90 Tottenham Court Road, London W1P 9HE, UK, without the permission in writing of the Publisher and the copyright owner, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for the exclusive use by the purchaser of the publication. Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Other Wiley Editorial Offices John Wiley & Sons. Inc., 605 Third Avenue, New York, NY 10158–0012, USA Wiley-VCH Verlag GmbH Pappelallee 3, D-69469 Weinheim, Germany Jacaranda Wiley Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons (Canada) Ltd, 22 Worcester Road, Rexdale, Ontario, M9W 1L1, Canada Library of Congress Cataloging-in-Publication Data Clote, Peter. Computational biology : a self contained approach to bioinformatics / Peter Clote, Rolf Backofen p. cm – (Wiley series in mathematical and computational biology) Includes bibliographical references (p.) ISBN 0-471-87251-2 (alk. paper) – ISBN 0-471-87252-0 (pbk.: alk. paper) 1. Genetics—Mathematical Models. 2. Molecular biology— Mathematical models. I. Backofen, Rolf. II. Title. III. Series. QH438.4.M3 C565 2000 572.8'01'51 187-dc21

00 -038169

British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0-471-87251-2 ISBN 0-471-87252-0 Some content in the original version of this book is not available for inclusion in this electronic edition. Produced from PostScript files supplied by the authors. Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire. This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at least two trees are planted for each one used for paper production.

Page v

To my wife, Marie, and to my son, Nicolas. (P.C.) To my wife, Doris, and my children, Ina and Lara. (R.B.)

Page vii

Contents Series Preface Preface

xi xiii

1 Molecular Biology

1

1.1 Some Organic Chemistry

3

1.2 Small Molecules

4

1.3 Sugars

6

1.4 Nucleic Acids

6

1.4.1 Nucleotides

6

1.4.2 DNA

8

1.4.3 RNA

13

1.5 Proteins

14

1.5.1 Amino Acids

14

1.5.2 Protein Structure

15

1.6 From DNA to Proteins

17

1.6.1 Amino Acids and Proteins

17

1.6.2 Transcription and Translation

19

1.7 Exercises

21

Acknowledgements and References

22

2 Math Primer 2.1 Probability

23

23

2.1.1 Random Variables

25

2.1.2 Some Important Probability Distributions

27

2.1.3 Markov Chains

38

2.1.4 Metropolis–Hastings Algorithm

43

2.1.5 Markov Random Fields and Gibbs Sampler

47

2.1.6 Maximum Likelihood

52

2.2 Combinatorial Optimization

53

2.2.1 Lagrange Multipliers

53

2.2.2 Gradient Descent

54

2.2.3 Heuristics Related to Simulated Annealing

54

2.2.4 Applications of Monte Carlo

55

2.2.5 Genetic Algorithms

60

2.3 Entropy and Applications to Molecular Biology

61

2.3.1 Information Theoretic Entropy

62

2.3.2 Shannon Implies Boltzmann

63

Page viii

2.3.3 Simple Statistical Genomic Analysis

66

2.3.4 Genomic Segmentation Algorithm

69

2.4 Exercises

72

2.5 Appendix: Modification of Bezout's Lemma

77

Acknowledgements and References

79

3 Sequence Alignment

81

3.1 Motivating Example

83

3.2 Scoring Matrices

84

3.3 Global Pairwise Sequence Alignment

88

3.3.1 Distance Methods

88

3.3.2 Alignment with Tandem Duplication

99

3.3.3 Similarity Methods

110

3.4 Multiple Sequence Alignment

111

3.4.1 Dynamic Programming

112

3.4.2 Gibbs Sampler

112

3.4.3 Maximum-Weight Trace

114

3.4.4 Hidden Markov Models

117

3.4.5 Steiner Sequences

117

3.5 Genomic Rearrangements

118

3.6 Locating Cryptogenes and Guide RNA

120

3.6.1 Anchor and Periodicity Rules

122

3.6.2 Search for Cryptogenes

122

3.7 Expected Length of gRNA in Trypanosomes

123

3.8 Exercises

128

3.9 Appendix: Maximum-Likelihood Estimation for Pair Probabilities

132

Acknowledgements and References

133

4 All about Eve

135

4.1 Introduction

135

4.2 Rate of Evolutionary Change

137

4.2.1 Amino Acid Sequences

137

4.2.2 Nucleotide Sequences

139

4.3 Clustering Methods

144

4.3.1 Ultrametric Trees

147

4.3.2 Additive Metric

152

4.3.3 Estimating Branch Lengths

156

4.4 Maximum Likelihood

157

4.4 1 Likelihood of a Tree

159

4.4.2 Recursive Definition for the Likelihood

160

4.4.3 Optimal Branch Lengths for Fixed Topology

162

4.4.4 Determining the Topology

166

4.5 Quartet Puzzling

166

4.5.1 Quartet Puzzling Step

169

4.5.2 Majority Consensus Tree

170

4.6 Exercises

171

Page ix

Acknowledgements and References 5 Hidden Markov Models

173 175

5.1 Likelihood and Scoring a Model

177

5.2 Re-estimation of Parameters

180

5.2.1 Baum–Welch Method

181

5.2.2 EM and Justification of the Baum–Welch Method

184

5.2.3 Baldi–Chauvin Gradient Descent

187

5.2.4 Mamitsuka's MA Algorithm

191

5.3 Applications

193

5.3.1 Multiple Sequence Alignment

193

5.3.2 Protein Motifs

194

5.3.3 Eukaryotic DNA Promotor Regions

195

5.4 Exercises

197

Acknowledgements and References

198

6 Structure Prediction

201

6.1 RNA Secondary Structure

202

6.2 DNA Strand Separation

213

6.3 Amino Acid Pair Potentials

223

6.4 Lattice Models of Proteins

228

6.4.1 Monte Carlo and the Heteropolymer Protein Model

231

6.4.2 Genetic Algorithm for Folding in the HP Model

233

6.5 Hart and Istrial's Approximation Algorithm

234

6.5.1 Performance

234

6.5.2 Lower Bound

236

6.5.3 Block Structure, Folding Point, and Balanced Cut

239

6.6 Constraint-Based Structure Prediction

243

6.7 Protein Threading

246

6.7.1 Definition

246

6.7.2 A Branch-and-Bound Algorithm

249

6.7.3 NP-hardness

258

6.8 Exercises

259

Acknowledgements and References

261

Appendix A Mathematical Background

263

A.1 Asymptotic complexity

263

A.2 Units of Measurement

263

A.3 Lagrange Multipliers

264

Appendix B Resources

265

B.1 Web Sites

265

B.2 The PDB Format

266

References

269

Index

281

Page 281

Index A absolute performance 236 addition law 24 additive metric 152 additive tree metric 147 additivity of alignments 90, 95 adenine 8 adenoside triphosphate (ATP) 4–6 alcohol 5 alignment 88, 90–5, 97–110 additivity 90, 95 distance 88, 90, 100 alleles 11–12 amine 5 amino acids 6, 14–15, 17–19 codes 18 pair potentials 35, 223–8 pair probabilities 86–8 sequences 137–9 substitution matrix methods 139 amino group 4, 5, 14 aminoacyl-tRNA synthetase 20, 21 Amoeba dubia 12 anchor region rules 122 antiparallel ß sheets 16 Archaea 1 archaebacteria 1 asymptotic complexity 263 asymptotic performance 236 Australopithecus 135 Avogadro's number 38 B

back mutations 137, 138 backtrack 210 balanced cut 239–42 balanced state 44 Baldi–Chauvin gradient descent 187–91 Bernouilli random variable 27, 28 Baldi-Chauvin updates 188–91 bases, chemical forms 8 basic U-folds 239 Baum–Welch method 180–8 Baum–Welch parameter 184 Baum–Welch score 178, 179 Bayes' rule 25 Bender's theorem 205 Bernouilli trial 27 ß sheet 16 Bezout's Lemma 77–9 binary phylogenetic trees 145–7 binary trees 144, 145, 166 binomial coefficients 24 binomial distribution 27–8 bioinformatics 2 block-respecting codes 56, 57, 59 block structure 239–42 block-structured code 56 BLOSUM matrices 88 Boltzmann distribution 35–8, 45, 46, 181, 221 Boltzmann probability 45, 46 Boltzmann probability distribution 63, 64 Boltzmann's constant 38, 66 Boltzmann's law 63 boolean cellular automation 74 Box–Muller algorithm 32 branch-and-bound algorithm 249–58 branch lengths 156–7, 162–5 Brookhaven Protein Database (PDB) 266

C Cantor–Bendixson derivative 207 carbohydrates 4 carboxyl group 4, 5, 14 carboxylic acid 5

Page 282

catalan numbers 204 CATH database 266 Cavalli-Sforza–Edwards theorem 145–7 central limit theorem 31 chaperones 21 chloroplast DNA (cpDNA) 140 chromosomal duplication 119 chromosomal rearrangement 119 chromosomes 9, 12, 60, 119, 233–4 clustering methods 144–57 codons 17 combinatorial optimization 53–61 exercises 73–5 complete maximum-weight trace (CMWT) formalization 114 computational biology 2 conditional likelihood 161–2 conditional probability 25, 49–50, 181 connected neighbors 229 constraint-based structure prediction 243–6 core model 247–8, 259 covalent bond 3 Cro Magnon 135–6 crossover 61 cryptogenes 120–3 cyanobacteria 1 cytosine 8 cytosine 8 D Dempster et al. theorem 186 deoxyribose 7 dinucleotide entropy 67–8 directed graph 144 discrete Markov model 175 distance matrix 94, 154 disulfide bonds 17 divergence 67

DNA 2, 8–12 DNA replication 21 DNA strand separation 213–23 Drosophilia 197 duplication 119 dynamic programming 112 dynamic programming algorithm 107 E edit distance 88–90 edit operation 89 energy functions 213 energy matrix computation 210 enthalpy 66 entropy 61–72 exercises 75–6 information theoretic 62–3 equilibrium distribution 42, 45, 46 ergodic state 44 error distance 192 Escherichia coli 1 ester 5 Eukarya 2 eukaryotes 1, 20 eukaryotic DNA 214 promotor regions 195–7 promotor sequence 196 evolution rates 135–74 change rate 137–44 exercises 171–3 expectation maximization 180 expectation maximization algorithm 184–7 expected number of transitions 180, 182 exponential distribution 30, 33–4 extrachromosomal element (ECE) 9 F Farris transformed distance method 154

fatty acids 4 Feller theorem 34 fibrinopeptides 140 fission 119 Fitch–Margoliash method 156–7 foldicity 231 folding 233–4 hydrophobic force 235 folding point 239–42 forward method 178 forward variable, definition 178–9 fusion 119 G gap function 111 gap penalty 94–5, 111 Gaussian distribution 30 Geman–Geman theorem 51 gene 11 GENEMARK 47 genetic algorithms 60–1, 233–4

Page 283

genetic code 18, 19 fault tolerant 55–60 optimality 55–60 genome 11 genomic analysis 66–8 genomic rearrangements 118–20 genomic segmentation algorithm 69–72 genomic signature 68 geometric distribution 28–9 Gibbs distribution 47–9, 51 Gibbs free energy 38 Gibbs sampler 47–52, 112 global pairwise sequence alignment 88–111 Gotoh algorithm 82, 100–2 Gotoh theorem 96 gradient descent method 54, 180 GU base pairs 205, 209 guanine 8 guide RNA (gRNA) 13, 20, 120–3, 123–8 H Haemophilus influenzae 67, 68 Hamming distance 205 Hart–Istrail approximation algorithm 234–42 heteropolymer protein model 231 hidden Markov models (HMM) 117, 175–99 applications 193–7 exercises 197–8 urn model 176 Homo erectus 135 Homo habilis 135 homologous modeling 201 homologous proteins 83–4 homology testing 81 hydrocarbon molecule 4 hydrogen bonds 3, 9, 17

hydrophilic amino acid 229 hydrophilic molecules 3 hydrophobic amino acid 229 hydrophobic force 4, 17 hydrophobic molecules 4 hydroxyl group 4, 5 hypergeometric distribution 32 I information (entropy) 62 information flow 2 information theoretic entropy 62–3 interaction graph 248–9 inter-chromosomal events 119 internal energy 66 intra-chromosomal events 119 inversion 119 J Jaccard's index 76 Jensen-Shannon divergence 69–70 K Kececioglu, Li, Tromp algorithm 118 Kececioglu theorem 116 Kronecker δ-function 144, 158 L L. tarenolae 121 Lagrange multipliers 53–4, 59, 63, 64, 132, 219, 264 lattice connectivity constant 236 lattice models of proteins 228–34 Lawrence, Altschul, Boguski, Liu, Neuwald, Wootton algorithm 113 least common ancestor 154 likelihood 177–80 recursive definition 160–2 linking number 214 local alignments 111 local move set 231–2

log odds ratios 86 M majority consensus tree 170–1 Mamitsuka's MA algorithm 191–3 Mamitsuko's updates 192–3 Markov chain 38–43, 127, 140, 141, 220 definition 176 irreducible 39 reversible 42 stationary 39, 42 Markov chain Monte Carlo algorithm 43 Markov matrix 141 Markov model 125 definition 177 order 176 Markov process 140 Markov property 140–1, 176 Markov random fields 47–51 mathematical concepts 23–79 mathematical models 23 maximal entropy probability distribution 65

Page 284

maximum entropy 66 maximum likelihood estimation 52–3, 117, 157–66, 184 maximum-likelihood estimation, pair probabilities 132–3 maximum-weight trace 114–17 mean square difference 56 meiosis 12, 21 messenger RNA (mRNA) 13, 20, 120 Methanococcus jannaschii 1, 2, 9, 67–70, 266 methionine 21 metric 147 definition 90 Metropolis et al. theorem 46 Metropolis–Hastings algorithm 35, 37, 43–7 mitochondrial DNA (mtDNA) 136, 140 mitosis 12, 21 molecular biology exercises 21–2 overview 1–22 molecular fossils 13 molecular fossils 13 Monte Carlo algorithm 43, 220 Monte Carlo applications 55–60 Moore automation 125, 127 motifs 16 multiloops 207 multinomial coefficients 24 multinomial distribution 28 multiple sequence alignment 111–18, 193 multiregional model 135 multivariate function 186–7 mutations 137, 138 Mycoplasma genitalia 68 N Needleman–Wunsch algorithm 107 Needleman–Wunsch edit distance 91–4

neighbor relation 166 neighborhood system 44 net pairwise potential 225 neutral networks 203, 205 neutral substitutions 139 non-covalent bond 3 normal distribution 30–1 normalized specific amino acid distance frequency 225 NP-hardness 258 nuclear magnetic resonance (NMR) studies 226 nucleic acids 6–13 nucleotide entropy 66–8 nucleotide sequences 66, 139–44 nucleotides 4–8 forms 8 Nussinov–Jacobson matrix 208 O odds ratio 86 oligonucleotides 6 open reading frame (ORF) 12 operational taxonomic unit (OTU) 137 ordering constraints 248 organic chemistry 3 overlay matrices 100 P pair group method (PGM) 148 pair probabilities, maximum-likelihood estimation 132–3 PAM matrices 86–8, 139, 140 parallel ß sheets 16 parallel mutations 137, 138 partition function 43, 48, 65 PDB format 266 peptide bond 14 percent minimization 59 performance, definition 234–6 periodicity rules 122

persistence, definition 39 phosphodiester bond 8 phylogenetic trees 136, 145, 148 pivot moves 232 Poisson distribution 29–30, 34 Poisson process 138 polar requirement 17 polarity index 58 polymer, definition 4 polysaccharides 4 positive transition matrix 42 potential energy function 48 primary structure 17, 202 principle of insufficient reason 63 probability density function 25 probability distributions 27–38 probability function 24

Page 285

probability theory 23–53 exercises 72–3 prokaryotes 1, 19, 20 protein 2 protein data bank (PDB) 266 protein folding problem 201 see also folding protein motifs 194–5 protein structure 15–17 prediction 201–62 protein threading 202, 246–59 definition 246–9 proteins 14–19 Protokarya 2 Pulley Principle 162 purines 8 pyramidines 8 Q quarternary structure 17 quartet puzzling step 166–70 quartet trees 166–8 R Ramachandran plot 15 random boolean cellular automation 74 random sequence 118 random variables 25–6, 31 reciprocal translocation 119 record-to-record Travel algorithm (RRT) 55 recursion equation 92, 95, 104–7 re-estimation of parameters 180 reference amino acid distance frequency 224 relative threading 253 restriction enzymes 81–2 reverse transcriptases 83–4 reversible Markov process 158

ribose 7 ribosomal RNA (rRNA) 13, 21 ribosomes 21 RNA 2, 13 RNA polymerase 19, 195 RNA secondary structure 202–13 root mean square deviation (RMSD) 156 roulette wheel technique 61 S Saccharomyces cerevisiae 266 saddlepoint 52 salt bridges 17 SCOP database 266 scoring a model 177–80 scoring function 249, 259 scoring matrices 84–6 scoring subsequence 111 secondary structure 17, 202 elements 16 segment algorithm 71 segmentation algorithm 32 selenocysteine 56 sequence alignment 81–134 example 83–4 exercises 128–32 sequence space 205 Shannon entropy function 64 Shannon's formula 62 shape space 205 shuffle algorithm 61 shuffled-codon codes 56, 58 similarity methods 110–11 simulated annealing 43–4, 46, 220 heuristics related to 54–5 Sinclair theorem 43 single-molecule DNA sequencing 117

small molecules 4–6 small nuclear (snRNA) 13 Smith–Waterman local sequence alignment 120 spacing constraints 248 specific amino acid distance frequencies 225 standard deviation 26 standard error 31 statistical model 175 statistical significance 69 StatSignificance algorithm 71 Steiner sequences 117–18 Stirling's approximation 146 Stirling's formula 24–5, 62 stochastic matrix 38 Strimmer, von Haeseler algorithm 168 structure prediction 201–62 constraint-based 243–6 exercises 259–62 sugar molecule 4 sugar transport proteins 195 sugars 6

Page 286

sum-of-pairs multiple sequence alignment problem 114 supercoiled DNA 218, 220 supersecondary structures 16 SWISS-PROT 266 synonymous substitutions 139 syntenic distance 119, 120 synteny 119 T tandem duplication 99–110 TATA box 12, 19, 195–6 taxon 137 Taylor expansion 29, 143 tertiary structure 17, 201, 202 thermal luminescence 135 threading sets 253 threshold accepting (TA) algorithm 54–5 thymine 8 topological neighbors 229 total free energy 220 total probability formula 25 trace matrix 93, 98 traceback 93, 94, 98, 107, 179, 180 transcription 19–21 transfer RNA (tRNA) 13, 20–1 transition probability functions 141 transitional mutations 140 transitions 110, 127–8 translation 19–21 transposition 119 transversion 110 transversional mutations 140 tree 145 likelihood 159–60 topology 166 Trypanosoma brucei 1

trypanosomes 123–8 U ultrametric trees 147–52 Unger–Moult hybrid genetic algorithm 233 unit evolutionary time 138 units of measurement 263–4 UPGMA 148–9, 152, 154–5, 157 uracil 8 V variance 26 Viterbi algorithm 180 Viterbi score of a model 179 W WAC matrix 139 water molecule 4 Waterman, Smith and Beyer theorem 95–6 Watson–Crick base pairs 121, 124, 205, 209 Watson–Crick model 8 Watson–Crick rules 8 web sites 266 WPGMA 151, 156 Wraparound Dynamic Programming 107, 108 wraparound step 101, 102