248 106 3MB
English Pages 206 [232] Year 2013
Contents
Basic Bioinformatics Second Edition
i
ii Contents
Basic Bioinformatics Second Edition
S. Ignacimuthu, s.j.
α Alpha Science International Ltd. Oxford, U.K.
Basic Bioinformatics Second Edition 242 pgs. | 54 figs. | 16 tbls.
S. Ignacimuthu, s.j. Director Entomology Research Institute Loyala College, Chennai Copyright © 2013 ALPHA SCIENCE INTERNATIONAL LTD. 7200 The Quorum, Oxford Business Park North Garsington Road, Oxford OX4 2JZ, U.K.
www.alphasci.com All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the publisher. ISBN 978-1-84265-804-8 E-ISBN 978-1-84265-978-6 Printed in India
Contents
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Dedicated to Rev. Fr. Adolfo Nicolas, S.J. the Superior General of the Society of Jesus ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
v
vi Contents
Preface to the Second Edition
As I thank the readers for their tremendous support for my book 'Basic Bioinformatics', I am happy to bring out the second edition of this book for the benefit of the readers. In recent years bioinformatics has been gaining importance. Being an interface between modern biology and informatics, it involves the discovery, developments and use of computational algorithms and software tools that facilitate an understanding of the biological processes with a goal to serve healthcare and other sectors of human endeavours. From the time Paulien Hogeweg and Ben Hesper coined the word bioinformatics in 1978 to refer to the study of information processes in biotic systems, rapid developments have taken place in mapping and analyzing DNA and protein sequences, developing new databases, aligning different sequences, comparing them, viewing 3-D models of protein structures, studying the molecular interaction and carrying out drug discovery analyses. I am immensely happy to present the revised edition which includes all the up to date basic information relating to different areas of bioinformatics along with some procedures to have hands on experience. I am sure the students and teachers will greatly benefit from this book. S. Ignacimuthu, s.j.
viii Contents
Contents
ix
Preface to the First Edition
Bioinformatics is an interdisciplinary subject. It is the science of using information to understand biology. In bioinformatics biology, computer science and mathematics merge into a single discipline. Strictly speaking, bioinformatics is a large subjective of the computational biology, the application of information technology to the management of biological data. Biological data are being produced at a phenomenal rate as seen in genomic repository of nucleic acid and protein sequences. The three-fold aim of bioinformatics includes organization and preservation of data, development of tools and resources and analysis of data and interpretation of results using the tools. Thus it is the science of storing, extracting organizing analyzing, interpreting and utilizing biological information. Since the beginning of 1990s, many laboratories are analyzing the full genome of several species such as bacteria, yeast mice and humans. Due to these collaborative efforts enormous amount of data are collected and stored in databases, most of which are publicly accessible. These data have to be analyses in order to understand their relevance. Nucleotide and amino acid sequence which have to be studied. Mining these immense store houses of data to secure vital information for research and product development is one of the activities of bioinformatics. Bioinformatics not only provides theoretical background and practical tools for scientists to analyze proteins and DNA but also helps in sequence homology analysis and drug design, Two principal approaches underpin all the studies in bioinformatics. First is that of comparing and grouping and grouping the data according to biologically meaningful similarities and second, that of analyzing one type of data to infer and understand the observation for another type of data. The types of analysis that are carried out are: alignments, multiple alignment, databases search, signals, patterns or map in DNA or protein sequences, open reading frame and secondary structure prediction. Keeping in view the wide applicability of bioinformatics in different areas, it important to prepare well-trained human resource to face the challenges of the post genomic era, This book is intended to give the basics of
x Preface to the First Edition biological concepts, biological database and internet based bioinformatics tools. We are hopeful that this book will cater to the immediate needs of students, researchers, faculty members and pharmaceutical industries. S. Ignacimuthu, s.j.
Contents
xi
Acknowledgements
I am thankful to many of my friends who constantly encouraged me to write this book. I am grateful to Dr. C. Muthu for typesetting the manuscript and getting it ready for publication. Let me also thank Mr. R. Mahimairaj for preparing the illustration and Mr. A. Stalin for helping in verifying the web addresses. I am indebted to various publishers and authors for permitting me to use some of the illustrations and the explanations from their book. Let me congratulate the publishers for their good work. S. Ignacimuthu, s.j.
xii Contents
Contents
xiii
Contents
Preface to the Second Edition Preface to the First Edition Acknowledgements
vii ix xi
1. History, Scope and Importance 1.1 Important Contributions 1.2 Sequencing Development 1.3 Aims and Tasks of Bioinformatics 1.4 Application of Bioinformatics 1.5 Challenges and opp ortunities Study Questions
1.1 1.2 1.7 1.10 1.11 1.14 1.15
2. Computers, Internet, World Wide Web and NCBI 2.1 Computers and Programs 2.2 Internet 2.3 World Wide Web 2.4 Browsers and Search Engines 2.5 EMBnet and SRS 2.6 NCBI Study Questions
2.1 2.1 2.3 2.6 2.7 2.9 2.11 2.14
3. DNA, RNA and Proteins 3.1 Background 3.2 DNA 3.3 RNA 3.4 Transcription and Translation 3.5 Proteins and Amino acids Study Questions
3.1 3.1 3.5 3.9 3.14 3.19 3.23
4. DNA and Protein Sequencing and Analysis 4.1 Genomics and Proteomics
4.1 4.2
xiv Contents 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
Genome Mapping DNA Sequencing Method Open Reading Frame (ORF) Determining Sequence of a Clone Expressed Sequence Tags Protein Sequencing Gene and Protein Expression Analysis Human Genome Project Study Questions
4.4 4.6 4.9 4.10 4.12 4.14 4.15 4.25 4.28
5. Databases, Tools and their Uses 5.1 Importance of Databases 5.2 Nucleic Acid Sequence Databases 5.3 Protein Sequence Database 5.4 Structure Databases 5.5 Bibliographic Databases and Virtual Library 5.6 Specialized Analysis Packages 5.7 Use of Databases Study Questions
1.1 5.1 5.6 5.9 5.13 5.19 5.20 5.25 5.25
6. Sequence Alignment 6.1 Algorithm 6.2 Goals and Types of Alignment 6.3 Study of Similarities 6.4 Scoring Mutations, Deletions and Substitutions 6.5 Sequence Alignment Methods 6.6 Pairwise Alignment 6.7 Multiple Sequence Alignment 6.8 Algorithms for Identifying Domains within a Protein Structure 6.9 Algorithms for Structural Comparison 6.10 Carring Out a Sequence Search Study Questions
6.1 6.1 6.2 6.4 6.7 6.11 6.12 6.17 6.22 6.23 6.23 6.27
7. DNA and Protein Sequences 7.1 Gene Prediction Strategies 7.2 Protein Prediction Strategies 7.3 Protein Prediction Programs 7.4 Molecular Visualization Study Questions
7.1 7.1 7.4 7.15 7.17 7.20
8. Homology, Phylogeny and Evolutionary Trees 8.1 Homology and Similarity
8.1 8.1
Contents
xv
8.2 Phylogeny and Relationships 8.3 Molecular Approaches to Phylogeny 8.4 Phylogenetic Analysis Databases Study Questions
8.3 8.14 8.16 8.16
9. Drug Discovery and Pharmainformatics 9.1 Discovering a Drug 9.2 Pharmainformatics 9.3 Search Programs Study Questions
1.1 9.1 9.5 9.7 9.14
Appendix: List of Important Websites and Web Addresses Glossary References Index
A.1 G.1 R.1 I.1
C H A P T E R
1
History, Scope and Importance
In its broadest sense, the term bioinformatics can be considered to mean information technology applied to the management and analysis of biological data. From 1950 onwards, large amount of sequence data related to various living organisms have been collected and stored in databases. Since it is not very convenient to compare the sequences of several hundred nucleotides and amino acids by hand, several computational techniques were developed. Where data can be amassed faster than they can be analyzed and utilized, there is a great need for professionals who can use software to digest this ever-growing mass of information.
Definitions Bioinformatics is defined in various ways. Some of the definitions are as follows: (i) Bioinformatics is the use of computer in solving information problems in life sciences; mainly it involves the creation of extensive electronic database on genomes and protein sequences. Secondarily it involves techniques such as the three-dimensional modeling of biomolecules and biological systems. (ii) Bioinformatics is a computational management of all kinds of biological informations, including genes and their products, whole organisms or even ecological systems. (iii) Bioinformatics is an integration of mathematical, statistical and computational methods to analyse biological, biochemical and biophysical data. It deals with methods of storing, retrieving and analyzing biological data, such as nucleic acid and protein sequences, structures, functions, pathways and genetic interactions. (iv) Bioinformatics is the storage, manipulation and analysis of biological information via computer science. Bioinformatics is an essential infrastructure underpinning biological research.
1.2
Basic Bioinformatics
(v) Bioinformatics is the application of the methods of computational techniques and technologies to analyse and maintain biological data.
1.1
IMPORTANT CONTRIBUTIONS
Hereunder we are giving a chronological list of developments that contributed to the emergence of bioinformatics 1866 Gregor Mendel published the results on his investigations of the inheritance of ‘factors’ in pea plants. 1869 F. Miescher discovered DNA (published in 1871); he also suggested that the genetic information may exist in the form of molecular text 1928 Erwin Schrodinger proposed that this factor is of 1000 angstroms. 1933 Tiselius introduced a new technique known as electrophoresis for separating proteins in solution. 1938 Astbury and Bell suggested that the bases form the long scroll of DNA on which is written the pattern of life 1944 Avery et al. established the genetic role of DNA 1947 First sequencing of a pentapeptide graminicidine S was done by Consden et al. 1949 The A=T and G=C rule was discovered by Chargaff et al. 1951 Pauling and Corey proposed the structure for the alpha helix and beta-sheet of polypeptide chain of protein. • Reconstruction of partial 30 residue sequence of insulin by Sanger and Tuppy. 1952 Rosalind and Wilkins used X-ray crystallography to reveal repeating structure of DNA 1953 Watson and Crick proposed the double helix model for DNA 1954 Perutz’s group developed heavy atom methods to solve the phase problem in protein crystallography. 1955 F. Sanger announced the sequence of bovine insulin 1957 Arthur Kornberg produced DNA in a test tube 1958 The first integrated circuit was constructed by Jack Kilby at Texas Instruments • The Advanced Research Projects Agency (ARPA) was formed in USA 1962 Zuckerkandl and Pauling initiated studies on the variability of sequences and evolution 1963 Ramachandran plot or Ramachandran diagram was developed by G.N. Ramachandran, C. Ramakrishnan and V. Sasisekharan. They also discovered the triple helical structure of collagen. 1965 M. Dayhoff observed that many amino acids were replaced in evolution not in a random way but with specific preferences
History, Scope and Importance 1968
1969 1970
1971 1972
1973
1974
1975
1976
1977
1978
1.3
Werner Arber, Hamilton Smith and Daniel Nath described uses of restriction enzyme • Packet-switching network protocols were presented to ARPA Linking computers at Stanford and UCLA created the APRANET The details of the Needleman Wunsch algorithm for sequence comparison were published. • A.J. Gibbs ad G.A. McIntyre described a new method for comparing two amino acid and nucleotide sequences using dot matrix Ray Tomlinson (BBN) invented the email program Gatlin offered the first information - theoretical treatment of the sequence • Wireframe models of biological molecules were presented by Levinthal and Katz • Paul Berg made the first recombinant DNA molecule using ligase enzyme • Stanley Cohen, Annie Chang and Herbert Boyer produced the first recombinant DNA organism Joseph Sambrook and his team refined DNA electrophoresis technique using agarose gel • Stanley Cohen cloned DNA • Brookhaven Protein Data Bank was announced • Robert Metcalfe described Ethernet in his Ph.D. thesis Charles Goldfarb invented SGML (Standardized General Markup Language) • Vint Carf and Robert Kahn developed the concept of connecting networks of computers into an ‘internet’ and developed the Transmission Control Protocol (TCP). P.H. O’Farrell announced two-dimensional SDS polyacrylamide gel electrophoresis • E.M. Southern published experimental details for Southern Blot analysis. • Bill Gates and Paul Allen found Microsoft Corporation. Prosite database was reported by Bairoch et al. • The Unix-To-Unix Copy Protocol (UUCP) was developed at Bell Labs Fredrick Sanger, Allen Maxam and Walter Gilbert pioneered DNA sequencing. • The full description of the Brookhaven PDB was published by F.F. Bernstein et al. The first Usenet connection was established between Duke and the University of North Carolina at Chapel Hill by Tom Truscott, Jim Ellis and Steve Bellovin
1.4 1980
1981
1982
1983 1984 1985
1986
Basic Bioinformatics Mark Skolnick, Ray White, David Botstein and Ronald Davis created RFLP marker map of human genome. • The first complete gene sequence for an organism (FX 174) was published. • Wuthrich et al. published a paper detailing the use of multidimensional NMR for protein structure determination. • IntelliGenetics Inc. was founded in California. Their primary product was the IntelliGenetics Suite of programs for DNA and protein sequence analysis. • The Smith – Waterman algorithm for sequence alignment was published. • US Supreme Court holds that genetically – modified bacteria are patentable. IBM introduced its personal computer to the market • Human mitochondria DNA was sequenced • D. Benson, D. Lipman and colleagues developed a menu-driven program called GENINFO to access sequence database. • Maizel and Lenk developed various filtering and color display schemes that greatly increased the usefulness of the dot matrix method. First recombinant DNA – based drug was marketed • Genetics Computer Group (GCG) was created as a part of the University of Wisconsin at Wisconsin Biotechnology Center. The Compact Disk (CD) was launched • Name servers were developed at the University of Wisconsin Jon Postel’s Domain Name System (DNS) was placed on-line. Apple computer announced the Macintosh. Kary Mullis invented PCR • FASTP algorithm was published • Robert Sinsheimer made the first proposal for Human Genome Project Thomas Roderick coined the term Genomics to describe the scientific discipline of mapping, sequencing and analyzing genes. • Amoco Technology Corporation acquired IntelliGenetics. The Swiss-PROT database was created by the Department of Medical Biochemistry of the University of Geneva and the European Molecular Biology Laboratory (EMBL) • Leroy Hood and Lloyd Smith automated DNA sequencing. • Charles DeLisi convened a meeting to discuss the possibility of determining the nucleotide sequence of human genome. • NSFnet debuts
History, Scope and Importance 1987
1988
1989
1990
1991
1992
1.5
United States Department of Environment (US DoE) officially began human genome project. • The physical map of E. coli is published by Y. Kohara et al. The use of yeast artificial Chromosome (YAC) is described by David T. Burke et al. • Pearson and Lipman published the FASTA algorithm • The National Centre for Biotechnology Information (NCBI) was established at the National Cancer Institute in the US. • PERL (Practical Extraction Report Language) was released by Larry Wall • United States National Institute of Health (US NIH) took over genomic project with James Watson at the helm. • The Human Genome Initiative was started • Des Higgins and Paul Sharpe announced the development of CLUSTAL • A new program, an internet computer virus designed by a student, infected 6000 military computers in the USA NIH established National Centre for Human Genome Research. • The Genetics Computer group became a private company • Oxford Molecular Group Ltd (OMG) founded in Oxford, UK, created products such as Anaconda, Asp, Cameleon and other (molecular modeling, drug design, and protein design) products. The BLAST programme to align DNA sequences was developed by Altschul et al. • Michael Levitt and Chris Lee founded Molecular Applications Group in California. • InforMax was founded in Bethesda, MD • The HTTP 1.0 specification was published. Tim Berners – Lee Published the first HTML document. CERN, Geneva announced the creation of the protocols which make up the World Wide Web. • Craig Venter invented expressed sequence tag (EST) technology • Incyte Pharmaceuticals, a genomics company was formed in California. • Myriad Genetics Inc. was founded in Utah with a goal of discovering major common disease genes and their related pathways. • Lius Torvelds announced a Unix – Like separating system which later became Linux. Human Genome systems, Maryland was formed by William Haseltin • Craig Venter established the Institute for Genomic Research (TIGR).
1.6
1993
1994
1995
1996
1997
1998
Basic Bioinformatics • Mel Simon and coworkers (Cal Tech) invented BACs, crucial for clone by clone gene assembly. • Wellcome Trust joined human genome project Francis Collins took over Human Genome project. Sanger Center is opened in UK. Other nations joined in the effort. 2005 was projected as completion year. • CuraGen Corporation was formed in New Haven, CJ. Netscape Communications Corporation was founded and it released Navigator. • Attwood and Beck published the PRINTS database of protein motifs. • Gene Logic is formed in Maryland Researchers at the Institute for Genomic Research published the first genome sequence of free-living organism: Haemophilus influenzae. • Patrick Brown and Stanford university colleagues invented DNA micro-array technology. • Microsoft released version 1.0 of Internet Explorer • Sun released version 1.0 of Java and Netscape released version 1.0 of Java script; version 1.07 Apache was released • The Mycoplasma genitalium genome was sequenced The genome of Saccharomyces cerevisiae was sequenced. • International Human Genome project consortium established ‘Bermuda rules’ for public data release. • Prosite database was reported by Bairoch et al. • Affymetrix produced the first commercial DNA chips. • The working draft for XML was released by W3C • Structural Bioinformatics, Inc. was founded in San Diego, USA The genome for E. coli was published • Oxofed Molecular Group acquired the Genetics Computer Group. • LION bioscience AG was founded. • Paradigm Genetics Inc, was founded in North Carolina, USA - DeCode genetics maped the gene linked to pre-eclampsia The genomes for Caenorhabditis elegans and baker’s yeast were published • Graig Venter forms Celera in Maryland • Inphamatica, a new Genomics and Bioinformatics company was established by the University College, London. • Gene Formatics, a company dedicated to the analysis and prediction of protein structure and function was formed in San Diego. • The Swiss Institute of Bioinformatics was established as a nonprofit foundation
History, Scope and Importance
1.7
• NIH began SNP project to reveal human genetic variation. • Celera Genomics proposed to sequence human genome faster and cheaper than consortium. 1999 Wellcome Trust formed SNP consortium • First Human Chromosome sequence was published. 2000 The genomes of Pseudonomas aeruginosa, Arabidopsis thaliana and Drosophila melanogaster were sequenced. • Pharmacopeia acquired Oxford Molecular Group. 2001 Science and Nature published annotations and analysis of human genome by mid February. 2002 More genome sequences of other organisms were published. • Structural bioinformatics and GeneFormatics merged • Full genome sequence of the common house mouse was published 2004 Rat Genome sequencing project consortium completed the genome sequence of brown Norway laboratory rat. 2005 4,20,000 Variant SEQr human resequencing sequences were published on new NCBI probe database 2007 A set of closely related 12 Drozophilidae were sequenced • Craig Venter published the full diploid genome sequence 2008 Leiden university Medical Center deciphered the completed DNA sequence of a woman • G.P.S. Raghava from IMTECH, India developed softwares and databases for protein structure prediction, genome annotation and functional annotations of proteins. All the above mentioned developments have contributed significantly to the growth of bioinformatics in one way or another.
1.2
SEQUENCING DEVELOPMENT
Before 1945, there was not even a single quantitative analytical method available for any one protein. However, significant progress with chromatographic and labeling techniques over the next decade eventually led to the elucidation of the first complete sequence, that of the peptide hormone insulin. The sequence of the first enzyme ribonuclease was complete by 1960. By 1965, around 20 proteins with more than 100 residues had been sequenced, and by 1980, the number was estimated to be around 1500. Today more than 4,00,000 sequences are available.
1.8
Basic Bioinformatics
Initial Attempts Initially a majority of protein sequences were obtained by the manual process of sequential Edman degradation – dansylation. A very important step towards the rapid increase in the number of sequenced proteins was the development of automated sequences which, by 1980, offered a 104 fold increase in the sensitivity compared to the procedure implemented by Edman and Begg in 1967. The first complete protein sequence assignment using mass spectrometry was achieved in 1979. This technique played a vital role in the discovery of the amino acid γ-carboxyglutamic acid, and its location in the N-terminal region of prothrombin. During 1960s and 1970s scientists were finding it difficult to develop methods to sequence nucleic acids. When the techniques were available, the first techniques to emerge were applicable only to RNA (ribonucleic acid), especially transfer – RNAs (tRNA). tRNAs were ideal materials for this early work, because they were short (typically 74-95 nucleotides in length), and because it was possible to purify individual molecules.
Advanced Techniques DNA (deoxyribonucleic acid) consists of thousands of nucleotides and assembling the complete nucleotide sequence of an entire chromosomal DNA molecule is a very big task. With the advent of gene cloning and PCR, it became possible to purify defined fragments of chromosomal DNA. This paved the way for the development of fast and efficient DNA sequencing techniques. By 1977, two sequencing methods had emerged, using chain termination and chemical degradation approaches. These techniques with some minor modifications laid the foundation for the sequence revolution of the 1980s and 1990s and the subsequent birth of bioinformatics. The polymerase chain reaction (PCR) due to its sensitivity, specificity and potential for automation, is considered the front-line analytical method for analyzing genomic DNA samples and constructing genetic maps. Over the years, incremental improvements in basic PCR technology have enhanced the power and practice of the technique. Since the introduction of the first-semi-automated sequence in 1987, coupled with the development of PCR in 1990 and fluorescent labeling of DNA fragments generated by the Sanger dideoxy chain termination method, there have been large-scale sequencing efforts which have contributed greatly. Technologies for capturing sequence information have also become advanced over a period of time. In the early 1980s, researchers could use digitizer pens to manually read DNA sequences from gels. Then came image-capture devices, which were cameras that digitized the information on gels. In 1987 Steven Krawetz, helped to develop the first DNA sequencing software for automated film readers.
History, Scope and Importance
1.9
In the early 1990s, J. Craig Venter and his colleagues devised a new method to find genes. Rather than taking the single base chromosomal DNA, Venter’s group isolated messenger RNA molecules, copied these mRNA molecules into DNA molecules and then sequenced a part of the DNA molecule to create expressed sequence tags or ESTs. These ESTs could be used as handles to isolate the entire gene. The EST approach also has generated enormous databases of nucleotide sequences and the development of the EST technique is considered to have demonstrated the feasibility of high-throughput gene discovery, as well as provided a key impetus for the growth of the genomics industry.
Sequence Deposits At the start of 1998, more than 3,00,000 protein sequences have been deposited in publicly available non-redundant data bases, and the number of partial sequences in public and proprietary Expressed Sequence Tag (EST) databases was expected to run into millions. By contrast, the number of 3D structures in the Protein Data Bank (PDB) is still less than 20000. The United States Department of Energy (DoE) initiated a number of projects in 1980s to construct detailed genetic and physical maps of the human genome. Their aim was to determine the complete nucleotide sequence of human genome and to localize the estimated 30,000 genes. Work of such a great dimension required the development of new computational methods for analyzing genetic map and DNA sequence data, and demanded the design of new techniques and instrumentation for detecting and analyzing DNA. To benefit the public most effectively, the projects also necessitated the use of advanced means of information dissemination in order to make the results available as rapidly as possible to scientists and physicians. The international effort arising from this vast initiative became known as the Human Genome Project (HGP).
Useful Websites A very useful guide can be found in the website: http://www.genome.gov/ Education/ Overview of the role, history and achievements of the US Department of Energy in the HGP can be found in the website: http:// genomics.energy.gov/ Genome Annotation Consortium (GAC) provides comprehensive sequence-based views of a variety of genomes in the form of an illustrated guide, with progress charts, etc., and it can be found in the website: http:// www.geneontology.org/GO.refgenome.shtml Mapping and sequencing the genomes of a variety of organisms have been taken up and this can be found in the website: http://www.ornl.gov/ sci/techresources/ Human_Genome/publicat/primer/prim2.html
1.10 1.3
Basic Bioinformatics
AIMS AND TASKS OF BIOINFORMATICS
The underlying principle of bioinformatics is that, biological polymers such as nucleic acid molecule and proteins can be transformed into sequences of digital symbols. Besides, only limited numbers of alphabets are required to represent the nucleotide and amino acid monomers. This flexibility of analyzing the biomolecules with the help of limited alphabets resulted in the flourishing of bioinformatics. The growth and performance of bioinformatics rely on the developments in computer hardware and software. The simplest tasks used in bioinformatics concern the creation and maintenance of databases of biological information. Essentially bioinformatics has three components: (i) the creation of databases allowing the storage and management of large biological data sets, (ii) the development of algorithms and statistics to determine relationships among members of large data sets and (iii) the use of these tools for the analysis and interpretation of various types of biological data, including DNA, RNA and protein sequences, protein structures, gene expression profiles and biochemical pathways.
Aims The aims of bioinformatics are as follows: (i) To organize data in a way that allows researchers to access existing information and to submit new entries as they are produced. (ii) To develop tools and resources that aid in the analysis of data. (iii) To use these tools to analyze the data and interpret the results in a biologically meaningful manner.
Tasks The tasks in bioinformatics involve the analysis of sequence information. This process involves: • identifying the genes in the DNA sequences from various organisms. • Developing methods to study the structure and/or function of newly identified sequences and corresponding structural RNA sequences. • Identifying families of related sequences and the development of models. • Aligning similar sequences and generating phylogenetic trees to examine evolutionary relationships. Besides these, one of the important dimension of bioinformatics is identifying drug targets and pointing out lead compounds.
Areas Bioinformatics deals with the following areas: (i) Handling and management of biological data including its organization, control, linkages, analysis and so on.
History, Scope and Importance
1.11
(ii) Communication among people, projects, and institutions engaged in the biological research and applications. The communication may include e-mail, file transfer, remote login, computer conferencing, electronic bulletin boards, or establishment of web-based information resources. (iii) Organization, access, search and retrieval of biological information, documents, and literature. (iv) Analysis and interpretation of the biological data through the computational approaches including visualization, mathematical modeling, and development of algorithms for highly parallel processing of complex biological structures.
1.4
APPLICATION OF BIOINFORMATICS
Biocomputing has found its application in many areas. Apart from providing the theoretical background and practical tools for scientists to explore proteins and DNA, it also helps in many other ways. In understanding the meaning of sequences, two distinct analytical themes have emerged: (i) in the first approach, pattern recognition techniques are used to detect similarity between sequences and hence to infer related structures and functions and (ii) ab initio prediction methods are used to deduce 3D structures and ultimately to infer function directly from the linear sequence. The direct prediction of protein three-dimensional structure from the linear amino acid sequence is the objective of bioinformatics.
1.4.1 Sequence Homology Analysis One of the driving forces behind bioinformatics is the search for similarities between different biomolecules. Apart from enabling systematic organization of data, identification of protein homologues has some direct practical uses. Theoretical models of proteins are usually based on experimentally solved structures of close homologues. Wherever biochemical or structural data are lacking, studies could be carried out in yeast like lower organisms and the results can be applied to homologues in higher organisms such as humans. It also simplifies the problem of understanding complex genomes by analyzing simple organisms first and then applying the same principles to more complicated ones. This would result in identifying potential drug targets by checking homologues of essential microbial proteins.
1.4.2 Drug Design The adoption of a bioinformatics-based approach to drug discovery provides an important advantage. With bioinformatics, genotypes associated with pathophysiologic conditions could be defined, which might lead to the identification of potential molecular targets. Given the nucleotide sequence,
1.12
Basic Bioinformatics
the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence research techniques could then be used to find homologues in model organisms; and based on sequence similarity it is possible to model the structure of the specific protein on experimentally characterized structures. Finally, docking algorithms could design molecules that could bind to the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.
1.4.3 Predictive Functions Through large-scale screening of data, one can address a number of evolutionary, biochemical and biophysical questions. We can identify (a) specific protein folds associated with certain phylogenetic groups, (b) commonality between different folds within particular organisms, (c) the degree of folds shared between related organisms, (d) the extent of relatedness derived from traditional evolutionary trees, and (e) the diversity of metabolic pathways in different organisms. One can also integrate data on protein functions, given the fact that particular protein folds are often related to specific biochemical functions. Combining expression information structural and functional classifications of proteins, one can predict the occurrence of a protein fold in a genome, which is indicative of high expression levels. In conjunction with structural data, one can compile a map of all protein-protein interactions in an organism.
1.4.4 Medical Areas Applications in medical sciences have centered on gene expression analysis. This usually involves compiling expression data for cells affected by different diseases and comparing the measurements against normal expression levels. Identification of genes that are expressed differently in affected cells provides a basis for explaining the causes of illness and highlights potential drug targets. With this one would design compounds that bind to the expressed protein. Given a lead compound, microarray experiments can be sued to evaluate responses to pharmacological intervention; it can also help in providing early tasks to detect or predict the toxicity of trial drugs. If bioinformatics is combined with experimental genomics, a lot of advances could be made to revolutionize the future healthcare programs. This involves postnatal genotyping to assess susceptibility or immunity from specific diseases and pathogens; prescription of a unique combination of vaccines; minimizing the healthcare costs of unnecessary treatments and anticipating the onslaught of diseases later in life, which could lead to guidance for nutrition intake and early detections of any illness. In addition, drug-based treatments could be tailored specifically to the patient and disease, this providing the most effective course of medication
History, Scope and Importance
1.13
with minimal side effects. Human genome project will benefit forensic sciences, pharma industries, discovery of beneficial and harmful genes, contribute to a better understanding of human evolution, diagnosis of disease and disease risks, genetics of response to therapy and customized treatment, identification of drug targets and gene therapy.
1.4.5. Intellectual Property Rights Intellectual Property Rights (IPR) are essential part of today’s business. IPRs are the means to protect any intangible asset. Examples of IPR are Patent, Copyright, Trademark, Geographical indication and Trade Secret. A patent is an exclusive monopoly granted by the Government to an inventor over his invention for limited period of time. Major areas of bioinformatics which need intellectual property protection are (a) analytical and information management tools (e.g. modeling techniques, databases, algorithms, software, etc.), (b) genomics and proteomics and (c) drug discovery/design.
Innovations Majority of bioinformatics innovation involves applications of computerimplemented protocols or software in collecting and/or processing biological data. These inventions fall within the general category of computer related inventions called inventions implemented in a computer and inventions employing computer readable media. These inventions have two aspects (a) software and (b) hardware. For example, a computer based system for indentifying new nucleotide sequence clusters from a given set of nucleotide sequences based on sequence similarity may comprise an input device, a memory and a processor as hardware components of the system and a data set or method of operating instructions stored in the memory and operable by the processor as a software for the system. Patent protections would be invaluable in protecting methods, which use computational power, such as sequence alignments, homology searches and metabolic pathways modeling.
Genomics and Proteomics Genomics involves isolation and characterization of gene and assigning a function or use to the gene sequence, i.e., either expression of a particular protein or identification of the gene as a marker for a particular disease. This work involves a great deal of laboratory experiments as well as computational techniques. These techniques can also be protected under IPR. Proteomics involves purification and characterization of proteins using technologies like 2D-electrophoresis, multidimensional chromatography and mass spectroscopy. The application of these techniques to characterization and finding relation of the protein, (i.e. marker with a particular disease) is challenging, time consuming and needs heavy investment.
1.14
Basic Bioinformatics
Drug design by modeling which involves computer and computation can also be protected under IPR. Table 1.1. gives some examples of patents in bioinformatics. Table 1.1. Some examples of patents in bioinformatics Code Number 1. US 6,355,423 2. US 6,334,099 3. US 5,579,250 4. WO 98/15652 5. EPI 108779 6. EPO 807687
1.5
Specific title Methods and devices for measuring differential gene expression Methods for normalization of experimental data Method of rational drug design based on ab initio computer simulation of conformational features of peptides DNA sequencing and RNA sequencing using sequencing enzyme Spatial structures of at least one polypeptide Recombinant protease purification and computer program for use in drug design.
CHALLENGES AND OPP ORTUNITIES
There are numerous challenges: (i) We must be able to deal with increasingly complex data and to integrate data sources into a single system. (ii) Diverse types of data must be handled simultaneously to provide a better understanding of what genes do. (iii) Data have to be annotated, filtered and visualized better. (iv) Genomics and gene expression data have to be integrated more effectively. (v) Better methods have to be evolved to predict structures of protein from sequences. (vi) Better methods have to be designed to identify drug candidates. There are numerous opportunities as well: (i) Trained and skilled bioinformaticists are needed by many bioinformatics and drug companies. (ii) Research and academic institutions are looking for trained people. (iii) Trained people will be useful in the identification of useful genes leading to the development of new gene products. (iv) Skilled bioinfomaticists will contribute greatly in genomics and proteomics research. (v) Bioinformaticists will help in revolutionizing drug development and gene therapy. (vi) Bioinformaticists will be able to analyze the patterns of gene expression with computer algorithms. (vii) Bioinformaticists will help to understand toxic responses and to predict toxicity.
History, Scope and Importance
STUDY QUESTIONS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
What is bioinformatics? What is the contribution of Rosalind and Wilkins? Who produced the first recombinant DNA organism? Who invented the E-mail program? When was Compact Disk (CD) launched? Who developed BLAST program? Who published PRINTS database? In which year the human genome annotations were published? Write a short history on sequencing. What are the aims of bioinformatics? What are the tasks in bioinformatics? What are the various applications of bioinformatics? What is a patent? Give some examples of patents in bioinformatics.
1.15
C H A P T E R
2
Computers, Internet, World Wide Web and NCBI Computers are now an integral part of the biological world and without them advancements in biology and medicines would undoubtedly be hindered greatly. Computers are essential for the management of evergrowing biological data. Internet is a communication revolution. Web has been instrumental in making Internet a success. It allows the user to move freely anywhere on this single largest source of information highway. Computers are handling large quantities of data and help in probing the complex dynamics observed in nature. The data can be organized in flat files and spread sheet. They can be stored in hierarchical files and relational files.
2.1
COMPUTERS AND PROGRAMS
Computer is an electronic machine that is used to store information and process it in the binary mode. It can perform mathematical operations and symbol processing. Computer is madeup of transistors, capacitors and resistors. Bioinformatics would not be possible without advances in computing hardware and software. Fast and high-capacity storage media are essential to store information. Information retrieval and analysis require programs. Software is a collective term for various programs that can run on computers. Hardware refers to physical devices such as the processor, desk drives and monitor. Software is divided into two categories: system software and application software. System software comprises computer’s operating system and any other programs required to run applications, while application software is installed by the user for specific purposes. Computer programs are written in a variety of programming languages: machine code, assembly languages and higher-level languages. Programs written in assembly or higher-level programming languages must be converted into machine code by assembly and compilation.
2.2 Basic Bioinformatics In Windows, files in machine code are known as executable files and files in UNIX systems are known as executable images. These are run by computer’s processor. Scripts are files executed by another program. Microsoft Visual Basic, Java Script and PERL are scripting languages.
Programming Languages There are many programming, scripting and markup languages which are popular with bioinformaticists. HTML is a language used to specify the appearance of a hypertext document, including the positions of hyperlinks. HTML is not a programming language. Java Script is a popular scripting language that adds to the functionality of hypertext document, allowing web pages to include such features as pop-up windows, animations and objects that change in appearance when the mouse cursor moves over them. Java is a versatile and portable programming language that is designed to generate applications that can run on all hardware platforms. The Java source code is C++. Java is different from Java Script. Java applet is used in hypertext document. PERL (Practical Extraction and Reporting Language) is a versatile scripting language which is widely used in the analysis of sequence data. XML (Extensible Markup Language) allows files to be described in terms of the type of data they contain. PERL and PYTHON are the most suitable languages for the work of bioinformatics due to their efficiency and ability to meet diverse functional requirements of the field. PERL was invented by Larry Wall using languages like sed, awk, UNIX shell and C. PERL can do excellent pattern matching, has a flexible syntax or grammar and requires fewer codes for programming. It is good at string processing, i.e. doing things like sequence analysis and database management. It takes care of memory allocation. It has smooth integration with UNIX based system. It is available free from the NET to copy, compile and print. PERL can be downloaded from its home page: http://www.perl.org/. PYTHON is a complete subject oriented scripting language developed by Guido Van Rossum in 1998. It has tools for quick and easy generation of graphical user interface, a library for functions of structural biology and a mature library for numerical methods. Bioinformatic Sequence Markup Language (BSML) graphically describes genetic sequences and methods for storing and transmitting encoded sequence and graphic information. Biopolymer Markup Language (BIOML) is a data type definition for the annotation of molecular biopolymer sequence information and structure data.
Operating Systems The operating system is a master program that manages all peripheral hardware and allows other software applications to run. BIOS (Basic InputOutput System) is a low-level operating system which is largely or entirely in firmware (i.e. software stored in read-only memory).
Computers, Internet, World Wide Web and NCBI
2.3
BIOS handles activities such as deciding what to do when the computer is switched on after a cold start, reading and writing to disks, responding to input, displaying readable characters on the monitor and producing diagnostics. The higher-level operating system then takes over, and the computer acquires a typical graphical user interface (GUI) such as Windows. Files that contain instructions for the operating system are called batch files in Windows and Shell scripts in UNIX systems. Windows owned by Microsoft Corporation is the most familiar operating system on home and office PCs. Most commercial workstations and servers run under variations of an operating system called UNIX. GNU and LINUX conform to UNIX standard. The operating system allows one to have an access to the available files and programs. UNIX is a powerful operating system for multi-user component environment. The software that powers the web was invented on UNIX. UNIX is rich in commands and possibilities, which includes everything from networking software to word processing software and from e-mail to newsreaders. It also provides free access to downloading of programs installed on the UNIX systems. UNIX has many varieties and versions. LINUX is regarded as an open source version of UNIX, as it can be downloaded and installed free of cost. Under LINUX, the PCs prove to be highly elastic and useful workstations. It is also enabled with important packages for computational biology. IBION is a recent, complete and selfcontained bioinformatics system. It is a ground breaking server, an appliance for bioinformatics that has apache web server, a postgreSQL relational database, the R statistical language on an Intel-based hardware system with preinstalled LINUX and a comprehensive suite of bioinformatics tools and databases. Usually computer software is obtained on floppy disks or compact disks (CDs). A file is downloaded when it is copied from a remote source onto a local computer. A file is uploaded when it is copied from a computer’s hard drive to a remote source. Downloading from the internet is achieved in the following three ways: (i) directly from a hypertext document, (ii) from an FTP server or (iii) by e-mail.
2.2
INTERNET
The interplay between the Internet, the World Wide Web, and the global network of biological information and service providers has made the bioinformatics revolution possible. The Internet is a global network of computers and computer networks that links government, academic and business institutions. This allows computers to talk to each other in their own electronic languages. Biological information is stored on many different computers around the world. The easiest way to access this information is to join all those computers in a network.
2.4 Basic Bioinformatics Computers are connected in a variety of ways, most commonly by telephone cables and satellite links, thus allowing data to be exchanged between remote users. In order to function effectively, the networks share a communication protocol called Transmission Control Protocol/Internet Protocol, better known as TCP/IP. TCP determines how data are broken into packages and reassembled. IP determines how the packets of information are addressed and routed over the network. Such a shared pattern of communication means that different types of machines are able to speak to each other in a common way. Computers within the network are referred to as nodes, and these communicate with each other by transferring data packets. For transfer, data are first broken into small packets (units of information), which are sent independently and reassembled when they arrive at their destination. But packets do not necessarily travel directly from one machine to another; they may pass through several computers on route to their final destination. Even if any of the nodes on the way are down, the network protocols are designed to find an alternative route because of the availability of different routes.
Access The Internet provides a means to distribute software and enables researchers to perform sophisticated analysis on remote servers. Till the late 1980s, there were mainly three ways of accessing databases over an Internet: electronic mail servers, File Transfer Protocol (FTP) and TELNET sever. E- mail serves as a means of communicating text messages from one’s computer to some other computer. FTP is a means of transferring computer files such as programs from remote machines. TELNET is an internet protocol that allows the user to connect to computers at remote locations and use these computers as if they were physically operating the remote hardware. Electronic mail services allow researchers to send an electronic mail query to the mail server’s Internet address. The researcher’s query will then be ceased by the cover, and the result will be sent back to the sender’s mailbox. However, it had its own disadvantages such as poor querying with errors and too much time. With File Transfer Protocol, the researcher could download the entire databases search locally. This too has its own drawback that a researcher should have to download each and every database after each update. TELNET allows a user to remotely log onto a computer and access its facilities. This method is useful for occasional queries. This has its own disadvantages such as extensive management of user identifications and overloading of remote computer’s processing power.
Origin The true origins of the Internet lie with a research project on networking at the advanced Research Project Agency (ARPA) of the US Department of Defense in 1969, named ARPAnet. The original ARPAnet connected for the first time four nodes from different places in the US West Coast, with the
Computers, Internet, World Wide Web and NCBI
2.5
immediate goal of rapid exchange of scientific data on defense-related research between laboratories. In 1981, BITnet (Because It’s Time) was introduced, providing point-topoint connections between universities for the transfer of electronic mails and files. In 1982, ARPA introduced the TCP/IP allowing different networks to be connected to and communicate with one another.
Address Once the machines on a network have been connected to one another, there must be an unambiguous way to specify a single computer so that messages and files actually find their intended recipient. To facilitate communication between nodes, each computer on the Internet is given a unique, identifying number (its IP address). IP address is unique, identifying only one machine. It is encoded in a dotted decimal format. For example, one node on the internet might have the IP address: 130.14.25.1. These numbers represent the particular machine, the site where the machine is located, and the domain (and sub domain) to which the site belongs. These numbers help computers in directing data. An alternative, hierarchical domain-name system has also been implemented, which makes Internet addresses easier to decipher. For example, ncbi.nlm.nih.gov represents the above numbers meaning National Centre for Biotechnology and Information (NCBI), at National Library of Medicine (NLM) at National Institute of Health (NIH) and at Government site (gov). A complete list of domain suffixes, including country codes, can be found a t h t t p : / / w w w . c h r i s t c e n t e r e d s t o r e . c o m / international_domain_extensions_and_suffixes.htm, http://iwantmyname.com/domains/domain-name-registration-list-ofextensions.
Connectivity Normally we can get connected to the Internet through a modem which uses the existing copper twisted cables carrying telephone signals to transmit data. Data transfer rates using modem are relatively slow (28.8 to 56 kilobits per second, [kbps]. A number of new technologies are available for faster transfer of data. Integrated services digital network (ISDN) is one such technology but it is costly. Other cost effective alternatives are using television coaxial cables which are not used to transmit television signals and hence free to transmit data at high speed (4.0 megabits per second (Mbps)). Later digital subscriber line (DSL) with high speed (up to 7 Mbps) and asynchronous DSL (ADSL) were available. Some of the newer technologies involve wireless and satellite connections to the Internet. Most of the people commonly use Internet for electronic mail (e-mail), newsgroups, file transfer and remote computing. E-mail deals with
2.6 Basic Bioinformatics communication between individuals; newsgroups are concerned with remote computing, involving the use, for example, of the File Transfer Protocol (FTP) to transfer files between machines, and the Telnet protocol, by which users may connect to computers at different sites and use the machines as if physically present at the remote location. The most exciting use of internet is the communication between users in real-time. These include the UNIX talk protocol (or VMS phone), which is analogous to holding a telephone conversation, but users speak to each other by typing into a shared screen. An extension of this concept is conferencing, whereby groups of people meet and ‘talk’ to each other, again by typing into a shared interface.
2.3
WORLD WIDE WEB
The World Wide Web (www) is a way of exchanging information over the Internet using a program called a browser. www was conceived and developed at European Nuclear Research Council (CERN) in 1989. The European laboratory for Particle Physics allowed information sharing between internationally dispersed groups in the High Energy Physics Community. This led to a medium through which text, images, sounds and videos could be delivered to users on demand, anywhere in the world. The concept of information sharing between remote locations, and the ramifications for rapid data dissemination and communication, found immediate applications in numerous other areas. As a result, the web spread quickly and is now making a profound impact in the field of bioinformatics. Today, the www is the most advanced information system deployed on the Internet. The web is a hypermedia-based information system. It has become so popular and powerful, that it has almost become synonymous with the Internet itself. www is a collection of web pages from all over the world. The introduction of GOPHER and WAIS (Wide Area Information Server) in the early 1990s, increased the selection of database accession process. The world wide web (www) invented by Tim Berners-Lee (CERN) in 1990 replaced both these protocols. www greatly enhanced the power of cross referencing by providing active integration of databases over Internet, thus eliminating the need to download and maintain local copies of databases. With this a researcher could easily navigate across database entries through active hypertext cross references with the guarantee to retrieve the latest information. The first molecular biology web server to be set up was ExPasy (Expert Protein Analysis System) in 1993 by Geneva University Hospital and University of Geneva.
Web Pages and Websites Web pages are the documents that appear in the web browser window when we surf the www. Each document displayed on the web is called a web page, and all of the related web pages of a particular server are collectively called a
Computers, Internet, World Wide Web and NCBI
2.7
website. Their content is similar to plain text documents, except that they are much more flexible as they may contain links to other pages and files around the world. Website is a collection of relevant web pages and stored on one computer. Each web site on the internet has a unique address. The most important feature of web page is links. A link in a web page allows one to jump to another page anywhere in the current website or even to another page on another computer website anywhere in the world. The greatest asset of the www is its simplicity, providing access to static pages with highlighted text that can by a click of the mouse allow users to traverse related pages of widely dispersed information.
Object Web Object web is designed to support highly functional and interactive systems. It is a multi-tier architecture that contains two objects and communication layer. One object may represent the user interface, and the other may provide some computation. To communicate between the two objects, it is necessary to define the messages they might receive. The messages between two or more objects are mediated by a special piece of code (an Object Request Broker (ORB) on each machine capable of understanding the message of definitions and able to translate them into the specific language of each object. With the object web a system can be broken down into its constituent components written in different languages and running on different hardware systems. The Common Object Request Broker Architecture (COBRA) provides the standards that make this communication possible. It provides a language to define the structure of the messages, the Interface Definition Language (IDL), and the architecture for the mediators, the ORBs. ORBs transparently hide all of the communication between distributed objects, and form the backbone (wiring for the object web).
2.4
BROWSERS AND SEARCH ENGINES
The full potential of the Internet was realized only with the advent of browsers, which for the first time allowed easy access to information at different sites. Browsers are clients that communicate with servers, using a set of standard protocols and conventions. The first point of contact between a browser and a server is the home page. Once the browser has loaded its initial page, it then provides an easy to use interface with which to retrieve documents, access files, search databases, and so on. Some of the most commonly used browsers are Firefox, Chrome, Safari, Opera, Lynx, Mosaic, Netscape Navigator and Internet Explorer. Search engines are those which help to launch searches. There are many general purpose search engines such as Google, Yahoo, Microsoft, etc. which are very useful.
2.8 Basic Bioinformatics Lynx and Mosaic Lynx was developed in the Academic Computing Services at the University of Kansas, USA as part of an effort to construct a campus-wide information system. It runs on UNIX or VMS operating systems, providing a text-only interface via low-cost dumb display devices, such as the ubiquitous VT 100 terminal (or emulator). Mosaic was developed in 1993 at the National Centre for Supercomputing Application (NCSA), University of Illinois, and UrbanaChampaign, USA. As a hypermedia system designed for X-windows, Apple Mac and Microsoft Windows platform, it provided a single, user-friendly interface to the diverse protocols, data formats and information servers available throughout the Internet.
Netscape navigator and Internet Explorer Netscape Navigator was developed in 1994 by Netscape Communication Corporation, Mountain View, California, USA. It was prepared as an alternative to Mosaic. It is now the most popular package for browsing information on the Internet. Current versions of the software include facilities such as Internet, email, frames, real-time communication, audio-video support and the latest technology to support creation of visually exciting, fully interactive pages (e.g. with Java applets). Internet explorer was developed in 1995 by Microsoft Corporation, Redmond, USA. It was based on NCSA Mosaic and is designed to work with PC-based operating systems. It offers the familiar functionality of other hypermedia browsers, including support for frames, Java and ActiveX. Users can navigate by clicking on specific text, buttons, or pictures. These clickable items are collectively known as hyperlinks.
Hyperlinks Hyperlinks are usually characterized by being highlighted in some way, either by using a different color from the main body of the text or by being boxed etc. Selecting a highlighted link calls up the linked document, regardless of its location, whether on the same server, or on a server in a different country. Communication between hyperlinks is transparent. Each hypertext document has a unique address known as a uniform resource locator (URL). URLs take the format http://restofaddress. The communication protocol used by web servers is Hyper Text Transport Protocol or http. Rest of address provides a location for the hypertext document on the Internet.
HTML Hyper text documents are written in a standard markup language known as Hyper Text Markup Language or HTML. HTML code is strictly text-based, and any associated graphics or sounds for that document exist as separate files in
Computers, Internet, World Wide Web and NCBI
2.9
a common format. Markups instructions permit the web author to render in bold type (the symbol), to insert horizontal rulers (), images (), and so on; each of these modes is switched off with the relevant symbol (e.g. ). Another technology to support the creation of a functional genetic data warehouse is XML. XML stands for extensive markup language. SML, HTML, can build web pages. XML tags data in a way that any application can use. It provides a general language for representing data in a standard format. It allows files to be described in terms of the types of data they contain. XML is more flexible and robust. It provides the method for defining the meaning or semantics of the document. It has the advantage of controlling not only how data are displayed on a www page, but also how the data are processed by another program or by a database management system (DBMS).
2.5
EMBNET AND SRS
Computers store sequence information as simple rows of sequence characters called strings. Each character is stored in binary code in the smaller unit of memory, called a byte. Each byte comprises 8 bits, with each bit having a possible value of 0 or 1, producing 255 possible combinations. A DNA sequence is usually stored and read in the computer as a series of 8-bit words in this binary format. A protein sequence appears as a series of 8-bit words comprising the corresponding binary form of amino acid letters. Normally DNA and protein sequences are presented in standard ASCII file and in FASTA format. A network was established in 1988 to link European laboratories that used biocomputing and bioinformatics in molecular biology research. The network, known as EMBnet, was developed to provide information, services and training to users in dispersed European laboratories, via designated nodes operating in their local languages. Later this establishment removed the necessity for individual institutions to keep up-to-date copies of a range of biological databases, to install search tools, to buy expensive commercial software packages, etc.
Nodes and Sites Now EMBnet operates 34 nodes. Of these, 20 are designated National nodes. Respective nations have a mandate to provide databases, software and online services (including sequence analysis, protein modeling, genetic mapping, etc.), to offer user support and training and to undertake research and development. Eight EMBnet nodes are specialist sites. These are academic, industrial or research centers that are considered to have particular knowledge of specific areas of bioinformatics. They are largely responsible for the maintenance of biological databases and softwares.
2.10 Basic Bioinformatics A further six sites have been accepted within EMBnet as Associate Nodes. These are biocomputing centers from non-European countries that serve their user communities with the same kinds of service, as might a typical National Node. Most of these offer up-to-date access to sequence databases and analysis software, together with a variety of tools for molecular modeling, genome management, genetic mapping and so on. Table 2.1 gives a list of EMBnet Associate Nodes. Table 2.1: EMBnet Associate Nodes Abbreviation
Country
Site
MIPS/GSF
Germany South Africa National, Specialist, and Associate Nodes
http://mips.gsf.de/ http://www.cpgr.org.za http://www.embnet.org/about/members
CPGR Other all EMBnet nodes
Sequence Retrieval System Sequence Retrieval System (SRS), is a network browser for databases in molecular biology. This was evolved to help EMBnet users. SRS allows any flat-file database to be indexed to any other. Its advantage is that the derived indices may be rapidly searched, allowing users to retrieve, link and access entries from all the interconnected resources. This can be readily customized to use any defined set of databanks. The source links nucleic acid, EST, protein sequence, protein pattern, protein structure, specialist/boutique and/or bibliographic databases. SRS is thus a very powerful tool, allowing users to formulate queries across a range of different database types via a single interface, without having to worry about underlying data structures, query languages and so on. SRS is an integrated system for information retrieval from many different sequence database, and for feeding the sequences retrieved into analytical tools such as sequence comparison and alignment programs. SRS can search a total of 141 databases of protein and nucleotide sequences, metabolic pathways, 3D structures and functions, genomes, disease and phenotype information. These include many small databases such as the Prosite and Blocks databases of protein structural motifs, transcription factor databases, and databases specialized to certain pathogens. In addition to the number and variety of databases to which it offers access, SRS offers tight links among the databases, and fluency in launching applications. A search in a single database component can be extended to a search in the complete network, i.e., entries in all databases pertaining to a given protein can be found easily. Similarity searches and alignments can be launched directly without saving the responses in an intermediate file. The parent URL of SRS is: http://srs.ebi.ac.uk/
Computers, Internet, World Wide Web and NCBI
2.6
2.11
NCBI
The National Centre for Biotechnology Information (NCBI) was established in 1988 in USA as a division of the National Library of Medicine and is located on the campus of the National Institute of Health in Bethesda, Maryland. The role of the NCBI is to develop new information technologies in aiding our understanding of the molecular and genetic processes that underlie health and diseases. Its specific aims include the creation of automated systems for storing and analyzing biological information, the development of advanced methods of computer-based information processing, the facilitation of user access to databases and software, and the coordination of efforts to gather biotechnology information worldwide. NCBI also maintains GenBank, the NIH DNA sequence database. Groups of annotators create sequence data records from the scientific literature and together with information acquired directly from authors, data are exchanged with the international nucleotide databases, EMBL and DDBJ. All resources are available from the NCBI home page www.ncbi.nlm.nih.gov.
Entrez Entrez is the integrated, text based search and retrieval system. Just like SRS for EMBnet, Entrez facility was evolved at NCBI to allow retrieval of molecular biology data and bibliographic citations from NCBI’s integrated databases. Entrez permits related articles in different databases to be linked to each other, whether or not they are cross-referenced directly. Entrez provides access to DNA sequence (from GenBank, EMBL and DDBJ), protein sequence (from SWISS-PROT, PIR, PRF SEQDB, PDB and translated protein sequence from the DNA sequence databases), genome and chromosome mapping data, 3D protein structures from PDB, and the PubMed bibliographic database. Links between various databases are a strong point of NCBI’s system. The starting point for retrieval of sequence and structure is called Entrez. It is a www-based data retrieval system. It integrates information held in all NCBI databases. It is the common front-end to all the databases maintained by the NCBI and it is extremely easy to use. In total, Entrez links to 11 databases (Table 2.2). Entrez can be accessed via the NCBI web site at the following URL: http://www.ncbi.nlm.nih.gov/Entrez/
Data Model The NCBI introduced the use of model for sequence-related information. This made possible the rapid development of software and the integration of databases that underlie the popular Entrez retrieval system and on which the GenBank database is built. The advantages of the model are the ability to move effortlessly from the published literature to DNA sequences to the proteins they encode, to chromosome maps of the genes, and to the three-dimensional structures of the protein.
2.12 Basic Bioinformatics Table 2.2: The databases covered by Entrez, listed by category. Category 1. Nucleic acid sequences 2. Protein sequences
3. 3D structures 4. Genomes 5. PopSet 6. OMIM 7. Taxonomy 8. Books 9. ProbeSet 10. 3D domains 11. Literature
Databases Entrez nucleotides: sequences obtained from GenBank, RefSeq and PDB Entrez protein: Sequences obtained from SWISS-PROT, PIR, PRF, PDB, ad translation from annotated coding regions in GenBank and RefSeq. Entrez Molecular Modeling Databases (MMDB) Complete genome assemblies from many sources From GenBank, set of DNA sequences that have been collected to analyze the evolutionary relatedness of a population Online Mendelian Inheritance in Man NCBI Taxonomy Database Bookshelf Gene Expression Omnibus (GEO) Domains from the Entrez Molecular Modeling Database (MMDB) PubMed
The NCBI data model deals directly with a DNA sequence and a protein sequence. The translation process is represented as a link between the two sequences rather than an annotation on one with respect to the other. Protein related annotations, such as peptide cleavage products, are represented as features annotated directly on the protein sequence. In this way, it becomes very natural to analyze the protein sequences derived from translations of CDS features by BLAST or any other sequence search tool without losing the precise linkage back to the gene. A collection of a DNA sequence and its translation products is called Nuc-prost set. The NCBI data model defines a sequence type as a segmented sequence. GenBank, EMBL and DDBJ represent constructed assemblies of segmented sequences as contigs. Entrez shows this as a line connecting all its component sequences.
Retrieval and Application There are two main reasons for putting data on a computer: retrieval and discovery. Retrieval is the ability to get back what was put in. Amassing sequence information without providing a way to retrieve makes the sequence information useless. It is more valuable to get back from the system more knowledge than was put in. This will help in biological discoveries. Scientists can make these kinds of discoveries by discerning connections between two pieces of information that were not known when the pieces were entered separately into the database or by performing computations on the data that offer new insight into the records.
Computers, Internet, World Wide Web and NCBI
2.13
In the NCBI data model, the emphasis is on facilitating discovery; that means the data must be defined in a way that is amenable to both linkage and computation. NCBI uses four core data elements: bibliographic citations, DNA sequences, protein sequences and three-dimensional structures. In 1992, NCBI began assigning GenInfo Identifiers (gi) to all sequences processed into Entrez, including nucleotide sequences from DDBJ/ EMBL/ GenBANK, the protein sequences from the translated CDS features, protein sequences from SWISS-PROT, PIR, FRF, PDB, patents and others. The gi is assigned in addition to the accession number provided by the source database. The gi is simply an integer number, sometimes referred to as a GI number. It is an identifier for a particular sequence only and it is stable and retrievable.
Bioseq The Bioseq, or biological sequence, is a central element in the NCBI data model. It comprises a single, continuous molecule of nucleic acid or protein, thereby defining a linear, integer coordinate system for the sequence. A sequence cannot is a self-contained package of sequence annotations or information that refers to specific locations on specific Bioseqs. Sequence alignments describe the relationships between biological sequences by designating portions of sequences that correspond to each other. This correspondence can reflect evolutionary conservation, structural similarity, functional similarity or a random event.
ExPASy ExPASy (Expert Protein Analysis System) world wide web server (http:// www.expasy.ch) is a service provided by a team at the Swiss Institute of Bioinformatics (SBI) from 1993. It contains databases and analytical tools related to proteins and proteomics. The databases include Swiss-PROT, TrEMBL, SWISS-2DPAGE, PROSITE, ENZYME and SWISS-MODEL. The analytical tools include similarity searches, pattern and profile searches, posttranslational modification prediction, topology prediction, primary, secondary and tertiary structure analysis and sequence alignment.
Procedure Open the internet browser and type the URL address: http://www.expasy.ch. Pull the drop-down menu at search option. Select Swiss-Prot/TrEMBL. Type the name of the protein in the TEXT box. Note down the details from the query page which will show the name of the sequence, the taxonomy classification, description of protein, the literature regarding the sequence, etc.
Mirrors and Intranet Different servers providing the same service are called mirrors. To access a particular website, it is necessary to type the URL in the address bar of the
2.14 Basic Bioinformatics browser. Many academic institutions have an intranet, which means, a local network that can be accessed only from computers within the institution. What makes the web so powerful is its network. Table 2.3 gives a few gateway sites which are comprehensive. Table 2.3: Some basic sites for beginners of bioinformatics on the www 1. http://www.ncbi.nlm.nih.gov/ 2. http://www.ebi.ac.uk/ 3. http://www.expasy.ch/ 4. http://www.embl.de/ 5. http://www.izb.fraunhofer.de/en.html 6. http://themecraft.net/www/bmn.com
Apart from these, there are a great number of specialist sites with biological data which can be accessed. General-purpose search engines such as Google, Yahoo, Bing, Wikipedia, AltaVista and Hotbot are helpful in this.
STUDY QUESTIONS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
What is a computer? What is software? Give some names of languages used in computer programs? What are the advantages of PERL? What is Internet? How does Internet work? What is World Wide Web? What are browsers? Give some example. How does Netscape Navigator Work? Give details about EMBnet. How is sequence retrieval system useful in bioinformatics? What is the role of NCBI in maintaining sequence databases? What is the use of Entrez? Explain Bioseq and ExPASy.
C H A P T E R
3
DNA, RNA and Proteins
The properties that characterize a living organism (species) are based on its fundamental set of genetic information – its genome. A genome is composed of one or more DNA molecules (RNA in some viruses), each organized as a chromosome. The DNA has all the necessary informations encoded in it, for the functions of the cell. DNA sequence determines the protein sequence. Protein sequence determines the protein structure. Protein structure determines the protein function. Hence it is important to understand the fundamental aspects of DNA, RNA and protein and their interaction.
3.1
BACKGROUND
Already by 1866 Gregor Mendel suggested that factors of inheritance were existing in pea plants. In the beginning of twentieth century, it became clear that Mendel’s factors were related to parts of the cell called chromosomes. Chromosomes are thread like strands of chemical material located in the cell nucleus. Also, during this time, geneticists began using the terms ‘inheritance unit’ and ‘genetic particle’ to describe the factors occurring on the chromosomes of Mendel’s pea plants. By 1920s, these terms were discarded and the word gene was used following the suggestion of Willard Johannsen. Scientists viewed the gene as a specific and separate entity located on the cell’s chromosome.
Initial Studies In 1869, Friedrich Miescher isolated nucleic acid from nucleus and named this substance nuclein. Later Phoebus Levene and his coworkers studied the components of nuclein and gave it a more descriptive and technical name, deoxyribonucleic acid (DNA). They also identified ribonucleic acids (RNA) from some organisms. Their analysis revealed that both nucleic acids contain three basic components: (i) a five-carbon sugar, which could be either ribose (in RNA) or deoxyribose (in DNA), (ii) a series of phosphate groups, that is, chemical
3.2 Basic Bioinformatics groups derived from phosphoric acid molecules, and (iii) four different compounds containing nitrogen and having the chemical properties of bases. In DNA the four bases include adenine, thymine, guanine and cytosine; and in RNA, they are adenine, uracil, guanine and cytosine. Adenine and guanine are double – ring molecules known as purines; cytosine, thymine and uracil are single-ring molecules called pyrimidines [Fig. 3.1].
Fig. 3.1 The components of nucleic acid. The first component is a phosphate group, a derivative of phosphric acid composed of phosphoric, oxygen, and hydrogen atoms. The second component is a five-carbon sugar, either deoxyribose (in DNA) or ribose (in RNA). The third is a series of the five nitrogenous bases adenine, guanine, cytosine, thymine, and uracil. Note the presence of nitrogen. The first two bases are known as purines; the last three are pyrimidines.
DNA, RNA and Proteins
3.3
Advanced Studies In 1949, Erwin Chargaff reported that in DNA the amount of adenine is always equal to the amount of thymine regardless of the source of the DNA and the amount of cytosine is consistently equal to the amount of guanine. Chargaff’s observations played an important role in the double helix model of DNA proposed by James D. Watson and Francis H.G. Crick, apart from the experimented data of Maurice M.F. Wilkins and Rosalind Franklin which suggested that the DNA molecule was a helix. (In 1962, Watson, Crick and Wilkins were awarded the Nobel Prize in Physiology or Medicine. Unfortunately Franklin had died of cancer in 1958 and because the Nobel committee does not cite individuals posthumously, she did not share in the award). In 1902, Archibald Garrod postulated that a genetic disease is caused by a change in the ancestor’s genetic material. He also suggested that due to lack of an enzyme to break down alkapton, alkaptonuria disease occurs (Patients with this disease expel urine that rapidly turns black on exposure to air. The color change takes place because the urine contains alkapton, a substance that darkens on exposure to oxygen. In normal individuals, alkapton [known chemically as homogentisic acid] is broken down to simpler substance in the body, but in persons with alkaptonuria, the body cannot make this transformation, and alkapton is excreted). In 1940s Beadle and Tatum postulated ‘one gene – one enzyme hypothesis’ which suggested that the genes of a cell influence the production of cellular enzymes (An enzyme is a protein that catalyses a chemical reaction of metabolism while it itself remains unchanged).
Contribution from Biochemists In 1940s, biochemists reported that cells undergoing protein synthesis posses an unusually large amount of RNA. They theorized that RNA synthesis could occur in the nucleus, then the RNA could travel to the cytoplasm, where it would determine the amino acid sequence in the protein. In 1961 F.H.C. Crick and his colleagues reasoned that the genetic code of DNA probably consists of a series of blocks of chemical information, each block corresponding to an amino acid in the protein. They further hypothesized that within a single block a sequence of three nitrogenous bases specifies an amino acid and proved this by experiments also. For their work on the nature of the genetic code Marshall Nirenburg and Har Gobind Khorana were awarded the 1968 Nobel Prize in Physiology or Medicine. In the ensuing years, biochemists demonstrated that the genetic code is nearly universal: the same three-base codes specify the same amino acids regardless of whether the organism is bacterium, bee or a plant. The essential difference among species of organisms is not the nature of the nitrogenous bases but the sequence in which they occur in the DNA molecule.
3.4 Basic Bioinformatics Central Dogma Different sequences of bases in DNA specify different sequences of bases in RNA, and the sequence of bases in RNA specifies the sequences of amino acids in proteins (Fig. 3.2). This is the so-called central dogma of protein synthesis. And as the nucleic acid and protein vary, so does the species of an organism [Fig 3.3].
RNA Single strand
DNA double helix 0
0
G
C 0
G 0 0 T A 0 0
U 0
0
0
0 C G 0 0 T A 0 G C 0
0
C G
0
0
0
A
Aspartic acid (Asp) Alanine (Ala)
C
U
0 0 0
0 0 0 A T 0 0 0 A T G C 0
0
A
Protein Polypeptide chain
G C 0 U
Transalation
Alanine (Ala)
0 0
U 0
T 0
C
0
U
0
0
Phenylalanine (Phe) Serine (Ser)
0
0 A T 0 0 A T 0 0 A T 0 0 G C 0
0 0 0
A G
0 A A
Condon A-A-G translate into lysine
Lysine (Lys)
0 0 0
0
DNA Triplet
RNA Triplet
Amino Acid Specified
TAC ATC AAA AGG ACA GGG GAA GCG TTC TGC CCG CTA
AUG UAG UUU UCC CGU CCC CUU CGC AAG ACG GGC GAU
“Start” “Stop” Phenylalanine Serine Cysteine Proline Leucine Arginine Lysine Tyrosine Glycine Aspartic acid
Fig. 3.2 Gene expression and protein synthesis. (a) The base code in DNA is used to formulate a base code in RNA by the process of transcription. The RNA molecule is then used in translation to encode an amino acid sequence in a protein, (b) Some selected triplet codes in DNA and RNA and the amino acid specified in the protein. Note that the RNA code (known as a codon) is the complement of the DNA code and that certain codons are "start" or "stop" signals.
DNA, RNA and Proteins
3.5
Genomic DNA Transcription mRNA Translation Protein
Fig. 3.3 The central dogma states that DNA is transcribed into RNA, which is then transcribed later into protein.
3.2
DNA
DNA is a linear, double-helical structure (Fig. 3.4). The double-helix is composed of two intertwined chains madeup of building blocks called nucleotides (Fig. 3.5). Each nucleotide consists of a phosphate group, a deoxiribose sugar molecule and one of four different nitrogenous bases: adenine, guanine, cytosine or thymine. Each of the four nucleotides is usually designated by the first letter of the base it contains: A, G, C or T. 1.0 nm
0.34 nm Wide groove
3.4 nm
Narrow groove
2 nm
Fig. 3.4 What the X-ray diffraction photographs revealed about DNA. Watson and Crick postulated that DNA is composed of two ribbon like "backbones" composed of alternating deoxyribose and phosphate molecules. They surmised that nucleotides extend out from the backbone chains and that 0.34 nm distance represents the space between sucessive nucleotides. The data showed a distance of 34 nm between turns. So they guessed that ten nucleotides exist per turn. One strand of DNA would only encompass 1 nm width, so they postulated that DNA is composed of two stands to conform to the 2 nm diameter observed in the X-ray diffraction photographs.
3.6 Basic Bioinformatics
O H
H2C
P
N
H
–
O O
N
O
CH2 5
Thymine O
O H
H
–
O
N
3 N
Adenine
N
H –
N
O O
P
O
CH2 5
N
O
H
H
–
O
H N
3
H
Cytosine
N
–
O
P
O
CH2 5
O
N
H
O
O H
H
–
O
N
3
H
P 5¢
N 3¢
T
P 5¢
O O
3¢
C
3¢ OH
N
N
P
O
CH2 5
O
Guanine
–
O
3
P 5¢
N
A
P 5¢
H
–
3¢
N
H
G
OH 3¢ end
Fig. 3.5 The binding of nucleotide to form a nucleic acid. The phosphate group forms a bridge between the 5'carbon atom of one nucleotide and 3'carbon atom of the next nucleotide. A water molecule H2O results form union of the hydroxyl group (-OH) formerly at the 3'-carbon atom and a hydrogen atom (-H) formerly in the phosphate group. The linkage between nucleotide is a "3'-5' linkage", the bond is called a phosphodiester bond. Note that the 3' carbon of the lowest nucleotide is available for linking to another nucleotide (this is called 3' end of the molecule) and that the phosphate group of the uppermost nucleotide can link to still another nucleotide (this is the 5' end).
DNA, RNA and Proteins
3.7
Each nucleotide chain is held together by bonds between the sugar and phosphate backbone of the chain. The two intertwined chains are held together by weak bonds between bases of opposite chains. There is a lock and key fit between the bases of the opposite strands, such that adenine pairs only with thymine and guanine pairs only with cytosine. The bases that form base pairs are said to be complementary. DNA is replicated by the unwinding of the two strands of the double helix and the building up of a new complementary strand on each of the separated strands of original double helix (Fig. 3.6). Parent molecule G C C
G G
A
T
A G T
A C
G A
T
A G C C
G G
A GC G
C C
G T
T
A A A
A
A A T
A G
C
T
G T
T
G G
C
T
A
T
A A
G
G
T
A
C
G
C C
G C
A T
G A
G
A
Old New strand strand Daughter molecule
Old strand
New strand
Daughter molecule
Fig. 3.6 The general plan of DNA replication. The double helix unwind, and the two 'old' strands serve as templates for the synthesis of 'new' stands having complementary bases.
An organism’s basic complement of DNA is called its genome. The somatic cells of most plants and animals contain two copies of their genomes; these organisms are diploid. The cells of most fungi, algae, and bacteria contain just one copy of the genome; these organisms are haploid. The genome itself is made up of chromosomes, which contain DNA.
3.8 Basic Bioinformatics Chromosome Chromosome literally means colored body. Chromosome is the threadlike structure of chemical material located in the cell nucleus. Genes are encoded in DNA molecule, which in turn is organized into chromosomes. Based on the organization of chromosomes, living organisms are classified broadly into Prokaryotes and Eukaryotes. Prokaryotic chromosome is very simple in organization. The prokaryotic chromosome is single, normally circular, double helix of DNA. The nuclear material does not have distinct nuclear membrane. The eukaryotic chromosome is double, linear helix of DNA. The nuclear material has a distinct nuclear membrane and is highly coiled. In diploid cells, each chromosome and its component genes are present twice. For example, human somatic cells contain two sets of 23 chromosomes, for a total of 46 chromosomes. Two chromosomes with the same gene array are said to be homologous. In eukaryotes, chromosomes occur in pairs.
Centromere Each chromosome has a constriction called centromere. Depending on the position of centromere 4 types of chromosome types are seen. If the centromere is found in the middle of the chromosomes, it is a metacentric type. If the centromere is slightly away from the middle, it is submetacentric type. If the centromere is found in the top of the chromosome, it is telocentric type. If the centromere is very close to the tip, it is acrocentric type. The centromeres are the sites of attachment of spindle fibres which are formed during cell division. In many species a separate pair of chromosomes is present for sex determination and they are referred to as sex chromosomes. All the other chromosomes are referred to as autosomes. The presentation of complete diploid set of chromosomes in a diagrammatic manner is called karyotype. When the chromosomes are photographed using cytological preparation, and then cut and pasted according to size, it is referred to as ideogram. The end portions of chromosomes are called telomeres where short multiple repeat sequences of DNA are arranged. All living beings contain genetic information in the form of DNA within their cells. A characteristic of all living organisms is that DNA is reproduced and passed on to the next generation. DNA contains instructions for making proteins.
Gene A gene is a sequence of chromosomal DNA that is required for the production of a functional product: a polypeptide or a functional RNA molecule. A gene includes not only the actual coding sequences but also adjacent nucleotide sequences required for the proper expression of genes.
DNA, RNA and Proteins
3.3
3.9
RNA
RNA is the other major nucleic acid and it is single-stranded unlike DNA which is double-stranded. It contains ribose instead of deoxyribose as its sugar-phosphate backbone, and the uracil (U) instead of thymine (T). There are three types of RNAs in the cells for use in protein synthesis: messenger RNA (mRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA). mRNA acts as a template for protein synthesis; the rRNA and tRNA form a part of protein synthesizing machinery. mRNA is produced inside the nucleus by transcription of protein coding genes by RNA polymerase II. In eukaryotic systems the coding sequence in gene is not continuous as in prokaryotes (Fig. 3.7). There are a number of noncoding sequences known as introns interspersed with the coding sequences called exons, the parts of the gene expressed as protein. Introns do not contain information for functional gene product such as protein but they contain switches for genes. Prokaryote gene
Regulatory region for transcription intiation
Coding region
Transcription termination signals
Eukaryote gene Introns
Regulatory region for transcription intiation
Coding region (exons)
Transcription termination signals
Fig. 3.7 Generalized gene structure in prokaryotes and eukaryotes. The coding region is the region that contains the information for the structure of the gene product (usually a protein). The adjacent regulatory regions (light line) contain sequences that are recognized and bound by protein that make the gene's RNA and by proteins that influence the amount of RNA made. Note that in eukaryotic gene the coding region is often split into segments (exons) by one or more noncoding introns. (Source: A.J.F. . Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002)
Pre-mRNA When the RNA polymerase sweeps down the DNA template with introns and exons a preliminary mRNA molecule is formed. Therefore, a processing of premRNA is required to remove the non-coding introns from it. The introns are
3.10 Basic Bioinformatics removed biochemically; the exons are spliced together to form the functional mRNA molecule. Splicing makes the coding sequence continuous and the mRNA emerges as an accurate template for building up of the protein (Fig. 3.8).
Fig. 3.8 The formation of mRNA. A gene consists of exons, the parts of the gene expressed as protein, and introns, the intervening sequences between the exons. In the formation of mRNA, the gene is transcribed to a preliminary mRNA molecule. Then the introns are moved biochemically and the exons are spliced together. This activity results in the funational mRNA molecule, which is then ready for translation. This type of processing does not occur in mRNA production in prokaryotic cell such as bacterial cells; it occurs only in eukaryotic cells such as plant, animal, and human cells.
The processing of pre mRNA also includes modification of the 5’ end nucleotide which is called capping. The 3’ end is modified by the addition of a long stretch of 250 adenines. This process is called polyadenylation and the long tail is called poly A tail (Fig. 3.9). Inside the nucleus due to the action of RNA polymerase II, a number of species of mRNA are produced. The mRNA populations inside the nucleus vary in length and in stages of processing. Such mRNA population is called heterogeneous nuclear RNA (hnRNA).
DNA, RNA and Proteins
3.11
Polyadenylation signal (AAUAAA) Transcription Translation termination site termination site GU A AG
Transcription start site Translation initiation site GU A AG P 5¢ UTR Exon 1 Intron 1 Exon 2
Promoter
Intron 2
Exon 3 3¢ UTR
Gene
Exon 1
Intron 1
Exon 2
Intron 2
Exon 3
Exon 1
Intron 1
Exon 2
Intron 2
Exon 3
Intron 2
Exon 3
3¢ cleavage
Addition of poly(A) tail Exon 1
Poly(A) Intron 1
Exon 2
Primary RNA transcript
Addition of cap
Ploy (A) Mature mRNA
Splicing Exon 1
Exon 2
Exon 3
Fig. 3.9 Transcriptional and translational landmarks in a eukaryotic gene with two introns (top line), and the processing of its transcript to make mRNA. Note that since the landmarks shown are relevant to RNA, U is given in the gene sequence instead of T. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002)
Splicing Splicing is carried out inside the nucleus by a group of molecules which have catalytic function similar to enzymes. That is composed of small RNA molecules rich in uracil called URNAs or small nuclear RNAs (snRNAs) in conjunction with small nuclear ribonucleo proteins (snRNPs). There are many snRNPs such as U1, U2, U4, U5, U6 that are involved in splicing reactions. The exon intron junction has a specific nucleotide sequence, which is called signature sequence. This signature sequence is identified by the snRNPs. The RNA portion of the snRNP interacts with the splice junction nucleotides and base pair. In vertebrate animals branch point sequence is present. The U1 snRNP binds to the 5’ splice site and the U2 snRNP binds to the branch point sequence. The remaining snRNPs, U5 and U4/U6 form a complex with U1 and U2 causing the intron to loop so that the exons come together. The combination of the intron and snRNPs is called the spliceosome. The spliceosomes curl the intron and bring the exon junction and also join the exon ends (Fig. 3.10). In some unicellular organisms instead of snRNPs, mRNA itself takes care of splicing with the help of ribonucleases of ribozyme.
3.12 Basic Bioinformatics Pre-mRNA GU
A
Exon 1
AG Exon 2
Intron
Spliceosome composed of five different SnRNPs Spliceosome attached to pre-mRNA
A
1 U G
2
A G
SnRNPs
Spliced exons Lariat
A
Fig. 3.10 The structure and function of a spliceosome. The spliceosome is composed of several snRNPs that attach sequentially to the RNA, taking up positions roughly as shown. Alignment of the snRNPs results from hydrogen bonding of their snRNA molecules to the complementary sequences of the intron. In this way the reactants are properly aligned and the splicing reactions (1 and 2) can occur. The P-shaped loop, or lariat structure, formed by the excised intron is joined through the central adenine nucleotide. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
Capping Capping is a process by which the 5’ end of mRNA is protected from exonuclease enzymes. Typically a prokaryotic mRNA remains stable only for a few minutes. In eukaryoties the half-life of mRNA is around 6h. A nucleotide may be deleted, added or substituted by RNA editing.
tRNA tRNAs are adapter-like, small, linking molecules. The function of tRNA is to fetch the correct amino acid to mRNA molecule and deposit it to the growing polypeptide chain during protein synthesis. Every amino acid has its own tRNA. tRNA has two ends. One end has the anticodon. This end base pairs with Fe codon of mRNA. The other end acts as a socket to attach the amino acid. According to the sequence of codons in mRNA, the amino acids are brought in by tRNAs and a specific polypeptide sequence is thus built. tRNA molecules have between 74 to 95 nucleotides. tRNAs are produced in a
DNA, RNA and Proteins
3.13
precursor form called pre-tRNAs. Several tRNA genes are transcribed together non-stop by RNA polymerase III enzyme. Ribonuclease enzyme cleaves the tRNA molecule into individual tRNA.
Ribosomes Ribosomes are macro molecules composed of both RNA and several polypeptides. Ribosomes provide a firm platform for protein synthesis. Each ribosome is composed of large and small subunits (Fig. 3.11).
Fig. 3.11. Ribosomes contain a large and a small subunit. Each subunit contains rRNA of varing lenghts and a set of proteins. There are two principal rRNA molecules in all ribosomes (shown in the column on the left). Ribosomes form prokaryotes also contain one 120-baselong rRNA that sediments at 5S, whereas eukaryotic ribosomes have two small rRNAs; a 5S RNA molecule similar to the prokaryotic 5S, and a 5.8S molecule 160 base long. The proteins of the large subunit are named L1, L2, etc., and those of the small subunit proteins S1, S2, etc. (Source: Lodish et al., Molecular Cell Biology, Scientific American Books, Inc., 1995).
rRNA The prokaryotic ribosomes are 70s type. The subunits have 50s and 30s values (s stands for measurement in Swedberg unit). The 50s subunit has two rRNAs and 31 polypeptides. The 30s subunit has a single rRNA and 21 polypeptides. In eukaryotes the ribosomes are of 80s types. The subunits have 60s and 40s values. The 60s subunit has 3rRNAs and about 49 polypeptides. The 40s subunit has one rRNA and about 33 polypeptides. RNA polymerase 1 transcribes the rRNA genes.
3.14 Basic Bioinformatics In prokaryotes such as E. coli, there are 7 copies of rRNA genes scattered throughout the genome. Each gene contains one copy each of 16s, 23s and 5s rRNA sequences arranged consecutively. The gene is transcribed as single prerRNA (30s) molecule, which is processed to produce individual rRNAs. The prerRNA folds into a number of stem-loop structures over which ribosomal proteins bind. During this time some of the nucleotides of rRNA are methylated. Finally, the ribonuclease RNAse III cleaves and releases 5s, 23s and 16s RNAs. Mature rRNAs are formed by further trimming at 5’ and 3’ ends by ribonucleases M5, M16, and M23. In eukaryotes, the sequences of the 28s, 18s and 5.8s rRNAs are present in a single gene. This gene exists in multiple copies separated by short nontranscribed regions. In humans, there are about 200 gene copies occurring in 5 clusters on separate chromosomes. RNA polymerase I transcribes these genes. Transcription takes place in nucleolus inside the nucleus. In humans prerRNA is 45s in size. It is processed to yield 28s, 18s and 5.8s rRNAs. The eukaryotic prerRNA is processed similar to that in prokaryotes. The prerRNA is cleaved to yield mature 28s, 18s and 5.8s rRNA by ribonucleases. Small cytoplasmic RNAs (scRNAs) direct protein traffic within the eukaryotic cell.
3.4
TRANSCRIPTION AND TRANSLATION
The biological role of most genes is to carry information specifying the chemical composition of proteins and the regulatory signals that will govern their production by the cell. Modern biochemists agree that the process of protein synthesis is initiated by an uncoiling of the DNA double helix and an uncoupling of the two strands of DNA. A functional regime of DNA, the gene, is thereby exposed.
Transcription The first step taken by the cell to make a protein is to copy or transcribe the nucleotide sequence in one strand of the gene into a complementary singlestranded molecule called ribonucleic acid (RNA) Fig. 3.12). Component nucleotides stored in the region are used for the synthesis, and an enzyme called RNA polymerase binds the nucleotides together to form the RNA molecule. CTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCGAAAACGATCTGCGCTGC
3¢
GACGGTAACAGTCTGTACATATGGGGCATGCCAGAAGGGCTCGCTTTTGCTAGACGACG
5¢
CUGCCAUUGUCAGACAUGUAUACCCCGUACGUCUUCCCGAGCGAAAACGAUCUGCGCUGC
3¢ mRNA
DNA
Nontemplate strand 5¢ Template strand 3¢ 5¢
Fig. 3.12. The mRNA sequence is complementary to the DNA template strand from which it is synthesized and therefore matches the sequence of the nontemplate strand (except that RNA has U where DNA has T). The sequence shown here is form the gene for the enzyme β-galactosidase, which is involved in lactose metabolism. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
DNA, RNA and Proteins
3.15
The production of RNA is called transcription, a word coined by Crick in 1956. The fragments so constructed are known as RNA transcripts. These RNA molecules, together with ribosomal proteins and enzymes, constitute a system that carries out the task of reading the genetic message and producing the protein that the genetic message specifies. The transcription process, which occurs in the cell nucleus, is very similar to the process for replication of DNA because the DNA strand serves as the template for making the RNA copy, which is called a transcript. The RNA transcript, (which in many species undergoes some structural modifications) becomes a working copy of the information in the gene, a kind of message molecule called messenger RNA (mRNA). The mRNA then enters the cytoplasm, where it is used by the cellular machinery to direct the manufacture of a protein.
Translation The process of producing a chain of amino acids based on the sequence of nucleotides in the mRNA is called translation. The nucleotide sequence of a mRNA molecule is read from one end of the mRNA to the other, in groups of three successive bases. These groups of three are called codons (AUU, CCG, UAC). Because there are four different nucleotides, there are 4 × 4 × 4 = 64 different possible codons, each one either coding for an amino acid or a signal to terminate translation (Table 3.1). Table 3.1: The genetic code. Notice that an amino acid can be coded by several different codons. A stop codon does not code for an amino acid, but instead signals to the ribosome that this is the end of the protein and that translation should cease. Second letter U U
C
G
CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG
Phe Leu
Leu Ile Met Val
UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG
A Ser
Pro
Thr
Ala
UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG
G Tyr Stop Stop His Gin Asn
Aus Glu
UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GAG
Cys Stop Trp Arg
Ser Arg
Gly
U C A G U C A G U C A G U C A G
Third letter
First letter
A
UUU UCC UCA UUG
C
3.16 Basic Bioinformatics Because only 20 kinds of amino acids are used in the polypeptides that make up proteins, more than one codon may correspond to the same amino acid. For example, AUU, AUC and AUA, all these three codons code for isoleucine. UUU ad UUC code for phenylalanine. The mRNA molecule consists of a series of codons formed as RNA polymerase sweeps down the DNA template. In a eukaryotic cell, the mRNA molecule now moves through a pore in the nuclear membrane into the cell cytoplasm. Here it combines with one or more ribosomes. During this time different amino acids join with their specific tRNA molecules in the cytoplasm. Once bound together, the different tRNA molecules get attached to ribosome where mRNA is stationed. One portion of the mRNA molecule attaches to the 30s subunit and a tRNA molecule with its amino acid attaches to the 50s subunit. During this step, the codon of the mRNA attracts a complementary anticodon on the tRNA. The codon-anticodon matching brings a specified amino acid into position. The matching thus denotes the amino acid’s location in the protein chain. At this precise moment, the genetic code of DNA is expressed as the location of an amino acid in a protein chain. After pairing with mRNA, the tRNA-amino acid is held in a viselike grip on the ribosome’s larger subunit. The ribosome then moves along the mRNA to a new location. Here a second tRNA with its amino acid approaches the ribosome and pairs its anticodon with the second codon on the mRNA molecule. Thus, two tRNA molecules and their amino acids stand next to one another on the mRNA. In a millisecond, an enzyme from the 50s subunit of the ribosome joins the amino acids together to form a dipeptide (two amino acid in a chain). The first tRNA is now free of its amino acid, and it moves back to the cytoplasm, leaving its amino acid behind and joined to the second amino acid. Now the ribosome moves to a third location at the third codon of the mRNA. A new tRNA with its amino acid enters the picture and the process continues forming a long chain of amino acids called polypeptide (Figs. 3.13a and 3.13b). The polypeptide bond is formed by the removal of water between amino acids (Fig. 3.14). The final one or two codons of the mRNA are chain terminator or ‘stop’ signals. As these codons are reached (UAA, UAG or UGA), no complementary tRNA molecules exist and no amino acids are added to the chain. Instead, the stop signals activate release factors to discharge the polypeptide chain from the ribosome. Now the polypeptide will coil to yield the functional protein.
The Nature of Chemical Bonds By definition, elements are things that cannot be further reduced by chemical reaction. Elements are made of individual atoms, which in turn, are made of smaller subatomic particles. These are separated by physical reactions. Only three subatomic particles – neutron, proton and electron – are stable. The number of proton in the nucleus of an atom determines what element it is. Generally, for every proton in an atomic nucleus there is an electron in orbit around it to balance the electrical charges.
DNA, RNA and Proteins
3.17
Fig. 3.13a The addition of a single amino acid (aa6), carried by the tRNA at the A site, to the growing polypeptide chain, tethered by the tRNA at the P site, during translation of mRNA. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
Polypeptide aa1 aa2
tRNA
mRNA
aa3
aa4 aa1
aa2
aa3
aa4 aa5 aa6
aa8 aa7
Codon Codon Codon Codon Codon Codon Codon Codon Codon Codon Codon 1 2 3 4 5 6 7 8 9 10 11 Ribosomes
Fig. 3.13b The addition of an amino acid (aa) to a growing polypeptide chain in the translation of mRNA. Multiple copies of the polypeptide are produced by a train of ribosomes following each other along the mRNA; two such ribosomes are shown. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
3.18 Basic Bioinformatics
aa1
H
aa2
H
R2
N
C
C
H
O
Amino end H
H
OH
H
R1
N
C
C
H
O
H
R2
N
C
C
H
O
H
R2
N
C
C
H
O
aa1
OH
H
H
R3
N
C
C
H
O
H
R3
N
C
C
H
O
OH
Carboxyl end OH + 2(HO)
aa3
aa2 Peptide bond
(a)
aa3
Peptide bond Peptide group O 1.24
1
1.5 C R
(b)
C
R
H C
1.
32
6 1.4 N
H H
Fig. 3.14 The peptide bond (a) A polypeptide is formed by the removal of water between amino acids to form peptide bonds. Each aa indicates an amino acid. R1, R2 and R3 represent R groups (side chains) that differentiate the amino acids. R can be anything from a hydrogen atom (as in glycine) to a complex ring (as in tryptophan), (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002). (b) The peptide group is a rigid planar unit with the R groups projecting out form the C-N backbone. Standard bond distances are shown in angstroms (Source: Stryer, L., Biochemistry, W.H. Freeman and Company, 1995).
The higher an atom’s affinity for electrons, the higher its electronegativity. The slight separation charges within a molecule contribute to hydrogen bonding. Chemicals can be placed in two categories based on their affinity or non-affinity to water: hydrophilic (literally ‘water friendly’) or hydrophobic (literally ‘afraid of water’).
DNA, RNA and Proteins
3.5
3.19
PROTEINS AND AMINO ACIDS
Proteins are the molecular machineries that regulate and execute nearly every biological function. Proteins are madeup of amino acids. Each amino acid has a backbone consisting of an amide (-NH2) group, an alpha carbon, and a carboxylic acid or carboxylate (-COOH) group. To the alpha carbon, a side chain is attached. The side chains vary with each amino acid. These side chains confer unique stereochemical properties on each amino acid. The amino acids are often grouped into three categories. (i) The hydrophobic amino acids, which have side chains composed mostly or entirely of carbon and hydrogen, are unlikely to form hydrogen bonds with water molecules. (ii) The polar amino acids, which often contain oxygen and/ or nitrogen in their side chains, form hydrogen bonds with water much more readily. (iii) The charged amino acids carrying a positive or negative charge at biological pH. The order of the amino acids in a protein’s primary sequence plays an important role in determining its secondary structure and ultimately, its tertiary structure, its physical and chemical properties and ultimately its biological function. A chain of several amino acids is referred to as a peptide. Longer chains are polypeptides. When two amino acids are covalently joined, one of the amino acids loses a hydrogen (H+) from its amine group, while the other loses an oxygen and a hydrogen (OH-) from its carboxyl group, forming a carbonyl (C=O) group (and water, H2O). The result is a dipeptide – two amino acids joined by a peptide bond – and a single water molecule. In a polypeptide, the amino acids are sometimes referred to as amino acid residues, because some atoms of the original amino acid are lost as water in the formation of the peptide bonds. Polypeptides have specific directionality. The amino terminus (or N terminus) of the polypeptide has an unbounded amide group, while the carboxy terminus (or C terminus) ends in a carboxylic acid group instead of a carbonyl. Protein sequences are usually considered to start at the N terminus and progress towards the C terminus. The non-side chain atoms of the amino acids (constant region of each amino acid) in a polypeptide chain form the protein backbone. The chemistry of a protein backbone forces most of the backbone to remain planar. The only movable segments of the protein backbone are the bonds from the nitrogen to the alpha carbon (the carbon atom to which the chain is attached) and the bond between the alpha carbon and the carbonyl carbon (the carbon with a double bond to an oxygen atom). These two chemical bonds allow for circular or dihedral rotation, and are often called phi (Φ) and psi (Ψ), respectively. Thus a protein consisting of 300 amino acids will have 300 phi angles, often numbered as Φ1, Ψ1 up to Φ300 and Ψ300. All of the various conformations attainable by the protein come from the rotations of these 300 pairs of bonds.
3.20 Basic Bioinformatics Only twenty different amino acids are used to produce the countless combinations found in the proteins of cells (Table 3.2). The polypeptide chain consisting of amino acids folds into a curve in space by folding pattern. Proteins show a great variety of folding patterns. Folding may be thought of as a kind of intramolecule condensation or crystallization. Table 3.2: The four naturally occurring nucleotides in DNA and RNA and 20 naturally occurring amino acids in proteins The four naturally occurring nucleotides in DNA and RNA a-adenine g-guanine c-cytosine The twenty naturally-occurring amino acids in proteins Non-polar amino acids G-glycine A-alanine P-proline I-isoleucine L-leucine F-phenylalanine Polar amino acids S-serine C-cysteine T-threonine G-gulatamine H-histidine Y-tyrosine Charge amino acids D aspartic acid E-glutamic acid K-lysine
t-thymine (u uracil)
V- valine M-menthionine N-asparagine W-tryptophan R-arginine
Other classifications of amino acids can also be useful. For Instance, histidine, phenylalanine, tyrosine, and tryptophan are aromatic, and are observed to play special structural roles in memberane proteins Amino acid names are frequently abbreviated to their first three letters, for instance Gly for glycine, except for isoleucine, asparagines, glutamine and htryptophan, which are abbreviated to Ile, Asn, Gin and Trp, respectively. The rare amino acid selenocysteine has the three-letter abbreviation Sec and the one-letter code U.It is conventional to write nucleotides in lower case and amino acids in upper case. Thus atg-adenine-thymine-guanine and ATG= Alanine-Threonine-Glycine.
Structure The linear sequence of amino acids in a protein molecule refers to primary structure. Regions of local regularity within a protein fold (e.g. α-helices, β-turns, β-strands) refer to secondary structure. Proteins show recurrent patterns of interaction between helices and sheets close together in the sequence. These arrangements of α-helices and/or β-strands into discrete folding units (e.g. β-barrels, β α β-units, Greek keys, etc.) refer to supersecondary structures (Fig. 3.15). The overall fold of a protein sequence, formed by the packing of its secondary and/or super-secondary structure elements refers to tertiary structure. The arrangement of separate protein chains in a protein molecule with more than one subunit refers to quaternary structure. The arrangement of separate molecules such as in protein-protein or protein-nucleic acid interactions refers to quinternary structure.
DNA, RNA and Proteins
3.21
(a)
(b)
(c)
Fig. 3.15 Common supersecondary structures (a) α–helix hairpin, (b) β–hairpin, (c) β-α-β unit. The chevrones indicate the direction of the chain. (Source: Lesk, A.M., Introduction to Bioinformatics, Oxford University Press).
Domains Many proteins contain compact units within the folding pattern of a single chain that look as if they should have independent stability. These are called domains. In the hierarchy, domains fall between super-secondary structures and the tertiary structure of a complete monomer, nodular proteins are multi domain proteins which often contain many copies of closely related domains. The most general classification of families of protein structures is based on the secondary and tertiary structures of protein (Table. 3.3).
Motif The active site of an enzyme which takes part in catalytic function occupies only a small portion on the protein molecule. If the protein is stretched into a polypeptide chain the active site region may be found distributed as discrete patches on the primary structure. Such conserved small regions which confer
3.22 Basic Bioinformatics characteristic minor shape to the protein are called motifs. Motifs are short strings of base pairs characteristic of sites regulating particular events in gene expression or chromosome replication such as 5’ splice sites or origins of replication. Table 3.3: Class and characteristics of protein structures Class α-helical β-sheet α+β
Characteristics Secondary structure exclusively or almost exclusively α-helical Secondary structure exclusively or almost exclusively β-sheet. α-helices and β-sheets separated in different parts of the molecule; absence of β-α-β super secondary structure α/ β Helices and sheets assembled from β-α-β units α-β-linear Line through centers of strands of sheet roughly linear α-β-barrels Line through centers of strands of sheet roughly circular Little or no secondary structure
Folding Patterns Within these broad categories, protein structures show a variety of folding patterns. Among proteins with similar folding patterns, there are families that share enough features of structure, sequence and function to suggest evolutionary relationship. Classification of protein structures occupies a key position in bioinformatics – as a bridge between sequence and function. The amino acid sequence of a protein dictates its three dimensional structure. When placed in a medium of suitable solvent and temperature conditions, like the one provided by a cell interior, proteins fold spontaneously to their native active states. If amino acid sequences contain sufficient information to specify three-dimensional structures of proteins, it should be possible to device an algorithm to predict protein structure from amino acid sequence. But this has been difficult. Hence scientists have tried to predict secondary structure, fold recognition and homology modeling.
Biochemical Nature Biochemically, proteins play variety of roles in life processes; there are structural proteins (e.g. viral coat proteins, the horny outer layer of human and animal skin, and proteins of the cytoskeleton); proteins that catalyse chemical reactions (the enzymes); transport and storage proteins (hemoglobin); regulatory proteins, including hormones and receptor; signal transduction proteins; proteins that control genetic transcription; and proteins involved in recognition, including cell adhesion molecules, and antibodies and other proteins of the immune system. Proteins are large molecules. In many cases only a small part of the structure – an active site – is functional, the rest existing only to create and fix the spatial relationship among the active site residues.
DNA, RNA and Proteins
3.23
Chemical Nature Chemically, protein molecules are long polymers typically containing several thousand atoms composed of a uniform repetitive backbone (or main chain) with a particular side chain attached to each residue. The polypeptide chains of proteins have a main chain of constant structure and side chains that vary in sequence. The side chains may be chosen, independently, from the set of 20 standard amino acids. It is the sequence of the side chains that gives each protein its individual structural and functional characteristics.
Chaperones Some proteins require chaperons to fold, but these catalqze the process, rather than directing it. Molecular chaperones are helper proteins that ensure that growing protein chains fold correctly. Chaperones are thought to block incorrect folding pathways that would lead to inactive products, by preventing incorrect aggregation and precipitation of unassembled subunits. They probably bind temporarily to interactive surfaces that are exposed only during the early stages of protein assembly.
Functions Proteins serve several vital functions: (i) for catalyzing various biochemical reactions (e.g. enzymes), (ii) as messengers 9 e.g. neurotransmitters), (iii) as control elements that regulate cell reproduction, iv) growth and development of various tissues (e.g. trophic factors), (v) oxygen transport in the blood (e.g. hemoglobin), (vi) defense against diseases (e.g. antibodies), etc. The function of a protein is determined by its shape.
STUDY QUESTIONS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Who coined the word gene? Who isolated nucleic acid first? Who gave the name DNA? What is the contribution of Erwin Chargaff? Who proposed the DNA double helix model? Who proposed one gene-one enzyme hypothesis? What is a chromosome? What is a centromere? Name the different types of Centromere. What are the different kinds of RNAs? What is polyadenylation? What is transcription? What is translation? What are the different structures of protein? What is the function of chaperons?
C H A P T E R
4
DNA and Protein Sequencing and Analysis Contributions from the field of biology and chemistry have facilitated an increase in the speed of sequencing genes and proteins. With the advent of cloning technology it has become easier to insert foreign DNA sequences into many systems. Rapid mass production of particular DNA sequences, a necessary prelude to sequence determination, has also become possible through this technology. Oligonucleotide synthesis technology has allowed researchers with the ability to construct short fragments of DNA with sequences. These oligonucleotides could then be used in probing vast libraries of cDNA to extract genes containing that sequence. Alternatively, these DNA fragments could also be used in polymerase chain reactions (PCR) to amplify existing DNA sequences or to modify these sequences. Two common goals in sequence analysis are to identify sequences that encode proteins, which determine all cellular metabolism, and to discover sequences that regulate the expression of genes or other cellular processes. Some important laboratory techniques which are useful to decipher the information content of genomes are given below: Restriction enzymes isolated from bacteria digest double-stranded DNA molecule at specific base sequences. They throw some light into the specific organization and sequence of a DNA molecule. When DNA is digested it will yield many DNA fragments. Gel electrophoresis is used to separate these different fragments from each other using electrical current to pass through a matrix of agarose or acrylamide (gel) carrying the fragment which was loaded on the upper end of the gel. Blotting of the gel and hybridization of the nitrocellulose paper which contains the DNA fragments after blotting are done to find the gene fragment by using specific probes. To generate sufficient quantity and quality of specific gene, cloning is done by inserting it into chromosome-like carriers called vectors that allow their replication in living cells. They can be purified and used for analysis. Polymerase chain reaction (PCR) can be used to get large quantities of particular gene regions from very small quantities. This is a powerful alternative to cloning.
4.2 Basic Bioinformatics 4.1
GENOMICS AND PROTEOMICS
Genomics is the development and application of molecular mapping, sequencing, characterization, computation and analysis of entire genomes of organisms and whole set of gene products. Genome refers to the entire complement of genetic material in a chromosome set. The analysis of whole genome gives us new insights into global organization, expression, regulation and evolution of the hereditary materials (Fig. 4.1)
Structural, Functional and comparative Genomics Genomics has three distinct subfields: structural genomics, functional genomics and comparative genomics. Structural genomics is the genetic mapping, physical mapping and sequencing of most genomes. Genetic maps provide molecular landmarks for building the higher-resolution physical and sequence maps and also provide molecular entry points for researchers interested in cloning genes. Physical maps provide a view of how the clones from genomic clone libraries are distributed throughout the genome. They provide clone resource for positional cloning. Genome DNA sequences are helpful in describing the functions of all genes including gene expression and control. Functional genomics is the global study of the structure, expression patterns, interactions, and regulation of the RNAs and protein encoded by the genome. It is the comprehensive analysis of the functions of genes and nongene sequences in the entire genomes. Comparative genomics allows the comparison of entire genomes of different species with the goal of enhancing our understanding of the functions of each genome, including evolutionary relationships. Genomics
Whole genomic maping High-resolution genetic maps Physical maps
Comparative genomics
Chromosome evolution
Sequence maps
Transcript maps
Gene conservation and evolution
Functional genomics
Transcript expression
Interaction maps
Protein expression
Polypeptide maps
Fig. 4. 1 Genomic analysis: A hierarchical view of genomic analysis (Source: A.J.F. Griffiths 4.1 et al., Modern Genetic Analysis: Integrating Genes and Genomes, W.H. Freeman & Company, New York, 2002).
DNA and Protein Sequencing and Analysis
4.3
Approaches to Genome Sequencing Determination of the complete genomic DNA sequence of an organism allows attempts to be made to identify all of an organism’s genes and therefore define its genotype. Special experimental techniques have been devised to carry out the difficult task of manipulating and characterizing large numbers of genes and large amounts of DNA. One approach to genome sequencing is first to generate high resolution genetic and physical maps of the genome to define segments of increasing resolution and then to sequence the segments in an orderly manner. Another approach, the direct shotgun approach, is to break up the genome into random, overlapping fragments, then to sequence the fragments and assemble the sequences using computer algorithms. Analysis of genomic sequences reveals that each organism has an array of genes required for basic metabolic processes and genes whose products determine the specialized function of the organism. Complete genome sequencing therefore provides a knowledge base on which to build information about gene and protein expression, but is not sufficient on its own to define the entire protein component of the organism.
Proteomics Proteomics is the cataloging and analysis of proteins to determine when a protein is expressed, how much is made, and with what other proteins it can interact. The term proteomics indicates proteins expressed by a genome. It is the systematic analysis of protein profiles of tissues. The word proteome refers to all proteins produced by a species at a particular time. Proteome varies with time and is defined as “the proteins present in one sample (tissue, organism, cell culture) at a certain point in time”. Proteomics represents the genome at work and it is a dynamic process. Proteomics can be divided into expression proteomics (the study of global changes in protein expression) and cell-map proteomics (th systematic study of protein-protein interactions through the isolation of protein complexes). There is an increasing interest in proteomics because DNA sequence information provides only a static snapshot of the various ways in which the cell might use its proteins whereas the life of the cell is a dynamic process. Proteins expressed by an organism change during growth, disease and the death of cells and tissues. Proteomics attempts to catalog and characterize these proteins, compare variations in their expression levels in healthy and diseased tissues, study their interactions and identify their functional roles using leading edge technological capability. Proteomics begins with the functionally modified protein and works back to the gene responsible for its production.
Goals The goals of proteomics are: (i) to identify every protein in the proteome, (ii) to determine the sequence of each protein and entering the data into databases
4.4 Basic Bioinformatics and (iii) to analyse globally protein levels in different cell types and at different stages in development.
Structural and Functional Proteomics Proteomics research can be categorized as structural proteomics and functional proteomics. Structural proteomics or protein expression measures the number and types of proteins present in normal and diseased cells. This approach is useful in defining the structure of proteins in a cell. Some of these proteins may be targets for drug discovery. Functional proteomics is the study of proteins’ biological activities. An important function of proteins is the transmission of signals using intricate pathways populated by proteins, which interact with one another. There are three main steps in proteome research: (i) Separation of individual proteins by 2D PAGE (ii) Identification by mass spectrometry or N-terminal sequencing of individual proteins recovered from the gel (iii) Storage, manipulation and comparison of the data using bioinformatics tools.
Uses Proteomics will contribute greatly to our understanding of gene function in the post genomic era. Differential display proteomics for comparison of protein levels has potential application in a wide range of diseases. Because it is often difficult to predict the function of a protein based on homology to other proteins or even their three-dimensional structure, determination of components of a protein complex or of a cellular structure is central in functional analysis. Proteomics will also play an important role for drug discovery and development by characterizing the disease process directly by finding sets of proteins (pathways or clusters) that together participate in causing the disease. Proteomics can be seen as a mass-screening approach to molecular biology, which aims to document the overall distribution of proteins in cells, identify and characterize individual proteins of interest, and ultimately elucidate their relationships and functional roles. Such direct protein-level analysis has become necessary because the study of genes, by genomics, cannot adequately predict the structure or dynamics of proteins, since it is at the protein level that most regulatory processes take place, where disease processes primarily occur and where most drug targets are to be found.
4.2
GENOME MAPPING
Before the advent of genomic analysis, the genetic basis of the knowledge of an organism usually included relatively low-resolution chromosomal maps and physical maps of genes producing known mutant phenotypes. Starting with
DNA and Protein Sequencing and Analysis
4.5
these genetic linkage maps, whole genome molecular mapping generally proceeds through several steps of increasing resolution (Fig. 4.2). A genetic map is a representation of the genetic distance separating genes derived from the frequency of genetic recombination between the genes. Genetic mapping is the process of locating genes to chromosomes and assigning their relative genetic distances from other known genes. Genetic maps of genomes are constructed using genetic crosses and for humans, pedigree analysis. Genetic crosses are used to establish the location of genetic markers (any allele that can be used to mark a location on a chromosome or a gene) on chromosomes and to determine the genetic distance between them. Historically genes have been used as markers of genetic mapping experiments. Now, another type of genetic marker, DNA marker, is used to develop the genetic map. DNA markers are genetic markers that are detected using molecular tools that focus on the DNA itself rather than on the gene product or associated phenotype. Four types of DNA markers are used in human genomic mapping: (i) Restriction fragment length polymorphism (RFLP), (ii) Variable number of tandem repeats (VNTR) (also called mini satellite), (iii) Short tandem repeats (STR) (also called microsatellite sequences) and (iv) Single nucleotide polymorphisms (SNP) (Simultaneous typing of hundreds of SNPs can be done using DNA microarrays).
Cytogenetic mapping
Molecular marker 1
Gene Molecular marker 2
Gene
Molecular marker 3
Genetic high-resolution mapping
Cloned fragments Physical mapping
DNA sequencing TTAGCTTAACGTACTGGTACCGTACCGTGGCTTAT
Fig. 4.2 Overview of the general approaches of whole genome mapping. General scheme for making a genome map by using analyses at increasing levels of resolution (Source: A.J.F. Griffiths et al., Modern Genetic Analysis: Integrating Genes and Genomes, W.H. Freeman & Company, New York, 2002).
4.6 Basic Bioinformatics 4.3
DNA SEQUENCING METHOD
Methods are available to determine the order of nucleotides in DNA. One of the methods is called chain termination sequencing or dideoxy sequencing or the Sanger method after its inventor. The basic sequencing reaction consists of a single – stranded DNA template, a primer to initiate the nascent chain, four deoxyribonucleoside triphosphates (dATP, dCTP, dGTP and dTTP) and the enzyme DNA polymerase, which inserts the complementary nucleotides in the nascent DNA strand using the template as a guide. Normally four DNA polymerase reactions are set up, each containing a small amount of one of four dideoxyribonucleoside triphosphates (ddATP, ddCTP, ddGTP and ddTTP). These act as chain terminating competitive inhibitors of the reaction. Each of the four reaction mixtures generate a nested set of DNA fragments, each terminating at a specific base (Fig. 4.3).
5¢
Template GGATTCTGCTACGGA
3¢ 5¢ Primer
Reaction including ddATP 5¢
GGATTCTGCTACGGA ddATGCCT ddACGATGCCT ddAGACGATGCCT ddAAGACGATGCCT
3¢
A
C
G
T
H Reaction including ddCTP 5¢ GGATTCTGCTACGGA ddCT ddCCT ddCGATGCCT ddCTAAGACGATGCCT ddCCTAAGACGATGCCT
C C T A A G A C G A T G C C T
3¢
Reaction including ddGTP 5¢
3¢
GGATTCTGCTACGGA ddGCCT ddGATGCCT ddGACGATGCCT
L (b)
Reaction including ddTTP 5¢
3¢
GGATTCTGCTACGGA ddT ddTGCCT ddTAAGACGATGCCT (a)
Fig. 4.3 Principle of DNA sequencing (a) Four sequencing reactions are set up, each containing a limiting amount of one of the four dideoxynucleotides. Each reaction generates
DNA and Protein Sequencing and Analysis
4.7
a nested set of fragments terminating with a specific base as shown. (b) A polyacrylamide gel is shown with each reaction running in a separate lane of clarity. In a typical automated reaction, all reactions would be pooled prior to electrophoresis and the terminal nucleotide determined by scanning for a specific fluorescent tag. (Source: Twyman, R.M., Advanced Molecular Biology @ BIOS Scientific Publishers Ltd., 1998).
Automated Methods Most DNA sequencing reactions are automated, these days. Each reaction mixture is labeled with a different fluorescent tag (on either the primer or on one of the nucleotide substrates), which allows the terminal base of each fragment to be identified by a scanner. All four reaction mixtures are then pooled and the DNA fragments are separated by polyacrylamide gel electrophoresis (PAGE). Smaller DNA fragments travel faster than the larger ones. Thus the nested DNA fragments are separated according to size. The resolution of PAGE allows polynucleotides differing in length by only one residue to be separated. Near the bottom of the gel, the scanner scans the fluorescent tag as each DNA fragment moves past, and this is converted into trace data, displayed as a graph comprising colored peaks corresponding to each base (Fig. 4.4). A
C
C
A
G
C
G
G
C
T
C
T
Fig. 4.4 A sample of a high quality sequence trace, where all peaks are easily called. Peaks are typically period in different color (shown here as different line styles) to aid visual interpretation. Software such as Phred is used to read the peaks and assign quality value (A = dark line; C= lighter; G = dotted line; T = dark line with breaks). (Source: Westhead, D.R. et al., Instant Notes: Bioinformatics, Bios Scientific Publishers Ltd., 2003)
DNA sequences are stored in databases. Genomic DNA sequences, copy DNA (cDNA) sequences and recombinant DNA sequences are available in databases. Genome sequencing is done using shotgun sequencing or clone contig strategies. Many different programs such as Phred, Vector-clip, CrossMatch, RepeatMaster, Phrap, Staden Gap4 have been used in quality control of sequences. The arrival of high-throughput automated fluorescent DNA sequencing technology has led to the rapid accumulation of sequence information; it provides the basis for abundant computationally derived protein sequence data. Analysis of DNA sequence underpins a number of aspects of research; these include, for example, detection of phylogenetic relationships; genetic engineering using restriction site mapping; determination of gene structure
4.8 Basic Bioinformatics through intron/exon prediction; interference of protein coding sequence through open reading frame (ORF) analysis, etc.
Exons, Introns and CDS The central dogma states that DNA is transcribed into RNA, which is then translated into protein. In eukaryotic systems, exons form a part of the final coding sequence (CDS), whereas introns though transcribed are edited out by the cellular machinery before the mRNA assumes its final form (Fig. 3.3). DNA sequence databases typically contain genomic sequence, which includes information at the level of the untranslated sequence, introns and exons, mRNA, cDNA and translations. Untranslated regions (UTRs) occur both in DNA and RNA; they are portions of the sequence flanking CDS that are not translated into protein. Untranslated sequence, particularly at the 3‘ end, is highly specific both to the gene and to the species from which the sequence is derived. 5¢ 5¢ UTR
Intron Exon
3¢
Intron Exon
Exon
3¢ UTR
Sense strand genomic DNA Transcription 5¢ UTR
CDS
3¢ UTR
mRNA Translation
Protein
Fig. 4.5 In eukaryotic systems exons from a part of the final coding sequence (CDS), whereas introns are transcribed, but are the edited out by the cellular machinery before the mRNA assumes its final form. Here, the gene is made up of three exons and two introns. Exons, unlik coding sequences are not simply terminated by stop codons, but rather by intron-exon boundaries; the untranslated regions (UTRs) occur at either end of the gene; if transcription begins at the 5' end of the sequence, then the 5' UTR contains promoter sites (such as the TATA box), and the 3' UTR follows the stop codon. (Source: Attwood, T.K. and Parry-Smith, D.J., Introduction to Bioinformatics, Pearson Education Ltd., 2001)
Primer Design The location of the primers on a DNA source will be determined relative to the start and stop codons of the gene. The default option will find the ‘forward’ primer of a given length that resides within the first 35 basepairs upstream of the coding sequence. The default option will also find the ‘reverse’ primer that
DNA and Protein Sequencing and Analysis
4.9
resides within 35 basepairs immediately following the coding sequence. We can alter the endpoints of either of these by changing the number in the Distance from the Start’ and ‘Distance from the Stop’ fields. We can also define the exact 5’ endpoints of the primers by selecting the button marked ‘YES’ on the line which asks about the exact endpoints.
Procedure Open the Internet browser and type the URL address: http:// frodo.wi.mit.edu.cgi.bin/ primers3/primer3_www.cgi. Paste the sequence in the text box. Choose the primer. Click the left and right primer. Press ‘Pick Primer’ button and the result will be displayed in a new page.
4.4
OPEN READING FRAME (ORF)
ORFs are stretches of DNA sequence uninterrupted by codons which would cause protein synthesis to fail, and which are bounded by appropriate start and stop signals. An ORF is a nucleotide sequence without a ‘stop’ signal that encodes some minimal number of amino acids (about 100). In prokaryotes, identifying ORFs is fairly straightforward. In eukaryotes, because of introns and exons assignment of ORFs is complicated. Which is the correct reading frame for translation? The longest frame uninterrupted by a stop codon (TGA, TAA or TAG) is normally supposed to be the correct reading frame. Such a frame is known as an open reading frame (ORF). Finding the end of an ORF is easier than finding its beginning. We may use several features as indicators of potential protein coding regions in DNA. One of these is sufficient ORF length. Recognition of flanking kozak sequences may also be helpful in pinpointing the start of the CDS (Fig. 4.6). Patterns of codon usage differ in coding and non-coding regions. cDNA
5¢
3¢
EST CDS UTR
Fig. 4.6 When constructing a library, complementary DNA (cDNA) is run off from the mRNA stage, using reverse transcriptase. ESTs are then generated using a single read of each clone on an automated sequencing system. In the mRNA, the start codon may be flanked by a Kozak sequence, which gives additional confidence to the prediction of the start of the CDS. (Source: Attwood, T.K. and Parry-Smith, D.J., Introducton to Bioinformatics, Pearson Education Ltd., 2001)
Specifically, the use of codons for particular amino acids varies according to species, and codon-use rules break down in regions of sequence
4.10 Basic Bioinformatics that are not destined to be translated. Thus, codon-usage statistics can be used to infer both 5‘ and 3‘ untranslated regions and to assist the detection of mistranslations, because there is an uncharacteristically high representation of rarely used codons in these regions. The table 4.1 illustrates the considerable variability in selection of codons that different organisms employ for a particular amino acid. In addition to their characteristic pattern of codon usage, may organisms show a general preference for G or C over A or T in the third base (Wobble) position of a codon. The consequent bias towards G/C in this base can further contribute to diagnosis of ORFs. Table 4.1: Percentage use of codons for serine in a variety of model organisms. There are six possible codons for serine, which in principle could be used with equal frequency whenever serine is specified in a CDS. In practice, however, organisms are highly selective in the particular codons they use. The characteristic differences in usage reflected here can be used to help diagnose regions of DNA that may code for protein. Codon AGT AGC TCG TCA TCT TCC
E. coli 3 20 4 2 34 37
D. melanogaster 1 23 17 2 9 42
H. sapiens 10 34 9 5 13 28
Z. mays 4 30 22 4 4 37
S. cerevisiae 5 4 1 6 52 33
In the region upstream of the start codon of prokaryotic genes, detection of ribosome binding sites, which help to direct ribosomes to the correct translation start positions, is considered to be a powerful ORF indicator. One consequence of the presence of exons and introns in eukaryotic genes is that potential gene products can be of different lengths, because not all exons may be represented in the final transcribed mRNA (although the order of exons that are included is preserved). When the mRNA editing process results in different translated polypeptides, the resulting proteins are known as splice variants or alternatively spliced forms. Thus, results of database searches with cDNA or mRNA (transcription level information) that appear to indicate substantial deletions in matches to the query sequence could, in fact, be the result of alternative splicing.
4.5
DETERMINING SEQUENCE OF A CLONE
A clone is a copied fragment of DNA maintained in circular form identical to the template from which it is derived. The process of determining the nucleotide sequence of a clone also helps in the analysis of DNA sequences. In
DNA and Protein Sequencing and Analysis
4.11
an experiment to clone a specific gene whose sequence is already known, it is necessary to check that the cloned sequence is indeed identical to the published one. A cDNA clone is synthesized using mRNA as a template. The clone is then sequenced by designing primers to known oligonucleotides present in the cloning vector flanking the inserted DNA. When the primers hybridize to the corresponding sequences, they are extended in a chain synthesis reaction using the inserted sequence as template (Fig. 4.7). The reaction is terminated by the incorporation of a dideoxynucleotide (ddATP, ddTTP, ddGTP, or ddCTP). Not all the chains terminate at the same base, since normal bases (dATP, dTTP, dGTP or dCTP) are also present in the reaction mixture. The result is a series of fragments for each primer, all of different lengths because they have been terminated at different base positions. The generated fragments are run on standard radioactive sequencing gels, or fluorescent sequencing machines, as appropriate, to determine the order of bases in a sequence. The assembler program builds a consensus sequence for the clone, according to a weighting given to each nucleotide position in the sequence. Terminated chain
(a) 3¢ (b) 5¢
ddGTP
Template DNA ddGTP ddGTP
5¢
ddGTP ddGTP
5¢ 3¢
C
C
CC
5¢
Fig. 4.7 Template DNA sequencing: (a) Chain synthesis and termination by incorporation of ddGTP;) (b) the family of chains terminated at different positions by ddGTP. Since G pairs with C the template sequence contains C at each of these positions.
Whole genome shotgun sequencing assembly (Fig. 4.8) is also used to sequence clones from physical map of a genome. In whole genome shotgun sequencing, the portions of the inserts adjacent to the junction points with vector sequences are sequenced from a great many random clones throughout the genome, and the overlapping sequence information is used to assemble the sequence of the entire genome and to reconstruct the physical map of the clones. The rapid accumulation of DNA sequence data has been expedited by the introduction of fluorescent sequencing technology. Larger number of sequencing reactions can be carried out and the protocols are more readily adapted to automation. When the reactions are run on a fluorescent sequencing gel, computers are used to interpret the laser-activated fluorescence and convert it into a digital form suitable for further analysis.
4.12 Basic Bioinformatics
Contig 1
Contig 1 Paired and reads
Contig 1 Paired and reads
Scaffold Sequenced contig 1
GAP
Sequenced contig 2
GAP
Sequenced contig 3
Fig. 4.8 Whole genome shotgun sequencing assembly. First, the unique sequence overlaps between sequences reads are used to build contigs. Paired-end reads are then used to span gaps and the order and orient the contigs into larger unit called Scaffolds. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002)
Typically 36 lanes are run on a gel at once. The output consists of a series of (colour-coded) peaks, beneath which is a string of base symbols. Sometimes the software that interprets the chromatogram is unable to determine which base should be called at specific position. So a ‘-‘ appears. Such ambiguous positions are replaced by ‘N’ in the resulting sequencing file.
4.6
EXPRESSED SEQUENCE TAGS
Expressed sequence tag is a partial sequence of a clone, randomly selected from a cDNA library and used to identify genes expressed in a particular tissue. We do not always have the full length DNA sequences; a large part of currently available DNA data is madeup of partial sequences, the majority of which are Expressed Sequence Tags (ESTs). In analyzing ESTs some points should be kept in mind: (i) The EST alphabet is five characters, ACGTN. (ii) There may be phantom INDELS resulting in translation frame shifts. (iii) The EST will often be a sub-sequence of any other sequence in the databases. (iv) The EST may not represent part of the CDS of any gene. How an EST is sequenced is given in Figure 3.7. A cDNA library is constructed from a tissue or cell line of interest. The mRNA is isolated from tissue or cell. The mRNA is then reverse-transcribed into cDNA, usually with an oligo (dT) primer, so that one end of the cDNA insert derives from the polyA tail at the end of the mRNA. The other end of the cDNA is normally within the coding sequence but may be in the 5’ untranslated region if the coding sequence is short. The resulting cDNA is cloned into a vector.
DNA and Protein Sequencing and Analysis
4.13
Individual clones are picked from the library, and one sequence is generated from each end of the cDNA insert. Thus, each clone normally has a 5’ and 3’ EST associated with it. Because ESTs are short, they generally represent only fragments of genes and not complete coding sequences. A typical EST will be between 200 to 500 bases in length. The EST production process is normally highly automated and typically involves use of a fluorescent laser system that reads the sequencing gels. The resulting sequences are downloaded to a computer system for further analysis. Does this EST represent a new gene? To answer this question, a DNA database search is usually performed. If the result shows a significant similarity to a database sequence, the normal procedure for classifying the hit will determine whether a novel gene has been found. If however, the result shows no significant similarity, we cannot immediately assume that a new gene has been discovered; it may be that the EST represents non-coding sequence, for a known gene, that simply is not in the database. Many mRNAs (especially humans) have long untranslated regions at the 5’ and 3’ ends of the CDS. It is possible for an EST to be entirely from one of these noncoding regions. If we are lucky, the section of untranslated (noncoding) sequence will already be in the database. If it is, a direct match will be found, as untranslated regions are highly conserved and specific to their coding gene. Cell or tissue Deposit the EST sequences dbEST
Isolate mRNA and reverse transcribe into cDNA 5’ EST
3’ EST
Clone cDNA into a vector to make a cDNA libarary cDNA
Vector
cDNA
Vector
cDNA
cDNA
Vector Pick individual clones
Sequence the 5’ and 3’ ends of cDNA insert
Vector
Fig. 4.9 Overview of how ESTs are constructed. (Source: Wolfberg, T.G. and Landsman, D., Expressed Sequence Tags (ESTs), in Bioinformatics – a practical guide to the analysis of genes and proteins (eds) Baxevanis, A.D. and Francis Quellette, B.F., John Wiley & Sons, Inc, 2002)
If we are unlucky, no match will be found, indicating one of the two possibilities; either (i) the EST represents a CDS for which there is no similar sequence on the database (still a distinct possibility), or ii) it represents a non-
4.14 Basic Bioinformatics coding sequence that is not in the database. It is critical to the interpretation of EST analysis that a distinction is made between these two situations (Fig. 4.10). Exon 1
Exon 2
Exon 3
Exon 4 Genomic DNA
240 241 5¢
5¢EST
528 529
696 697
816 3¢
3¢EST
cDNA
ESTs
Fig. 4. 10 The alignment of fully sequenced cDNAs and ESTs with genomic DNA. The soild 4.1 lines indicate regions of alignment; for the cDNA, these are the exons of the gene. The dots between segments of cDNA or ESTs indicate regions in the genomic DNA that do not align with cDNA or EST sequences; these are the locations of the introns. The number above the cDNA line indicate the base coordinates of the cDNA sequence, where base 1 is the 5' most base and base 816 is the 3' -most base of the cDNA. For the ESTs, only a short sequence read form either the 5' or 3' end of the corresponding cDNA is obtained. This establishes the boundaries of the transcription unit, but it is not informative about the internal structure of the transcript unless the EST sequences cross an intron (as is ture for the 3' EST depicted here). (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
4.7
PROTEIN SEQUENCING
Direct RNA sequencing involves the chemical characterization of modified nucleotides. The most sensitive comparisons between sequences are done at the protein level; detection of distantly related sequences is easier in protein translation, because the redundancy of the genetic code of 64 codons is reduced to 20 distinct amino acids of the functional building blocks of proteins. Because proteins are a functional abstraction of genetic events that occur in DNA the loss of degeneracy at this level is accompanied by a loss of information that relates more directly to the evolutionary process. Direct protein sequencing was carried out using a process called Edman degradation in the past. In this the terminal residue of a protein is labeled, removed and then identified using a series of chemical tests. Current methods of protein sequencing rely on mass spectrometry (MS), a technique in which the mass/charge ratio (m/e or m/z) of ions in a vacuum is accurately determined allowing molecular masses to be calculated.
Determination of Structure Protein structures can be determined using X-ray crystallography and nuclear magnetic resonance spectroscopy (NMR). X-ray crystallography involves the reconstruction of atomic positions based on the diffraction pattern of X-rays through a precisely orientated protein crystal. Scattered X-rays cause positive and negative interference, generating an ordered pattern of signals called reflections.
DNA and Protein Sequencing and Analysis
4.15
Structural determination depends on three variables: the amplitude and phase of the scattering (which depend on the number of electrons in each atom), and the wavelength of the incident X-rays. The basis of NMR spectroscopy is that some atoms, including natural isotope of nitrogen, phosphorous and hydrogen behave as tiny magnets and can switch between magnetic spin states in an applied magnetic field. This is achieved by the absorbance of low wavelength electromagnetic radiation, generating NMR spectra. Other methods such as magic angle spinning NMR and circular dichroism spectroscopy are also used.
Prediction There are three main approaches to secondary structure prediction: (i) empirical statistical methods that use parameters derived from known 3D structures; (ii) methods based on physicochemical criteria (e.g., fold compactness, hydrophobicity, charge, hydrogen bonding potential, etc.) and (iii) prediction algorithms that use known structures of homologous proteins to assign secondary structure. One of the standard empirical statistical methods is that of Chou and Fasman, which is based on observed amino acid conformational preferences in non-homologous proteins. But in spite of being a ‘standard’ approach, like all other methods, its reliability to derive the conformational potentials of the amino acids has been inadequate. By contrast, for prediction algorithms, the use of multiple sequence data can improve matters and may yield enhancements of several percent. Tertiary structure prediction (especially methods that build on secondary predictions) is still further beyond reach.
4.8
GENE AND PROTEIN EXPRESSION ANALYSIS
The activity of a gene is called gene expression in which the gene is used as a blueprint to produce a specific protein. Patterns in which a gene is expressed provide clues to its biological role. All functions of cells, tissues and organs are controlled by differential gene expression. Gene expression is used for studying gene function. Knowledge of which genes are expressed in healthy and diseased tissues would allow us to identify both the protein required for normal function and the abnormalities causing disease. This information will help in the development of new diagnostic tests for various illnesses as well as new drugs to alter the activity of the affected genes or proteins. Usually gene expression has been studied at either RNA or protein level on a gene-by-gene basis using Northern and Western blot techniques. Now global expression analysis methods are available which study all genes simultaneously. A simple but expensive technique to analyse at the RNA level is direct sequence sampling from RNA populations or cDNA libraries or even from sequence databases.
4.16 Basic Bioinformatics In a more sophisticated technique called serial analysis of gene expression (SAGE), very short sequence tags (usually 8-15 nt) are generated from each cDNA and hundreds of these are joined together to form a concatemer prior to sequencing. In one sequencing reaction, information on the abundance of hundreds of mRNA, can be gathered. Each SAGE tag uniquely identifies a particular gene, and by counting the tags, the relative expression levels of each gene can be determined (Fig. 4.11). Biotinylated oligo dT
+
Poly A RNA AAAAA
+
TTTTTT
cDNA synthesis AAAAA TTTTTT
Nla III
Streptaridin-coated magnetic beads AAAAA TTTTTT
CATG GTAC +
+
‘Pool A’
‘Pool B’ AAAAA TTTTTTT
13bp
AAAAA TTTTTTT
Fok I
13bp
Fok I
Ligate, PCF amplify
Restriction digest, purify ditags and concatenate
Nla III
Clone and sequence CATGCCTAGTCAGGCGACTTCACATGCCAAAGTGCTTTCGAGACATGGAAGTCCTACGATCATGGCATG
Tag 1
Tag 2
Ditag A
Tag 3
Tag 4
Ditag B
Tag 5
Tag 6
Ditag C
Fig. 4.11 Simplified outline method for serial analysis of gene expression. Nla lll is a frequent cutting restriction enzyme used intially to generate the 3' cDNA fragments and provide the overhang for linker ligation, and later to remove the linkers prior to concatamerization of the ditags. Foki is type lls restriction enzyme with a recognition site in the linker that generates the SAGE tags by cutting the DNA a few bases downstream. (Source: D.R. Westhead et al., Instant Notes: Bioinformatics, Bios Scientific Publishers Ltd. 2003)
DNA and Protein Sequencing and Analysis
4.17
4.8.1 DNA Microarrays Presently, DNA arrays (DNA chips) are used widely. A DNA microarray or DNA chip is a dense grid of DNA elements (often called features or cells) arranged on a miniature support, such as nylon filter or glass slide. Each feature represents a different gene. (The specificity of nucleic acid hybridization is such that a particular DNA or RNA molecule can be labeled (with a radioactive or fluorescent tag) to generate a probe, and can be used to isolate a complementary molecule from a very complex mixture, such as whole DNA or whole cellular RNA). The array is usually hybridized with a complex RNA probe, i.e. a probe generated by labeling a complex mixture of RNA molecules derived from a particular cell type. The composition of such a probe reflects the levels of individual RNA molecules in its source. If non saturating hybridization is carried out, the intensity of the signal for each feature on the microarray represents the level of the corresponding RNA in the probe, thus allowing the relative expression levels of thousands of genes to be visualized simultaneously. The most widely used method involves the robotic spotting of individual DNA clones onto a coated glass slide. Such spotted DNA arrays can have a density of up to 5000 features per square cm. The features comprise doublestranded DNA molecules (genomic clones or cDNAs) up to 400 bp in length and must be denatured prior to hybridization (Fig. 4.12)
DNA clones
Test
Reference Laser 1
Laser 2
Reverse transcription Label with fluor dyes
Emission Quantify emission in red and green wavelength bands
PCR amplification purification Robotic printing
Hybridize terget to microarray
Analyze relative expression levels by computer
Fig. 4.12 The Process of differential expression measurement using a DNA microarray. DNA clones are first amplified and printed out to form a microarray. Test and reference RNA samples are then reverse transcribed and labled with different fluor dyes (Cy5 and Cy3), which fluoresce in different (red, green) wavelength bands. These are hybridized to the microarray. Fluorescence of each dye is then measured for each samples. (Source: Duggan D.J. et al., Expression profiling using cDNA microarrays. Nature Gene. 21 (suppl 2): pp 10-14, 1999).
4.18 Basic Bioinformatics Genechips Another method is on-chip photolithographic synthesis, in which short oligonucleotides are synthesized in situ during chip manufacture. These arrays are known as Genechips. They have a density of up to 1,000,000 features per square cm, each feature comprising up to 109 single-stranded oligonucleotides 25 nt in length. Each gene on a Genechip is represented by 20 features (20 overlapping oligos), and 20 mismatching controls are included to normalize for nonspecific hybridization. Fluorescent probes are used for spotted DNA arrays, since different fluorophores can be used to label different RNA populations. These can be simultaneously hybridized to the same array, allowing differential gene expression to be monitored directly. In Genechips, hybridization is carried out with separate probes on two identical chips and the signal intensities are measured and compared by the accompanying analysis software.
Data Analysis The raw data from microarray experiments consists of images from hybridized arrays. The exact nature of the image, depends on the array platform (the type of array used). DNA arrays may contain many thousands of features. Therefore, data acquisition and analysis must be automated. The software for initial image processing is normally provided with the scanner. This allows the boundaries of individual spots to be determined and the total signal intensity to be measured over the whole spot (signal volume). The signal intensity should be corrected for background and control measures should be included to measure nonspecific hybridization and variable hybridization across arrays. The aim of data processing is to convert the hybridization signals into numbers which can be used to build a gene expression matrix. The interpretation of microarray experiment is carried out by grouping the data according to similar expression profiles. Clustering is a way of simplifying large data sets by partitioning similar data into specific groups. Many software applications are available for implementing microarray data analysis methods (Table 4.2).
Applications DNA microarray has the following applications: (i) Investigating cellular states and processes: Patterns of expression that change with cellular state can give clues to the mechanisms of the processes such as sporulation, or the change from aerobic to anaerobic metabolism. (ii) Diagnosis of disease: Testing for the presence of mutations can confirm the diagnosis of a suspected genetic disease, including detection of lateonset condition such as Huntington disease, to determine whether prospective parents are carriers of a gene that could threaten their children.
DNA and Protein Sequencing and Analysis
4.19
Table 4.2: Internet resources for microarray expression analysis. The first two sites are very comprehensive and contain hundreds of links to databases, software and other resources. Two web-based suites of analysis program are also listed as well as some databases that store microarray and other gene expression data. URL
Product(s)
Comments
Sties with extensive links to microarray analysis software and resources http://smd.stanford.edu Cluster, Xcluster, Extensive list of SAM, Scanalyze, software resource from Stanford many others University and other sources, both downloadable and www-based. http://smd.stanford.edu/
Cluster, Cleaver, GeneSpring, resources/databases.shtml Genesis, many others
Comprehensive list of downloadable and www-based software of microarray analysis and data mining plus links to gene expression databases.
www-based microarray data analysis http://www.ebi.ac.uk/ expressionprofiler/
Expression profiler
Very powerful suite of programs from the EBI for analysis and clustering of expression data.
http:// bioinfo.cnio.es .dnarray/ analysis/ http://www.cbs.dtu.dk/ biotools/DNAarraytools. phpSOM,
DNA arrays analysis tools
A suite of programs from the National Spanish Cancer Centre (CNIO) including two sample correlation plot, hierarchical clustering, neural network are tree viewers.
Micro array databases http://www.ncbi. National Centre for nlm.nlh.gov/geo/ Biotechnology Information (NCBI)
GEO (Gene Expression Omnibus) GEO is a gene expression and hybridiaztion array database, which can be searched by accession number, through the contents page or through the Entrez ProbeSet search interface.
http://www.ebi.ac. ArrayExpress uk/arrayexpress/ http://www.ncgr.org/genex/
EBI microarray gene expression database. Developed by MGED and supports MIAME.
http://genex.geneGeneX quantification.info/ http://www.informatics.jax. org/mgihome/GXD/aboutGXD.shtml
The GeneX gene expression database is an integrated tool set to the analysis and comparison of microarray data.
(iii) Genetic warning signs: Some diseases are not determined entirely and irrevocably by genotype, but the probability of their development is correlated with genes or their expression patterns. A person aware of an enhanced risk of developing a condition can in some cases improve his or her prospects by adjustments in lifestyle.
4.20 Basic Bioinformatics (iv) Drug selection: Detection of genetic factors that govern responses to drugs, that in some patterns of gene expression. Knowing the exact type of disease is important in selecting optimal treatments. (v) Classification of disease: Different types of leukemia can be identified by different patterns of gene expression. Knowing the exact type of disease is important in selecting optimal treatments. (vi) Target selection for drug design: Proteins showing enhanced transcription in particular disease states might be candidates for attempts at pharmacological intervention (provided that it can be demonstrated, by other evidence, that enhanced transcription contribute to or is essential the maintenance of the disease state). (vii) Pathogen resistance: Comparisons of genotypes or expression patterns, between bacterial strains susceptible and resistant to an antibiotic, point to the protein involved in the mechanism of resistance.
4.8.2 Protein Expression Analysis 2D Poly Acrylamide Gel Electrophoresis (2D-PAGE) is a well established biochemical technique in which proteins are separated on the basis of two separate properties: their isoelectric point (pI) (charge) and their molecular mass. Separation in the first dimension is carried out by isoelecrtric focusing in an immobilized pH gradient. The pH gradient is generated by a series of buffers, and an immobilized pH gradient is produced by covalently linking the buffering groups to the gel, thus preventing migration of the buffer itself during electrophoresis.
Isoelectric focusing Isoelectric focusing means allowing proteins to migrate in an electric field until the pH of the buffer is the same as the pI of the protein. The pI of the protein is the pH at which it carries no net charge and therefore does not move in the applied electric field. Next the gel is equilibrated in the detergent sodium dodecylsulphate (SDS), which binds uniformly to all proteins and confers a net negative charge. Therefore, separation in the second dimension can be carried out on the basis of molecular mass. After the second dimension separation, the protein gel is stained with a universal dye to reveal the position of all protein spots. Reproducible separations can then be carried out with similar samples to allow comparison of protein expression levels. It provides a diagnostic protein fingerprint of any particular sample (Fig. 4.13). The stained protein gel is scanned to obtain a digital image. Individual protein spots are then detected and quantified, and the intensity of the signal for each spot is corrected for local background. Several algorithms are available based on Gaussian fitting or Laplacian of Gaussian spot detection. Spots whose morphology deviates from a single Gaussian shape can be interpreted using a model of overlapping shapes.
DNA and Protein Sequencing and Analysis
4.21
Fig. 4.13 A section from a 2D protein gel. The sample has been seperated on the basic of isoelectric pH (horizontal dimension) and molecular mass (vertical dimension). Each spot should correspond to a single protein.
Other Methods A simpler approach is line and chain analysis, in which columns of pixels from the digital image are scanned for peaks in signal density. This process is repeated for adjacent pixel columns allowing the algorithm to identify the centers of spots and their overall signal intensity. Another method is known as watershed transformation. In this method, pixel intensities are viewed as a topographical map so that hills and valleys can be identified. This is useful for separating clusters, chains and small spots overlapping with larger ones (shouldered spots) and also for merging regions of a single spot. The output of each method is a spot list. Differential protein expression can also be analysed using 2D-PAGE. This can be used to look for proteins that are induced or repressed by particular treatments or drugs, to look for proteins associated with disease states, or to look at changes in protein expression during development. Once protein expression data have been recorded, they are built into a protein expression matrix. The results from 2DPAGE experiments are generally stored in 2D-PAGE databases. They can be found at: http://www.ucl.ac.uk/ich/services/labservices/mass_spectrometry/ proteomics/technologies/2d_page http://world-2dpage.expasy.org/swiss-2dpage/
4.8.3 Gene Discovery Lately, substantial financial resources have been spent in the search for the genes that may be linked to particular types of diseases. The objective is to develop new therapies with which to combat a wide variety of prevalent
4.22 Basic Bioinformatics disorders, such as cancer, tuberculosis, asthma, etc. There are two main strategies for discovering proteins that may represent suitable molecular targets, whether for small molecular drug discovery or for gene therapy.
Approaches One approach for discovering disease-related genes is the technique of positional cloning. Here the chromosome linked to the disease in question is found out by analyzing a population of people some of whom exhibit the disease. Once a link to a chromosomal region is established, a large part of the chromosome in the vicinity of the region (locus) is sequenced, yielding several megabases of DNA. Such a locus can contain many genes, only one of which is likely to be involved in some way in the disease process. Sequence searching and gene prediction techniques can be used to increase the efficiency of gene identification in the locus, but ultimately several genes will need to be expressed, and further experimentation (or validation) will be required to confirm which gene is actually involved in the disease. Although genes discovered in this way can be very illuminating from an academic point of view, they do not necessarily represent good drug targets (or points of therapeutic intervention). Another approach to gene discovery, requiring much less sequencing effort and relying more heavily on the powerful search capabilities of current computer systems, examines the genes that are actually expressed in healthy and diseased tissues. This allows a comparison to be performed between the two states, and a process of reasoning applied to arrive at a potential drug target in a more direct way. This process analyses the mRNAs, which are used by the cellular machinery as a template for the construction of the proteins themselves.
Gene Finding In gene finding, generally elements such as splice sites, start and stop codons, branch points, promoters and terminators of transcription, polyadenylation sites, ribosome binding site, topoisomerase-II binding sites, topoisomerase I cleavage sites and various transcription factor binding sites are included. Local sites like these are called ‘signals’ and are detected by ‘signal sensors’. In contrast to this, extended and variable length sequences such as exons and introns are called ‘contents’ and are detected by content sensors. Most sophisticated signal sensors in use are neural nets. Commonly used content sonsor is the one which predicts coding regions. Several systems that combine signal and content sensors have been developed in an attempt to identify complete gene structure. Such systems are capable of handling more complex interdependencies between gene features. Genelaug is one of the earliest integrated gene finders to date, which uses dynamic scored regions and sites into a complete gene prediction with a maximal total score.
DNA and Protein Sequencing and Analysis
4.23
The main feature of dynamic programming is the one which includes a latent or hidden variable associated with each nucleotide that represents the functional role or position of that nucleotide. These models are called hidden Markov models (HMMs). Most popular statistical methods used for gene finding are Markov models using gene mare program. Some of the important gene finding HMMs include Ecoparse, Expound, etc. The list of computational gene finding data bases are given in Table 4.3. In prokaryotes, it is still common to locate gene by simply looking an open reading frame (ORF). This is certainly not adequate for higher eukaryotes. To distinguish between coding and noncoding regions in higher eukaryotes, exon content sensors are used which use statistical models of the nucleotide frequencies and dependencies, which are present in codon structure. Table 4.3: Computational gene finding databases and genefinders Datasets and genefinders 1. Genefinding datasets a) Single genes b) Annotated contigs c) Hmm-based gene finders Genie Genscan HMMgene GenMark Pirate d) Other gene finders AAT FGENEH
GENEID GeneParser Glimmer Grail Procrusters GENE FINDING
Accession sites http://www.cbcb.umd.edu/research/genefinding.shtml ftp://www-hgc.ilb.gov/pub/genesets/ http://igs-server.cors-mrs.fr/banbury/index/hyml http://www.fruitfly.org/seq_tools/genie.html http://genes.mit.edu/GENSCANinfo.html, http://genes.mit.edu/GENSCAN.html http://www.cbs.dtu.dk/services/HMMgene/ http://opal.biology.gatech.edu/GeneMark/ http://www.cbcb.umd.edu/software/pirate/ http://aatpackage.sourceforge.net/ http://linux1.softberry.com/ berry.phtml?topic=fgenesh&group =programs&subgroup=gfind http://genome.crg.es/geneid.html http://beagle.colorado.edu/~eesnyder/geneparser.html http://www.cbcb.umd.edu/software/glimmer/ http://grail.lsd.ornl.gov/grailexp/ http://www-hto.usc.edu/software/procrusters http://www.molquest.com/ molquest.phtml?group=index&topic=gfind http://www.biologie.uni-hamburg.de/b-online/library/ genomeweb/GenomeWeb/nuc-geneid.html
4.24 Basic Bioinformatics Levels of Gene Expression The human genome is complex, consisting of about 3 billion base pairs (bp) of DNA. Yet only 3% of the DNA is coding sequence (i.e. that part of the genome that is transcribed and translated into protein). The rest of the genome consists of areas necessary for compact storage of the chromosomes, replication at cell division, the control of transcription, and so on. A large part of the work of sequence analysis is centered on analyzing the products of the transcription/ translation machinery of the cell, i.e. protein sequences and structures. Recently much industrial emphasis has been placed on the study of mRNA; this is partly because a conceptual translation into protein sequence can be generated readily, but the main reason is that mRNA molecules represent the part of the genome that is expressed in a particular cell type at a specific stage in its development. Thus, in simple terms, we have three levels of genomic information: (i) the chromosomal genome (genome) – the genetic information common to every cell in the organism, (ii) the expressed genome (transcriptome) – the part of the genome that is expressed in a cell at a specific stage in its development and (iii) the proteome – the protein molecules that interact to give the cell its individual character. For each level, different analytical tools and interpretative skills are required. Cells express a different range of genes at various stages during their development and functioning. This characteristic range of gene expression is the expression profile of the cell. By capturing the cell’s expression profiles we can build up a picture of what levels of gene expression may be normal or abnormal and what the relative expression levels are between different genes within the same cell. This process also provides a rapid approach to gene discovery that complements full-blown genome sequencing projects.
Capturing Expression Profile The procedure for capturing an expression profile is as follows: First a sample of cells is obtained; then RNA is extracted from the cells and is stabilized by using reverse transcriptase to run off cDNA from the RNA template. The cDNA is transformed into a library (a cDNA library) suitable for use in rapid sequencing experiments. A sample of clones is selected from the library at random – e.g. 10000 from a library with a complexity of 2 million clones. A substantial automated sequencing operation is required to produce 10,000 sequencing reactions, and then to run these on automated sequencers. The resulting data are downloaded to computers for further analysis. The ideal result is a set of 10000 sequences each between 200 and 400 bases in length, representing part of the sequence of each of the 10000 clones. In reality, some sequencing runs will fail altogether, some will fail to produce
DNA and Protein Sequencing and Analysis
4.25
sufficient sequence data and some will fail to produce data of appropriate quality. The sequences that emerge successfully from this process are called Expressed Sequence Tags (ESTs). ESTs are submitted to GenBank, EMBL and DDBJ. ESTs can be accessed through all these databases. The same ESTs are available from NCBI’s dbEST.
4.9 HUMAN GENOME PROJECT A genome is the entire DNA in an organism. Robert Sinsheimer, a molecular biologist by training, made the first proposal of Human Genome Project (HGP) in 1985. While he was the chancellor of the University of California, he organized a scientific meeting to discuss the possibility of the project. Charles DeLisi, Head, Division of Health and Environmental Research, Department of Energy (DOE) came to know about the HGP proposal and became an avid supporter of the project. In 1986, DeLisi convened a meeting of scientists who were in DNA research from laboratories at Livermore and Los Alamos in USA and suggested to them to carry out the project with a primary goal of determining the nucleotide sequence of human genome. Due to legal problem, the National Academy of Sciences appointed a committee and the committee suggested that both DOE and National Institute of Health (NIH) should be involved with a common advisory board. In 1987, under the leadership of James Wyngaarden, NIH secured $17.4 million fo the project. James D. Watson became the first director of the new ‘Office of Human Genome Research’ (OHGR). The OHGR appointed Norton Zinder as chairman of program Advisory committee on the Human Genome. In 1990, the office became a ‘center’ and was called The National Center for Human Genome Research (NCHGR). In 1998, NCHGR became National Project and has been the largest and most complex international collaboration with funding from their governments and many charitable societies across the world. The project goals are to: • identify all the approximate 30,000 genes in human DNA • determine 3 billion nucleotide base pairs of human DNA • store information in databases • develop tools for data analysis • transfer related technologies to the private sector, and • address the ethical, legal and social issues that may arise from the project. The first working draft of the entire human nuclear genome was published in February 2001 issues of the Journals Nature and Science. Due to rapid technological advancement, the project was completed by April 2003 itself (even though 2005 was the projected year of completion) and the complete high quality reference sequence was made available to researchers worldwide for practical applications.
4.26 Basic Bioinformatics Salient Features A number of genes and their association with human diseases have also been established. The content and some of the salient genetic features of the human genome (Figure 4.14) are highlighted below: • The human genome contains 3.2 billion nucleotide bases (A, C, T and G) • The sizes of the genes vary greatly. The average gene consists of 3000 bases. The largest known human gene is dystrophin (2.4 million bases) • The functions are unknown for more than 50% of discovered genes. • The sequences of human genome remain the same in 99.9% people. • About 2% of the genome encodes instructions for the synthesis of proteins.
Human nuclear genome 3200 million bases
Genes and related sequences 1200 Mb
Genes 48 Mb
Related sequences 1152 Mb
Intergenic DNA (junk DNA) 2000 Mb
Interspersed repeats 1400 Mb
Pseudogenes
Long interspersed nuclear elements 640 Mb
Gene fragments
Small interspersed nuclear elements 420 Mb
Other intergenic regions 600 Mb
Short tandem repeats 90 Mb
Other repeats 510 Mb
Introns, untranslated Long terminal repeats 250 Mb Mobile DNA 90 Mb
Fig. 4.14 Content of the human genome (Based on IHGSC April 2003)
DNA and Protein Sequencing and Analysis
4.27
• Repeat sequences (those which do not code for proteins) make up about 50% of the genome (Repeat sequences are thought to maintain chromosome structure and dynamics. By rearrangement it creates entirely new genes or modify and reshuffle existing genes). • About 40% of the human proteins showed similarity with fruit-fly or worm proteins. • Genes appear to be spread randomly throughout the genome with vast expanses of noncoding DNA in between • Chromosome 1 (the largest human chromosome) has 2968 genes and the Y chromosome (smallest human chromosome) has 231 genes. • Candidate genes were identified for numerous diseases and disorders including breast cancer, muscle disease, deafness and blindness. • Single nucleotide polymorphism can occur in 3 million locations. • Every 2kb contains a microsatellite (short tandem repeat) (Anderson et al., have decoded the entire sequence of human mitochondria. The circular and double stranded genome contains 16569 base pairs and 37 genes. Among them, thirteen genes code for respiratory complex proteins and the other 24 genes represent RNA molecule for the expression of mitochondrial genome). The ‘Periodic Table of Life’ developed from HGP will be beneficial to everyone in many ways. James Watson and the joint NIH-DOE genome advisory panel were against patenting the genes. They were of the view that public was paying for deciphering the genome and they must decide what to do with the information. Also scientists should have access to all available gene data for the advancement of genome research program. In 1997, NIH established GenBank and made everyone to access information through Internet. This encouraged many to refrain from taking out patent on raw sequence data.
Benefits of Genome Research The findings through various genome research programs will be beneficial in the following areas:
Molecular Medicine • to develop better disease diagnosis • to detect genetic predispositions to diseases • to design drugs based on molecular information and individual genetic profiles • useful for better gene therapy
Microbial Genomics • to detect and treat pathogens speedily • to develop new biofuels
4.28 Basic Bioinformatics • to protect citizens from biological and chemical warfare • to clean up toxic waste safely and efficiently
Risk Assessment • to evaluate the level of health risk in individuals who are exposed to radiation or mutagens • to detect pollutants and monitor environments
Anthropology and Evolution • to study evolution due to germline mutations • to study migration of different population groups • to study mutations on the y chromosome to trace lineage and migration of males
DNA Identification • to identify criminals whose DNA may match evidence left at crime scenes • to exonerate persons wrongly accused of crimes • to establish paternity and other family relationships • to identify endangered and protected species • to detect bacteria and other organisms that may pollute environment • to match donors with recipients in organ transplant programs • to determine pedigree for seed or livestock breeds
Agriculture and Animal Science • • • •
to grow crops of disease and drought resistance high productivity to breed farm animals to develop biopesticides
STUDY QUESTIONS 1. 2. 3. 4. 5. 6. 7. 8.
What does the basic DNA sequencing reaction consist of? Describe how DNA sequencing is done. What is the role of open reading frame? How do you determine the sequence of a clone? What are expressed sequence tags? How an expressed sequence tag is sequenced? What are the methods of protein sequencing? What is DNA microarray?
DNA and Protein Sequencing and Analysis
4.29
9. How DNA microarray works? 10. Name some of the URLs that are used as internet resources for microarray expression analysis? 11. How is protein expression analysis performed? 12. What are the approaches used for gene discovery? 13. Name several organisms whose genomes have been successfully sequenced. 14. How will human genome project be of benefit to various researchers and human beings? 15. What are the contents of the human nuclear genome? 16. What are the outcomes of human genome project? 17. Mention various goals of human genome project 18. Why it is important to know about human genome?
C H A P T E R
5
Databases, Tools and their Uses Today biological data are gathered and stored all over the world. In order to interpret these data in a biologically meaningful way, we need special tools and techniques. Databases and programs allow us to access the existing information and to compare these data to find similarities and differences. The various Internet based molecular biology databases have their own unique navigation tools and data storage formats. Given a sequence, or fragment of a sequence, how to find sequences in the database that are similar to it? Given a protein structure, or fragment, how to find protein structures in the database that are similar to it? Given a sequence of protein of unknown structure, how to find structures in the database that adopt similar 3D structures? Given a protein structure, how to find sequences in the database that correspond to similar structure? Different data retrieval tools help to solve these problems.
5.1
IMPORTANCE OF DATABASES
A database is a logically coherent collection of related data with inherent meaning built for certain application. It is composed of entries – discrete coherent parcels of information. It is a general repository of information and contains records to be processed by a program. Its contents can easily be accessed, managed, and updated. Databases can be searched or cross-referenced either over the Internet or using downloaded versions on local computers or computer networks by multiple users. The databases are electronic filing cabinets, a convenient and efficient method of storing vast amount of information. They are assemblages of analyzed biological information into central and shareable resources. Databases are needed to collect and preserve data, to make data easy to find and search, to standardize data representation and to organize data into knowledge. The primary goals of databases are, (i) minimizing data redundancy and (ii) achieving data independence.
5.2 Basic Bioinformatics Information available in these databases can be searched, compared, retrieved and analyzed. Databases are essential for managing similar kind of data and developing a network to access them across the globe. A large amount of biological information is available all over the world through www but the data are widely distributed and it is therefore necessary for scientists to have efficient mechanisms for data retrieval. If we have to derive maximum benefit from the deluge of sequence information that is available today, we must establish, maintain and disseminate databases, providing easy to use software to access the information they contain, and design state-of-the art analysis tools to visualize and interpret the structural and functional clues hidden in the data. Databases of nucleic acid and protein sequences maintain facilities for a very wide variety of information retrieval and analysis operations such as retrieval of sequences from the data base, sequence comparison, translation of DNA sequences to protein sequences, simple types of structure analysis and prediction, pattern recognition and molecular graphics. Some examples of such databases are Entrez (http://www.ncbi.nlm.nih.gov/Entrez/) and OMIM. ExPASy is the information retrieval and analysis system (http:// wwww.expasy.ch).
Types of Databases There are many different database types, depending both on the nature of the information being stored and on the manner of data storage. Databases are broadly classified into two types, namely, generalized databases and specialized databases. Examples of generalized databases are DNA, protein, carbohydrate or similar databases. Examples of specialized databases are expressed sequence tags (EST), genome survey sequences (GSS), single nucleotide polymorphism (SNP) sequence tagged sites (STS), or similar databases. Other specialized databases include Kabat for immunology proteins and Ligand for enzymes reaction ligands. Generalized databases are again broadly classified into sequence databases and structure databases. Sequence databases contain the individual sequence records of either nucleotides or amino acids or proteins. Structure databases contain the individual sequence records of biochemically solved structures of macromolecules (e.g. Protein 3 D structure). Two principal types of databases are: (i) relational and (ii) objectoriented. The relational database orders the data to tables made up of rows giving specific items in the database and columns giving the features as attributes of those items. The object-oriented database includes objects such as genetic maps, genes, or proteins which have an associated set of utilities for analysis which help in identifying the relationships among these objects.
Classification More specifically databases can be classified into three types based on the complexity of the data stored: (i) Primary database, (ii) secondary database and (iii) composite database.
Databases, Tools and their Uses
5.3
Primary database contains data in its original form, taken as such from the source. e.g. GenBank for genome sequences and SWISS-PROT for protein sequences. They are also known as archival databanks. Secondary database is a value added database which contains some specific annotated and derived information from the primary database, e.g. SCOP, CATH, PROSITE. These are the derived databanks that contain information collected from the archival databanks after analysis of their contents. Composite database amalgamates a variety of different primary database structures into one. A redundant database is a database where more than one copy of each sequence may be found. Databases constructed by using subsets of the original database for reducing sampling bias are often referred to as nonredundant databases. Some databases that form specialized resources are called boutique databases. They either have a species specific sequence data or contain sequences obtained through a particular technique (e.g. Saccharomyces genome database (SGD), Drosophila genome database, etc). In addition to these, Bibliographic Databanks and the databanks of websites are also available on the net.
Database Entries Database entries comprise new experimental results, and supplementary information or annotations. Annotations include information about the source of data and the methods used to determine them. They identify the investigators responsible for the discovery and cite relevant publications. They provide links to connected information in other databanks. Curators in databanks base their annotations on the analysis of the sequence by computer programs. To make sure that all the fundamental data related to DNA and RNA are freely available, scientific journals require deposition of new nucleotide sequences in the database as a condition for publication of an article. Similar conditions apply to amino acid sequences, and to nucleic acid and protein structures. EMBL (European Molecular Biology Laboratory) nucleotide sequence database submission procedures are available at http:// www.ebi.ac.uk/embl/submission.
Sequence Formats Many databases and software applications are designed to work with sequence data, and this requires a standard format for inputting nucleic acid and protein sequence information. Three of the most common sequence formats are NBRF/PIR (National Biomedical Research Foundation/ Protein Information Resource), FASTA and GDE. Each of these formats has facilities not only for representing the sequence itself, but also for inserting a unique code to identify the sequence and for making comments which may include for example the name of the sequence, the species from which it was derived, and an accession number for GenBank or another appropriate database.
5.4 Basic Bioinformatics NBRF/PIR format begins with either >P1; for protein or >N1; for nucleic acid. FASTA format begins with only ‘>’, and the GDE format begins with ‘%’. A feature table (lines beginning FT) is a component of the annotation of an entry that reports properties of specific regions, for instance coding sequences (CDS). The feature table may indicate regions that perform or affect function, that interact with other molecules, that affect replication, that are involved in recombination, that are a repeated unit, that have secondary or tertiary structure and that are revised or corrected.
Database Record A typical database record contains three sections: (i) The header includes description of the sequence, its organism of origin, allied literature references and cross links to related sequences in other databases. Locus field contains a unique identifier summarizing the function of the sequence in abbreviation and is followed by an accession number in the Accession field. The organism field contains the binomial of the organism and its full taxonomic classification. (ii) The feature table contains a description of the features in the record like coding sequences, exons, repeats, promoters, etc., for the nucleotide sequences and domains, structure elements, binding sites, etc., for protein sequences. If the feature table includes a coding DNA sequence (CDS), links to the translated protein sequences are also mentioned in the feature description. (iii) The sequence (per se) is often more easily analyzed by the computer.
Database Management System A database management system (DBMS) is a software that allows databases to be defined, constructed and manipulated. It is a set of programs that manages any number of databases. The DBMS consists of users interface to talk with, on-line user, application developer, database engine to manage the storage and access of physical data on disk, data dictionary to record all information about the database, schemas, index details and access rights. DBMS is responsible for (i) accessing data, (ii) inserting, updating and deleting data, (iii) security, (iv) integrity, v) logging, (vi) locking, supporting batch and online programs, (viii) facilitating backups and recoveries, (ix) optimizing performance, (x) maximizing availability, (xi) maintaining the catalog and directory of database objects, xii) managing the buffer pools, and (xiii) acting as an interface to other systems’ programs. DBMS provides data independence, data sharing, non-redundancy, consistency, security and integrity.
Types There are three traditional types of database management systems: hierarchical, relational and network. Hierarchical and network models are
Databases, Tools and their Uses
5.5
based on traversing data links to process a database. The data are represented by a hierarchical structure and connection are defined and implemented by physical address pointers within the records. They are typically used for large mainframe systems.
Relational Database Management System Relational database management system (RDBMS) has become popular just because of its simple data model. Data are presented as a collection of relations. Each relation is depicted as a table. A row corresponds to a record and a column corresponds to a field. Each table contains only one type of record. Each record in a table has the same number of fields. The order of the records within a table has no significance. Columns of the tables are attributes. Each row of a table is uniquely identical by the data values (entities) from one or more columns. The column that uniquely identities each row is the primary key. Microsoft Access and Oracle are the well known RDBMS. Microsoft Access provides a graphical user interface that makes it very easy to define and manipulate databases. Access allows one to work with various tab options like tables, queries, forms and reports separately. Another RDBMS software called Postgre is used under Linux systems. The RDBMS is based on mathematical notion, i.e. database operations are based on set theory. The relational algebra provides a collection of operations to manipulate relations. It supports the notion of a query or request to retrieve information from a database in a set theoretic fashion. The relational calculus is a formal query language. Instead of having to write a sequence of relational algebra operations, we simply write a single declarative expression, describing the results we want. The expressive power is similar to using relational algebra. Many commercial languages that come these days are based on the relational calculus; the famous one is the structured query language (SQL).
Structured Query Language Structured Query Language (SQL) is a set of commands that gives access to a database. SQL is a tool for organizing and retrieving data stored by a computer database. SQL is a non-procedural language. This means that when using SQL we have to specify what is to be done and not how to do it. It is a high level language where one can get, modify, and manipulate information from the database using common English words and phrases like select, create, drop, update, insert, etc. There are different types of commands: (i) Data definition language (DDL): These commands create, delete and modify database objects such as tables, views and index. (ii) Data manipulation language (DML): These commands are used to insert, delete and modify data. (iii) Data query language (DQL): These are selected statements used for retrieving data and which can be tested with DML commands.
5.6 Basic Bioinformatics (iv) Transcriptional control language (TCL): These commands are used to maintain data integrity while modifying data (v) Data control language (DCL): These commands are used for creating and maintaining databases, partitions and assigning users to tables and other database objects. (vi) Data retrieval language (DRL): These commands are used to retrieve data from a table or more than one table.
Data Mining and Knowledge Discovery Biological database continue to grow rapidly. A huge volume of data is available for the extraction of high level information including the development of new concepts, concept interrelationships and interesting patterns hidden in the databases. Data mining is the application of specific tools for pattern discovery and extraction. Knowledge discovery is concerned with the theoretical and practical issues of extracting high level information (knowledge) from volumes of low level data. It combines techniques from databases, statistics and artificial intelligence. Knowledge discovery comprises several data preprocessing steps as well as data mining and knowledge interpretation steps. The goals of knowledge discovery are verification, prediction and description (explanation).
5.2
NUCLEIC ACID SEQUENCE DATABASES
The nucleic acid sequence databases are collections of entries. Each entry has the form of a text file. Text file contains text that can be read by human beings as well as a computer. Text file contains data and annotations for a single contiguous sequence. Many entries are assembled from several published papers reporting overlapping fragments of a complete sequence. Each entry is divided into fields. Fields are used to create indices for relational databases. Each field is essentially a table and the field values are indices. Unique accession numbers are allotted. First nucleic acid sequence of yeast t-RNA with 77 bases was announced around 1964. There are three premier institutes in the world, which constitute the International Nucleotide Sequence Database Collaboration. These are (i) National Centre for Biotechnology Information (NCBT), (ii) the European Molecular Biology Laboratory (EMBL), and (iii) DNA Data Bank of Japan (DDBJ). Data are stored and exchanged daily. The databases contain not only sequences but also extensive annotations.
EMBL The EMBL nucleotide sequence database (http:\\www.ebi.ac.uk/embl) is available at the EMBL European Bioinformatics Institute, UK. It contains a large and freely accessible collection of nucleotide sequences and accompanying annotations. Webin is the preferred tool for submission.
Databases, Tools and their Uses
5.7
EMBL contains sequences from direct author submissions and genome sequencing groups, and from the scientific literature and patent applications. The database is produced in collaboration with DDBJ and GenBank; each of the participating groups collects a portion of the total sequence data reported worldwide, and all new and updated entries are then exchanged between the groups. The rate of growth of DNA database has been following an exponential trend, with a doubling time now estimated to be about 9-12 months. The format of EMBL entries is consistent with SWISS-PROT format. Information can be retrieved from EMBL using the SRS (sequence Retrieval System); this links the principal DNA and protein sequence databases with motif, structure, mapping and other specialist databases and includes links to the MEDLINE facility. EMBL may be searched with query sequences via EMBL’s web interfaces to the BLAST and FASTA programs.
DDBJ The DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) contains expressed sequence tags (EST) and genome sequence data.
Procedure Open the internet browser and type the URL: www.ddbj.ac.jp. Pull the drop down menu at search option. Select protein or nucleotide. Type it in the TEXT box. Note down the details from the query page which will show the accession number, description of the query, total number of base pairs, etc. DDBJ database is produced, maintained and distributed at the National Institute of Genetics; sequence may be submitted to it from all corners of the world by means of a web-based data-submission tool. The web is also used to provide standard search such as FASTA and BLAST.
GenBank GenBank from NCBI incorporates sequences from publicly available sources, primarily from direct author submissions and large-scale sequencing projects. Information can be retrieved from GenBank using the Entrez integrated retrieval system. GenBank may be searched with user query sequences by means of the NCBI’s Web interface to the BLAST suite of programs. The increasing size of the database coupled with the diversity of the data sources available, have necessitated splitting GenBank database into 17 smaller discrete divisions with a 3 letter code each (Table 5.1). Table 5.1: The 17 subdivisions of GenBank database Number
Subdivisions
Sequence subset
1. 2. 3.
BCT PLN INV
Bacterial Plant, fungal, algal Invertebrate
Contd...
5.8 Basic Bioinformatics 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
PRI ROD MAM VRT PHG VRL RNA SYN UNA EST STS GSS HTG PAT
Primate Rodent Other mammalian Other vertebrate Bacteriophage Virus Structural RNA Synthetic Unannotated Expressed Sequence Tags Sequence Tagged Sites Genome Survey Sequences High-throughput Genomic Sequences Patent
GenBank entry consists of a number of keywords, relevant associated subkeywords, and an optional Feature Table; its end is indicated by a // terminator. The positioning of these elements on any given time is important: keywords begin in column 1; sub-keywords begin in column 3; a code defining part of the Feature Table begins in column 5. Any line beginning with a blank character is considered a continuation from the keyword or sub-keyword above. Keywords include LOCUS, DEFINITION, NID, SOURCE, REFERENCE, FEATURE, BASE COUNT and ORIGIN. Most submissions are made using the web-based Bankit or standalone sequin programs. The main purpose of the GenBank database is to provide and encourage the scientific community to access the most up-to-date and comprehensive DNA sequence information.
GSDB The Genome Sequence Data Base (GSDB) is produced by the National Centre for Genome Resources at Santa Fe, New Mexico. GSDB creates, maintains, and distributes a complete collection of DNA sequences and related information to meet the needs of major genome sequencing laboratories. The format of GSDB entries is consistent with that of GenBank. The database is accessible either via the web, or using relational database client-server facilities. The main sequence databases have a number of subsidiaries for the storage of particular types of sequence data. dbEST is a division of GenBank which is used to store expressed sequence tags (ESTs). dbGSS is used to store single-pass genomic survey sequences (GSSs); dbSTS is used to store sequence tagged sites (STSs) and HTG (high-throughput genomic) is used to store unfinished genomic sequence data. OMIM (Online Mendelian Inheritance in Man) is a comprehensive database of human genes and genetic disorders maintained by NCBI.
Databases, Tools and their Uses
5.9
Ensembl Ensembl http://asia.ensembl.org/index.html) is intended to be the universal information source for the human genome. The goals are to collect and annotate all available information about human DNA sequences, link it to the master genome sequence and make it accessible to many scientists who will approach the data with many different points of view and requirements. To achieve this, in addition to collecting and organizing the information, very serious effort has gone into developing computational infrastructure. The program used to generate this resource, eMOTIF, is based on the generation of consensus expressions from conserved regions of sequence alignments. Ensembl is a joint project of the European Bioinformatics Institute and the Sanger Centre. It is organized as an open project; it encourages outside contributions. Data collected in Ensembl include genes, SNPs, repeats and homologies. Genes may either be known experimentally, or deduced from the sequence. Because the experimental support for annotation of the human genome is so variable Esnsembl presents the supporting evidence for identification of every gene. Very extensive linking to other databases containing related information such as OMIM or expression databases is also possible.
Specialized Genomic Resources In addition to the comprehensive DNA sequence databases, a variety of more specialized genomic resources also exists. The purpose of these specialized resources is to bring a focus (a) to species-specific genomics, and (b) to particular sequencing techniques. The Saccharomyces Genome Database (SGD), the TDB (TIGR) database, AceDB database are some examples. Here is a list of web addresses for nucleotide sequence databases. EMBL DDBJ GenBank dbEST GSDB SGD UniGene AceDB Webace OMIM
5.3
: : : : : : : : :
http://www.ebi.ac.uk/embl/index.html http://www.ddbj.nig.ac.jp/ http://www.ncbi.nlm.nih.gov/genbank/ http://www.ncbi.nlm.nih.gov/dbEST/ http://www.ncgr.org/quick-jump/sequencing http://www.yeastgenome.org/ http://www.ncbi.nlm.nih.gov/unigene/ http://www.sanger.ac.uk/software/Acedb/ http://www.acedb.org/Databases/ http://www.ncbi.nlm.nih.gov/omim
PROTEIN SEQUENCE DATABASE
Most amino acid sequence data arise by translation of nucleic acid sequence. The primary structure of a protein is its amino acid sequence; these are stored in primary databases as linear alphabets that denote the constituent residues. The secondary structure of a protein corresponds to regions of local regularity,
5.10 Basic Bioinformatics which in sequence alignments are often apparent as well-conserved motifs; these are stored in secondary databases as patterns (e.g. regular expressions, fingerprints, blocks, profiles, etc.). The tertiary structure of a protein arising from the packing of its secondary structure elements may form discrete domains within a fold, or may give rise to autonomous folding units or modules stored in structure databases as sets of atomic coordinates. First protein to be sequenced was insulin in 1956 and its sequence consisted of 51 residues. From the beginning of 1980, sequence information started to become more abundant in scientific literature. Hence, several laboratories started to harvest and store these sequences in central repositories. Many primary database centers evolved in different parts of the world. The Protein Sequence Database was developed at National Biomedical Research Foundation at Georgetown University in the early 1960s by Margaret Dyahoff as a collection of sequences for investigating evolutionary relationships among proteins. From 1988, the Protein Sequence Database has been maintained collaboratively by PIR – International, an association of macromolecular sequence data collection centers consisting of the Protein Information Resource (PIR) at the NBRF, the International Protein Information Database of Japan (JIPID), and the Martinsried Institute of Protein Sequences (MIPS). The MIPS collects and processes sequence data for the PIRinternational.
PIR Databases The PIR is an effective combination of a carefully curated database information retrieval access software and a workbench for investigations of sequences. The PIR also produces the Integrated Environment for Sequence Analysis (IESA). Its functionality includes browsing, searching and similarity analysis and links to other databases. The PIR maintains several databases about proteins: (a) PIR-PSD: The main protein sequence database (b) iProclass: Classification of proteins according to structure and function. (c) ASDB: annotation and similarity database; each entry is linked to a list of similar sequences. (d) P/R-NREF: a comprehensive non-redundant collection of over 8,00,000 protein sequences merged from all available sources. (e) NRL3D: a database of sequences and annotations of proteins of known structure deposited in the protein Data bank. (f) ALN: a database of protein sequence alignment. (g) RESID: a database of covalent protein structure modifications. PIR database is split into four distinct sections, designated as PIR1, PIR2. PIR3 and PIR4. They differ in terms of the quality of data and levels of annotation provided; PIR1 includes fully classified and annotated entries; PIR2 contains preliminary entries, which have not been fully reviewed and which may contain redundancy. PIR3 includes unverified entries, which
Databases, Tools and their Uses
5.11
have not been reviewed; and PIR4 entries fall into one of the following four categories; (i) conceptual translations of artefactual sequences, (ii) conceptual translations of sequences that are not transcribed or translated, (iii) Protein sequences or conceptual translations that are extensively genetically engineered; and (iv) sequences that are not genetically encoded and not produced on ribosomes. Programs are provided for data retrieval and sequence searching via the NBRF-PIR database Web Page.
SWISS-PROT The Swiss Institute of Bioinformatics (SIB) collaborates with the EMBL Data Library to provide an annotated database of amino acid sequences called SWISS-PROT. SWISS-PROT is a curated protein sequence database which strives to provide high-level annotations, including descriptions of the function of the protein and of the structure of its domains, its posttranslational modifications, variants and so on with a minimal level of redundancy and high level of integration with other databases. SWISS-PROT is interlinked to many other resources. The structure of the database and the quality of its annotations places SWISS-PROT apart from other protein sequence resources and has made it the database of choice for most research purposes. Entries start with an identification (ID) line and finish with a // terminator. ID codes in SWISS-PROT are designed to be informative and people-friendly; they take the form PROTEIN_SOURCE, where the PROTEIN part of the code is an acronym that denotes the type of protein, and SOURCE indicates the organism name. Since ID codes can sometimes change an additional identifier, an accession number, is also provided, which will remain static between database releases. The accession number is provided on the AC line, which is computer readable. If several numbers appear on the same AC line, the first or primary accession number is the most current. The DT lines provide information about the date of entry of the sequence to the database, and details of when it was last modified. The DE (description) line, informs us of the name, by which the protein is known. The following lines give the gene name (GN), the organism species (OS) and organism classification (OC) within the biological kingdom. The next section of the database provides a list of supporting references; these can be from the literature, unpublished information submitted directly from sequencing projects, data from structural or mutagenesis studies and so on. Following the references, the comment (CC) lines are found. These are divided into themes, which tell us about the function of the protein, its posttranslational modifications, its tissue specificity, cellular location and so on. The CC lines also point out any known similarity or relationship to particular protein families. Database cross-reference (DR) lines follow the comment field. These provide links to other biomolecular databases, including primary sources, secondary databases, specialist databases, etc.
5.12 Basic Bioinformatics Immediately after the DR lines a list of relevant keywords (KW) are seen, and then a number of FT lines can be found. The FT highlights regions of interest in the sequence, including local secondary structure (such as transmembrane domains), ligand binding sites, and post-translational modifications and so on. Each line includes a key, the location in the sequence of the feature, and a comment, which might, for example, indicate the levels of confidence of a particular annotation. The final section of the database entry contains the sequence itself on the SQ lines. Only single letter amino acid code is used. The structure of SWISS-PROT makes computational access to the different information fields both straightforward and efficient.
TrEMBL TrEMBL (translated EMBL) was designed in 1996 as a computer-annotated supplement to SWISS-PROT. The database benefits from the SWISS-PROT format, and contains translations of all coding sequences in EMBL. TrEMBL has two main sections, designated as SP-TrEMBL and REM-TrEMBL; SPTrEMBL (SWISS-PROT TrEMBL) contains entries that will eventually be incorporated into SWISS-PROT, but that have not yet been manually annotated; REM-TrEMBL contains sequences that are not destined to be included in SWISS-PROT; these include immunoglobulins and T-cell receptors, fragments of fewer than eight amino acids, synthetic sequences, patented sequences, and codon translations that do not encode real proteins. TrEMBL was designed to allow very rapid access to sequence data from the genome projects, without having to compromise on the quality of SWISSPROT itself by incorporating sequences with insufficient analysis and annotation. PIR is the most comprehensive resource, but the quality of its annotations is still relatively poor. SWISS-PROT is a highly structured database that provides excellent annotations, but its sequence coverage is poor compared to PIR.
NRL-3D The NRL-3D database is produced by PIR from sequences extracted from the Protein Data Bank (PDB). The titles and biological sources of the entries conform to the nomenclature standards used in the PIR. Bibliographic references and MEDLINE cross references are included, together with secondary structure, active site, binding site and modified site annotations, and details of experimental methods, resolution, R-factor, etc. Keywords are also provided. NRL-3D is a valuable resource, as it makes the sequence information in the PDB available both for keyword interrogation and for similarity searches. The database may be searched using the ATLAS retrieval system, a multidatabase information retrieval program specifically designed to access macromolecular sequence databases.
Databases, Tools and their Uses
5.4
5.13
STRUCTURE DATABASES
Structure Databases archive, annotate and distribute sets of atomic coordinates. They store a collection of 3 dimensional biological macromolecular structures of proteins and nucleic acids. The last established database for protein structures is Protein Data Bank (PDB). The website is http://www.rcsb.org/pdb/home/home.do This is the single world-wide repository of structural data and is maintained by Research Collaborators for Structural Bioinformatics (RCSB) at Rudgers University, New Jersey, USA. (The associated nucleic acid databank (NDB) is also maintained here). An equivalent European database is the Macromolecular Structure Database (MSD) maintained by the European Bioinformatics Institute. The website for MSD is http:// www.ebi.ac.uk/Databases/structure.html RCSB and MSD databases contain the same data. The PDB entry normally contains the following informations: the name of the protein, the species it comes from, who solved the structure, references to publications, describing the structure determination, experimental details about the structure determination, the amino acid sequence, any additional molecules and atomic coordinates. MSD includes a search tool called OCA, which is a browser database for protein structure and function, integrating information from numerous databanks. Another useful information source available at the EBI is the database of Probable Quaternary Structures (PQS) of biologically active forms of proteins.
Structural Classifications Many proteins share structural similarities, reflecting, in some cases, common evolutionary origins. The evolutionary process involves substitutions, insertions and deletions in amino acid sequences. For distantly related proteins, such changes can be extensive, yielding folds in which the numbers and orientations of secondary structures vary considerably. However, where, for example, the functions of proteins are conserved, the structural environments of critical active site residues are also conserved. With a view to better understand sequence structure relationships, struture classification schemes have been evolved. Several websites offer hierarchical classifications of the entire PDB according to the folding patterns of the proteins. (i) SCOP : Structural classification of Proteins (ii) CATH : Class/ Architecture/ Topology/ Homology (iii) DALI : Based on extraction of similar structure from distance matrices. (iv) CE : a database of structural alignments.
SCOP Database The SCOP database describes structural and evolutionary relationships between proteins of known structure. Since current automatic structure
5.14 Basic Bioinformatics comparison tools cannot reliably identify all such relationships, SCOP has been designed using a combination of manual inspection and automated methods. Proteins are classified in a hierarchical fashion to reflect their structural and evolutionary relatedness. Within the hierarchy there are many levels, but principally these describe the family, super family and fold. Proteins are clustered into families with clear evolutionary relationships if they have sequence identities of more than 30%. Proteins are placed in super families when, in spite of low sequence identity, their structural and functional characteristics suggest a common evolutionary origin. Proteins are suggested to have a common fold if they have the same major secondary structures in the same arrangement and with the same topology, whether or not they have a common evolutionary origin. SCOP is accessible for keyword interrogation via the MRC Laboratory Web Server.
CATH Database The CATH (lass, architecture, topology, homology and sequence) database is largely derived using automatic methods, but manual inspection is necessary where automatic methods fail. Different categories within the classification are identified by means of both unique numbers and descriptive names. There are five levels (class, architecture, topology, homology and sequence) within the hierarchy. Class is derived from gross secondary structure content and packing. Architecture describes the gross arrangement of secondary structures. Topology gives a description that encompasses both the overall shape and the connectivity of secondary structures. Homology groups domains that share more than 35% sequence identity and are thought to share a common ancestor. Sequence provides the final level within the hierarchy whereby structures within homology groups are further clustered on the basis of sequence identity. CATH is accessible for keyword interrogation via UCL’s Biomolecular Structure and Modeling Unit Web server. CATH database is a protein structure database residing at University College, London. Proteins are classified first into hierarchical levels by class, similar to the SCOP classification except that α/β and α + β proteins are considered to be in one class. Instead of a fourth class for α + β proteins, the fourth class of CATH comprises proteins with few secondary structures. Following class, proteins are classified by architecture, fold superfamily and family.
Composite Databases A composite database is a database that amalgamates a variety of different primary sources. Composite databases render sequence searching much more efficient, because they obviate the need to interrogate multiple resources. The interrogation process is streamlined still further if the composite has been designed to be non-redundant, as this means that the same sequence need not be searched more than once.
Databases, Tools and their Uses
5.15
Different strategies can be used to create composite resources. The final product depends on the chosen data sources and the criteria used to merge them. The choice of different sources and the application of different redundancy criteria have led to the emergence of different composites, each of which has its own particular format. The main composite databases are NRDB, OWL, MIPSX and SWISS-PROT+ TrEMBL. NRDB (Non-Redundant Database) is comprehensive and contains upto-date information. OWL is a non-redundant protein database with a priority with regard to the level of annotation and sequence validation. MIPSX database contains information of only unique copies. SWISS-PROT + TrEMBL provide a resource that is both comprehensive and minimally redundant.
NDB Database The Nucleic acid structure Database (NDB) (http://ndbserver.rutgers.edu/) assembles and distributes structural information about nucleic acids. In addition to information regarding nucleic acids it maintains a DNA-binding protein database. Available information includes coordinates and structure factors, an archive of nucleic acid standards and an atlas of nucleic acid containing structures that highlight special aspects of each structure in the NDB. It also maintains information regarding intrinsic correlations between structural parameters.
CSD Database Cambridge structural Database (CSD) contains comprehensive structural data for organic and organic-metallic compounds studied by X-ray and neutron diffraction. It contains 3D atomic coordinate information as well as associated bibliographic, chemical and crystallographic data. It is equipped with graphical, search, retrieval, data manipulation and visualization software.
BMRB Database BioMagResBank (BMRB) contains data from NMR studies of proteins, peptides and nucleic acids (www.bmrb.wisc.edu). It is used to deposit the data that is used to derive the NMR restraints and the coordinates deposited into the PDB. It contains NMR parameters that are measures of flexibility and dynamics. It also contains data on measured NMR parameters such as chemical shifts, coupling constants, dispolar couplings, T1 values, T2 values, heteronuclear NOE values, Se (order parameters), hydrogen exchange rates and hydrogen exchange protection factors.
3Dee and FSSP databases 3Dee is a database of protein domain definitions. FSSP (fold classification based on structure-structure alignment of proteins) database is based on automatic all-against-all 3D structure comparisons of all the entries of the PDB.
5.16 Basic Bioinformatics FSSP database contains a database of representative fold for all the structures in the PDB. The representative folds are subjected to a hierarchical clustering algorithm to construct a fold tree based on structural similarities. The FSSP database is based on structure alignment of all pair-wise combinations of the proteins in the Brookhaven structural database by the structural alignment program DALI.
Other Databases Molecular Modeling Database (MMDB) is a database containing experimentally determined structures extracted from PDB. Its organization is based on the concept of neighbors-links to sequential and structural neighbors. MMDB categorizes proteins of known structure in the Brookhaven PDB into structurally related groups by the VAST (Vector Alignment Search Tool) structural alignment program. VAST aligns three dimensional structures based on a search for similar arrangements of secondary structural elements. MMDB provides a method for rapidly identifying PDB structures that are statistically out of the ordinary. Conserved Domain Database (CDD) is a database of conserved domain alignments with links to three-dimensional structures of domains. Chemicophysical AMino acidic Parameter databank (CHAMP) is an amino acidic parameters data bank containing 32 different series of physico-chemical parameters of amino acids. It is integrated with FAST. The Enzyme-Reaction Database links a chemical structure to amino acid sequences of enzymes that recognize the chemical structure as their ligand. The chemical structures and chemical names are registered in the chemical-structure database on the MACCS system. The enzymes are registered in the database with NBRF-PIR entry codes. The enzymes’ sequences in the database are divided into clusters and a conserved sequence is extracted from each cluster using multiple sequence alignment. These conserved sequences are used to construct motifs. Thermodynamic Database for Proteins and Mutants (ProTherm) is a collection of numerical data for studying the relationship between structure, stability and function. It contains thermodynamic parameters such as unfolding Gibbs free energy change, enthalpy change, heat capacity change, transition temperature, etc. It also contains information about activity, secondary structure, surface accessibility, measuring methods and experimental conditions such as pH, temperature, and buffer ion and protein concentration. ProTherm is linked with PIR and SWISS-PROT, PDB, PMD and PubMed. The SARF (spatial arrangement of backbone fragments) database also provides a protein database categorized on the basis of structural similarity.
Secondary Databases Primary database search tools are effective for identifying sequence similarities, but analysis of output is sometimes difficult and cannot always
Databases, Tools and their Uses
5.17
answer some of the more sophisticated questions of sequence analysis. Hence secondary database search tools are used. Depending on the type of analysis method using secondary data bases, relationships may be elucidated in considerable detail, including superfamily, family, subfamily, and speciesspecific sequence levels. The principle behind the development of secondary databases is that within multiple alignments, there are many conserved motifs that reflect shared structural or functional characteristics of the constituent sequences. The simplest approach to pattern recognition is to characterize a family by means of a single conserved motif, and to reduce the sequence data within the motif to a consensus or regular expression pattern. Regular expressions are the basis of the PROSITE database. Many secondary databases, which contain the fruits of analysis of the sequences in the primary sources, are also available. Many secondary databases such as PROSITE, Profiles, PRINTS, Pfam, BLOCKS, IDENTIFY use SWISS-PROT as primary source. PROSITE stores Regular Expression (patterns); Profiles stores weighted matrices (profiles); PRINTS stores aligned motifs (fingerprints). Pfam stores hidden Markov Models (HMMs). BLOCKS stores aligned motifs (blocks), and IDENTIFY stores fuzzy regular expressions (patterns). The type of information stored in each of the secondary databases is different. Yet these resources have arisen from a common principle; namely, that homologous sequences may be gathered together in multiple alignments, within which are conserved regions that show little or no variation between the constituent sequences. These conserved regions or motifs, usually reflect some vital biological role (i.e. are somehow crucial to the structure or function of the protein). One of the aims of sequence analysis is to design computational methods that help to assign functional and structural information to uncharacterized sequences; this is achieved by means of primary database searches, the goal of which is to identify relationships with already known sequences. Within a database, the challenge is to establish which sequences are related (truepositive) and which are unrelated (true-negatives). To improve diagnostic performance one has to capture most of true-positive family members and to include no or few false positives.
PROSITE Database PROSITE was the first secondary database to be developed. The rationale behind its development was that protein families could simply and effectively be characterized by the single most conserved motif observable in a multiple alignment of known homologues, such motifs usually encoding key biological functions (e.g. enzyme active sites, ligand or metal binding sites, etc.). Searching such a database should, in principle, help to determine to which family of proteins a new sequence might belong, or which domain or functional site it might contain.
5.18 Basic Bioinformatics PRINTS Database Most protein families are characterized not by one, but by several conserved motifs. It therefore makes sense to use many, or all, of these to build diagnostic signatures of family membership. This is the principle behind the development of the PRINTS fingerprint database. Fingerprints inherently offer improved diagnostic reliability over single-motif methods by virtue of the mutual context provided by motif neighbours; in other words, if a query sequence fails to match all the motifs in a given fingerprint, the pattern of matches formed by the remaining motifs still allows the user to make a reasonably confident diagnosis.
BLOCKS Database A multiple-motif database, called BLOCKS, was created by automatically detecting the most highly conserved regions of each protein family. The limitations of regular expression in identifying distant homologues led to the creation of a compendium of profiles. The variable regions between conserved motifs also contain valuable sequence information. Here the complete sequence alignment effectively becomes the discriminator.
HMMs An alternative to the use of profiles is to encode alignments in the form of Hidden Markov Models (HMMs). These are statistically based mathematical treatments, consisting of linear chains of match, delete or insert states that attempt to encode the sequence conservation within aligned families. A collection of HMMs for a range of protein domains is provided by the Pfam database.
IDENTITY, KEGG and MEDLINE Databases Another automatically derived tertiary resource, derived from BLOCKS and PRINTS is IDENTIFY. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is the database of metabolic pathways. It collects individual genomes, gene products and their functions with biochemical and genetic information. MEDLINE integrates the medical literature including very many papers dealing with molecular biology. It is included in PubMed, a bibliographic database offering abstracts of scientific articles. Web Addresses: Gen Bank EMBL DDBJ PIR MIPS SWISS-PROT OWL PROSITE
: : : : : : : :
http://www.ncbi.nlm.nih.gov/genbank/ http://www.ebi.ac.uk/embl/index.html http://www.ddbj.nig.ac.jp/ http://www.pir.georgetown.edu/ http://www.mips.biochem.mpg.de/ http://pir.georgetown.edu/pirwww/dlinfo/nr13d.h http://www.bioinf.man.ac.uk/dbbrowser/OWL/ http://www.expasy.ch/prosite/
Databases, Tools and their Uses PRINTS BLOCKS Profiles Pfam IDENTIFY Proweb SCOP CATH
: : : : : : : :
5.19
http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ http://www.blocks.fhcrc.org/ http://www.isrec.isb-sib.ch/software/PFSCAN_form.html http://www.sanger.ac.uk/software/Pfam/ http://dna.stanford.EDU/identify/ http://www.proweb.org/kinetin/ProWeb.html http://scop.mrc-lmb.cam.ac.uk/scop/ http://www.biochem.ucl.ac.uk/bsm/cath/
5.5 BIBLIOGRAPHIC DATABASES AND VIRTUAL LIBRARY Publication is at the core of every scientific endeavor. It is the common process whereby scientific information is reviewed, evaluated, distributed and entered into the permanent record of scientific progress. Bibliographic databases (also known as literature database or knowledge databases) contain published articles, abstracts and free selective full text papers with links to individual records. Though there are a number of literature databases, PubMed and Agricola are extensively used by scientists as they provide updated information from different links.
PubMed PubMed is maintained by the National Library of Medicine (US) and includes a bibliographic database MEDLINE as well as links to selective full text articles on sites maintained by journal publishers. It offers abstracts of scientific articles and is integrated with other information retrieval tools of the National Centre for Biotechnology Information. Scientific journals place their table of contents and in some cases, entire issues, on web sites. PubMed records are relational in nature and query results include links to the GenBank, PDB, etc. PubMed databases can be searched at the following websites: http://www.ncbi.nlm.nih.gov/PubMed/ http://www.pubmedcentral.nih.gov
AGRICOLA AGRICOLA stands for Agricultural online access. It is a bibliographic database of citations to the agricultural literature created by the National Agricultural Library and its cooperators. It includes publications and resources from all the disciplines related to agriculture, such as, veterinary science, plant science, forestry, aquaculture and fisheries, food and human nutrition, earth and environmental science. The database can be searched at the following website: http://www.nal.usda.gov/ag98/
Virtual Library Virtual library on the net provides access to web sites that are a storehouse of information. It contains a collection of links to various online journals and
5.20 Basic Bioinformatics bibliographic databases. Virtual library can be classified into various groups with links to various online journal, bibliographic databases, institute library access, forums and associations, tutorial sites, educational sites, grants and funding resources, government and regulatory bodies, etc. The most famous virtual library site in the web is: http://www.vlib.org There are also further collections of virtual libraries on various topics such as microbiology, biochemistry, etc. Many publishers have their own online journals available on sites (e.g. Nature: www.nature.com). These sites provide free access to the table of contents and abstracts.
5.6
SPECIALIZED ANALYSIS PACKAGES
Homology searching is only one aspect of the analysis process. Numerous other research tools are also available, including hydropathy profiles for the detection of possible trans membrane domains and/or hydrophobic protein cores; helical wheels to identify putative amphipathic helices; sequence alignment and phylogenetic tree tools for charting evolutionary relationships; secondary structure prediction plots for locating α-helices and β-strands; and so on. Because of the need to employ a range of techniques for effective sequence analysis, software packages have been developed to bring a variety of these methods together under a single umbrella, obviating the need to use different tools with different interfaces, with different input requirements and different output formats. Major releases of DNA and protein sequence databases occur every three to four months. In the meantime, newly determined sequences are added to daily update files. To keep an in-house database up-to-date, synchronized FTP scripts are used (e.g. using scheduling software such as Cron under UNIX). With such a system, it is relatively simple to track individual databases, but it becomes unwieldy when several databases (e.g. GenBank, EMBL, SWISSPROT, PIR) have to be monitored and merged with proprietary information. Further, if new databases evolve, it is considered advantageous also to bring them in-house; hence existing scripts must be updated to incorporate the new resources. There are a number of well-known packages that offer a fairly complete set of tools for both DNA and protein sequence analysis. These suites have evolved and grown to be fairly comprehensive over a period of years.
GCG Package The most widely known, commercially available sequence analysis software is the GCG (Oxford Molecular Group). This was developed by the Genetic Computer Group at Wisconsin (575 Science Drive, Medison, Wisconsin, USA 53711) primarily as a set of analysis tools for nucleic acid sequences, but which in time included additional facilities for protein sequence analysis.
Databases, Tools and their Uses
5.21
Within GCG, many of the frequently used sequence databases can be accessed (e.g. GenBank, EMBL, PIR and SWISS-PROT) as can a number of motif and specialist databases (such as PROSITE; TFD, the transcription factor database; and REBASE, the restriction enzyme database). A particular strength of the system is that it can also be relatively easily customized to accept additional, user-specific databases. Within the suite, EMBL and GenBank are split into different sections, allowing users to minimize search time by directing queries only to relevant parts of the databases. Thus, for example, sequences in GenBank and EMBL may be searched either collectively or separately or by defined taxonomic categories (e.g. viral, bacterial. Rodent, etc.). The sequence databases have their own distinct formats, so these must be converted to the GCG format for use with its programs. Likewise, all data files imported to the suite for analysis must adhere to the GCG format. The facilities include tools for pairwise similarity searching, multiple sequence alignment, evolutionary analysis, motif and profile searching, RNA secondary structure prediction, hydropathy, and antigenecity plots, translation, sequence assembly, restriction site mapping and so on.
EGCG Package EGCG or Extended GCG started at EMBL in Heiddberg as a collection of programs to support EMBOL’s research activities. There are more than 70 programs in EGCG, covering themes such as fragment assembly, mapping, database searching, multiple sequence analysis, pattern recognition, nucleotide and protein sequence analysis, evolutionary analysis, and so on.
Staden Package The Staden Package is a set of tools for DNA and protein sequence analysis. It does not provide databases, but the software works with the EMBL database and other databases in a similar format. The package has a windowing interface for UNIX workstations. Amongst its range of options, the suite provides utilities to define and to search for patterns of motifs in proteins and nucleic acids (for example, specific individual routines allow searching for mRNA splice junctions, E. coli promoters, tRNA genes, etc. and users may define equally complex patterns of their own). A particular strength of the Staden Package lies in its support for DNA sequence assembly. It provides methods for all the pre-processing required for data from fluorescence-based sequencing instruments, including trace viewing (TREV), quality clipping (PREGAP4) and vector removal (PREGAP4, VECTOR_CLIP); a range of assembly engines; and powerful contig editing and finishing algorithms (GAP4). A new method for detecting point mutation is also there (TRACE_DIFF, GAP4). For analysis of finished DNA sequences, the package includes NIP4, and for comparing DNA or protein sequences, SIP4; these routines also provide an interface to the sequence libraries. The new interactive programs TEV, PREGAP4, GAP4, NIP4 and SIP4 have graphical user-interfaces, but the package also contains a large number of older, but still useful, programs that are text-based.
5.22 Basic Bioinformatics Lasergene Package Lasergene is a PC-based package that provides facilities for coding analysis, pattern and site matching, and RNA/DNA structure and composition analysis; restriction site analysis; PCR primer and probe design; sequence editing; sequence assembly and contig management; multiple and pairwise sequence alignment (including doplots); protein secondary structure prediction and hydropathy analysis; helical wheel and net creation; and database searching. Lasergene is available for windows or Macintosh, for single users or for networked-PC environments. There are numerous other packages available, which tend to concentrate on particular areas of sequence analysis of DNA. For example:
Sequencher Package Sequencher is a sequence assembly package for the Macintosh, used by many laboratories engaged in large-scale sequencing efforts. The package takes raw chromatogram data and converts it into contig assemblies; other functions include restriction site or ORF analysis, heterozygote analysis for mutation studies, vector and transposon screening, motif analysis, silent mutation tools, sequence quality estimation, and visual marking of edits to ensure data integrity.
Vector, NTI Package Vector NTI, for windows 3.1 supported by the American Type Culture Collection (ATCC) and InforMax, Inc., is a knowledge-based package designed to expedite cloning applications. It can automatically optimize the design of new DNA constructs and recommend cloning steps. The user can specify preferences for process such as fragment isolation, modification of termini and ligation. The system incorporates about 3000 rules for genetic engineering.
MacVector Package MacVector is a molecular biology system that exploits the Macintosh user interface to create an easy-to-use environment for manipulation and analysis of DNA and protein sequence data. The package implements the five BLAST search functions, and includes ClustalW for sequence alignment, and an iconmanaged sequence editor that is integrated with the program’s molecular biology functions (e.g. translation, restriction analysis, primer and probe analysis, protein structure prediction, and motif analysis). Facilities are also provided to compute predicted sequence-based melting curves for DNA and RNA structures. Intranet packages: The future for commercial solutions lies in providers understanding the key issues facing the large industrial user. Most companies now have intranets and support the use of HTTP and Internet Inter-ORB Protocol (IIOP). Bioinformatics solutions must fit as easily seamlessly as possible into this environment. Most companies need to implement integration throughout the research operation. Most industrial bioinformatics teams
Databases, Tools and their Uses
5.23
devote some resources to development and maintenance of internal web servers that replicate the services available at public bioinformatics sites. Two companies, NetGenics Inc. and Pangea Systems Inc., provide bioinformatics systems that offer the prospect of service integration via the intranet.
SYNERGY SYNERGY, developed by NetGenic, Inc., Cleveland, ohio, is an object-oriented approach using Java, CORBA, and an object-oriented database, to implement a flexible environment for managing bioinformatics projects. SYNERGY integrates standard tools into its portfolio through the use of CORBA ‘Wrappers’, which present a streamlined interface between the tool and the SYNERGY system. In this way, the developers are able to incorporate a number of standard programs very rapidly and users of the system are able to incorporate their own tools by implementing CORBA wrappers in-house.
Pangea Systems GeneMill, GeneWorld and GeneThesaurus are the developments of Pangea Systems Inc., Oakland, California. These are web-based tools that are backended by a relational database. The overall system is aimed at highthroughput sequencing projects and other large-scale industrial genomics projects, including, for example, GeneMill, a sequencing workflow database system for managing sequencing projects; Geneworld, a tool for analysis of DNA and protein sequences; and GeneThesaurus, a sequence and annotation data subscription service, allowing access to public data and integration with proprietary data. The system is modular and allows interfaces to in-house software to be built easily, using an open programming interface, PULSE (Pangea’s Unified Life Science Environment).
EMBOSS Package European Molecular Biology Open Software Suite (EMBOSS) is an integrated set of packages and tools for sequence analysis being specifically developed for the needs of the Sanger Centre and the EMBnet user communities. Application of the package include: EST clustering, rapid database searching with sequence patterns, Nucleotide sequence pattern analysis, code usage analysis, Gene identification tools, Protein motif identification.
Alfresco Package Alfresco is a visualization tool that is being developed for comparative genome analysis, using ACEDB for data storage and retrieval. The program compares multiple sequences from similar regions in different species, and allows visualization of results from existing analysis programs, including those for gene prediction, similarity searching, regulatory sequence prediction, etc.
DALI Program DALI (Distance matrix Alignment) program is used to quantify proteins with folding patterns similar to that of a query structure. L. Holm and C. Sander
5.24 Basic Bioinformatics wrote this program. It runs fast enough to carry out routine screens of the entire protein Data Bank for structures similar to a newly determined structure, and even to perform a classification of protein domain structures from an allagainst-all comparison. To meet the need for effective software technique for data analysis, many software packages have been developed. These packages are highly specific in their approach and can be easily loaded as per the requirements of the user (Table 5.2). Table 5.2: Some well known packages with a set of tools for DNA and protein sequence analysis Package Staden Genemill, Gene World, Gene Thesaurus analyses Lasergene
Synergy
CINEMA
EMBOSS
EGCG
ExPASy
KEGG
Scope Analyses of DNA and protein sequence. It has a window interface for UNIX workstations. Genemill package system manager sequence projects. Gene World DNA and protein sequences. Gene Thesaurus allows access to public data and integration with proprietary data. Coding analysis, pattern site matching, structure and comparison analysis of RNA/DNA, restriction site analysis, PCR primer and probe designing, sequence editing, sequence assembly, multiple and pairwise sequence analysis-helical wheel and net creation, and database searching. An object oriented package, uses java, COBRA and object-oriented database to implement a flexible environment for managing bioinformatics projects. A colour Interactive Editor for Multiple Alignments, an internet package written in Java, provides facilities for motif identification, database searching (using BLAST), 3d structure visualization, generation of dotplots and hydropathy profiles, six-frame translation. The European Molecular Biology Open Software suite specifically developed for easy integration of other public domain packages and other applications like EST clustering, nucleotide sequence pattern analysis, codon usage analysis, gene identification tools, protein motif identification and rapid databases searching with sequence pattern. Developed by Genetics Computer Group, Wisconsin, an extended version of GCG, has more than 70 programs including fragment assembly, mapping, database-searching, multiple sequence analysis, pattern recognition, nucleotide and protein sequence analysis, evolutionary analysis, etc. ExPASy is the SIB Bioinformatics Resource Portal which provides access to scientific databases and software tools (i.e., resources) in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics, etc. KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other highthroughput experimental technologies
Databases, Tools and their Uses
5.7
5.25
USE OF DATABASES
The available information on the biological function of particular sequences in model organisms may be exploited to predict the function of similar gene in other organisms. The sequence of the gene of interest is compared to every sequence in a sequence database, and the similar ones are identified. If a query sequence can be readily aligned to a database sequence of known function, structure or biochemical activity, the query sequence is predicted to have the same function, structure or biochemical activity. As a rough rule, if more than one-half of the amino acid sequence of query and database proteins is identical in the sequence alignments, the prediction is very strong. A common reason for performing a database search with a query sequence is to find a related gene in another organism. For a query sequence of unknown function, a matched gene may provide a clue to the function. Alternatively, a query sequence of known function may be used to search through sequences of a particular organism to identify a gene that may have the same function. Web addresses: GCG EGCG Staden NetGenics Pangea Systems CINEMA EMBOSS Alfresco
: : : : : : : :
http://www.gcg.com/ http://www.sanger.ac.uk/software.EGCG/ http://www.mrc-lmb.cam.ac.uk/pubseq/ http://www.netgenics.com/ http://www.pangeasystems.com/ http://www.bioinchem.ucl.ac.uk/bsm/dbbrowser/CENEMA2.1 http://www.sanger.ac.uk/Software/EMBOSS/ http://www.sanger.ac.uk/Users/nic/alfresco.html
STUDY QUESTIONS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
What are databases? What are the types of databases? What are the functions of databases? What are the nucleic acid sequence databases? Give some examples. What are protein sequence databases? Give some examples. What are protein sequence databases about protein maintained by PIR? What are structure databases? Give some example. What is bibliographic database? Give some examples. What is virtual library? Give some names of specialized analysis packages and their uses? What is database management system? What are the types of database management system? What is data mining? What are the goals of Ensembl?
C H A P T E R
6
Sequence Alignment The method used to analyze the similarities and differences at the level of individual bases or amino acids with the aim of inferring structural, functional and evolutionary relationships among the sequences is called sequence alignment. In simple words it is the identification of residue-residue correspondence; any assignment of correspondence that preserves the order of the residues within the sequences is an alignment. The sequences of biological macromolecules are the products of molecular evolution. When the sequences share a common ancestral sequence, they tend to exhibit similarity in their sequences, structures and biological functions. When a new sequence is found whose function is not known, but, if similar sequences could be found in the databases for which functional or structural information is available, then this can be used as a basis of a prediction of function or structure of the new sequence. Sequence alignment is the procedure of comparing two (pairwise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Two sequences are aligned by placing them in two rows. Identical or similar characters are placed in the same column. Nonidentical or dissimilar characters are either placed in the same column as a mismatch, or may be placed opposite a gap in the other sequence. The advent of high-throughput automated fluorescent DNA sequencing technology has led to the rapid accumulation of sequence information and provides the basis for abundant computationally derived protein sequence data. Analysis of DNA sequences can throw light on phylogenetic relationships, restriction sites, intron/exon prediction and gene structure and protein coding sequence through open reading frame analysis.
6.1 ALGORITHM Algorithm is a biological sequence of steps by which a task can be performed. It is a set of rules for calculating or solving a problem which normally is
6.2
Basic Bioinformatics
carried out by a computer program. A program is the implementation of an algorithm. Thus algorithm is a complete and precise specification of a method for solving a problem. Five important features of an algorithm are: (i) An algorithm must stop after a finite number of steps. (ii) All steps of an algorithm must be precisely defined. (iii) Input to the algorithm must be specified. (iv) Output to the algorithm must be specified. (v) It must be very effective (operation of the algorithm must be basic)
Genetic Algorithm The genetic algorithm is a general type of machine-learning algorithm developed by computer scientists which has no direct relationship to biology. It produces alignments by attempted simulation of the evolutionary changes in sequences.
6.2
GOALS AND TYPES OF ALIGNMENT
One goal of sequence alignment is to enable us to determine whether two sequences display sufficient similarity such that an inference of homology is justified. As genetic information is passed on from one generation to the next, the information gets altered slightly during the process of copying. The changes that occur during divergence from the common ancestor can be categorized as substitutions, insertions and deletions. These changes can accumulate as the generations pass by. After several thousand generations, considerable amount of divergences may have set in. Comparison of two supposedly homologous sequences will show how much evolutionary changes had taken place between them.
Global vs. Local alignment There are two types of alignment: global alignment and local (Fig. 6.1). In global alignment an attempt is made to align the entire sequence, using as many characters as possible, up to both ends of each sequence. In local alignment, stretches of sequence with the highest density of matches are aligned, thus generating one or more islands of matches or subalignments in the aligned sequences. L G P
S S
K Q T G K G S S
L N I
T I
K S
R I
W D N
A G K G A M R L G T G K G Local Alignment A G K G
D A
Global Alignment
Fig. 6.1 Distinction between global and local alignments of two sequences
Sequence Alignment
6.3
Sequences that are quite similar and approximately the same length are suitable candidates for global alignments. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length or sequences that share a conserved region or domain. In the figure 6.1 the global alignment is stretched over the entire sequence length to include as many matching amino acids as possible up to and including the ends of sequences. Vertical bars between the sequences indicate the presence of identical matches. In the local alignment, alignment stops at the ends of regions of identity or strong similarity. Priority is given to finding these local regions. There are two types of alignment: global alignment and local alignment. Global alignment considers the similarity across the full extent of the sequence. Local alignment focuses on regions of similarity in parts of the sequence only. A search for local similarity may produce more biologically meaningful and sensitive results than a research attempting to optimize alignment over the entire sequence length because usually the functional sites are localized to relatively short regions, which are conserved irrespective of deletions or mutations in intervening parts of the sequence.
Optimal Alignment Optimal alignment is an alignment which maximizes the score, that which exhibits the most correspondences, and the least differences. Suboptimal alignment is an alignment where the maximization of the score is below the optimum level. In an optimal alignment, non identical characters and gaps are placed to bring as many identical or similar characters as possible into vertical register. Optimal alignments provide useful information to biologists concerning sequence relationships by giving the best possible information as to which characters in a sequence should be in the same column in an alignment and which are insertions in one of the sequences (or detections on the other). This information is important for making functional, structural and evolutionary predictions on the basis of sequence alignment.
Parametric Sequence Comparison and Bayesian Statistical Method Parametric sequence comparisons refers to computer methods that are used to find a range of possible alignments in response to varying the scoring system used for matches, mismatches, and gaps. There is also an effort to use scores such that the results of global and local types of sequence alignments provide consistent results. Some of the programs are Xparal and Bayes block aligner. Bayesian statistical methods are also used to produce alignments between pairs of sequences and to calculate distances between sequences.
6.4 6.3
Basic Bioinformatics
STUDY OF SIMILARITIES
Sequence similarity searches of a database enable us to extract sequences that are similar to a query sequence. The extracted sequence, for which functional and structural information is available will help us to predict the structure and function of the query sequence. Generally a database scanning is done to find homologs. The speed and sensitivity of a search not only depends on the program used, but also in the computer hardware, the database being scanned and the length of the target sequence. In a typical database scan, the sequence under investigation is aligned against each database entry. The aligning of two sequence is termed as pairwise alignment. Sequence similarity searches employ a query sequence (called |he probe) and the subject sequence. The relationship between the two can be quantified and their similarity assessed. To identify an evolutionary relationship between a newly determined sequence and a known gene family, the extent of shared similarity is assessed. If the degree of similarity is low, the relationship is putative. Computer assisted dynamic programming algorithms are similarity searching methods which involve matching of the query sequence to the sequence deposited in the database. A similarity score is calculated by measuring the closeness between the residues (closeness is the number of nucleotide bases or amino acid residues that are similar between the compared sequences). Needleman-Wunch algorithm is used in global alignment to find similarity between sequences across the entire length. This is a matrix based approach. Smith-Waterman algorithm is used in local alignment to find similarity between sequences across only a small part of each sequence. This is also a matrix based approach. This is often quoted as benchmark when comparing different alignment techniques. Pairwise comparison is a fundamental process in sequence analysis. A sequence consists of letters selected from an alphabet. The complexity of the alphabet is 4 for DNA and 20 for proteins. Sometimes additional characters are used in an alphabet to indicate a degree of ambiguity in the identity of a particular residue or base. A simple approach to determine similarity between two sequences is to line up the sequences against each other and insert additional characters to bring the two strings into vertical alignment (Fig. 6.2). Sequence a
OPQRSTUVW
Sequence b
OPQR TUVW
-
Fig. 6.2 Alignment of two sequences with vertical bars and gaps. Vertical bar (|) denotes identical matches and horizontal bar (-) denotes gap.
Sequence Alignment
6.5
Gaps and Mismatches We could score the alignment by counting how many positions match identically at each position. The process of alignment can be measured in terms of the number of gaps introduced and the number of mismatches remaining in the alignment. A comprehensive alignment must account fully for the positions of all residues in both sequences. This means that many gaps may have to be placed at positions that are not strictly identical. In such cases, the positioning of gaps in the alignment becomes numerous and more complex. If this is done, then the algorithms produce alignments containing very large proportions of matching letters and large numbers of gaps. Although this process achieves optimum score and is mathematically meaningful, the result of such a process would be biologically meaningless because insertion and deletion of monomers is relatively a slow evolutionary process. Dynamic programming algorithms use gap penalties to maximize the biological meaning. A simple score contains a positive additive contribution of 1 for every matching pair of letters in the alignment and a gap penalty is subtracted for each gap that has been introduced (different kinds of gap penalties are there such as constant penalty, proportional penalty, affine gap penalty which includes gap opening and gap extension penalty). The total alignment score is then a function of the identity between aligned residues and the gap penalties incurred.
Lavenshtein Distance (Edit Distance) and Hamming Distance Distance treats sequences as points in a metric space. A distance measure is a function that also associates a numeric value with a pair of sequences, but with the idea that the larger the distance, the smaller the similarity, and vice versa. Distance measures usually satisfy the mathematical axioms of a metric. In most cases, distance and similarity measures are interchangeable in the sense that a small distance means high similarity, and vice versa. The process of alignment can be measured in terms of the number of gaps introduced and the number of mismatches remaining. They are known as the Lavenshtein distance and Hamming distance, respectively. Lavenshtein distance (Edit distance) is the minimal number of edit operations required to change one string (sequence) to the other, where an edit operation is a deletion, insertion or alteration of a single character in either sequence. The Hamming distance between two sequences of equal length is the number of positions with mismatch characters. It is desirable to assign variable weights to different edit operations since certain changes are more likely to occur naturally than others. String is used to signify text (sequence) in Perl program language. Strings are usually surrounded by single or double quotation marks (e.g. ‘I am a string’). Given two character string, the distance between them are measured by Hamming distance and Lavenshtein (edit) distance. A given sequence of edit operations includes a unique alignment but not vice versa.
6.6
Basic Bioinformatics
Example agtccgta ag-tcccgctca
Hamming distance = 2 Lavenshtein distance = 3
Hamming and Lavenshtein distances measure the dissimilarity of two sequences: Similar sequences give small distances and dissimilar sequences give large distances.
High Scoring and Low Scoring Matches Amino acid substitutions tend to be conservative and the replacement of one amino acid by another with similar size or physiochemical properties is more likely to occur than its replacement by another amino acid with very different properties. Therefore algorithms use different distance measures to compute and score alignments. Similar sequences give high scores and dissimilar sequences give low scores. Algorithm for optimal alignment can seek either to minimize a dissimilarity measure (such as Lavenshtein distance and Hamming distance) or maximize a scoring function. Sequence comparison generally involves full length sequences and a comprehensive alignment requires that many residues have to be placed at positions that are not strictly identical. For a biologically meaningful comparison, the positioning of gaps and maximizing the number of identical matches have to be balanced. To achieve the optimum score, penalties are introduced to minimize the number of gaps and extension penalties are added when the gap is extended. One of the important tasks of sequence analysis is to distinguish between high-scoring matches that have only mathematical significance and lower-scoring matches that are biologically meaningful.
Uses Sequence alignment is useful to discover functional, structural and evolutionary information in biological sequences. It is important to obtain the best possible or optimal alignment to discover this information. Sequences that are very much similar probably have the same function, be it a regulatory role in the case of similar DNA molecules, or a similar biochemical function and three-dimensional structure in the case of proteins. Additionally, if two sequences from different organisms are similar, there may have been a common ancestor sequence, and the sequences are said to be homologous. The alignment indicates the changes that could have occurred between two homologous sequences and a common ancestor sequence during evolution. Database similarity searching allows us to determine which of the hundreds of thousands of sequences present in the database are potentially related to a particular sequence of interest. The first discovery of similar sequences was in 1983 when Doolittle and Waterfield found out that viral
Sequence Alignment
6.7
oncogene V-sis was found to be a modified form of the normal cellular gene that encodes platelet-derived growth factor. Dynamic programming algorithms find the best alignment of two sequences for given substitution matrices and gap penalties. This process is often very slow.
6.4
SCORING MUTATIONS, DELETIONS AND SUBSTITUTIONS
Due to random mutations, nucleotides may be replaced or deleted or inserted. Most such mutations result in exchange of one amino acid to that of another amino acid of very similar physicochemical properties so that the protein is not affected functionally. Loss of the function of a protein is usually a disadvantage to the organism. Hence any change will survive only if it does not have a deleterious effect on the structure and function of the protein. If the change is very deleterious to the organism, such mutations will stop spreading in the population since the organism cannot survive. Therefore, most of the substitution mutations are well tolerated in the protein. The substitution that does not affect the protein’s property or function is called conservative substitution. Usually protein coding genes evolve much more slowly than most other parts of any genome, because of the need to maintain protein structure and function. When evolutionary changes do occur in protein sequence, they tend to involve substitutions between amino acids with similar properties, because such changes are less likely to affect the structure and function of the protein. Protein sequences from within the same evolutionary family usually show substitutions between amino acids with similar physicochemical properties. Substitution score matrix is used to show scores for amino acid substitutions. When comparing proteins, we can increase sensitivity to weak alignments through the use of a substitution matrix.
Amino Acid Substitution Matrix Scientists discovered that certain amino acid substitutions commonly occur in related proteins from different species. Because the protein functions with these substitutions, the substituted amino acids are compatible with protein structure and function. Often, these substitutions are to a chemically similar amino acid, but other changes also occur. Yet other substitutions are relatively rare. Knowing the types of changes that are most and least common in a large number of proteins can assist in predicting alignments for any set of protein sequences. If related protein sequences are quite similar, they are easy to align, and one can readily determine the single-step amino acid changes. If ancestor relationships among a group of proteins are assessed the most likely amino acid changes that occurred during evolution can be predicted. This type of analysis was pioneered by Margaret Dayhoff.
6.8
Basic Bioinformatics
Amino acid substitution matrices or symbol comparison tables are used for such purposes. In these matrices amino acids are listed both across the top of a matrix and down the side, and each matrix position is filled with a score that reflects how often one amino acid would have been paired with the other in an alignment of related protein sequences. The probability of changing amino acid A into B is always assumed to be identical to the reverse possibility of changing B into A. This assumption is made because, for any two sequences, the ancestor amino acid in the phylogenetic tree is usually not known. Additionally, the likelihood of replacement should depend on the product of the frequency of occurrence of the two amino acids and on their chemical and physical similarities. A prediction of this model is that amino acid frequencies will not change over evolutionary time. When calculating alignment scores, identical amino acids should be given greater value than substitutions and among substitutions conservative substitutions should be given greater value than nonconservative substitutions. Two popular matrices-Dayhoff mutation data (MD) and BLOSUM – have been devised to weight matches between non-identical residues according to observed substitution rates across large evolutionary distances. The MD score is based on the concept of Point Accepted Mutation (PAM).
Percent Accepted Mutation (PAM) Matrix This matrix lists the likelihood of change from one amino acid to another homologous protein sequence during evolution. Each matrix gives the changes expected for a given period of evolutionary time, evidenced by decreased sequence similarity as genes encoding the same protein diverge with increased evolutionary time. Thus, one matrix gives the changes expected in homologous protein that have diverged only a small amount from each other in a relative short period of time, so that they are still 50% or more similar. Another matrix gives the changes expected of proteins that have diverged over a much longer period, leaving only 20% similarity. They predicted changes are used to produce optimum alignments between two protein sequences and to score the alignment. The assumption in this evolutionary model is that the amino acid substitutions observed over short periods of evolutionary history can be extrapolated to longer distances. In deriving the PAM matrices, each change in the current amino acid at a particular site is assumed to be independent of previous mutational events at that site. Thus, the probability of change of amino acid to another amino acid is the same, regardless of the previous changes at that site and also regardless of the position of the first amino acid in a protein sequence. Amino acid substitutions in a protein sequence are viewed as a Markov model, characterized by a series of changes of state in a system such that a change from one state to another does not depend on the previous history of the state. Use of this model makes it possible to extrapolate amino acid
Sequence Alignment
6.9
substitutions observed over a relatively short period of evolutionary time to longer periods of evolutionary time. To prepare the Dayhoff PAM mitrices, amino acid substitutions that occur in a group of evolving proteins were estimated using 1572 changes in 71 groups of protein sequences that were at least 85% similar. Because these changes are observed in closely related proteins, they represent amino acid substitutions that do not significantly change the function of the protein. Hence they are called ‘accepted mutations’, defined as amino acid changes ‘accepted’ by natural selection. Similar sequences were first organized into a phylogenetic tree. The number of changes of each amino acid into every other amino acid was then counted. To make these numbers useful for sequence analysis, information on the relative amount of change for each amino acid was needed. Relative mutabilities were evaluated by counting in each group of related sequences, the number of changes of each amino acid and by dividing this number by a factor, called the exposure to mutation of the amino acid. This factor is the product of the frequency of occurrence of all amino acid changes that occurred in that group per 100 sites. This factor normalizes the data for variation in amino acid composition, mutation rate, and sequence length. The normalized frequencies were then summed for all sequence groups. By these scores, Asn, Ser, Asp, and Glu were the most mutable amino acids, and Cys and Trp were the least mutable. The above amino acid exchange counts and mutability values were then used to generate 20 × 20 mutation probability matrix representing all possible amino acid changes. Because amino acid change was modeled by a Markov model, the mutation at each site being independent of the previous mutations, the changes predicted for more distantly related proteins that have undergone many (N) mutations can be calculated. By this model, the PAM1 matrix could be multiplied by itself N times, to give transition matrices for comparing sequences with lower and lower levels of similarity due to separation of longer periods of evolutionary history. One PAM is a unit of evolutionary divergence in which 1% of the amino acids has been changed (i.e. one point mutation per 100 residues). This does not mean that after 100 PAMs every amino acid will be different; some positions may change several times, and some may even revert back to the original amino acid and some may not change at all. If there was no selection for fitness, the frequencies of each possible substitution would be primarily influenced by overall frequencies of the different amino acids (background frequencies). However, in related proteins, the observed substitution frequencies (target frequencies) are based toward those that do not seriously disrupt the protein’s function. PAM matrices are usually converted into another form, called log-odds matrices. The odds score represents the ratio of the change of amino acid substitution by two different hypotheses – one that the change actually represents an authentic evolutionary variation at that site (the numerator), and
6.10
Basic Bioinformatics
the other that the change occurred because of random sequence variation of no biological significance (denominator). Odds ratios are converted to logarithms to give log odds score for convenience in multiplying odds scores of amino acid pairs in an alignment by adding the logarithms. Each PAM matrix is designed to score alignments between sequences that have diverged by a particular degree of evolutionary distance. Dayhoff and coworkers were the first to use a log-odds approach in which the substitution scores in the matrix are proportional to the natural log of the ratio of target frequencies to background frequencies. To estimate the target frequencies, pairs of very closely related sequences are used to collect mutation frequencies corresponding to 1PAM, and these data are used to extrapolate to a distance of 250 PAMs. (Note that PAM matrices are derived by counting observed evolutionary changes in closely related protein sequences, and then extrapolating the observed transition probabilities to longer evolutionary distances). It is possible to derive PAM matrices for any evolutionary distance but in practice, the most commonly used matrices are PAM120 and PAM250; of these two, PAM250 matrix produces reasonable alignments.
Block Amino Acid Substitution Matrices (BLOSUM) The BLOSUM62 substitution matrix is widely used for scoring protein sequence alignments. The matrix values are based on the observed amino acid substitutions in a large set of more than 2000 conserved amino acid patterns, called blocks. These blocks have been found in a database of protein sequences representing more than 500 families of related proteins and act as signatures of these protein families. The BLOSUM matrices are based on an entirely different type of sequence analysis and a much larger data set than the Dayhoff PAM Matrices. The prosite catalog provides lists of proteins that are in the same family because they have a similar biochemical function. For each family, a pattern of amino acids that are characteristic of that function is provided. Henikoff and Henikoff examined each prosite family for the presence of ungapped amino acid patterns blocks that could be used to identify members of that family. To locate these patterns the sequences of each protein family were searched for similar amino acid patterns by the MOTIF program. These initial patterns were organized into larger ungapped patterns (blocks) between 3 and 60 amino acid long by the Henikoffs’ PROTOMAT program (www.blocks. Fhcrc.org). Because these blocks were present in all of the sequences in each family, they could be used to identify other members of that family. The blocks that characterized each family provided a type of multiple sequence alignment for that family. The amino acid changes that were observed in each column of the alignment could then be counted. The types of substitutions were then scored for all aligned patterns in the database and
Sequence Alignment
6.11
used to prepare a scoring matrix, the BLOSUM matrix, indicating the frequency of each type of substitutions. BLOSUM matrix values were given as logarithms of odds scores of the ratio of the observed frequency of amino acid substitutions divided by the frequency expected by chance. The procedure of counting all of the amino acid changes in the blocks, however, can lead to an over representation of amino acid substitutions that occur in the most closely related members of each family. To reduce this dominant contribution from the most alike sequences, these sequences were grouped together into one sequence before scoring the amino acid substitutions in the aligned blocks. The amino acid changes within these clustered sequences were then averaged. Patterns that were 60% identical were grouped together to make one substitution matrix called BLOSUM60, and those 80% alike to make another matrix called BLOSUM80, and so on. The BLOSUM matrices are based on scoring substitutions found over a range of evolutionary periods. Like PAM, BLOSUM is based on similar principles of target frequencies of mutations. BLOSUM makes use of BLOCKS database for deriving the mutation frequencies and the numbers attached to BLOSUM matrices do not have the same interpretation as those for PAM matrices. When deriving matrices in BLOSUM, any bias potentially introduced by counting multiple contributions from identical residue pairs is removed by clustering sequence segments on the basis of minimum percentage identity. Here effectively, the clusters are treated as single sequences. Blocks contain local multiple alignments of distantly related sequences (as against closely related sequences used for PAM). BLOSUM has an evolutionary model in its matrix formation, since it is derived from direct data rather than from extrapolation values as seen in PAM.
6.5
SEQUENCE ALIGNMENT METHODS
Similarities between sequences can be studied using different methods such as dotplot method and dynamic programming algorithms such as Needleman-Wunsch algorithm and the Smith-Waterman algorithm and word or k-tuple methods such as used by FASTA and BLAST programs. Alignment of two sequences (pairwise alignment) is performed using the following methods: (i) Dot matrix analysis (ii) The dynamic programming (DP) algorithm (iii) Word or k-tuple methods such as used by FASTA and BLAST programs. Alignment of three or more than three sequences is done using multiple sequence alignment methods. Some of the methods are: (i) Profiles, (ii) Blocks, (iii) Fingerprints, (iv) PSI-BLAST and (v) Hidden Markov Models (HMMs).
6.12 6.6
Basic Bioinformatics
PAIRWISE ALIGNMENT
When the sequence alignment aligns two sequences one below the other and scores the similarities, it is referred to as pairwise alignment. The challenge in pairwise sequence alignment is to find the optimum alignment of two sequences with some degree of similarity. Various computer programs assist in this.
Dot Matrix A dot matrix analysis is primarily a method for comparing two sequences to look for possible alignment of characters between the sequences. The method is also used for finding direct or inverted repeats in protein and DNA sequences, and for predicting regions in RNA that are self-complementary and that have potential of forming secondary structure. The major advantage of the dot matrix method for finding sequence alignments is that all possible matches of residues between two sequences are found, leaving the investigator the choice of identifying the most significant ones. Then sequences of the actual region that align can be detected by using other methods of sequence alignment, e.g. dynamic programming. Alignments generated by these programs can be compared to the dot matrix alignment to find out whether the longest regions are being matched and whether insertions and deletions are located in the most reasonable places. Detection of matching regions may be improved by filtering out random matches in a dot matrix. Filtering is achieved by using a sliding window to compare the two sequences at the same time. Identification of sequence alignments by the dot matrix method can be aided by performing a count of dots in all possible diagonal lines through the matrix to determining statistically which diagonals have the most matches, and comparing these match scores with the results of random sequence comparison. Dot matrix analysis can also be used to find direct and inverted repeats within sequences. Repeated regions in whole chromosomes may be detected. Direct repeats may also be found by performing sequence alignments with dynamic programming methods. A dot matrix analysis can also reveal the presence of repeats of the same sequence character. Dot matrix method displays any possible sequence alignments as diagonals on the matrix. Dot matrix analysis can readily reveal the presence of insertions/ deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods. Dotplot is a simple visual approach to compare two sequences. It is a table or matrix. It gives quick pictorial statement of the relationship between two sequences. The two sequences to be compared are plotted on the X and Y axis of a graph. Wherever a base or residue of one axis coincides with a base or residue on the other axis, it is marked with a dot. The plot is characterized by some apparently random dots and a central diagonal line where a high density of adjacent dots indicates the regions of greatest similarity between the two sequences (Fig. 6.3).
Sequence Alignment
6.13
MTFRDLLSVSFEGPRPDSSAGGSSAGG M T F R D L L S V S F E G P R P D S S A G G
X X X
X X
X
X
X XX XX X
X
XX
XX
XX
XX
X X
X X
X
X XX
X X X
XX
X X X
X
X X X
X X
XX XX
XX XX
X
X X X
XX XX
XX XX
Fig. 6.3 Illustration of the manner of construction of the dotplot matrix, using a simple residue identify matrix to score an ‘X’ where a pair of identical residues is observed. (Source: Atwood, T.K. and Parry-Smith, D.J., Introduction to Bioinformatics, Pearson Education Ltd., 2001)
Dynamic Programming Dynamic programming is a computational method that is used to align two protein or nucleic acid sequences. The method is very important for sequence analysis because it provides the very best alignment or optimal alignment between sequences. The method compares every pair of characters in the two sequences and generates an alignment. This alignment will include matched and mismatched characters and gaps in the two sequences that are positioned so that the number of matches between identical or related characters is the maximum possible. The dynamic programming algorithm provides a reliable computational method for aligning DNA and protein sequences. Both global and local types of alignments may be made by simple changes in the basic dynamic programming algorithm. A global alignment program is based on the Needleman-Wunsch algorithm and a local alignment program is based on the Smith-Waterman algorithm. Another feature of the dynamic programming algorithm is that the alignments obtained depend on the choice of a scoring system for comparing character pairs and penalty scores for gaps. For protein sequences, the simple system of comparison is one based on identity. A match in an alignment is only scored if the two aligned amino acids are identical.
6.14
Basic Bioinformatics
The dynamic programming method, first used for global alignment of sequences by Needleman and Wunsch and for local alignment by Smith and Waterman, provides one or more alignments of the sequences. An alignment is generated by starting at the ends of the two sequences and attempting to match all possible pairs of characters between the sequences and by following a scoring scheme for matches, mismatches and gaps. This procedure generates a matrix of number that represents all possible alignments between the sequences. The highest set of sequential scores in the matrix defines an optimal alignment. The dynamic programming method is guaranteed in a mathematical sense to provide the optimal alignment for a given set of user-defined variables, including choice of scoring matrix and gap penalties. In the global alignment of sequences using Needleman-Wunsch program in the dynamic programming method, the optimal score at the matrix position is calculated by adding the current match score to previously scored positions and subtracting gap penalties. Each matrix position may have a positive or negative score, or O. The Needleman-Wunsch algorithm will maximize the number of matches between the sequences along the entire length of the sequences. Gaps may also be present at the end of sequences, in case there is extra sequence left over after the alignment. These end gaps are often but not always, given a gap penalty. A local sequence alignment giving the highest-scoring local match between two sequences using Smith-Waterman program in the dynamic programming method gives more meaningful matches than global matches. Patterns that are conserved in the sequences are highlighted. A local alignment tends to be shorter and may not include many gaps. Using a distance scoring scheme, dynamic programming method could be used to provide an alignment that highlights the evolutionary changes. This method scores alignments based on differences between sequences and sequence characters, i.e., how many changes are required to change one sequence into another. The greater the distance between sequences, the greater the evolutionary time that has elapsed since the sequences diverged from a common ancestor. The first step in global alignment dynamic program is to create a matrix with M+1 columns and N+1 rows, where M and N correspond to the size of the sequence to be aligned. The next step is to score (Matrix fill) and the next step is to align (Trace back).
Procedure Go to the ncbi-entrez site (www.ebi.ac.uk/align). Once the home page appears select the method local or global. Paste the sequence of interest in the text box. Then press RUN button.
Sequence Alignment
6.15
Word or k-Tuple The word or k-tuple methods are used by FASTA and Blast algorithms. They align two sequences very quickly, first by searching for identical short stretches of sequences called words or k-tuples and then by joining these words into an alignment by the dynamic programming method. These methods are fast enough to be suitable for searching an entire database for the sequence that aligns best with an input test sequence. The FASTA and BLAST methods are heuristic, i.e., an empirical method of computer programming in which rules of thumb are used to find solutions and feedback is used to improve performance. In database searching, the basic operation is to align the query sequence to each of the subject sequence in the database and if this can be done in a faster manner, then this is better than dynamic programming algorithm methods.
FASTA FASTA is a DNA and protein sequence alignment software package. It was first described by David J. Lipman and William R. Pearson in 1985 as FASTP dealing with only protein sequences. In 1988 the ability to search DNA sequences was added. Procedure: Open the internet browser and type the URL address: http:// fasta.adbj.nig.ac.jp/top.e.html. The results can be received in any Email address. FASTA compares nucleotide sequence with nucleotide sequence database or amino acid sequence with amino acid sequence database. It compares nucleotide sequence with amino acid sequence database by translating the sequence taking into account all six possible open reading frames. It compares amino acid sequence with nucleotide sequence database by translating database sequences taking into account all six possible open reading frames. It compares amino acid sequence with nucleotide sequence database by translating database sequence taking into account all six possible open reading frames and frame-shift mutations. We must specify the database in which homologous sequences are searched. We must specify the division in which homologous sequences are searched. We must specify how many homologous sequences are reported in the list of homology scores. Default value is 100. We must specify how many alignments with homologous sequences are reported. Default value is 100. We must specify the degree of sensitivity (Ktup) of the search. Usually the Ktup value is recommended to be set at 3-6 for nucleotide sequences and 1-2 for amino acid sequences. Lesser the ktup value, more sensitive the search. The k-tupl value determines how many consecutive identities are required for a match to be declared.
6.16
Basic Bioinformatics
FASTA program achieves a high level of sensitivity for similarity searching at high speed. FASTA uses optimized local alignment and substitution matrix for its sensitivity. First FASTA prepares a list of words from the pair of sequences to be matched. The word is nothing but 3-6 nucleotides or 1 or 2 amino acids. It uses non-overlapping words. It matches the words and makes a count of it. Similar to dot matrix plotting and scoring, it creates the word diagonal and finds a high scoring match. The output is labeled as unit1. Only if the score is sizable it proceeds to the second level. In the second level, for every best hit of words, it looks for neighboring approximate hits and if the score value is good, it collects the short segments of unit1 and prepares a larger dot matrix diagonal and scores after including gap size and gap penalty. The best score from this second level scoring is called initn. The initn scores are saved for each comparison of a query sequence with a database sequence. After all the database sequences are tested, the sequences that produce best initn scores are used to produce local alignment using SmithWaterman algorithm, to give the opt score. FASTA format contains a cue line header followed by lines of sequence data. Sequences in FASTA formatted files are preceded by a line starting with a ‘>’ symbol. The first word on this line is the name of the sequence, and the rest of the line is a description of the sequence. The remaining lines contain the sequence itself. Blank lines in FASTA file are ignored and so are spaces or other gap symbols in a sequence. FASTA lines containing multiple sequences are just the same with one sequence listed next to the other. This format is accepted for many multiple sequence alignment programs.
BLAST BLAST (Basic Local Alignment Search Tool) program was developed by Altschul et al. in 1990. It has become very popular because of its efficiency and firm statistical foundation. BLAST works under the assumption that high-scoring alignments are likely to contain short stretches of identical or near identical letters. These short stretches are called words. The first step in BLAST is to look for words of a certain fixed word length W that score higher than a certain threshold score (T). The value of W is normally 3 for protein sequences or 11 for nucleic acid sequences. BLAST takes a word from the query sequence initially and proceeds to extend the query sequence on either direction on the target sequence with totalling scores for matchings, mismatchings, gap introduction and extension of gap. The extension will continue to reach a cut off value S. BLAST extends individual word matches until the total score of the alignment falls from its maximum value by a certain amount producing high scoring segment pairs. BLAST is a heuristic search algorithm employed by different BLAST programs such as BLASTP, BLASTN, BLASTX, TBLASTX and PSI-BLAST. BLASTP compares an amino acid query sequence against a protein sequence database. BLASTN compares a nucleotide query sequence against a nucleotide
Sequence Alignment
6.17
sequence database. BLASTX compares six-frame conceptual translation products of nucleotide query sequence (both strands) against a protein sequence database. TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. PSI-BLAST compares amino acid query sequence against a protein sequence database. The FASTA and BLAST programs are essentially local similarity search methods that concentrate on finding short identical matches, which contribute to a total match.
6.7
MULTIPLE SEQUENCE ALIGNMENT
A multiple sequence alignment is an alignment that contains more than two sequences. Analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The goal of multiple sequence alignment is to generate a concise, information-rich summary of sequence data in order to make decisions on the relatedness of sequences to a gene family. Multiple alignment is more informative about evolutionary conservation. To be informative a multiple alignment should contain a distribution of closely and distantly related sequences. In multiple sequence alignment, sequences are aligned optimally by bringing together the greatest number of similar characters into register in the same column of the alignment. Multiple sequence alignment of a set of sequences can provide information about the most similar regions in the set. In proteins such regions may represent conserved functional or structural domains. If the structure of one or more members of the alignment is known, it may be possible to predict which amino acids occupy the same spatial relationship in other proteins in the alignment and which genes occupy sites in nucleic acids. Multiple sequence alignment is also used for the prediction of specific probes for other members of the same group or family of similar sequences in the same or other organisms. There are many methods to carry out multiple sequence alignment such as Profiles, Blocks, fingerprints, etc. Profiles, for example, use a weight matrix approach to summarize the whole alignment. Blocks, for example, seeks out conserved, un-gapped blocks of residues within alignments, which are then converted to position-specific scoring matrices. Fingerprints for example, manually extracts highly specific, relatively short un-gapped motifs from alignments and uses them to generate unweighted scoring matrices. All these methods use techniques such as aligning
6.18
Basic Bioinformatics
all pairs of sequences, aligning sequences in arbitrary order or aligning sequences following the branching order of a phylogenetic tree. The power of multiple sequence analysis lies in the ability to draw together related sequences from various species and express the degree of similarity in a relatively concise format. There are many multiple alignment databases which are accessible via the web. The key steps in building a multiple alignment are: (i) Find the sequence to align by database searching or by other means (ii) Locate the region of each sequence to be included in the alignment. (iii) Assess the similarities within the set of sequences by comparing them pairwise with randomizations. (iv) Run the multiple alignment program (v) Manually inspect the alignment for problems (vi) Remove sequences that appear to disrupt the alignment seriously and then realign the remaining subset. (vii) After identifying key residues in the set of sequences that are straightforward to align, attempt to add the remaining sequences to the alignment so as to preserve the key features of the family.
Methods Many methods are available for applying multiple sequence alignments of known proteins to identify related sequences in database searches. Some important methods are: profiles, Blocks, Fingerprints, PSI-BLAST and Hidden Markov Models (HMMs).
Profiles Proteins of similar function usually share identical motif. Therefore, most prediction is more useful than trying to find similarity in entire sequence of the protein. Proteins of similar or comparable function are usually siblings of a common ancestral protein. Often they share some amount of similarity in the sequence, particularly in the motifs. A sequence alignment usually supplies us such families of proteins. Such kind of multiple alignments is often called profiles. A profile expresses the patterns inherent in a multiple sequence alignment of a set of homologous sequences. They have several applications: • They permit greater accuracy in alignments of distantly-related sequences. • Sets of residues that are highly conserved are likely to be part of the active site, and give clues to function. • The conservation patterns facilitate identification of other homologous sequences. • Patterns from the sequences are useful in classifying subfamilies within a set of homology.
Sequence Alignment
6.19
• Sets of residues that show little conservation, and are subject to insertion and deletion, are likely to elicit antibodies that will crossreact well with the native structure. • Most structure-prediction methods are more reliable if based on a multiple sequence alignment than on a single sequence. Homology modeling, for example, depends crucially on correct sequence alignment. To use profile patterns to identify homologs, the basic idea is to match the query sequences from the database against the sequences in the alignment table, giving higher weight to positions that are conserved than to those that are variable. In the profiles database, there is a distilling of the sequence information available within complete alignments into scoring tables or profiles. Profiles define which residues are allowed at given positions, which positions are highly conserved and which degenerate; and which positions or regions can tolerate insertions. Once multiple sequence alignment is performed, a portion of the alignment which is highly conserved is then identified and a type of scoring matrix called a profile is produced. A profile includes scores for amino acid substitutions and gaps (matches, mismatches, insertions, deletions) in each column of the conserved region so that an alignment of the region to a new sequence can be determined.
BLOCKS The blocks concept is derived from motif, the conserved stretch of amino acids that confer specific function or structure to the protein. If motifs of a protein family are aligned without introducing gaps in the sequences, we get blocks. In the BLOCKS database, conserved motifs, or blocks, are located by searching for spaced residue triplets and a block score is calculated using the BLOSUM62 substitution matrix. The validity of blocks found by this method is confirmed by the application of second motif-finding algorithm, which searches for the highest-scoring set of blocks that occur in the correct order without overlapping. Blocks within a family are converted to positionspecific matrices which are used to make independent database searches. Like the profiles, blocks represent conserved regions in the multiple sequence alignment. Blocks differ from profiles in lacking insert and delete positions in the sequences. Every column includes only matches and mismatches (Substituted position without gaps).
Fingerprints Within a sequence alignment, it is unusual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many or all of the conserved regions to create a signature or fingerprint, so that in a database search, there is a higher chance of identifying a distant relative, whether or not all parts of the signature are matched. Protein
6.20
Basic Bioinformatics
fingerprints are groups of motifs that represent the most conserved regions of multiple sequence alignments.
PSI-BLAST PSI-BLAST (Position Specific Iterated –BLAST) incorporates elements of both pairwise and multiple sequence alignment methods. Following an initial database search, PSI-BLAST allows automatic creation of position-specific profiles from groups of results that match the query above a defined threshold. Running the program several times can further refine the profile and increase search sensitivity.
HMMs Hidden Markov Models (HMMs) is a statistical model that considers all possible combinations of matches, mismatches, and gaps to generate an alignment of a set of sequences. A localized region of similarity, including insertions and deletions, may also be modeled by an HMM. HMMs are probabilistic models consisting of a number of interconnecting states: they are essentially linear chains of match, delete or insert states, which can be used to encode sequence conservation within alignments. HMMs are the basis of the Pfam database. A HMM is a computational structure for describing the subtle patterns that define families of homologous sequences. HMMs are powerful tools for detecting distant relatives, and for prediction of protein folding patterns. HMMs include the possibility of introducing gaps into the generated sequence, with position-dependent gap penalties and they carry out the alignment and the assignment of probabilities together.
Automatic Alignment Central to sequence analysis is the multiple alignment. Consequently a vital tool for the sequence analyst is an alignment editor. Several automatic alignment programs are available now, either in a stand-alone form (such as ClustalW) or as components of larger packages (such as Pileup in GCG). But automatically calculated alignments almost invariably require some degree of manual editing, whether to remove spurious gaps, to rescue residue windows, or to correct misalignments. This often presents problems, as there is currently no standard format for alignments. Consequently, swapping between alignment programs is almost impossible without the use of ad hoc scripts to convert between disparate input and output formats. The advent of the object-oriented network programming language, Java, addresses some of these problems. Java capable browsers may run applets on a variety of platforms - applets are small applications bonded from a server via HTML pages; the software is loaded onthe-fly from the server and cached for that session by the browser.
Sequence Alignment
6.21
CLUSTAL CLUSTAL performs a global multiple sequence alignment using the following steps: (i) Perform pairwise alignments of all of the sequences (ii) Use the alignment scores to produce a phylogenetic tree (iii) Align the sequences sequentially, guided by the phylogenetic relationships indicated by the tree. CLUSTAL approach exploits the fact that similar sequences are likely to be evolutionarily related. It aligns sequences in pairs, following the branching order of a family tree. Similar sequences are aligned first and more distantly related sequences are added later. Once pairwise alignment scores for each sequence relative to all others have been calculated, they are used to cluster the sequences into groups which are then aligned against each other to generate the final multiple alignment. CLUSTAL has been revised many times. CLUSTAL W uses the positioning of gaps in closely related sequences to guide the insertion of gaps into those that are more distant. Similarly, information compiled during the alignment process about the variability of the most similar sequences is used to help vary the gap penalties on a residue and position specific basis.
CINEMA CINEMA is a Colour Interactive Editor for Multiple Alignments, written in Java: the program allows creation of sequence alignments by hand, generation of alignments automatically (e.g. using ClustalW), and visualization and manipulation of sequence alignments currently resident at different sites on the Internet. In addition to its special advantage of allowing interactive alignment over the web, CINEMA provides links to the primary data sources, thereby giving ready access to up-to-date data, and a gateway to related information on the Internet. CINEMA is more than just a tool for colour-aided alignment preparation. The program also offers facilities for motif modification; database searching (using BALST); 3D-structure visualization (where co-ordinates are available), allowing inspection of conserved features of alignments in a 3D context; generation of dotplots and hydropathy profiles; six-frame translation; and so on. The program is embedded in a comprehensive help-file (written in HTML) and is accessible both as a stand-alone tool from the DbBrowser Bioinformatics Web Server, and as an integral part of the PRINTS protein fingerprint database.
READSEQ READSEQ is a very useful sequence format conversion tool. D.G. Gilbert from the Biology Department of Indiana University, USA programmed this in 1990 to read the formatted sequence files and convert the sequence information in the files into another file that has a different format. It automatically detects
6.22
Basic Bioinformatics
many sequence formats (FASTA/Pearson, Intelligenetics/Stanford, GenBank, NBRF, EMBL, GCG, DNA Strider, Fitch, PHYLIP V3.3, V3.4, PIR or CODATA, MISF, ASN1 and PAUP NEXUS) and inter-converts them.
Procedure Open the internet browser and type the URL address: http://www.bimas.cit .nih.gov/molbio.readsec/. Pull the drop down menu and select the desired format. Paste the sequence in the text box. Press SUBMIT or RUN button.
6.8
ALGORITHMS FOR IDENTIFYING DOMAINS WITHIN A PROTEIN STRUCTURE
Zehfus (1994) proposed a method for identification of discontinuous domains based on their ‘compactness’. The PUU (Parser for protein Unfolding Units) algorithm attempts to maximize the interactions within each unit (domain) while minimizing interactions between units. If a molecular dynamics simulation is carried out on a molecule, the residues that have the most correlated motion are likely to be part of a domain. Therefore, a harmonic model is used to approximate inter domain dynamics. Differences in fluctuations times can be used for domain decomposition. However, a chain can cross over several times between units. To solve this problem the residues are grouped by solving an eigenvalue problem for the contact matrix – this reduces the problem to a onedimensional search for all reasonable trial bisections. Physical criteria are used to identify units that could exist by themselves. The DOMAK algorithm calculates a ‘split value’ from the number of each type of contact when the protein is divided arbitrarily into two parts. This split value is large when the two parts of the structure are distinct. The detective procedure for domain identification is based on the assumption that each domain should contain an identifiable hydrophobic core. However, it is possible that hydrophobic cores from different domains continue through the interface region. In this algorithm core residues are defined as those residues that occur in a regular secondary structure and have buried side chains that form predominantly nonpolar contacts with one another. An algorithm based on dividing the chain to minimize the density of inter-domain contacts has also been proposed. Another algorithm based on cluster analysis of secondary structure has also been suggested for identification of domains in protein structures. A consensus method that was based on the assignments from the three independent algorithms for domain recognition (Detective, PUU and DOMAK) was found to give better accuracy than any of the individual algorithms that were tested. Strudl (STRUctural domain limits) uses a Kernighan-Lin graph heuristic to partition the protein into residue sets that display minimal interactions between the sets. The graph specifies the connectivity between the nodes and is represented by matrix. Starting from a reasonable partition the algorithm
Sequence Alignment
6.23
minimizes a cost function that is based on interactions between the nodes. This is carried out by swapping pains of nodes until an optimal partitioning is obtained. The interactions are deduced from the weighted Voronoi diagram.
6.9 ALGORITHMS FOR STRUCTURAL COMPARISON Wide variety of methods that make use of graph theory, distance matrices, dynamic programming, Monte Carlo, molecular dynamics maximum likelihood criteria, etc. have been proposed. Double dynamic programming method for structure alignment requires two matrices. In protein structure comparison by alignment of distance matrices (DALI) the three-dimensional coordinates of each protein are used to calculate residue-residue distance matrices. The distance matrices are decomposed into elementary contact patterns, e.g. hexapeptide-hexapeptide similarities. Similar contact patterns are paired and combined into larger consistent sets of pairs. The alignments are evaluated by defining a similarity score. Unmatched residues do not contribute to the overall score. The primary advantage of this method is that it does not depend on the topological connectivity of the aligned segments. In addition, this algorithm tolerates sequence gaps of any length and chain reversals. It is fully automated and all structural classes can be treated with the same set of parameters. Combinatorial Extension of optional path (CE) algorithm is based on the concept of an aligned fragment pair (AFP). An aligned fragment pair consists of two structurally similar fragments, one from each structure. The similarity is defined based on local geometry and not on global features such as orientation of secondary structures or topology. If a combination of AFPs represents a continuous alignment path, an attempt is made to extend it further; otherwise it is discarded. By considering different combinations of AFPs in this manner, a single optimal alignment is created. Vector alignment search tool (VAST) is used for pairwise structural alignment. A unit of tertiary structural similarity is defined as pairs of secondary structural elements (SSE) that have similar type, relative orientation and connectivity. In comparing two domains, the sum of the superposition scores across these units is calculated.
6.10 CARRYING OUT A SEQUENCE SEARCH One of the important aims of bioinformatics is the prediction of protein function, and ultimately of structure, from the linear amino acid sequence. Given a newly determined sequence, one wants to know: what is this protein? To what family does it belong? What is its function? And how can one explain its function in structural terms? By searching secondary databases, which store abstractions of functional and structural sites characteristic of particular
6.24
Basic Bioinformatics
proteins, one can recognize patterns that allow one to infer relationships with previously characterized families. Similarly, by searching fold libraries, which contain templates of known structures, it is possible to recognize a previously characterized fold. Given the size of existing sequence databases, it is likely that searches with new sequences will uncover homologues; and, with the expansion of sequence pattern and structure template databases, the chances of assigning functions and inferring possible fold families are also improving. However, these advances in sequence and fold pattern recognition methods have not yet been matched by similar advances in prediction techniques. So if one cannot predict function or structure directly from sequence, but can identify homologues and recognize sequence and fold patterns that have already been seen, given the bewildering array of databases to search, how does one use this information to build a sensible search method for novel sequences? Essentially, one has to check identical matches and then move on to search for closely similar sequences in the primary databases. The strategy then involves searching for previously characterized sequence – and, where possible, fold patterns in a variety of pattern databases. The final step is the integration of results from all these searches to build a consistent family/ functional/structural diagnosis. An interactive www tutorial, known as BioActivity can be found at: http://www.bioinf.man.ac.uk/dbbrowser/ bioactivity/prefacefrm.html The first and fastest test to identify an unknown protein sequence fragment is to perform an identity search, preferably of a composite sequence database. OWL is a composite resource that can be queried directly by means of its query language. Identity searches, which are suitable for peptides up to 30 residues in length, are possible via web interface; this provides an easy-touse form that conveniently shields the user from the syntax of the query language. An identity search will reveal in a matter of seconds whether an exact match to the unknown peptide already exists in the database. The following website is useful. http://www.bioinf.man.ac.uk/dbbrowser/ bioactivity/ nucleicfrm.html If an identity search fails to find a match, the next step is to look for similar sequences again preferably in a composite database. For best results it is recommended to perform similarity searches on peptides that are longer than 30 residues (shorter the peptide, the greater the likelihood of finding chance matches that have no biological relevance). In most applications as much sequence information as possible should be used in a BLAST search (although this can lead to complications in interpreting output from searches with multi-domain or modular proteins). There are several important features to note in the BLAST output. First, one is looking for matches that have high scores with correspondingly low probability values. A very low probability indicates that a match is unlikely to have arisen by chance. As the probability values approach unity, they are considered more and more likely to be random matches. The second feature
Sequence Alignment
6.25
of interest is whether the results show a cluster of high scores (with low probabilities) at the top of the list, indicating a likely relationship between the query and the family of sequences in the cluster. Heuristic search tools like BLAST do not always give clear-cut answers. Frequently the program will not be able to assign significant scores to any of its retrieved matches, even if a biologically relevant sequence appears in the hit-list. Such search tools do not have the sensitivity always to fish out the right answer from the vast amount of sequences in the primary database; rather, they cast a coarse net, and it is then up to the user to pick out the best. Under these circumstances, where no individual high-scoring sequence or cluster of sequences, is found, the third feature to consider is whether there are any observable trends in the type of sequences matched, i.e. do the annotations suggest that several of these are from a similar family? If there are possible clues in the annotations, the next step is to try to confirm these possibilities both by reciprocal BLAST searches (do retrieved matches identify the sequence in a similarity search?), and by comparing results from searches of the secondary databases. The first secondary database to consider is PROSITE. Within the tutorial, this is accessible for searching via the ‘Protein sequence analysis-Secondary database searches page: http://www.bioinf.man.ac.uk/dbbrowser/ bioactivity/protein1frm.html The database code is simply supplied to the relevant part of the form and the option to exclude patterns with a high probability of occurrence (i.e. rules) is switched on. The next step is to search the ISREC profile library. In addition to the profiles that have already been incorporated into the main body of PROSITE, the web server offers a range of pre-release profiles that have not yet been sufficiently documented for release through PROSITE. Searching the complete collection of profiles is achieved, once again, by simply supplying the database code to the web form, remembering to change the format button from the default (plain text) to accept a SWISS-PROT ID: http://www.bioinf.man.ac.uk/ dbbrowser/bioactivity/protein1frm.html Another important resource to search is the Pfam collection of Hidden Markov Models. Searching is achieved via web interface that requires the query sequence to be supplied to a text box: http://www.bioinf.man.ac.uk/ dbbrowser/bioactivity/protein1frm.html The sequence must be in FASTA format, which means that the query must be preceded by the > symbol and a suitable sequence name. Another key secondary resource is PRINTS, which provides a bridge between single-motif search methods, such as the one used to compile PROSITE, and domain-alignment/profile methods, such as those embodied in the profile library and Pfam. PRINTS is accessible for searching via the ‘Protein sequence analysis – protein fingerprinting’ page: http:// www.bioinf.man.ac.uk/dbbrowser/bioactivity/protein2frm.html
6.26
Basic Bioinformatics
The output is divided into distinct sections; first, the program offers an intelligent ‘guess’ based on the occurrence of the highest-scoring complete or partial fingerprint match or matches; it then provides an expanded calculation that shows the top 10 best-scoring matches-clearly; these include the intelligent results from the previous analysis, but the additional matches are provided to highlight why the best guess was chosen, and to allow a different choice, if the guess is considered either to be wrong or to have missed something; the remaining sections of output provide more of the new data, again allowing the users to search for anything that might have been missed. A particularly valuable aspect of this software is the facility to visualize individual fingerprint matches by clicking on the graphic box. The next secondary resources to be searched are the BLOCKS database, derived from PROSITE and PRINTS. If results matched in PROSITE and/or PRINTS are true-positive, then we would expect these to be confirmed by the BLOCKS search results. The BLOCK databases are searched by supplying the query sequence to the input box of the relevant web form: http:// www.bioinf.man.ac.uk/dbbrowser/bioactivity/protein1frm.html One must remember in each case to switch to the required database. The accession codes in the Block column indicate the number of motifs; matches to these motifs are ranked according to score. The ‘rank’ of the bestscoring block, the so-called anchor block is reported. Where additional blocks support the anchor block by matching with high scores in the correct order, a probability value is calculated, reflecting the likelihood of these matches appearing together in an order. Often results are littered with matches with high–scoring individual blocks. These matches are usually the result of chance, and p-values are not calculated. The information content of particular blocks can be visualized by examination of the sequence logo. A sequence logo is a graphical display of a multiple alignment consisting of colour-coded stacks of letters representing amino acids at successive positions. The height of a given letter increases with increasing frequency of the amino acid, and its height increases with increasing conservation of the aligned position; hence, letters in stacks with single residues (i.e. representing conserved positions) are taller than those in stacks with multiple residues (i.e. where there is more variation). Within stacks, the most frequently occurring residues are not only taller, but also occupy higher positions in the stack, so that the most prominent residue at the top is the one predicted to be the most likely to occur at that position. To address the problem of sequence redundancy within block, which strongly biases residue frequencies, sequence weights are calculated using a position-specific scoring matrix (PSSM). This reduces the tendency for overrepresented sequence to dominate stacks, and increases the representation of rare amino acids relative to common ones. The final resource is IDENTIFY, which is searched by supplying the query sequence to the relevant web form: http://www.bioinf.man.ac.uk/ dbbrowser/bioactivity/protein1frm.html
Sequence Alignment
6.27
We can find out more about the structure, either by following the links embedded in the PROSITE and PRINTS entries or by supplying a relevant PDB code in the query forms of the structure classification resources (such as SCOP and CATH). SCOP is accessible for searching via the ‘protein structure analysis–structure classification resources’ page: http://www.bioinf.man.ac.uk/dbbrowser/bioactivity/structurefrm.html The CATH resource is queried by supplying the desired PDB html code to the relevant form on the same web page. Clicking on the hyper linked PDB code in the CATH summary takes to the PDBsum resource, a web based collection of information for all PDB structures. The picture of the overall fold and secondary structure of the molecule is available here. Using this pictorial information, one can begin to rationalize the results of the secondary database searches in terms of structural and functional features of the 3D molecule, essentially by superposing the motifs matched in PROSITE, PRINTS and BLOCKS on to the sequence.
STUDY QUESTIONS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
What is sequence alignment? What are the goals of sequence alignment? What are the types of sequence alignments? How is dotplot analysis performed? How is pairwise comparison done? How mutations, deletions and substitutions are scored? Which programs are used for pairwise database searching? What is multiple sequence alignment? Enumerate the key steps in building multiple alignment Which are the programs used in multiple alignment? How can one carry out a sequence search? What is a string? What is Hamming distance? What is Lavenshtein (edit) distance?
C H A P T E R
7
Predictive Methods using DNA and Protein Sequences Since sequencing whole genomes has been achieved with greater ease today, deriving biological meaning from the long sequences of nucleotides that are obtained through sequencing becomes a crucial biological research problem. Annotation is a word that is commonly used today to mean ‘deriving useful biological information’ from raw elements in genomic DNA (structural annotation) and then assigning functions to these sequences (functional annotation). With the advent of whole-genome sequencing projects, there is considerable use for computer program that scan genomic DNA sequences to find genes, particularly those that encode proteins. Once a new genome sequence has been obtained, the most likely protein-encoding regions are identical and the predicted proteins are then subjected to a database similarity search. Prediction is an important component of bioinformatics. Assignment of structures to gene products is a first step in understanding how organisms implement their genomic transformation. Prediction helps to understand the structures of the molecules encoded in a genome, their individual activities and interactions and the organization of these activities and interactions in space and time during the lifetime of the organism.
7.1
GENE PREDICTION STRATEGIES
Because the proteins present in a cell largely determine cell shape, role and physiological properties, one of the first orders of business in genome analysis is to determine the polypeptides encoded by an organism’s genome. To determine list of polypeptides, the structure of each mRNA encoded by the genome must be deduced. Bioinformatics uses several independent sets of information to predict the most likely sequence for mRNA and polypeptide coding regions. The sets of information are: cDNA sequences, Docking site sequences marking the start and end points for transcription, pre-mRNA splicing and translation,
7.2 Basic Bioinformatics sequences of related polypeptides, and species-specific usage preferences for some codons over others encoding the same amino acid. Figure 7.1 depicts how different sources of information are combined to create the best possible mRNA predictions. Predictions of mRNA and polypeptide structure from genomic DNA sequence depend on an integration of information from cDNA sequence, docking site predictions, polypeptide similarities and codon bias.
Categories Gene finding strategies can be grouped into three categories, namely, contentbased, site-based and comparative. Content-based methods rely on the overall, bulk properties of a sequence in making a determination. Characteristics considered here include how often particular codons are used, the periodicity of repeats, and the compositional complexity of the sequence. Because different organisms use synonymous codons with different frequency, such clues can provide insight into determining regions that are more likely to be exons.
Predictions Blast similarity from protein Codon bias Predictions Sequence from mRNA motif EST and its properties cDNA Predictions from docking Promoter Splice site analysis site sites programs
Translation termination site
Translation Polyadenylation termination site site
Open reading frame (ORF)
5¢ UTR
Intron
Splice sites
Exon
Intron
Exon
3¢ UTR Intron
Exon
Predicted gene
Fig. 7 7..1 The different forms of gene product evidence – cDNAs, ESTs, BLAST similarity hits, codon bias, and motif hits – are integrated to make genes predictions. Where multiple classes of evidence are found to be associated with a particular genomic DNA sequence, there is a greater confidence in the likelihood that a gene prediction is accurate. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
Predictive Methods using DNA and Protein Sequences
7.3
Site-based methods focus their attention to the presence or absence of a specific sequence, pattern, or consensus. These methods are used to detect features such as donor and acceptor splice sites, binding sites for transcription factors, poly A tracts and start and stop codons. Comparative methods make determinations based on sequence homology. Hence translated sequences are subjected to database searches against protein sequences to determine whether a previously characterized coding region correspond to the region in the query sequence. The simplest method of finding DNA sequences that encode proteins is to search for open reading frames. An ORF is a length of DNA sequence that contains a contiguous set of codons, each of which specifies an amino acid.
Prediction of RNA Secondary Structure Sequence variation in RNA sequences maintain base-pairing patterns that give rise to double stranded regions (secondary structure) in the molecule. Thus, alignments of two sequences that specify the same RNA molecules will show covariations at interacting base-pair positions. In addition to these covariable positions, sequences of RNA-specifying genes may also have rows of similar sequence characters that reflect the common ancestory of the genes. Computational methods are available for predicting the most likely regions of base-pairing in an RNA molecule. Methods for predicting the structure of RNA molecules include: (i) an analysis of all possible combinations of potential double-stranded regions by energy minimization methods and (ii) identification of base covariation that maintains secondary and tertiary structure of an RNA molecule during evolution (Covariation analysis led to the prediction of three domains of life – the Bacteria, the Eukarya and Archae by C. Woese).
7.1.1 Gene Prediction Programs There are many commonly used methods, which are freely available in the public domain. GRAIL1 (Gene Recognition and Analysis Internet Link) makes use of a neural network method to recognize coding potential in fixed-length (100 base) windows considering the sequence itself, without looking for additional features such as splice junctions or start or stop codons. GRAIL2 uses variable-length windows. GRAIL-EXP uses additional information in making the prediction, including a database search of known complete and partial gene messages. FGENEH is a method that predicts internal exons by looking for structural features such as donor and acceptor splice sites, putative coding regions and intronic regions both 5’ and 3’ to the putative exon, using linear discriminant analysis. FGENES, an extension of FGENEH, is used in cases when multiple genes are expected in a given stretch of DNA. MZEF uses quadratic discriminant analysis to predict internal coding exons. GENSCAN predicts complete gene structures. It can identify introns, exons, promoter sites, and polyA signals. It relies on probabilistic model. GenomeScan program assigns higher score to
7.4 Basic Bioinformatics putative exons. PROCRUSTES program takes genomic DNA sequences and forces them to fit into a pattern as defined by a related target protein. GeneID finds exons based on measures of coding potential. GeneParser uses a neural network approach to determine whether each subinterval contains a first exon, internal exon, final exon or intron. HMMgene predicts whole genes in any given DNA sequence. Different methods produce different types of results. No one program provides the foolproof key to computational gene identification. Web Addresses FGENEX GeneID GeneParser GENSCAN GRAIL GRAIL-Exp HMMgene MZEF PROCRUSTES
7.2
: : : : : : : : :
http://genomic.sanger.ac.uk/gf/gf.shtml http://www1.imim.es/geneid.html http:// beagle.colordo.edu/~eesnyder/GeneParser.html http://genes.mit.edu/GENSCAN.html http:// compbio.ornl.gov/tools/index.shtml http:// compbio.ornl.gov/grailexp/ http://www.cbs.dtu.dk/services/HMMgene/ http://www.cshl.org/genefinder http://www-hto.usc.edu/software/procrustes
PROTEIN PREDICTION STRATEGIES
One of the major goals of bioinformatics is to understand the relationship between amino acid sequence and three dimensional structures of proteins. If this relationship is known, then the structure of a protein could reliably be predicted from the amino acid sequence. Prediction of these structures from sequence is possible using presently available methods and information. The alphabet of 20 amino acids found in protein allows for much greater diversity of structure and function, primarily because the differences in the chemical makeup of these residues are more pronounced. Each residue can influence the overall physical properties of the protein because these amino acids are either basic or acidic, hydrophobic or hydrophilic and have straight chains, branched chains or are aromatic. Thus, each residue has certain quality to form structures of different types in the context of a protein domain (sequence specific conformation). The first step in predicting the three-dimensional shape of a protein is determining what regions of the backbone are likely to form helices, strands, and beta turns, the U-turn like structures formed when a beta strand reverses direction in an antiparallel beta sheet.
Prediction Methods Modeling the structure of biological macromolecules allows us to gain a great deal of insight into the molecule’s functional features. Modeling unknown protein structures based on their homologs is known as homology-based
Predictive Methods using DNA and Protein Sequences
7.5
structural modeling. In this type of modeling, the experimentally determined structures are generally referred to as the ‘templates’ and the sequence homology (a novel one) that lacks structural coordinates is called the ‘target’ sequence. The homology-based protein modeling approach entails four sequential steps. The first step involves the identification of known structures that are related in sequence to the target sequence using BLAST. In the second step, the potential templates are aligned with the target sequence to identify the closest related template. In the third step, a model of the target sequence is calculated from the most suitable template in step two. The fourth step involves the evaluation of the modeled target sequence using different criteria. The knowledge of evolutionarily conserved structural features of similar proteins from other species enables us to gain insight into the structure of the target sequence. The observation that each protein folds spontaneously into a unique threedimensional native conformation implies that nature has an algorithm for predicting protein structure from amino acid sequence. Some attempts to understand this algorithm are based solely on general physical principles; others are based on observations of known amino acid sequences and protein structures. A proof of our understanding would be the ability to reproduce the algorithm in a computer program that could predict protein structure from amino acid sequence. Most attempts to predict protein structure from basic physical principles alone try to reproduce the inter-atomic interactions in proteins, to define a compatible energy associated with any conformation. Computationally, the problem of protein structure prediction then becomes a task of finding the global minimum of this conformational energy function. So far this approach has not succeeded, partly because of the inadequacy of the energy function and partly because the minimization algorithms tend to get trapped in local minima. The alternative to a priori methods is the approach based on assembling clues to the structure of a target sequence by finding similarities to known structures. These are empirical or knowledge-based methods.
The Ramachandran Plot Ramachandran plot (also known as Ramachandran diagram or a phi (Φ) psi (Ψ) plot was originally developed in 1963 by G.N. Ramachandran, C. Rmakrishnan and V. Sasisekharan. It is a way to visualize the backbone dihedral (torsional) angles phi against psi of amino acid residues in protein structure. It shows the possible conformation of phi and psi angles for a polypeptide. The psi angle of the peptide bond is normally 180° since the partial – double – bond character keeps the peptide planar. The backbone conformation of an entire protein can be specified in terms of the phi and psi angles of each amino acid.
7.6 Basic Bioinformatics The torsional angles of each residue in a peptide define the geometry of its attachment to its two adjacent residues by positioning its planar peptide bond relative to the two adjacent planar peptide bonds thereby the torsional angles determine the conformation of the residues and the peptides. In sequence order phi is the (N(i-1), C(i), Ca(i), N(i)) torsion angle and psi is the (C(i), Ca(i), N(i), C (i+1)) torsion angle. Ramachandran plotted psi values on the X-axis and psi values on the Y-axis. Plotting the torsional angles in this way graphically shows which combinations of angles are possible. Ramachandran plot is used to judge the quality of a model by finding residues that are in unlikely or high energy conformation (Figure 7.2). 180
aL GG G
y 0
G –180 –180
G 0 f
G GG G 180
Fig. 7 .2 Sasisekharan – Ramakrishnan – Ramachandran plot of acylphosphatase (PDB 7.2 code 2 ACY). Note the clustering of residues in the α and β regions, and that most of the exceptions occur in glycine residues (labelled G).
A Ramachanran plot can be used in two somewhat different ways. One is to show in theory which values or conformations of the phi and psi angles are possible for an amino acid residue in a protein. A second is to show the empirical distribution of data points observed in a single structure in usage for structure validation or else in database for many structures.
Bridging the Sequence Structure Gap An understanding of structure leads to an understanding of function and mechanism of action. There is a big gap between known sequences and known structures. This gap is called sequence structure gap. This is the main factor for prediction of protein structure. Structure prediction means to make a prediction of the relative position of every protein atom in three-dimensional space using only information from the protein sequence.
Predictive Methods using DNA and Protein Sequences
7.7
Structure prediction is done using categories like comparative modeling, fold recognition, secondary structure prediction, ab initio prediction and knowledge-based prediction. Knowledge-based methods attempt to predict protein structure using information taken from the database of known structures. If a sequence of known structure (target sequence) can be aligned with one or more sequences of known structure to show at best 25% identify in an alignment of 80 or more residues, then the known structure (template structure) can be used to predict the structure adopted by the target sequence, using multiple alignment tools. This is comparative modeling (homology modeling). It produces a full atom model of tertiary structure. When suitably related template structures do not exist for a particular target sequence, secondary structure prediction is an alternative. It provides a prediction of the secondary structure state of each residue, either helical, strand or extended, or coil. The predictions are sometimes known as three-state predictions. Fold-recognition (α-reading) methods detect distant relationships and separate them from chance sequence similarities not associated with a shared fold. They operate by searching through a library of known protein structures and finding the one most compatible with query sequence whose structure is to be predicted. Once the alignment between the sequence and the distantly related known structures has been obtained, a full three-dimensional structure of the protein to be predicted can be obtained. Ab initio methods attempt to predict protein structures from first principles using theories from the physical sciences like statistical thermodynamics and quantum mechanics. Of all these methods, comparative modeling is the most accurate and comprehensive structure prediction method.
7.2.1 Secondary Structure Prediction Accurate prediction as to where α-helices, β-strands and other secondary structures will form along the amino acid chain of proteins is one of the greatest challenges in sequence analysis. Methods of structure prediction from amino acid sequence begin with an analysis of a database of known structures. These databases are examined for possible relationships between sequence and structure. The ability to predict secondary structure also depends on identifying types of secondary structural elements in known structures and determining the location and extent of these elements. The main types of secondary structures that are examined for sequence variation are α-helices, β-strands and coils. The basic assumption in all secondary structure prediction is that there should be a correlation between amino acid sequence and secondary structure. The usual assumption is that a given short stretch of sequence is more likely to form one kind of secondary structure than another.
7.8 Basic Bioinformatics Two early methods based on secondary structure propensity were those of Chou and Fasman and GOR (Garnier-Osgathorpe-Robson). These were based on the local amino acid composition of single sequences. Later the use of evolutionary information from multiple alignments improved the accuracy of secondary structure prediction methods significantly since during evolution, structure is much more strongly conserved than sequence. Widely used methods of protein secondary predictions are: (i) the ChouFasman and GOR methods, (ii) neural network models and (iii) nearestneighbor methods.
Chou-Fasman Method Chou-Fasman method is based on the assumption that each amino acid individually influences secondary structure within a window of sequence. It is based on analyzing the frequency of each of the 20 amino acid in α-helices, β-strands and turns. To predict a secondary structure, the following set of rules is used. The sequence is first scanned to find a short sequence of amino acids that has a high probability for starting a nucleation event that could form one type of structure. For α-helices, a prediction is made when four of six amino acids have a high probability of > 1.03 of being in an α-helix. For β-strands, the presence in a sequence of three of five amino acids with a probability of >1.00 of being in a β-strand. These nucleated regions are extended along the sequence in each direction until the prediction values for four amino acids drop below 1. If both α-helices, β-strand regions are predicted, the higher probability prediction is used. Turns are predicted a little differently. Turns are modeled as a tetrapeptide, and two probabilities are calculated. First, the average of the probabilities for each of the four amino acids being in a turn is calculated as for the α-helix and β-strand prediction. Second the probabilities of amino acid combinations being present at each position in the turn of tetrapeptide are determined. These probabilities for the four amino acids in the candidate sequence are multiplied to calculate the probability that the particular tetrapeptide is a turn. A turn is predicted when the first probability value is greater than the probabilities for an α-helix and β-strand in the region and when the second probability value is greater than 7.5 × 10-5.
GOR Method GOR method is based on the assumption that amino acids flanking the central amino acid residue influence the secondary structure that the central residue is likely to adopt. It uses the principles of information theory to derive predictions. Known secondary structures are scanned for the occurrence of amino acids in each type of structure. The frequency of each type of amino acid at the next 8 amino-terminal and carboxy-terminal positions is also determined, making the total number of positions examined equal to 17, including the central one.
Predictive Methods using DNA and Protein Sequences
7.9
Neural Network Prediction In the neural network approach, computer programs are trained to be able to recognize amino acid patterns that are located in known secondary structures and to distinguish these patterns from other patterns not located in these structures. These neural network models extract more information from sequences theoretically. PHD and NNPREDICT are two neural network programs. Neural network models are meant to simulate the operation of the brain.
Nearest-neighbor Prediction Like neural networks, nearest-neighbor methods are also a type of machine learning method. They predict the secondary structural conformation of an amino acid in the query sequence by identifying sequences of known structures that are similar to the query sequence. A large list of short sequence fragments is made by sliding a window of varied length along a set of approximately 100-400 training sequences of known structure. The minimal sequence similarity to each other and the secondary structure of the central amino acid in each window is recorded. A window of the same size is selected from the query sequence and compared to each of the above sequence fragments, and the 50 best matching fragments are identified. The frequencies of the known secondary structure of the middle amino acid in each of these matching fragments are then used to predict the secondary structure of the middle amino acid in the query window.
7.2.2 Propensity for Secondary Structure Formation A number of attempts have been made to predict the secondary structure by using the amino acid sequence alone. Solution studies of model polypeptides have indicated that amino acids show large variations in their propensity to adopt regular conformations. The earliest attempts at secondary structure prediction were based on parameterization of physical models. These physico-chemical studies on model polypeptides indicated that the propensity of an amino acid to extend a helix could be different from its propensity to nucleate a helix. Chou and Fasman suggested an approach that was based on a statistical model. In this approach, the frequency of occurrence of a particular amino acid in a particular conformation is compared with the average frequency of occurrence of all amino acids in that conformation. The resulting ration is the propensity of the amino acid to occur in that conformation. These values were used to classify amino acids into different classes and to formulate rules for secondary structure prediction. Both Chou and Fasman and GOR methods make use of the idea of secondary structure propensity. The amino acids seem to have preferences for certain secondary structure states, which are shown in Table 7.1. For instance, glutamic acid has a strong preference for the helical secondary structure, and
7.10 Basic Bioinformatics valine has lower than average propensity for both types of regular secondary structure, reflecting a tendency to be found in loops. Table 7.1: Helical and strand propensities of the amino acids. A value of 1.0 indicates that the preference of that amino acid for the particular secondary structure is equal to that of the average amino acid; values greater than one indicate a higher propensity than the average; values less than one indicate a lower propensity than the average (The values are calculated by dividing the frequency with which the particular residue is observed in the relevant secondary structure by the frequency for all residues in that secondary structure). Amino acid GLU ALA LEU MET GLN LYS ARG HIS VAL ILE TYR CYS TRP PHE THR GLY ASN PRO SER ASP
Helical (α) propensity 1.59 1.41 1.34 1.30 1.27 1.23 1.21 1.05 0.90 1.09 0.74 0.66 1.02 1.16 0.76 0.43 0.76 0.34 0.57 0.99
Helical (β) propensity 0.52 0.72 1.22 1.14 0.98 0.69 0.84 0.80 1.87 1.67 1.45 1.40 1.35 1.33 1.17 0.58 0.48 0.31 0.96 0.39
The accuracy of these early methods based on the local amino acid composition of single sequences was fairly low, with often less than 60% of residues being predicted in the correct secondary structure state.
7.2.3 Intrinsic tendency of amino acids to form β-turns Crystal structure data were analyzed to calculate the frequency of occurrence of pairs of amino acids in β-turns. The observed frequencies were, Pro-Asn (63%), Pro-Phe (50%), Pro-Gly (38%), Pro-Ser (31%) and Pro-Val (8%). However, a statistical analysis using a different criterion for assigning b-turns found a substantial difference in the order of preference. The order of preference was found to be: Pro-Gly> Pro-Asn> Pro-Ser> Pro-Val> Pro-Phe, in the set of protein structures in the database.
Predictive Methods using DNA and Protein Sequences
7.11
The propensity to form β-turns was evaluated by measuring the standard Gibbs free energy of peptide cyclization in the model tetrapeptides cys-Pro-XPro. The observed order of preference was found to be Pro-Asn> Pro-Gly> ProSer> Pro-Phe> Pro-Val. Measurements of the temperature dependence of the (NMR) chemical shifts in the model peptides Tyr-Pro-X-Asp-Val provides an indication of the b-turns populations. These NMR data indicated that the bturns populations were in the order Pro-Gly> Pro-Asn> Pro-Phe> Pro-Ser> Pro-Val. A combined analysis of the thermodynamics solution NMR and crystal (statistical) structure data indicate that the order of preference is Pro-Gly, ProAsn> Pro-Ser> Pro-Val. Although the relative position of Pro-Phe appears to be highly variable in this series, for other peptides there is a reasonable correlation between the statistical preferences calculated from the database of protein structures and the preferences based on thermodynamic and NMR measurements on model compounds.
7.2.4 Rotamer Libraries Rotamers are low energy conformations of side chains. Pioneering work on side chain conformational preferences has indicated that a few side chain conformers are much more likely than others. This result stimulated a number of studies to characterize the probability that a given side chain will occur in a particular conformation in a given amino acid, and its dependence on the main chain conformation. Using the vastly improved size of the database, a number of such rotamer libraries have been developed. The Rotamer libraries can be used in molecular modeling to add the most likely side-chain conformation to the backbone.
7.2.5 Three-Dimensional Structure Prediction Protein structural comparisons have shown that newly found protein structures often have a similar structural fold or architecture to an alreadyknown structure. Structural comparisons have also revealed that many different amino acid sequences in proteins can adopt the same structure fold. Examination of sequences in structures has also revealed that the same short amino acid patterns may be found in different structural contexts. Structural alignment studies have revealed that there are more than 500 common structural folds found in the domains of the more than 12500 threedimensional structures that are in the Brookhaven Protein Data Bank. These studies have also revealed that many different sequences will adopt the same fold. Thus, there are many combinations of amino acids that can fit together into the same three-dimensional conformation, filling the available space and making suitable contacts with neighboring amino acids to adopt a common three-dimensional structure. There is also a reasonable probability that a new sequence will possess an already identified fold. The object of fold recognition is to discover which fold is best matched. Hidden Markov Model (discrete state-space model) and threading are used to predict three-dimensional structures.
7.12 Basic Bioinformatics If two proteins share significant sequence similarity, they should also have similar three-dimensional structures. The similarity may be present throughout the sequence lengths or in one or more localized regions having relatively short patterns that may or may not be interrupted with gaps. When a global sequence alignment is performed, if more than 45% of the amino acid positions are identical, the amino acids should be quite superimposable in the three-dimensional structure of the proteins. Thus, if the structure of one of the aligned proteins is known, the structure of the second protein and the position of the identical amino acids in this structure may be reliably predicted. If less than 45% but more than 25% of the amino acids are identical, the structures are likely to be similar, but with more variation at the lower identity levels at the corresponding three dimensional positions.
7.2.6 Comparative Modeling Comparative modeling, commonly referred to as homology modeling, is useful when a 3D structure of a sequence that shares substantial similarity to the protein sequence of interest is available. The two sequences are aligned to identify segments that share sequence similarity. If more than one structure is available, multiple sequence alignment is used. It is noted that the reliability of structure prediction from the comparative modeling approach increases substantially if more than one structure of a protein with substantial sequential similarity is available. The efficiency of the alignment substantially affects the accuracy of the subsequent structure prediction. After alignment has been used to identify corresponding residues, the structure of the desired protein is predicted by making use of the structure of the homology. Several algorithms are available for this step. They can be broadly classified as: (i) rigid body assembly, (ii) segment matching and (iii) satisfaction of spatial restraints. In the rigid body assembly approach, the structure is assembled from rigid bodies that represent the core, loop regions, side chains, etc. These rigid bodies are identified from related structures and added onto a framework that is obtained by averaging the positions of the template atoms in the conserved regions of the fold. In the segment matching procedure coordinates are calculated from the approximate positions of conserved atoms of the templates. For this, use is made of a database of short segments of protein structure. This may be supplemented by energy or geometry rules. The alignment of the sequence of interest with one or more structural templates can be used to derive a set of distance constraints. Subsequently, distance geometry or restrained energy minimization or restrained molecular dynamics can be used to obtain the structure.
Predictive Methods using DNA and Protein Sequences
7.13
Steps Steps in comparative modeling are: 1. Align the amino acid sequences of the target and the protein or proteins of known structure. 2. Determine main chain segments to represent the regions containing insertions or deletions. Stitching these regions into the main chain of the known protein creates a model for the complete main chain of the target protein. 3. Replace the side chains of residues that have been mutated. For residues that have not mutated, retain the side chain conformation. 4. Examine the model (both by eye and by programs) to detect any serious collisions between atoms. Remove these collisions. 5. Refine the model by limited energy minimization.
7.2.7 Threading Threading is a method for fold recognition. Given a library of known structures and a sequence of a query protein of unknown structure, does the query protein share a folding protein? Threading is a technique to match a sequence with a protein shape. Threading is based on the observation that even proteins that have very low sequence identity often have similar structures. Threading may be used in the absence of any substantial sequence identity to proteins of known structure, whereas, comparative modeling requires protein structures that have substantial sequence similarity to the protein sequence of interest. The sequence of interest is matched against a database of known folds and the protein is assumed to have the same fold as the best match. Theoretical considerations indicate that the total number of possible folds for proteins is limited. Hence it is possible to predict the structure of a representative protein for each possible fold. The basic idea of threading is to build many rough models of the query protein, based on each of the known structures and using different possible alignments of the sequences of the known and unknown proteins. Threading approaches may be on sequence information, structural information or both. The two essential components of threading are: (a) finding an optimal alignment (with gaps) of a sequence onto a structure and (b) scoring different alignments and deciding on the best shape. Scoring may be carried out by (i) mapping the structural information to create a profile for each structural site, or (ii) using a potential based on pairwise interactions. In general, the models based on pairwise interactions have greater discriminatory ability. However, it is more difficult and more computationally expensive to find an optimal alignment using a pairwise interaction potential.
7.14 Basic Bioinformatics 7.2.8 Energy-based Prediction of Protein Structure The essence of energy based approaches to compute the conformation dependent potential energy for different conformations; the conformation with the lowest energy is assumed to be the structure of the molecule under investigation. The form of the potential energy function is based upon the known physics of interacting bodies. The potential energy function contains terms corresponding to well understood interactions such as coulombic interaction between charged bodies, terms for interaction between polarizable atoms etc. In the case of force fields that are variable geometry, terms are included for deviations from an assumed ‘ideal geometry’. The ideal geometries for different residues are defined, based on examination of high-resolution structures of model compounds. The parameters for the potential energy function may be obtained from ab initio quantum mechanical calculations or from thermodynamic, spectroscopic or crystallographic data or a combination of these three. Ab initio based attempts to locate global energy minimum have been less successful than the knowledge based approaches. The reason for this has been: (i) the inaccuracy of existing energy functions and (ii) the computational difficulty in searching for the global minimum. The development of energy based methods (in particular force field based fully on the physics of interacting bodies and capable of recognizing the native structure as the lowest-energy one) would be a major step forward towards understanding the role of particular interactions in the formation of protein structure and the mechanisms of protein folding. For practical reasons, a global minimum search of real-size proteins is unfeasible at the all-atom level; therefore united-residue models of polypeptide chains have received greater attention. After the global minimum is found at the united-residue level, it can be converted to all-atom representation and limited exploration of the conformational space in the neighborhood of the converted structure. The whole approach is referred to as hierarchical approach to protein folding.
7.2.9 Protein Function Prediction Comparison of protein structures may reveal relationships with distant homology of known function, and this homology might be used to predict function. If the homologs have a high degree of sequence similarity then methods based on sequence comparison might be adequate for identifying the relationships. However, when the sequence similarity is low, then a comparison of protein structure might reveal relationships that were not evident by using methods that are based only upon analysis of the sequences of the proteins. As proteins evolve they may (i) retain function and specificity, (ii) retain function but alter specificity, (iii) change to a related function or a similar
Predictive Methods using DNA and Protein Sequences
7.15
function in a different metabolic context, and (iv) change to a completely unrelated function. Proteins of similar structure and even of similar sequence can be recruited for very different functions. Very widely diverged proteins may retain similar functions. Moreover just as many different sequences are compatible with the same structure, unrelated proteins with different folds can carry out the same functions. In a series of homologous enzymes the identification of a set of highly conserved residues that are spatially close but are not required for structural stabilization might indicate that they are the active site residues. The nature of the active site residues might provide clues about the function and the mechanism of action of the enzyme.
Domains Certain proteins contain specific modules that mediate protein-protein interactions. The identification of such domains in a particular protein can provide clues about its interacting partners. For example, the presence of an SH2 domain or a PTB domain in a protein indicates that it will bind to another protein containing phosphotyrosine residue. The presence of the monomeric PD2 domain indicates that it might interact either with another protein that contains a PDZ/LIM domain or with the C-terminal region of membrane proteins. The presence of a Pleckstrin homology domain in a protein indicates that it is likely to be involved in signal transduction and that it might bind to the acid rich regions of protein involved in signal transduction or to phosphoinositides. In X-ray diffraction studies of crystals, the technique of molecular replacement is used to obtain an initial set of phases. If a protein that shares substantial sequence similarity with the protein of interest is available in the database, then its structure may be used for building a model of the protein of interest using comparative modeling. The coordinates of the atoms in this structure can be used for calculating the structure factors. The phase of the resulting structure factors and the measured values of the magnitudes of the structure factors are then used for calculation of a new electron density model. The resulting model can then be subjected to Fourier or least-squares refinement.
7.3
PROTEIN PREDICTION PROGRAMS
A number of computational tools have been developed fro making predictions regarding the identification of unknown proteins based on chemical and physical properties of each of the 20 amino acids. Many of these tools are available through ExPASY server at the SWISS Institute of Bioinformatics. AACompIdent uses the amino acid composition of an unknown protein to identify known proteins of the same composition. AACompSim, a variant of AACompIdent, uses the sequence of a SWISS-PROT protein.
7.16 Basic Bioinformatics PROPSEARCH uses amino acid composition of a protein to detect weak relationships between proteins to discern members of the same protein family. MOWSE (Molecular Weight Search) algorithm uses information obtained through mass spectrometric techniques. There are a few other tools, which help to analyse physical properties based on sequence. ComputepI/MW and ProtParam calculate the isoelectric point and molecular weight of an input sequence. PeptideMass determines the cleavage products of a protein after exposure to a given protease or chemical reagent. TGREASE calculates the hydrophobicity of a protein along its length. SAPS (Statistical Analysis of Protein Sequences) algorithm provides extensive statistical information for any given query sequence. There are a few other tools used to analyse motifs and patterns. BLAST searches are performed to identify sequences in the pubic databases that are similar to a query sequence of interest. PSI.BLAST is used to identify new, distantly related members of a protein family called pfscan to find similarities between a protein or nucleic acid query sequence and a profile library. BLOCKS database utilizes the concept of blocks to identify a family by using similar family of sequences. Profilescan uses a method of proteins, rather than relying on the individual sequences themselves. CDD (Conserved Domain Database) is used to identify conserved domains within a protein sequence. There are a few tools which are used to analyse secondary structure and folding classes. The nnpredict algorithm uses a two-layer, feed-forward neural network to assign the predicted type for each residue based on FASTA format. PredictProtein uses SWISS-PROT, MaxHom and PHDsec algorithms to predict secondary structure. The PREDATOR algorithm uses database-derived statistics on residuetype occurrences in different classes of local hydrogen-bonded structures. The PSIPRED uses two feed-forward neural networks to perform the analysis on the profile obtained from PSI-BLAST. SOPMA (Self-Optimized Prediction Method) builds sub-databases of protein sequences with known secondary structure prediction based on sequence similarity. The information from the sub-databases is then used to generate a prediction on the query sequence. SOPMA is a combination of five other methods (Garnier-Gibrat-Robson (GOR) method, Levin homolog method, double-prediction method, PHD method, and CNRS method). Jpred integrates six different structure prediction methods and returns a consensus prediction based on simple majority rule. The Jpred server runs PHD, DSC, NNSSP, PREDATOR, ZPRED and MULPRED. There are some algorithms which are useful to identify specialized structures or features. COILS algorithm runs a query sequence against a database of proteins known to have a coiled-coil structure. TMpred and PHDtopology are used to predict transmembrane regions. SignalP is used to detect signal peptides and their cleavage sites. SEG is used to detect nonglobular regions. DALI, SWISS-MODEL and TOPITS are used for tertiary structure prediction.
Predictive Methods using DNA and Protein Sequences
7.17
ROSETTA is a program that predicts protein structure from amino acid sequence by assimilating information from known structures. It predicts a protein structure by first generating structures of fragments using known structures, and then combining them. LINUS (Local Independently Nucleated Units of Structures) is a program for prediction of protein structure from amino acid sequence. It is a completely a priori procedure, making no explicit reference to any known structure or sequence structure relationships.
7.4
MOLECULAR VISUALIZATION
Molecular visualization helps scientists to bioengineer the protein molecules. There are a number of softwares, both free and commercial, which help in visualizing biomolecules. The most commonly used free softwares are: RasMol, Chime, MolMol, Protein explorer and Kinemage. RasMol is derived from Raster (the array of pixels on a computer screen) and Molecules.This is a molecular graphics program intended to visualize proteins, nucleic acids and small molecules for which a 3D structure is available. In order to display a molecule, RasMol requires an atomic coordinate file that specifies the position to every atom in the molecule through its 3D Cartesian coordinates. RasMol accepts this coordinate file in a variety of formats including PDB format. The visualization provides the user a choice of color schemes and molecular representation. RasMol can be run outside a web browser. The home page is : www.umass.edu/microbio/rasmol
RasMol and RasTop RasMol is the program for molecular visualization. RosTop is the graphical interface to RasMol. Roger Sayle from the Biocomputing Research unit at the University of Edinburgh, UK and Biomolecular Structure Department, Glaxo Research and Development, Greenford, UK, developed RasMol initially. RasTop helps in viewing and manipulating macromolecules and micromolecules on screen. It is user friendly. Each command in the menu generates its own script which is transferred to RasMol. RasTop is helpful in addition or subtraction of atoms, groups, or chains in selection on screen with a lasso, in going back to the previous selection, copying and pasting selections, in setting operations such as inverse, extraction, summation, subtraction, exclusion, and in saving work session under a script format called RSM script. RasTop permits opening of several molecules at the same time in the same window and several windows at the same time.
Procedure A. When we want to measure the band length the following steps can be used: 1. Open RasMol and load a file of pdb atom coordinates (downloaded from the PDB databank).
7.18 Basic Bioinformatics 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15. 16.
Use various menu options to get a feel of the molecule. Open RasTop, the molecular visualization tool. From the file menu, open a PDB atom coordinate file. Roatate the molecule. Use the options in the menu and command line. Set the display style to ball and stick. Zoom the molecule to visualize the bonds better (shift + mouse down). Go to command line window Type: Set picking distance and press Enter key. Go to the display window and select the two atoms participating in the bond formation by clicking on them successively. The bond length will appear in the command line window. Note down the results. If we want to show a bond and measure the band length between two atoms, we can also use the following after going to command line window. Type set picking monitor in the command line window. Click on the atom again (only once). A band line will appear. Note down the results from the command line window.
B. When we want to measure band angle the following steps can be used 1. Set the display style to ‘ball and stick’ 2. Zoom the molecule (shift + mouse down) 3. Go to command line window 4. Type set picking angle and press 5. Go to display window and select the three atoms forming the bond angle by clicking on them successively. 6. The bond angle will appear in the command line window. 7. Note down the results C. When we want to measure the torsion angle the following steps can be used: 1. Set the display style to ‘ball and stick’ 2. Zoom the molecule (shift + mouse down) 3. Go to command line window 4. Type set picking torsion and press Enter key 5. Go to display window and select the four atoms forming phi and psi angle by clicking on them successively. [The clik sequence for phi is: carbonyl C of residue (i-1), N of residue i,CA of residue i, and carbonyl C of residue i. The click sequence for psi is: N of residue (i+1), CA of residue i, carbonyl C of residue i, and N of residue (I+1)].
Predictive Methods using DNA and Protein Sequences
7.19
6. After successive clickings the torsion angle will appear in the command line window. 7. Note down the results RasTop is available on window, Linux and Mac platforms. To install extract the RasTop folder from the RasTop. Zip file and install in any directory. To start RasTop, double click on the RasTop icon. It will display a single main window with one empty graphic window, the color window and the command line window. To view the molecule we have to load the correct file after choosing the correct path. Then we can click molecule to select information about the molecule. In the command line use the option Show to get information about world, atom selection, group selection, chain selection, coordinates, phi, psi, Ramprint, sequence, symmetry, etc. The main menu window has click Atoms button. Select Spacefill and display; after click atom select lablels and display. The RasMol ’Spacefill’ is used to represent all of the currently selected atoms as solid sphere. This command is used to produce both union-ofspheres and ball-and-stick models of a molecule. [The following command line uses RasMol and RasTop: spacefill ], spacefill temperature, spacefill user, spacefill [-] To know the bonds click Bonds and select Hbonds and display. In 3D structure dotted lines represent Hbonds; after viewing, close bond by clicking remove button. To see the display of the loaded protein of the ribbon form (a smooth solid ribbon surface, passing along the backbone of the protein) click ribbon and select ribbons simultaneously working with others such as strands, cartoons, Trace and Backbone. After display click Remove button. We can learn more about RasTop by exploring ‘Help RasTop’. Chime and Protein explorer are derivatives of RasMOl that allow visualization inside web browsers. Hence, it can be used only online. Chime can be reached at www.Umass.edu/microbio/Chime MolMol stands for Molecule analysis and Molecule display. MolMol is a molecular graphics program for display, analysis and manipulation of threedimensional structures of biological macromolecules with special emphasis on NMR solution structures of proteins and nucleic acids. MolMol can be reached at www.mol.biol.ethz.ch/ wuthrich/software/molmol Kinemage (kinetic images) allows the user to move two molecules or parts of a molecule complex, relative to each other. Molscript is a tool for making cartoons of secondary structural elements. Grasp is used for visualization of the surface. Swiss-pdbviewer produces high quality images using ray tracing methods. Insight II is a commercial software that also supports hardware for interactive 3D viewing.
7.20 Basic Bioinformatics Some websites ComputepI/MW MOWSE PeptideMass TGREASE SAPS AACompIdent AACompsim ROPSEARCH BLOCKS Pfam PRINTS ProfileScan npredict PredictProtein SOPMA Jpred PSIPRED PREDATOR COILS PHDtopology SignalP Tmpred DALI SWISS-MODEL TOPITS
: : : : : : : : : : : : : : : : : : : : : : : : :
http://www.expasy.ch/tools/pi_tool.html http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse http://www.expasy.ch/tools/peptide-mass.html http://ftp.virginia.edu/pub/fasta/ http://www.isrec.isb-sib.ch/software/SAPS_form.html http://www.expasy.ch/tools.aacomp/ http://www.expasy.ch/tools/aacsim/ http://www.embl-heidelberg.de/prs.html http://blocks.fhcrc.org http://www.sanger.ac.uk/software/Pfam/ http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html http://www.isrec.isb-sib-ch/software/PFSCAN-form.html http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html http://www.embl-heidelberg.de/predictprotein/ http://pbil.ibcp.fr/ http://jura.ebi.ac.uk:8888/ http://insulin.brunel.ac.uk/psipred http://www.embl-heidelberg.de/predator/predator_ifno.html http://www.ch.embnet.org/software/COILS_form.html http://www.embl-heidelber.de/predictprotein http://www.cbs.dtu.dk/services/signalP/ http://www.isrec.isb-sib.ch/ftp-server/tmpred/www/TMPRED_form.html http://wwwz.ebi.ac.uk/dali/ http://www.expasy.ch/swissmod/SWISS-MODEL.html http://www.embl-heidelberg.de/predictprotein/
STUDY QUESTIONS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
What are the uses of prediction? What are the strategies used in gene prediction? How do we predict mRNA structure? Give examples of some of the commonly used methods for gene prediction. What is the necessity to predict protein structures? What is Ramachndran plot? What are its uses? How do we predict secondary structure? Describe the intrinsic tendency of amino acids to form b-turns. What is Rotamer Library? Distinguish between ab initio and knowledge-based methods of prediction?
Predictive Methods using DNA and Protein Sequences 11. 12. 13. 14. 15. 16. 17.
7.21
How is comparative modeling done? What are the steps involved in comparative modeling? What is threading? What is energy-based prediction? How is protein function prediction done? Give examples of some protein prediction programs. What is molecular visualization? Give some examples of programs for molecular visualization?
C H A P T E R
8
Homology, Phylogeny and Evolutionary Trees Homology specifically means descent from a common ancestor. Usually descendants of a common ancestor show similarities in several characters. Such characters are called homologous characters. Charles Darwin studied the Galapagos finches in 1835, noting the differences in the shapes of their beaks and the correlation of beak shape with diet. Finches that eat fruits have beaks like those of parrots, and finches that eat insects have narrow, prying beaks. These observations were seminal to the development of Darwin’s ideas on the theory of evolutionary basis for the origin of species.
8.1
HOMOLOGY AND SIMILARITY
Many a time the word homology and similarity are used interchangeably even though they are technically different. Similarity is the measurement of resemblance or difference and it is independent of the source of the resemblance. Similarity can be observed in the data that are collectable at present and it involves no historical hypothesis. In contrast, assertions of homology require inferences about historical events, which are almost always unobservable. Similarity is quantifiable but homology is more qualitative. Sequences are said to be homologous if they are related by divergence from a common ancestor. When protein folds are similar but the sequences are different, such folds are usually considered to be analogous. The essence of sequence analysis is the detection of homologous sequences by means of routine database searches, usually with unknown or uncharacterized query sequences. Homology is not a measure of similarity, but an absolute statement that sequences have a divergent rather than a convergent relationship. Sequences that share an arbitrary, threshold level of similarity determined by alignment of matching bases are termed homologous. They are inherited from a common ancestor that possessed similar structure, although the structure of the ancestor may be difficult to determine because it has been modified through descent.
8.2 Basic Bioinformatics Orthologs, Paralogs and Xenologs Homologs are either orthologs, paralogs or xenologs. Homologous genes that share a common ancestry and function in the absence of any evidence of gene duplication are called orthologs. (When there is evidence for gene duplication, the genes in tan evolutionary lineage derived from one of the copies and with the same function are also referred to as orthologs). Orthologs are produced by speciation. They represent genes derived from a common ancestor that diverged due to divergence of the organisms they are associated with. They tend to have similar function. Paralogs are produced by gene duplication. They represent genes derived from a common ancestral gene that duplicated within an organism and then subsequently diverged. The two copies of duplicated genes and their progeny in the evolutionary lineage are referred to as paralogs. They tend to have different functions. Xenologs are produced by horizontal gene transfer between two organisms. In other cases, similar regions in sequences may not have a common ancestor but may have arisen independently by two evolutionary pathways converging on the same function, called convergent evolution.
Study of Orthologous and Paralogous Proteins Among homologous sequences, it is useful to distinguish between proteins that perform the same function in different species (orthologs) and those that perform different but related functions within one organism (paralogs). Sequence comparison of orthologous proteins opens the way to the study of molecular paleontology. In particular cases, construction of phylogenetic trees has revealed relationships, for example, between proteins in bacteria, fungi and mammals, and between animals, insects and plants. Such kinds of inferences are unearthed only by investigation at the molecular level. The study of paralogous proteins, on the other hand, has provided deeper insights into the underlying mechanisms of evolution. Paralogous proteins arose from single genes via successive duplication events. The duplicated genes have followed separate evolutionary pathways, and new specificities have evolved through variation and adaptation. The emergence of different specificities and functions following gene duplication events may be detected by protein sequence comparison. For example, different visual receptors (opsins), which diverged from each other early in vertebrate evolution, are stimulated by different wavelengths of light. Human long-wavelength opsin (i.e. those sensitive to red and green light) are more closely related to each other (with around 95% sequence identity) than either sequence is to the short-wavelength blue-opsins, or to the rhodopsins (the achromatic receptors), with which they share an average 43% identity. The complexity that arises from the richness of such paralogous, and of orthologous, relationships presents a significant challenge for protein family classification.
Homology, Phylogeny and Evolutionary Trees
8.3
Modular Proteins Much of the challenge of sequence analysis involves the marriage of biological information with sequence data. This process is made more difficult by the problem of orthology versus paralogy. The analytical process is further complicated by the fact that, sometimes, sequence similarity is confined only to some part of an alignment. This scenario is encountered, in particular, when we study modular proteins. Modules may be thought of as a subset of protein domains; they are autonomous folding units that are contiguous in sequence, and are frequently used as protein building blocks. As building components they may be used to confer a variety of different functions on the parent protein, either through multiple combinations of the same module, or via combinations of different modules to form mosaics. In genetic terms, the spread of modules cannot be explained simply by gene duplication and fusion events, but is thought to be the result of genetic shuffling mechanisms. Whatever the actual process, it appears that Nature behaves rather like a tinker, using a patchwork of existing components to produce a new, workable whole. Evolution, it seems, does not produce novelties from the scratch, but works with old material, either transmogrifying a system to give it new functions, or combining several systems to produce a more elaborate one.
8.2
PHYLOGENY AND RELATIONSHIPS
Normally living organisms are classified into groups based on observed similarities and differences. If two organisms are very closely related to each other, in principle, it is assumed, that they share a recent common ancestor. Phylogeny is the description of biological relationships, usually expressed as a tree. Similarities and differences between organisms are used to infer phylogeny. The study of understanding the evolutionary relationships among organisms is called phylogenetics. Phylogenetic analysis refers to the act of inferring or estimating these relationships. Phylogenetic analysis is the means used to estimate evolutionary relationships. The evolutionary history inferred from phylogenetic analysis is usually depicted as branching, treelike diagrams that represent an estimated pedigree of the inherited relationships among molecules, organisms or both. A statement of phylogeny among various organisms assumes homology and depends on classification. Phylogeny states a topology (patterns of ancestry) of the relationships based on classification according to similarity of one or more sets of characters, or on a model of evolutionary processes. In many cases, phylogenetic relationships based on different characters are consistent, and support one another.
8.4 Basic Bioinformatics Evolutionary Tree
Pigeon Kangaroo Rabbit Pig Donkey Horse Dog Monkey Humans
Tuna Rattlesnake Turtle Penguin Chicken
Screwwworm fly
Bread mold
Silkworm mpth
Baker’s yeast
Candidda
The relationships among species, populations, individuals or genes are taken in the literal sense of kinship or genealogy, that is, assignment of a scheme of descendants of a common ancestor. The results are usually presented in the form of an evolutionary tree. Such a tree, showing all descendents of a single original ancestral species, is said to be rooted. Evolutionary trees determined from genetic data are often based on inferences from the patterns of similarity (Fig. 8.1).
Mammals
Insects
Vertebrates
Animals
Fungi
Fig. 8.1 Evolutionary tree of fungi and animals
Phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution. The evolutionary relationships among the sequences are depicted by placing the sequences as outer branches on a tree. The branching relationships on the inner part of the tree then reflect the degree to which different sequences are related. The objective of phylogenetic analysis is to discover all of the branching relationship in the tree and the branch lengths. On the basis of the analysis of nucleic acid or protein sequences, the most closely related sequences can be identified by their position as the neighboring branches on a tree. When a gene family is found in an organism or group of organism, phylogenetic relationships among the genes can help to predict which ones might have an equivalent function. When the sequences of two nucleic acid and protein molecules found in two different organisms are similar, they are likely to have been derived from a common ancestor sequence. A sequence alignment reveals which positions in
Homology, Phylogeny and Evolutionary Trees
8.5
the sequences were conserved and which diverged from a common ancestor sequence. When one is quite certain that the two sequences share an evolutionary relationship, the sequences are referred to as being homologous. An evolutionary tree is a two-dimensional graph showing evolutionary relationships among organisms or evolutionary relationships in genes from separate organisms. The separate sequences are referred to as taxa, defined as phylogenetically distinct units on the tree. It is important to recognize that each nod in the tree represents a splitting of the evolutionary path of the gene into two different species that are isolated reproductively.
8.2.1 Approaches used in Phylogenetic Analyses Phenetic (or clustering), cladistic and evolutionary systematic approaches are used in the study of phylogenetics.
Phenetic and Cladistic Approaches In phenetic approach, species are grouped together based on phenotypic resemblance (similarity) and all characters are taken into account. The phylogenetic relationship achieved through phenetic approach is usually nonhistorical. In cladistic approach, species are grouped together only with those that share derived characters, that is, characters that were not present in their distant ancestors. Cladistic approach is based on genealogy. This approach is considered to be the best method for phylogenetic analysis because it accepts and employs current evolutionary theory, that is, that speciation occurs by bifurcation (cladogenesis). The cladistic approach considers possible pathways of evolution, infers the features of ancestor at each node, and chooses an optimal tree according to some model of evolutionary change. The basic point behind cladistics is that members of a group or clade share a common evolutionary history and are more related to each other than to member of another group. A given group is recognized by sharing some unique features that were not present in distant ancestors. These shared and derived characteristics can be anything that can be observed and described. Usually cladistic analysis is performed by either multiple phenotypic characters or multiple base pairs or amino acids in a sequence. Phenetics is based on similarity; cladistics is based on genealogy. There are three basic assumptions in cladistics: (i) any group of organism is related by descent from a common ancestor (ii) There is a bifurcating pattern (iii) Change in characteristics occurs in lineages over time.
Clade, Taxon and Node A clade is a monophyletic taxon. Clades are groups of organisms or genes that include the most recent common ancestor of all of its members and all of the descendants of that most recent common ancestor (Clade is derived from the
8.6 Basic Bioinformatics Greek word ‘klados’ which means branch or twig). A taxon is any named group of organism but not necessarily a clade. A node is a bifurcating branch point. Branch lengths correspond to divergence in some cases (Fig. 8.2). Rode
Man
a clade
Chimpanzee Rhesus monkey-another clade
Fig. 8.2 The relationship between 3 animals shown as a branch of a tree
Methods Three methods – maximum parsimony, distance and maximum likelihood – are generally used to find the evolutionary tree or trees that best account for the observed variation in a group of sequences.
Maximum Parsimony Method Maximum parsimony method (minimum evolution method) predicts the evolutionary tree that minimizes the number of steps required to generate the observed variation in the sequences. A multiple sequence alignment is required to predict which sequence positions are likely to correspond. These positions will appear in vertical columns in the multiple sequence alignments. For each aligned position, phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes are identified. This analysis is continued for every position in the sequence alignment. Finally, those trees that produce the smallest number of changes overall for all sequence positions are identified. Maximum parsimony method is used to construct trees on the basis of the minimum number of mutations required to convert one sequence into another. The main programs for maximum parsimony analysis in the PHYLIP package are DNAPARS, DNAPENNY, DNACOMP, DNAMOVE and PROTPARS.
Distance Method In distance matrix methods, all possible sequence alignments are carried out to determine the most closely related sequences, and phylogenetic trees are constructed on the basis of these distance measurements. The distance method employs the number of changes between each pair in a group of sequences to produce a phylogenetic tree of the group. The sequence pairs that have the smallest number of sequence changes between them are termed ‘neighbors’. On a tree, these sequences share a node or common ancestor position and are each joined to that node by a branch. The goal of distance methods is to identify a tree that positions the neighbors correctly and that also has branch lengths which reproduce the original data as closely as possible. The success of distance methods depends
Homology, Phylogeny and Evolutionary Trees
8.7
on the degree to which the distances among a set of sequences can be made additive on a predicted evolutionary tree. The most commonly applied distance based methods are the unweighted pair group method with arithmetic mean (UPGMA), neighbor-joining (N) and methods that optimize the additivity of a distance tree, including the minimum evolution (ME) method. Distance analysis programs in PHYLIP are FITCH, KITCSCH and NEIGHBOR.
Maximum Likelihood Method The maximum likelihood method uses probability calculations to find a tree that accounts best for the variation in a set of sequences. This method is similar to the maximum parsimony method in that the analysis is performed on each column of a multiple sequence alignment. All possible trees are considered. For each tree, the number of sequence changes or mutations that may have occurred to give the sequence variation is considered. Because the rate of appearance of new mutations is very small, the more mutations needed to fit a tree to the data the less likely that tree. Trees with the best number of changes will be the most likely. Maximum likelihood method incorporates an expected model of sequence changes and weighs the probability of any residue being converted into any other. PHYLIP includes two programs such as DNAML and DNAMLK for maximum likelihood analysis.
Criteria for Phylogenetic Analysis For phylogenetic analysis, many different criteria can be used such as, morphological characteristics, biochemical properties and data from nucleic acid and protein sequences. Nucleic acid and protein sequence data are very useful for comparison because they provide a large and unbiased data set, which extends across all known organisms, allowing the comparison of both closely related and distantly related taxa. The relatedness between sequences is usually quantified objectively using sequence alignment algorithms. Macromolecules, especially sequences, have surpassed morphological and other organismal characters as the most popular form of data for phylogenetic or cladistic analysis.
Steps in Phylogenetic Analysis Phylogenetic analysis consists of four steps: (i) Alignment (both building the data model and extracting a phylogenetic dataset) (ii) Determining the substitution model (iii) Tree building (iv) Tree evaluation
8.8 Basic Bioinformatics Bootstrap Bootstrapping is a reassembling tree evaluation method that works with distance, parsimony, likelihood and with any other tree derivation method. The result of bootstrap analysis is typically a number associated with a particular branch in the phylogenetic tree that gives the proportion of bootstrap replicates that support the monophyly of the clade. Bootstrapping can be considered a two-step process comprising the generation of new data sets from the original set and the computation of a number that gives the proportion of times that particular branch appeared in the tree. That number is commonly referred to as bootstrap value. Bootstrap value is considered to be a measure of accuracy. Based on simulation studies it has been suggested that under favourable conditions (roughly equal rates of change, symmetric branches), bootstrap values greater than 70% correspond to a probability of greater than 95% that the true phylogeny has been found. Jackknife is another technique like bootstrap. Parametric bootstrap uses simulated but actual replicates. It can be used in conjunction with any tree building method.
8.2.2
Phylogenetic Trees
Usually phylogenetic relationships are described as trees (dendrogram). The clearest way to visualize the evolutionary relationships among organisms is to use a graph. A graph is a simple diagram (abstract structure) used to show relationships between entities, such as numbers, objects or places. Entities are represented by nodes and relationships between them are shown as links or edges (connecting lines). In phylogenetic trees, nodes represent different organisms and links are used to show lines of descent. In computer language, a tree is a particular kind of graph. A graph is a structure containing nodes (abstract points) connected by edges (represented as lines between the points). A path from one node to another is a consecutive set of edges beginning at one point and ending at the other. A connected graph is a graph containing at least one path between any two nodes. A tree is a connected graph in which there is exactly one path between every two points. A particular node may be selected as a root. Abstract trees may be rooted or unrooted. Unrooted trees show the topology of relationship but not the pattern of descent. A rooted tree in which every node has two descendants is called binary tree. Another special kind of graph is a directed graph in which each edge is a one-way street. Rooted phylogenetic trees are implicitly, directed graphs, the ancestor–descendent relationship implying the direction of each edge (Fig. 8.3).
Homology, Phylogeny and Evolutionary Trees (a)
Human
Primate
Chinpanzee Gorilla
(b) LCA
8.9
Eukarya Archaea Bacteria
Fig. 8.3 Rooted trees for (a) three great apes with an unspecified primate ancestor, and (b) the three major forms of life on this planet. Archaea were previously called the archae bacteria. Bacteria were previously called eubacteria and by Eukarya we refer to the nuclearcytoplasmic system in eukaryotes (Organelles are ignored). LCA is the last common ancestor of all life on this planet. (Source: D.R. Westhead et al., Instant Note: Bioinformatics, Bios Scientific Publishers Ltd., 2003)
It may be possible to assign numbers to the edges of a graph to signify, in some sense, a ‘distance’ between the nodes connected by the edges. The graph may then be drawn to scale, with the sizes of the edges proportional to the assigned lengths. The length of a path through the graph is the sum of the edge lengths. In phylogenetic trees, edge lengths signify either some measure of the dissimilarity between two species, or the length of time since their separation (Figure 8.4).
Echinoderms (Starfish) Deuterostomes
Urochordates (Tunicate worms) Cephalochordates (Amphioxus) Jawless fish (Lamprey, Hagfish) Cartilaginous fish (Shark) Bony fish (Zebrafish) Amphibians (Frog) Mammals (Human) Reptiles (Lizard) Birds (Chicken)
Fig. 8.4 Phylogenetic tree of vertebrate and our closest relatives. Chordates, including vertebrates, and echinoderms are all deuterostomes (Source: Lesk, A.M., Introduction to Bioinformatics, Oxford University Press, 2003).
8.10 Basic Bioinformatics Special Features of Trees Trees have some special features: (i) Nodes are of two types – ancestral and terminal (leaves, tips). Ancestral nodes may or may not correspond to a known species. Ancestral nodes give rise to branches. They may link to other ancestral nodes, or they may link to terminal nodes which represent known species. Terminal nodes mark the end of the evolutionary pathway. (ii) Trees may be rooted or unrooted. When the position of the ancestor is indicated, it is called rooted tree. When the position of the ancestor is not indicated, it is called unrooted tree. (iii) Each tree is binary. Evolution of species is represented as a series of bifurcations. (iv) The length of the branches may or may not be significant.
8.2.3
Tree-building Methods
Tree-building methods can be grouped into distance-based and character based methods.
Distance-based methods Distance-based methods compute pairwise distances according to some measure and then discard the actual data, using only the fixed distances to derive trees that optimize the distribution of the actual data patterns for each character. Here the pairwise distances are not fixed but they are determined by the tree topology. Distance-based methods use the amount of dissimilarity (distance) between two aligned sequences to derive trees. A distance method would reconstruct the true tree if all genetic divergence events were accurately recorded in the sequence. However, divergence encounters an upper limit as sequences become mutationally saturate. Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is a clustering or phenetic algorithm. It joins tree branches based on the criterion of greatest similarity among pairs and averages of joined pairs. Neighbor Joining (NJ) algorithm is commonly applied with distance tree building, regardless of the optimization of criterion. Fitch-Margobiash (FM) method seeks to maximize the fit of the observed pairwise distances to a tree by minimizing the squared deviation of all possible observed distances relative to all possible path lengths on the tree. Minimum Evolution (ME) method seeks to find the shortest tree that is consistent with the path lengths measured in a manner similar to FM.
Character-based Methods The character-based methods use character data at all steps in the analysis. This allows the assessment of the reliability of each base position in an
Homology, Phylogeny and Evolutionary Trees
8.11
alignment on the basis of all other base positions. The principle of maximum parsimony (MP) method is to search for a tree that requires the smallest number of changes to explain the differences observed among the taxa under study. The MP method defines an optimal tree as the one that postulates the fewest mutations. The principle of maximum likelihood (ML) method is to assume that changes between all nucleotides (or amino acids) are equally probable leading to reconstructions of likelihoods. ML method assigns quantitative probabilities to mutational events rather than merely counting them. For each possible tree topology, the assumed substitution rates are varied to find the parameters that give the highest likelihood of producing the observed sequences. The optimal tree is the one with the highest likelihood of generating the observed data.
Models Phylogenetic tree-building methods presume particular evolutionary models. Models inherent in phylogenetic methods have some important assumptions: 1. The sequence is correct and originates from a specified source. 2. The sequences are homologous (i.e. all are descended in some way from a shared ancestral sequence). 3. Each protein in a sequence alignment is homologous with every other in that alignment. 4. Each of the multiple sequence included in a common analysis has a common phylogenetic history with the others (e.g. there are no mixtures of nuclear and organellar sequences). 5. The sampling of taxa is adequate to resolve the problem of interest. 6. Sequence variation among the samples is representative of the broader group of interest. 7. The sequence variability in the sample contains phylogenetic signal adequate to resolve the problem of interest.
Similarity Table and Distance Table Phylogenetic trees can be constructed from either similarity tables or distance tables, which show the resemblance among organisms for a given set of characters (Fig. 8.5). Usually the numbers in a similarity table show the percentage of matches. Such data form the basis to adansonian analysis or numerical taxonomy. The numbers in the distance table show percentage of differences. Some of the most commonly used methods for tree building in phylogenetic analysis involves agglomerative hierarchical clustering based on distance matrices. The essential basis for this type of algorithm is that the taxa represented in a distance table are merging two taxa together in each step until only one cluster remains. There are other distance matrix algorithms such as single linkage, complete linkage, average linkage and centroid method.
8.12 Basic Bioinformatics (a) a b c d e
a 100 65 50 50 50
b 65 100 50 50 50
c 50 50 100 97 65
d 50 50 97 100 65
e 50 50 65 65 100
(b) a b c d e
a 0 6 11 11 11
b 6 0 11 11 11
c 11 11 0 2 6
d 11 11 2 0 6
e 11 11 6 6 0
Fig. 8.5 Hypothetical (a) similarity table and (b) distance table for five organisms, a-e. (Source: D.R. Westhead et al., Instant Notes: Bioinformatics, Bios Scientific Publishers Ltd., 2003).
Aligning According to Sequence and Structure As more genomes are sequenced, we are interested to learn more about protein or gene evolution. Studies of protein and gene evolution involve the comparison of homologs, i.e., sequences that have common origin but may or may not have common activity. The simple principle behind the phylogenetic analysis of sequences is that the greater the similarity between two sequences, the fewer mutations are required to convert one sequence into the other, and thus they shared a common ancestor more recently. Phylogenetic sequence data usually consist of multiple sequence alignments. The individual, aligned-base positions are commonly referred to as sites. These sites are equivalent to character in theoretical phylogenetic discussions and the actual base (or gap) occupying a site is the character state. Aligned sequence positions subjected to phylogenetic analysis represent a priori phylogenetic conclusions because the sites themselves (not the actual bases) are effectively assumed to be genealogically related or homologous. Steps in building the alignment include selection of the alignment procedure and extraction of a phylogenetic data set from the alignment. A typical alignment procedure involves the application of program such as CLUSTAL W, followed by manual alignment editing and submission to a treebuilding program. Aligning according to secondary or tertiary sequence structure is considered phylogenetically more reliable than sequence-based alignment because confidence in homology assessment is greater when comparisons are made to complex structures rather than to simple characters (primary sequence). Multiple Sequence Alignment and Phylogenetic Tree construction using ClustalX. The clustal series of programs are widely used in molecular biology for the multiple sequence alignment of both nucleic acids and protein sequences and for preparing phylogenetic trees. The first Clustal program was written by Des Higgins in 1988. It was designed specifically to work efficiently on personal computer. It has now given rise to a number of developments, including ClustalX.
Homology, Phylogeny and Evolutionary Trees
8.13
ClustalX is a windows interface for the clustalW. It provides an integrated environment for performing multiple sequence and profile alignments and analyzing the results. The program displays the multiple alignment in a scrollable window and all parameters are available using pulldown menus. Within alignments, conserved columns are highlighted using a customizable color scheme and quality analysis tools are available to highlight potentially misaligned regions. ClustalX is easy to install. It is user-friendly. It maintains the portability of the previous generations through NCBI vibrant toolkit (ftp:// ncbi.nlm.nih.gov/toolbox/ncbitools/). Numerous options such as the realignment of selected sequences or selected blocks of the alignment and the possibility of building up difficult alignments piecemeal are available. It includes other features such as NEXUS and FASTA format output, printing range numbers and faster tree calculation. The accuracy of the results, robustness, portability and user-friendliness of the program are attractive features. ClustalX can be downloaded from PCBLAB Bioinformatics links using the following URL address: http://www-igbmc.U-stasbg.fr/BioInfo/ ClustalX/Tophtml. After it is downloaded click for the ClustalX package, double click the ClustalX folder, open Blue navigation menu and click the menu ClustalX. It will appear on the window. ClustalX is available for a number of platforms such as SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECStations, Microsoft windows (32 bit) for PCs, LINUS ELF for X 86 PCs, and Macintosh Powermac.
Procedure The following steps can be followed to align sequences and construct the phylogenetic tree using ClustalX: 1. Open ClustalX 2. Load sequence saved in the FASTA format (Entrez session) using the file menu. Click the ClustalX yellow logo, click file> load the sequence>enter. The dialogue box will appear. Give correct path, open the sequence file and enter. 3. Scroll the match without alignment 4. Go to the alignment menu and click do complete alignment click > do complete alignment> 5. Save the alignment files (*.dnd and *.aln) 6. Scroll again and see matches by noting the symbol code and the histogram 7. Go to trees menu and click Tree’ then select >Draw N-J Tree. It will create a tree file with .Ph extension. This file opens with NJ Plot. 8. Save the resultant tree file (*.ph) 9. Close ClustalX
8.14 Basic Bioinformatics 10. Open NJ Plot 11. Open the tree constructed using ClustalX (*.ph) 12. Observe the phylogenetic relationship between the sequences.
8.3
MOLECULAR APPROACHES TO PHYLOGENY
Molecular approaches to phylogeny developed against a background of traditional taxonomy. Many molecular properties have been used for phylogenetic studies. In 1967, based on immunological data, V.M. Sarich and A.C. Wilson announced that the divergence of humans from chimpanzees took place 5 million years ago (Fig. 8.6). This was in contrast to paleontologists who dated the split at 15 million years ago. In 1909, E.T. Reichert and A.P. Brown published a phylogenetic analysis of fishes based on hemoglobin crystals. Human beta Horse beta
Chimp Alpha Human beta Chimp beta Horse beta Human Alpha Chimp Alpha Horse Alpha
Fig. 8.6 Two trees generated from hemoglobin sequences from human, chimpanzee and horse. The lower tree is correct, indicating the correct phylogeny for both α and β hemoglobin chains. The upper tree is confusing because it is formed from human and horse β chains and the chimpanzee α chain, creating impression that horse is closer to human than chimpanzee (Source: D.R. Westhead et al., Instant Note: Bioinformatics, Bios Scientific Publishers Ltd., 2003)
Today, DNA sequences provide the best measures of similarities among species for phylogenetic analysis. The data are digital. It is even possible to distinguish selective from non-selective genetic change, using the third position in codons or untranslated regions as pseudogenes, or the ratio of synonymous to non-synonymous codon substitutions. Many genes are
Homology, Phylogeny and Evolutionary Trees
8.15
available for comparison. Given a set of species to be studied, it is necessary to find genes that vary at an appropriate rate. Genes that remain almost constant among the species of interest provide no discrimination of degrees of similarity. Genes that vary too much cannot be aligned. Molecular phylogenies are very informative compared to those based on traditional or morphological characters because they are wider in scope. (It is possible to compare flowering plants and mammals using protein sequences, but not using morphological characters) and data handling is consistent and objective.
Macromolecular Sequences Different macromolecular sequences evolve at different rates, even sequences in different regions of the same molecule. Residues in an RNA or protein that have a critical structural or functional role in the molecule can accommodate mutations less easily than those in other regions. The rate at which a particular sequence evolves depends largely on the proportion of residues whose substitution would adversely affect normal structure and function.
Mitochondrial DNA A useful macromolecular sequence for the study of primates is mitocondrial DNA (mtDNA). As a consequence of respiratory metabolism, there is a higher concentration of active oxygen species (such as superoxide and the hydroxyl radical) in the mitochondria than in the nucleus and consequently a higher chance of oxidative chemical lesions in mitochondrial DNA. Further, the mtDNA polymerase is more error-prone than the nuclear enzyme. Therefore, mtDNA evolves more quickly than nuclear DNA due to an increased intrinsic mutation rate. There is a short noncoding region in primate mtDNA where selective constrains are low, since point mutations tend not to affect mitochondrial function. This particular sequence evolves at a suitable rate to study primate phylogeny. The tree in Fig. 8.3 is consistent with the alignment and clustering of this region, and with such analyses of coding genes in mtDNA.
Ribosomal RNA Ribosomal RNA (rRNA) is a highly conserved ubiquitous molecule in all living organisms (animals, plants, fungi, bacteria, parasites, etc.). It has a low tolerance for mutations and evolves very slowly. The abundant secondary structure of rRNA insures that the rate of evolutionary change is slow, since compensating base changes are required in double helical regions. The tree in Figure 8.7 is consistent with the alignment and clustering of this molecule and the conclusions are compatible with those of other macromolecular studies.
8.16 Basic Bioinformatics Bacteria
Eukarya Animals Extreme Slime Green nonhalophiles sulphur bacteria Methanobacterium Entamoebae molds Fungi Plants Gream-positive bacteria Methanococcus Ciliates Thermoplasma Purple bacteria Pyrodictium Thermococcus Cyanobacteria Flagellates Flavobacteria Termoproteus Trichomonads Thermotoga Aquifex
Archaea
Diplomonads
Fig. 8.7 Major division of living things, derived by C. Woese on the basis of 15s RNA sequences (Source: Lesk, A.M., Introduction to Bioinformatics, Oxford University Press)
8.4
PHYLOGENETIC ANALYSIS DATABASES
PAUP (Phylogenetic Analysis Using Parsimony) and PHYLIP (Phylogenetic Inference Package) are versatile programs for phylogenetic analysis. PAUP provides a phylogenetic program that includes as many functions (including tree graphics) as possible in a single, platform—independent program with a menu interface. PHYLIP consists of about 30 programs that cover most species of phylogenetic analysis. It is a command-line program; it does not have a pointand-click interface. The interface is straightforward. PHYLIP (Phylogeny and ALIgnment of homologous protein structures) is a database containing 3D structure based sequence alignments and structure based phylogenetic trees of homologous protein domains in protein families. Two types of dendrograms are used to represent the relationships – one is based on a structural dissimilarity metric defined for pairwise alignment (sequence based) and the other is based on similarity of topologically equivalent residues (structure based). SUPFAM is a database of potential superfamily relationships derived by comparing sequence-based and structure-based families. PASS2 is a semi-automated database of Protein Alignment organized as Structural Superfamilies.
STUDY QUESTIONS 1. 2. 3. 4. 5.
What was the observation of Charles Darwin in Galapages finches? How do you distinguish homology and similarity? How do you distinguish ortholog, paralog and xenolog? What are modules? What is phylogeny?
Homology, Phylogeny and Evolutionary Trees 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
8.17
What is phenetic approach? What is the special features of cladistics? What is a node? What is phylogenetic tree? What is a rooted and unrooted tree? What are the special features of phylogenetic tree? What are the presumptions of phylogenetic tree-building? What are the different methods used in phylogenetics? How is molecular phylogenetics superior to traditional phylogenetics? What are databases used in phylogenetic analysis? What is bootstrapping?
C H A P T E R
9
Drug Discovery and Pharmainformatics A drug is a molecule that interacts with a target biological molecule in the body and through such interaction triggers a physiological effect. The target molecules are usually proteins. Drugs can be beneficial or harmful depending on their effect. The aim of pharmaceutical industry is to discover drugs with specific beneficial effects to treat diseases especially in humans. A chemical compound to qualify as a drug should have the following characteristics: It should be safe, effective, stable (both chemically and metabolically), deliverable (should be absorbed and make its way to its site of action), available (by isolation from natural sources or by synthesis) and novel (patentable).
9.1
DISCOVERING A DRUG
Discovering a drug can be arrived at by two methods: the empirical and the rational. The empirical method is a blind hit or loose method; it is also called black box method. Thousands of chemical compounds are tested on the disease without even knowing the target on which the drug acts and the mechanism of action. Occasionally a serendipitous discovery like the discovery of Penicillin may come up.
Approaches Usually thousands of chemical compounds are tested for drug action. One out of 10,000 may hit the target. In this type of approach, no one knows initially which target the drug attacks and the mechanism involved in the attack. Rational approach starts from the clear knowledge of the target as well as the mechanism by which it is to be attacked. Drug discovery involves finding the target and arriving at the lead. Target refers to the causal agent of the disease and lead refers to the active molecule which will interact with the causal agent. When diseases are treated with drugs they interact with targets that contribute to the disease and try to control their contribution thus producing positive effects. The disease target may be endogenous (a protein synthesized
9.2 Basic Bioinformatics by the individual to whom the drug is administered) or, in the case of infectious diseases, may be produced by a pathogenic organism. Drugs act either by stimulating or blocking the activity of the target protein.
9.1.1 Target Identification and Validation Developing a drug is not that easy. It is a complex, lengthy and expensive process. Drug development begins with the identification of a potentially suitable disease target. This process is called target identification. One has to study what is known about diseases, possible causes, its symptoms, its genetics, its epidemiology, its relationship to other diseases – human and animal – and all known treatments. The biology of the disease (cause of illness, the spread of the disease in the population, the development of the disease inside the patient, the biochemical and physiological changes in the patients, etc.) has to be ascertained. In the past, target identification was based largely on medical need. Presently, target identification depends not only on medical need but also on factors such as the success of existing therapies, the activity of competing drug companies and commercial opportunities.
Types of Targets The targets for the drugs are usually the biomolecules, such as enzymes, receptors or ion channels. The validity of the enzyme as a target depends upon how much important it is for the survival of the pathogen. If it is less significant, then the target has no value. If the drug target is located inside the human system, the fluctuation of the target activity must correspond to the fluctuation of the disease severity. Only when we are able to establish a high level of significance in the regulation of the target for effective disease control, the target will have relevance to the disease. Once the target is confirmed, we can identify the modulators of the target. There are positive modulators and negative modulators (Table 9.1). Table 9.1: List of positive and negative modulators Biomolecules
Positive modulators
Negative modulators
Enzymes Receptors Ion Channels
Activators Agonists Openers
Inhibitors Antagonists Blockers
Validation Once the target is identified, it has to be validated. This process is called target validation. It involves extensive testing of the target molecule’s therapeutic potential. Validation may include the creation of animal disease models, and the analysis of gene and protein expression data. By comparing the levels of gene expression in normal and disease states, novel drug targets can be identified in silico. Micro array technique can be used in this.
Drug Discovery and Pharmainformatics
9.3
Once the gene which is ‘up or down regulated’ (expressed in higher or lower level than in normal tissue) in a disease state is identified, its nature can be identified using bioinformatic tools. Similar genes or proteins can be traced using BLAST from the sequence database. Similar genes and proteins will help to deduce the function of the up or down regulated gene. If the target happens to be one of a highly tractable structure class (such as receptors, enzymes or ion channels), the drug designing will be easier. A valid target must have a high therapeutic index, that is, a significant therapeutic gain must be predicted through the use of such a drug. If a known protein is the target, binding can be measured directly. A potential antibacterial drug can be tested by its effect on growth of the pathogen. Some compounds might be tested for effects on eukaryotic cells grown in tissue culture. If a laboratory animal is susceptible to the disease, compounds can be tested on animals.
Characters If the target happens to be an enzyme, the following characters are studied: the active site, the amino acids associated in the formation of active site, presence or absence of metal component, number of hydrogen donors and acceptors present in the active site, the topology of the active site, and the details about hydrophobic and hydrophilic amino acids present in the active site. If the target happens to be a biochemical substance or a substrate of an enzyme, the following details are collected: size of the molecule, chemical nature, groups that show hydrogen donor or acceptor capacity, its metabolic byproducts and how this compound can be modified chemically.
9.1.2 Identifying the Lead Compound Once a target has been validated, the search begins for drugs that interact with the target. This process is called lead discovery, and involves the search for lead compounds, that is, substances with some of the desired biological activity of the ideal drug.
Qualities A lead molecule should have the following desirable qualities: (a) the potency (able to modulate the target effectively), (b) solubility (it should be easily soluble in water for quicker action), (c) a milder lipophilicity (ability to penetrate plasma membrane), (d) metabolic stability (should not get destroyed quickly inside the body; a longer shelf life is desirable), (e) bioavailability (quicker absorption into the body and at the same time retained for longer time for sustained activity), (f) specific protein binding, (g) less toxic or not at all toxic.
9.4 Basic Bioinformatics Finding Compounds Lead compounds can be found using some of the following ways: (i) Serendipity – through chance observations (discovery of penicillin by Alexander Fleming). (ii) Survey of natural sources – from traditional medicines (quinine from Chincona bark). (iii) Study of what is known about substrates or ligands or inhibitors and the mechanism of action of the target protein, and select potentially active compounds from these properties. (iv) Trying drugs effective against similar diseases (v) Large-scale screening of related compounds (vi) Occasionally from side effects of existing drugs. (vii) Screening of thousands of compounds. (viii) Computer screening and ab initio computer design.
9.1.3 Optimization of Lead Compound Once a lead compound is found, it must be optimized. Lead optimization involves the modification of lead compounds to produce derivatives which are called candidate drugs with better therapeutic profiles. For example, deliverability of a drug to a target within the body requires the capacity to be absorbed and transmitted. It requires metabolic stability. It requires the proper solubility profile – a drug must be sufficiently water – soluble to be absorbed, but not so soluble that it is excreted immediately; it must be sufficiently lipidsoluble to get across membranes, but not so lipid-soluble that it is merely taken up by fat stores. Once this is done the candidate drugs are assessed for quality, taking into account factors such as the ease of synthesis and formulation. After this, they are registered as an investigational new drug and submitted for clinical trials. This is the lengthiest and most expensive part of the drug development process. Due to this most projects are abandoned before this stage. Clinical trials are designed to determine safety and tolerance levels in humans, and to discover how the drug is metabolized. Trials are divided into several stages.
Stages Trials are dived into several stages Pre-clinical phase: Studies using animals Phase I: Normal (healthy) human volunteers Phase II: Evaluation of safety and efficacy in patients, and selection of dose regimen Phase III: Large patient number study with placebo or comparator; at this stage regulatory approval is sought and a commercial launch decision is taken Phase IV: Long-term monitoring for adverse reactions reported by pharmacists and doctors.
Drug Discovery and Pharmainformatics
9.5
Other Inputs Drug development has been benefiting much from genomics, proteomics, combinational chemistry and high-throughput screening. Genomics and proteomics have revolutionized the way target molecules are identified and validated. Traditionally, drug targets have been characterized on an individual basis and lead compounds have been sought with specific clinical effects. With the advent of genomics, particularly the availability of the entire human genome sequence and its annotations, thousands of potential new targets can now be identified by sequence, structure and function. Bioinformatics is important not only because of its role in the analysis of sequences and structures, but also in the development of algorithms for the modeling of target protein interactions with drug molecules. This allows rational drug design, in which protein structural data is used to predict the type of ligands that will interact with a given target, and thus form the basis of lead discovery. Of late systematic methods are used to identify lead compounds. These methods are based on high throughput screening in which lead discovery is accelerated through the use of highly parallel assay formats, such as 96-well plates. In turn, this requires the assembly of large chemical libraries for testing. This has been made possible by combinational chemistry approaches, in which large numbers of different compounds can be made by pooling and dividing materials between reaction steps.
9.2
PHARMAINFORMATICS
The term pharmainformatics is often used to describe the mix of biology, chemistry, mathematics and information technology required for data processing and analysis in the pharmaceutical industry. The scope of pharmainformatics is summarized in Table 9.2. Table 9.2: Areas of biology and chemistry where informatics plays a vital role in the drug discovery pipeline. Application Biology Genomics proteomics (human genome project) Characterization of human genes and proteins Genomics, proteomics (human pathogen genome projects). Characterization of the genes and proteins of organisms that are pathogenic to human
Role of Bioinformatics Target identification, validation in the human genome Cataloguing single nucleotide polymorphisms, and association with drug response patterns (pharmacogenomics) Target identification, validation in pathogens
Contd...
9.6 Basic Bioinformatics Functional genomics (protein structure) Analysis of protein structures (human and their pathogens)
Prediction of drug/ target interactions Rational drug design
Functional genomics (expression profiling)
Gene classification based on drug responses
Determining gene expression patterns in disease and health
Pathway reconstruction
Functional genomics (genome-wide mutagnesis)
Databases of animal models
Determining the mutant phenotypes for all genes in the genome
Target identification, validation
Functional genomics (protein interactions) Characterization of protein interactions Determining interactions among all proteins Reconstruction of pathways Prediction of binding sites. Chemistry High throughput screening
Storing, tracking and analyzing data
Highly parallel assay formats for lead identification Combinational chemistry
Cataloguing chemical libraries.
Synthesis of large number of chemical compounds
Assessing library quality, diversity Predicting drug, target interactions
9.2.1 Chemical Libraries and Search Programs High throughput screening in drug discovery depends on the availability of diverse chemical libraries, such as those generated by combinational chemistry, since these maximize the chances of finding molecules that interact with a particular target protein. It is not easy to quantify chemical diversity. Attempts have been made to understand this based on the concept of ‘chemical space’. In essence, chemical space encompasses molecules with all possible chemical properties in all possible molecular positions. A diverse library would have broad coverage of chemical space, leaving no gaps and having no clusters of similar molecules.
Tanimoto Coefficient Usually library diversity is quantified using measures that compare the properties of different molecules based on descriptors such as atomic position, charge and potential to form different types of chemical bond. We can compare two molecules using the Tanimoto coefficient (Tc), which evaluates the similarity of fragments of each molecule. The coefficient is calculated by the formula Tc = c/(a + b - c), where a is the number of fragment–based descriptors in compound A, b is the number of
Drug Discovery and Pharmainformatics
9.7
fragment-descriptors in compound B, ad c is the number of shared fragmentbased descriptors. Hence, for identical molecules, Tc = 1, while for molecules with no descriptors in common, Tc = 0. In a chemical library of ideal diversity, most-pairwise comparisons would generate a Tanimoto coefficient near to zero.
Pharmacophore When we do not know much about the binding specificity of the target protein, diverse libraries will be useful for lead discovery. When only some form of sequence or structural information is available for the target, this can be used to design focused libraries that concentrate on one region of chemical space. For example, if the sequence of a particular target protein is known, then database homology searching will often find a related protein whose structure has been solved and whose interactions with small molecules have been characterized. In these cases, it is possible to design a chemical library based on particular molecular scaffold, which preserves a framework of sites present in a known ligand, but which can be modified with diverse functional groups. Some of these groups may have previously been shown to be important for drug binding. Such sites are known as pharmacophores.
Tools Many tools and resources are available for the design of combinatorial libraries and the assessment of chemical diversity. A program called Selectors, available from Tripos, allows the user to design very diverse libraries or libraries focused on a particular molecular skeleton. Chem-x, developed by the Oxford Molecular Group, allows the chemical diversity in a collection of compounds to be measured and identifies all the pharmacophore. ComibiLibMaker, another Tripos program, allows a virtual target.
9.3
SEARCH PROGRAMS
Before starting laboratory-based screening experiments, it is always better to generate as much information as possible about potential drug/ target interactions. The computational screening of chemical databases, using a target molecule of known structure, is one way in which such information can be obtained. Alternatively, the solved structure of a close homology may be used, or the structure may be predicted using a threading algorithm. Algorithms can be used to identify potential interacting ligands based on goodness of fit, if the structure of a target protein is known, thus allowing rational drug design. Already many docking algorithms have been developed which attempt to fit small molecules into binding sites using information on stearic constraints and bond energies (Table 9.3).
9.8 Basic Bioinformatics Table 9.3: Chemical docking software available over the internet freely URL
R/F
http://www.scripts.edu/pub/olson F -web/dock.autodock/index.html http://swift.emblR heidelberg.de/lignin/
Description
Availability
Autodock
Download for UNIX/LINUX
LIGIN, a robust ligandprotein interaction prediction limited to small ligands
Download for UNIX or as apart of the WHATIF package
http://www.bmm.icnet. uk/docking/
R
FTDock and associated Download for UNIX/LINUX programs. RPScore and mMultiDock, can deal with protein-protein interactions. Ralies on a Forier transform library
http://reco3.musc.edu/gramm/
R
GRAMM (Global Range Molecular Matching) an empirical method based on tables of inter-bond angles. GRAMM has the merit of coping with low-quality structures.
Download for UNIX or Windows
http://cartan.gmd.de/flexbin/FlexX
F
FlexX, which calculates favorable molecular complexes consisting of the ligand bound to the active site of the protein, and ranks the output.
Apply on-line for FlexX Workspace on the server
Note: R means Rigid; F means Flexible; they indicate whether the program regards the ligand as a rigid or flexible molecule.
Docking Algorithms One of the most established docking algorithms is autodock. Another widely used program is DOCK. Another program is CombiDOCK. In DOCK, the arrangement of atoms at the binding site is converted into a set of spheres called site points. The distances between the spheres are used to calculate the exact dimensions of the binding site, and this is compared to a database of chemical compounds. Matches between the binding site and a potential ligand are given a confidence score, and ligands are then ranked according to their total scores. In combiDOCK, each potential ligand is considered as a scaffold decorated with functional groups. Only spheres on the scaffold are initially used in the docking prediction and then individual functional groups are tested using a variety of bond torsions. Finally it is bumped before a final score is presented. Chemical databases can be screened not only with binding site (searching for complementary molecular interactions) but also with another ligand (searching for identical molecular interactions). Several available algorithms can compare two-dimensional or three-dimensional structures and build a profile of similar molecules.
Drug Discovery and Pharmainformatics
9.9
The three dimensional structure (3D) of the target is a prerequisite (X-ray crystallography, nuclear magnetic resonance imaging) for designing a compound that can bind or act on it. The compound is chosen from existing chemical compound library by the combinatorial structure docking. The lead compounds from the library are docked or tried by complementary fixing onto the active site of the target molecule. This initial in silico fixing reduces the number of compounds that have to be synthesized and tested in vitro, since the databases contain the chemical property and method of synthesis of the compounds. In addition there are a few other commercial docking and molecular modeling softwares which are described below:
Schroedinger Schroedinger Software is a suite of computational tools specializing in research for computational chemistry, docking, homology modeling, protein xray crystallography refinement, bioinformatics, ADME prediction, cheminformatics, enterprise informatics, pharmacophore searching, molecular simulation, and quantum mechanics to solve real-world problems in life science and molecular chemistry research. Maestro is the unified interface for all Schroedinger software. Impressive rendering capabilities, a powerful selection of analysis tools, and an easy-to-use design combine to make Maestro a versatile modeling environment for all researchers. It can be used to build, edit, run and analyse molecules. The main comments are OPLS-AA, MMFF, GBSA solvent model, conformational sampling, minimization, MD that includes the Maestro GUI which provides visualization, molecule building, calculation setup, job launching and monitoring, project-level organization of results and access to a suite of other modeling programs (http://www.schrodinger.com/).
Molsoft Molsoft is a leading provider of tools, databases and consulting services in the area of structure prediction, structural proteomics, bioinformatics, cheminformatics, molecular visualization and animation, and rational drug design. Molsoft offers complete solutions customized for a biotechnology or pharmaceutical company in the areas of computational biology and chemistry. Molsoft is committed to continuous innovation, scientific excellence, the development of the cutting edge technologies and original ideas. Molsoft is a Powerful global optimizer in an arbitrary subset of internal variables, NOEs, Protein docking, Ligand docking, Peptide docking, EM and Density placement (http://www.molsoft.com/).
Discovery Studio Discovery Studio is a well-known suite of software for simulating small molecule and macromolecule systems. It is developed and distributed by Accelrys, a company that specializes in scientific software products covering
9.10 Basic Bioinformatics computational chemistry, computational biology, cheminformatics, molecular simulations and Quantum Mechanics. It is typically used in the development of novel therapeutic medicines, including small molecule drugs, therapeutic antibodies, vaccines, synthetic enzymes, and even in areas such as consumer products. It is used regularly in a range of academic and commercial entities, but is most relevant to Pharmaceutical, Biotech, and consumer goods industries. The product suite has a strong academic collaboration programme, supporting scientific research and makes use of a number of software algorithms developed originally in the scientific community, including CHARMM, MODELLER, DELPHI, ZDOCK, DMol3 and more (http:// accelrys.com/products/discovery-studio/).
GOLD - Protein-Ligand Docking GOLD is a program for calculating the docking modes of small molecules in protein binding sites and is provided as part of the GOLD Suite, a package of programs for structure visualization and manipulation (Hermes), for proteinligand docking (GOLD) and for post-processing (GoldMine) and visualization of docking results. The product of collaboration between the University of Sheffield, GlaxoSmithKline plc and CCDC, GOLD is very highly regarded within the molecular modeling community for its accuracy and reliability. It is mainly used for calculating docking modes of small molecules into protein binding sites, genetic algorithm for protein-ligand docking, full ligand and partial protein flexibility, energy functions partly based on conformational and nonbonded contact information from the CSD, choice of scoring functions: Gold Score, ChemScore and User defined score and virtual library screening (http:/ /www.ccdc. cam.ac.uk/products/life_sciences/gold/).
VLifeMDS VLifeMDS is a comprehensive and integrated software package for computer aided drug design and molecular drug discovery process. This integrated suite provides complete toolkit to scientists to perform all scientific functions with its flexible architecture. VLifeMDS is ready to meet demands from a structure based design approach as well as ligand based design approach while a seamless integration between various modules within VLifeMDS allows a hybrid approach for discovery projects. With VLifeMDS users can access intuitive features for multiple activities within a discovery project. The main objectives are active site analysis, Homology modeling, pharmacophore identification, conformer generation, combinatorial library, property visualization, Docking, QSAR analysis, database querying and virtual screening (http://www.vlifesciences.com/ products/VLifeMDS/Product_VLifeMDS.php).
Drug Discovery and Pharmainformatics
9.11
Active Site Analysis By studying the active site of the target molecule carefully, the lead compound is built piece-by-piece using computer software. The surface of the target molecule to be interacted by lead may have various chemical environments such as hydrophobicity, hydrogen bonding or catalytic zone. To this field, fragments of a hypothetical compound are placed. The orientation of the fragments provides a clue about the final form of the lead compound. GRID, GREEN, HISTE, HINT and BUCKTS are some of the softwares used for this kind of active site analysis. Sometimes the entire molecule is fit into the receptor site or active site. DOCK is a software that uses ‘shape fitting’ approach (Fig. 9.1- 9.1D). It searches all possible ways of fitting a ligand into the receptor site. The binding site of the receptor or enzyme molecule contains hydrogen bonding regions and hydrophobic regions.
Fig. 9.1A. Wire frame view of the docking molecules RmID (Rv3266c) (enzyme) and 11za (ligand) before docking as observed in the Hex window.
Fig. 9.1B. Wire frame view of the very close contact between RmID (Rv3266c) (enzyme) and 11za (ligand) before docking as observed in the Hex window.
9.12 Basic Bioinformatics
Fig. 9.1C. Harmonic surface view of the RmID (Rv3266c) (enzyme) and 11za (ligand) after docking process is completed as observed in the Hex window.
Fig. 9.1D. The cartoon model of the RmID (Rv3266c) (enzyme) and 11za (ligand) complex as observed in the Hex window.
Initially a prototype molecule is positioned inside the active site to satisfy a few of the bonding energy. Additional building blocks are fitted in stepwise manner till all the bonding energies are satisfied. CLIX is a software that creates the active site points and then searches for chemical structure database that would satisfy the active site.
Drug Discovery and Pharmainformatics
9.13
QSAR In drug development, lead compounds are optimized by decorating the molecular skeleton with different functional groups and testing each derivative for its biological activity. If there are several open positions on the lead molecule that can be substituted, the total number of molecules that need to be tested in a comprehensive screen would be very large. The synthesis and screening of all these molecules would be timeconsuming and laborious, especially since most would have no useful activity. In order to select those molecules most likely to have a useful activity and thus guide in chemical synthesis, QSAR can be used. QSAR is Quantitative Structure–Activity Relationship, a mathematical relationship used to determine how the structural features of a molecule are related to biological activity. Here, essentially, the molecules are treated as groups of molecular properties (descriptors), which are arranged in a table. The QSAR mines these data and attempts to find consistent relationships between particular descriptors and biological activities, thus identifying a set of rules that can be used to score new molecules for potential activity. A QSAR is usually expressed in the form of a linear equation: i=n
Biological activity = constant +
∑ CiPi i=1
P1-PN are parameters (molecular properties) established for each molecule in the series and C1-CN are coefficients calculated by fitting variations in the parameters to their biological activities. Once the lead molecules are identified, they have to be optimized for potency, selectivity and pharmacokinetic properties. Four qualities such as the H bond donors