Bioinformatic Approaches for Livestock Genome Analysis 9789384053017, 9384053015


239 44 7MB

English Pages 312 [334] Year 2015

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Title
Copyright
Contents
Preface
List of Editors
1. Basic Bioinformatics Tools for Molecular Data Analysis
2. Data Mining in Animal Genetics
3. Designing and In Silico Quality Checking of PCR Primers
4. Sequence Alignment: Concept and Methods
5. Submitting Nucleotide Sequence to Bankit
6. Molecular Phylogeny: Basics, Methods and Applications
7. Real-Time PCR for Quantification of mRNA Levels
8. Next-generation Sequencing Technologies: A Novel Approach for SNP Genotyping Studies
9. Transcriptome Analysis: Methods and Applications
10. Genome Annotation in Prokaryotes and Eukaryotes
11. DNA Signature of Agricultural Germplasm: A Computational Approach
12. Marker Based Technologies: A Paradigm Shift in Selection Methodologies in Livestock
Recommend Papers

Bioinformatic Approaches for Livestock Genome Analysis
 9789384053017, 9384053015

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Bioinformatic Approaches for Livestock Genome Analysis

Bioinformatic Approaches for Livestock Genome Analysis

Editors : Umesh Singh Sushil Kumar C. S. Mukhopadhyay Rajib Deb Rafeeque R. Alyethodi Rani Alex Kuldeep Dhama Indrajith Ganguly

SATISH SERIAL PUBLISHING HOUSE 403, Express Tower, Commercial Complex, Azadpur, Delhi-110033 (India) Phone : 011-27672852, Fax : 91-11-27672046 E-mail : [email protected], [email protected] Website : www.satishserial.com

Published by: SATISH SERIAL PUBLISHING HOUSE 403, Express Tower, Commercial Complex, Azadpur, Delhi-110033 (INDIA) Phone : 011-27672852 Fax : 91-11-27672046 E-mail : [email protected], [email protected]

© Editors / Publisher

ISBN 978-93-84053-01-7

© 2015. All rights reserved, no part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher/editors/authors. This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the editor(s)/contributors and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The editor(s)/contributors and publisher have attempted to trace and acknowledge the copyright holders of all material reproduced in this

publication and apologize to copyright holders if permission and acknowledgements to publish in this form have not been given. If any copyright material has not been acknowledged please write and let us know so we may rectify it. The views expressed in the individual chapters including the contents are sole responsibility of the respective authors.

Typeset at: Harminder Kharb for Laxmi Art Creation Printed at: Salasar Imaging Systems, Delhi-110035

CONTENTS Preface List of Editors 1. Basic Bioinformatics Tools for Molecular Data Analysis K.N. Raja and I. Ganguly 2. Data Mining in Animal Genetics Sachinandan De, Meenu Chopra and Biswajit Brahma 3. Designing and In Silico Quality Checking of PCR Primers C.S. Mukhopadhyay 4. Sequence Alignment: Concept and Methods C.S. Mukhopadhyay and Sarabjot Singh Osahan 5. Submitting Nucleotide Sequence to Bankit C.S. Mukhopadhyay 6. Molecular Phylogeny: Basics, Methods and Applications C.S. Mukhopadhyay and Sarabjot Singh Osahan 7. Real-Time PCR for Quantification of mRNA Levels Indrajit Ganguly, Anita Ganguly, Sanjeev Singh and Harishankar Singha 8. Next-generation Sequencing Technologies: A Novel Approach for SNP Genotyping Studies Soumendu Chakravarti 9. Transcriptome Analysis: Methods and Applications Rafeeque R. Alyethodi, Rani Alex, Umesh Singh, Sushil kumar and Rajib Deb 10. Genome Annotation in Prokaryotes and Eukaryotes Sarika, Mir Asif Iquebal, C.S. Mukhopadhyay, Prakash G. Koringa, Anil Rai,

Chaitanya G. Joshi and Dinesh Kumar 11. DNA Signature of Agricultural Germplasm: A Computational Approach Mir Asif Iquebal, Sarika, Anil Rai and Dinesh Kumar 12. Marker Based Technologies: A Paradigm Shift in Selection Methodologies in Livestock Rani Alex, Rafeeque R. Alyethodi, Umesh Singh, Sushil Kumar and Rajib Deb

PREFACE From the economic point of view, the importance of genomics in modern agricultural practice is likely to increase. The education of producers is therefore a key issue in the adaptation of genomic research results to agricultural practice. The application of bioinformatics and in- silico techniques has the potential to maintain high standards of genomic research. Bioinformatics/In-silico tools is a branch of biological science which based on application of computer technology for the management of biological resource data which deals with collection, storage, analysis, assimilation of biological and genetic information and also deals with algorithms, databases, artificial intelligence , soft computing, data mining, assessment of the evolutionary relationship of gene to protein sequences, studying the biological pathways and networking systems of biological pathway. This book is consisting of different aspects of bioinformatics as well as Insilico tools used for functional genomics studies in livestock. This also will be a ready source of information for those who wish to pursue and adopt these techniques. Editors

LIST OF EDITORS

Dr. Umesh Singh, Ph.D, Principal Scientist Animal Genetics & Breeding Division Central Institute for Research on Cattle, Meerut, UP-250001, India Dr. Sushil Kumar, Ph.D, Principal Scientist Animal Genetics & Breeding Division Central Institute for Research on Cattle, Meerut, UP-250001, India Dr. C.S. Mukhopadhyay, Ph.D, Assistant Scientist Guru Angad Dev Veterinary and Animal Sciences University Ludhiana, Punjab-141004, India Dr. Rajib Deb, Ph.D, Scientist Animal Genetics & Breeding Division Central Institute for Research on Cattle, Meerut, UP-250001, India Dr. Rafeeque R. Alyethodi, Ph.D, Scientist Animal Genetics & Breeding Division Central Institute for Research on Cattle, Meerut, UP-250001, India Dr. Rani Alex, MVSc, Scientist Animal Genetics & Breeding Division Central Institute for Research on Cattle, Meerut, UP-250001, India Dr. Kuldeep Dhama, Ph.D, Principal Scientist

Indian Veterinary Research Institute, Bareilly, UP-243122, India Dr. Indrajith Ganguly, Ph.D, Senior Scientist National Bureau of Animal Genetics Resources, Karnal Haryana-132001, India

Chapter 1 Basic Bioinformatics Tools for Molecular Data Analysis K.N. Raja and I. Ganguly National Bureau of Animal Genetic Resources, Karnal, Haryana-132001 "Bioinformatics" is the branch of biological science which deals with the application of computer technology for management of biological data particularly the information generated on the genome of any organism. Computer software programmes are used to collect, store, analyze and integrate biological and genetic information. Bioinformatics also deals with algorithms, databases, artificial intelligence and soft computing. The bioinformatics tools shall be used in data mining, studying the evolutionary relationship of gene and protein sequences, studying the biological pathways and networks which are important parts of system biology. Molecular biology deals with the molecular basis of biological activity and overlaps with other areas of biology and chemistry, particularly genetics and biochemistry. Molecular biology chiefly concerns itself with understanding and the interactions between the various systems of a cell, including the interactions between the different types of DNA, RNA and protein biosynthesis as well as learning how these interactions are regulated. Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline which helps in sequencing and annotating genomes and their observed mutations in genetics and genomics. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. The amount of biological data is increasing at an exponential speed. Sequence data, both nucleotide and amino acid, has

become one of the most important data types in bioinformatics. This has resulted in huge quantities of data in the databases which need to be analyzed using bioinformatics tools for interpretation and gaining knowledge about the gene function. The goal of molecular biologists starts from molecular data generation, data analysis and interpretation of the results arises from the analysis. To carry out the routine molecular biology works, certain software packages are extensively being used.

Biological Database For any biological data on DNA, RNA or protein sequences of a gene to be analyzed, we need the basic information about the gene of interest. This information is collected and stored in a central place through a computer network programme and made accessible to all the end users. This is called a biological database, a large organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. The database should be easily accessible and has the ease to extract information through a biological query. The following are the some popular databases which are providing the information of molecular biology 1. GenBank: It is the Genetic Sequence Databank which has a flat file structure in an ASCII text file, which is readable by both humans and computers. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature. 2. EMBL: This Nucleotide Sequence Database consists of a comprehensive database on DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. 3. SwissProt: Protein sequence database. 4. PROSITE: Protein sequence database. 5. EC-ENZYME: Database on characterized enzyme with EC number. 6. PDB: Protein Data Bank with x-ray crystallography of proteins. 7. GDB: Human Genome Database. 8. OMIM: Database contains the information on Mendelian inheritance in

Man. 9. PIR-PSD: It is the most comprehensive and expertly annotated protein sequence database. PIR (Protein Information Resource) produces and distributes the PIR-International Protein Sequence Database (PSD). 10. MEDLINE: This is a premier bibliographic database of NLM's covering the fields of medicine, nursing, dentistry, veterinary medicine, and the preclinical sciences.

Basic Software Packages Applied in Molecular Biology Bioinformatics tools are software programs that are designed for extracting the meaningful information from the mass of data & to carry out the analysis step. While designing a tool for analyzing the molecular data one must take into view that the end user may not be a frequent user of computer technology and these software tools must be made available over the internet given the global distribution of the scientific research community. There are many basic software packages which are routinely used for molecular data handling. These are Primer3, Primer Premier, BioEdit, DNASTAR's Lasergene sequence analysis software, ClustalW and ClustalX, CLC Free Workbench, BLAST, FASTA, CHROMA, MEGA, MEGA2, Sequence Manipulation Suite, NEBcutter V2.0 etc.

Classification of Bioinformatics Tools The bioinformatics tools can be classified as follows: 1. Homology and Similarity tools-used for finding the sequence homology of the query of interest with the available sequence in the database. 2. Protein Function Analysis-a tool to compare the protein of interest to the secondary (or derived) protein databases that contain information on motifs, signatures and protein domains. 3. Structural Analysis-tools meant to compare query structures (nucleic acid or protein) with the known structures in the database.

4. Sequence Analysis-Set of tools that allow carrying more detailed analysis on the query sequence about evolutionary analysis, identification of mutations, hydropathy regions, CpG islands and compositional biases.

1. Primer3 PCR is an essential tool in genetics and molecular biology used for many different goals. Efficiently the PCR can be utilized once we have the sequence specific primer pairs which can be effectively designed by using bioinformatics softwares. Primer3 is a free online tool to design and analyze primers for PCR and real time PCR experiments. Primer3 can also design hybridization probes and sequencing primers. Primer3 has many different input parameters that will control and help to make primers of our choice and goal. The online tool constitutes some important features like primer detection, cloning, sequencing and primer listing.

2. Primer Premier Primer Premier is the most comprehensive software to design and analyze PCR primers. Primer Premier's search algorithm finds optimal PCR, multiplex and SNP genotyping primers with the most accurate melting temperature using the nearest neighbor thermodynamic algorithm. Primers are screened for secondary structures, dimers, hairpins, homologies and physical properties of sequence in ranked order. After loading the gene of interest from NCBI and selection of a search range, Primer Premier will manipulate sequences and analyze the results of primer design and will pick the best possible primers. The Primer Premier Software is applied in SNP Genotyping Assays, Multiplex Primer Design, Automatic Homology and Template structure avoidance. Tool name Description

Site name

Primer3

http://www-genome.wi.mit.edu/cgi-bin/ primer/primer3_www.cgi http://www.basic.nwu.edu/biotools/Primer3.html http://www.justbio.com/primer/index.php

Comprehensive PCR primer and hybridization probe design tool; many

options but easy to accept defaults at first. Gene Fisher

Interactive primer design tool for standard or degenerate primers; will accept unaligned sequences.

http://bibiserv.techfak.uni-bielefeld.de/genefisher/

DoPrimer

Easily design primers for PCR and DNA sequencing.

http://doprimer.interactiva.de/

CODEH OP

Consensus http://blocks.fhcrc.org/codehop.html Degenerate Hybrid Oligonucleotide Primers; degenerate PCR primer design; will accept unaligned sequences.

Web Primer

Allow alternative design of primers for either PCR &sequencing.

http://genome-www2.stanford.edu/cgi-bin/SGD/web-prim

PCR

For restriction

http://cedar.genetics.soton.ac.uk/public_html/primer.html

Designer Primo Pro 3.4

analysis of sequence mutations. Reduces PCR noise by lowering the probability of random primering.

http://www.changbioscience.com/primo/primo.html

Primo Primo Degenerate Degenerate 3.4 3.4 designs PCR primers based on a single peptide sequence or multiple alignments of proteins or nucleotides.

http://www.changbioscience.com/primo/primod.html

PCR Primer Design

An application that designs primers for PCR or sequencing purposes.

http://pga.mgh.harvard.ed/servlet/org.mgh.proteome.Prim

The Primer Generator

The program analyzes the original nucleotide sequence and desired amino acid sequence and designs a primer that

http://www.med.jhu.edu/medenter/primer/primer.cgic

either has a new restriction enzyme site or is missing an old one. Primer Quest

A primer design tool.

http://www.idtdna.com/biotools/primer_quest/primer_que

3. Bio Edit BioEdit is a freeware for editing alignments of nucleotide or amino- acid sequences. It is available at http://www.mbio.ncsu.edu/BioEdit/bioedit.html. Current version is 7.2.4. The package can be installed through the standard Windows procedure. It encompasses various sequence manipulation and analysis options and links to external analysis programs facilitating a working environment which allows one to view and manipulate sequences with simple point-and-click operations.

4. Reverse Complement Reverse Complement converts a DNA sequence into its reverse, complement, or reverse-complement counterpart (http://www.bioinformatics.org/sms/rev_comp.html).

5. DNASTAR’s Lasergene Sequence Analysis Software Lasergene' provides tools that enable users to accomplish each step of sequence analysis from trimming and assembly of sequence data to gene discovery, annotation, gene product analysis, sequence similarity searches, sequence alignment, phylogenetic analysis, oligonucleotide primer design and cloning strategies. Website: http://www.dnastar.com/t-seqmanpro.aspx. Lasergene Full Suite The Lasergene Full Suite consists of all seven Laser gene applications which are given in the following sections: SeqBuilder - sequence editing and annotation, automated virtual

cloning, and primer design SeqMan Pro- contig assembly and analysis, including SNP discovery, coverage evaluation, and project annotation MegAlign - DNA and protein sequence alignments and analysis GeneQuest - For gene discovery and annotation Protean - protein structure analysis and prediction PrimerSelect - primer design EditSeq - importing and editing unusual file type All the programmes in DNASTAR will accept the input file in the EditSeq format only.

6. Multiple Sequence Alignment Using ClustalW and ClustalX The Clustal programs are widely used for carrying out automatic multiple alignments of sets of nucleotide or amino acid sequences. The most familiar version is ClustalW (Thompson et al, 1994), which uses a simple text menu system that is portable to more or less all computer systems. ClustalX (Thompson et al., 1997) features a graphical user interface and some powerful graphical utilities for aiding the interpretation of alignments, and is the preferred version for interactive usage. In the simplest usage the programs are employed to take a set of homologous sequences (all DNA/RNA or all protein) and to produce a single multiple alignment. Clustal also has extensive facilities for adding sequences to existing alignments, merging existing alignments realignment of sections of alignment, detecting and fixing alignment errors, and basic phylogenetic analysis.

7. Clustal Omega Clustal Omega is a latest addition of multiple sequence alignment program under Clustal family that uses seeded guide trees and HMM profile-profile techniques to generate alignments (Sievers et al., 2011). It is available online at; http://www.ebi.ac.uk/Tools/msa/clustalo. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. In addition, the quality of alignments is superior to previous versions, as measured by a range of popular benchmarks.

8. Muscle Muscle is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that are consistently better than CLUSTALW (Edgar, 2004). MUSCLE can align hundreds of sequences in seconds.

9. Salign A web server for alignment of multiple protein sequences and structures and also provides two methods for constructing multiple alignments from pairwise alignments, 'tree alignment' and 'progressive alignment' (Hannes Braberg et al., 2012).

10. Emboss European Molecular Biology Open Software Suite is a software- analysis package which can work with data in a range of formats and also retrieve sequence data transparently from the Web (Rice et al., 2000). Extensive libraries are also provided with this package, allowing other scientists to release their software as open source. It provides a set of sequence-analysis programs, and also supports all UNIX platforms. It includes sequence alignment, rapid database searching with sequence patterns, protein motif identification, including domain analysis, nucleotide sequence pattern analysis, codon usage analysis for small genomes, rapid identification of sequence patterns in large scale sequence sets, and presentation tools for publication (http://emboss.sourceforge.net/what/#Uses).

11. CLC Free Workbench The software is freely available online as a trial version which can be efficiently used for analyzing the molecular data. CLC Free Workbench creates a software environment enabling users to make basic bioinformatics analyses and smooth data management. Some features of CLC Free Workbench are summarized below: Easy access to web based protein and nucleotide search in GenBank, including download facilities and full graphical overview of sequence annotations of user choice

User-friendly graphical tools used for finding and working with relevant regions of DNA, RNA and protein sequences Full integration of data input, data management, calculations results, and data export. This eliminates time spent on manual data transfers between different programs and databases Easy printing of reports and graphics Multiple alignment of DNA, RNA and proteins Open reading Frame determination Translation from DNA to proteins Estimation of molecular weight and iso-electric point (for proteins) Neighbor-joining and UPGMA phylogenies

12. NCBI-BLAST (Basic Local Alignment Search Tool) National Centre for Biotechnology Information funded by US government maintained as natural resource for molecular biology information. BLAST (Basic Local Alignment Search Tool) is a collection of searching programs for biological sequence databases. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences and identify library sequences that resemble the query sequence above a certain threshold. It is an algorithm for comparing primary biological sequence information such as the amino- acids sequences of different proteins or the nucleotide sequences of DNA. BLAST is one of the most widely used bioinformatics programs because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis on speed is vital to making the algorithm practical on the currently available huge genome databases. BLAST is highly efficient than FASTA by searching only for the more significant patterns in the sequences, but with comparative sensitivity. Types of BLAST Programs BLASTP: compares amino acid sequence query to a protein sequence data base BLASTN: compares nucleotide sequence query to a nucleotide sequence data base BLASTX: nucleotide sequence query is translated and compared to a protein sequence data base TBLASTN: compares amino acid sequence query to a nucleotide

sequence data base, where the later is translated TBLASTX: compares nucleotide sequence query to a nucleotide sequence data base but the sequences are first translated Applications of BLAST Specialized BLAST tools are available for many purposes; choosing the right BLAST tool and utilizing it properly can help researchers to Search homology, putative function, common ancestry to a protein or nucleotide sequence of interest Obtain information on the inferred function of a gene or protein. Help to justify homology between two sequences. Find conserved domains in target sequence that are common to many sequences Search for sequence motifs or patterns that are similar to a sequence of interest in a particular region Compare known sequences from different taxonomic groups Limit a search to particular segments of the database such as a particular species' genome Search for a protein sequence of interest using a nucleotide sequence as the query and vice versa Discover suspected cloning vector sequences in query sequence Identify a species and/or find homologous species for an unknown DNA sequence. To locate known domains within the protein sequence of interest Locate common genes in two related species, and can be used to map annotations from one organism to another

13. Fasta FASTA is a searching program for biological sequence databases. The program uses the FASTA algorithm to compare a protein sequence query to a protein sequence library or a DNA sequence query to a DNA sequence library.

14. Chroma CHROMA is a tool for generating annotated multiple sequence

alignments in a convenient format. CHROMA annotates multiple DNA/protein sequence alignments by consensus to produce formatted and coloured text suitable for incorporation into other documents for publication. The package is designed to be flexible and reliable and has a simple-to-use graphical user interface running under Windows.

15. Mega MEGA is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, inferring ancestral sequences and testing evolutionary hypotheses. MEGA is a multi¬threaded Windows application.

16. Molecular Evolutionary Genetics Analysis version 2 (MEGA2) Molecular Evolutionary Genetics Analysis version 2 (MEGA2) can be applied for exploring and analyzing aligned DNA or protein sequences from an evolutionary perspective (Kumar et al., 2001). MEGA2 vastly extends the capabilities of MEGA version 1 by: (1) facilitating analyses of large datasets (2) enabling creation and analyses of groups of sequences (3) enabling specification of domains and genes (4) expanding the repertoire of statistical methods for molecular evolutionary studies and (5) adding new modules for visual representation of input data and output results on the Windows platform.

17. NEBcutter V2.0 NEBcutter V2.0 (New England Biolabs) is an online tool for determining the restriction enzymes that cut a particular DNA sequence. The sequence provided by the user should be as a text file, FASTA file, or GenBank number. This tool will take a DNA sequence and find the sites for all Type II and commercially available restriction enzymes. Entering the sequence and submission of the sequence will appear with the output. The maximum size of the input file is 1 Mega Byte, and the maximum sequence length is 300 KBases.

18. Gene Expression and Microarray Data Analysis Tools GEO (http://www.ncbi.nlm.nih.gov/geo/): Gene Expression Omnibus is a public functional genomics data repository supporting MIAME-compliant data submissions where array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles. ArrayExpress (http://www.ebi.ac.uk/arrayexpress/): The ArrayExpress Archive is a database of functional genomics experiments including gene expression where one can query and download data collected to MIAME and MINSEQE standards. Bioconductor (http://bioconductor.org/ ): Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data.

19. Genome browser and comparative analysis tools Argo genome browser (http://www.broad.mit.edu/annotation/argo/): The Argo genome Browser is the Broad Institute's production tool for visualizing and manually annotating whole genomes. Ensembl genome browser (http://www.ensembl.org/index.html): The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online. UCSC genome browser (http://genome.ucsc.edu/ ): This site contains the reference sequence and working draft assemblies for a large collection of genomes. It also provides portals to the ENCODE and Neandertal projects. VISTA genome browser (http://pipeline.lbl.gov/cgi-bin/gateway2): VISTA is a comprehensive suite of programs and databases for comparative analysis of genomic sequences. There are two ways of using VISTA - you can submit your own sequences and alignments for analysis (VISTA servers) or examine pre-computed whole-genome alignments of different species.

20. Pathway Analysis Tools Ingenuity Pathway Analysis (http://www.ingenuity.com/products/ipa) IPA helps in understanding complex 'omics data at multiple levels by integrating data from a variety of experimental platforms and providing insight into the molecular and chemical interactions, cellular phenotypes, and

disease processes of the system. KEGG Pathway (http://www.genome.jp/kegg/pathway.html): KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks. Panther Pathway (http://www.pantherdb.org/pathway/ ): PANTHER Pathway consists of over 176, primarily signaling, pathways, each with subfamilies and protein sequences mapped to individual pathway components. Pathway diagrams are interactive and include tools for visualizing gene expression data in the context of the diagrams.

Application of Bioinformatics tools in Genome analysis The bioinformatics tools/software are being used in the following purposes 1. 2. 3. 4. 5. 6.

Primer designing DNA sequence analysis Genome annotation Phylogenetic analysis Gene and protein expression analysis Transcriptome and proteome analysis etc.

Studying and understanding the genome/genetic makeup and its function of any organism in the universe shall help the human beings to lead their life in easy way. The information on genome of any species which is being generated through various means needs more robust method of analysis using advanced software packages through bioinformatics in farm animals, shall be helpful in planning appropriate breeding programmes for genetic improvement. With the current deluge of data, computational methods have become indispensable to biological investigations. Originally developed for the analysis of biological sequences, bioinformatics now encompasses a wide range of subject areas including structural biology, genomics and gene expression studies.

References Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.32(5):17921797. Hannes Braberg, Benjamin, M. Webb, Elina Tjioe, Ursula Pieper, Andrej Sali, Mallur S. Madhusudhan. 2012. SALIGN: a web server for alignment of multiple protein sequences and structures. Bioinformatics28(15): 2072¬2073. Kumar S, Tamura K, Jakobsen IB, Nei M. 2001. MEGA2: molecular evolutionary genetics analysis software. Bioinformatics 17(12):1244-5. Rice,P. Longden,I. and Bleasby,A. 2000. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16, (6) pp276—277. Sievers F, Wilm A, Dineen DG, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol.7:539 doi:10.1038/msb.2011.75. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res.25, 4876-4882. Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.

Chapter 2 Data Mining in Animal Genetics Sachinandan De, Meenu Chopra and Biswajit Brahma Animal Genomics Laboratory, Animal Biotechnology Centre National Dairy Reseach Institute, Karnal, Haryana-132001 Biologists are leading the current research on genome characterization (sequencing, alignment, transcription), providing a huge quantity of raw data about many genome organisms. The last years witnessed a continued growth of the amount of data being stored in biologic databanks. Often the data sets are becoming so huge, that make them difficult to exploit. Extracting knowledge from this raw data using data mining is an important area of bioinformatics. However, it is difficult to deal with these genomic information using actual bioinformatics data mining tools, because data are heterogeneous, huge in quantity and geographically distributed. The National Center for Biotechnology Information (NCBI), as a primary public repository of genomic sequence data, collects and maintains enormous amounts of heterogeneous data. Data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains are integrated with the analytical, search, and retrieval resources through the NCBI website. Entrez, a text-based search and retrieval system, provides a fast and easy way to navigate across diverse biological databases. Customized genomic BLAST enables sequence similarity searches against a special collection of organism-specific sequence data and viewing the resulting alignments within a genomic context using NCBI's genome browser, Map Viewer. Comparative genome analysis tools lead to further understanding of evolutionary processes, quickening the pace of discovery.

NCBI Recent advances in biotechnology and bioinformatics led to a flood of genomic data and tremendous growth in the number of associated databases. As of February 2008, NCBI Genome Project collection described more than 2,000 genome sequencing projects: 1,500 bacteria and archaea (631 complete genomes, 462 draft assemblies, and 507 in progress) as listed at the NCBI Genome Project site: http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi and almost 500 eukaryotic genomes (23 complete, 195 draft assemblies, and 221 in progress) as listed at http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi. Information on complete and ongoing genome projects is also available in Genomes OnLine Database (GOLD) (Liolios et al., 2007), a communitysupported World Wide Web resource. Hundreds of thousands of genomic sequences for viruses, organelles, and plasmids are available in the three public databases of the International Nucleotide Sequence Database Collaboration [INSDC, www.insdc.org] - EMBL (Cochrane et al., 2008), GenBank (Benson et al., 2008), and the DNA Data Bank of Japan (Sugawara et al., 2008). Additional information on biomedical data is stored in an increasing number of various databases. As published in the 15th annual edition of the journal Nucleic Acid Research (NAR), also known as Database Issue, the number of databases in 2008 crossed the 1,000 landmark. This issue listed 1,078 databases, 110 more than in the previous year (Galperin et al., 2008). Navigating through the large number of genomic and other related ''omic'' resources becomes a great challenge to the average researcher. Understanding the basics of data management systems developed for the maintenance, search, and retrieval of the large volume of genomic sequences will provide necessary assistance in traveling through the information space. Depending on the focus and the goal of the research project or the level of interest, the user would select a particular route for accessing the genomic databases and resources. These are (1) text searches, (2) direct genome browsing, and (3) searches by sequence similarity. All of these search types enable navigation through precomputed links to other NCBI resources. Entrez is the text-based search and retrieval system used at NCBI for all of the major databases, and it provides an organizing principle for biomedical information. Entrez integrates data from a large number of sources, formats, and databases into a uniform information model and retrieval system. The

actual databases from which records are retrieved and on which the Entrez indexes are based have different designs, based on the type of data, and reside on different machines. These will be referred to as the ''source databases.'' A common theme in the implementation of Entrez is that some functions are unique to each source database, whereas others are common to all Entrez databases. Each Entrez database (''node'') can be searched independently by selecting the database from the main Entrez Web page (http://www.ncbi.nlm.nih.gov/sites/gquery). Typing a query into a text box provided at the top of the Web page and clicking the ''Go'' button will return a list of DocSum records that match the query in each Entrez category. These include nucleotides, proteins, genomes, publications (PubMed), taxonomy, and many other databases.

PubMed Overview NLM (National Library of Medicine, USA) has been indexing the biomedical literature since 1879, to help provide health professionals access to information necessary for research, health care, and education. What was once a printed index to articles, the Index Medicus, became a database now known as MEDLINE. MEDLINE contains journal citations and abstracts for biomedical literature from around the world. Since 1996, free access to MEDLINE has been available to the public online via PubMed. PubMed is a web-based retrieval system developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine. It is part of NCBI's vast retrieval system, known as Entrez. PubMed is a database of bibliographic information drawn primarily from the life sciences literature. PubMed contains links to full-text articles at participating publishers' Web sites as well as links to other third party sites such as libraries and sequencing centers. PubMed provides access and links to the integrated molecular biology and chemistry databases maintained by NCBI.

Primary Gene Sequence Database GenBank is the NIH genetic sequence database, an archival collection of

all publicly available DNA sequences (Benson et al., 2008). GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ) (Sugawara et al., 2008), the European Molecular Biology Laboratory (EMBL) (Cochrane et al., 2008), and GenBank at NCBI. These three organizations exchange data on a daily basis. Many journals require submission of sequence information to a database prior to publication to ensure an accession number will be available to appear in the paper. As of February 2008 GenBank release 164.0 (ftp://ftp.ncbi.nih.gov/ genbank/release.notes/gb164.release.notes) contains more than 83 billion bases in over 80 million sequence entries. The data come from the large sequencing centers as well as from small experimentalists. These sequences are accessible via Web interface by text queries using Entrez or by sequence queries using BLAST. Quarterly GenBank releases are also downloadable via FTP.

Genome Sequence Databases The genome sequencing era that started about 20 years ago has brought into being a range of genome resources. Genomic studies of model organisms give insights into understanding of the biology of humans enabling better prevention and treatment of human diseases. Comparative genome analysis leads to further understanding of fundamental concepts of evolutionary biology and genetics. A review on genome resources (Hillary and Maria, 2006) reports on a selection of genomes of model species - from microbes to human. Species-specific genomic databases comprise a lot of invaluable information on genome biology, phenotype, and genetics. However, primary genomic sequences for all the species are archived in public repositories that provide reliable, free, and stable access to sequence information. In addition, NCBI provides several genomic biology tools and online resources, including group-specific and organism-specific pages that contain links to many relevant websites and databases. A genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes (most have many complete genomes) with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, etc. Annotated data is

usually from multiple diverse sources. They differ from ordinary biological databases in that they display data in a graphical format, with genome coordinates on one axis and the location of annotations indicated by a spacefilling graphic to show the occurrence of genes, etc. Few important genome browsers are mentioned below. Integrated Genome Browser (IGB) A cross-platform, Java-based desktop genome viewer. VISTA genome browser Ensembl The Ensembl Genome Browser (Sanger Institute and EBI) GBrowse The GMOD GBrowse Project Genostar GenoBrowser: a standalone application to display and explore genomic data from any kind of file (EMBL, GenBank, Fasta, GFF... ) Integrated Microbial Genomes (IMG) system by the DOE-Joint Genome Institute OMGBrowse An extensible automated genome annotating service. Based on JBrowse. UCSC Genome Bioinformatics Genome Browser and Tools (UCSC) Viral Genome Organizer (VGO) A genome browser providing visualization and analysis tools for annotated whole genomes from the eleven virus families in the VBRC (Viral Bioinformatics Resource Center) databases X:Map A genome browser that shows Affymetrix Exon Microarray hit locations alongside the gene, transcript and exon data on a Google Maps API.

FTP Resources for Genome Data The source genome records can be accessed from the GenBank directory; these are the records that were initially deposited by the original submitters. The reference genomes, assemblies, and associated genes and proteins can be downloaded from the Genomes and RefSeq directories. Information on the data content in these FTP directories is located in the README files. Download the full release database, daily updates, or WGS files: ftp://ftp.ncbi.nih.gov/genbank/ Download complete genomes/ chromosomes, contigs and reference sequence mRNAs and proteins: ftp://ftp.ncbi.nih.gov/genomes/ Download the curated RefSeq full release or

daily updates: ftp://ftp.ncbi.nih.gov/refseq/ Download curated and noncurated protein clusters from microbial and organelle genomes: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS

Blast The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between protein or nucleotide sequences. The program compares nucleotide or protein sequences to sequence in a database and calculates the statistical significance of the matches. BLAST Quick Start mini-course provides an introduction to BLAST and then describes the practical application of different BLAST programs with example (www.ncbi.nlm.nih.gov/Class/minicourses). In each example, emphasis is placed on practical step-by-step procedures, although relevant theory is also given where it affects the choice of BLAST program, parameters, and database. The original BLAST program used a protein "query" sequence to scan a protein sequence database. A version operating on nucleotide query" sequences and a nucleotide sequence database soon followed. The introduction of an intermediate layer in which nucleotide sequences are translated into their corresponding protein sequences according to a specified genetic code allows cross-comparisons between nucleotide and protein sequences. Specialized variants of BLAST allow fast searches of nucleotide databases with very large query sequences, or the generation of alignments between a single pair of sequences. Both the standalone and web version of BLAST are available from the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov). The web version provides searches of the complete genomes of Homo sapiens as well as those of many model organisms, including mouse, rat, fruit fly, and Arabidopsis thaliana, allowing BLAST alignments to be seen in a full genomic context (Lionios et al., 2007). Genome BLAST refers to the application of any of the BLAST search programs to the complete genomic sequence of an organism or the transcript and protein sequences derived from its annotation.Genome BLAST services are available at NCBI for a variety of organisms including human, mouse, rat,

fruit fly, and many others in a growing list. At a minimum, MegaBLAST and "blastn" searches against the complete genome are supported. These are usually offered in conjunction with "tblastn" searches against the genome, "blastp" and "blastx" searches against the proteins annotated on the genome and MegaBLAST, "blastn" and "tblastn" searches against collections of transcript sequences that have been mapped to the genome. Hits to the genome are displayed graphically within NCBI's MapViewer to show their genomic context. A protein query can be also manually searched against the conserved domain database. BLAST2 Sequences is used to compare two sequences, protein or nucleotide, using any one of the principal BLAST variants, "blastp," "blastn," "tblastn," "blastx," "tblastx," or MegaBLAST. The output of BLAST2Sequences consists of a set of the traditional pair-wise alignments generated by the principal BLAST programs it uses, supplemented with a dot plot representation of these alignments. The dot plot is useful for highlighting deletions and duplications of segments between two sequences. The translated variants of BLAST2Sequences are useful for the detection of exons.

Tools for Advanced Users The Entrez Programming Utilities (eUtils) are a set of eight server- side programs that provide a stable interface to the Entrez query and database system. The eUtils use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve data and represent a structured interface to the Entrez system databases. To access these data, a piece of software first posts an eUtils URL to NCBI, then retrieves the results of this posting, after which it processes the data as required. The software can thus use any computer language that can send a URL to the eUtils server and interpret the XML response, such as Perl, Python, Java, and C++. Combining eUtils components to form customized data pipelines within these applications is a powerful approach to data manipulation. More information and training on this process are available through a course on NCBI Powerscripting: http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/course.html.

References Liolios, K., Mavrommatis, K., Tavernarakis, N., Kyrpides, N. C. (2007) The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 36(Database issue), D475-D479. Cochrane, G., Akhtar, R., Aldebert, P., Althorpe, N., Baldwin, A., Bates, K., Bhattacharyya, S., Bonfield, J., Bower, L., Browne, P., Castro, M., Cox, T., Demiralp, F., Eberhardt, R., Faruque, N., Hoad, G., Jang, M., Kulikova, T., Labarga, A., Leinonen, R., Leonard, S., Lin, Q., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee,G., Nardone, F., Plaister, S., Robinson, S., Sobhany, S., Vaughan, R., Wu, D., Zhu, W., Apweiler, R., Hubbard, T., Birney, E. (2008) Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database. Nucleic Acids Res 36(Database issue), D5-D12. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Wheeler, D. L. (2008) GenBank. Nucleic Acids Res 36(Database issue), D25-D30. Sugawara, H., Ogasawara, O., Okubo, K., Gojobori, T., Tateno, Y. (2008) DDBJ with new system and face. Nucleic Acids Res 36(Database issue), D22-D24. Galperin, M. Y. (2008) The molecular biology database collection: 2008 update. Nucleic Acids Res 36(Database issue), D2-D4. Hillary, E. S., Maria, A. S., eds. (2006) Genomes (Cold Spring Harbor Monograph Series, 46). Cold Spring Harbor, New York.

Chapter 3 Designing and In Silico Quality Checking of PCR Primers C.S. Mukhopadhyay Guru Angad Dev Veterinary and Animal Sciences University, Ludhiana141004 Polymerase chain reaction (PCR) amplifies a target nucleotide sequence in vitro, which requires a short complementary oligonucleotide sequence to 'initiate' the DNA amplification. Literally, "to prime" means to initiate or start. A primer is a short synthetic oligonucleotide which is used in PCR. While, in vivo, a short RNA template is required to 'initiate' DNA replication in all species Software tools are used to design primer(s) sequence(s) that is/ are complementary to a region in the template (or target) DNA to which the primer(s) is/ are targeted to anneal. Specificity and efficiency of PCR primers are critical for any set of primers. The specificity takes care of mispriming which may occur when primers are poorly designed. This results in 'nonspecific amplification'. Specificity of a primer is determined by the length and sequence of the oligo, vis-a-vis by the sequence pattern (repetitive or single copy, part of a multigene family or not etc) of the template used in the PCR reaction. Efficiency of a primer-pair is determined by the fold increase of amplicon in each cycle. An efficient primer-pair implies almost two-fold increase (1.8 to 1.95) in PCR-product for each cycle of the PCR. The present chapter will discuss salient features of a "good" primer vis-avis demonstrate primer-designing in the following sections: 1. General rules for designing primers for standard PCR.

2. Demonstration of primer designing using free online "Primer3 (Version4)" software (Rozen and Skaletsky, 2000) (http://www.frodo.wi.mit.edu/) 3. Checking the quality (by determining the possible secondary structures) of the designed primers using "IDT Oligo Analyzer" online software (http://eu.idtdna.com/analyzer/applications/ oligoanalyzer/) 4. Detection of secondary structures in the amplicon (target sequence) using "UNAFold Software" (http://www.idtdna.com/Scitools/Applications/UNAFold/) 5. Finally, determining the specificity of the designed primers and crosscheck whether the primers also anneal to any other non-specific sequence in the template using Primer-BLAST (Ye et al., 2012) (http://www.ncbi.nlm.nih.gov/tools/primer-blast/ index.cgi? LINK_LOC=BlastHome)

General Rules for Primer Designing The specificity and efficiency of a primer depend on several factors which must be taken into account while designing primers (Thornton and Basu, 2011). The features of a "good" primer constitute these rules which have been adumbrated below: 1. Primer length: It determines the specificity and significantly affects its annealing to the template. If the primer is too short, it results in low specificity, hence, thereby induces non-specific amplification. On the contrary, very long primers tend to decrease the template-binding efficiency at normal annealing temperature due to the higher probability of forming secondary structures viz. hairpins. Longer primers also require more time to anneal with the complementary target sequence and to denature in the next recycling step. It makes the PCR to compromise with the quantity of the amplicon. Optimal primer length: The optimal length of general PCR primers ranges between 18-24 bases. However, for multiplexing purpose the length may be as long as 30 to 35 bp, while primers used for random priming (viz. RAPD) are kept short, viz. 8 (Octamers) to 12-mer, to promote random priming.

2. Melting temperature (Tm): It is the temperature at which 50% of the DNA-duplex dissociates to become single stranded. In case of primers it can be defined as the temperature at which 50% of the primer and its complementary nucleotides of the template are hybridized. The Tm of the primer is determined by primer length, base composition, and salt concentration vis-a-vis pH of the reaction milieu. T is calculated at 50 mM salt, since this is the standard molarity of the PCR mastermix.Although the Tm of the primers can be correctly calculated after considering all the four aforementioned factors contributing to the primers melting temperature. The Tm for 18-mer or shorter oligos is approximated by the formula Tm=2(A+T)+4(G+C), where, A, T, G and C are the numbers of the four nucleotides present in each primer. The nearest neighbor thermodynamic calculation of Tm is the most accurate. The annealing temperature (Ta) for a primer pair, as a thumb rule, is 5°C less than the average Tm of the primer-pair. However, each primer-pair should be empirically standardized for the optimal annealing temperature. A primer should anneal to the template before the templatestrands renature. 3. Optimal melting temperature: Ideally it ranges between 52°C to 62°C; however, the G/C-content of the primer is critical in determining the melting temperature. Tm below 45°C should be avoided because of the potential for secondary annealing, and thereby spurious amplifications. While, for automated sequencing, primers with Tm's above 65-70°C can lead to secondary priming artifacts and noisy sequences. Higher Tm (75°C— 80°C) is recommended for amplifying high G/C content targets. 4. Primer pair Tm mismatch: Permissible T difference between the / -T m m primers is less than 5°C, preferably within 2°C. Primer pair Tm mismatch can lead to poor amplification. The primer with the higher Tm will misprime at lower temperatures, while the other primer with the lower Tm may not work at higher temperatures. 5. Non-specific amplification/ Cross homology: Primer-pairs may amplify non-specific (not intended to amplify) either due to chance factor, or due to similarity of the sequence of some genes belonging to

the same gene family or due to presence of a repeat sequence in the designed primers. Primers containing highly repetitive sequences are prone to generate non-specific amplicons when amplifying genomic DNA. The designed primers should be checked through Primer-BLAST or by nucleotide BLAST (http://www.ncbi.nlm.nih.gov/) against nonredundant sequence database of NCBI (discussed later). Moreover, while designing cDNA specific primers, it is always safe to span exonintron boundaries in order to avoid amplifying the genomic DNA (if contamination of gDNA is there). 6. Primer G/C content: In general, the optimal G/C content is between 45-55%, with an acceptable range of 40-60%. The G/C content ultimately determines the annealing temperature. 7. GC Clamp: The 3'-terminus of the primer is very important, since the DNA amplification occurs in 5' to 3' direction. The G/C bond is more stable than A/T due to three hydrogen bonds in G/C (instead of two hydrogen bonds in A/T). Thus the presence of G" or "C" is preferred, which acts as "clamp" or tighter bondage at the 3'-end. It increases efficiency of the primers. G/C clamp refers to the presence of "G" or "C" within the last 4 bases from the 3'-end of primers. G/C clamp thus prevents mispriming and enhances specific primer-template binding. Two G/C clamps at 3' end are acceptable. Three or more G/C clamps should be avoided at the 3'-end. It will make the primer 'sticky' due to higher Tm at 3' terminus. 8. Adding RE sites to Primers: Restriction endonuclease (RE) site(s) may be required to be annexed at the 5' end of the primer in order to use the amplicon in cloning and genetic engineering. It is required to add 4-6 nucleotides as RE site just before the primer sequence (let 18 bases long) at 5' terminus, while designing such primers. It is often required to add 2-3 more bases prior to this RE site to facilitate the RE to attach to the sequence, since some REs require some space to attach itself before the RE-site. Under such conditions, two different sets of T 's are to be considered. First set corresponds to the core primers (18 bases long). After some cycles of PCR, the generated amplicons will start incorporating the 4-6 bases for REs. The whole primer sequence (18+4(or 6) bases) is likely to have different Tm than the core primer. The PCR may be programmed for different annealing temperatures for two different set of cycles in one go.

9. Secondary structures in primers: Secondary structures refers to the formation of various combinations of the primers among themselves (self and heterologous) vis-a-vis production of looplike structure within the same primer (Figure 1).

Source: Primer Design Assistant (PDA): a web-based primer design tool (http://nar.oxfordjournals.org/content/31/13/3751/F1 .large.jpg) Fig. 1: Diagrammatic representation of primer-to-primer annealing (A); hairpin loop structure (B) and primer-to-template annealing (C). The stability for various DNA duplex structures is estimated by nearestneighbour dimer sequence in the pairing segment with NN parameters (D) Hairpins: these are formed due to interactions (Hydrogen bond formation) among the nucleotides of the same primer. Such loops negatively affect the primer annealing with the target sequence, thus results in poor or no amplification. The acceptable ΔG for 3'end hairpin should be higher than -2 kcal/mole (viz. -1.5, -1.0 kcal/mole will be better). In other words, the numerical value should be less, since it has negative sign. While, the ΔG should be

more than -3 kcal/ mole for internal hairpin. ΔG is a measure of the spontaneity of formation of a dimer between the internal regions of two same-sense primers. Therefore, ΔG is the energy required to break the secondary structure. Larger negative values (i.e. away from zero towards negative side) indicate a higher inclination for identical primers to hybridize to each other rather than to the template. Self-Dimer (homodimer): these dimers are formed by intermolecular interactions between two same primers (viz. forward primer with forward primer, same in case of reverse primers). The acceptable ΔG is more than -5 kcal/ mole for 3'-end self-dimer and more than -6 kcal/mole for internal self-dimers. Cross-Dimer (hetero-dimer): It is produced by inter- molecular interactions between the sense and antisense primers. The acceptable ΔG is more than -5 kcal/mole for 3'- end cross-dimer; and -6 kcal/ mole for internal cross-dimer. 10. Max 3'-end stability: The stability of the 3'-end is very crucial for specificity and efficiency of the primers. It is rendered by the maximum ΔG of the 5 bases from the 3'-end of primers. Higher 3'-end stability improves priming efficiency, however, too higher stability could negatively affect specificity because of 3'-terminal partial hybridization induced non-specific extension. Hence, ΔG value less than -9 should be avoided. ΔG is the energy required to break the secondary structure, and larger negative values indicate a higher propensity for false priming as the 3' end can initiate polymerization even if the remainder of the primer does not bind well. 11. Optimal amplicon size: The length of the product is determined by the purpose of experiment, rather than the rules for designing "good" primers. The PCR protocol is accordingly modified based on the amplicon length and its G/C content. Nevertheless, the PCR ingredients, especially the quality of the Taq DNA polymerase has a major role to play. The amplicon length varies with the application of the product in different experiments: Single strand conformation polymorphism (SSCP): less than 400 nucleotides (Markoff et al., 1997.) Cloning: 200- several kilobases (kb). Insert of the size more than 3

Kb requires special Taq polymerase dedicated to amplify longer amplicon. Real time PCR: 80-200 bp, since, longer amplicon misleads the detection of signal due to generation of stronger signal by the fluorophores (Thornton and Basu, 2011) RFLP studies: the amplicon-length will vary according to the requirement of the target sequences. 12. Stretches of Nucleotides: Primers with long "polyG" or "polyC "stretches can promote mispriming. Runs of same bases, such as polynucleotides, increase the probability of primer-dimer and hair-pin loop formation. Mispriming means annealing of primer(s) to an unintended template due to chance similarity between complementary sequences that leads to low Gibb's free energy at the 3'-end. On the other hand, poly-A and poly-T opens the primer-template complex and makes it looser. The maximum acceptable number of runs of mono-nucleotide is 4 bp. Similarly, dinucleotide repeats also increase mispriming, and formation of secondary structures. The maximum acceptable number of repeats is 4 di-nucleotide in a primer sequence.

Analyzing Gibbs Free Energy (Delta G) The "Gibbs Free Energy" (ΔG) measures of the amount of work that can be extracted from a process operating at a constant pressure. Thus, it indicates the spontaneity of the reaction. In case of oligos, the ΔG represents the stability of secondary structure. It is the amount of energy required to break the secondary structure. More negative value (viz. -6 kcal/mol versus -3 kcal/mol) for ΔG indicates stable, undesirable hairpins (for -6 kcal/ mol). Presence of hairpins at the 3' end most adversely affects the reaction. In this chapter the Gibbs free energy has been discussed in detail. The fundamental principle of natural systems is that these favor a more random system rather than an ordered system. In general terms to elucidate the state of equilibrium, there are no net flows of matter or of energy, due to lack of driving forces within the system. The primers exhibit chemical reaction either with the template or oligos (during homo- or hetero-dimer, hairpin-loop formation) during PCR when

two single stranded oligos and / or template form duplex. It can be summarized by the following equation: [Oligol 1] + [Oligol 2] «[Duplex] Where, [ ] symbol represents the molar concentration of the ingredient placed inside the squared brackets. In case of oligos, Gibbs free energy is the energy (kcal/mole) required to break the bonds between the complementary bases. It is the net exchange of energy between the system and its environment (http://cdn.idtdna.com/ support/technical/TechnicalBulletinPDF/mFold_Delta_G_and_melting_temperature_expla ΔG is calculated using the formula: ΔG = AH (enthalpy of primer in kcal/mol) - T(Temperature in°K)* AS (entropy of primer in kcal/°K/mol) (Breslauer et al., 1986). Where, ΔH or the enthalpy: the total energy exchange between the system and its surrounding environment. The terms exothermic and endothermic reactions relate to the enthalpy change of a process. ΔS or the entropy: the energy spent by the system to organize itself, T refers to the absolute temperature of the system and is in units Kelvin (Celsius + 273.15). The unit of ΔG is kcal/ mole. In this regard the +/- sign associated with the energy value should also be taken care. If the energy is high but with negative sign, it means the reaction is spontaneous, and hence, exergothermic (explained below). The ΔG values of the primers can be estimated using some software like "IDT Oligo Analyzer" online software (http://eu.idtdna.com/analyzer/applications/oligoanalyzer/) for the 3' ends of each of the oligos vis-a-vis the homo- and hetero-dimers. The ΔG value for the five 3' end-bases should be at least -9 kcal/mole. Now, let us consider the following example: Sequence of a primer: 5'-GATGCGTACGACTGACTG-3' Sequence of a portion in the template (i.e. secondary sequence): 5'CΔGTCΔGTCGTACGCATC-3'. This sequence has been deliberately taken as reverse complementary to the sequence of the primer cited above to check

the ΔG value. The ΔG value can be calculated using any online software for checking the quality of the primer, viz. OligoAnalyzer 3.1. The concentrations of oligo and Mg++ are not to be reset for this calculation. The results obtained are as follows: 1. Maximum ΔG (i.e. the highest ΔG that can be obtained either from the primary or secondary sequences, assuming complete match for that sequence) = -31.95 kcal/mole 2. Maximum ΔG obtained (for complete match of all the 18 nucleotides in the primer) = -31.95 kcal/mole Now, let us consider another secondary sequence of large size: 5'GACCTΔGATΔGGCCTTAAATGCTΔGTACATCACGCGT-3'. It is a random sequence; hence any match will be due to mere chance. The results obtained after calculating the ΔG values are: Maximum ΔG = -67.8 kcal/mole (since, the size of the secondary sequence has increased the max ΔG value has increased) The highest ΔG obtained using these two sequences is -8.09 kcal/ mole. Now, this calculation shows that the value of ΔG moves to more negative side as the match between the complementary sequences increases. This energy value is the amount of energy required to break the hydrogen bonds between the hybridized DNA fragments. Hence, for primer and specific template this value should be high (ignoring the sign), while for other unintended matches the ΔG should be less (i.e. with negative sign, it is to be towards zero). There can be either one of the following values of Gibbs free energy: 1. ΔG>0: G product - Greactants > 0 It means that the system will go in the direction of producing the reactants (i.e. double to single strand). This is a non-spontaneous reaction, which means the process can proceed spontaneously in the reverse direction (viz. single to double stranded DNA), not in the forward one (viz. double to single strand). A non-spontaneous reaction is an endergonic (or unfavourable) reaction in which the standard change in free energy (i.e. the change is Gibbs free energy) is positive,

and energy is absorbed by the system to start the reaction. 2. ΔG