118 82 10MB
English Pages 385 [378] Year 2024
Priyanka Anjoy · Kuldeep Kumar Girish Chandra · Kishor Gaikwad Editors
Genomics Data Analysis for Crop Improvement
SPRINGER PROTOCOLS HANDBOOKS
Springer Protocols Handbooks collects a diverse range of step-by-step laboratory methods and protocols from across the life and biomedical sciences. Each protocol is provided in the Springer Protocol format: readily-reproducible in a step-by-step fashion. Each protocol opens with an introductory overview, a list of the materials and reagents needed to complete the experiment, and is followed by a detailed procedure supported by a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. With a focus on large comprehensive protocol collections and an international authorship, Springer Protocols Handbooks are a valuable addition to the laboratory.
Genomics Data Analysis for Crop Improvement Edited by
Priyanka Anjoy National Accounts Division, Ministry of Statistics and Programme Implementation, New Delhi, Delhi, India
Kuldeep Kumar Indian Institute for Pulses Research, Indian Council of Agricultural Research, Kanpur, Uttar Pradesh, India
Girish Chandra Department of Statistics, University of Delhi, New Delhi, India
Kishor Gaikwad National Institute for Plant Biotechnology, Indian Council of Agricultural Research, New Delhi, Delhi, India
Editors Priyanka Anjoy National Accounts Division Ministry of Statistics and Programme Implementation New Delhi, Delhi, India
Kuldeep Kumar Indian Institute for Pulses Research Indian Council of Agricultural Research Kanpur, Uttar Pradesh, India
Girish Chandra Department of Statistics University of Delhi New Delhi, India
Kishor Gaikwad National Institute for Plant Biotechnology Indian Council of Agricultural Research New Delhi, Delhi, India
ISSN 1949-2448 ISSN 1949-2456 (electronic) Springer Protocols Handbooks ISBN 978-981-99-6912-8 ISBN 978-981-99-6913-5 (eBook) https://doi.org/10.1007/978-981-99-6913-5 Mathematics Subject Classification : 68P15, 92B15, 92C40, 92C75 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
In the Memory of Dr. Dharmendra Mallick
Our beloved teacher and guide and a devoted educator Dr. Dharmendra Kumar Mallick left us on May 16, 2021, for heavenly abode. He has been a source of motivation for students and fellows. He was a humble and passionate being, always spreading joy and positivity. Sir, we will always cherish your memories fondly in our heart. Dr. Dharmendra Kumar Mallick, a doyen of botany, finally hung over his hat on May 16, 2021. He began his humble journey of academics from Delhi University as a Ph.D. student and was supervised by Prof. S.K. Sawhney. His teaching career began from ANDC and continued in Deshbandhu College as ad hoc and later as a permanent faculty. He taught in Hindu College as well for a brief period of time in between. Since 2005, Dr. Mallick has been an integral part of Deshbandhu College and contributing immensely to the growth and progress of the institution by taking up many key positions all these years. As a member of the Governing Body of this college, he has been part of many key decisions, which have taken us so far as an institution. His constant love for research never faded away. Recently, he was selected for the prestigious Plantae Fellow (2017–2018) for his contribution and active involvement in the subject of botany. He has been a recipient of many awards and distinctions for his understanding and high standard of research in the field of plant physiology. An alumnus of this university and DUBS (Delhi University, Botanical Society) member, he has been part of many committees holding various positions, all leading to the benefit of teaching the learning process of botany as a subject and academics as a whole. As a member of the Committee of Courses, he was actively involved in restructuring and revision of botany curriculum.
v
In the Memory of Dr. Hukum Chandra
(7th November 1972 to 26th April 2021) Dr. Hukum Chandra was an Indian-born (7/11/1972–26/04/2021) statistician known for his distinguished contributions in the field of statistics and survey sampling. He was an eminent scientist who pioneered the inception and popularization of “small area estimation” technique in the official statistics system of India. He contributed to this book as a resource person. Dr. Chandra had over 23 years of experience in research, teaching, training, and consultancy, majorly at the Indian Agricultural Statistics Research Institute, New Delhi. His contributions to statistics, small area estimation, are path-breaking innovations for statisticians. He did his M.Sc. in Statistics from the University of Delhi; Ph.D. from the University of Southampton, UK; and Postdoctoral Research from the University of Wollongong, Australia. He was bestowed with Cochran-Hansen Award by the International Association of Survey Statisticians and Young Researcher/Student Award of the American Statistical Association during his academic career. He was a recipient of the Commonwealth Scholarship offered by the Commonwealth Scholarship Commission in the United Kingdom. His outstanding contributions in the profession were also recognized with many awards including the National Award in Statistics by the Ministry of Statistics and Programme Implementation, Government of India; National Fellow and Lal Bahadur Shastri Outstanding Young Scientist Award by the Indian Council of Agricultural Research; Recognition Award of the National Academy of Agricultural Sciences; Professor P.V. Sukhatme Gold Medal Award; and Dr. D.N. Lal Memorial Award by the Indian Society of Agricultural Statistics. He has left his significant impact as a statistical expert and consultant in various national and international organizations and committees including many interministerial and institutional committees in India and the Food and Agriculture Organization of the United Nations in Sri Lanka, Ethiopia, Myanmar, and Italy. He was the Elected Member of the International Statistical Institute, the Netherlands; Fellow of the National Academy of Agricultural Sciences, India; and Fellow of the Indian Society of Agricultural Statistics. He has worked as a Council Member of the International Association of Survey Statisticians. Dr. Chandra also contributed voluntarily to many academic activities like associate editors, guest editors, and reviewers of renowned scientific journals; member of scientific committees; and convener of different national and international conferences. His works include diverse area of methodological and applied problems in statistics particularly survey design and estimation methods; small area estimation; bootstrap methods; disaggregate level estimation and analysis of agricultural, socio-economic, and health indicators; spatial models
vii
viii
In the Memory of Dr. Hukum Chandra
for survey data; and statistical methods for improvement in agricultural and livestock statistics. In his career, he had served as an investigator in more than 25 national and international projects, publishing more than 125 research papers, 4 books, several technical bulletins, reports, book chapter, working papers, and training and teaching reference manuals. He has delivered a number of invited talks in many national and international platforms of repute worldwide.
Preface The main endeavor in bringing out this volume is to provide wide-ranging programming solutions for genomics data handling and analysis so as to address complex problems associated with crop improvement programs, which is essential for India’s developmental programs. We tried to include both the basics and comprehensive programming solution for genomics data handling, which encompasses fields from both plants and animals in biology. The 14 chapters consisting of recent developments in several methods of genomic data analysis for crop improvements, written by eminent researchers in their areas of expertise, are included. The real applications of a wide range of key bioinformatics topics, including assembly, annotation, and visualization of NGS data; expression profiles of coding and noncoding RNA; statistical and quantitative genetics; trait-based association analysis; QTL mapping; artificial intelligence in genomic studies, etc. are well presented. In Chap. 1 of this volume, the statistical and biological data analysis using programming languages with special emphasis on R and Perl languages is presented. Chapter 2 describes about the Python language with particular emphasis on the biologists. The assembly, annotation, and visualization of NGS data are presented in Chap. 3. Some historical aspects of statistical and quantitative genetics studies are contained in Chap. 4. Chapter 5 deals with the mapping of quantitative trait loci in the context of dissecting complex traits. The traitbased association mapping in plants is described in Chap. 6. In Chap. 7, the Meta-QTL strategy and its application in candidate gene discovery have been explained. Chapter 8 narrates the role of databases and bioinformatics tools in crop improvement. The overview of bioinformatics databases and tools for genomic research on crop improvement is presented in Chap. 9. The analysis of public domain NGS datasets and databases for plant virus detection and identification is carried out in Chap. 10. Chapter 11 contains the tree genome databases in the context of developing cyber-infrastructure for forest trees. However, the development of databases for genomic research is presented in Chap. 12. The role of artificial intelligence in genomic studies is discussed in Chap. 13. The last chapter, Chap. 14, deals with the genes and dogma of molecular biology. We hope that this volume will appeal to the researchers and students engaged in the field of bioinformatics, biometrics, statistics, and life sciences. Here, we wish to remember late Dr. Hukum Chandra, Indian Agricultural Statistics Research Institute, who had given his incredible initial contribution, and as a result, the bringing out of this volume could be possible. He will be ever alive among scientific fraternity through his significant contribution to statistics and allied areas. We extend our thanks and appreciation to the authors for their continuous support during finalization of the book. New Delhi, Delhi, India Kanpur, Uttar Pradesh, India New Delhi, Delhi, India New Delhi, Delhi, India
Priyanka Anjoy Kuldeep Kumar Girish Chandra Kishor Gaikwad
ix
Contents 1 Statistical and Biological Data Analysis Using Programming Languages . . . . . . . Ritwika Das, Soumya Sharma, and Debopam Rakshit 2 Python for Biologists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rajkumar Chakraborty and Yasha Hasija 3 Assembly, Annotation and Visualization of NGS Data . . . . . . . . . . . . . . . . . . . . . . . Kalyani M. Barbadikar, Tejas C. Bosamia, Mazahar Moin, and M. Sheshu Madhav 4 Statistical and Quantitative Genetics Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rumesh Ranjan, Wajhat Un Nisa, Abhijit K. Das, Viqar Un Nisa, Sittal Thapa, Tosh Garg, Surinder K. Sandhu, and Yogesh Vikal 5 Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for Dissecting Complex Traits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanchika Snehi, Mukesh Choudhary, Santosh Kumar, Deepanshu Jayaswal, Sudhir Kumar, and Nitish Ranjan Prakash 6 Trait Based Association Mapping in Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyanka Jain, Bipratip Dutta, and Amitha Mithra Sevanthi 7 Meta-analysis of Mapping Studies: Integrating QTLs Towards Candidate Gene Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anita Kumari, Divya Sharma, Sahil, Kuldeep Kumar, Amitha Mithra Sevanthi, and Manu Agarwal 8 Role of Databases and Bioinformatics Tools in Crop Improvement . . . . . . . . . . . Madhu Rani 9 Overview of the Bioinformatics Databases and Tools for Genome Research and Crop Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Divya Selvakumar, Selva Babu Selvamani, and Jayakanthan Mannu 10 Public Domain Databases: A Gold Mine for Identification and Genome Reconstruction of Plant Viruses and Viroids . . . . . . . . . . . . . . . . . . . V. Kavi Sidharthan and V. K. Baranwal 11 Tree Genome Databases: A New Era in the Development of Cyber-Infrastructures for Forest Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ayushman Malakar, Girish Chandra, and Santan Barthwal 12 Development of Biological Databases for Genomic Research . . . . . . . . . . . . . . . . . Jatin Bedi, Shbana Begam, and Samarth Godara 13 Artificial Intelligence in Genomic Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shbana Begam, Jatin Bedi, and Samarth Godara 14 Basics of the Molecular Biology: From Genes to Its Function . . . . . . . . . . . . . . . . Ria Mukhopadhyay, Sahanob Nath, Deepak Kumar, Nandita Sahana, and Somnath Mandal
xi
1 33 63
95
125
159
191
217
229
247
285 309 325 343
Chapter 1 Statistical and Biological Data Analysis Using Programming Languages Ritwika Das, Soumya Sharma, and Debopam Rakshit Abstract Rapid progress in high throughput sequencing technologies has generated an unprecedented huge amount of complex and heterogeneous biological data and has opened the door for researchers to carry out different types of analysis in various areas of biological sciences like genomics, transcriptomics, proteomics, metabolomics, etc. Hence, the need for fast and efficient computational tools have been felt for performing various types of in silico analysis of these biological data. Over the years, many computer programming languages like Java, Perl, R, Python, etc. have emerged for software development which helps in the visualization of generated data and subsequent statistical data analysis through which proper inference can be drawn. The main endeavour in bringing out this book is to provide wide-range programming solutions for genomics data handling and analysis to address complex problems associated with crop improvement programs. Special emphasis has been provided on R and Perl programming languages as these are highly popular amongst biologists, especially biometricians and bioinformaticians. Key features of these two programming languages, viz., installation procedures, basic syntax, etc. and their applications in areas of biological data analysis supported by suitable examples are described in successive sections. The targeted audiences of this book are primarily students, data analysts, scientists and researchers from various fields of biological sciences. This book will guide the readers in solving various emerging bioinformatics problems, such as assembly of NGS data, annotation, interpretation, visualization, mapping of QTLs, association mapping, genomic selection, etc. Key words Bioconductor, BioPerl, CRAN, DESeq2, ggplot2, GWAS, MASS, Perl, R, Regular expression, rMVP, RStudio
1
Introduction With the advent of Next Generation Sequencing (NGS) technologies and their high throughput downstream analysis algorithms, the interpretation of complex biological processes has become a tedious task where the implementation of the appropriate programming languages is crucial. Programming languages make rigorous computation very simple with increased power of computation. They are designed to interact with the machines so that a problem can be solved. The massive computational power of programming
Priyanka Anjoy et al. (eds.), Genomics Data Analysis for Crop Improvement, Springer Protocols Handbooks, https://doi.org/10.1007/978-981-99-6913-5_1, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
1
2
Ritwika Das et al.
languages helps data scientists and software programmers to take on larger and more complex endeavors. The use of programming languages makes codes reusable. Based on various programming languages, whatever we are now-a-days doing nowadays on our computers and mobiles are evolved. In a nutshell, programming languages have revolutionized our daily life activities and it has proved themselves as a blessing for mankind. For statistical data analysis, some commonly used programming languages include SPSS, SAS, STATA, R, Perl, Python, C, C ++, Java, etc. C and C++ are fully compiled languages, suitable for system-intensive tasks. These programming languages are also very much useful in the context of biological research. Each of these languages has some specific features. Source codes written in C and Java programs should be converted to bytecode by compiler before executing using a java interpreter. Whether codes written in Python, Perl, R do not need compilation separately. Python and Perl are script languages suitable for parsing, web scripting and pipeline development. Both languages use automatic memory management and have large free libraries. R is very much useful for statistical data analysis, data mining and machine learning bases analysis. Open-source software projects like BioPerl, BioJava, BioPython, BioRuby, etc. have also been developed to facilitate research in areas of computational biology and bioinformatics. In the following sections, the basic features of two very important programming languages, viz., Perl and R and their applications in areas of omics research are discussed briefly.
2
Perl Perl (Practical Extraction and Report Language) was created by Larry Wall in 1998. It is a general-purpose programming language developed for text manipulation. It is now used for a wide range of tasks including system administration, web development, network programming, GUI development, and more. In the hierarchy of programming languages, Perl is located halfway between high-level languages such as Pascal, C and C++, and shell scripts (languages that add control structure to the Unix command-line instructions) such as sh, sed and awk. Key features of Perl are as follows: (a) Perl is a stable, open-source and platform independent programming language. (b) Perl supports Unicode. (c) It supports programming.
both
procedural
and
object-oriented
(d) Perl is an interpreted language; no separate compilation is needed.
Statistical and Biological Data Analysis Using Programming Languages
3
(e) Fast and easy text processing and file handling capability. (f) Perl offers extremely strong regular expression capabilities especially pattern matching which is very much useful for bioinformatics research. (g) Perl DBI supports third-party databases including Oracle, Sybase, MySQL, etc. and makes web database integration easy. (h) Perl is known as “The duct tape of the internet”. (i) It can handle encrypted web data including e-commerce transactions. (j) Perl is extensible. There are >20,000 third-party modules available from the Comprehensive Perl Archive Network (CPAN). 2.1 Installation of Perl
Perl can be installed in any operating system, i.e., Windows/ macOS/Linux. The installation file can be downloaded from https://www.perl.org/. The recent version of Perl is 5.34.0. After the installation process, open the command prompt and type: C:\Users>perl -v
If the installation is successful, the following output will be shown on screen (Fig. 1.1). 2.2
Perl Basics
Perl scripts can be written in a text file and saved with the .pl extension. Each instruction line must end with a semicolon (;) sign. A line starting with a # indicates a comment in Perl and the interpreter ignores it during execution. Codes written in Perl should be saved in a file with the .pl extension. To execute the program, we have to use the “perl file_name.pl” command (Fig. 1.2). Perl has three basic data types: (a) Scalar: It is preceded by a dollar ($) sign. It stores either a number or character or string. Example: $num = 10; or, $char = ‘d’; or, $string = “perl doc”; etc. (b) Array: It is preceded by an at (@) sign. It stores an ordered list of scalar values. Example: @numbers = (10, 45, 89); or, @list = (“Ram”, “Shyam”, “Rabi”); etc. We have to use index numbers to access any value in the array (Fig. 1.3). Indexing starts with 0. (c) Hash: It is preceded by a percent (%) sign. Hashes are unordered sets of key/value pairs. Keys should be unique and strings. Values can be any scalars. Similar to other programming languages, Perl also supports various arithmetic, logical, relational, bitwise operators, etc. Apart from
4
Ritwika Das et al.
Fig. 1.1 Installation of Perl in Windows operating system
Fig. 1.2 (a) Simple Perl command, (b) Execution of a Perl programme
Fig. 1.3 (a) Example of a Perl array, (b) Output of second element of an array
these, Perl has some unique string handling operators. Some of these are listed below: (a) Pop: It removes the last value of a list. (b) Push: It inserts a new value at the end of a list. (c) Shift: It removes the first value of a list. (d) Unshift: It inserts a new value at the beginning of a list (Fig. 1.4). (e) Reverse: It takes the list of values and returns the list in the opposite order. (f) Sort: It takes the list of values and returns the sorted list in lexicographical order (Fig. 1.5).
Statistical and Biological Data Analysis Using Programming Languages
5
Fig. 1.4 (a) Codes and (b) Outputs of Pop, Push, Shift and Unshift functions in Perl
(g) Split: It cuts a string based on the presence of a particular separator and makes a list out of it. (h) Join: It joins values in a list using a particular joining character and makes a string out of it (Fig. 1.6).
6
Ritwika Das et al.
Fig. 1.5 (a) Codes and (b) Outputs of Reverse and Sort functions in Perl
2.3 Perl Regular Expression
A regular expression is a string of characters that defines the pattern or patterns we are viewing. To apply a regular expression, we have to use the pattern binding operators =~ (for affirmation) or, !~ (for negation). There are two main regular expression operators within Perl: • Match (m//): It is used to match a string or statement to a regular expression. • Substitute (s///): It helps to replace the matched text with some new text (Fig. 1.7).
Statistical and Biological Data Analysis Using Programming Languages
7
Fig. 1.6 (a) Codes and (b) Outputs of Split and Join functions in Perl
In bioinformatics data analysis, this regular expression operator can be used in many ways. The simplest example can be the conversion of a DNA sequence to an RNA sequence by substitution of T with U (Fig. 1.8). Similarly, reverse compliments of a given DNA sequence can also be obtained by using these matching and substitution operators. 2.4 Application of Perl in Bioinformatics
Due to its powerful regular expressions for pattern matching and easy programming features, Perl has become a very popular programming language in bioinformatics data analysis. Several tools
8
Ritwika Das et al.
Fig. 1.7 (a) Codes and (b) Outputs of Match and Substitution operations in Perl
Fig. 1.8 Conversion of DNA to RNA sequence using Perl regular expression
have been developed using Perl to perform biological data analysis. Among these, the two most important tools are MISA [1] and
Statistical and Biological Data Analysis Using Programming Languages
9
Primer3 [2]. MISA (MIcroSAtellite Identification tool) is used to mine SSR markers from a given FASTA sequence. Primer3 input file can be prepared using MISA output and flanking sequences corresponding to each microsatellite marker. This software generates sets of forward and reverse primers for each marker sequence. Integrating these two tools, several databases and web servers have been developed for genomic data analysis such as TomSatDB [3], BuffSatDB [4], PolyMorphPredict [5], etc. Due to the advancement of new and efficient high throughput sequencing technologies, a huge amount of biological sequence data has become available in various public domains. To analyze these data, machine learning approaches have emerged as a very powerful and efficient tool. Genomic feature extraction from input sequences is the first step for such analysis. Perl scripts can be used for this. Example: GC % calculation for each DNA sequence in an input multi FASTA file (Fig. 1.9). We can also convert a multiline FASTA file into a single line FASTA file using Perl script (Fig. 1.10). 2.5
Perl Modules
A module in Perl is a collection of related subroutines and variables that perform a set of programming tasks. These are reusable. The name of a Perl module ends with a .pm extension. Various Perl modules are available on the Comprehensive Perl Archive Network (CPAN).
2.6
BioPerl
BioPerl is an active open-source software project supported by the Open Bioinformatics Foundation. It has played an integral role in the Human Genome Project. BioPerl consists of a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics data analysis. These modules are reusable, extensible and can be used as a standard for manipulating molecular biological data. Several tasks can be performed using these modules like accessing databases with different formats, sequence manipulation, execution and parsing of the results of molecular biology programs, etc. BioPerl modules can be obtained from www.BioPerl.org. BioPerl modules are named as, Bio:: To use any module for data analysis, use Bio::
Some important BioPerl modules along with their specific functionalities for bioinformatics data analysis are mentioned in Table 1.1.
3
R R was designed by Ross Ihaka and Robert Gentleman in the 1990s (https://www.r-project.org.html) at the University of Auckland, New Zealand and is currently developed by the R development
10
Ritwika Das et al.
Fig. 1.9 Calculation of GC% for each DNA sequence using Perl; (a) Example of an input multi FASTA file, (b) Perl code for calculation of GC% and, (c) Output of the Perl program to calculate GC%
core team. It was developed at Bell Laboratories and is an opensource programming language and also case sensitive. It is the execution of the S programming languages, which was developed by John Chambers at the Bell Labs. It was named by the first letter of the names of the creators and partially because of its inheritance
Statistical and Biological Data Analysis Using Programming Languages
11
Fig. 1.10 Conversion of a multiline FASTA file to a single file FASTA file using Perl; (a) Example of an input multiline FASTA file, (b) Perl code for conversion of multiline to single line FASTA, and (c) Output single line FASTA file
12
Ritwika Das et al.
Table 1.1 BioPerl modules and their functions Module name
Function
Bio::Seq
Sequence object creation
Bio::SeqIO
Handling of sequence input and output
Bio::Seq::LargeSeq
Handling of large sequences
Bio::Tools::SeqStats
Provides sequence statistics like count_monomers, count_codons, molecular weights, reverse compliments etc.
Bio::DB::GenBank
Provides access to GenBank database
Bio::Tools::Run:: RemoteBlast
Running BLAST remotely
Bio::Tools::Run:: StandAloneBlast
Run BLAST locally on PC
Bio::Tools::HMMER:: Results
Parsing HMMER results
Bio::SimpleAlign
Multiple sequence alignment
Bio::AlignIO
Pairwise alignment
Bio::ClustalW
MSA using ClustalW
Bio::SeqFeature
Sequence feature object
Bio::Tools:: RestrictionEnzyme
Locates restriction sites on sequences
Bio::Tools::SigCleave
Finding amino acid cleavage sites
Bio::Tools::SeqPatterns
Finding regular expression patterns in sequence
Bio::Tools::Oddcodes
Rewriting amino acid sequences in abbreviated codes for statistical analysis
Bio::LocationI
Location information of a sequence
Bio::Tools::EPCR
Parsing output of ePCR program
Bio::TreeIO
Dealing with evolutionary trees
from S. It is extensively used by software programmers, statisticians, data scientists, and data miners. It has numerous applications in domains like healthcare, academics, consulting, finance, media, and many more. Its vast applicability in statistics, data visualization, and machine learning have contributed to the analysis and interpretation of biological data from various experimental techniques. Here are some advantages of the R software: 1. Open source: R is an open-sourced programming language, freely available and the users do not require to incur any costs to use it.
Statistical and Biological Data Analysis Using Programming Languages
13
Fig. 1.11 Interface of R-GUI after successful installation of R
2. Fast and continuous growth: Any user can contribute to the further growth of its capabilities by creating packages. This convenience is attracting more users and creating a larger user base. 3. Platform independence: The performance and usability of R are independent of any operating system and can be easily used on Windows, Linux and Mac OS. 3.1
Installation of R
The R programming language can be installed in Windows, Linux or Mac OS systems from https://cran.r-project.org/ directly (Fig. 1.11). As of 31st December 2021, the latest R version is R 4.1.2. R programming language is not only a statistic package but also allows us to integrate with other languages (C, C++). Objects, functions, and packages can easily be created by R. Since R is much similar to other widely used languages syntactically, it is easier to code and learn in R. Programs can be written in R in any of the widely used IDE like RStudio, Rattle, Tinn-R, etc. The RStudio IDE is available for desktop and server use. The desktop version is used to run the program locally on the desktop whereas the server version provides a browser-based interface to a version of R running on a remote Linux server. RStudio provides both open source editions and commercial editions. The RStudio can be downloaded from https://www.rstudio.com/products/rstudio/download/. To use the RStudio, the R software must be installed in the machine beforehand. After successful installation, the interface of RStudio will look like the following given in Fig. 1.12.
14
Ritwika Das et al.
Fig. 1.12 Interface of RStudio after installation
The top left window in the RStudio is the editor where the codes can be written. The advantage of using this editor over the R-GUI is that any error present in the code can be corrected in its original location without rewriting it. But in R-GUI if any error is found in the code after its execution, then we have to rewrite the corrected code again. The bottom left window has three tabs: Console, Terminal, and Jobs. The ‘Console’ tab is similar to the R console where the codes are executed. The ‘Terminal’ tab provides access to the system shell. It helps in advanced source control operations, execution of long-running jobs, remote logins, system administration of RStudio Workbench or RStudio Server, and supports full-screen terminal programs. The ‘Jobs’ tab helps to run a time-consuming R script while one can use the IDE. The top right window has four tabs: Environment, History, Connections, and Tutorial. In the ‘Environment’ tab, one can see the loaded objects in the current session. The ‘History’ tab contains the history of executed codes. The ‘Connections’ tab helps to establish connections to data sources and access the data. The ‘Tutorial’ tab is used to host tutorials powered by the “learnr” package. The bottom right window has five tabs: Files, Plots, Packages, Help, and Viewer. The ‘Files’ tab helps to access the file directory on the hard drive. The ‘Plots’ tab shows all the produced plots from R codes. The ‘Packages’ tab shows the list of all the installed R packages as well as allows to install and load packages. The ‘Help’ tab opens the help menu for R functions. The ‘Viewer’ tab allows users to view local web content.
Statistical and Biological Data Analysis Using Programming Languages
3.2 Statistical Analysis and Visualization Using R
15
R software is very much useful for statistical analysis. All types of basic statistical analysis can be done using R like descriptive statistics, testing of hypothesis, non-parametric tests, matrix algebra, multivariate analysis, ANOVA, regression analysis, time-series analysis, machine learning, etc. R has a huge number of in-built functions in its base version. These in-built functions cover almost all the basic statistical analysis. Further, for specific analysis, specific R packages are used. A few of these are built into the base R environment. There are 29 packages supplied with R (called “standard” and “recommended” packages) and many more are available through the CRAN (Comprehensive R Archive Network). These packages consist of functions and codes to perform some mathematical and statistical analysis or can be used for visualization. The most fascinating thing about the R is that it allows its users to contribute to extending its capabilities by creating new packages. This facility is very much useful for the programmers and academicians to popularize their own created methodologies in their respective subjects by making R packages so that others can use them easily. While using the RStudio, the packages can be installed and updated from the ‘Packages’ tab situated in the bottom right window. After that, the package can be loaded by clicking the square box located on the left side. Alternatively, installation and loading of packages can also be done by running codes as follows: > install.packages("package_name") # installation of package > library(package_name) # loading of package
The text written after the hashtag (#) symbol is comments. The program ignores the comments as it is in the codes. It needs to remember that a new package is required to install only once but it should be loaded each time for a new session. R software is very good for visualization. Various publicationquality plots can be drawn using codes in R such as line diagrams, bar diagrams, pie charts, box plots, histograms, density plots, time plots, scatter plots, Q-Q plots, mosaic plots, etc. Three-dimensional plots can also be drawn in R. For example, here are the pairwise plots of the Iris dataset. The dataset is readily available in the in-built R package “datasets”. The dataset contains 50 sample observations of four features (sepal length, sepal width, petal length and petal width) of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). The R codes and the pairwise scatter plots produced using the ‘plot()’ function are given in Fig. 1.13. In the following section, we are going to discuss some popular methods used in statistical data analysis. For the above-mentioned Iris dataset, we can calculate the pairwise correlation among the four variables. It can be done using the “psych” package and its output is given in Fig. 1.14 along with the heat map of the
16
Ritwika Das et al.
> library(datasets) > data(iris) > summary(iris) Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa
:50
versicolor:50 virginica :50
> plot(iris)
(a)
(b) Fig. 1.13 (a) Summary of Iris dataset and R code to create scatter plot, (b) Output pairwise scatter plots
correlation matrix. Otherwise, if anyone wants to find the correlation coefficient between only two variables that can be done using the in-built ‘cor()’ function. Regression analysis is another useful
Statistical and Biological Data Analysis Using Programming Languages
17
> library(psych) > corr corr Call:corr.test(x = iris[, 1:4], y = NULL, use = "pairwise", method = "pearson") Correlation matrix Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.00 -0.12 0.87 0.82 Sepal.Width -0.12 1.00 -0.43 -0.37 Petal.Length 0.87 -0.43 1.00 0.96 Petal.Width 0.82 -0.37 0.96 1.00 Sample Size [1] 150 Probability values (Entries above the diagonal are adjusted for multiple tests.) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 0.00 0.15 0 0 Sepal.Width 0.15 0.00 0 0 Petal.Length 0.00 0.00 0 0 Petal.Width 0.00 0.00 0 0 To see confidence intervals of the correlations, print with the short= FALSE option > cor.plot(corr$r)
(a)
(b) Fig. 1.14 (a) R code and correlation matrix for the features of Iris dataset, (b) Heat map of the correlation matrix
method. The Linear regression analysis can be done using the ‘lm()’ function. R programming is also being widely used in the area of artificial intelligence and machine learning. Various techniques appeared in literature around a few decades ago. But due to a lack of data and computational processing power, they were not popularized at that
18
Ritwika Das et al.
time and nowadays they are in limelight. Machine learning techniques can be broadly classified as supervised and unsupervised learning methods. Linear discriminant analysis is one of the popularly used supervised machine learning methods. It can be done in R by using the ‘lda()’ function of the “MASS” package. The ‘createDataPartition()’ function of the “caret” package can be used to partition the whole dataset into a training and a testing dataset. Let us take randomly 80% of the data as the training dataset and the remaining 20% of data as the testing dataset. The linear discriminant analysis of the Iris dataset can be done as in Fig. 1.15. From the above output, it can be seen that the training dataset comprises of equal proportion of each of the species. And for the testing dataset, all the predictions are correct. Again, cluster analysis is one of the most popular unsupervised machine learning methods. Hierarchical cluster analysis can be done in R using the in-built ‘hclust()’ function. For illustration purposes, let us consider the USArrests dataset from the package “datasets”. This data set contains statistics on arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973 along with the population living in urban areas. The hierarchical clustering of these 50 states based on this dataset using R software can be done as in Fig. 1.16. R is also very useful for data visualization. The “ggplot2” and “lattice” are two popular packages for this purpose. The former package creates graphics based on ‘The Grammar of Graphics’ and the latter is an implementation of Trellis graphics for R, a high-level data visualization system with an emphasis on the multivariate dataset. 3.3 Application of R in Bioinformatics
R is one of the most widely-used and powerful programming languages in bioinformatics. Due to its data handling, modeling capabilities and flexibility, R is becoming the most widely used software in bioinformatics. In life sciences especially in bioinformatics, R has been frequently used for statistical analysis of biological data from various experiments like microarray, RNA-Seq, ChIP-Seq, whole-genome sequencing, small RNA-seq, single-cell RNA sequencing, etc. and also for data visualizations to create high quality multi-dimensional interactive graphs and plots. For example, R can be used for building co-expression networks between genes using their expression values which can reveal many interactions in pathways that give insight into the function of genes altogether. Heatmaps can be generated using R to visualize the differential expression patterns of genes. There are so many R packages available that are helpful for undergoing machine learning based biological data analysis. In the area of omics data analysis, Bioconductor (http://www.bioconductor.org/) is very useful software. It is an R programming language based free, open-source and open development software project for the analysis and
Statistical and Biological Data Analysis Using Programming Languages
19
> library(MASS) > library(caret) > set.seed(1234) # Take any arbitrary number as seed > training training.data testing.data lda.analysis lda.analysis Call: lda(Species ~ ., data = training.data) Prior probabilities of groups: setosa versicolor virginica 0.3333333 0.3333333 0.3333333
Group means: Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 5.0050 3.4150 1.4575 0.2525 versicolor 5.9175 2.7700 4.2400 1.3275 virginica 6.6950 3.0125 5.6050 2.0225 Coefficients of linear discriminants: LD1 LD2 Sepal.Length 0.929781 -0.4358611 Sepal.Width 1.360846 -2.0827471 Petal.Length -2.214094 0.8831912 Petal.Width -2.935442 -2.2903447 Proportion of trace: LD1 LD2 0.9913 0.0087 > species.prediction species.prediction$class [1] setosa setosa setosa setosa setosa setosa [7] setosa setosa setosa setosa versicolor versicolo r [13] versicolor versicolor versicolor versicolor versicolor versicolor [19] versicolor versicolor virginica virginica virginica virginica [25] virginica virginica virginica virginica virginica virginica Levels: setosa versicolor virginica > table(testing.data$Species,species.prediction$class) setosa versicolor virginica setosa 10 0 0 versicolor 0 10 0 virginica 0 0 10
Fig. 1.15 Output of the linear discriminant analysis
interpretation of genomic data. It provides access to powerful statistical and graphical methods for the analysis of genomic data. It also facilitates the integration of biological metadata like GenBank, GO, LocusLink and PubMed in the analysis of experimental
20
Ritwika Das et al.
> library(datasets) > data("USArrests") > clust clust Call: hclust(d = dist(USArrests), method = "centroid") Cluster method : centroid Distance : euclidean Number of objects: 50 > plot(clust, hang = -0.01, main = "Hierarchical Cluster: Centroid method", + sub = "")
(a)
(b) Fig. 1.16 (a) R code for hierarchical clustering of USArrests dataset, (b) Dendrogram of the cluster analysis
data. Bioconductor contains a large number of R packages highly useful for analyzing biological data sets such as microarray data, RNA-seq data, etc. Every year, there are two releases of this software. The current release of Bioconductor (version 3.14) contains 2083 R packages for various data analyses in the area of bioinformatics research. To install any R package from Bioconductor, we have to use the following codes: > if (!require("BiocManager", quietly = TRUE)) + install.packages("BiocManager") > BiocManager::install("package_name")
Statistical and Biological Data Analysis Using Programming Languages
21
Table 1.2 R packages for biological data analysis Type of analysis
R package
Exploratory data analysis and data visualization for biological sequence (DNA and protein) data
seqinr
Fitting generalized linear models
glm2
Embedding the SQLite database engine in R and providing an interface compliant with RSQLite the DBI package; useful for quality control and pre-processing of SNP data Genome wide association studies
rMVP
Classification and regression training
caret
Support Vector Machine approach
e1071
Random forest approach
randomForest
Creating graphs and plots for data visualization
ggplot2
Generating heatmaps in R
pheatmap
Efficient manipulation of biological strings
Biostrings
Linear models for microarray data
limma
Differential gene expression analysis based on the negative binomial distribution
DESeq2
Handling and analysis of high-throughput microbiome census data
phyloseq
Gene set variation analysis for microarray and RNA-seq data
GSVA
Routines for the functional analysis of biological networks
BioNet
Multiplex PCR primer design and analysis
openPrimeR
Classify diseases and build associated gene networks using gene expression profiles
geNetclassifier
Some important R packages included in (CRAN and Bioconductor) helpful for statistical analysis and visualization of omics data are listed in Table 1.2. GWAS analysis can be carried out in R in the following ten step processes using the R package “rMVP”: 1. Install the R package “rMVP” for GWAS computation > install.packages(“rMVP”)
2. Load the installed package: > library(rMVP)
3. Set The working directory to folder which contains genotype and phenotype data for GWAS analysis:
22
Ritwika Das et al. > setwd("/Users/GWAS")
4. The existing genotypic and phenotypic data needs to be formatted for further computation. The acceptable file formats of genotypic data for GWAS computation using rMVP package are Hapmap, VCF, binary and Numeric. Caution: The name of the genotypes should be the same for genotypic as well as the phenotypic data. Otherwise, an error message will be displayed. The genotypic and phenotypic files that have been used here are mdp_genotype.hmp.txt and mdp_phenotype.txt respectively. The MVP.Data function will format the data and make desired output files for GWAS analysis: > MVP.Data(fileHMP = "mdp_genotype.hmp.txt", filePhe = "mdp_phenotype.txt", sep.map = "\t", sep.phe = "\t", fileKin = T, out = "res.hmp") > show_res show_res show_res
the MVP.Data function will generate the following files: "res.hmp.geno.bin", "res.hmp.geno.desc" "res.hmp.geno. ind", "res.hmp.geno.map", "res.hmp.kin.bin", "res.hmp.kin. desc", "res.hmp.phe" 5. Import formatted genotypic data using attach.big.matrix function: > genotypic phenotypic
map_information
kinship dir.create("PCA_Kinship") > setwd(paste0(wd,"/PCA_Kinship"))
10. Compute genome-wide association using MVP function. After this, all the results will be automatically saved in the earlier created PCA_kinship folder. > GWAS_PCA_Kin BiocManager::install("airway") > library(airway) > data(airway) > airway class: RangedSummarizedExperiment dim: 64102 8 metadata(1): '' assays(1): counts rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99 rowData names(0): colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521 colData names(9): SampleName cell ... Sample BioSample > sample_info sample_info sample_info$dex sample_info$dex names(sample_info) # Get the samplewise met adata file > write.table(sample_info, file = "/sample_info.csv", sep = ',', col.names = T, row.names = T, quote = F) > # Get the matrix of read counts for each gene in every sample > countsData write.table(countsData, file = "/counts_dat a.csv", sep = ',', col.names = T, row.names = T, quote = F)
(a)
(b)
(c)
Fig. 1.17 (a) R code to collect sample dataset from “airway” package, (b) Screenshot of the count matrix, (c) Screenshot of the metadata file
Statistical and Biological Data Analysis Using Programming Languages
25
and order of row names in the metadata file. If it is so, then we have to load the package “DESeq2” to perform the subsequent differential gene expression analysis. We have to create a DESeqDataSet object and then run the ‘DESeq()’ function to perform the said analysis (Fig. 1.18). Here, we are trying to find the genes which are differentially expressed in Dexamethasone treated conditions as compared to untreated conditions. Hence, the reference level is set as ‘untreated’. After the analysis, the result contains base means, log2FoldChange values, p-values, adjusted p-values, etc. for each gene. If at 1% level, the adjusted p-value for a gene is found as >0.01, it means the result has been obtained purely by chance, i.e., a non-significant result. Otherwise, that gene is differentially expressed if the adjusted p-value is 0, the gene is upregulated and if it is install.packages(“protr”) > install.packages(“seqinr”) > library(protr) > library(seqinr)
2. Select the protein sequence file for feature extraction.
26
Ritwika Das et al.
> BiocManager::install("DESeq2") > library(DESeq2) > # read in counts data > counts_data # read in sample info > colData # making sure the row names in colData matches to column names in counts_data > all(colnames(counts_data) %in% rownames(colData)) [1] TRUE > # are they in the same order? > all(colnames(counts_data) == rownames(colData)) [1] TRUE > dds dds class: DESeqDataSet dim: 64102 8 metadata(1): version assays(1): counts rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99 rowData names(0): colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521 colData names(2): cellLine dexamethasone > # pre-filtering: removing rows with low gene counts > # keeping rows that have at least 10 reads total > keep = 10 > dds # set the factor level > dds$dexamethasone < - relevel(dds$dexamethasone, ref = "untreated") > # --------Run DESeq---------------------> dds res res log2 fold change (MLE): dexamethasone treated vs untreated Wald test p -value: dexamethasone treated vs untreated DataFrame with 22369 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000000003 708.5979 -0.3788229 0.173155 -2.187769 0.0286865 0.138470 ENSG00000000419 520.2963 0.2037893 0.100742 2.022878 0.0430857 0.182998 ENSG00000000457 237.1621 0.0340631 0.126476 0.269325 0.7876795 0.929805 ENSG00000000460 57.9324 -0.1171564 0.301583 -0.388472 0.6976669 0.894231 ENSG00000000971 5817.3108 0.4409793 0.258776 1.704099 0.0883626 0.297042 ... ... ... ... ... ... ... ENSG00000273483 2.68955 0.600441 1.084447 0.553684 0.5797949 NA ENSG00000273485 1.28646 0.194074 1.346550 0.144127 0.8854003 NA ENSG00000273486 15.45244 -0.113321 0.426034 -0.265991 0.7902460 0.930697
Fig. 1.18 Differential gene expression analysis using the “DESeq2” package in R
Statistical and Biological Data Analysis Using Programming Languages
27
> x ABU86209.1 putative NB-ARC domain-containing protein, partial [Oryza sativa] NLEGQLEIYNLKNVKRIEDVKGVNLHTKENLRHLTLCWGKFRDGSMLAENANEVLEALQPPKRLQSLKIW RYTGLVFPRWIAKTSSLQNLVKLFLVNCDQCQKLPTIWCLKTLELLCLDQMKCIEYICNYDTVDAEECYD ISQAFPKLREMTLLNMQSLKGWQEVGRSEIITLPQLEEMTVINCPMFKMMPATPVLKHFMVEGEPKLCSS
3. Change the input file to character format for further processing > y aac_feature head(aac_feature) A 0.0333333333333333
ENSG00000273487 8.16327 ENSG00000273488 8.58437 > summary(res)
1.017800 0.575797 1.767637 0.0771216 0.271627 0.218105 0.570714 0.382161 0.7023421 0.896550
out of 22369 with nonzero total read count adjusted p -value < 0.1 : 1884, 8.4% LFC > 0 (up) LFC < 0 (down) : 1502, 6.7% outliers [1] : 51, 0.23% low counts [2] : 3903, 17% (mean count < 4) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results > res0.01 summary(res0.01) out of 22369 with nonzero total read count adjusted p -value < 0.01 : 1030, 4.6% LFC > 0 (up) LFC < 0 (down) : 708, 3.2% outliers [1] : 51, 0.23% low counts [2] : 5200, 23% (mean count < 6) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?res ults
Fig. 1.18 (continued)
28
Ritwika Das et al.
> # MA plot > plotMA(res) > # Volcano plot > library(ggplot2) > library( dyverse) > df df$diffexpressed # if log2Foldchange > 0 and padj < 0.01, set as "UP" > df$diffexpressed[df$log2FoldChange > 0 & df$padj < 0.01] # if log2Foldchange < 0 and padj < 0.01, set as "DOWN" > df$diffexpressed[df$log2FoldChange < 0 & df$padj < 0.01] ggplot(df, aes(log2FoldChange, -log10(padj), col= diffexpressed))+geom_point()+scale_color_manual(values = c("red", "black", "green")) Warning message: Removed 3954 rows containing missing values (geom_point). > # Developing Heatmap of first 10 genes for be er demonstra on > library(pheatmap) > library(RColorBrewer) > breaksList = seq(-0.4, 0.5, by = 0.04) > rowLabel = row.names(counts_data[1:10,]) > pheatmap(df$log2FoldChange[1:10], color = colorRampPale e(c("dark blue", "white", "yellow"))(25), breaks = breaksList, border_color = "black", cellheight = 25, cellwidth = 25, cluster_rows = F,cluster_cols = F, fontsize = 12, labels_row = rowLabel)
(a)
(b) Fig. 1.19 (a) R code to visualize the result of differential gene expression analysis, (b) MA plot showing significantly upregulated and downregulated genes as blue dots, (c) Volcano plot representing upregulated genes as green, downregulated genes as red and non-significant genes as black dots, (d) Heatmap representing the expression levels of first ten genes in terms of log2FoldChange values in a scale of -0.4 to 0.4 where, blue colour represents downregulated genes, yellow represents upregulated genes and expression levels of remaining genes are represented by gradation of colour between blue and yellow
Statistical and Biological Data Analysis Using Programming Languages
(c)
(d) Fig. 1.19 (continued)
29
30
Ritwika Das et al. R 0.0380952380952381 N 0.0571428571428571 D 0.0333333333333333 C 0.0476190476190476 E 0.0857142857142857
5. Extract CTD descriptors composition from sequence > ctd_feature head(ctd_feature) hydrophobicity.Group1 0.347619047619048 hydrophobicity.Group2 0.252380952380952 hydrophobicity.Group3 0.4 normwaalsvolume.Group1 0.295238095238095 normwaalsvolume.Group2 0.452380952380952 normwaalsvolume.Group3 0.252380952380952
6. Extract PAAC feature (Pseudo Amino acid composition)
> paac_feature head(paac_feature) Xc1.A 1.83953032139242 Xc1.R 2.10232036730562 Xc1.N 3.15348055095843 Xc1.D 1.83953032139242 Xc1.C 2.62790045913203 Xc1.E 4.73022082643765
Similarly, other sequence and structure based features can also be extracted using various R packages for subsequent model building and biological data analysis using R software.
4
Conclusion The advancement of programming languages makes rigorous data analysis and graphical visualization easy to implement. Both the R and Perl are open-sourced software, which makes their popularity manifold. Powerful pattern recognition features of Perl have made feature extraction, SSR and SNP mining, polymorphism detection and several other analyses for omics data faster, easier and userfriendlier. Despite some limitations, the introduction of BioPerl modules has facilitated various kinds of omics data analysis. R software is very much popular among data scientists. Development of several R packages under CRAN and Bioconductor projects
Statistical and Biological Data Analysis Using Programming Languages
31
along with their detailed documentation have helped researchers to understand and apply codes easily on huge and complex biological datasets for various statistical analyses and visualization of results in terms of graphs, plots, etc. Apart from R and Perl, many other programming languages are there which have been found very much useful for data analysis like Python, C, C++, Java, etc. Open-source software projects like BioJava, BioPython, BioRuby, etc. have also been developed to carry out data analyses in areas of computational biology and bioinformatics. The area of computer programming is growing very fast as well as the development of new modules, packages, etc. for specialized data analyses and also new programming languages. These modules and in-house scripts are simple, fast to analyze and easy to use. Recently, artificial intelligence and big data analysis approaches are being applied in areas of computational biology and bioinformatics which are highly dependent on programming languages like R and Python. Hence, we, researchers should develop good skills in computer coding to create wonders in the area of omics data analysis in near future. References 1. Thiel T, Michalek W, Varshney RK, Graner A (2003) Exploiting EST databases for the development and characterization of gene-derived SSR-v markers in barley (Hordeum vulgare L.). Theor Appl Genet 106(3):411–422 2. Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, Rozen SG (2012) Primer3—new capabilities and interfaces. Nucleic Acids Res 40(15):e115 3. Iquebal MA, Sarika AV, Verma N, Rai A, Kumar D (2013) First whole genome based microsatellite DNA marker database of tomato for
Further Reading
http://www.bioconductor.org/ https://cran.r-project.org/ https://www.perl.org/ https://www.r-project.org.html ht tps://www.rstudio.com/products/rstudio/ download/
mapping and variety identification. BMC Plant Biol 13:197 4. Sarika AV, Iquebal MA, Rai A, Kumar D (2013) In silico mining of putative microsatellite markers from whole genome sequence of water buffalo (Bubalus bubalis) and development of first BuffSatDB. BMC Genomics 14:43 5. Das R, Arora V, Jaiswal S, Iquebal MA, Angadi UB, Fatma S, Singh R, Shil S, Rai A, Kumar D (2019) PolyMorphPredict: a universal web-tool for rapid polymorphic microsatellite marker discovery from whole genome and transcriptome data. Front Plant Sci 9:1966
Chapter 2 Python for Biologists Rajkumar Chakraborty and Yasha Hasija Abstract Recent advancements in the fields of life and computer sciences have revolutionized problem-solving through the application of computational techniques. Python, a free general-purpose programming language, has emerged as a versatile tool for addressing various computational challenges. Python’s simplicity and wide-ranging capabilities make it an ideal choice for research labs to tackle their daily obstacles effectively. By harnessing the power of computers and leveraging an appropriate programming language like Python, tasks such as data manipulation, retrieval and parsing of biological data, automation, and simulation of biological problems can be efficiently executed. This chapter aims to provide a comprehensive overview of the Python programming language, shedding light on its functionality and potential applications. The primary focus lies on introducing fundamental concepts such as data structures and flow control statements. Furthermore, it delves into more advanced topics such as file access, functions, and modules, providing in-depth coverage. Key words Python, Syntax, Programming, Object oriented programming, Data handling
1 Introduction Before discussing Python, it’s important to understand why people in the life sciences should learn to code. The flood of biological data, including sequencing, annotations, interactions, physiologically active substances, and so on. For example, as of August 2021, one of the largest databases for nucleotide sequences, Gene Bank (NCBI) [1], contained 232 million sequences (https://www.ncbi. nlm.nih.gov/genbank/statistics/). The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database currently contains 2482.5 million annotated sequence data (https://www. ebi.ac.uk/ena/about/statistics). It is now possible to create low-cost, high-throughput profiles of biological systems thanks to a wave of new technologies. This generates an enormous amount of biological data [2]. As a result, the age of “big data” is beginning, and it is vital to close the gap between high-throughput technical innovation and human capacity to store, analyse, and integrate biological datasets [3]. To get the most out of it, it needs to be Priyanka Anjoy et al. (eds.), Genomics Data Analysis for Crop Improvement, Springer Protocols Handbooks, https://doi.org/10.1007/978-981-99-6913-5_2, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
33
34
Rajkumar Chakraborty and Yasha Hasija
analysed by computers. To assign tasks to computers and achieve the desired results, the knowledge of programming languages is essential. Filtering, merging, and sub-setting are all common problems in biological research. Identifying list similarities and customising data formats for storing are a few of the preliminary tasks researchers have to perform. The first step in developing a hypothesis from big data is the curation of massive datasets. Curating data is a time-consuming and tedious task that involves searching databases, literature, and other sources for data. Python, a widely adopted programming language, has an interesting history. The first version, Python 2.0, was introduced in 2000 by the BeOpen Python Labs. Guido van Rossum, the creator of Python, had been primarily responsible for adding new features and fixing bugs until then [4]. However, to foster community contributions and reduce dependence on Rossum, the Python Labs team developed Python 2.X. The final release in the Python 2 series was Python 2.7, which reached its end of life in 2020. In 2008, Python 3.0 was unveiled. It was not merely a bug-fix version but incorporated a significant change that rendered it incompatible with earlier Python versions [5]. This syntax reform aimed to provide multiple approaches to achieve the same outcome and simplify the language for novice programmers. Python 3 introduced improvements and new features, making it a more powerful and efficient language. Python has gained immense popularity, particularly in the domains of data science and machine learning. Its straightforward syntax and readability have made it highly recommended as a first programming language for beginners. Python’s success can be attributed to its strong community support, where developers actively contribute to its growth and evolution. Python’s readable code allows programmers to focus on programming principles rather than syntax. Python’s communitydriven approach has led to the creation of numerous libraries that can be pre-installed or freely installed. These libraries eliminate the need for custom code and allow for the rapid construction of complex applications. One of the main advantages of learning Python is the availability of popular libraries, making it a versatile language. Frameworks such as Django, Flask, and Pylons are used to create static and dynamic websites, while libraries like Pandas, Numpy, and Matplotlib are useful for data science and visualization. Scikit-learn and Tensorflow are powerful machine learning and deep learning packages, and desktop applications can be built with PyQt, Gtk, and wxWidgets. For mobile applications, modules like BeeWare and Kivy are ideal. In biology, Python is widely used for bioinformatics and genomics applications, as it can handle massive biological datasets for analysis and cleaning [6]. With high-throughput technology advancements in the life sciences, Python has become crucial in handling large volumes of data generated from technologies like
Python for Biologists
35
Next-Generation Sequencing (NGS), transcriptomics analysis, system biology interactions, and more [7]. With high-throughput technology advancements in the life sciences, Python has become crucial in handling large volumes of data generated from technologies like Next-Generation Sequencing (NGS), transcriptomics analysis, system biology interactions, and more [8].
2
Installing Python The most recent stable release of the Python programming language is Python 3.10 and can be downloaded for free from the Python Software Foundation’s website (https://www.python.org/ ). Following the installation of Python, launch the Python shell in Windows or type ‘python3’ in the Mac or Linux terminal as follows: Python 3.10 (v3.7.3:ef4ec6ed12, Mar 25 2022, 22:22:05) [MSC v.1916 64 bit (AMD64)] on win32 Type “help”, “copyright”, “credits” or “license()” for more information. >>> Instructions are typed immediately following ‘>>>’. Let’s begin by inputting our first command and pressing enter (Table 2.1).
3
Installing Anaconda Distribution Python has a plethora of packages for reducing code. It takes time to install each package. Anaconda Python was used to expedite setup. IPython and Jupyter notebook-like packages are available in Anaconda. This chapter uses Jupyter notebook to write and run code. IPython is used by Jupyter notebook. User can use the Jupyter Notebook App to write, edit, and run code in a web browser. The app can be used offline. It includes an autofill IDE for variables and packages. Users can download and run this chapter’s code because Jupyter notebook is easy to share. Anaconda distribution can be found on their official website (https://www. Anaconda.com/distribution/).
Table 2.1 Simple python command to print a text Input
Output
>>> print(‘Welcome to the world of python’)
Welcome to the world of python
36
Rajkumar Chakraborty and Yasha Hasija
Fig. 2.1 Python 3 notebook in a new window 3.1 Running Jupyter Notebook
After installing Anaconda, open Jupyter and write our first line of code. Enter the Anaconda command prompt in Windows. Use the terminal on Linux or Mac OS. When users type “jupyter notebook,” if port 8888 is not in use, the programme will launch at http://localhost:8888. To open a Jupyter notebook, search for ‘Anaconda Navigator.’ The files tab displays a list of the files and directories in the current working directory. The open notebooks or terminals are displayed in the running table A cluster is a collection of computers that are linked to a single node. Using the “New” button, create a new file, folder, or Jupyter notebook. When users click the button, a dropdown menu displays. To create a Python 3 jupyter notebook, select ‘Python 3’ beneath the notebook section (Fig. 2.1). This notebook has no title. Renaming is possible by clicking the “untitled” text. As the kernel, Python 3 code was performed in cells. Python terminal code is as follows: print (“Welcome to Python”). Users must either click the “Run” button or press “Shift” and “Enter.” “Welcome to Python,” says the notebook. Variables and imports can be shared if the Notebook has multiple cells executing at the same time. This enables logical code split without the need to reimport libraries, add variables, or define functions in each cell. The building blocks of programs. There are several fundamental patterns that are utilised in the construction of a programme. These building elements are not exclusive to Python programmes; they are applicable to programmes written in virtually any programming language: Accepting data/information from the user as input: The input can be from the keyboard, a file such as FastQ or PDB, or even via a sensor such as a biological device or colour detection. Output: Showing the result, storing it in a file, or occasionally sending commands to other devices, as in robotics or automation. Serial execution: Run statements sequentially in the manner specified in the script.
Python for Biologists
37
Conditional execution: Examining certain criteria and executing or avoiding certain instructions. Repeated execution: Continuously execute a collection of statements, maybe with small variations. Reuse: Create a batch of instructions once, give them a name, and then reuse them throughout the programme.
4
Errors in Python Python generates fairly comprehensive error messages that include information on the statement and library being used. While correcting or comprehending mistakes might be tedious at times. There are several sorts of mistakes; some are comprehensible by Python and will produce a warning; nevertheless, many problems are unknown to Python, resulting in unexpected consequences when programmes are performed. Three primary types of errors are as follows: Syntax errors are the simplest to comprehend and rectify. This often occurs when there is a violation in Python’s grammar rules and the language becomes puzzled by the assertions. Python will indicate precisely where it became confused with the line and term and will request the user to rectify it. This is the most often occurring error message or mistake among beginners. Structuring the statements is a necessary condition for their execution. Logic errors: These are mistakes that Python does not comprehend and hence causes the programme to terminate unexpectedly. These occur when your sentence is grammatically correct but does not convey the intended meaning. Logical mistakes are defects, and the debugging procedure will assist in this situation. One have to go through each step in order to locate the bug. Semantic errors occur when user provide a grammatically accurate statement and the statements are in the correct order, but the code has a problem. For instance, while attempting to add or subtract a numeric value using a string. This is not possible and will result in a semantic error indicating the operation or statement. Readers will come across several errors, and fixing them needs the ability to ask questions, as mentioned previously. Python programmes can be written using this foundational knowledge and environment. This chapter thoroughly discusses the codes so that they are understandable and useful. Let’s get started with Python comment syntax. Python disregards any statement or
38
Rajkumar Chakraborty and Yasha Hasija
Table 2.2 Hello to python Code
Output
#Lets print “Hello Python” print(Hello Python)
Hello Python
line that begins with ‘#.’ Code is clarified via comments. Several comments will be used to describe the code. After clicking “ctrl + enter” to execute the Table 2.2 code in a Jupyter notebook, the first statement or line beginning with ‘#’ will be disregarded, and the output will be “Hello Python”. As a result, the first line contains a comment indicating that the code will print ‘Hello Python’.
5 5.1
Datatypes and Operators Datatypes
The term “datatypes” in Python refers to the different types of data that can be saved and processed, including characters, numbers, and booleans. There are four primary datatypes in Python that are commonly used: • Integers (int) • Decimal or floating-point numbers (float) • Boolean values (bool) which are either true or false • Strings (str) which are a collection of characters like text Python provides two primary ways of representing numbers—integer and floating-point. Floating-point numbers or decimals such as 1.0, 3.14, and -2.33 can take up more storage than integers or whole numbers like 1, 3, -4, and 0. Boolean datatypes are used to establish conditions in conditional statements, and they can only have two values—True or False. Lastly, string data is the most common type of data encountered by biologists as DNA, RNA, protein sequences, and names are all typically text-based. Therefore, the chapter on datatypes includes a dedicated section on strings. It is important to note that string data is always enclosed in quotes, such as ‘’. For example, the peptide ‘MKSGSGGGSP’ would be a Python string.
5.2
Operators
There are some standard operators used in Python, including ‘+’, ‘-’, ‘*’, ‘/’, ‘=’, and ‘**’, which represent addition, subtraction, multiplication, division, assignment, and exponent, respectively. When performing an operation on an integer and a float, the result will always be of float type. However, when operating on two integers, the result will be of integer type, except for division,
Python for Biologists
39
which always returns a float type result. If you want to get an integer type result from division, you can use the integer division operator ‘//’.
6
Variables In Python, variables can be compared to algebraic variables in mathematics, which have both a name and a value. The ‘=’ operator is used to declare and assign a value to a variable’s name. The name is located on the left side of the assignment operator, while the value is located on the right side (Table 2.3). Variables play a crucial role in programming by enabling code readability, maintainability, and reusability. They allow us to store and manipulate data, making it accessible throughout our program. Take, for instance, the scenario of working with large protein or nucleotide sequences. Writing the entire sequence repeatedly would be impractical and hinder code comprehension. Here, variables come to our rescue. We can store the sequence in a variable and reuse it as needed, enhancing code efficiency and manageability. Python offers great flexibility when it comes to variables. They can be assigned to various data types, including integers, floats, Booleans, and strings. Additionally, variables can be assigned to other variables, allowing for dynamic data manipulation. Reassigning a variable with a new value replaces its previous value, which cannot be recovered unless explicitly stored elsewhere. Notably, Python permits reassigning a variable to a different data type, an uncommon feature in many programming languages. For instance, an integer variable can be reassigned to a string variable and vice versa. nPython also boasts a unique capability: simultaneous value assignment to multiple variables in a single statement. This is seldom found in other languages. For example, the statement “name, age = ‘John’, 25” assigns the value “John” to the variable “name” and the value 25 to the variable “age.” This concise syntax streamlines code and enhances readability. It’s vital to remember that variable names in Python are case-sensitive. A variable declared as “gene_symbol” cannot be referenced as “Gene Symbol” or “GENE SYMBOL.” Correct spelling and capitalization must be adhered to in order to avoid program errors.
Table 2.3 Assigning variables in python Code
Output
height = 155 print(height)
155
40
Rajkumar Chakraborty and Yasha Hasija
Table 2.4 Keywords in Python False
else
import
pass
Yield
None
break
except
in
Raise
True
class
finally
is
return
and
continue
for
lambda
try
as
def
from
nonlocal
while
assert
del
global
not
with
elif
if
or
6.1 Rules for Variable Naming
In Python, a variable can be given a nomenclature using the following set of rules: • Names for variables must begin with a letter or underscore. • The remaining of the name must be composed of letters, digits, or underscores; special characters such as ‘@’ or ‘.’ are not permitted. • Python variables, as previously stated, are case sensitive. • Thirty-three terms are not allowed as variable names because they are included in Python 3.7’s lexicon and are referred to as keywords. Table 2.4 contains a list of all Python keywords. Python has specific rules for naming variables, which are as follows: • The name of a variable must start with a letter or an underscore. • The rest of the name can include letters, digits, or underscores. Special characters like ‘@’ or ‘.’ are not allowed in the variable name. • The case of the variable name matters; ‘myVar’ and ‘myvar’ are considered two different variables. • Python has 33 reserved words or keywords that cannot be used as variable names, as they are already defined in the Python 3.7 lexicon. These keywords are listed in Table 2.5.
7
Strings Strings are a fundamental data type in computer programming, and they are especially important in bioinformatics research where text manipulation is a common task. For example, string manipulation is frequently used in sequence analysis, pattern detection, and data mining from text files. In Python, a string is a sequence of characters that can be created by enclosing them in single quotes, double quotes, triple single quotes, or triple double quotes.
Python for Biologists
41
Table 2.5 Defining strings in Python Code
Output
sequence_1 = ’ ATCGTACGA’ print(sequence_1)
ATCGTACGA
sequence_2 = " ATCGTACGA" print(sequence_2)
ATCGTACGA
sequence_3 = ’’’ ATCGTACGA’’’ print(sequence_3)
ATCGTACGA
sequence_4 = """ ATCGTACGA """ print(sequence_4)
ATCGTACGA
sequence_5 = ’’’ ATCGTACGA CTAGCTAG ’’’ print(sequence_5)
ATCGTACGA CTAGCTAG
sequence_6 = """ ATCGTACGA CTAGCTAG """ print(sequence_6)
ATCGTACGA CTAGCTAG
When using single or double quotes, the string can only span a single line, but when triple quotes are used, the string can span multiple lines. It is important to note that the quotes used to define the string must be consistent throughout the string. Consider the following example (Table 2.5). 7.1
String Indexing
In Python, string indexing allows us to access individual characters within a string by their position. Let’s take the example string ‘PLANT’. String indexing in Python starts from 0, so the first character of the string would have an index of 0. For instance, in the string ‘PLANT’, the character ‘P’ can be accessed using string indexing with the index 0. Similarly, the character ‘L’ would have an index of 1, ‘A’ would have an index of 2, ‘N’ would have an index of 3, and ‘T’ would have an index of 4. In addition to regular indexing, Python also supports reverse indexing. Reverse indexing starts from -1, where -1 represents the last character of the string. In our example string ‘PLANT’, using reverse indexing, the character ‘T’ can be accessed with the index -1. Similarly, ‘N’ would have an index of -2, ‘A’ would have an index of -3 and so on. String indexing in Python allows us to retrieve specific characters or perform operations on individual characters within a string, providing flexibility and control when working with text data. The use of character indexing in Python can be demonstrated in Table 2.6, which displays the forward and backward indexes of nucleotides in a DNA sequence. The first row shows the sequence,
42
Rajkumar Chakraborty and Yasha Hasija
Table 2.6 String indexing in Python A
T
T
G
C
Forward Indexing
0
1
2
3
4
Backward Indexing
-5
-4
-3
-2
-1
the second row represents the forward index, and the third row shows the backward index. In Python, string slicing allows us to extract a portion of a string by specifying a range of indices. Let’s take the example string ‘ATTGC’ to understand string slicing in Python To perform string slicing, we use the syntax ‘string[start:end]’, where ‘start’ represents the index of the first character to include in the slice, and ‘end’ represents the index of the character to exclude from the slice. It’s important to note that the character at the ‘end’ index is not included in the resulting slice. Using our example string ‘ATTGC’, let’s explore different string slicing scenarios: • ‘string[1:4]’: This will slice the string from index 1 to index 4 (excluding index 4), resulting in the substring ‘TTG’. It includes characters at indices 1, 2, and 3. • ‘string[:3]’: This will slice the string from the beginning (index 0) to index 3 (excluding index 3), resulting in the substring ‘ATT’. • ‘string[2:]’: This will slice the string from index 2 to the end, resulting in the substring ‘TGC’. • ‘string[-3:-1]’: This demonstrates reverse indexing. It slices the string from the third-last character (index -3) to the second-last character (index -1), resulting in the substring ‘TG’. String slicing allows us to extract substrings from a larger string based on our specific needs. It is a powerful feature in Python that enables us to manipulate and process text data efficiently. 7.2 Operations on Strings
There are a few ways to concatenate or join strings. String addition, denoted by the ‘+’ operator, allows us to concatenate two or more strings together. When we add two strings using the ‘+’ operator, the resulting string is formed by joining the characters of the original strings. Using the example string ‘ATTGC’, if we perform the operation “ATT’ + ‘GC’’, the result would be the same, ‘ATTGC’. String Multiplication: String multiplication, denoted by the ‘*’ operator, enables us to repeat a string a certain number of times. When we multiply a string by an integer using the ‘*’
Python for Biologists
43
operator, the resulting string is formed by repeating the original string the specified number of times. For instance, if we use the example string ‘ATTGC’ and multiply it by 2, the result would be ‘ATTGCATTGC’. Both string addition and multiplication provide flexibility when working with strings in Python. Addition allows us to combine strings, while multiplication allows us to repeat strings as needed for various applications. 7.3 Commands in Strings
Several string handling commands include count(), find(), and len (). Using the example string ‘ATTGC’, let’s explore a few common string commands in Python: • Length: The len() function in Python allows us to determine the length of a string. For instance, applying len(‘ATTGC’) would return the value 5, indicating that the string ‘ATTGC’ consists of five characters. • Upper and Lower Case Conversion: Python provides methods like upper() and lower() to convert a string to uppercase or lowercase, respectively. If we apply upper() to the string ‘ATTGC’, it would become ‘ATTGC’. Similarly, applying lower() would result in ‘attgc’. • Substring Search: We can check if a particular substring exists within a string using the in keyword. For example, to check if the substring ‘TT’ is present in the string ‘ATTGC’, we can write ‘TT’ in ‘ATTGC’, which would return True. • String Replacement: The replace() method allows us to replace specific characters or substrings within a string. For instance, calling ’ATTGC’.replace(’T’, ’A’) would replace all occurrences of the character ‘T’ with ‘A’, resulting in the string ‘AAAAC’. • Splitting and Joining: The split() method can split a string into a list of substrings based on a delimiter. For example, calling ’ATTGC’.split(G’) would result in the list [’ATT’, ’C’] since ‘G’ acts as the delimiter. Conversely, the join() method can concatenate a list of strings into a single string using a specified delimiter. For instance, ’’.join([’ATT’, ’GC’]) would yield ‘ATTGC’. These are just a few examples of common string operations in Python. Python provides a rich set of built-in string methods and functions, allowing for versatile manipulation and analysis of text data.
44
8
Rajkumar Chakraborty and Yasha Hasija
Python Lists and Tuples After learning about datatypes like integers, strings, and booleans, this section will cover Lists. List is a versatile data structure that allows us to store and manipulate collections of items. A list is an ordered sequence of elements enclosed within square brackets ‘[]’. Lists can contain elements of different data types such as numbers, strings, or even other lists. The elements within a list are separated by commas. Lists in Python are mutable, which means that we can modify, add, or remove elements from them. Lists provide several useful operations and methods for working with data. We can access individual elements in a list using indexing, where the first element has an index of 0. Slicing allows us to extract a portion of the list by specifying a range of indices. Elements in a list can be modified by assigning new values to specific indices. Additionally, we can append or insert new elements to the end or a specific position in the list, respectively. Conversely, we can remove elements using methods such as ‘remove()’, ‘pop()’, or ‘del’. Lists also support various built-in functions and methods for common operations. We can determine the length of a list using the ‘len()’ function, concatenate or multiply lists using operators like ‘+’ and ‘*’, sort the elements using ‘sort()’, and find specific values or counts using methods such as ‘index()’ and ‘count()’. The flexibility and functionality of lists make them a fundamental data structure in Python. They are commonly used for storing and manipulating collections of data, making them invaluable in a wide range of applications and algorithms.
8.1 Accessing Values in List
Fig. 2.2 Python list indexes
Like strings, list items also have indexes starting with ‘0’ for forward indexing and ‘-1’ for backward indexing. The items inside a list can be accessed using brackets [] and indexes. List indexing and slicing allow us to access and manipulate elements within a list. Let’s consider the example list ‘[‘Moss’, ‘Embryophyte’, ‘Thallophyte’, ‘Conifer’]’ to explore list indexing and slicing (Fig. 2.2).
Python for Biologists 8.1.1
List Indexing
45
• Indexing in Python starts from 0. So, to access the first element of the list, we would use index 0. In our example, ’list[0]’ would give us the value ‘Moss’. • Similarly, ’list[1]’ would give us ‘Embryophyte’ and ’list[2]’ would give us ‘Thallophyte’.
8.1.2
List Slicing
List slicing allows us to extract a portion of the list by specifying a range of indices. • For example, ’list[1:3]’ would give us a new list containing elements from index 1 to index 2 (excluding index 3). In this case, the resulting list would be ’[’Embryophyte’, ’Thallophyte’]’. • If we omit the starting index in the slice, such as ’list[:2]’, it will include elements from the beginning of the list up to index 1. In this case, the resulting list would be ’[’Moss’, ’Embryophyte’]’. • Conversely, if we omit the ending index in the slice, such as ’list[2:]’, it will include elements from index 2 to the end of the list. Here, the resulting list would be ’[’Thallophyte’, ’Conifer’]’.
• We can also use negative indices for slicing. For instance, ’list [-2:]’ would give us the last two elements of the list, resulting in ’[’Thallophyte’, ’Conifer’]’. List indexing and slicing provide flexible ways to access specific elements or extract subsets of a list. They are essential techniques for manipulating and working with list data in Python. 8.2
9
Tuples
A tuple is an immutable data structure that allows us to store a collection of elements. Tuples are like lists, but the key difference is that tuples cannot be modified once they are created. Tuples are defined by enclosing the elements within parentheses (). The elements within a tuple can be of different data types, and they are separated by commas. Tuples are commonly used when we need to store a fixed set of values that should remain unchanged. While we cannot modify the elements of a tuple, we can access them using indexing or slicing, just like in lists. Tuples are particularly useful when we want to ensure the integrity and immutability of data, such as storing coordinates, constant configurations, or database records.
Dictionary in Python A dictionary is a powerful data structure that allows us to store and retrieve data using key-value pairs. Dictionaries are defined by
46
Rajkumar Chakraborty and Yasha Hasija
enclosing the key-value pairs within curly braces {} or by using the ’dict()’ constructor. Keys within a dictionary must be unique and immutable, such as strings, numbers, or tuples, while the corresponding values can be of any data type. Dictionaries are unordered, meaning that the order of the key-value pairs may vary. To access the values in a dictionary, we use the corresponding keys instead of indices as in lists or tuples. Dictionaries provide efficient lookup and retrieval of values based on their associated keys, making them ideal for scenarios where fast access to data is required. They are widely used for tasks such as data mapping, configuration settings, or organizing data in a structured manner (Table 2.7 and Fig. 2.3). Few common dictionary operations in Python: Accessing Values: We can access the values in a dictionary by referring to their respective keys. For example, ‘crop[’Name’]’ would return the value ‘’Rose’’, ‘crop[’Genus’]’ would return ‘’Rosa’’, and so on. This allows you to retrieve specific information associated with each key. Updating Values: We can update the value of a specific key in the dictionary. For instance, if we want to update the species from ‘’Bracteatae’’ to ‘’Damascena’’, we can use the assignment operator (’=’) like this: ‘crop[’Species’] = ’Damascena’’. This will modify the value associated with the ‘’Species’’ key. • Adding New Key-Value Pairs: To add a new key-value pair to the dictionary, we can assign a value to a previously unused key. For example, ‘crop[’Family’] = ’Rosaceae’’ would add a new key ‘’Family’’ with the value ‘’Rosaceae’’ to the ‘crop’ dictionary. • Removing Key-Value Pairs: To remove a key-value pair from the dictionary the ’del’ keyword is used followed by the dictionary name and the key to be deleted. For instance, ’del crop [’Kingdom’]’ would remove the key-value pair associated with the key ‘’Kingdom’’ from the ‘crop’ dictionary. • Checking Key Existence: To check if a specific key exists in the dictionary the ‘in’ keyword is used. For example, to check if the key ‘’Genus’’ exists in the ‘crop’ dictionary, we can use the expression ‘’Genus’ in crop’, which would return ‘True’ if the key exists and ‘False’ otherwise. These are just a few examples of operations you can perform on Python dictionaries. Dictionaries provide a convenient way to store and access data using key-value pairs, making them a powerful tool for organizing and manipulating information.
Output {’Name’: ’Rose’, ’Kingdom’: ’Plantae’, ’Genus’: ’Rosa’, ’Species’: ’ Bracteatae’}
Code
crop = {} crop = {’Name’:’Rose’,’Kingdom’:’Plantae’,’Genus’:’Rosa’,’Species’:’ Bracteatae’} print(crop) print(type(crop))
Table 2.7 Creating a python dictionary
Python for Biologists 47
48
Rajkumar Chakraborty and Yasha Hasija
Fig. 2.3 Python dictionary key: value pairs
10
Conditional Statements Uptill now in this chapter the programs were relatively elementary and lack sophistication, rendering them unable to make informed decisions. To introduce decision-making capabilities, conditional statements play a vital role. Computers, similar to a light switch, operate in two fundamental states—True or False. These binary states are commonly referred to as booleans in Python, representing the foundation of logical and comparative operations. • Python offers a range of logical operators to evaluate and combine boolean values. The logical AND operator, denoted by ‘and’, returns True if both operands on either side of it are True; otherwise, it returns False. For instance, ‘x and y’ would yield True only if both ‘x’ and ‘y’ are True. • Python provides the logical OR operator, denoted by ‘or’, which returns True if at least one of the operands on either side is True. It returns False only if both operands are False. For example, ‘x or y’ would result in True if either ‘x’ or ‘y’ (or both) are True. • Python encompasses the logical NOT operator, denoted by ‘not’, which negates the boolean value of its operand. If the operand is True, ‘not’ would return False, and vice versa. For instance, ‘not x’ would yield True if ‘x’ is False, and False if ‘x’ is True. Additionally, Python provides a set of comparative operators to compare values and evaluate the resulting boolean expression. These operators include: • Equal to (‘==’): Returns True if the operands are equal. • Not equal to (‘!=’): Returns True if the operands are not equal. • Greater than (‘>’): Returns True if the left operand is greater than the right operand. • Less than (‘=’): Returns True if the left operand is greater than or equal to the right operand. • Less than or equal to (‘ operator in the if statement. If the condition evaluates to True (which it does in this case), the code block indented under the if statement is executed. In this example, the code block contains a print statement that prints the string ‘downregulated’ to the console. Since the value of normal_expression is indeed greater than perturbation_expression, the condition is satisfied, and the code
Fig. 2.4 Syntax of ‘if’ statement
50
Rajkumar Chakraborty and Yasha Hasija
Table 2.8 If statements Code
Output
normal_expression = 7 perturbated_expression = 3.5 if normal_expression > perturbation_expression: print(downregulated)
downregulated
Table 2.9 If, else statement Code
Output
normal_expression = 7 Gene is downregulated perturbation_expression = 3.5 if normal_expression > perturbation_expression: print(’Gene is downregulated’) else: print(’Gene is upregulated’)
within the if block is executed. As a result, running this code would output ‘downregulated’ to the console. In the next example code (Table 2.9), we have added an else statement to the existing if statement. The else statement provides an alternative code block to be executed if the condition specified in the if statement evaluates to False.
11
Loops in Python In software development, it is sometimes essential to repeat a specific block of code or statement until a certain condition is satisfied or until the statement becomes false (depending on the specific requirements). This recurring or iterative behavior can be achieved using loops, which are considered control structures in programming. To facilitate the implementation of loops, programming languages offer several options. There are two main types of loop structures that programmers commonly use. The first is the ‘while’ loop, which iterates through a block of code as long as the specified condition is true. The code inside the ‘while’ block will continue to execute repeatedly until the condition evaluates to false. The second type is the ‘for’ loop, which is commonly used when the programmer needs to iterate over a sequence of values, such as a range of numbers or a list of items. In this case, the ‘for’ loop will execute the block of code once for each value in the sequence. By utilizing loops, a user can write more efficient and
Python for Biologists
51
Fig. 2.5 Syntax of a python while loop Table 2.10 Simple while loop Code
Output
a=0 while a treated_mean: print(’Control gene in upregulated’) else: print(’Treated gene is upregulated’)
Treated gene is upregulated
plant’. Afterwards, the function can be used with any flower name and it will print it. Functions can take any number of arguments as input. 12.1 Returning Values
Functions can return values, by using ‘return’ keyword (Table 2.15). The function returns the mean expression for a list consisting of expression values of a gene with triplicate experiments. Mean expression of controlled and treated conditions are calculated. Then they are compared to see whether the treated genes are upregulated or downregulated.
Python for Biologists
55
Once the ‘return’ keyword is executed, the function terminates immediately and returns the value. Python functions can also return more than one values, refer to the python documentation for more details.
13
Modules in Python Modules in Python are just “.py” files that contain Python code and may be imported into another Python programme. A module is a library of pre-written code or a file containing a set of functions that may be added to a bigger programme. Modules allow us to arrange related portions of code together, such as a set of functions, a set of classes, or anything else. According to best practises, big sections of Python code should be divided up into modules of no more than 300–400 lines. A module’s components are the lines of code that define and execute classes, variables, and functions for usage in another application. Python’s modules help modularize programming. Using the ‘import’ keyword in Python allows to use the modules in the code. Using modules improves our program’s stability. Web development, database construction, image analysis, data science, statistics, machine learning, and so on are all possible using Python tools and libraries. Packages contain modules, whereas libraries contain functions. The Python Package Index (PyPI) contains approximately 395,666 packages to help developers [9]. Using the ‘pip’ installer, users can install any PyPI package. Both conda and pip installers in Python anaconda can be used. ‘pip install PackageName’ or ‘conda install PackageName’ for an Anaconda distribution should be entered into command prompt or Linux terminal to install a package. The Anaconda Python distribution comes with preloaded data science packages. After installing the packages, the modules can be called with ‘from’ and ‘import’ keywords of python. Users may experience issues while installing models on occasion; in this case, users must properly examine the dependencies of the targeted modules and libraries, as well as their versions.
14
Classes and Objects In traditional procedural programming, code is divided into multiple functions, and variables are used to store and manage data within these functions. However, as programs grow in size, managing a large number of functions becomes challenging. Code duplication and interdependencies between functions can lead to issues when making changes. To address these challenges, object-oriented programming (OOP) comes into play. OOP introduces several key
56
Rajkumar Chakraborty and Yasha Hasija
concepts to build code that is more readable, maintainable, and reusable. Inheritance is one such concept where a class can inherit properties and methods from another class. This allows the creation of specialized subclasses that inherit common characteristics and behaviors from a parent class. For example, an “animal” class can have subclasses like “cow” and “dog” with their own unique attributes and methods. Inheritance promotes code reuse, improving code efficiency and maintainability. • Encapsulation is another important OOP concept that involves hiding internal information of an object from the outside world. It allows for controlled access to data and prevents accidental modifications. For instance, a “Person” class can have properties like “name” and “age,” with methods to retrieve and set these attributes. Encapsulation provides data security and enables code modification without affecting the rest of the codebase. • Abstraction focuses on hiding the inner workings of an object, allowing users to utilize it without knowing the underlying implementation details. This simplifies code usage and maintenance. For example, a “plant” class can have methods like “count the quantity of fruits” and “leaves.” Users can simply call these methods without needing to understand their internal workings. Abstraction enhances code usability and comprehension. • Polymorphism enables objects to take on multiple forms or states. This flexibility allows code to be easily updated and adapted to different contexts. For instance, an “mRNA” class can have subclasses representing different types of mRNA, each implementing common methods in their own unique way. Polymorphism enables code flexibility and adaptability. Understanding these concepts of inheritance, encapsulation, abstraction, and polymorphism is crucial in OOP. They facilitate the creation of code that is easier to use, maintain, and modify. In OOP, related variables and functions are encapsulated into objects, which serve as self-contained entities. Objects are instances of classes, which act as blueprints for creating objects with shared characteristics and functions. Python treats various data types like integers, strings, lists, and dictionaries as classes. Methods are the internal functions of a class that define its behaviours. By embracing OOP principles, developers can write code that is more organized, reusable, and adaptable. Python provides a robust implementation of OOP, allowing for effective structuring and management of code through classes, objects, and methods. To begin building our custom object in Python, initially a class with the keyword ‘class’ is declared. Consider developing a class that represent the plant’s details. Each object will consist entirely of
Python for Biologists
57
Table 2.16 Creating an empty class Code Class Plant: pass
Table 2.17 Initialization using __int__() function Code class Plant: def __init__(self): pass
plants. To begin, a class must be created. Let us begin by creating an empty class named plant (Table 2.16). Here ‘pass’ keyword is used to show that the class is empty. 14.1 __int__() Method
15
__int__() is a special method which initializes an object. This method is automatically executed every time a class object is created. The __int__() method is typically used for operations required before the object is produced (Table 2.17). When __int__() is specified in a class description, it should have ‘self’ as its first parameter or variable. The ‘self’ parameter refers to the object itself. This is to retrieve or set variables with in the class. This variable or argument doesn’t always have to be named self.
File Handling in Python File handling in Python involves working with files on the system, which includes tasks such as reading from files, writing to files, and modifying existing files. Python provides various file opening modes to specify the intended operation on the file. The file opening modes are: •
’r’: Read mode. This mode allows reading from an existing file. It is used when you want to retrieve data from a file without modifying its contents. For example, to read a FASTA file containing biological sequences, you can open it in read mode: file = open(’sequences.fasta’, ’r’).
•
’w’: Write mode. This mode is used to create a new file or overwrite the contents of an existing file. It is commonly used when you want to write data to a file. For instance, to write data
58
Rajkumar Chakraborty and Yasha Hasija
to a CSV file, you can open it in write mode: file = open(’data. csv’, ’w’). •
data to the end of an existing file without overwriting its contents. It is useful when you want to add new data to an existing file. To append data to a CSV file, you can open it in append mode: file = open (’data.csv’, ’a’).
•
’x’: Exclusive creation mode. This mode creates a new file but raises an error if the file already exists. It is useful when you want to ensure that a file does not already exist before creating it.
•
’b’: Binary mode. This mode is used when dealing with non-text files, such as images or binary data. It ensures that the file is opened in binary format rather than text format.
’a’: Append mode. This mode allows appending
These file opening modes can be combined, such as ‘rb’ for reading a binary file or ‘w+’ for both reading and writing to a file. When working with files, it is important to close them after you are done using them. This can be done using the file.close() method. Alternatively, you can use a ‘with’ statement, which automatically closes the file when the block of code finishes executing, ensuring proper cleanup. The file handling in Python involves operations such as reading, writing, and modifying files. The file opening modes (‘r’, ‘w’, ‘a’, ‘x’, and ‘b’) specify the intended operation and behavior when working with files. These modes can be combined to achieve different functionalities. When handling biological file formats like FASTA or CSV, you can use the appropriate file opening modes to read or write data in those formats. In the FASTA format example (Table 2.18), the file ‘sequences. fasta’ is opened in read mode (‘r’), and the contents are processed line by line. The code identifies and prints the headers and sequences based on the FASTA format structure. In the CSV format example (Table 2.19), the code writes data to a CSV file named ‘expression_data.csv’ in write mode (‘w’). The csv.writer is used to write the rows of data to the file, creating a simple expression data table. These examples demonstrate how file handling in Python allows for efficient interaction with biological Table 2.18 Reading from a FASTA file Code with open(’sequences.fasta’, ’r’) as file: for line in file: if line.startswith(’>’): print("Header:", line.strip()) else: print("Sequence:", line.strip())
Python for Biologists
59
Table 2.19 Reading from and writing to a CSV file Code import csv # Writing to a CSV file data = [ [’Gene’, ’Expression’], [’GeneA’, 10.5], [’GeneB’, 8.2], [’GeneC’, 15.1] ] with open(’expression_data.csv’, ’w’, newline=’’) as file: writer = csv.writer(file) writer.writerows(data)
file formats, such as FASTA and CSV, providing flexibility in reading, writing, and manipulating the content of these files.
16
Data Handling Pandas is a highly effective tool for data manipulation and analysis widely utilized in the data science community. Built on Python, it provides a user-friendly interface for loading, manipulating, and analyzing data effortlessly. One of its key advantages is its speed and simplicity, enabling processing of vast datasets with millions of rows without performance issues. At the heart of Pandas lies the DataFrame, a versatile and userfriendly data structure. Users can effortlessly add or delete columns, slice and index data, and handle missing values. Pandas seamlessly handles reading data from CSV files, a popular data storage format. Once data is loaded into a DataFrame, a plethora of operations becomes available, including statistical analysis, data cleaning, data visualization, and more. For instance, Pandas allows answering questions such as “What are the average, median, highest, and lowest values in each column?” or “Are there any connections between rows A and B?” Incorrect data can be easily removed, and sorting can be applied to specific rows or columns of interest. Matplotlib library integration enables transformation of data into visual representations like lines, bars, bubbles, or histograms. Following data processing and analysis, the results can be exported to CSV files, databases, or other storage locations. Python offers a myriad of powerful libraries and tools that can complement Pandas. Matplotlib, Seaborn, and Scipy are examples of libraries used for data visualization, statistical tests, and mathematical modeling. Proficiency in Python and its libraries opens up various job opportunities and equips individuals to tackle modern
60
Rajkumar Chakraborty and Yasha Hasija
challenges. However, this chapter only scratches the surface of Python’s capabilities. Additional topics to explore include algorithms, machine learning, and natural language processing. Interested readers can find high-quality learning resources such as online tutorials, books, and courses to delve deeper into Python and its vast ecosystem of tools and applications.
17
Conclusions and Future Prospectors The next generation of bioinformation will require to employ computational tools to derive knowledge from existing biological data. Bioinformatics is concerned with the collection, retrieval, and modelling of data for the purposes of analysis, visualisation, and prediction. This is performed through the creation of algorithms and software. As computer technology advances, numerous advancements in the field of bioinformatics emerge worldwide. Day by day, high throughput technologies become more affordable and efficient, resulting in the addition of a large volume of data that is kept in optimised databases with correct annotations. Numerous methodologies formerly employed in the realm of computer science are finding application in computational biology. Algorithm optimization, mathematical modelling of systems, graph theory, network analysis, data science, and artificial intelligence are all being applied to biological data for the purpose of conducting research and developing predicting tools. Furthermore, the complexity of the models requires an unparalleled amount of flexibility in software tools to allow investigators to design and evaluate novel ideas. Next-generation Sequencing (NGS) is one of the most important technological advances in the life sciences over the last decade. Whole Genome Sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and other technologies are widely employed to examine critical biological topics. With good reason, these are also known as high-throughput sequencing technologies; they create massive volumes of data that must be analysed. NGS is the primary reason that computational biology is becoming a “big data” discipline. This is a field that, above all, necessitates powerful bioinformatics techniques. Professionals with these skills are in high demand. However, programming is not limited to NGS; it can also be used for many other activities such as literature searches, editing DNA and protein sequences, and data analysis and display. Biopython is an open-source library for bioinformatics computation [10]. PyMed is another library that can assist researchers in creating consistent and understandable batch search queries in PubMed, making literature searches a joy. Python, with its libraries, is a powerful tool for manipulating, exploring, and visualising large amounts of data. Pandas is used for data manipulation, and Seaborn is used for data visualisation. The interactive and dynamic character
Python for Biologists
61
of python, as well as its relative simplicity and ease of use, make it a good alternatives for developing software solutions. Python makes storing, organising, analysing, and displaying massive amounts of data a piece of cake. As a result, python will play an increasingly important role in the era of digital biology.
18
Exercises 1. What is the purpose of conditional statements, and how do Booleans play a role in them? 2. Describe some of the finest ways to name a Python variable. 3. If we were given the following DNA sequence: ‘TGGGACA AGGGGTCACCCGAGTGCTGTCTTCCAATCTACTT,’ Determine the DNA fragments that would be produced if this sequence was digested by a restriction enzyme with the recognition site ‘CCC’ using Python. 4. Which datatypes are not permitted to be used as keys in the Python dictionary? 5. Create a function that takes two lists and returns all the items in each that are common. 6. Create a sequence-based DNA class and a method for retrieving the complementary string of the DNA.
References 1. Benson DA, Cavanaugh M, Clark K et al (2013) GenBank. Nucleic Acids Res 41:D36– D42. https://doi.org/10.1093/NAR/ GKS1195 2. Sobti RC, Ali A, Dolma P et al (2022) Emerging techniques in biological sciences. In: Advances in animal experimentation and modeling: understanding life phenomena, pp 3–18. https://doi.org/10.1016/ B978-0-323-90583-1.00013-1 3. Dash S, Shakyawar SK, Sharma M, Kaushik S (2019) Big data in healthcare: management, analysis and future prospects. J Big Data 6:1– 25. https://doi.org/10.1186/S40537-0190217-0/FIGURES/6 4. What’s new in Python 2.0. https://web. archive.org/web/20091214142515/http:// www.amk.ca/python/2.0. Accessed 21 Aug 2022 5. Python 3.0 Release. Python.org. https://www. python.org/download/releases/3.0/. Accessed 21 Aug 2022 6. Ekmekci B, McAnany CE, Mura C (2016) An introduction to programming for bioscientists:
a python-based primer. PLOS Comput Biol 12:e1004867. https://doi.org/10.1371/ JOURNAL.PCBI.1004867 7. Gupta N, Verma VK (2019) Next-generation sequencing and its application: empowering in public health beyond reality. In: Arora P (ed) Microbial technology for the welfare of society. Springer. https://doi.org/10.1007/ 978-981-13-8844-6_15 8. Greener JG, Kandathil SM, Moffat L, Jones DT (2021) A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23(1): 40–55. https://doi.org/10.1038/s41580021-00407-0 9. PyPI. The Python Package Index. https:// pypi.org/. Accessed 21 Aug 2022 10. Cock PJA, Antao T, Chang JT et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423. https://doi.org/10.1093/BIOINFORMAT ICS/BTP163
Chapter 3 Assembly, Annotation and Visualization of NGS Data Kalyani M. Barbadikar, Tejas C. Bosamia, Mazahar Moin, and M. Sheshu Madhav Abstract The next-generation sequencing technologies have revolutionized the field of agriculture with the advent of ever-evolving chemistries and a reduction in the price of sequencing per base. The last decade has witnessed a substantial upsurge in the chemistries, databases, tools, and bioinformatics platforms. The number of data (GBs) deposited in publicly available databases is increasing exponentially day by day. From unraveling the basic structural information to understanding the minute patterns of gene functions, NGS technologies are being used to reveal various aspects of plant growth, development, stress resistance/tolerance, yield, nutrient content etc. on a modest budget. Climate change is also driving the need of more focused research using plant lines/germplasm for climate resilience. Availability of such data is extremely worthy in the current scenario of ever evolving chemistries, information technology, artificial intelligence. To have a conglomeration with the biological data, assembling, annotation and representation or visualization of the big data sets is extremely essential. These technologies include short read as well as long read sequencing individually or together offers high-throughput, scalability and speed by using various pipelines/platforms/graphical user interface. The tools and softwares deployed for executing the assembly, annotation and visualization need to be selected carefully based on the aims and applications of the study. The plausible roles of genetics and epigenetic regulation in a particular tissue and condition can be very well studied using different applications like genome sequencing, epigenetic, methylome sequencing, transcriptome sequencing, targeted sequencing etc. In this review, we have presented an account of the considerations for assembling the reads together, annotation, visualization for various applications along with various tools available for executing the same. Key words Next-generation sequencing, Long-read sequencing, Assemblers, Gene ontology, Genome browser
1 Introduction The next-generation sequencing (NGS) technologies have de facto turned indispensable. Loads of data (GB) are added day-by-day in public repositories. The interpretation of proportionate increase in the amount of data generated is considerably challenging. Whether the objective is whole-genome sequencing to compare mutations between the samples [1], RNA-seq to learn about differentially
Priyanka Anjoy et al. (eds.), Genomics Data Analysis for Crop Improvement, Springer Protocols Handbooks, https://doi.org/10.1007/978-981-99-6913-5_3, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
63
64
Kalyani M. Barbadikar et al.
Fig. 3.1 NGS workflow and applications of NGS
expressed genes [2] or studying microbial diversity as part of the metagenomics [3], NGS sequencing data is the starting point. For each of these, the starting materials would be DNA or RNA followed by the preparation of libraries, sequencing the samples and finally data analysis [4, 5]. The NGS sequencing has the ability that many samples can be run simultaneously, large amount of data can be generated from a single sequencing run and speed of the analysis [6–8]. Multiplexing of the samples is the major benefit of NGS. The NGS technologies have a wide range of applications in human, plants, archaea, fungi, bacteria, viruses (Fig. 3.1). The softwares, pipelines, platforms, tools for doing the downstream analysis of the sequenced data are available as freeware as well as paid for executing the complete data analysis. The most critical start-point of any sequencing procedure is the tissue, time (expression studies), integrity, and quality of isolated nucleic acids, amino acids. The tools and databases available are freeware, on-line and most of them can be operated using a command line interface. Public repositories like GitHub (open-source license) are extremely useful for deploying the tools/pipelines deposited and also for developing desired pipelines (https://gist.github.com).
2
Considerations for Next-Generation Sequencing
2.1 Points to Be Considered Before Starting Next Generation Sequencing Project
Several points need to be taken into consideration for getting proper results using the NGS technologies. Select individuals with pure genetic backgrounds and good representatives of the species that provide enough amounts of DNA. Quality and quantity should be adequate so that each sequencing run should have the same DNA sample if required. Extract the RNA from the same individual that is used in DNA isolation so that RNA sequencing data can be deployed for proper assembly and annotation of coding
Assembly, Annotation and Visualization of NGS Data
65
regions. The sequencing platform and assembly tools/softwares/ programs should be decided at the time of drafting any genome sequencing project. Table 3.1 gives the definitions of commonly used terminologies in NGS. 2.2 Properties of the Genomes to Be Taken into Consideration
Every genome sequencing project is different because each genome of species possesses different properties and the sequencing platform, assembly tool, and annotation also influence the output of the sequencing project. It is worth looking into such properties of the genome before starting the project. Genome size: To reduce errors in assembly and increase precision, a certain amount of sequencing data is needed. The coverage of the sequencing data is measured based on the genome size. Generally, greater than 60× coverage is required for the Illumina platforms. Therefore, the bigger the genome size, the more data is needed. The approximate genome size can be referred from the existing database such as for plants (http:// data.kew.org/cvalues), for fungi (http://www.zbi.ee/fungalgenomesize), and for animals (http://www.genomesize.com) or it can be estimated by performing flow cytometry. Flow cytometry can be deployed to have a priori-knowledge regarding the genome size of the organism in study [9]. Genome comparisons and synteny tools like SyntTax, Cinteny Server, CoGe, SimpleSynteny, Synteny Portal AutoGRAPH Sibelia Kablammo, M1CR0B1AL1Z3R, GeneOrder 4.0 CoreGenes, Panseq, WebACT, EDGAR, PARIGA can be used for checking the synteny between the genomes. Repeats: The sequence fragments that occur multiple times and at different places in the genome are considered repeats. The repeats have an enormous effect on the assembly outcome. The assembler cannot distinguish the same reads from different locations and sequences from distant locations can be mis-assembled. The assembly of high repeat content is generally fragmented. It is better to order data from long-read technology to deal with repeats. Long reads provided sufficient unique sequences that flanked the repeats. Heterozygosity: Assembly programs are meant to generate one consensus sequence from the homologous region and fail to detect allelic differences. In heterozygous, sequences from homologous alleles assembled separately and that region might be reported twice for diploid organisms. The less variable regions will only be reported once. The heterozygosity increases the chance to generate fragmented assemblies. It is very important to select individuals with less heterozygosity like inbred lines. If possible, it is better to sequence haploid tissue that nullifies the problem caused by heterozygosity.
66
Kalyani M. Barbadikar et al.
Table 3.1 Commonly used terms in NGS Terms
Definition with units
Fasta
A text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using singleletter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data
Fastq
A text file that contains the sequence data from the clusters that pass filter on a flow cell
Read length
Number of base pairs (bp) sequenced from a DNA fragment
Single read
Sequencing DNA fragments from one end to the other, useful for some applications, such as small RNA sequencing, and can be a fast and economical option
Quality
Phred-scaled quality score assigned by the variant caller. Higher scores correspond to higher confidence in the variant
Scores
A way to assign confidence to a particular base within a read. Some sequencers have their own proprietary quality encoding, but most have adopted Phred33 encoding. Each quality score represents the probability of an incorrect base call at that position
Scaffold
A portion of the genome sequence reconstructed from end-sequenced wholegenome shotgun clones. Scaffolds are composed of contigs and gaps
Contig
A series of overlapping DNA sequences used to make a physical map that reconstructs the original DNA sequence of a chromosome or a region of a chromosome
N50
Defines assembly quality in terms of contiguity. Given a set of contigs, N50 is defined as the sequence length of shortest contig at 50% of the assembly
L50
Count of smallest number of contigs whose length sum makes up half of genome size
Coverage
Average number of sequencing reads that align to, or “cover,” each base in sequenced sample. Lander/Waterman equation is a method for calculating coverage (C) based on your read length (L), number of reads (N), and haploid genome length (G): C = LN/G
Blast
A computer algorithm that rapidly align and compare a query DNA sequence with a database of sequences
Assembly
Aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence
Mate pair
Involves generating long-insert paired-end DNA
Paired end
After a DNA fragment is read from one end, the process starts again in the other direction. In addition to producing twice the number of sequencing reads, this method enables more accurate read alignment and detection of structural rearrangements
Sequencing by synthesis (SBS)
Chemically modified nucleotides bind to the DNA template strand through natural complementarity
Insert size
The length of the sequence in between a pair of reads (continued)
Assembly, Annotation and Visualization of NGS Data
67
Table 3.1 (continued) Terms
Definition with units
Barcode
Sequences are added to each DNA fragment during next-generation sequencing (NGS) library preparation so that each read can be identified and sorted before the final data analysis
Adapter
Adapter ligation contains the full complement of sequencing primer hybridization sites for single, paired-end, and indexed reads
SNP calling
SNP calling aims to determine in which positions there are polymorphisms or in which positions at least one of the bases differs from a reference sequence; the latter is also sometimes referred to as ‘variant calling’
Pipeline
A bioinformatics pipeline is composed of a wide array of software algorithms to process raw sequencing data and generate a list of annotated sequence variants. Bioinformatics pipelines are either designed and developed by a vendor with or without customization by the laboratory or entirely developed by the laboratory
Tool
Package for analysis
Database
Organized and systematic collection of data mostly retrievable
Trimming
Removal of low-quality bases from sequencing reads is
GO
Formal representation of a body of knowledge within a given domain. Ontologies usually consist of a set of classes with relations that operate between them
BAM
Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files. BAM is the compressed binary representation of SAM (Sequence Alignment Map)
BED
BED format is a simple way to define basic sequence features to a sequence. It consists of one line per feature, each containing 3–12 columns of data, plus optional track definition lines. These are generally used for user defined sequence features as well as graphical representations of features
SAM
Sequence Alignment/Map format TAB-delimited text format consisting of a header section, which is optional, and an alignment section
VCF
A text file format that contains information about variants found at specific positions in a reference genome
GFF
A tab-delimited text file that holds information any and every feature that can be applied to a nucleic acid or protein sequence
GTF
A file format used to hold information about gene structure. It is a tab-delimited text format based on the general feature format (GFF), but contains some additional conventions specific to gene information
HMM
Hidden Markov Models in Bioinformatics. A specific HMM is developed for this purpose to infer the genotype for each position on the genome by incorporating the mapping quality of each read and the corresponding base quality on the reads into the emission probability of HMM (continued)
68
Kalyani M. Barbadikar et al.
Table 3.1 (continued) Terms
Definition with units
NCBI
National Center for Biotechnology Information—a source for public biomedical databases, software tools for analyzing molecular and genomic data, and research in computational biology
SRA
Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data
Bioproject
A collection of biological data related to a single initiative, originating from a single organization or from a consortium (PRJNAXXXXX)
Biosample
Descriptive information about the physical biological specimen from which your experimental data are derived. Accession numbers (SAMNXXXXXXXX)
Ploidy level: The ploidy level greatly increases the number of alleles in homologous regions and is very difficult to manage during assembly. The assembly tool creates a similar problem of repeats in assembling higher ploidy genomes viz. tetraploid and hexaploid genomes. The assembly of a high ploidy level genome will often result in fragmentation. GC-content: Extreme low or high GC content in the genome region could cause problems in 454 and Illumina sequencing platforms resulting in very low or no coverage in these regions. Unbiased sequencing technology including PacBio and Nanopore for GC content can be used while dealing with high or low GC content genome regions. Pooling of individuals: Pooling of individuals is generally avoided. However, sometimes it can be difficult to extract amounts of DNA from a single individual. The closely related or inbreds can be used for pooling. It is important to note that pooling may increase genetic variability just like heterozygosity and result in fragmented assemblies. In case of lack of sufficient DNA samples, whole genome amplification might be used prior to sequencing. However, this method has its own drawbacks like uneven coverage of different regions, generation of chimeric sequences that consist of two or more unrelated sequences. Presence of other organisms: Contamination of other organisms is more common in genome assembly because samples are contaminated or symbionts with other organisms. It should be taken care that the DNA concentration of other
Assembly, Annotation and Visualization of NGS Data
69
organisms should not be more than DNA of interest. Small amounts of contamination can be manageable and rarely cause any problem. Organelle DNA: The amount of organelle DNA should not be excessed than nuclear DNA in the sample. The tissue can be chosen with higher nuclear DNA to organelle DNA ratio to avoid such problems.
3
Next Generation Sequencing Platforms The first-generation Sanger sequencing has limitations in sequencing length which has been met by second generation sequencing (SGS) and third generation sequencing (TGS). The short-read sequencers available in the market for example, Illumina, NovaSeq, HiSeq, NextSeq, MiSeq, Thermo Fisher’s Ion Torrent sequencers BGI’s MGISEQ and BGISEQ sequence up to 600 bases. The longread sequencers produce reads more than 10 kb and currently Pacific Bio sciences’ (PacBio) single molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies’ (ONT) nanopore sequencing are widely deployed [10]. Especially for the long read sequencing, a catalogue of long read sequencing data analysis tools are available online for various steps in data analysis of data emanated from TGS platforms (https://long-read-tools.org/table. html). The sample tissues and the isolated nucleic acid should be thoroughly checked for quality (high quality, pure, integrated DNA/RNA) before proceeding for NGS. An appropriate sequencing platform needs to be selected according to the plan of the experiment, number of samples, replications, coverage, accuracy, type of sample tissues. For de novo sequencing, genome finishing, structural variant detection generating long-insert paired-end DNA libraries. Long and short read sequencing platforms can work together for better options of downstream analysis.
3.1 Critical Factors Before Choosing the Sequencing Platform
The choice of the sequencing platform to use will influence the cost and success of the assembly process. Different types of sequencing platforms generate different types of data that can be analyzed using different assembly programs. The assembly program is very specific for the type of data to be analyzed so the analysis pipeline should be decided prior to perform sequencing. The sanger sequencing method developed by Frederick Sanger and colleagues is known as first generation sequencing platform which is replaced by high throughput sequencing platforms such as second-generation sequencing platforms (SGS, 454 pyrosequencing, Illumina sequencing and Solid sequencing) and third generation sequencing platforms (TGS, Pac Bio, Nanopore, ion torrent). However, Sanger
70
Kalyani M. Barbadikar et al.
sequencing platform is widely used in small-scale projects and as a supporting platform for closing gaps between contigs generated by SGS and TGS platforms. The SGS could generate a huge amount of data with low per nucleotide cost. This is the main reason for SGS to dominate the market. However, as discussed earlier, heterozygosity and high/low GC-content regions are not assembled correctly using SGS. These problems could be overcome by using the TGS platform. TGS can generate very long reads between 10,000 and 15,000 bp with some reads exceeding 100,000 bp. However long reads are more error prone have 10–15% error that requires correction before and after the assembly process. In addition, supporting technology is also used to provide support to existing genome assembly such as optical mapping method (Bio-nano), linked-read technology (10× genomics chromium system) and genome folding based technique HiC (analyze chromatin interactions genome-wide). These technologies allow users to generate chromosome level assemblies. It is very difficult to pinpoint particular technique over others but to determine large scale structural changes and improve the contiguity of existing assembly any of these technologies can be at the upper hand. The TGS definitely has an advantage over SGS as TGS generates less fragmented assemblies. However, TGS has higher requirements in terms of DNA quality, quantity and computational resources. TGS is costlier in terms of cost per nucleotide than SGS. In addition, the combination of both SGS and TGS might be even better as errors generated in TGS can be corrected by SGS and the overall quality of assembly might be improved. Computational resources: For genome assembly, the analysis process running time and memory requirements are proportional to the amount of data and genome size. It is important to select a proper analysis tool according to the computational resources available. Moreover, the annotation part that integrates external data (RNA seq) makes the computational part more complex and intense.
4
Demultiplexing of Raw Sequencing Data The raw output of NGS data is in the form of BCL (Binary base call) files generated by the Illumina sequencers during the sequencing process. After sequencing, the pooled sequencing data must be separated and assigned to each specific sample. The process of separating sequencing data into separate files for each individual sample is called demultiplexing. This converts from the .bcl format to the fastq format which is a text-based format for storing nucleotide sequences and their associated quality scores [11]. Bcl files contain all the data from all samples from a sequencing run.
Assembly, Annotation and Visualization of NGS Data
71
There will be nucleotides from every sample, after demultiplexing the sequencing samples can be specifically separated. After conversion from. bcl to .fastq, the fastq data consists of sequence id or information about specific sample type, raw nucleotide data from that sample, a plus symbol which is used simply as a spacer and finally quality score values for each nucleotide [12]. The quality score represents the accuracy of identified nucleotide for every nucleotide that is sequenced. The symbols represent the low-quality score, numbers indicate better score and letters code for best accuracy. For example, the quality score Q10 represents the probability of occurrence of incorrect base in one in ten bases indicating an overall accuracy of 90%, similarly, Q20 represents a probability of incorrect base call in one in 20 bases meaning 99% accuracy in base call, whereas Q30 represents 99.9% accuracy with a probability that only a single base out of every thousand will be incorrectly called. When the sequencing quality reaches Q30, almost all the reads (a read is a sequence of nucleotides that will be sequenced) considered perfect with very little ambiguity in the sequencing and hence, Q30 is considered as a quality benchmark in NGS [13]. Some of the tools that are available for demultiplexing NGS data include GBSX [14], Je (scRNA-seq and iCLIP) [15], DigestiFlow [16], ngsComposer [17] etc.
5
Assembly Assembling is the process of putting together the reads for getting biological significance. In general, workflow for the assembly would follow the same steps irrespective of the sequencing technology which are sequencing, quality control (QC), assembly and validation. QC sequences are screened for the overall quality and presence of adaptor sequences and other contaminants. In the assembly stage, several assembly programs are tried even with different parameters and then results are validated based on several assembly criteria. The aim is to create genome assembly with the least fragmented reads and least number of mis-assemblies. The assembly can be generated using two different approaches viz. de novo and reference-based assembly. In de novo, the genome is reconstructed based on the information of overlapping reads while reference can be used to guide the mapping of reads or reordered the already existing de novo assembled contigs. The de novo assembly is more challenging as size of the genome, topological complexity and non-randomness of sequences might cause problems in assembly. Several genome assemblers have been developed in the past decade. Each assembler is different in terms of types of reads required, type of graph constructed, way of correcting sequencing error and ability to deal with different lengths of sequences. However, based on algorithms all assemblers can be classified into two
72
Kalyani M. Barbadikar et al.
broad classes such as overlap layout consensus (OLC) and de Bruijn graph (DBG). OLC is generally suitable for long read sequences. It works well with a small number of reads with adequate overlap. However, this method is computationally complex and resourceconsuming. The ability to handling repeats in a complex genome is one of the advantages of OLC long read assemblers but also suffers from low accuracy. Assembly has been performed in OLC based on following steps. (1) Identification of candidate overlap, (2) fragment layout formation, (3) Consensus sequence formation. In contrast, DBG is more popular for the analysis of short reads. DBG is graph data structure that represents the overlap structure of short reads. It may be noted that DBG assemblers are only useful in the species with fewer repeats present in their genome. The DBG-based approach is also known as k-mer graph approach, which generates k-mer of reads. A node of each de Bruijn graph represent k-mer and edges represent the overlap between reads. The contigs are generated by traversing the Eulerian path in the graph. Table 3.2 indicates the list of commonly used assembler for various kind of data. 5.1
Accuracy
The accuracy and continuity of assembly depends on algorithms used, experimental designs, insertion size of libraries, read accuracy and genome complexity. Therefore, assessing the quality of assembly is challenging which requires statistical and biological validation. Different statistical parameters such as overall assembly size, measure of assembly contiguity by N50, NG50, NA50 and NGA50, assembly likelihood scores and completeness of the genome assembly (BUSCO score and/or RNA-mapping). N50 statistics is a widely used parameter for assessing assembly contiguity, which is expressed as length of shortest contigs at 50% of assembly. The NG50 statistic is the same as N50 except calculation is based on genome size rather than the assembly size. NA50 and NGA50 are analogous to N50 and NG50 where the contigs are replaced by blocks aligned to the reference. Various sequencing platforms have some inherent advantages and disadvantages. The use of SGS and TGS both platforms can improve the assembly quality. Recently, some assembler like LoRDEC, DBG2OLC and Jabba have been developed to handle short as well as long reads. This approach is known as hybrid assembly. In these approaches data of various platforms can be combined either at read level or contigs level. This approach is executed through steps including generation of contigs of short reads using DBG, mapping of contigs with long reads, multiple sequence alignment to correct long reads error, generation of best overlap graph using OLC, and generation of final consensus reads. For metagenomics assembly the large volume of data, quality of sequencing, unequal representation of microbial community, strains from similar species are the major challenges. Unavailability of sequenced data for
Assembly, Annotation and Visualization of NGS Data
73
Table 3.2 List of widely used assemblers
Assembler
Data types
Platform compatibility
Type of assembler/ algorithm used
ABySS
Genome
Illumina, SOLiD
De novo/DBG
2008
ALLPATHS-LG Genome
Illumina, SOLiD
De novo/DBG
2011
Celera WGA Assembler
Genome
Sanger, Illumina, 454
De novo and reference assembly/ OLC
2008
CLC Genomics Workbench
Genome
Sanger, 454, Illumina, SOLiD
De novo and reference assembly/ DBG
2008
DNASTAR
Genome, exomes, transcriptomes, metagenomes, ESTs
Illumina, SOLiD, 454, Ion Torrent, Sanger
De novo/DBG
2007
Newbler
Genome, ESTs
454, Sanger
De novo/OLC
2004
PASHA
Genome
Illumina
De novo/DBG
2011
Trinity
Transcriptome
Illumina, 454, SOLiD,
De novo/DBG
2011
SOAPdenovo
Genome metagenomes
Illumina
De novo/DBG
2009
SPAdes
Genome metagenomes
Illumina, Sanger, 454, Ion Torrent, PacBio, Oxford Nanopore
De novo/DBG
2012
Velvet
Genome
Sanger, 454, Illumina, SOLiD
De novo/DBG
2007
LoRDEC
Genome
Illumina, PacBio
Hybrid assembler 2014
DBG2OLC
Genome
Illumina, PacBio, Oxford Nanopore
Hybrid assembler 2016
Jabba
Genome
Illumina, PacBio
Hybrid assembler 2016
MEGAHIT
Metagenome
Illumina
De novo
2015
IDBA-UD
Metagenome
Illumina
De novo
2014
Year
certain species also needs to be addressed. The assembly can be broadly classified on the basis of strain and consensus assembly. 5.2 Assembly Finishing
Genome assemblies generated by different assemblers must be scrutinized and checked for accuracy because of low coverage of some regions, poor data quality and error occurred while handling repeats. These re-examinations of assemblies is known as assembly finishing that involved three main steps gap closure, assembly
74
Kalyani M. Barbadikar et al.
validation and genome refinement. The gap can be identified in assembly and can be closed by direct PCR, mate-pair library and primer-walking. Other relevant information such as BAC library, transcriptome data, EST and physical map is also used for gap closure. Recent techniques such as optical mapping and HiC are now becoming popular for gap closure. The assembled contigs can be analyzed using a number of tools including consed, autofinish, BAC cardi GAP4. After assembling the sequences, it is vital to check the correctness, accuracy and coverage of the assembly. Apart from the technical measures like N50, N90, it is also equally important to check the completeness of the gene for functional relevance. Tools such as Benchmarking Universal Single-Copy Orthologs (BUSCO) (https://busco.ezlab.org/), MaGuS (https://github.com/ vlasmirnov/MAGUS), Quality Assessment Tool for Genome Assemblies QUAST (http://quast.sourceforge.net/) can be employed to check the quality, contiguity and completeness of genome assemblies.
6
Annotation Annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements. The annotation depends on the type of sequencing (DNA or RNA) as well as on the availability of the genome references. Based on these two factors, one can select the suite/tool/platform for annotating the assembly. Annotation needs to be done at structural and functional levels. For the whole genome assemblies, a process called structural annotation is necessary. This process detects genes including their exon/intron structures within a given assembly. If there is genome reference available, ab initio annotation needs to be done. Ab initio annotation relies on ab initio gene predictors, which in turn rely on training data to construct an algorithm or model. Prediction is done based on the genomic sequence in question, using statistical analysis and other gene signals such as k-mer statistics and frame length.
6.1 Annotation Quality Scoring Parameters
Annotation edit distance (AED) provides a measurement for how well an annotation agrees with overlapping aligned ESTs, mRNAseq and protein homology data. AED values range from 0 and 1, with 0 denoting perfect agreement of the annotation to aligned evidence, and 1 denoting no evidence support for annotation [18]. Annotation-scoring like annotation confidence score (ACS) computed by combining sequence and textual annotation similarity using a modified version of a logistic curve. The most important
Assembly, Annotation and Visualization of NGS Data
75
feature of the proposed scoring scheme is to generate a score that reflects the excellence in annotation quality of genes by automatically adjusting the number of genomes used to compute the score and their phylogenetic distance. Updating and re-annotating genome annotations is necessary for the provision of accurate and relevant information, because the knowledge of gene products is expanded each day by downstream research, such as comparative genomics, transcriptomics, proteomics, and metabolomics. Several pipelines are available for genomes and transcriptomes De novo and reference based. Several suites are available for the functional annotations. Automatic annotation databases and tools are available for non-coding genes (https:// m.ensembl.org/info/genome/genebuild/ncrna.html). RefSeq is a highly curated collection of annotated genomes and transcripts that is widely used as a reference for genome projects and different analysis tools and is considered to contain high-quality annotations. Although annotation tasks, such as protein function predictions are challenging for machine-learning models, these models are likely to play a big role in future annotations considering the constant increase in the available data. Table 3.3 describes various tools/ softwares/databases utilized in annotations. 6.2 Functional Annotation
Functional annotation is the process of collecting information about and describing a gene’s biological identity. The most widely used functional annotation is ‘Gene Ontology’ (GO) which provides defined ‘GO terms’ that are classified as ‘Biological Process’ that describes molecular events, ‘Cellular Component’ that describes the location of a protein at a cellular level and ‘Molecular Function’ that describes the genes products at molecular level. The annotation of proteins is mostly based on InterPro (https://www. ebi.ac.uk/interpro/) that describes the domains, protein families, conserved sites, binding sites, protein architecture, etc. The databases viz. CATH, CDD, HAMAP, MobiDB Lite, Panther, Pfam, PIRSF, PRINTS, Prosite, SFLD, SMART, SUPERFAMILY and TIGRfams are integrated in InterPro. The Hidden Markov models (HMMs) are used for domain identification and annotation [19]. These HMM are widely used in annotations and model correlations between gene prediction, alignment, annotation, structural alignment etc. [20]. Table 3.4 describes various databases for gene ontology, enzymes classification. Tables 3.5 and 3.6 describe various annotation pipelines available online and stand alone for prokaryotes, eukaryotes, plants. The third-generation sequencing technologies require very specialized tools for clustering, alignment, annotation and visualization (Table 3.7). In case of the prokaryotic genome or reference assemblies, the annotation can be very much specific with respect to the genome
76
Kalyani M. Barbadikar et al.
Table 3.3 Annotation tools and databases for eukaryotes
tRNA
lncRNA
Repeat bases
mRNA
miRNA
rRNA
Ab initio AUGUSTUS FGENESH RNA-Seq GENSCAN EuGene Genome wide Event finding and Motif discovery (GEM)
SUmirFind/ SUmirFold sRNAanno miRNA Digger deepSOM HHMMiR microRPM MiPred miRA miRanalyzer miRCat2 miReader miRHunter mirinho miRLocator miRNA Digger miRNA-dis Mirnovo miRPlex miR-PREFeR miRvial plantMirP triplet–SVM Pseudogene.org, Dfam, and miRbase MIReNA miRDeep-P2 miRNAFold miRPlant miR-PREFeR mirinho
RNAmmer tRNAscan-SE NONCODE Repeat Modeller Rhea DAIRYdb tRNA-DL Repbase GtRNAdb ChEBI
organization. The annotation tools for such specialized downstream annotation are given in Table 3.8. 6.2.1 Critical Points to Be Considered for Annotation
There exist differences in the experimental evidence data and electronic annotations. So carefully, one needs to select the gene annotation. The quality of the annotation also needs to be checked using two or three available tools and the protein identity percentage if reference is available. The absence of genes or transcripts from the transcriptome data should be checked thoroughly as that always does not correspond to the absence of that gene at that time. When the gene/transcript has a low copy number maybe it escaped the
Assembly, Annotation and Visualization of NGS Data
77
Table 3.4 Annotation for gene ontology and visualization Annotation databases for gene ontology AmiGO
Search the Gene Ontology data for annotations, gene http://amigo.geneontology.org/ products, and terms amigo
REVIGO
A Web server that summarizes long, unintelligible lists http://revigo.irb.hr/ of GO terms
AgriGO
A web-based tool and database for the gene ontology http://bioinfo.cau.edu.cn/ analysis agriGO/
Blast2GO
A bioinformatics platform for high-quality functional https://www.blast2go.com/ annotation and analysis of genomic datasets
DAVID
A database for annotation, visualization and integrated discovery
https://david.ncifcrf.gov/
GO FEAT
Web-based functional annotation tool for genomic and transcriptomic data
http://computationalbiology. ufpa.br/gofeat/
Trinotate
De novo RNA-Seq Annotation
https://rnabio.org/module-07trinotate/0007/02/01/ Trinotate/
TRAPID
Online tool for the fast, reliable, and user-friendly analysis of De novo transcriptomes
http://bioinformatics.psb.ugent. be/trapid_02/
TOA
Package for automated functional annotation in non-model plant species
https://github.com/GGFHF/ TOA/
EnTAP
A eukaryotic non-model annotation pipeline
https://entap.readthedocs.io/en/ v0.8.0-beta/introduction.html
EggNOG
A database of orthology relationships, functional annotation, gene evolutionary histories
http://eggnog5.embl.de/#/app/ home
Phytozome Plant Comparative Genomics portal
https://jgi.doe.gov/data-andtools/phytozome/
TreeGenes
Genomic and phenomic information for tree plants
https://treegenesdb.org/
Inparanoid
A program that identifies orthologs
https://inparanoid.sbc.su.se/cgibin/faq.cgi
CoGe
A platform for performing Comparative Genomics
https://genomevolution.org/ coge/
Orthofinder A comprehensive platform for comparative genomics https://github.com/davidemms/ OrthoFinder OrthoMCL A genome-scale algorithm for grouping orthologous https://orthomcl.org/orthomcl/ app protein sequences PFAM
A large collection of protein families
http://pfam.xfam.org/
TIGRFAM
A prokaryotic protein families database
https://www.ncbi.nlm.nih.gov/ genome/annotation_prok/ tigrfams/
PANTHER A protein classification database
http://www.pantherdb.org/ (continued)
78
Kalyani M. Barbadikar et al.
Table 3.4 (continued) Annotation databases for gene ontology SMART
A protein database that contains proteomes
http://smart.embl-heidelberg. de/
CDD
A protein annotation resource that consists of a collection of well-annotated multiple sequence alignment
https://www.ncbi.nlm.nih.gov/ Structure/cdd/cdd.shtml
Annotation databases for enzymes Mercator
Classify protein or gene sequences
https://plabipd.de/portal/ mercator-sequence-annotation
BRENDA
Collection of enzyme functional data
https://www.brenda-enzymes. org/
KEGG
Database resource for understanding high-level functions
https://www.genome.jp/kegg/
MetaCyc
Curated metabolic database that contains metabolic pathways, enzymes, metabolites
https://metacyc.org
Plant
Metabolic network
http:// www.
plantcyc.org
Broad network of plant metabolic pathway databases
BIOCYC
Database collection of organism specific metabolic Pathway and Genome Databases
https://biocyc.org/
Reactome
Visualization, interaction, pathway database
https://reactome.org/
MapMan
Annotation, visualization
http://mapman.gabipd.org/ web/guest/mapcave
criteria given for annotations. So cross checking is a must and should be done with caution regarding the presence or absence with respect to functional relevance. The stringency of parameters used in the bioinformatics tools/softwares should be checked manually along with default parameters at two to three times before arriving at solid conformed statements regarding the functional annotations.
7
Visualization After annotating the variants, they are visualized using visualization tools and genome browsers. By visualizing the variants, information about variants, such as mapping quality, aligned reads,
Assembly, Annotation and Visualization of NGS Data
79
Table 3.5 Annotation pipelines available for plants and prokaryotes Annotation pipelines
Attributes
Site
MAKER2
Genome annotation pipeline
https://www.yandell-lab.org
Hayai
Functional annotation system specialized in plant species
https://plantgarden.jp/en/list/tool Annotation Plants v2.0
GOMAP
Gene Ontology Meta Annotator for Plants (GOMAP)
https://dill-picl.org/projects/ gomap/
MicrobeAnnotator
A user-friendly, comprehensive functional annotation pipeline for microbial genomes
https://github.com/cruizperez/ MicrobeAnnotator
NCBI Eukaryotic A pipeline for the execution of all Annotation Pipeline annotation
https://www.ncbi.nlm.nih.gov/ genome/annotation_euk/ process/
BRAKER1
A pipeline for RNA-Seq-based genome https://github.com/GaiusAugustus/BRAKER annotation (GeneMark-ET and AUGUSTUS)
MyPro
High-quality prokaryotic genome assembly and annotation
https://sourceforge.net/projects/ sb2nhri/files/MyPro/
PAGIT
A toolkit for improving the quality of genome assemblies created via an assembly software
https://www.sanger.ac.uk/tool/ pagit/
KBase
Assemble and Annotate Prokaryotic Genomes
https://www.kbase.us/
CODON
Tool for manual curation, beyond prediction and annotation. Software to manual curation of prokaryotic genomes
https://github.com/codonlibrary/ codonPython
VIRULIGN
Codon-correct alignment and annotation of viral genomes
https://github.com/rega-cev/ virulign
GenSAS v6.0
Whole genome structural and functional https://www.gensas.org/ annotation for eukaryotes and prokaryotes
Prokka
Rapid annotation of prokaryotic genomes
Rapid Annotations using Subsystems Technology (RAST)tk
A fully automated annotation service for http://rast.nmpdr.org complete, or near-complete, archaeal and bacterial genomes
GAMOLA2
An annotation and curation of draft and http://sanger-pathogens.github.io/ complete microbial genomes Artemis/Artemis/
NCBI Prokaryotic To annotate bacterial and archaeal Genome Annotation genomes (chromosomes and Pipeline (PGAP) plasmids) combines ab initio gene
https://github.com/tseemann/ prokka
https://www.ncbi.nlm.nih.gov/ genome/annotation_prok/ (continued)
80
Kalyani M. Barbadikar et al.
Table 3.5 (continued) Annotation pipelines
Attributes
Site
prediction algorithms with homology based methods RefSeq
An integrated, non-redundant, wellannotated set of reference sequences including genomic, transcripts, proteins
https://www.ncbi.nlm.nih.gov/ refseq/about/prokaryotes/
DFAST
A prokaryotic genome annotation pipeline
https://github.com/nigyta/dfast_ core/https://dfast.nig.ac.jp/
Genome Sequence Annotation Serve (GenSAS)
A web-based genome annotation https://www.gensas.org platform for structural and functional annotation
GO FEAT
A rapid web-based functional annotation tool for genomic and transcriptomic data
ChIPseeker
Visualize the coverage of the ChIP seq https://www.bioconductor.org/ packages/devel/bioc/vignettes/ data, peak annotation, average profile, ChIPseeker/instdoc/ChIPseeker. R package for annotating ChIP-seq html data analysis
ChIP-Seq tools
Web server provides access to a set of useful tools performing common ChIP-Seq data analysis tasks
https://ccg.epfl.ch//chipseq/
OmicsBox
Complete software package
https://www.biobam.com/ omicsbox/
Chloe
Plant Organelle Annotation tool
https://chloe.plantenergy.edu.au
http://computationalbiology.ufpa. br/gofeat/https://github.com/ fabriciopa/gofeat
GeneSeqer@PlantGDB Genome database
http://brendelgroup.org/ bioinformatics2go/GeneSeqer. php
Plant mitochondrial genome annotation
Genome database
https://dogma.ccbb.utexas.edu/ mitofy/
BlastKOALA
KEGG Orthology And Links Annotation
https://www.kegg.jp/blastkoala/
PGA
Command line tool, software package
https://bio.tools/Plastid_Genome_ Annotator, https://github.com/ quxiaojian/PGA
Hayai-Annotation Plants
Database
https://github.com/aghelfi/HayaiAnnotation-Plants
OrthoVenn
Database
http://probes.pw.usda.gov/ OrthoVenn (continued)
Assembly, Annotation and Visualization of NGS Data
81
Table 3.5 (continued) Annotation pipelines
Attributes
Site
EggNOG
Database
http://eggnog5.embl.de/#/app/ home
OrthoMCL
Database
https://orthomcl.org/orthomcl/ app
COG protein
Database
https://www.ncbi.nlm.nih.gov/ research/cog-project/
Table 3.6 Pipelines/platforms for annotation for plants OmicsBox
https://www.biobam.com/omicsbox/
Chloe
https://chloe.plantenergy.edu.au
GeneSeqer@PlantGDB
http://brendelgroup.org/bioinformatics2go/GeneSeqer. php
Plant mitochondrial genome annotation https://dogma.ccbb.utexas.edu/mitofy/ BlastKOALA
https://www.kegg.jp/blastkoala/
PGA
https://bio.tools/Plastid_Genome_Annotator https://github.com/quxiaojian/PGA
Hayai-Annotation Plants
https://github.com/aghelfi/Hayai-Annotation-Plants
BRAKER2
https://github.com/Gaius-Augustus/BRAKER
OrthoVenn
http://probes.pw.usda.gov/OrthoVenn
EggNOG
http://eggnog5.embl.de/#/app/home
OrthoMCL
https://orthomcl.org/orthomcl/app
COG protein
https://www.ncbi.nlm.nih.gov/research/cog-project/
annotation information which includes consequence, impact of variants, scores of different annotation tools (Table 3.9). The three popular tools for data visualization are Integrative Genomics Viewer (IGV) [21], Genome browser (Gbrowse) [22] and Jbrowse [23]. The other visualization tools depending on the applications such as gene expression, reduced representation of number of genes, variant, networking analysis, etc. (Table 3.10). Volcano plots can also be used to visualize statistically significant changes in gene expression or to demonstrate large changes in gene
82
Kalyani M. Barbadikar et al.
Table 3.7 Annotation tools/servers/programs exclusively for long read sequencing Tool/server/ programme
Website
CARNAC
https://hal.archives-ouvertes.fr/hal-01930211
isONclust
https://github.com/ksahlin/isONclust
GMAP
https://github.com/juliangehring/GMAP-GSNAP
BLASR
https://github.com/PacificBiosciences/blasr
BBMap
https://www.osti.gov/biblio/1241166-bbmap-fast-accuratesplice-aware-aligner
Magic-BLAST
https://ncbi.github.io/magicblast/
Minimap2
https://github.com/lh3/minimap2
Meta-aligner
http://brl.ce.sharif.edu/software/meta-aligner/
IMOS
http://brl.ce.sharif.edu/software/imos/
desalt
https://github.com/hitbc/deSALT
graphmap2
https://nanoporetech.com/resource-centre/graphmap2-spliceaware-rna-seq-mapper-long-reads
BRAKER1
https://bioinf.uni-greifswald.de/bioinf/braker/
Coding Quarry
https://sourceforge.net/projects/codingquarry/
MatchAnnot
https://github.com/TomSkelly/MatchAnnot/wiki/How-toInterpret-clusterView-Plots
Iso-Seq
https://github.com/PacificBiosciences/IsoSeq
IsoSeq-Browser
https://github.com/goeckslab/isoseq-browser
IsoView
https://github.com/JMF47/IsoView
Webserver
NanoGalaxy
https://galaxyproject.org/use/nanogalaxy/
Annotation pipeline
LoReAn
https://github.com/lfaino/LoReAn
MAKER2
https://github.com/wuying1984/MAKER2_PM_genome_ annotation
Transcriptome assembly
IDP-De novo
www.healthcare.uiowa.edu/labs/au/IDP-denovo/
Annotation
Tombo
https://nanoporetech.github.io/tombo/resquiggle.html
IDP
https://github.com/RuRuYa/IDP-denovo
NanoMod
https://github.com/WGLab/NanoMod
MandalorionEpisode II
https://github.com/rvolden/Mandalorion-Episode-II
Pinfish
https://github.com/nanoporetech/pipeline-nanopore-refisoforms
TAPIS
https://help.rc.ufl.edu/doc/TAPIS
Clustering
Alignment
Prediction
Visualization
(continued)
Assembly, Annotation and Visualization of NGS Data
83
Table 3.7 (continued) Tool/server/ programme
Website
FLAIR
https://github.com/flairNLP/flair
SQANTI
https://github.com/ConesaLab/SQANTI
Tama
https://github.com/GenomeRIK/tama
Table 3.8 Specialized prokaryotic annotation tools CRISPR (Clustered Regulatory Regularly Interspaced proteins/antibiotic Short Palindromic Repeats) resistance CRISPRfinder CRISPRmapCRISPI CRISPRTarget CRISPy-web
ResFinder ARG-ANNOT CARD MEGARes BacMet P2CS (Prokaryotic 2-Component Systems) P2RP (Predicted Prokaryotic Regulatory Proteins)
Virulence determinants
Genomic islands
T3SEdb SIEVE T3SE VirulenceFinder ClanTox t3db the Toxin and Toxin Target Database TAfinder 2.0 VFDB PAIDB (Pathogenicity Island Database) Gypsy Database PanDaTox (Pan Genomic D atabase for Genomic E lements Toxic to Bacteria) PathogenFinder (predicts pathogenic potential) VirulentPred Effective
Phage_Finder Prophinder PHAST (PHAge Search Tool) PHASTER PHAge Prophage Hunter IslandViewer PAIDB (PAthogenicity Island DataBase) MTGIpick
expression [24]. In volcano plot, each dot represents a gene. Log-fold change is plotted on x-axis based on FPKM values, whereas log10 (p-values) are represented on y-axis. This plot provides information on the genes that are not statistically significant or showing decreased or increased gene expression. The BCL files are converted into fastq through demultiplexing followed by conversion of fastq into SAM or BAM files using BWA as an alignment tool. The variations can be detected through GATK or VarScan2 and finally data can be visualized through IGV, Gbrowse and Jbrowse.
84
Kalyani M. Barbadikar et al.
Table 3.9 Differences in the three widely used genome browsers IGV
Gbrowse
Jbrowse
Easiest to use among three
Difficult than IGV
Difficult than IGV
Full functionality
Creates a website for project
Advanced version of Gbrowse
Faster analysis
Relatively slow
Faster than Gbrowse
Can be ran on one computer at a time
Same project can be operated Fixes bugs better than Gbrowse. The overall by many people output is cleaner and provides much simultaneously information than Gbrowse
Need a lot of computing power for large data sets
Requires internet for realtime collaboration
8
Requires internet for real-time collaboration
Workflow of the NGS According to Applications
8.1 Whole Genome Sequencing (WGS) Data Analysis
All analyses begin with assembly or alignment with the raw data is converted into fastq before being aligned to a reference sequence. Many tools are available for data alignment in NGS such as Sambaba [25], Kart [26], GASAL2 [27] and Burrows wheeler aligner (BWA) [28]. For example, BWA is a software package for mapping sequences against a reference genome [29]. It is the most commonly used tool, can do very long reads and is one of the quickest alignment tools. BWA uses three algorithms. The first is BWA backtrack [30] used for very short reads such as those that are less than 50 bases. BWA-SW [28] alignment is suitable when there are gaps in the alignment and third is BWA-MEM (BWA’s Maximal Exact Match [31], which is latest, faster, more accurate and usually preferred for standard Illumina sequencing. If a reference genome is not available, de novo assembly must be performed. De novo assembly combines overlapping paired reads into contiguous sequences which eventually generate a contig [32]. The contigs are then combined into a scaffold to generate a complete assembly. Scaffolds can have gaps between contigs when a nucleotide sequence is not known which can be used for alignment afterwards. This information is stored in sequence alignment map (SAP) files which are universal file formats for mapped sequence reads [33]. This contains the sequence and quality score of each read. Many tools are available for comparing the samples and for identifying single nucleotide polymorphisms (SNPs) [34, 35], insertions and deletions (INDELs) [36, 37] and copy number variations (CNVs). The two major programs for comparing two different samples are genome analysis tool kit (GATK) and Variant detection in massively parallel sequencing data (VarScan2). GATK
Assembly, Annotation and Visualization of NGS Data
85
Table 3.10 Visualization tools/server/web-site/programme for specific representation Tools/server/web-site/ programme
Website
CGView Server
http://cgview.ca/
Circos
http://circos.ca/
Jena Prokaryotic Genome Viewer (JPGV)
http://jpgv.leibniz-fli.de/cgi/index.pl
GenomeVx
http://wolfe.ucd.ie/GenomeVx/
myGenomeBrowser
https://framagit.org/BBRIC/myGenomeBrowser
DNAPlotter
https://www.sanger.ac.uk/tool/dnaplotter/
OrganellarGenomeDRAW
https://chlorobox.mpimp-golm.mpg.de/OGDraw.html
MeV
https://mev.tm4.org/#/about
VolcaNoseR
https://huygens.science.uva.nl/VolcaNoseR
Heatmapper
http://www.heatmapper.ca/
BUBBLE PLOT
https://www.data-to-viz.com/graph/bubble.html
Venny
https://bioinfogp.cnb.csic.es/tools/venny/
WGCNA, BiocManager
https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/ Rpackages/WGCNA/
BEAVR
https://github.com/developerpiru/BEAVR
Volcanoplot
https://training.galaxyproject.org/training-material/topics/ transcriptomics/tutorials/rna-seq-viz-with-volcanoplot/tutorial. html
eVITTA
https://tau.cmmt.ubc.ca/eVITTA/
DEBrowser
https://bioconductor.org/packages/release/bioc/html/debrowser. html
Quickomics
http://quickomics.bxgenomics.com
Dr. Tom
https://www.bgi.com/global/dr-tom-system-webinar-registrationpage/
eFP-Seq_Browser
https://bar.utoronto.ca/eFP-Seq_Browser/
EaSeq
https://easeq.net/downloadeaseq/
was developed at the Broad institute and used for variant detection and genotyping. Both are involved in highlighting nucleotide variations relative to the reference genome. VarScan2 on the other hand was developed by the genome institute at the Washington University and uses a limited number of tools for detection of SNPs, indels and CNVs [38]. The GATK uses more stringent criteria, has a longer run time and hence, increases chance of false
86
Kalyani M. Barbadikar et al.
negatives [39]. VarScan2 on the other hand uses a less stringent criteria and has a shorter run time and therefore increases the chance of false positives. 8.2 RNA-Seq Data Analysis
The RNA sequencing (RNA-seq) refers to the quantification of specific sequences of RNAs present in a sample using NGS. This can be single-end sequencing where the fragment from a sense or antisense strand is read or sequenced and is more suitable for studying gene expression. Alternatively, paired-end sequencing refers to reading of fragment from both strands of RNA and is used for analysis of alternate splicing and De novo transcriptome which is the total RNA (such as microRNAs, mRNAs, tRNAs, rRNAs, long non-coding RNAs (lncRNAs) and degraded RNAs) content of a cell. RNA-seq analysis begins with assembly or alignment [40]. Once the reads are aligned to reference genome, this must be first normalized relative to the gene length. The number of reads relative to a gene is independent of gene length and instead represents the changes in gene expression whether it is high or low expression. The gene expression is normalized with fragments per kilobase per million mapped reads (FPKM) using a software such as StringTie which is used for transcript assembly and quantification of RNA seq [41]. It provides a FPKM value for gene expression with higher value indicating increased gene expression [42]. This software is also used to identify alternative transcripts resulting from alternative splicing of mRNAs which is particularly important during developmental stages. DEseq2 is developed to analyze the differential expression of genes in the form of heat maps where the dark colored patterns indicate higher expression and pale colored grids represent decreased expression relative to the controls.
8.3 Targeted Sequencing
The whole-genome sequencing (WGS) evaluates the nucleotides present in entire genome of a cell, whole exome sequencing (WES) refers to analysis of sequences present in protein-coding parts of a gene and targeted sequencing (TS) is reading nucleotides of specific portions of a DNA or genes that are linked by physiological or signal transduction pathways. Both inter and intra genic regions such as introns and exons are analyzed using this approach. A targeted amplicon sequencing approach capitalizes the capacity of NGS to sequence targeted regions from a large number of samples [43]. TS approach has higher multiplexing capacities, lower data analysis requirements, is easily scalable, simple in execution, neither time consuming nor labor-intensive and relatively inexpensive, and can be applied to a broad diversity of organisms. It also facilitates identification of somatic variations, low-frequency variants which enables the recognition of novel functional variants, biomarker discovery and paves the way for translational research. The depth of coverage that can be achieved by TS is very high compared to WGS and WES. For a small portion of a genome, the coverage
Assembly, Annotation and Visualization of NGS Data
87
obtained through TS is 5000×, while WGS and WES provide 30× and 200× coverage, respectively [44]. Therefore, very rare variants can also be detected through TS because of its high sensitivity. TS requires a pre-sequencing sample preparation step called target enrichment where target sequences (DNA or RNA) are directly amplified by PCR (multiplex-PCR based method) [45, 46], or captured (hybrid capture method) [47, 48] and then sequenced using sequences [49]. PCR-based method results in tagged amplicons that are ready to be directly sequenced using a next-gen platform, without extensive amplicon preparation steps (e.g., several rounds of ligation and purification) or purchasing every possible combination of locus specific primers. Also, this method allows to have complete control of PCR protocols, while still generating a tagged amplicon library. In hybrid capture, a genomic or cDNA library is created with appropriate sequencing motifs (adaptors or primers). This is followed by the enrichment step that includes design of probes that hybridize to the targets or portions of DNA or RNA that has to be captured. After hybridization, the DNA/RNA molecules are captured by streptavidin-coated dynamic beads which bind to biotin attached to the probe. Unbound molecules are washed away, and DNA/RNA captured on beads can be proceeded for sequencing. Among PCR-based and hybrid capture methods, the latter offers high sensitivity and provides reads up to 100 kb to 200 Mb portions of DNA, in contrast, PCR-based method is suitable for molecules with 100–5000 kb size. TrueSeq is a streamlined DNA/RNA library preparation workflow to perform target NGS. Once sequencing reads are obtained, a bioinformatics workflow is utilized to accurately map reads to a reference genome, to call variants, and to ensure the variants pass quality metrics. The TS analysis involves an in silico application, BarcodeCrucher, to take raw next-gen sequence reads and perform quality control checks and convert the data into FASTA format organized by gene and sample, ready for phylogenetic analyses. 8.4
Metagenomics
8.5 Variant Discovery
Metagenomics is the culture independent sequencing of microorganisms present in a particular niche. It collectively analyzes the genomes of microbes for genetic diversity or peculiar niche studies. The recent tools/servers for species identification, annotation are listed in Table 3.11. The applications like SNP genotyping, QTL-sequencing, MutMap (mutation mapping), and haplotyping requires understanding the variation in the nucleotides. The data emanated from the sequencing platform is processed, the reads are aligned to desirable reference genome (parent, species specific, reference genome, wild type, etc.), the variants are called for the presence of variants
88
Kalyani M. Barbadikar et al.
Table 3.11 Tools and servers available for metagenomics studies MG-RAST (the Metagenomics RAST)
https://www.mg-rast.org/
DIAMOND
http://www.diamondsearch.org
MEGAN
http://megan.husonlab.org
Unicycler
https://github.com/rrwick/Unicycler#installation
Medaka
https://github.com/nanoporetech/medaka
MetaErg
https://github.com/xiaoli-dong/metaerg
ATLAS
https://github.com/metagenome-atlas/atlas
KAUST Metagenomic Analysis Platform (KMAP)
https://www.cbrc.kaust.edu.sa/aamg/kmap.start
LoReAn
https://github.com/lfaino/LoReAn
NCBI Prokaryotic Genomes Automatic Annotation Pipeline
https://www.ncbi.nlm.nih.gov/genome/annotation_prok/
BASys Bacterial Annotation Tool
https://www.hsls.pitt.edu/obrc/index.php?page=URL1132 678306
Viral Genome ORF Reader (VIGOR) https://www.viprbrc.org/brc/vigorAnnotator.spg?method= ShowCleanInputPage&decorator=flavi MAKER Web Annotation Service (MWAS)
http://www.yandell-lab.org/software/mwas.html
GenSAS—Genome Sequence Annotation Server
https://www.gensas.org/
MicroScope
https://mage.genoscope.cns.fr/microscope/home/index.php
METAGENassist
http://www.metagenassist.ca/METAGENassist/faces/Home. jsp
Orphelia
http://orphelia.gobics.de
MetaBin
https://www.rdocumentation.org/packages/meta/ versions/4.15-1/topics/metabin
AmphoraNet
https://pitgroup.org/amphoranet/
Real Time Metagenomics
https://edwards.sdsu.edu/rtmg/
EBI Metagenomics
https://www.ebi.ac.uk/metagenomics/pipelines/2.0
Kaiju
https://kaiju.binf.ku.dk/
16S Classifier
http://metabiosys.iiserb.ac.in/16Sclassifier
SpeciesFinder 1.0
https://cge.cbs.dtu.dk/services/SpeciesFinder/
PlasmidFinder 1.3
https://cge.cbs.dtu.dk/services/PlasmidFinder/
PhyloPythiaS
https://github.com/algbioi/ppsp/wiki
Virtual Metagenome
https://vmg.hemm.org/ (continued)
Assembly, Annotation and Visualization of NGS Data
89
Table 3.11 (continued) MetaPhlAn2 (version 2.0.0)
https://huttenhower.sph.harvard.edu/metaphlan2/
CoMet-Universe
http://comet2.gobics.de/
BEACON
https://www3.beacon-center.org/blog/tag/metagenomics/
JGI IMG
https://img.jgi.doe.gov/
WebMGA
http://weizhong-lab.ucsd.edu/webMGA/
COV2HTML
https://mmonot.eu/COV2HTML/
i.e. SNPs or indels and annotated. The variant information in the file are stored in the vcf format and the tools are listed in Table 3.12.
9
Conclusions The next generation sequencing (NGS) technologies cater molecular biologists in their day-to-day activities to understand the genome, gene expression, proteomics, epigenome, methylation patterns, variations in the genic/intergenic regions, alternative splicing, metabolite profiling with a much higher scalability, accuracy, reliability in less time than the conventional methods. The availability of various platforms, chemistries, tools, software to perform the NGS analysis has provoked the researchers to deploy these technologies in their everyday lab and field activities. Nonetheless, appropriate computational steps, statistical significance, and protocols should be validated in each case depending on the species, stage, tissue, replications, etc. under consideration. Depending on the application of the NGS technologies, logistics should be made and considered accordingly. Replicated data at multiple time points in the case of transcriptome analysis are always more reliable than considering data from a single sample. Generation of high-quality data is extremely essential in the current scenario wherein the reliability/accuracy/coverage are the main criteria for further assembly and annotations. Such annotations give reliable results for biological implications. It is always preferable to deploy two or three software or tools or a pipeline to confirm the results and optimize the same. The assembly should be checked for completeness not only in terms of gene coverage but also based on protein-coding functions. Similarly, results can be checked thoroughly using different parameters or options in the bioinformatics programme to have a biological concurrence and significance. The bioinformatics data is hosted by the NCBI and can be easily
90
Kalyani M. Barbadikar et al.
Table 3.12 Annotation tools/databases/websites for variant calling Variant annotation
Attributes
Website
SnpEff
Genomic variant annotations and functional effect prediction toolbox
http://pcingola.github.io/ SnpEff/
Ensembl Variant Effect Predictor (VEP)
Effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions
https://asia.ensembl.org/info/ docs/tools/vep/index.html
SeattleSeq
Server provides annotation of SNVs, small indels, https://snp.gs.washington.edu/ SeattleSeqAnnotation154/ both known and novel
VEP
Effect of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions
AnnTools
Tool for annotating single nucleotide http://anntools.sourceforge.net/ substitutions (SNP/SNV), small insertions/ deletions (indels), and copy number variations (CNV)
ANNOVAR
Functionally annotate genetic variants
https://annovar. openbioinformatics.org/en/ latest/
ANNOVAR
Functional annotation of genetic variants detected from diverse genomes
https://annovar. openbioinformatics.org/en/ latest/
PLANET-SNP
Identification of efficient and functional SNPs
http://www.ncgd.nbri.res.in/ PLANET-SNP-Pipeline.aspx
Galaxy
Web-based analysis variants by predicting their molecular effects on genes and proteins
https://galaxyproject.github.io/ training-material/topics/ variant-analysis/
https://asia.ensembl.org/info/ docs/tools/vep/index.html
updated with the progress and process of sequencing projects. Upgradation is an important activity that must be done continuously by the data depositors for the benefit of the scientific community using the data. References 1. Reiman A, Kikuchi H, Scocchia D, Smith P, Tsang YW, Snead D, Cree IA (2017) Validation of an NGS mutation detection panel for melanoma. BMC Cancer 17(1):1–7. https://doi. org/10.1186/s12885-017-3149-0 2. Shahjaman M, Mollah MMH, Rahman MR, Islam SS, Mollah MNH (2020) Robust
identification of differentially expressed genes from RNA-seq data. Genomics 112(2): 2000–2010. https://doi.org/10.1016/j. ygeno.2019.11.012 3. Jiang B, Song K, Ren J, Deng M, Sun F, Zhang X (2012) Comparison of metagenomic samples using sequence signatures. BMC Genomics
Assembly, Annotation and Visualization of NGS Data 13(1):1–17. https://doi.org/10.1186/14712164-13-730 4. Lim JS, Choi BS, Lee JS, Shin C, Yang TJ, Rhee JS, Choi IY (2012) Survey of the applications of NGS to whole-genome sequencing and expression profiling. Genomics Inf 10(1): 1–8. https://doi.org/10.5808/GI.2012.10. 1.1 5. Lorenz DJ, Gill RS, Mitra R, Datta S (2014) Using RNA-seq data to detect differentially expressed genes. In: Statistical analysis of next generation sequencing data. Springer, Cham, pp 25–49. https://doi.org/10.1007/978-3319-07212-8_2 6. Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Turner DJ (2008) A large genome center’s improvements to the Illumina sequencing system. Nat Methods 5(12): 1005–1010. https://doi.org/10.1007/9783-319-07212-8_2 7. Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 6:5448. https://doi.org/10. 1101/pdb.prot5448 8. Raczy C, Petrovski R, Saunders CT, Chorny I, Kruglyak S, Margulies EH et al (2013) Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29(16):2041–2043. https://doi.org/10. 1093/bioinformatics/btt314 9. Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Vinnere Pettersson O, Amselem J, Bouri L, Bocs S, Klopp C, Gibrat JF, Vlasova A, Leskosek BL, Soler L, Binzer-Panchal M, Lantz H (2018) Ten steps to get started in Genome Assembly and Annotation. F1000Res. https://doi.org/ 10.12688/f1000research.13598.1 10. Amarasinghe SL, Su S, Dong X et al (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21:30. https://doi.org/10.1186/s13059-0201935-5 11. Akgu¨n M, Bayrak AO, Ozer B, Sag˘ırog˘lu MS¸ (2015) Privacy preserving processing of genomic data: a survey. J Biomed Inform 56: 103–111. https://doi.org/10.1016/j.jbi. 2015.05.022 12. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, DePristo MA (2013) From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 43(1):11–10. https://doi.org/ 10.1002/0471250953.bi1110s43
91
13. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Gu Y (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1):1–13. https://doi.org/10.1186/ 1471-2164-13-341 14. Herten K, Hestand MS, Vermeesch JR, Van Houdt JK (2015) GBSX: a toolkit for experimental design and demultiplexing genotyping by sequencing experiments. BMC Bioinformatics 16(1):1–6. https://doi.org/10.1186/ s12859-015-0514-3 15. Girardot C, Scholtalbers J, Sauer S, Su SY, Furlong EE (2016) Je, a versatile suite to handle multiplexed NGS libraries with unique molecular identifiers. BMC Bioinformatics 17(1):1–6. https://doi.org/10.1186/ s12859-016-1284-2 16. Holtgrewe M, Nieminen M, Messerschmidt C, Beule D (2019) DigestiFlow—reproducible demultiplexing for the single cell era. PeerJ Preprints 7:e27717v3. https://doi.org/10. 7287/peerj.preprints.27717v4 17. Kuster RD, Yencho GC, Olukolu BA (2021) ngsComposer: an automated pipeline for empirically based NGS data quality filtering. Brief Bioinformatics 22(5):bbab092. https:// doi.org/10.1093/bib/bbab092 18. Eilbeck K, Moore B, Holt C, Yandell M (2009) Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics 10(1):1–15. https://doi.org/ 10.1186/1471-2105-10-67 19. Yoon BJ (2009) Hidden Markov models and their applications in biological sequence analysis. Curr Genomics 10(6):402–415. https:// doi.org/10.2174/138920209789177575 20. Bolger ME, Arsova B, Usadel B (2018) Plant genome and transcriptome annotations: from misconceptions to simple solutions. Brief Bioinformatics 3:437–449. https://doi.org/10. 1093/bib/bbw135 21. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinformatics 14(2): 178–192. https://doi.org/10.1093/bib/ bbs017 22. Donlin MJ (2009) Using the generic genome browser (GBrowse). Curr Protoc Bioinformatics 28(1):9–9. https://doi.org/10.1002/ 0471250953.bi0909s17 23. Buels R, Yao E, Diesh CM, Hayes RD, MunozTorres M, Helt G, Holmes IH (2016) JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol 17(1):
92
Kalyani M. Barbadikar et al.
1–12. https://doi.org/10.1186/s13059-0160924-1 24. Goedhart J, Luijsterburg MS (2020) VolcaNoseR is a web app for creating, exploring, labeling and sharing volcano plots. Sci Rep 10(1): 1–5. https://doi.org/10.1038/s41598-02076603-3 25. Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12):2032–2034. https://doi.org/10. 1093/bioinformatics/btv098 26. Lin HN, Hsu WL (2017) Kart: a divide-andconquer algorithm for NGS read alignment. Bioinformatics 33(15):2281–2287. https:// doi.org/10.1093/bioinformatics/btx189 27. Ahmed N, Le´vy J, Ren S, Mushtaq H, Bertels K, Al-Ars Z (2019) GASAL2: a GPU accelerated sequence alignment library for high-throughput NGS data. BMC Bioinformatics 20(1):1–20. https://doi.org/10. 1186/s12859-019-3086-9 28. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760. https://doi.org/10.1093/bioinformatics/ btp324 29. Abuı´n JM, Pichel JC, Pena TF, Amigo J (2015) BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies. Bioinformatics 31(24):4003–4005. https://doi.org/10. 1093/bioinformatics/btv506 30. Abuı´n JM, Pichel JC, Pena TF, Amigo J (2016) SparkBWA: speeding up the alignment of highthroughput DNA sequencing data. PLoS One 11(5):e0155461. https://doi.org/10.1371/ journal.pone.0155461 31. Houtgast EJ, Sima VM, Bertels K, Al-Ars Z (2018) Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths. Comput Biol Chem 75:54–64. https://doi.org/10.1016/j.compbiolchem. 2018.03.024 32. Du H, Liang C (2019) Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads. Nat Commun 10(1):1–10. https://doi.org/10. 1038/s41467-019-12196-4 33. Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng HW (2011) Comparative studies of De novo assembly tools for next-generation sequencing technologies. Bioinformatics 27(15):2031–2037. https://doi.org/10. 1093/bioinformatics/btr319 34. Grant JR, Arantes AS, Liao X, Stothard P (2011) In-depth annotation of SNPs arising from resequencing projects using NGS-SNP.
Bioinformatics 27(16):2300–2301. https:// doi.org/10.1093/bioinformatics/btr372 35. Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM (2013) An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS One 8(12):e85024. https:// doi.org/10.1371/journal.pone.0085024 36. Ratan A, Olson TL, Loughran TP, Miller W (2015) Identification of indels in nextgeneration sequencing data. BMC Bioinformatics 16(1):1–8. https://doi.org/10.1186/ s12859-015-0483-6 37. Au CH, Leung AY, Kwong A, Chan TL, Ma ES (2017) INDELseek: detection of complex insertions and deletions from next-generation sequencing data. BMC Genomics 18(1):1–7. https://doi.org/10.1186/s12864-0163449-9 38. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Wilson RK (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 22(3):568–576. https://doi.org/10.1101/gr.129684.111 39. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, DePristo MA (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20(9): 1297–1303. https://doi.org/10.1101/gr. 107524.110 40. Magar ND, Shah P, Harish K, Bosamia TC, Barbadikar KM, Shukla YM, Phule A, Zala HN, Madhav MS, Mangrauthia SK, Neeraja CN (2022) Gene expression and transcriptome sequencing: basics, analysis, advances. In: Gene expression. IntechOpen. https://doi.org/10. 5772/intechopen.105929 41. Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL (2015) String Tie enables improved reconstruction of a transcriptome from RNA-seqreads. Nat Biotechnol 33(3):290–295. https://doi.org/10.1038/ nbt.3122 42. Joo MS, Shin SB, Kim EJ, Koo HJ, Yim H, Kim SG (2019) Nrf2-lncRNA controls cell fate by modulating p53-dependent Nrf2 activation as an miRNA sponge for Plk2 and p21cip1. FASEB J 33(7):7953–7969. https://doi.org/ 10.1096/fj.201802744R 43. Bybee SM, Bracken-Grissom H, Haynes BD, Hermansen RA, Byers RL, Clement MJ, Crandall KA (2011) Targeted amplicon sequencing (TAS): a scalable next-gen approach to multilocus, multitaxa phylogenetics. Genome Biol Evol 3:1312–1323. https://doi.org/10. 1093/gbe/evr106
Assembly, Annotation and Visualization of NGS Data 44. Chen R, Aldred MA, Xu W, Zein J, Bazeley P, Comhair SA, NHLBI Severe Asthma Research Program (SARP) (2021) Comparison of whole genome sequencing and targeted sequencing for mitochondrial DNA. Mitochondrion 58: 303–310. https://doi.org/10.1016/j.mito. 2021.01.006 45. Ganal MW, Altmann T, Ro¨der MS (2009) SNP identification in crop plants. Curr Opin Plant Biol 12(2):211–217. https://doi.org/10. 1016/j.pbi.2008.12.009 46. Onda Y, Takahagi K, Shimizu M, Inoue K, Mochida K (2018) Multiplex PCR targeted amplicon sequencing (MTA-Seq): simple, flexible, and versatile SNP genotyping by highly multiplexed PCR amplicon sequencing. Front Plant Sci 9:201. https://doi.org/10.3389/ fpls.2018.00201
93
47. Hill CB, Wong D, Tibbits J, Forrest K, Hayden M, Zhang XQ, Li C (2019) Targeted enrichment by solution-based hybrid capture to identify genetic sequence variants in barley. Sci Data 6(1):1–8. https://doi.org/10.1038/ s41597-019-0011-z 48. Ostezan A, McDonald SC, Tran DT, Souza RSE, Li Z (2021) Target region sequencing and applications in plants. J Crop Sci Biotechnol 24(1):13–26. https://doi.org/10.1007/ s12892-020-00056-3 49. Cronn R, Knaus BJ, Liston A, Maughan PJ, Parks M, Syring JV, Udall J (2012) Targeted enrichment strategies for next-generation plant biology. Am J Bot 99(2):291–311. https:// doi.org/10.3732/ajb.1100356
Chapter 4 Statistical and Quantitative Genetics Studies Rumesh Ranjan, Wajhat Un Nisa, Abhijit K. Das, Viqar Un Nisa, Sittal Thapa, Tosh Garg, Surinder K. Sandhu, and Yogesh Vikal Abstract Quantitative genetics and plant breeding are non-exclusive and these disciplines has been benefited from each other for the past 100 years. As a matter of fact, the majority of economically significant traits in crops and livestock species are quantitative in nature rather than qualitative. Several biometricians have made a significant contribution to understand how quantitative genetics works. Different methods of plant breeding had evolved, and are still evolving. Traditional plant breeding methods emphasize the components of quantitative variance but accuracy, time, and effectiveness are the three key factors that conventional breeding faces while trying to decode this. High throughput next-generation sequencing has made it easier, more accurate, and more exact to understand the genetics of complex traits in a shorter time, which has also shortened the breeding cycles required for desired genetic gain. This book chapter mostly focused on the statistical tools that plant breeders used in their research to dissect the genetics of complex traits that are significant economically including both traditional and advance plant breeding methods. Key words Quantitative genetics, Conventional approaches, Modern molecular methods, Softwares, Analysis
1 Introduction The agricultural revolution is said to be the most important series and is a turning point in the human evolution process. Shifting the hunting and gathering lifestyle to the cultivation of some of the selected plants for their livelihood dating back to 12,000 BC is a paradigm shift. The shift is gradual and started at various times in various parts of the globe. The oldest crops to be domesticated are thought to be wheat and barley, in the Fertile Crescent, about 10,000 years ago and both crops are the major crop that provides 50% of world requirements [1]. With the increase in population, many innovations have been made in agriculture and the culture of art has been changed to science thus now it’s called agricultural science. To maintain food availability along with the pace of the growing population, agriculture scientists around the world have
Priyanka Anjoy et al. (eds.), Genomics Data Analysis for Crop Improvement, Springer Protocols Handbooks, https://doi.org/10.1007/978-981-99-6913-5_4, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
95
96
Rumesh Ranjan et al.
invented and adopted different technologies that boost global crop production. Breeding nutritious and productive crops requires much research effort. Plant breeding activities move in a series from selection, hybridization, and release of hybrid/varieties. Earlier plant breeding was art with unplanned efforts for the selection of plants with suitable traits to raise the next season. The discovery of Mendel’s law of Inheritance, therefore, laid the foundation of genetics, and simultaneously, planned selection efforts lead to scientific plant breeding activities during the twentieth century. This plant breeding then became the matter of “crossing the best, picking the best, and hoping for the best”. Though many individuals have contributed to scientific plant breeding in past but it is only after Darwin (The Origin of Species, 1859) and Mendel’s (Law of inheritance, 1865) discovery, the scientific plant breeding came to light. Based on Mendel’s smart selection of simple characters viz., tall vs short plants, or red vs white flower and so on, i.e., the categories did not overlap and the variation is discontinuous whereas Darwin 5 years journey resulted in a theory of evolution by natural selection. Variation due to natural selection is continuous and thus the view of Darwin and Mendel contradicted each other and scientific communities were divided into two groups. Thus, the former is called qualitative traits i.e. an individual can be grouped into different classes. The latter is called quantitative i.e. an individuals can’t be grouped into distinct classes. The experimental or natural population exhibits continuous variation and is also important for the plant breeder. Economically important traits in crops are mostly quantitative in nature. The method of analysis of such variation starts with frequency distribution. Such distributions are analyzed based on mean, variance, and covariance etc.
2
History Darwin’s cousin, Francis Galton, pioneered and developed the concept of regression for estimating the relationship between the height of father and their sons—“essentially heritability – as a means of predicting response to selection” [2]. Galton was indeed able to demonstrate that there must be a hereditary component in continuous variation. However, only a little progress was made in understanding the genetic implications of these statistical quantities. Pearson [3] used mathematics to study inheritance and also published many articles during 1890–1900. After the rediscovery of Mendel’s work in 1900, most scientists were in the opinion of discontinuous variation and neglected the theory of continuous variation. The Galton students (Pearson and others) were unwilling to accept the simple Mendelian ratio and argued that the particulate mode of variation has little importance to their blending or
Statistical and Quantitative Genetics Studies
97
continuous type of inheritance. In the history of biology, there was a big controversy between two groups of scientists. Several new concepts were worked out for inheritance by various scientists. Bateson and Punnett [4] pointed out the interaction of genes and traits may be controlled by more than one gene. Yule [5] settled the controversy between two groups of scientists and advised that there is no conflict between particulate inheritance and blending inheritance. He proposed that continuous variation is an outcome of many genes and these genes have a small effect on phenotypic development and also stated that these genes transmit according to Mendelian fashion. Johannsen [6] while working on Princess Variety of common beans, proved that phenotype is the result of an individual’s genotype and environment i.e. P = G + E. He showed that variation between pureline is genetic and variation within pureline is environmental. Thus selection between the pureline is effective and within the line is not effective. Nilsson-Ehle [7] provides experimental evidence to support the Yule hypothesis which is also known as Multiple factor hypothesis. He crossed several white and red kernels of wheat and found different F2 ratios in different crosses. He concluded that a ratio 3:1 indicates a single gene difference between parents, 15:1 indicates two-gene difference, and 63:1 indicates three-gene difference. Further analysis of ratio15:1 cross was considered, and the close study revealed that red seeds differed in the intensity of redness which was again regrouped into four distinct classes of red as 1 (dark red):4 (medium-dark red):6 (medium red):4 (light red):1 (white). He was able to show that certain characters are governed by many genes that have a small and cumulative effect. American Geneticist, East in 1916 demonstrated that polygenic characters were in perfect agreement with Mendelian segregation [8]. He selected true-breeding varieties of Nicotiana longiflora for inheritance study. Parents differ strikingly in corolla length when they were grown under similar conditions. F1 was intermediate between those of two parents. The variability of F1 was comparable to that in the parents. The variability present in F1 was due to the environment. F2 generation was more variable than the parents and the F1, and is due to segregation and recombination of Mendelian genes. Thus the result observed in parental F1 and F2 generation confirms Mendelian expectations. Finally, Fisher demonstrated how the biometrical findings could be interpreted in terms of Mendelian factors [9]. He showed the environmental effect can be partitioned from genetic effects statistically. He also divides genetic variance into additive (d), dominance (h), and epistasis (e) components (i.e. G = d + h + e) and thus laid the foundation for the classical quantitative genetics approach by bringing together the approach of Galton and Mendel. Later on different new concept has been developed in the field of
98
Rumesh Ranjan et al.
quantitative genetics by several biometricians. The contribution of some of the biometricians is presented in Table 4.1. In a polygenic system, quantitative differences arise due to the small effects of individual genes compared to variation that arises due to effect of environment. Quantitative inheritance hypothesis the following points: – Multiple genes may regulate a quantitative trait, each with a separate effect, large or small, but not necessarily equal. – A major gene has a large influence, but a minor gene or polygene has little effect that cannot be observed separately in experimental conditions. – The difference between major and minor genes is not absolute and depends on the experiment as a major gene in one environment may behave as minor in another environment.
3
Quantitative Genetics and Plant Breeding The use of quantitative genetics varies greatly, depending on the field in which it is to be applied. Plant breeders hope the quantitative genetics to provide answers to many questions [34], such as; – The selection of most promising germplasm. – The breeding methods to be used. – The development and performance of hybrid/variety. Quantitative genetics approaches could conceivably provide answers to all of the above questions. Variability and selection are the two basic foundations of a plant breeding In-plant breeding, biometrical techniques are used for (a) Assessment of variability (b) Selection of promising lines (c) Inheritance study from the crosses of suitable parents (d) Judgment of varietal adaptation.
3.1 Tools for Quantitative Genetics and Software Used by Plant Breeders
The prime objectives of plant breeders are to identify and select the individual with the best genotypic value from a large set population. Expansion of breeding programs from a large set of population has always been an effort-intensive job for breeders. Plant breeding has always been predictive in the sense that in the P = G + E equation, the G component is predicted from the observed P. Since, the environment plays a vital role in determining the phenotypic character of the genotypes. Selecting the genotype based on phenotypic observation will be more accurate, only when E tends to zero. Moreover, differentiating the effects of genotype and environment is almost impossible. In such a scenario quantitative genetics theory proves to be an effective tool for distinguishing genetic effects from environmental effects.
Statistical and Quantitative Genetics Studies
99
Table 4.1 Contribution of biometricians in the field of quantitative genetics S. No Biometrician
Contribution
References
1
Wright (1921)
Given the term inbreeding coefficient in the manuscript “system [10] of mating” Developed the concept of path analysis
2
Haldane (1924)
Articles on the mathematical theory of natural and artificial selection
[11]
3
Lush (1940)
Defined the heritability in the narrow sense, heritability in the broad sense, and selection differential
[12]
4
Mahalanobis (1928)
Developed the concept of D2 statistics while studying Chinese head measurement. C. R. Rao (1952) used D2 statistics to study diversity in plants
[13]
5
Sprague and Tatum Gave an idea on combining the ability of a parent while working [14] (1942) on maize
6
Male´cot (1948)
Defined coefficient of parentage
7
Mather (1949)
Provide scaling test and further extend divide epistasis of Fisher’s [16] model as G = d + h + i + j + l, where i, j, and l are additive X additive epistasis, additive X dominance, epistasis, and dominance X dominance epistasis effects
8
Comstock and Developed the concept of biparental mating. The three designs [17] Robinson (1948) of Noth Carolina are NCD1, NCD2, and NCD3
9
Cavalli (1952)
[15]
Developed the concept of joint scaling test
[18]
10 Jinks and Hayman (1953)
Developed graphical approach for diallel cross-analysis
[19]
11 Gauch and Zobel (1988)
Developed numerical approach for diallel cross-analysis
[20]
12 Kempthorne (1957)
Introduction to genetics statistics (publication). Gave concepts like line X tester analysis, partial diallel cross-analysis, and restricted selection index
[21]
13 Hanson and Develop the concept of general selection index Johnason (1957)
[22]
14 Anderson (1957)
Developed the concept of metroglyph analysis
[23]
15 Dewey and Lu (1959)
First applied path coefficient analysis in plant selection while working on crested wheatgrass
[24]
16 Hayman (1958)
Proposed six and five parameters model of generation mean analysis
[25]
17 Jinks and Johnson (1958)
Proposed three parameters model of generation mean analysis
[26]
18 Falconer (1961)
Book on introduction to quantitative genetics
[27]
19 Kearsey and Jinks (1968)
Developed the concept of triple test cross analysis
[28] (continued)
100
Rumesh Ranjan et al.
Table 4.1 (continued) S. No Biometrician 20 Rawlings and Cockerham (1962)
Contribution
References
Developed the concept or triallel and quadriallel cross-analysis
[29]
21 Finlay and Gave the first approach for stability analysis Wilkinson (1963)
[30]
22 Eberhart and Russell (1966)
Concept of stability analysis model
[31]
23 Freeman and Perkins (1971)
Also gave a concept of stability analysis model
[32]
24 Elston and Stewart (1973)
Concept of mixed inheritance model i.e. single major gene plus [33] small minor polygene
The computer simulation program made the analysis of quantitative traits easier for the plant breeders. Collaborations with experts from artificial intelligence for operations research, data science, the incorporation of simulation tools, and integration of them in plant breeding are the need of the hour. Multiple software tools are available for the analysis of quantitative data which are described in the following subheadings. 3.1.1 Assessment of Variability
Variability refers to the presence of differences among the individuals within a population which may be due to alterations in genetic makeup or environmental effects. The selection of the genotype is depended on the magnitude of variability present in the breeding population. Variability in the gene pool of the crop species is the main factor responsible for success in crop improvement. Indigenous selections were based on visual observation but now different modern biometrical methods are available and deployed for systemic assessment of genetic variability. Simple measures of variabilities like range, standard deviation, variance, standard error, coefficient of variation, and covariance are being used by plant breeders to assess phenotypic variability. Analysis of variance (ANOVA) provides the estimates of components of variability (includes phenotypic variation, genotypic variation, environmental variation, heritability, and genetic advance, etc.) whereas based on covariances between relative’s genetic variance components can be estimated [35]. The assessment of genotypic variation involves mating design i.e. the crossing of the numbers of genotypes in a definite fashion i.e. diallel cross, partial diallel cross, line X tester, biparental cross, triple test cross, generation mean analysis, triallel cross, and quadriallel cross. The F1 thus obtained is
Statistical and Quantitative Genetics Studies
101
evaluated in the replicated trial. All the three components, viz., additive component, dominance component and epistasis component can only be estimated from generation mean analysis, triallel cross, and quadriallel cross whereas other breeding designs only provide information about additive and dominance variance. The other biometric method to assess the variability is metroglyph analysis, D2 Statistics, and principal component analysis. For a detailed study, the authors suggest referring to the book “Statistical and Biometrical Techniques in Plant Breeding” by J. R. Sharma and “Biometrical Methods in Quantitative Genetic Analysis” by R.K. Singh and B.D. Chaudhary. Different statistical softwares are available these days to make the analysis easier and more accurate. The following softwares are frequently used by the breeders: 3.1.2 Softwares for Assessment of Variability
OpStat: It is a free online statistical data analysis tool developed at CCS Haryana Agricultural University, Hissar, India by a computer programmer O. P. Sheoran and could perform almost every basic analysis that a breeder requires. Analysis like one-factor analysis involving one-way ANOVA, two-factor analysis involving two-way ANOVA, three-factor analysis and Principal component analysis can be performed easily by inputting the data in a prescribed format. GenStat: It is another available popular statistical software package for summarizing and comparing data, building model relationships, and analysis of research trials. The software is loaded with features including analysis of experimental data and working out the basic statistics. It can perform multivariate analysis, geostatistics, resampling methods analysis, statistical distribution fitting, experimental design, and spatial analysis (VSN International) IndoStat: IndoStat has been developed at a Hyderabad-based private firm by Murli Mohan Kehtan. The software is user-friendly, simple, and menu-driven. The module has an agriculture component that comprises software for Plant Breeding and Genetics and advanced biometrics. The software performs a wide range of evaluations that plant breeders demand. The data can be cut and paste from an excel spreadsheet or data can be import and export in the particular format provided. The beauty of this software is that it analyses the quantitative tools used by plant breeders viz., Var.Covar Matrices with ANOVA and ANCOVA, Genetic Parameters like PCV%, GCV%, h2, Genetic advancement, etc., D2 Statistics with Mahalnobis and Canonical Roots method, The best part of the software is that it provides high-resolution graphics, beautiful graphs (Mahalanobis D2 incl. Dendrograms, 2D/3D Plotting of Canonical Roots, Cluster Means, Cluster Distances and Metroglyph plot, etc.) wherever needed, with the possibility to paste in PPT or other documents with formatting option is also available.
102
Rumesh Ranjan et al.
PB Tools (Plant breeder tools): PB Tools, a plant breeding software, is being developed at IRRI. It’s a free application with a graphical user interface that was designed with the Eclipse Rich Client Platform (RCP) and the R programming language (GUI). ANOVA table, variance components, heritability, and pairwise comparisons are among the outputs that vary depending on the model. Modules for generating design layouts and breeding value prediction based on geological data will be added in the future. Modules for generating design layouts, breeding value prediction based on genotype genetic relationships, and multi-trait selection index will be added in the future. STAR (Statistical tool for agricultural research): STAR was also developed at IRRI using the Eclipse Rich Client Platform (RCP) and the R programming language for crop scientists. It also has a user-friendly graphical user interface (GUI). Its current version includes modules for generating randomization and layout of common experimental designs, data management, and basic statistical analysis, such as descriptive statistics, hypotheses testing, and ANOVA of a designed experiment. SPSS (Statistical Package for the Social Sciences): With a simple user interface and powerful computing, SPSS offers advanced statistical analysis. SPSS provides a broad range of data analysis services that include descriptive statistics, bivariate statistics, ANOVA, mean, correlation studies, and several nonparametric tests. Regression model prediction, cluster analysis, and factor analysis can be worked out using this software. Hypothesis testing, MANOVA (Multivariate Analysis of Variance), T-test are the regular parameters for the breeders and are easily handled by SPSS. 3.2 Determination of Yield Component and Selection of Elite Genotype
The extent of genetic variability present in the population and the heritability of the particular trait is the major factor for the plant breeders to make a selection. In-plant breeding programs, the selection is the most important activity which is more effective for heritable traits than low heritable traits. We can go for direct selection for such traits. Most of the economically important traits like yield are quantitative and are low in heritability. We can’t go for direct selection and it’s desirable to select indirectly to improve yield. Yield is the end product and is the result of component traits. To identify the component traits, breeders usually go for Correlation, Path analysis and Discrimination function analysis (Selection Index). Pearson in 1902 gave the term correlation coefficient (r) and thus measures the direction and degree of relationship between two or more components [36]. Path analysis, on the other hand, is a standardized partial regression coefficient that divides the correlation coefficient into measures of direct and indirect effects, and calculates the direct and indirect contributions of each independent variable to the dependent variable. The discrimination function
Statistical and Quantitative Genetics Studies
103
refers to the statistical approach proposed by [37], which is used in the construction of the selection index. In this, the selection index is constructed with the help of characters associated with dependent traits say yield. Based on character combinations, desirable genotypes are discriminated from undesirable ones in the discrimination function. Different software operated by breeders is as: 3.2.1 Softwares for Assessment of Yield Component and Selection of Elite Genotype
R IndSel: Selection based on a single trait may lead to an ambiguous result. The line showing excellent performance for one trait might not have a similar result for the other related traits. So it is always desirable to know the net worth of an individual. With the concept of selection index, multiple available information can be merged into a single value to know the net genetic merit of an individual and make the selection effective. RIndsel is such a graphical user interface that exactly carries out the above-mentioned task. The net worth of an individual can be worked out based on phenotype and/or genotype. Based on the two experimental designs that RIndSel covers; Randomized Complete Block Design and Alpha Lattice Design, phenotypic and genetic covariances are estimated. For an available phenotypic value of the trait, RIndSel had five measures viz., Smith Selection Index, Eigen Selection Index Method (ESIM), Restrictive Kempthorne, and Nordskog selection index, Restrictive Eigen selection index method (RESIM). Here the economic weightage for each trait considered is predefined by the user. On the other hand, for available molecular data, Lande and Thompson method and the Molecular ESIM method (MESIM) are considered. Indostat: Correlation and Path Analysis with Partial R2 (Coefficient of determination) including Path Diagram is the beauty of this software. OpStat: This software also analyses Correlation and Path Analysis but no path diagram is generated in output file.
3.3 Choice of Suitable Parents and Breeding Procedures
One must take special care during the selection of parents before hybridization. Selection of parents should always be made based on genotypic value, since superior parents selected based on phenotypic value alone may not yield good recombination in segregating generations. Normally, combining ability effects reflects the genetic value of parents and hybrids, therefore, parents are chosen based on their combining ability. There are several biometrical techniques for the evaluation of genotypes in terms of their genetic value. They are top cross, poly cross, line × tester, diallel cross, partial diallel cross, triallel, and quadriallel analysis. Breeding value refers to the genetic worth of an individual. Based on the mean performance of offspring, the breeding value of their parents is measured. It expresses the value transmitted from parents to offspring and is calculated from the additive genetic effect.
104
Rumesh Ranjan et al.
After the selection of the parents, the breeding procedure to be followed depends on the genetics of quantitative traits. Hence, before initiating any breeding program it is of utmost importance to know the gene action involved in the expression of various quantitative traits. The nature of gene action of quantitative traits can be assesed through combining ability analysis. Combining ability refer to the ability of a genotype to transmit a superior performance to its progenies. The information about the nature and magnitude of various types of gene action involved in the expression of various quantitative characters is obtained from combining ability analysis. It also helps in the evaluation of inbred in terms of their genetic value and the selection of suitable parents for hybridization and the identification of superior crosscombinations. There are two types of combining ability viz., general combining ability (GCA) and specific combining ability (SCA). A/c [14], GCA indicates the average performance of lines in several hybrid combinations and SCA is the relative performance of a particular cross departed from what would be expected based on the average performance of the lines involved. GCA is attributed to additive effects and SCA to non-additive effects of genes (dominance ad epistasis interaction). GCA and SCA can also be defined based on the covariance of half-sib and full-sib [21]. We can estimate GCA and SCA by using a different mating designs such as Line × tester cross, diallel cross, partial diallel cross, triallel cross, quadriallel cross, polycross, top cross, Biparental cross (NCD-1, NCD-II, NCD-III), triple test cross, and software used for the analysis of combining ability as: 3.3.1 Softwares for Selection of Suitable Parents
AGD-R (Analysis of Genetic Designs with R): AGD-R is a series of R programmes developed at CIMMYT to determine Line X Tester, Diallel, North Carolina using statistical analyses. AGD-R comes with a graphical JAVA interface that allows the user to quickly select input files, implement, and analyze. SAS (Statistical analysis software): SAS is another powerful command-driven statistical data analyzing tool for advanced analytics, multivariate analysis, and predictive analysis that provides results through a wide array of visualization graphs. This software encompasses a myriad array of algorithms for carrying out advanced analysis with publication-quality graphics. Various computer programs have been developed exploiting the SAS that is beneficial for the breeders. SASQuantis program in SAS estimates gene effects following Hayman’s model, heritability study, calculation of genetic variances, predicted gain from the selection using Wright’s and Warner’s models, and looks for the number of effective factors following Wright’s, Mather’s and Lande’s method. Another program SASHAYDIALL is another SAS program for Hayman’s diallel analysis.
Statistical and Quantitative Genetics Studies
105
PBTool: This software also analyzes diallel (Griffin method), triple test cross, mating design, and generation mean analysis. OPStat: Other facilities under this online software are generation mean analysis, Diallel analysis, path analysis, Line X Tester analysis, Partial diallel analysis which can be easily carried out. IndoStat: User-friendly software for diallel analysis Griffing Method 1 and 2, Model I and II, including Hayman’s Vr—Wr approach, Line X Tester with heterosis table and relevant Graphs. Also, analyze advanced biometric tools like multi-location diallel analysis—Griffing Method 2 (Daljeet Singh model) and multilocation lines Tester with heterosis able and relevant graphics, Other important analysis like Generation Mean analysis, Analysis of 5, 6 and 3 Parameter models, Joint Scaling Test (Cavalli)) and its testing, biparental mating design, and triple test cross analysis add the flavors in this software. 3.4 Assessment of Stability of Genotypes
3.4.1 Models for Stability Analysis
Once the line or hybrids are developed through the different procedures, it is subjected to a multi-location yield trial to know their performance in fluctuating environments. In a multi-location yield trail, lines are evaluated for their stability which refers to the suitability/performance of a crop over a large geographical area for crop production. For stable crop production both over regions and years, the stability of genotypes to the fluctuating environment is a must. Stability analysis provides significant information on identifying adaptable genotypes and forecasting how different genotypes will respond to changing environments. Similarly, stability analysis is useful in predicting the response of various types of gene actions over changing environments. Comstock and Moll [38] classified environment into two kinds viz., microenvironment and macro environment. Microenvironment refers to the environment of a single plant that is made up of all the things whereas macro environments are collectively experienced within a given area and period. Whereas, Allard and Bradshaw [39] coined the term predictable and unpredictable environmental variation to distinguish the environmental source of variation that contributes to genotypeenvironment (G × E) interaction. Most breeders aim to develop genotypes that have low G × E interaction in such ways that it gives stable and high economic returns. There are different models developed in due time for the identification of stable and buffering genotypes. 1. Traditional models (a) Stability factor model by [40], (b) Ecovalence model by Wricke [41], (c) Stability variance model by Sukla [42]
106
Rumesh Ranjan et al.
2. Regression Coefficient Models (a) Finley and Wilkinson model (1963), (b) Eberhart and Russel Model (1966), (c) Perkin and Jinks model (1968), (d) Freeman and Perkins model (1971). In plant breeding, regression coefficient models were mostly used by breeders in earlier days. The ranking of varieties for stability remains the same in these three models. However, among the three different models for stability analysis, the regression coefficient model by Eberhart and Russel is most commonly used as it is simple yet informative. 3. Principal Component Analysis (a) Additive main effects and multiplicative interaction effect [43, 44] (b) GGE Biplot [45–47] AMMI and GGE biplot model is a combination of the earlier models mostly used by breeders these days. Williams [48] and Pike and Silverberg [49] invented this AMMI model. A combination of additive and multiplicative components for main effects and genotype-environment interaction (G × E) respectively are integrated by the AMMI model. A univariate technique (ANOVA) and multivariate technique (PCA), each for the main additive effects (genotype and environment) and multiplicate effect (G × E) is taken in combination. Several researchers [50–52] have reported the efficiency of the AMMI model for G × E analysis. Yan [45] proposed the GGE Biplot analysis, which takes into account both genotype and G × E components in the study. The only difference between these models is in the initial steps of the analysis, where GGE analyzes G plus G × E while AMMI separates G from G × E; and at the final steps where the biplots for the interpretation are built [53]. For a detailed study, the authors suggest referring to the book “Quantitative Genetics and Biometrical Techniques in Plant Breeding” by Gunasekaran M and Nadarajan N, and research papers of respective models [43–47, 52]. 3.4.2 Softwares for Stability Analysis
R environment: It is undoubtedly one of the most preferable and sophisticated software available when it comes to statistical computing and graphical output. This platform is highly extensible and with the integration of numerous packages, almost any data analysis can be carried out. A package like ‘agricolae’ is a must-have package for plant breeders. For graphical presentation, “ggplot” offers infinite options. A package like “metan” is used for multi-environment data handling and stability analysis. R has an effective data handling and storage facility and can produce publication-quality graphics.
Statistical and Quantitative Genetics Studies
107
The use of R has numerous possibilities and with the easily extensible packages, it can handle almost all the analysis. GEA-R: The superiority of genotype is judged based on its performance across a wide range of environments. With G × E interaction playing a major role in the exploitation of the genetic potential of genotypes, several methods have been developed that include AMMI, stability analysis, and so on. GEA-R help resolves this kind of problem and acts as a powerful tool for a breeder to the selection of most favourable genotype acting superior in all test environment. GGEbiplot: It is a user-friendly software package with a wide application in performing biplot analysis of the experimental data. It generates perfect biplots of all possible centring and scaling models. In GGE biplot multi-environment trial data set is used taking cultivars as entries and environments as testers. The software provides interactive visualization of the biplot and based on the graphical output, researchers can rank the genotype performance in any given environment. One can compare the performance of a set of genotypes across different environments. Similarly, the grouping of genotypes based on performance across different environments and the grouping of environments based on genotype performance are the other benefits of GGEbiplot. Indostat: Stability Analysis of the model such as Eberhart and Russel or Freeman and Perkins or Perkins and Jinks can be analyzed with good graphs. This software also includes stability analysis— Huhns Nonparametric and AMMI models with Bi Plot. SAS (Statistical analysis software): For stability analysis across environments, STABSAS was developed. Another program META analyzes multi-environment trials and looks for the stability of genotypes across several growth environments. Another program SASGENE analyzes the inheritance and linkage of qualitative traits. PBTool: Modules for evaluating single and multi-environment trials (METs) done using widely-used experimental designs are included in this software. The regression-based stability parameter and Shukla’s stability variance are derived for METs, as well as the Additive Main Effects and Multiplicative Interaction (AMMI) model also can be evaluated.
4
Selection of Quantitative Traits Selection is an important activity for the breeders and the success of the breeding program depends on it. Breeders have to decide which individual will be allowed to produce the next generation and how these selected individuals will be mated to each other to generate further genetic variability. The fundamental effect of selection is to change the gene frequencies in the population. The effect of
108
Rumesh Ranjan et al.
selection on quantitative characters can be measured in terms of changes in the genetic properties of a population, such as mean, variances, and co-variances. To anticipate the response to a single generation of selection, knowledge of heritability (h2) is necessary. The selection response (R) is defined as the change in mean over one generation (i.e., the difference between the mean phenotypic value of offspring of selected parents and the mean of the initial population before selection). Thus, R = h2S, where, the selection differential (S) is defined as the difference between the mean of selected individuals and the mean of the initial population before selection. This equation is commonly referred to as the Breeders’ equation [54] and is widely used in artificial and natural selection theory and experiments [55–58]. The breeder’s equation, developed by Lush, can be used to predict the selection response, also known as anticipated genetic gain. The selection can be estimated using Eqn. S = iσph. Where i is the intensity of selection and σph is the standard deviation of the phenotype. The proportion of the population selected determines the intensity of selection which can be predicted from table values [52]. Efficiency, based on the expression of genetic gain or responsiveness to the selection, was evaluated by incorporating the number of years required for each cycle (L) as a denominator in the selection equation [59]. Thus R = h2 iL σph.The denominator (L) defines the time required for one breeding cycle which infuses the whole of recombination, evaluation and selection time for a new set of crosses from the evaluated parents. Improving selection accuracy, h2 depends largely on the researcher’s accuracy and expertise; and can be increased by refining the quality of trials or increasing the number of replications to provide more screening power to the researcher, which ultimately raises the genetic gain per unit time. Improvement in selection intensity, i can be done either by deducting the number of selected individuals or by increasing the number of selection populations or encompassing both. Also, the cycle time t can be shortened by reducing the interval between making crosses, evaluating those crosses and making the subsequent crosses from the selected progenies (rapid generation advance) [60]. With the advent of cheaper and quicker next-generation sequencing (NGS) technologies, there has been an explosion in the availability of data in the scientific communities giving rise to several cost-effective marker technologies. Combined with more sophisticated and accurate high throughput phenotyping technology being available to accurately judge the genotypes, a relationship study between genotype and phenotype can be done with higher fidelity and with higher resolution than ever before. These sequencing technologies, which were once used to be costlier, have now become a regular practice among researchers. These NGS technologies have now become a cost-effectively handy tool for breeders
Statistical and Quantitative Genetics Studies
109
and large plant populations are being sequenced for pinpointing the particular gene increasing the confidence interval of quantitative trait locus (QTL), and hence acting as a powerful tool for modeling complex genotype-phenotype relationships at the whole-genome level. Integrating this genetic information in selection programs, breeders can make an accurate selection by knowing the worth of agriculturally important QTLs and genes and their relative contribution, and hence the genetic value of each of the parents. The marker-assisted selection (MAS) identifies the major contributing genes and QTLs based on the prior study and incorporates them in our selection program, and identifies the genetic worth of each line making selection very effective. For identifying loci in diverse populations, NGS technologies have proven useful [61]. To reduce the cycle length (L), different approaches like shuttle breeding, and speed breeding has been introduced. Rapid genetic gains in different crops are possible only due to MAS and speed breeding schemes. Compared to conventional crop improvement approaches SpeedMAS schemes have cut off the period to 4–9 years per cycle along with enhanced genetic gain each year for every trait and have been the best approach for traits with low heritability. Jighly et al. [62] depicted that higher genetic gain per cycle can be achieved by running more speed breeding rounds and by shortening cycle time to make such increase more pronounced per year. Quantitative genetics is integrating with molecular genetics, and this expanding field needs more attention. There are many techniques for the characterization and mapping of polygene causing quantitative variation that is gaining momentum with the plant breeders.
5
Molecular Quantitative Genetics and Mapping of Quantitative Traits Loci Due to the advancement of molecular biology, different approaches have been developed by researchers to dissect the genomic location, and position of complex quantitative traits. Some of these approaches are discussed here under different subheads and software used for analysis in brief.
5.1
QTL Mapping
The linkage/QTL/Bi-parental mapping is a family-based mapping analysis that spans the recombination and segregation in the progenies of bi-parental crosses. This affects its mapping resolution and also the allele richness. It identifies the loci or regions that co-segregate for the trait of interest in the said population. The approach though is more robust and can identify rare alleles but the lesser cycles of recombination impede its resolution.
110
Rumesh Ranjan et al.
5.1.1 Approaches for QTL Mapping
Various approaches are available for QTL analysis. Broadly it can be classified into two groups i. single QTL mapping and ii. multiple QTL mapping. There are several approaches available based on regression analysis, maximum likelihood parameter estimation or Bayesian models, etc. for QTL detection even within each of these mapping methods. 1. Single QTL mapping: This method detects a single QTL at a time without taking into consideration other genes affecting the trait of interest. It is very obvious that polygenes control the expression of a quantitative trait. QTL identified using this method are not very reliable. Single marker analysis and simple interval mapping are the two common methods that come under single QTL mapping. (a) Single marker analysis (SMA): It is the earliest method for QTL detection where every marker is separately tested for its association with the target trait. The method works on phenotypic means which are placed under different genotypic groups to find the desired QTL near the site of the marker. Single marker analysis is simple to compute using common statistical software. Single-marker analysis includes statistical tests such as T-tests, analysis of variance (ANOVA), and linear regression. The most commonly used approach is linear regression as the calculated coefficient of determination (R2) of marker gives an estimate of phenotypic variation for the identified QTL linked with the marker. The major advantage of this method is that it does not require the availability of a complete linkage map of the species and analysis can be done using basic statistical software packages. However, as a single marker is analysed at a time for QTL detection, a true QTL may go undetected due to the recombination event between the marker and QTL as the distance between QTL and the marker increases. Further, the magnitude of the QTL effect may be underestimated. These problems can be minimized by taking a large number of markers uniformly distributed throughout the genome (15 cM) [63]. (b) Simple interval mapping (SIM): In comparison to SMA, interval mapping provides the likely sites of a QTL between the adjacent linked markers on a linkage map. Lander and Botstein [64] have developed an interval mapping procedure known as simple interval mapping. The best part of the method is that it can provide precise QTL estimates as there is no confounding by the rate of recombination between QTL and marker, and is statistically more powerful. It detects QTL at several locations
Statistical and Quantitative Genetics Studies
111
but as a one-dimensional search. It provides logarithm of odds (LOD) score curve that allows the localization of the significant QTL into a linkage map. However, the disadvantage is that implementation requires more computation time than single-marker analysis. 2. Multiple QTL mapping (MQM) As quantitative traits are governed by a large number of QTLs, SMA is likely to yield biased results. MQM takes into consideration the effect of other genes and combines multiple regression analysis with SIM to avoid biases of SMA. Hence, MQM provides many advantages like reduction of residual variations, increased power of QTL detection, detection of QTL x QTL interaction, etc. (a) Composite interval mapping (CIM): It combines interval mapping along with multiple regression analysis. At the very first CIM carries SMA with a stepwise forward regression method. The approach selects the highest LOD score first and then further adds the second-highest LOD score. This factor accounts for its precision as markers are re-evaluated for their significance, hence on the said significance next highest LOD score is added. (b) Inclusive composite interval mapping: It is based on the algorithm of the linear regression model which is built on the marker information. This method implements two properties of an algorithm of the CIM model. First, the markers used are more than the number of QTLs and second markers are discovered by the virtue of Regression analysis. Here is the number of markers used for QTL more than the number of QTLs. The advantage of ICIM is its accuracy for detecting true positive QTLs and less number of false positives than CIM, moreover, co-factors do not suffer randomness. ICIM detects the QTL in the genomic regions that fall in the highest LOD scores, therefore the markers that have significant regression coefficient gives the estimate of background markers (c) Multiple interval mapping: It detects QTL in multiple marker intervals and avoids the complicated procedure that is used in CIM for the selection of background markers. (d) Bayesian multiple QTL mapping: Bayesian QTL mapping detects multiple QTLs. It works on MCMC (Markov Chain Monte Carlo) and also treats every QTL as a random variable. For bi-parental mapping populations Bayesian method offers an advantage particularly where highdensity linkage maps are available, it handles missing genotypes well. Concerning the complex cross-software
112
Rumesh Ranjan et al.
packages, QTL Express and MCQTL are used. Here the Significance of genetic effects can be tested by using Wilcoxon rank-sum statistics. 5.1.2 Softwares Used in QTL Mapping
Map maker/QTL: It is the first software that becomes available to us [65]. It involves the use of non-parametric methods and interval mapping, which uses linkage maps. It can run on most computers, is command-driven, and generally lacks graphic user interfaces. We can save the output as postscript files. PLAB QTL: It was given by [66] in 1996 and is mostly applicable to top cross progenies. It uses multiple regression procedures for SIM and CIM. The software has the power to calculate and compare at the same time various LOD curves for fitting into different model assumptions. Further, it analyses QTL × E interactions along with the handling of missing marker data. QTL cartographer: The software has a powerful graphical interface and is compatible with all versions of Windows where it can perform SMA, SIM, CIM, and MIM along with epistasis, Bayesian interval mapping, multiple trait analysis, and multiple trait MIM analysis. It readily maps different traits as well. It’s available at http://statgen.ncsu.edu/qtlcart/WQTLCart.htm. Map manager QT/QTX: It is a better interface for graphics and is available at http://mapmanager.org/mmQTX.htm. This method detects QTLs by various fast regression methods also involving SMA, SIM, CIM. LOD scores are calculated by permutation. R/QTL: It works under R software which is freely available on http://www.r-project.org/. It’s mostly operated on every platform, however, its updated version performs SMA, SIM, regression interval mapping, CIM, and MQM.R/QTL supplements missing data and determines the QTL and QTL hotspots in addition cis-trans and QTL interaction can also be visualized with the help of a programme. R/QTLBIM: This method carries Bayesian interval mapping [67] and is available at http://qtlbim.org. This allows mapping multiple QTLs which are interactive and can handle continuous binary traits. QTL Express: The first software package where QTL analysis was performed on outbred populations. Multiple approaches, it performs a multiple regression approach for the inputs are marker linkage map, phenotype and genotype data. Flex QTL: The software is based on Bayesian theory and is generally implemented through Markov Chain Monte Carlo simulation used for QTL mapping. It can calculate probabilities of genes being identical by descent provided marker data and pedigree is available.
Statistical and Quantitative Genetics Studies
113
Q Gene: QGene is written in a Java program along with GUI. It can have the capacity to perform several QTL mapping methods. It displays superimposed QTL profiles for a large number of chromosomes and works for a large number of traits. Its disadvantage is that it can’t estimate QTL × QTL interactions, hence can only handle genetic covariates and cannot take into account data in multiple environments. Epistat: It is a DOS-based interactive program that is specifically used for the analysis of epistatic interactions. A program called QTL Cafe´ runs in Java enabled worldwide web browser another program is Ici mapping which is user-friendly and efficiently prepares marker linkage maps along with carries QTL mapping. QTL ICiMapping: QTL IciMapping is open-source software that may be used to create high-density linkage maps and map quantitative trait loci in biparental populations [68]. Microsoft. NET Framework 2.0 (86)/3.0/3.5 is required to execute the product on Windows XP/Vista/7/8. The ideal QTL mapping approach should have a high detection power and a low false discovery rate, according to statistics. This software looked into the frequently questioned subject of why this QTL mapping software is user-friendly for most researchers [69]. 5.2 Genome-Wide Association Studies (GWAS)
Recent advances in next-generation sequencing technologies expedited the discovery of single-nucleotide polymorphisms (SNPs) cost-effectively and made them the marker of choice. As the SNPs are present in high density in the genome, they have been extensively used to identify the small haplotype blocks that correlate with the quantitative trait variation. Earlier the traditional bi-parental mapping offered a low resolution, especially which is due to a limited number of recombination events possible in bi-parental mapping populations To overcome this limitation genome-wide association approach (GWAS) was developed to identify the causal genes in genetically diverse populations thus providing higher resolution, often to the gene level. The approach has extended its arena to humans, animals and plant studies for dissecting the genetic basis of variation including diseases in humans and various physiological and agronomic traits in plants [70]. Klein et al. [71] performed GWAS for the first time in humans and found a variant of the Complement Factor H gene which is strongly associated with macular degeneration related to age. Later on, it was used to find trait associations in various model crops as mentioned in Table 4.2. GWAS is carried on large-size natural populations or germplasm sets that had procured historical recombination events over time. In more precise terms it is historical Linkage Disequilibrium (LD), over dozens/hundreds of generations which have kept their stance among the representative accessions. Therefore, the rapid decay of this LD prompted improvement in the resolution of association analysis. The
114
Rumesh Ranjan et al.
Table 4.2 List of model crops studied Model crop Studies performed A. thaliana ACCELERATED CELL DEATH6 (ACD6) for leaf necrosis Maize
An outcrossing plant, with an LD that decays at approximately 2000 bp on average (a distance fivefold shorter than that in A. thaliana)
Rice
Heading date, grain size, and starch quality
effectiveness of GWAS depends on several factors: phenotypic variation, population size, structure, allele frequency, and linkage disequilibrium. 5.2.1
Workflow of GWAS
A diverse association panel, for which phenotyping is performed in multi-locations for the trait of interest, followed by genotyping with SNP markers (Fig. 4.1). The genotypic and phenotypic data are analyzed using different software to find the marker-trait association. The significance of associations is tested by various thresholds like False discovery rate and Bonferroni corrections.
5.2.2 Softwares Used for GWAS Analysis
TASSEL (Trait Analysis by Association, Evolution, and Linkage): It is one-of-a-kind and persistently used software for conducting GWAS. It implies many powerful statistical approaches like GLM, logistic regression and MLM, and further performs PCA and kinship to analyze the population structure. The new version (Tassel 5.0) does the sequence analysis along with the SNP calling for GBS data. Visualization of the results is prompted in the form of a scatter plot of PCA, genetic distance represented by a heat map, Manhattan plot for GWAS results, and LD [73]. The general linear model (GLM) as a function of the Q method was incorporated in the TASSEL for structure analysis. This model helps to specify F-tests and runs permutation but in some cases, if the significance derived from F-test assumes that the analyzed trait has some kind of residual that is normally distributed, then TASSEL operates with two options in the case of the first option transformation functions are formed which produce roughly normal error terms, while the second option uses a permutation test for generating P-values that are not distribution dependent [72]. As Q + K is considered a powerful combination therefore it was implemented as a Mixed Linear model. TASSEL is free software and can be downloaded from http://www.maizegenetics.net/tassel. EMMAX (Efficient Mixed-Model Association eXpedited): It is a fast computational tool used for association analysis keeping population structure in consideration. The EMMAX works on the algorithm of variance component that preferably can analyze GWAS datasets within a pack of hours. This software is also based on mixed
Statistical and Quantitative Genetics Studies
115
Association panel /diverse/natural Population
2. Genotyping
1. Phenotyping
Population structure tested to select suitable GWAS model
Genotyping by Sequencing platform used to generate numerous SNPs covering whole genome
Estimation of heritability for removing outliers
Using various software packages (genome wide study performed)
Statistical models used: MLM, GLM,
Multi location /years testing
Estimation of mean from phenotypic data using BLUP
The significant marker trait association depicted by crossing the threshold using False discovery rate (FDR) or Bonferroni correction
Filtration and imputation of generated SNPs in order to remove missing data, heterozygosity, and minor allele frequency
MTA/Gene identification
3. Results depicted in the form of manhatten and Quantile-Quantile (Q-Q) plots
Fig. 4.1 The flow diagram depicts the overall working of GWAS which is divided into three stages. (1) Phenotyping. (2) Genotyping. (3) Results as the significant marker-trait association
models, hence using a kinship matrix to control population structure. A highly efficient method than EMMAX was EMMA which works on the algorithm of spectral decomposition for building the likelihood function, hence can be considered much easier to
116
Rumesh Ranjan et al.
evaluate, besides can handle large samples and a large number of markers [74].These were more efficient of forms mixed models as they overcome the computational speed lacunae. EMMA can also be used for the inbred population. Fast-LMM (factored spectrally transformed linear mixed model): This method is more efficient than EMMA in the perspective of reducing the computational time to a few hours. Though earlier LMM/MLM methods were preferred to overcome the cryptic relatedness the major drawbacks with such methods were computational and statistical power recently a concern was during a cattle study that some of the flanking SNPs when incorporated in kinship matrix and therefore undergo competition with the main effects (as fixed effects) thus lowering the power of causal loci detection. This phenomenon was called “proximal contamination” [75]. Hence to overcome this, Lippert and co-workers proposed a Fast-LMM. It works by capturing the polygenic background effects by working with selecting a subset of markers that are chosen beforehand [76]. Missing data can easily be imputed at any (genetic or non-genetic) covariate because of which the approach can implement stepwise conditional analyses [77]. PLINK: It is an intriguing GWAS implied population-based linkage analyses. It leverages the study of a large dataset of phenotypes and genotypes. The package has diverse functions viz., population stratification detection, summary statistics for quality control association analysis, identity-by-descent estimation, and data management. It is free software that can be downloaded from http:// zzz.bwh.harvard.edu/plink/. The package was developed by [78] at the Center for Human Genetic Research. The results are prompted as Graphical imagery for the Manhattan plot, Q-Q plot with access to multidimensional scaling (for population structure). GenABEL: GenABEL package is implemented in R-library and the reason is the graphical interface of R. Therefore the results are visualized and have efficacy for data storage and handling, having robust procedures for quality control of genetic data. It provides an interface for standard and specific R data types and functions. The package is available at http://cran.r-project.org. GenABEL package was developed under the GenABEL project which comprises other packages as well. GAPIT: With the advent of advances in the free R environment, various packages are now available. GAPIT is a widely used Genomic Association and Prediction Integrated Tool. In 2012 Lipka released its first version. It was implemented with the mixed linear model (MLM), general linear model (GLM), compressed MLM, and genomic Best Linear Unbiased Prediction (gBLUP). Lately, in 2016 the second version was released which included the following implementations: compressed MLM and Settlement of mixed linear models Under Progressively Exclusive Relationship
Statistical and Quantitative Genetics Studies
117
(SUPER) but the only drawback they imposed was that they were the single-locus test. But in recent times version 3.0 of GAPIT was released which comprised three multiple loci test methods, including Multiple Loci Mixed Model (MLMM), Fixed and random model Circulating Probability Unification (FarmCPU), and Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) at Zhiwu Zang’s lab [79]. The major advantage of GAPIT is that it can handle a large amount of data (SNPs and genotypes) and it hails the efficiency to reduce computational time without compromising statistical power [80]. FarmCPU: The approach works on the algorithm of substituting kinship with a set of markers that are associated with the causal genes. To test markers at one time across the genome they are fitted as a fixed effect in a fixed-effect model. The maximum likelihood method is used to optimize the set of markers in MLM further to avoid any kind of overfitting. Here the variance and covariance structure is defined by associated markers. SPAGeDI: Spatial Pattern Analysis of Genetic Diversity software uses data from codominant genetic markers and estimates genetic distances between populations or relatedness coefficients between individuals [81]. The package involves Kinship analysis and generates K matrix. It implies spatial coordinates Commercial packages like: Genstat: General statistics software comprises various features altogether to carry out GWAS along with population structure analysis with PCA and kinship. Along with the value of the marker for the traits, it provides G x E interaction. It is one of the flexible software developed by VSNi with a user-friendly interface with a powerful programming language. JMP Genomics: It is one of the robust tools for analyzing omic data viz: expression analysis, association studies or predictive modeling. It uses a relationship matrix for computing relatedness between individuals on marker data rather than pedigree information (Kinship Matrix tool). It can compute identity-by-Descent, Identity-by-State, or Allele-Sharing-Similarity. To analyze population structure Q matrix is incorporated by Principal component analysis (PCA) and Multi-dimensional scaling (MDS) and both are meant for data reduction.
6
Genomic Selection (GS) One of the foremost drawbacks accounted for in MAS is that it cannot identify the small-effect QTLs. Therefore, rectifying GS was proposed. This scheme utilizes the information of genome-wide markers. The molecular markers identified throughout the genome can be utilized to anticipate the phenotypic value of a particular
118
Rumesh Ranjan et al.
individual which is based on all genotypic effects, as reported by [82], which is referred to as genomic selection or GS. The foundation of this approach lies in the hypothesis that quantitative characters are profoundly polygenic. So, in answer to this, their variation can be captured by modeling all genome-wide markers thereby, calculating the genomic estimated breeding value (GEBV) and enabling the determination of superior genotypes based on genomic evaluated breeding values or genetic values (GEGV). Reducing the breeding cycle is the imperial feature of this approach which is because of an increase in selection gains per unit of time. A prerequisite for GS is the correlation between the genomic evaluated breeding values (GEBVs) and actual trait values for which various prediction subsets are to be observed [83, 84]. 6.1 Workflow of Genomic Selection (GS)
The separate set of population as training set should be representative of the breeding population or testing set. Both the sets should be related to enhance the prediction accuracy (Fig. 4.2). Its size should be fairly large to avoid low accuracy and enhance trait variance associated with markers. The training population is put to multi-location evaluation and genotyped as well. Both sets of data (Phenotypic and genotypic) are used to generate a training model. Then only genotyping is done for the breeding population,
Fig. 4.2 General scheme of genomic selection
Statistical and Quantitative Genetics Studies
119
finally, the marker information is used to estimate breeding values called GEBVs which become the index of direct selection of lines [85]. 6.2 Models Used for Computing GEBVs
Information from markers or predictors is used by various prediction models to estimate marker effects. Therefore various models are developed. 1. Stepwise Regression: It fits a larger number of the marker into the model, but only those which show a significant effect. The major constraint of this approach is its accuracy in GEBV and detects the limited number of QTLs as thresholds are less stringent [86]. 2. Ridge Regression: This method was used by [82] for calculating BLUPs (Best linear unbiased predictor estimates) for markers. In this model, all the markers are considered to have random effects and belong to normal distribution. This approach is superior to stepwise regression to avoid biases because of selection intensity for significant markers. 3. Bayesian Approach: The approach has the power to estimate variance separately for each marker. There are two types of Bayesian models Bayes A and Bayes B. The former can detect large-effect QTLs whereas the latter has superior GEBV prediction accuracy than the earlier approaches even if there is an increase in marker density. Thus it can be considered a better state-of-the-art approach.
6.3 Softwares Used for Genomic Selection
Ridge Regression-BLUP: This method was examined by [82, 87]. It is efficient at prediction with unreplicated training data because of its single variance component along with the residual error. It also provided a fast maximum-likelihood algorithm for mixed models. Genomic best linear unbiased prediction (g BLUP): The method works on various genomic relationships to estimate the genetic worth/merit of an individual/genotype. Marker information is used to determine the relationship matrix which is further implied to find the covariances. In 2008, Harris and co-workers practised this method to check the reliabilities of genomic prediction value on 4500 dairy cattle. The percentage of reliability was 16–33% which was higher than the breeding values based on parent-based average information for traits representing milk production. Generally, BLUP has helped to get a fair amount of genetic gains for cattle. In traditional BLUPs, covariances were estimated from pedigree information. But during recent advances, DNA information is used to generate the matrix which is called General Relationship Matrix (GRM) which acts as a substitute for the numerator relationship matrix. The method was developed by [88]. gBLUP is usually hyped for three main features that make it more efficient.
120
Rumesh Ranjan et al.
First and foremost is its accuracy and second is its computational efficiency whereas the third is that gBLUP information can be incorporated with pedigree information in a single-step method [89]. Adaptive Multi-BLUP (AM-BLUP): The AM-BLUP is an impressive method that identifies and weighs genomic regions with large effects in the case of complex traits. It was proposed by [90]. It is considered much more efficient than another Bayesian method. This method uses a general relationship matrix along with SNPs for identifying a region associated with the trait of interest. Higher prediction accuracy is the core feature of this method.
7
Conclusion Most of the economically important traits are complex in nature. Dissecting the complex trait and integrating it into crop improvement is the main objective of plant breeders. The genetics of complex traits is not affected by the action of a few major genes. There are cassettes of genes that regulate the expression of complex traits. To decipher the mechanism of quantitative genetics, a landmark contribution has been made by many biometricians and continues to date. This book chapter mainly emphasized the tool used by the breeders conventionally to study the genetics of complex traits that are economically important. The main limitation of conventional breeding on decoding complex traits is accuracy, duration, and effectiveness. With the advent of high throughput next-generation sequencing; understanding the genetics of complex traits became easy, accurate, and precise within less time as a result the duration of genetic gain get shortened. To overcome the limitation of the conventional approach, the book chapter emphasizes the different molecular mapping approaches i.e. family mapping and population mapping that are being used by the breeders to decode the complex traits and to integrate the traits in crop and release as a variety in fast track mode. There are many landmark examples in crops where both conventional and molecular quantitative genetics have been integrated for the development of crop variety that are being cultivated on a large scale by the farmers and also used by the end-user as the product.
References 1. Haas M, Schreiber M, Mascher M (2019) Domestication and crop evolution of wheat and barley: genes, genomics, and future directions. J Integr Plant Biol 61(3):204–225 2. Galton F (1889) Natural Inheritance Macmillan & Co
3. Pearson K (1894) Contributions to the mathematical theory of evolution. Philos Trans R Soc London A 185:71–110 4. Bateson W, Punnett RC (1905) 1908. Experimental studies in the physiology of heredity. In: Peters JA (ed) Classic papers in genetics. Prentice-Hall, Englewood Cliffs, NJ, pp 42–59
Statistical and Quantitative Genetics Studies 5. Yule, G. U. (1906). On the theory of inheritance of quantitative compound characters on the basis of Mendel’s laws-a preliminary note. In: Rep 3rd Int. Conf Genetics, pp 140, 142 6. Johannsen W (1903) Ueber Erblichkeit in Populationen und in reinen Linien: ein Beitrag zur Beleuchtung schwebender Selektionsfragen. G. Fischer 7. Nilsson-Ehle H (1909) Kreuzungsuntersuchungen an hafer und weizen, vol 5(2). H. Ohlssonsbuchdruckerei 8. East EM (1916) Studies on size inheritance in Nicotiana. Genetics 1(2):164–176 9. Fisher RA (1919) XV.—The correlation between relatives on the supposition of Mendelian inheritance. Earth Environ Sci Trans R Soc Edinb 52(2):399–433 10. Wright S (1921) Systems of mating. I. The biometric relations between parent and offspring. Genetics 6(2):111–123 11. Haldane JBS (1924) A mathematical theory of natural and artificial selection. Part II the influence of partial self-fertilisation, inbreeding, assortative mating, and selective fertilisation on the composition of mendelian populations, and on natural selection. Biol Rev 1(3): 158–163 12. Lush JL (1940) Intra-sire correlations or regressions of offspring on dam as a method of estimating heritability of characteristics. J Anim Sci 1940(1):293–301 13. Mahalanobis PC (1928) Statistical study of the Chinese head. Proceedings of the Indian science congress (Calcutta) 14. Sprague GF, Tatum LA (1942) General vs. specific combining ability in single crosses of corn. J Am Soc Agron 34:923–932 15. Male´cot G (1948) Les Mathe´matiques de l’He´re´dite´. Masson, Paris. (translated as The Mathematics of Heredity) 16. Mather K (1949) Biometrical genetics. Methuen and Co. Ltd., London, p 162 17. Comstock RE, Robinson HF (1948) The components of genetic variance in populations of biparental progenies and their use in estimating the average degree of dominance. Biometrics 4: 254–266 18. Cavalli, L. L. (1952). An analysis of linkage in quantitative inheritance 19. Jinks JL, Hayman BI (1953) The analysis of diallel cross. Maize Genetics News Letter 27: 48–54 20. Gauch HG, Zobel RW (1988) Predictive and postdictive success of statistical analyses of yield trials. Theor Appl Genet 76(1):1–10
121
21. Kempthorne O (1957) An introduction to genetic statistics. Wiley/Chapman and Hall, New York/London 22. Hanson WD, Johnson HW (1957) Methods for calculating and evaluating a general selection index obtained by pooling information from two or more experiments. Genetics 42(4):421–432 23. Anderson E (1957) A semigraphical method for the analysis of complexproblems. Proc Natl Acad Sci U S A 43(10):923–927 24. Dewey DR, Lu K (1959) A correlation and path-coefficient analysis of components of crested wheatgrass seed production 1. Agron J 51(9):515–518 25. Hayman BI (1958) The separation of epistatic from additive and dominance variation in generation means. Heredity 12:371–390 26. Jinks JL, Jones RM (1958) Estimation of the components of heterosis. Genetics 43(2): 223–234 27. Falconer DS (1961) Introduction to quantitative genetics. Pearson Education India 28. Kearsey MJ, Jinks JL (1968) A general method of detecting additive, dominance and epistatic variation for metrical traits I. Theory Heredity 23(3):403–409 29. Rawlings JO, Cockerham CC (1962) Triallel analysis 1. Crop Sci 2(3):228–231 30. Finlay KW, Wilkinson GN (1963) The analysis of adaptation in a plant-breeding programme. Aust J Agric Res 14(6):742–754 31. Eberhart ST, Russell WA (1966) Stability parameters for comparing varieties 1. Crop Sci 6(1):36–40 32. Freeman GH, Perkins JM (1971) Environmental and genotype-environmental components of variability VIII. Relations between genotypes grown in different environments and measures of these environments. Heredity 27(1):15–23 33. Elston RC, Stewart J (1973) The analysis of quantitative traits for simple genetic models from parental, F1 and backcross data. Genetics 73(4):695–711 34. Dudley JW, Moll RH (1969) Interpretation and use of estimates of heritability and genetic variances in plant breeding 1. Crop Sci 9(3): 257–262 35. Cockerham CC (1963) Estimation of genetic variances. Statistical genetics and plant breeding. NAS-NRC 982:53–94 36. Pearson K (1902) On the fundamental conceptions of biology. Biometrika 1(3):320–344 37. Smith HF (1936) A discriminant function for plant selection. Ann Eugenics 7(3):240–250
122
Rumesh Ranjan et al.
38. Comstock RE,Moll RH (1963) Genotype environment interactions. Statistical genetics and plant breeding (No. REP-1173. CIMMYT.) 39. Allard RW, Bradshaw AD (1964) Implications of genotype X environmental interactions in applied plant breeding 1. Crop Sci 4(5): 503–508 40. Lewis D (1954) Gene-environment interaction: a relationship between dominance, heterosis, phenotypic stability and variability. Heredity 8(3):333–356 41. Wricke G (1964) Zurberechnung der okovalenzbeisommerweizen und hafer. Z Pflanzenzuchtung 52(2):127 42. Shukla GK (1972) Some statistical aspects of partitioning genotype environmental components of variability. Heredity 29(2):237–245 43. Gauch HG (1992) Statistical analysis of regional yield trials: AMMI analysis of factorial designs. Elsevier Science Publishers 44. Freeman GH (1990) Modern statistical methods for analyzing genotype–environment interactions. In: Kang MS (ed) Genotype × environment interaction and plant breeding. Louisiana State University Agricultural Center, Baton Rouge, LA, pp 118–125 45. Yan W (2001) GGEbiplot—a Windows application for graphical analysis of multienvironment trial data and other types of two way data. Agron J 93(5):1111–1118 46. Yan W, Kang MS (2002) GGE biplot analysis: a graphical tool for breeders, geneticists, and agronomists. CRC Press 47. Yan W, Tinker NA (2006) Biplot analysis of multi-environment trial data: principles and applications. Can J Plant Sci 86(3):623–645 48. Williams EJ (1952) The interpretation of interactions in factorial experiments. Biometrika 39: 65–81 49. Pike EW, Silverberg TR (1952) Designing mechanical computers. Mach Des 24:131–137 50. Crossa J (1990) Statistical analyses of multilocation trials. In: Advances in agronomy, vol 44. Academic, pp 55–85 51. Annicchiarico P (1997) Additive main effects and multiplicative interaction (AMMI) analysis of genotype-location interaction in variety trials repeated over years. Theor Appl Genet 94(8): 1072–1077 52. Gauch HG (2006) Statistical analysis of yield trials by AMMI and GGE. Crop Sci 46(4): 1488–1500 53. Neisse AC, Kirch JL, Hongyu K (2018) AMMI and GGE Biplot for genotype× environment interaction: a medoid–based hierarchical
cluster analysis approach for high–dimensional data. Biom Lett 55(2):97–121 54. Lush JL (1943) Animal breeding plans. Animal breeding plans (2nd edn) 55. Lande R (1976) Natural selection and random genetic drift in phenotypic evolution. Evolution 30:314–334 56. Falconer DS, Mackay TFC (1996) Introduction to quantitative genetics. Pearson Education India 57. Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits 58. Heywood JS (2005) An exact form of the breeder’s equation for the evolution of a quantitative trait under natural selection. Evolution 59(11):2287–2298 59. Eberhart SA (1970) Factors affecting efficiencies of breeding methods. Afr Soils 15:669– 680 60. Covarrubias-Pazaran G, Martini JW, Quinn M, Atlin G (2021) Strengthening public breeding pipelines by emphasizing quantitative genetics principles and open source data management. Front Plant Sci 12 61. Varshney RK, Terauchi R, McCouch SR (2014) Harvesting the promising fruits of genomics: applying genome sequencing technologies to crop breeding. PLoS Biol 12(6): e1001883 62. Jighly A, Lin Z, Pembleton LW, Cogan NO, Spangenberg GC, Hayes BJ, Daetwyler HD (2019) Boosting genetic gain in allogamous crops via speed breeding and genomic selection. Front Plant Sci 10:1364 63. Tanksley SD (1993) Mapping polygenes. Annu Rev Genet 27(1):205–233 64. Lander ES, Botstein D (1989) Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121(1): 185–199 65. Lincoln SE, Daly MJ, Lander ES (1993) Constructing genetic linkage maps with MAPMAKER/EXP version 3.0: a tutorial and reference manual. A whitehead institute for biomedical research technical report, 3 66. Utz HF, Melchinger AE (1996) PLABQTL: a program for composite interval mapping of QTL. J Quant Trait Loci 2(1):1–5 67. Yandell BS, Mehta T, Banerjee S, Shriner D, Venkataraman R, Moon JY et al (2007) R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics 23(5):641–643 68. Meng L, Li H, Zhang L, Wang J (2015) QTL IciMapping: integrated software for genetic linkage map construction and quantitative
Statistical and Quantitative Genetics Studies trait locus mapping in biparental populations. Crop J 3:269–283. Go to original source 69. Ranjan R, Yadav R, Jain N, Sinha N, Bainsla NK, Gaikwad KB, Kumar M (2021) Epistatic QTLsPlay a major role in nitrogen use efficiency and its component traits in Indian spring wheat. Agriculture 11(11):1149 70. Chen W, Wang W, Peng M, Gong L, Gao Y, Wan J et al (2016) Comparative and parallel genome-wide association studies for metabolic and agronomic traits in cereals. Nat Commun 2016(7):12767. https://doi.org/10.1038/ ncomms12767 71. Klein RJ, Zeiss C, Chew EY, Tsai J-Y, Sackler RS, Haynes C, Henning AK et al (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308(5720): 385–389 72. Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES (2007) TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23(19):2633–2635 73. Alqudah AM, Sallam A, Baenziger PS, Bo¨rner A (2020) GWAS: fast-forwarding gene identification and characterization in temperate cereals: lessons from barley—a review. J Adv Res 22:119–135 74. Zhou X, Stephens M (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet 44(7):821–824 75. Listgarten J, Lippert C, Heckerman D (2013) FaST-LMM-select for addressing confounding from spatial structure and rare variants. Nat Genet 45(5):470–471 76. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D (2011) FaST linear mixed models for genome-wide association studies. Nat Methods 8(10):833–835 77. Eu-Ahsunthornwattana J, Howey RA, Cordell HJ (2014) Accounting for relatedness in family-based association studies: application to genetic analysis workshop 18 data. In: BMC proceedings, vol 8(1). BioMed Central, p 1–5 78. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D et al (2007) PLINK: a tool set for whole-genome association and
123
population-based linkage analyses. Am J Hum Genet 81(3):559–575 79. Lipka AE, Tian F, Wang Q, Peiffer J, Li M, Bradbury PJ et al (2012) GAPIT: genome association and prediction integrated tool. Bioinformatics 28(18):2397–2399 80. Wang J, Zhang Z (2018) GAPIT version 3: an interactive analytical tool for genomic association and prediction. Preprint 81. Hardy OJ, Vekemans X (2002) SPAGeDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Mol Ecol Notes 2(4):618–620 82. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829 83. Heffner EL, Lorenz AJ, Jannink JL, Sorrells ME (2010) Plant breeding with genomic selection: gain per unit time and cost. Crop Sci 50(5):1681–1690 84. Wong CK, Bernardo R (2008) Genomewide selection in oil palm: increasing selection gain per unit time and cost with small populations. Theor Appl Genet 116(6):815–824 85. Singh BD, Singh AK (2015) Marker-assisted plant breeding: principles and practices. Springer, New Delhi, pp 77–122 86. Heffner EL, Sorrells ME, Jannink JL (2009) Genomic selection for crop improvement. Crop Sci 49:1–12 87. Habier D, Fernando RL, Dekkers JC (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177(4):2389–2397 88. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423 89. Misztal I, Legarra A, Aguilar I (2009) Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. J Dairy Sci 92(9):4648–4655 90. Speed D, Balding DJ (2014) MultiBLUP: improved SNP-based prediction for complex traits. Genome Res 24(9):1550–1557
Chapter 5 Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for Dissecting Complex Traits Sanchika Snehi, Mukesh Choudhary, Santosh Kumar, Deepanshu Jayaswal, Sudhir Kumar, and Nitish Ranjan Prakash Abstract With the onset of the genomic era, mapping of Quantitative Traits Loci (QTLs) appears to be a feasible solution for dissecting the complex architecture of numerous multigenic traits of significance. Great progress has been made in the recent past in QTL mapping, advances in fine mapping and expression studies conjugated with cheaper prices present a new arena for QTL mapping of under-studied species that lack reference genome sequences. Here we reviewed, basic to advances in mapping of QTLs, QTG (Quantitative Traits Gene) sequence, Meta-QTL analysis, statistical software in mapping and validating QTL. This review also emphasizes precision phenotyping by an amalgamation of AI in the phenotyping of traits. Despite the identification of numerous major and minor QTLs governing various traits, only a few have been utilized in crop improvement programs. Furthermore, validations of identified QTL and their introgression in the elite lines are strongly recommended. Key words QTL mapping, Markers, Mapping population, Methods of QTL mapping, SNP genotyping
1
Introduction The great majority of economically important traits exhibit polygenic inheritance and environmental influence in their phenotypic expression, which is known as quantitative traits. Hence, QTL may be defined as any genomic region associated with the existing variation for a trait phenotype in a relevant reference population that can be identified by one or more sequence-based DNA markers when they are applied in conjugation with a suitable experimental design and statistical analysis method [1]. Quantitative trait locus (QTL) mapping is a genome-wide speculating kinship between genotype at various genomic locations and phenotype for a set of quantitative traits in terms of the number, genomic positions, effects, and interaction of QTL. This localization is relevant for the ultimate identification of responsible genes and also for
Priyanka Anjoy et al. (eds.), Genomics Data Analysis for Crop Improvement, Springer Protocols Handbooks, https://doi.org/10.1007/978-981-99-6913-5_5, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024
125
126
Sanchika Snehi et al.
understanding underlying genetic mechanisms of the trait variation [2]. Mapping QTL also helps to decipher the number of QTLs significantly contributing to the trait variation in a population. It also aids in understanding the extent of variation due to the additive, dominant and epistatic effects of QTL. The nature of the genetic correlation between different traits in a genomic region which may be due to pleiotropy, or close linkage can also be apprehended from QTL mapping. The QTL so discovered can be useful entry points for many other investigations to understand the fine detail of the DNA sequence polymorphisms in the region of the QTL and to understand the gene-to-phenotype relationships for the traits [3]. In the recent past, numerous QTLs governing various quantitative traits have been discovered and mapped [4]. Based upon size of effect, extent of the environmental influence on their expression, type of effect produced by them and mode of their actions, QTLs have been grouped under different categories. QTLs producing direct effect on the expression of concerned traits are known as Main effect QTLs while those interacting with main effect QTLs to influence the expression of concerned traits are known as epistatic QTLs [5]. Main effect QTLs are further classified into Major QTL (explaining ≥10% phenotypic variation of the concerned trait) and Minor QTL (explaining ≤10% phenotypic variation of the concerned trait) [5]. Similarly, the QTLs whose expression is little affected by the environment are known as stable QTL, while the expression of unstable QTL are greatly influenced by the environment. In general, major QTLs show stable expression across the environment while the expression of most minor QTLs is environment sensitive. QTLs affecting the expression levels can be classified as expression (eQTL) and metabolic QTL (mQTL). eQTLs affect the level of RNA transcript produced by various genes; mQTLs regulate the rate of various metabolic reactions and levels. Plant breeding relies on the output of genetical studies (QTL mapping and trait inheritance), wherein a trait is broadly dissected and utilized for crop improvement using sound genetic principles and rationale behind it [3]. QTLs have been widely used in crop improvement programs for a variety of traits such as yield improvement, for tolerance to several abiotic stresses such as drought, heat, salinity, cold etc. For several abiotic stresses such as disease resistance, pest resistance, nematode resistance, and resistance to parasitic plants, QTLs had been identified and are being used in modern crop improvement programs [4]. Saltol for salinity tolerance and sub1 for submergence tolerance are the two most widely used QTLs that have been used across the globe in breeding superior rice tolerant to seedling stage salinity and tolerance to flash floods respectively [6].
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
2
127
Principle of QTL Mapping QTL mapping assumes co-segregation of genes and markers via., chromosome recombination during meiosis, thus allowing their analysis in the progeny [7]. The QTL analysis is based on the principle of detecting an association between phenotype and the genotype of individuals. Markers are used to partition the mapping population into different genotypic groups based on the presence or absence of a particular marker locus and to determine whether significant differences exist between groups with respect to the trait being measured [1, 8]. A significant difference between phenotypic means of the groups, depending on the marker system and type of population, indicates that the marker locus being used to partition the mapping population is linked to a QTL controlling the trait. The general procedure for Linkage mapping is briefly summarized in Fig. 5.1.
3
Prerequisite of QTL Mapping The basic requirements for QTL mapping are: 1. A suitable mapping population; 2. Saturated linkage map of the species using molecular markers; 3. Reliable means of phenotypic evaluation;
QTL mapping
Selection of 2 diverse parents (two homozygous lines having contrasting phenotype for the trait are selected) and polymorphism survey in the selected parental lines
Developement of mapping population (suitable for linkage mapping of genetic markers generated by controlled crossing and handling the progeny in definite fashion) such as DH, RILs, F2:3.. etc
Phenotyping (evaluation for concemed trait in replicated trials over location and years)
Genotyping (analysis of the individuals/lines of the mapping population including parents and F1 using polymorphic markers)
QTL analyses (detecting association between phenotypic and marker genotype)
QTL Validation
Fig. 5.1 The general procedure for QTL mapping in plants
Linkage map construction
128
Sanchika Snehi et al.
4. Appropriate statistical package/software package for QTL detection and mapping. 3.1 Molecular Markers
The molecular genetic marker is a DNA sequence with a known chromosomal location having a specific detection assay. Molecular markers may be closely associated near to target gene and may act as sign or flags [9]. The development of DNA based markers has been a major breakthrough in the mapping of QTLs [1]. One of the main uses of DNA markers in plant breeding has been in the construction of linkage maps. Linkage maps have been widely utilized for identification of chromosomal regions harbouring genes regulating quantitative traits [10]. The first application of DNA marker in mapping of QTL was by Paterson et al. [11] in which 29 putative QTLs were mapped for fruit size, pH and soluble solids using 237 backcross progenies derived from cross between Solanum lycopersicum and Lycoersicum chmielwski. Different types of molecular markers are given in Fig. 5.2. (a) Restriction Fragment Length Polymorphism (RFLP) was the first molecular marker based on hybridization. RFLP is the polymorphism which is detected based on the differential hybridization of cloned DNA to DNA fragments in a sample of restriction enzyme digested DNAs. It can be defined by a specific enzyme-probe combination. It has good repeatability and particularly useful in fingerprinting and comparative genome mapping. It is co-dominant in nature and hence can be effectively used to differentiate homozygotes from the heterozygotes. However, the assay is tedious and time consuming and requires large amount of DNA [12]. (b) Random Amplified Polymorphic DNA (RAPD) detects polymorphism based on differential PCR amplification of a sample of DNAs. It is dominant in nature. The random
Fig. 5.2 Types of molecular markers used in QTL mapping studies
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
129
primers with length of 10 nucleotides are usually used which can anneal at multitude of genomic locations. The amplified PCR products are then run on gel electrophoresis to separate fragments and visualize them under UV-rays. Polymorphism arises due to mutation at primer binding sites which leads to absence of bands of that specific size. Poor reproducibility and dominant nature are major limitation in the usage of this markers [13]. (c) Amplified Fragment Length Polymorphism (AFLP) The limitations present in the RAPD and RFLP technique were overcome through the development of AFLP markers. Detects polymorphism using a procedure that combines restriction digestion using a set of frequent cutter (tetracutter) and rare cutter (hexa-cutter) and ligating adaptors to end. PCR amplification is then done using primers having adaptor and cut sequences as part. Visualisation is done as like RAPD. These markers have stable amplification and higher repeatability. However, these are also dominant in nature [14]. (d) Sequence Characterized Amplified Sequence (SCAR) is a PCR-based marker representing single, genetically defined loci that are identified by PCR amplification of genomic DNA with a pair of specific oligonucleotide primers. SCAR primers are designed based on the end sequence of the polymorphic RAPD fragment to amplify the corresponding locus [15]. (e) Simple Sequence Repeat (SSR) also referred as microsatellite marker. In fact, microsatellite with variable tandem repeats ranging from one to ten nucleotides remain dispersed throughout the genome of most eukaryotic organisms. Basis of SSR marker is length polymorphism resulting from difference in number of repeats. Generally di- and tri-nucleotide repeats are used as markers. Unique sequences flanking these repeats are used to design primers for PCR amplification. The PCR amplified products are then run on gel electrophoresis and visualized in UV. These markers are abundant and uniformly distributed in the genome, hyper-variable, co-dominant in nature with very high reproducibility. These markers are powerful tool for genotypic differentiation, seed purity evaluation, MAS and widely used in QTL mapping in crop plants [16]. (f) Inter Simple Sequence Repeat (ISSR) are di- or tri-nucleotide repeats anchored through two to four nucleotides at one of the two ends. Single primer consisting of repeat sequences are used so that if two inversely oriented microsatellites are present within an amplifiable distance from each
130
Sanchika Snehi et al.
other, the inter-repeat sequence is amplified. These markers have high background noise and are less favourable for MAS [17]. (g) Sequence Tagged Sites (STS) are short, unique sequence, identifying a specific locus and can be amplified by polymerase chain reaction. Each STS is characterized by a pair of PCR primers, which are designed by sequencing an RFLP probe representing a mapped low-copy number sequence. Polymorphism in STS is retained either as the presence or absence of amplification product or appear as length polymorphism that converts STS into co-dominant marker. STS can also be developed from any cloned sequence such as RAPD and AFLP. (h) Cleaved Amplified Polymorphic Sequence (CAPS) are a form of genetic variation in the length of DNA fragments generated by the restriction digestion of PCR products. The source for the primers can come from a gene-bank, genomic or cDNA clones and cloned RAPD bands. This marker class is co-dominant in nature. (i) Expressed Sequence Tags (ESTs) is a DNA fragment representing the sequence from cDNA clone that corresponds to an mRNA molecule or part of it. Thus, ESTs serve as markers for genetic and physical mapping and as clones for expression analysis. EST databases have proven to be a tremendous resource for finding genes and for inter-specific sequence comparison. ESTs have also been utilized for developing the SSR markers. (j) Diversity array technology (DArT Seq) provides a great opportunity for the genotyping of polymorphic loci distributed over the genome. It is highly reproducible and is based on microarray based hybridization technology. It doesn’t require previous sequence information for the detection of loci for a trait of interest [18]. The most important benefit of this technique is that it is high throughput and very economical. (k) The last 30 years have witnessed a continuous development in the molecular markers technology from RFLP to SNPs and diversity of array-technology-based markers. Single Nucleotide Polymorphism (SNP) represents the differences found at the single nucleotide level. This type of polymorphism results due to substitution, deletions, or insertions. SNPs are abundant in a plant genome but have previously been costly and time-consuming for application to crop improvement programs [2]. Of late, there has been a dramatic reduction in the cost and time of high-throughput SNP genotyping platforms, reduced cost of genomic DNA sequencing, robust algorithms for big-data handling, etc. which expanded options
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
131
Fig. 5.3 SNP genotyping methods and assays used to detect SNPs across genomes
available to plant breeders for utilizing abundant SNP markers in their programs. Advances in next-generation sequencing (NGS) technologies have taken the implementation of SNPs for genetic analysis to a new level. Genotype-by-sequencing (GBS) has opened new doors for breeders with cost-effective, genome-wide scanning, and multiplexed sequencing platforms. In principle, GBS can simultaneously perform SNP discovery and genotyping, which is particularly advantageous for under-studied species that lack reference genome sequences. The co-dominance, abundance, and costeffectiveness made them ideal for the construction of genetic maps in plants. Genetic maps based on SNPs have been developed in several crops such as rice [19], cotton [20], soybean [21], maize [22], and Brassica [23]. SNP genotyping is the measurement of genetic variations of single nucleotide polymorphism (SNPs) between members of a species. Different kinds of SNP genotyping assays are enlisted in Fig. 5.3. Different types of markers like RFLP, AFLP, ISSR, SSR, ESTs, DArT and SNPs have been widely used for the construction of linkage maps in crop plants [24]. In general, 100–200 markers are found adequate for construction of linkage maps [25]. Albeit, the number of markers varies according to the studies and depends on the species genome size, as species with larger genome sizes require a larger number of markers. However, with the advent of
132
Sanchika Snehi et al.
Fig. 5.4 Different kinds of biparental mapping populations and their scheme of development. (RILs recombinant inbred lines; NILs near isogenic lines; BILs backcross inbred lines; DH doubled haploid; CSSLs chromosome segment substitution lines)
NGS, numerous DNA markers are now utilized for high-resolution genetic mapping. 3.2 Mapping Populations
A population of genetically related individuals developed by crossing two or more different lines and their segregating generations and progenies handled in a definite set of fashions used for linkage mapping is called a mapping population [26]. The parental lines may be diverse enough to characteristically differentiate the progenies into different phenotypic classes and which can be used for mapping the quantitative traits [2]. A brief outline of the method of generation of several mapping populations is given in Fig. 5.4. Among these, F2 and its segregating generations, F2:3 and Backcross populations are mortal populations, which cannot be developed again as such with the same individuals in the population [24]. Other populations containing fixed individuals which are not segregating are called immortal populations, these populations can be reconstituted and can be replicated repeatedly [27]. The F2 and F2:3 are best population for identification and detection of QTLs (both additive effect QTLs and epistatic QTLs), while Recombinant Inbred Lines (RILs), Backcross Inbred Lines (BILs)
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
133
and Doubled Haploids (DHs) are good for detection of both major and minor effect QTLs (only additive effect QTLs) [28]. Chromosome Segment Substitution Lines (CSSLs) and Near Isogenic Lines (NILs) are best for fine mapping of already identified QTLs [29]. 3.3 Phenotyping for QTL Mapping
4
Accurate, efficient, and robust phenotyping is a prerequisite for the identification of true QTL and recent advancement in phenotyping has heavily augmented the present day QTL mapping. Albeit, recent advancement in the realm of functional genomics led increasing the number of genes that are sequenced and in the identification of scores of genes influencing key traits [30], whereas without extensive phenotyping it is not possible to assign the precise phenotypic role of each genes. This calls for urgent need of protocols for high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes during crop growing stages at different levels (organism level, including the cell, tissue, organ, individual plant, plot, and field). These can be linked to genomic information using QTL mapping or association studies and can further be utilized for crop improvement in the desired direction. With the rapid development of novel sensors, image-processing technology, data-analysis technology, and phenotyping systems for multiple scales such as microscopic, ground, and aerial phenotyping platforms, effective solutions for high-throughput crop phenotyping in the field as well as in controlled environments have been provided [31]. The development of intelligent phenotyping robots has the advantages of both the detection accuracy of stationary phenotype platforms and the detection efficiency of UAV (drone) phenotype platforms. The combination of multi-source data obtained from robotic screening using artificial intelligence and image-based intelligent phenotyping and data recording can explain new biological phenomena in multi-dimensional manner [32]. Accelerating the application of artificial intelligence and other modern information technology in agriculture, especially in crop genetic studies and breeding, is in urgent need.
Methods of QTL Mapping Quantitative Trait Loci (QTL) mapping utilizes basic Mendelian Genetics for linkage map construction and modern computational power for analyzing and studying quantitative traits in a Mendelian fashion [9]. The basic principles and pre-requisite for QTL mapping have already been explained in the previous section. This section will deal with the basics of Linkage map construction and different QTL mapping methods.
134
Sanchika Snehi et al.
4.1 Linkage Map Construction
A Linkage map or genetic map is generated as landmarks in the genomes of parents and is used to analyse functionality in governing the trait of interest [33]. It indicates the relative position of genetic markers in terms of centimorgans (cM) alongside chromosomal landmarks, which are used as position or region on the chromosome governing certain quantitative traits [9]. The prediction of QTL regions using this map is possible by utilizing segregation of markers in progenies of contrasting parents differing for the trait of interest. These recombinants are generated on account of crossing over and recombination events occurring during meiosis [7]. The frequency of recombinants is used to calculate map distance in centimorgan (cM) using a factor called mapping function (c). Readers are assumed to be familiar with the basic concept of linkage mapping, types of mapping functions, and their application to various type of biparental populations or they can consult any textbook of genetics [34]. The mapping populations in which tracing and visualization of heterozygous genotypes using classical genetic approaches are possible, the recombination fraction can be calculated easily using a simple Mendelian method but will require high computation when many molecular markers are used. Whereas, in populations such as F2, RILs, and, other similar populations where it is not possible to visualize the genotype of F1 using classical genetics, the maximum likelihood method is used (MLM) for determining the genetic distances and constructing a linkage map [2, 35]. Similarly, in a natural population linkage map is generated using Linkage Disequilibrium (L.D.), which is the association between two loci that are not present randomly within a population experiencing HardyWeinberg equilibrium [36]. The Linkage between molecular markers i.e. independence of segregation of two loci can be estimated by the chi-square method or by the logarithm of odd (LOD) estimates. Considering the fact that QTL mapping experiments utilize many loci at a time, it is obvious to use the LOD score for map construction. The ratio of odds between chances of linkage between loci with that of a chance of no linkage can be explained in terms of a logarithm called logarithm of odd (LOD) score or LOD value of linkage [37]. The Maximum likelihood method is used to calculate the LOD value of linkage between two markers. LOD value of more than 3 is generally considered significant and is used for linkage map construction, which indicates that it is 1000 times more likely to have a linkage between both the loci than that of no linkage [35]. In a QTL mapping experiment, the molecular markers which are linked together and are showing dependent segregation are grouped together and are called linkage groups [38]. The number of linkage groups within a particular mapping population cannot be ideally lesser than their haploid number of chromosomes, therefore linkage group represents a chromosome or a chromosomal
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
135
Fig. 5.5 Steps involved in linkage map construction
segment. Uneven distribution of molecular markers over chromosomes, less number of markers assayed, types of mapping population, and small mapping population size determine the effectiveness of linkage map construction in any QTL mapping experiment and in turn affect accuracy, effectiveness, and efficiency in the identification of robust and major QTLs and prediction of their contribution in phenotype determination [5]. Considering the advancement in genotyping assay, ease in analysing big dataset, robust and reliable phenotyping platforms, phenomics assisted phenotyping and low cost genotyping, the QTL mapping using several biparental, multiparent, cross-pollinated populations and breeding advanced pedigree lines cross have become easy and will rule the trait mapping and characterization in near future [3, 39, 40]. Analyzing and generating genetic map distances with two or few markers are easy with manual calculations, but not suitable for the vast amount of genotyping data generated out of QTL mapping experiments utilizing a number of whole genome molecular markers [41]. For such mapping experiments, advanced mapping software such as MapMaker, JoinMap etc. is used and is described in the later part of this chapter. A figure explaining the steps involved in the construction of the linkage map is given in Fig. 5.5. 4.2 Linkage Based QTL Mapping
QTL search within the genetic map of parents is like finding the association of a genomic region marked by markers present on the map to that of the phenotypic value associated with the trait [2]. QTL mapping methods are explained in further sections.
136
Sanchika Snehi et al.
Fig. 5.6 A figure explaining the marker trait association analysed using simple linear regression. The equation denotes y = the trait mean value of a genotype in population; b0 = mean of the trait over population; b1 = additive effect of allelic substitution from ‘a’ to ‘A’; e = error value associated with environmental influence on genotype and error during phenotyping etc.; x = genotypic code for locus being tested (-1 for female parent and 1 for male or donor parent) 4.2.1 Single Marker Analysis
A polymorphic molecular marker can differentiate the individuals of a population in number of genotypic classes. Such as a marker with two alleles “A” and “a” can have three genotypic classes “AA”, “Aa” and “aa”, conditioned that the species under study is diploid one. The mean phenotypic value of a genotypic class may be significantly differing from other, which in turn denotes the presence of QTL in proximity of the marker under study [42]. When a QTL is tightly linked to the trait of interest then they tend to co-segregate, resulting in significant differences in mean of different genotypic class. Meanwhile markers which are distant from QTL or loosely linked or are present on different chromosomes will be segregating independently and will not have any differences in genotypic classes. Detection of such significant marker trait association can be done by Analysis of variance (ANOVA), linear regression, estimation of maximum likelihood, likelihood ratio test, and simple t-test [1, 8]. The regression analysis results in determining the coefficient of determination (R2), which is representing the amount of phenotypic variation explained by the identified QTL [35]. A figure explaining the significant marker trait association using linear regression is given in Fig. 5.6. This method does not require any linkage map construction. Each and every marker should be checked independently for any marker trait association (this may result in many false-positive QTL detection as variation in genotypic frequency may also be associated with segregation distortion, biased allelic behaviour etc.).
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
137
Detection of QTL depends on the population size (number of individuals), type of population, recombination between marker and trait and effect of QTL. This method is not able to predict the small effect QTLs, QTL × QTL interaction, QTL × Environment interaction, and position of QTL on chromosomal map [2]. 4.2.2 Simple Interval Mapping
Interval of markers are like boundaries which define the end or neighbourhood of the detected QTL. It is always better to define the boundaries and location of genomic regions governing a trait so that to further fine map them and identify the candidate gene(s). Simple Interval Mapping (SIM) utilizes linkage map and predict the possibility of presence of QTL at each small interval (e.g. 2 cM) within a marker interval. It considers one marker interval at a time and therefore are not able to identify true effect of QTL in that interval and separate the effect of other QTLs present in other intervals [43]. Meanwhile the presence of more than one QTL in an interval is also possibly not detected as it predicted them as single merged QTL [9]. The statistical significance of probability of predicted QTL is measured for each small interval within a marker interval in terms of LOD score and depicted on the genetic map. In general, a LOD score of 3 is considered to be threshold and LOD value ≥3 is considered to be significant and those regions are said to have QTL [5]. It also utilizes analysis of variance (ANOVA), simple linear regression, maximum likelihood ratio, and t-test to predict QTL [43]. The use of marker interval brings a significant improvement in predicting the major and minor QTLs as it overcomes the effect of recombination between QTLs and markers (chance of recombination between two opposite flanking marker is very less than that of a single marker and QTL). The output of SIM includes expected QTL position in an interval, effect of QTL, phenotypic variance explained (in terms of coefficient of determination = R2, significance score of QTL (in terms of LOD value), allelic contribution of QTL from parents and estimates of additive and dominance effects [35]. The major limitation of this method is that it is not able to separate QTL effects of linked QTLs and the effect of QTLs present in other intervals onto QTL present in interval under study is not separated [33]. This method is also not able to predict QTL × QTL interaction as well as QTL × Environment interaction. The figure explaining simple interval mapping (SIM) is given in Fig. 5.7a.
4.2.3 Composite Interval Mapping
Composite interval mapping (CIM) is performing simple interval mapping along with multiple regression [44]. It is able to reduce the effect of background QTLs (other markers are used as co-variates in the study) present in genome from that of the QTL present in interval under study and thus fine tune the map by reducing QTL confidence interval and increasing QTL peak
138
Sanchika Snehi et al.
Fig. 5.7 QTL predicted between marker M1 and M2 (a) by Simple Interval Mapping; (b) by Composite Interval Mapping. QTL peak is shown and QTL interval is marked (QTL confidence interval denotes the interval between start of the QTL curve having LOD score of 3 and end of the curve till the curve is showing LOD score of >3). LOD threshold is shown as 3 in this map as a dotted line. Map distances are given in centimorgan (cM). Within the interval M1-M2 itself, another peak is shown having LOD = 6 in CIM which is absent in SIM. Simple Interval Mapping is not able to differentiate the QTL present in same interval
(Fig. 5.7b). CIM is based on the fact that phenotypic variation present in two individuals harbouring a QTL at position 1 (both having QTL allele) must have other QTLs present in genome which may affect the QTL at position 1. CIM first performs single marker analysis to identify all the significant markers harbouring
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
139
QTLs and then use typical forward or stepwise regression to build multiple marker models. Marker having most significant effect on trait is modelled with second most significant markers and their phenotypic contribution is tested and if found significant another marker is added till all the marker are included in the group. These significant markers (all markers till model is significant) are used as cofactors while performing interval mapping by multiple regression analysis. Composite Interval Mapping (CIM) estimates position of QTLs on map and their statistical significance (LOD) score, percentage of phenotypic variation explained (r2), Source of alleles (parental source) and estimates of gene effects. It is not able to analyse QTL × QTL interaction as well as QTL × Environment interaction [35]. 4.2.4 Inclusive Composite Interval Mapping
Inclusive composite interval mapping (ICIM) includes all the markers for performing linear regression modelling, which was as done using significant markers in composite interval mapping (CIM). Important markers are identified by standard stepwise regression analysis harbouring QTLs and significant markers (having significant regression coefficients) are used as cofactors, while for other markers regression coefficients are kept at zero during analysis [45]. By keeping all markers in analysis, the chance of false positives is minimized. This method is able to identify dominance as well as epistasis in two genes. If a QTL peak is present in the middle of the marker interval then ICIM will equally distribute the effect of QTL on both the marker, otherwise, if the peak is more close to one marker then it will place them closer to that marker. It will also predict the phenotypic value of the individuals harbouring the QTLs. It is better than CIM in terms of predicting positive QTLs, enhancing QTL prediction, narrowing the confidence interval, detecting two gene epistasis, and is simpler and easy to perform [2]. Joint Inclusive Composite Interval Mapping (JICIM) is having little modification in the ICIM method and is used to map QTLs in multiple cross populations that has one common parent e.g. Nested Association Mapping (NAM) populations [46].
4.2.5 Multiple Interval Mapping
As the name suggests multiple interval mapping (MIM) is considering multiple interval at a time rather than considering a single interval at a time in other methods [47]. It detects multiple QTLs at a time and finds the QTL × QTL interaction as well as the QTL × Environment interaction. It doesn’t use complicated procedures of selecting marker using stepwise or forward selection to design a marker model. Instead, it uses multiple selection procedures such as forward and backward selection, forward search methods etc. As like CIM, it can also utilize an expectation maximization algorithm, additionally it can calculate epistatic interaction. As it considers many intervals together the calculation is
140
Sanchika Snehi et al.
computationally intensive and difficult to run when a number of intervals are more. Apart from this, the Bayesian QTL mapping method is also used to detect multiple QTLs at a time by utilizing reversible-jump Markov Chain Monte Carlo (MCMC) for specific modelling [48]. 4.3 Bulk Sequencing Based QTL Mapping
Whole genome resequencing and high throughput computation technologies have changed the whole scenario of QTL mapping in the past decade. Several QTL mapping experiments have used SNP based genotyping (using SNP genotyping assay), chip based microarray genotyping, genotyping by sequencing (low depth), whole genome resequencing of genotypes, etc. Low-cost phenomics technology has also aided in this [2]. As previously discussed in this chapter, SNP-based genotyping has been used in standard QTL mapping methods. The new approach of whole genome resequencing based bulk segregating analysis (WG-BSA) used for QTL mapping has been explained in this section.
4.3.1
QTL-Seq
QTL-Seq is bulk segregant analysis (BSA) based QTL mapping approach utilizing the whole genome resequencing of low and high bulks [49]. The individuals of the mapping population evaluated for a particular trait are displayed as a normal curves. Twenty to fifty individuals from both the extreme (low and high) are selected and classified into low and high bulks. DNA isolated from these genotypes is bulked in equal quantity (with equal concentration) as low and high bulk DNA. Such bulked DNA is sequenced and the reads are aligned to the reference genome. The SNP-index is calculated for the two bulks and compared for the identification of putative SNPs which can differentiate both the extreme bulks. The genomic regions harbouring SNPs with contrasting SNPs will be considered to be linked to the QTL [50]. Zhang et al. [51] used QTL-seq for the identification of QTL governing powdery mildew resistance in Korean cucumber.
4.3.2
QTG-Seq
Quantitative trait gene sequencing (QTG-seq) is a method to fasttrack the gene identification for a quantitative trait. Zhang et al. [52] used QTG-seq to identify a candidate gene for plant height QTL qPH7 in maize by sequencing of bulk segregant pools. In QTG-seq, two genotypes differing for the trait of interest are crossed and segregating generations like F2, and backcross is developed. QTL mapping is performed in F2 and reconfirmed in F2:3. From the backcross generation (BC1F1), individual plants heterozygous for desired QTL of interest and homozygous for other QTLs are selected. These plants are selfed to generate BC1F1:2, which are evaluated extensively for the desired phenotype. The extreme genotypes are selected and used for bulk segregant analysis using the modified algorithm. The rationale behind using this method is to convert a quantitative trait into near qualitative trait
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
141
and use them to fine map the QTL and identify the candidate gene which is otherwise time-consuming and labor-intensive in the era of “big data”. 4.4 Modified Approaches of QTL Mapping
QTL mapping methods has been modified at several stages to suit the need for better identification and validation of QTLs. Instead of using bi-parental mapping populations, multiparent populations such as Nested Association Mapping (NAM) population panel, Multiparent Advanced Generation Inter-cross (MAGIC) population, Advanced Backcross, Inter-crossed Segregants etc. has been used for QTL mapping [53]. Epigenetic marks are sometimes heritable and they serve as stress memory or any developmental signal, such marks may be DNA methylation, small RNAs, chromosome remodelling etc. These heritable marks have been used sometimes to map QTLs for traits that are related to epigenetic reprogramming [54]. Similarly, transcriptomic and metabolomic studies has also been use to map QTLs or for substantiating the QTL mapping studies [55, 56].
4.5 Meta-QTL Analysis
Generally, the mapped QTL has larger QTL interval, may have overlapping QTLs at same position in different QTL mapping studies, inconsistency arising out of different size of linkage map in various populations etc. Such phenomenon restricts the use of mapped QTLs in plant breeding unless it is a large effect QTL, fine mapped to a very small interval or its candidate gene is known [57, 58]. Meta-analysis of QTLs deals with identification of consensus QTLs linked with trait of interest or linked with a meta-trait governed by several component traits. Such consensus QTL will be having small confidence interval, high phenotypic influence and can help in candidate gene identification as well as fine mapping. Several QTL mapping studies related to trait of interest from literature has to be collected and used to make a database of all the QTLs and their linkage map. The QTL and map file has to be generated as specified in the manual of Biomercator_v4.2 software [59]. Then Meta-QTL has to be predicted as per the instruction given in the manual. Shen et al. [60] had co-localized the QTLs for morphological traits in Brassica using meta-analysis.
5
Software and Packages for QTL Mapping Currently, numerous software are available for linkage map construction and mapping QTLs in bi-parental or multi-parental populations and most are open access/free (Table 5.1). MAPMAKER was the first landmark software for the construction of linkage maps and QTL mapping i.e. interval mapping [61]. Although it is still a preferred software by many research groups but it cannot compute CIM (effect of other QTLs on locating target QTL) and is limited
64-bit MS-Windows 10
UNIX™, DOS, CLI menus script Windows, Mac OS
DOS, Windows
DOS, Sun OS
JoinMap_v5
QTL Cartographer_v2.5
PLABQTL_v1.2
MQTL_v0.98
CLI script
Script
GUI
CLI script
UNIX, DOS, Mac, VMS
MAPMAKER_v3.0b
Interface
Platform (OS)
Name
Text (tab delimited)
.qin (prepared using personal editor/text processing software)
Excel, MCD file format
Text file (tab delimited)
Text (tab delimited)
Input files
[63]
[88]
Free
Free
Genetic linkage analysis/maps SIM/CIM QTL and environment effects Multi-environment QTL Analysis SIM/CIM QTL and environment Effects
[62, 78]
Free
[81]
[61]
Free
Free/ commercial
Reference
Licence
Genetic linkage analysis/maps
Linkage map construction
Linkage map construction QTL mapping limited to F2 populations
Analysis (Comments)
Table 5.1 List of softwares used for linkage map construction and QTL mapping
ftp://gnome. agrenv.mcgill.ca/ pub/genetics/ software/ MQTL/
https://www. unihohenheim. de/ plantbreeding/ software/
http://statgen. ncsu.edu/ qtlcart/
https://www. kyazma.nl/ index.php/ JoinMap/
ftp://genome.wi. mit.edu/pub/ mapmaker3
URL
142 Sanchika Snehi et al.
–
QTL mapping in Netscape outbred Navigator/ populations Communicator recommended Single/Multi-QTLs browser
Java language
Web-based user interface (first app)
QTL Express
Free
[67]
[80]
Free Charting tool for genetic linkage studies
Text file (tab delimited)
MS-Windows
Map Chart_v2.32
[66]
QTL mapping in F2 Free and MAGIC RILs population (descends from founders with known genomes)
[72]
[89]
Text (tab delimited)
Free
–
[64]
Bayesian QTL mapping (F2 families or backcrosses) Multi-QTL models
QTL main effects and interactions between pairs of QTLs (graphs)
–
Commercial
[65]
Excel (QTL cartographer format)
Genetic linkage analysis/maps SIM
Free Linkage map construction and data preparation for other softwares H/K/M
.qdf (.txt based)
–
GUI
UNIX, Windows R package XP based (script)
Linux, Windows Script (Cygwin), Sun OS
DOS 3.0 or higher
HAPPY_v2.1
Chase@Bioscience. utah.edu Multimapper
Epistat
UNIX™, DOS, GUI Windows, Mac OS
QGene_v4.0
GUI
MS-Windows, Mac OS
Map Manager QTX_v_b29
(continued)
http://gaow. github.io/ genetic-analysissoftware/q/qtlexpress/
https://www.wur. nl/en/show/ mapchart.htm
http://mtweb.cs. ucl.ac.uk/mus/ www/HAPPY/ happyR.shtml
http://www.RNI. Helsinki.FI/ ~mjs/
Via e-mail to
http://www.qgene. org/
http://www. mapmanager. org/mmQTX. html Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . . 143
UNIX or R script compiled code based for Windows package or Mac
R package
Windows, Linux, GUI Unix and CLI MacOS Web-based server
Windows or Mac R package (script)
Windows, Unix
R/QTL_v1.48-1
QTLMap_v0.9.7
QTL Network
One Map_v2.1.3
PROC QTL
CSV
–
QTL mapping of continuous and categorical traits (Multi-QTL)
QTL mapping in clonal F 1 populations
QTL mapping and visualisation QTL main effects Epistasis and QE interactions
[77]
[70]
Free
Free
[60]
[90]
[74]
[76]
Reference
Free
linkage analysis – (interval mapping), deals with large number of markers (SNP) and traits (eQTL)
–
Commercial
Licence
Linkage analysis and Free QTL mapping
Genetic linkage analysis/maps (MQM mapping)
Analysis (Comments)
Excel/CSV
–
Input files
SAS (Script) SAS datasets
Script
CLI script Windows ® (95/98/ME/ NT4.0/2000/ XP/Vista 32-bit
MapQTL_v6.0
Interface
Platform (OS)
Name
Table 5.1 (continued)
http://www. statgen.ucr.edu/ software.html
https://github. com/augustogarcia/onemap
http://ibi.zju.edu. cn/software/ qtlnetwork
https://forge-dga. jouy.inra.fr/ projects/qtlmap
https://rqtl.org/ download/
http://www. kyazma.nl/ index.php/mc. MapQTL/sc
URL
144 Sanchika Snehi et al.
CSV
Text (tab delimited)/ CSV
CSV
MS Excel 2003/ 2007 or text (tab delimited)
Windows or Mac R package OS (script);
GUI Windows XP/Vista/7/ 10, with .NET Framework 4.0.
Windows or Mac R package OS (script)
GUI Windows XP/Vista/7/ 10, with .NET Framework 4.0
GACD_v1.2
R/qtl charts_v0.14
QTL IciMapping_v4.2
Text file (tab delimited)
DOQTL_v1.19.0
Java based
Unix, Mac OS and Windows
Lep-MAP_v3
CSV
Text (tab delimited)
UNIX or R script compiled code based for Windows package or Mac
GUI
R/mpMap_v1.14
QTL Network R_v0.1- R Environment 6 All OS
Free
[71]
[82]
[69]
Free
Free
Linkage map construction and QTL mapping
Linkage analysis and Free QTL mapping K/H/M
create interactive charts for QTL data (use with R/QTL)
[79]
[92]
Free
Free
[73]
[91]
QTL mapping in multi-parent outbred populations (hidden Markov model)
QTL mapping
Linkage maps and Free QTL mapping in multi-parent RILs
Visualizing QTL mapping results
(continued)
https://www. isbreeding.net/ software/?type= detail&id=29
https://github. com/kbroman/ qtlcharts
https://www. isbreeding.net/ software/?type= detail&id=27
https:// bioconductor. riken.jp/ packages/3.0/ bioc/html/ DOQTL.html
https:// sourceforge.net/ projects/lepmap3/
www.cmis.csiro. au/mpMap
http://cran.rproject.org/ web/packages/ QTLNetworkR/ index.html. Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . . 145
Windows, Mac and Linux
QTL. gCIMapping_v3.2
CSV
Input files
GUI script/ *.csv or *.txt code formats (like ICIM/QTL cartographer)
R script UNIX or based compiled code package for Windows or Mac
R/QTL2
Interface
Platform (OS)
Name
Table 5.1 (continued)
detects small-effect linked QTLs
QTL analysis software
Analysis (Comments)
Free
Free
Licence
[85]
[75]
Reference
https://cran.rproject.org/ web/packages/ QTL. gCIMapping. GUI/index.html https://cran.rproject.org/ web/packages/ QTL. gCIMapping/ index.html
https://github. com/rqtl/qtl2
URL
146 Sanchika Snehi et al.
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
147
to mapping of F2 populations. Therefore, over time emphasis was laid upon developing the software capable of executing CIM. As a result, the subsequent software such as win-QTL Cartographer [62], PLABQTL [63], QGene [64], Map Manager QTX [65], HAPPY [66], and QTL Express [67] were developed over the time. Map Manager QTX include background traits as covariates in weighted least-squares regression [65]. Later, with the development and adoption of mixed linear model and inclusive CIM (ICIM) methods new software namely QTLnetwork [68] and QTL IciMapping [69] were released. Margarido et al. [70] developed One Map software for QTL mapping in clonal F1 populations and Zhang et al. [71] developed GACD for QTL mapping in clonal F1 and double cross. Subsequently, with the development of multiple-QTL mapping approaches namely MIM, multi-marker analysis, Bayesian shrinkage estimation and penalized maximum likelihood, new packages like Multimapper [72], mpMap [73], R/qtl [74, 75], MapQTL [76], PROC QTL [77]; win QTL Cartographer [78], and DOQTL [79] was released. Few software was exclusively made for graphical representation of QTL maps or linkage maps like Map Chart [80], Join Map [81] and R/qtl charts package [82]. Recently, to detect the small-effect QTLs (closely linked), genome-wide composite interval mapping was proposed by Wang et al. [83] and Wen et al. [84]. Recently, QTL.gCIMapping was developed by Zhang et al. [85] to detect small effect (and closely linked) QTLs. The literature indicates the greater use of R packages nowadays as it is open access and provides flexible options for graphical representation (Table 5.1). In addition, the development of graphical user interface (GUI) and web based interface has also made the analysis easy for the molecular breeders (Table 5.1). In general, the use of different software provides similar but not exactly the same results, probably due to the use of different algorithms (statistical procedures/data formats/different computer platforms). Therefore, it is always recommended to first decide the aim of the QTL analysis and then select the suitable algorithm of mapping. For example, main effect QTLs can be detected through most softwares like MAPMAKER, R/qtl, QTL Cartographer but the detection of minor effect QTLs need to be detected through QTL.gCIMapping software. Hence, with the emerging use of machine learning and advances in statistical approaches future is likely to witness easy to use and better graphical representation based softwares. QTL-Seq analysis can be performed using QTLseqr, R package (v0.7.5.2) available at https://rdrr.io/github/bmansfeld/ QTLseqR/. It identifies QTL using two statistical approaches: QTL-seq and G’ [86]. It imports SNPs from GATK or a delimited file that contains allele read depths for each bulk. Then SNP filtering is done to remove low confidence SNPs. Then the QTL-Seq or G’ analysis provides the graphs and tabular summary. The entire
148
Sanchika Snehi et al.
Fig. 5.8 Flow-chart explaining the QTL-seq for identification of QTLs and prediction of candidate genes using Next-generation sequencing based bulk-segregant analysis approach
script with demo data is available on the link for the software. Recently, Sugihara et al. [87] developed a high performance pipeline (R-package based) for QTL-Seq (available at https://github. com/YuSugihara/QTL-seq) that uses FASTQ (without or with trimming) and BAM files for analysis. The basic flow chart of the QTL-seq pipeline and candidate gene prediction is given in Fig. 5.8.
6 Validation of QTL QTL mapping doesn’t affirms the effectiveness of the linked markers to discern the individuals from different populations/environment for the trait concerned. Thus, the process to assess the potential of a linked markers in its ability to differentiate the individuals from different populations/environment for the trait concerned is called QTL validation (Fig. 5.9). QTL validation is the proofreading of identified QTLs for affirmation of their marker
Mapping of Quantitative Traits Loci: Harnessing Genomics Revolution for. . .
149
Fig. 5.9 Procedures involved in fine mapping and validation of a QTL and identification of candidate genes
QTL association in unrelated germplasm or in different genetic backgrounds [93]. The earlier assumption was that the QTLs identified in the mapping studies can directly be used in the marker assisted breeding, but it is not necessary that the QTL reported in one mapping population be reflected in another mapping population. QTL validation also estimates the significance of different genetic background on QTL expression. Since, QTL mapping is just an anticipatory statistical dissection of the marker linked loci, its validation eliminates any probability of statistical error. It provides authentication of the presence of identified QTLs and also reveals the stability of the QTLs which pave the way for more downstream work such as transfer of QTL, allele mining at QTL associated genomic regions, prediction and validation of candidate genes etc. Thus, QTL validation in different genetic background and environment is necessary before moving towards its use in marker assisted breeding. It also reveals any undesirable effect of QTLs on the expression of lines possessing this QTL. QTL of interest can be validated by both phenotypic and genotypic screening of the linked marker in different mapping populations developed or in the advanced breeding generation of the same mapping population from which the QTL was earlier reported. It can also be done in a group of cultivars or elite breeding/germplasm lines or across populations for the validation of marker trait association. The validation on multiple mapping populations helps to perform interval mapping on each population to confirm QTLs common among the populations. The commonly used approach is the phenotypic and genotypic screening of the linked marker in the mapping populations generated from the parents which was not involved in the primary study. It helps in the validation of effect size of the QTL, marker trait association and the position of QTL. However, the development of new mapping
150
Sanchika Snehi et al.
population is very tedious, time consuming and requires lot of resources too. The development of a set of near-isogenic lines (NILs) with the same donor but different recurrent parents for the QTL of interest provides a more reliable method of QTL effect validation. Nearisogenic lines (NILs) have been authenticated as an efficacious source for QTL validation, and NIL population development is considered as the first step in fine-mapping of identified QTLs [94]. Homogeneous genetic background of the NILs give the more accurate estimation of QTL allele [95]. NILs are usually developed by five to six generations of backcrosses, which requires significant resource, effort and time. Therefore, it is better to combine QTL identification, QTL effect assessment, and its utilization in further breeding program. Heterogeneous inbred family analysis is another quick method to develop NILs from RIL mapping populations in which RILs are screened with concerned QTL linked markers to identify those RILs that are segregating or heterogeneous for these markers. The validation of QTL in these NILs gives estimates of QTL effects and determines the positions of QTL of interest in different genetic backgrounds. However, if the RIL population is of F6 or later generation and population size is small, the probability of getting more number of different NILs for a given QTL would be limited and the estimates of QTL effect from it would be applicable only to the concerned mapping population. Even though, it provides the more accurate estimate of QTL effects, as the analysis of many different NIL pairs would be preferable due to different genetic backgrounds.
7
Fine Mapping of QTLs Tightly or closely linked markers provide the basis for fine mapping, which helps in revealing the gene function and regulation, as well as development of lines expressing the traits of interest. QTL fine mapping is the method to reveal very closely located markers ( Statistics > Connectivity (Infomap)] descriptive statistics for markers of each linkage group in all the input studies and searches for the same marker IDs. Ideally, there should be a minimum of two common markers to establish the connection between maps and develop a consensus map. However, in some cases the consensus map is not generated due to absence of common markers, in which case the user has the flexibility to input information of additional markers in the genetic map file (described above). The most common source of the additional markers is the availability of the organism’s reference genetic map. While infomap shows the markers in tabular form, MMap View [Analyses > Statistics > Connectivity image creator (MMap View)] helps in the visualization of the common markers on the linkage groups for individual genetic maps.
5.5
The first part of the meta-analysis procedure consists of building a consensus genetic marker map (Analyses > Map compilation > ConsMap) on the basis of estimated genetic distance of markers using a Weighted Least Squares (WLS) strategy. The consensus map shows all the common markers as well as the unique markers from the individual input maps. It is highly advisable to integrate an organism’s reference genetic map (if available), to correct the offset in marker distance and to further saturate the marker density of the consensus map.
Consensus Map
5.6 Projection of QTLs
Once the consensus map is built, the second part of the metaanalysis procedure consists of projecting the QTL locations (Analyses > Map compilation > QTL projection) onto the map. The projection of QTLs is implemented using an algorithm which finds an optimal context, i.e., a pair of common markers flanking the QTLs in the original input map. The estimated distance for these markers should be consistent between the maps. The distance of the markers flanking individual QTLs is compared between individual and consensus maps and minimal distance is used for projection of QTLs on consensus map. The software also performs homogeneity testing of flanking marker intervals between the original and consensus map and favors lower p-value for projection.
5.7 Meta-QTL Analysis
The BioMercator software has two different algorithms for discovery of Meta-QTLs. Both the algorithms have the ability to group multiple related traits into meta-traits and execute a single metaQTL analysis. Any combination of traits that appear to be
206
Anita Kumari et al.
important to the user can be used to define meta-traits [91]. In the following section we will briefly describe the two algorithms. 5.8 Gerber and Goffinet Algorithm
The Gerber and Goffinet’s algorithm [Analyses>QTL meta-analysis>Meta-analysis (Gerber and Goffinet)] relies on a modified Akaike Information Criterion (AIC-estimator of prediction error) and runs five different models to identify Meta-QTLs. The models named as 1-, 2-, 3-, 4- or N-QTL models, represent the number of discovered Meta-QTLs (N-model signifies that all the input QTLs are Meta-QTLs). The most likely Meta-QTLs configurations for each of the five models is calculated using maximum likelihood method, assuming a Gaussian distribution. Consensus QTL positions are determined for each model as the mean of positions of the overlapping QTLs. Simulations under different scenarios are run for each model to test its quality. The best model among the five is determined using an Akaike-type statistical criterion for revealing the number of most significant Meta-QTLs. Output of the algorithm is generated in form of a text file listing the five models (1, 2, 3, 4 and N meta-QTLs) and their corresponding AIC values. The model with lowest AIC value is considered to be the best among all. Although BioMercator displays the best model, the user has the flexibility to generate and view the maps of all four models showing the selected QTLs and Meta-QTLs. However, the biggest drawback of the Goffinet and Gerber model is that only models with one to four meta-QTLs per chromosome can be evaluated, which forces the user to repeat the meta-analysis process several times to completely cover a linkage group.
5.9 Veyrieras Algorithm
To address the above-mentioned drawback, Veyrieras algorithm, an upgraded computational and statistical algorithm was included in BioMercatorV4 [Analyses > QTL meta-analyses > meta-analysis ½ (Veyrieras) > meta-analysis 2/2 (Veyrieras)], for carrying out the Meta QTL-analysis. The algorithm determines the actual number of Meta-QTLs among the reported QTLs using a Gaussian mixture model-based clustering approach. Meta-analysis of the entire linkage group with no limitations on the number of identified MetaQTLs and probabilistic clustering of QTLs are the two advanced features of Veyrieras algorithm over Gerber and Goffinet method. This algorithm is run in two parts namely, meta-analysis 1/2 and meta-analysis 2/2. Output of 1/2 analysis is produced in the form of three plain text files viz., *_res.txt: which contains a summary of the clustering for each linkage group, *_crit.txt: which summarizes values of the model depending on the used criteria and *_model. txt: gives optimal model number according to each of the criteria. The *_model.txt file has four columns starting with criterion, followed by chromosome number, trait and lastly the model number. Different models/criteria predict the number of meta-QTLs on a chromosome along with information about peak position, CI
Meta-analysis of Mapping Studies: Integrating QTLs Towards Candidate Gene. . .
207
Table 7.2 Troubleshooting of some common error encountered during Meta-QTL analysis using Biomercator Issue
Possible solution(s)
Inability to load map and QTL files
• Input files should be in the txt format • Both txt files should be tabulation-delimited • Map name in both the map as well as QTL file should be the same
Failed map connectivity and consensus map not generated
• Incorporate common markers in the map file
Meta-analysis 2/2 failed
• kmax (maximum number of meta-QTLs evaluated by the algorithm) value should not exceed the number of QTLs on the linkage group. Say on chr Z we have N QTLs, then the kmax value cannot be greater than N-1
and weightage. These include the AIC (Akaike Information Content), AICc (AIC correction), AIC3 (AIC 3 candidate models), BIC (Bayesian Information Criterion) and AWE (Average Weight of Evidence) [91]. AIC is considered as the best model as it can estimate the data lost in different statistical models (Raza et al. 2019). The lowest number of meta-QTLs predicted in most of the models (atleast 3 of 5) is considered to be the meta-QTL number of that chromosome. The 2/2 meta-analysis helps in visualizing the QTL clustering results generated in the 1/2 metaanalysis. Here the user is required to input the kmax value and model number based on the *_model.txt file which provides the best model with the number of meta-QTLs. The value of kmax should not exceed the total number of QTLs on the linkage group (see Table 7.2). 5.10 Output and Its Interpretation
The putative trait linked region i.e., a QTL, consists of thousands of genes spanning a chromosome, out of which only a few are most likely to contribute towards the trait of interest. Meta-QTL analysis reduces the confidence interval of overlapping QTL clusters thereby helping in shortlisting of candidate genes by reducing the number of genes from thousands to a few hundreds. Output of a meta-QTL analysis provides us with a model which best explains the distribution of QTL clusters lying within a meta-QTL region. In accordance with both the definition and published meta-QTL analysis results we propose that only those regions must be regarded as meta-QTLs which contain at-least two overlapping QTL clusters. Meta-QTLs containing a single QTL are often discarded from further analysis. Additionally, as per previous published reports, if there are several meta-QTLs on a chromosome, then the meta-QTL region derived from the highest number of overlapping QTL clusters should be considered as more important and should be explored for mining of candidate genes.
208
6
Anita Kumari et al.
Meta-QTL Analysis: Way Forward The ultimate goal of any mapping study is to identify the gene (s) contributing to a trait. Meta-QTL analysis helps in reducing genetic distance of a quantitative trait loci by finding the shortest genomic region that is common between overlapping QTLs. Genetic distance (in cM) of Meta-QTL is estimated with help of the left and right flanking markers and physical positions of the markers on linkage groups is determined with help of marker databases like Gramene marker database. Availability of a well annotated genome sequence is then used to ascertain the number and ontologies of genes lying within the flanking physical coordinates. The genome of an organism with its structural and functional annotation can be loaded onto BioMercator and then be used to mine genes underlying the meta regions. The gene information within each meta-QTL can be exported as a GFF3 (General Feature Format) file from BioMercator. This file contains information of both QTLs and meta-QTLs. Meta-analysis thus shifts focus of attention from a few thousand genes in a QTL to a few hundred in a meta-QTL, thereby reducing the efforts for investigating genes underlying the trait. However, the number of genes within a MetaQTL is still large and additional approaches can be integrated to further shortlist the likely candidate genes. This ensuing section of the chapter elaborates on ways in which a Meta-QTL data can be further refined to narrow down on functionality and validation of candidate genes (Fig. 7.3). Also some of the commonly used terminology in Meta-QTL analysis has been enlisted (Table 7.3).
6.1 Validation of Meta-QTLs and Prediction of Candidate Genes
Meta-QTLs having a small CI (