139 57 16MB
English Pages 375 [366] Year 2022
Shabir Hussain Wani Anuj Kumar Editors
Genomics of Cereal Crops
SPRINGER PROTOCOLS HANDBOOKS
For further volumes: http://www.springer.com/series/8623
Springer Protocols Handbooks collects a diverse range of step-by-step laboratory methods and protocols from across the life and biomedical sciences. Each protocol is provided in the Springer Protocol format: readily-reproducible in a step-by-step fashion. Each protocol opens with an introductory overview, a list of the materials and reagents needed to complete the experiment, and is followed by a detailed procedure supported by a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. With a focus on large comprehensive protocol collections and an international authorship, Springer Protocols Handbooks are a valuable addition to the laboratory.
Genomics of Cereal Crops Edited by
Shabir Hussain Wani Sher-e-Kashmir University of Agricultura, KHUDWANI ANANTNAG, Jammu and Kashmir, India
Anuj Kumar Dalhousie University, Halifax, Canada
Editors Shabir Hussain Wani Sher-e-Kashmir University of Agricultura KHUDWANI ANANTNAG Jammu and Kashmir, India
Anuj Kumar Dalhousie University Halifax, Canada
ISSN 1949-2448 ISSN 1949-2456 (electronic) Springer Protocols Handbooks ISBN 978-1-0716-2532-3 ISBN 978-1-0716-2533-0 (eBook) https://doi.org/10.1007/978-1-0716-2533-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Dedication SHW dedicates this book to his family. AK dedicates this book to his late grandmother Smt. Jagwati Devi.
v
Preface Cereal crops are important contributors for global food security and also known as staple foods. These are the immense source of human nutrition, and their products are the key source of energy, carbohydrate, protein, and fiber. Cereals also provide various macronutrients including vitamin E, vitamins B, magnesium, and zinc for human health. It has been predicted that the global population will be >9.6 billion by 2050. Therefore, it has become essential to increase food production by 70% to meet the growing food demand. The production of crops is adversely affected by various abiotic and biotic stresses such as drought, cold, salt, heat, pests, and pathogens. To cope with these factors, it is necessary to improve the molecular strategies for crop stress tolerance. Genomic elements like QTLs, genes, noncoding RNAs, and transcription factors (TFs) are the key stress responsive regulators and master candidates for crop improvement. With the advent of genomic methods, whole genome sequencing of crop genomes has become possible which accelerated the identification and functional annotation of genomic candidates leading to crop improvement. This book provides a compilation of a wide range of methods and protocols that will allow plant biologists to understand the complex biology of growth and development to improve cereal crops. It is expected that the methods and practical protocols presented in this book may facilitate activities and serve as a reliable guide to researchers, including plant molecular biologists, plant breeders, biotechnologists, genomicists, computational biologists, and bioinformaticians. This book will be very useful for budding scientists as well as senior researchers, faculty members, teachers, and scholars involved in molecular and genomic aspects of cereals biology. The primary goal of this book to give comprehensive practical guidance on different genomic methods and resources with different scientific purposes. The protocol chapters are authored by leading scientists and academicians working in the field of cereal genomics. Chapters 1 and 2 provide updates on crop genomes and genomic resources, respectively. Chapter 3 focuses on Next Generation Sequencing (NGS) technologies and their applications in crop improvement. Chapter 4 introduces a step-by-step protocol for CRISPR editing events in transgenic wheat with NGS approach. Chapter 5 describes a protocol for functional annotation of candidate genes of wheat using virus-induced gene silencing (VIGS). Chapter 6 gives details of common genomic tools frequently utilized for genetic improvement of cereals. Chapters 7 and 8 deal with step-by-step protocols for transcriptome analysis using reference and de novo assembly approach, respectively. Three chapters (Chapters 9, 10, and 11) provide practical protocols for computational prediction of ncRNAs (miRNAs and ceRNAs) in cereal crops. Chapters 12–18 give methods on applications of functional genomics including genotyping-by-sequencing (GBS) (Chapter 12), genomic selection using Bayesian method (Chapter 13), single cell sequencing (Chapter 14), genome-wide association study (GWAS) (Chapter 15), QTL interval mapping (Chapter 16), whole genome bisulfite sequencing (Chapter 17), and genome imprinting (Chapter 18). A method for the study of receptor-metabolite interaction has also been included in this book (Chapter 19).
vii
viii
Preface
In closing, we would like to express our sincere gratitude to the esteemed authors of chapters in this book. This book would not have been possible without their valuable efforts. We are also thankful to the members of the Methods in Molecular Biology Springer-Nature editorial team for guiding us through the assembly of a useful book on cereal genomics. Srinagar, India Halifax, Canada
Shabir Hussain Wani Anuj Kumar
Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v vii xi
1 An Update on Progress and Challenges of Crop Genomes . . . . . . . . . . . . . . . . . . . P. Hima Kumar, N. Urmilla, M. Lakshmi Narasu, and S. Anil Kumar 2 Updates on Genomic Resources for Crop Improvement . . . . . . . . . . . . . . . . . . . . . Aditya Narayan, Pragya Chitkara, and Shailesh Kumar 3 Next-Generation Sequencing Technologies: Approaches and Applications for Crop Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anupam Singh, Goriparthi Ramakrishna, Tanvi Kaila, Swati Saxena, Sandhya Sharma, Ambika B. Gaikwad, M. Z. Abdin, and Kishor Gaikwad 4 Check CRISPR Editing Events in Transgenic Wheat with Next-Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junli Zhang 5 Virus Induced Gene Silencing: A Tool to Study Gene Function in Wheat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaganpreet Kaur Dhariwal, Raman Dhariwal, Michele Frick, and Andre´ Laroche 6 Common Genomic Tools and Their Implementations in Genetic Improvement of Cereals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Megha Katoch, Ajay Kumar, Simranjeet Kaur, Anuj Rana, and Avneesh Kumar 7 Protocol for Identification and Annotation of Differentially Expressed Genes Using Reference-Based Transcriptomic Approach . . . . . . . . . . . Jyotika Bhati, Himanshu Avashthi, Anuj Kumar, Sayanti Guha Majumdar, Neeraj Budhlakoti, and Dwijesh Chandra Mishra 8 Transcriptome Data Analysis Using a De Novo Assembly Approach . . . . . . . . . . . Himanshu Avashthi, Jyotika Bhati, Shikha Mittal, Ambuj Srivastava, Neeraj Budhlakoti, Anuj Kumar, Pramod Wasudeo Ramteke, Dwijesh Chandra Mishra, and Anil Kumar 9 Protocol for In Silico Identification and Functional Annotation of Abiotic Stress–Responsive MicroRNAs in Crop Plants . . . . . . . . . . . . . . . . . . . . Anuj Kumar, Mansi Sharma, Tinku Gautam, Prabina Kumar Meher, Jyotika Bhati, Himanshu Avashthi, Neeraj Budhlakoti, Dwijesh Chandra Mishra, Ulavappa Basavanneppa Angadi, and Krishna Pal Singh 10 Functional Annotation of miRNAs in Rice Using ARMOUR . . . . . . . . . . . . . . . . Neeti Sanan-Mishra and Kavita Goswami
1
ix
13
31
95
107
157
175
195
211
227
x
11
12
13
14 15
16 17
18
19
Contents
Identification of ceRNAs in Cereal Crops: A Computational Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tinku Gautam, Hemant Sharma, Rakhi Singh, and Anuj Kumar Genotyping-by-Sequencing (GBS) Method for Accelerating Marker-Assisted Selection (MAS) Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laavanya Rayaprolu, Santosh P. Deshpande, and Rajeev Gupta Genomic Selection Using Bayesian Methods: Models, Software, and Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prabina Kumar Meher, Anuj Kumar, and Sukanta Kumar Pradhan Approaches of Single-Cell Analysis in Crop Improvement. . . . . . . . . . . . . . . . . . . . Upasna Srivastava and Satendra Singh Genome-Wide Association Study (GWAS) for Trait Analysis in Crops . . . . . . . . . Meenu Kumari, Lakesh Muduli, Prabina Kumar Meher, and Sukanta Kumar Pradhan QTL Interval Mapping for Agronomic and Quality Traits in Crops . . . . . . . . . . . Vandana Jaiswal, Vijay Gahlaut, and Sanjay Kumar Whole-Genome Bisulfite Sequencing for Detection of DNA Methylation in Crops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vijay Gahlaut, Vandana Jaiswal, and Sanjay Kumar Tools and Techniques for Genomic Imprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neeraj Budhlakoti, Sayanti Guha Majumdar, Amar Kant Kushwaha, Chirag Maheshwari, Muzaffar Hasan, D. C. Mishra, Anuj Kumar, Jyotika Bhati, and Anil Rai Computational Methods for Receptor–Metabolite Interaction Studies in Crops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anu Dalal, Ankit Singh, Gourav Choudhir, Sushil Kumar, and Anuj Kumar
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
235
245
259
271 295
309
325 335
347
359
Contributors M. Z. ABDIN • Centre for Transgenic Plant Development, Department of Biotechnology, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India ULAVAPPA BASAVANNEPPA ANGADI • Indian Agricultural Statistics Research Institute, Pusa, New Delhi, Delhi, India HIMANSHU AVASHTHI • Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, Delhi, India JYOTIKA BHATI • Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, Delhi, India NEERAJ BUDHLAKOTI • Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, Delhi, India PRAGYA CHITKARA • Bioinformatics Lab, National Institute of Plant Genome Research (NIPGR), New Delhi, India GOURAV CHOUDHIR • Centre for Rural Development and Technology, Indian Institute of Technology, Delhi, New Delhi, India ANU DALAL • Department of Chemistry, Indian Institute of Technology, Delhi, New Delhi, India SANTOSH P. DESHPANDE • International Crops Research Institute for Semi-Arid and Tropics, Hyderabad, India GAGANPREET KAUR DHARIWAL • Agriculture and Agri-Food Canada, Lethbridge Research and Development Centre, Lethbridge, AB, Canada RAMAN DHARIWAL • Agriculture and Agri-Food Canada, Lethbridge Research and Development Centre, Lethbridge, AB, Canada MICHELE FRICK • Agriculture and Agri-Food Canada, Lethbridge Research and Development Centre, Lethbridge, AB, Canada VIJAY GAHLAUT • CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India AMBIKA B. GAIKWAD • ICAR-National Bureau of Plant Genetic Resources, New Delhi, India KISHOR GAIKWAD • ICAR-National Institute for Plant Biotechnology, New Delhi, India TINKU GAUTAM • Department of Genetics and Plant Breeding, Ch. Charan Singh University, Meerut, Uttar Pradesh, India KAVITA GOSWAMI • Plant RNAi Biology Group, International Centre for Genetic Engineering and Biotechnology, New Delhi, India RAJEEV GUPTA • Cereal Crops Research Unit, US Department of Agriculture (USDA) Agricultural Research Service (ARS), Fargo, ND, USA MUZAFFAR HASAN • ICAR-Central Institute of Agricultural Engineering, Bhopal, India VANDANA JAISWAL • CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India TANVI KAILA • ICAR-National Institute for Plant Biotechnology, New Delhi, India MEGHA KATOCH • Department of Agricultural Biotechnology, CSK Himachal Pradesh Agricultural University, Palampur, Himachal Pradesh, India SIMRANJEET KAUR • Department of Botany, Akal University, Bathinda, Punjab, India
xi
xii
Contributors
AJAY KUMAR • Department of Botany and Microbiology, Gurukul Kangri Vishwavidyalaya, Haridwar, Uttarakhand, India ANIL KUMAR • Department of Molecular Biology and Genetic Engineering, G. B. Pant University of Agriculture and Technology, Pantnagar, Uttarakhand, India; Rani Lakshmi Bai Central Agricultural University, Jhansi, Uttar Pradesh, India ANUJ KUMAR • Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, Delhi, India AVNEESH KUMAR • Department of Botany, Akal University, Bathinda, Punjab, India P. HIMA KUMAR • Centre for Biotechnology, Institute of Science & Technology, JNT University, Hyderabad, Telangana, India S. ANIL KUMAR • Centre for Biotechnology, Institute of Science & Technology, JNT University, Hyderabad, Telangana, India; Department of Biotechnology, Vignan’s Foundation for Science, Technology & Research, Guntur, Andhra Pradesh, India SANJAY KUMAR • CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India SHAILESH KUMAR • Bioinformatics Lab, National Institute of Plant Genome Research (NIPGR), New Delhi, India SUSHIL KUMAR • Shaheed Mangal Pandey Government Girls PG College Madhavpuram, Meerut, India MEENU KUMARI • ICAR-Research Complex for Eastern Region, Ranchi, India AMAR KANT KUSHWAHA • ICAR-Central Institute for Subtropical Horticulture, Lucknow, India ANDRE´ LAROCHE • Agriculture and Agri-Food Canada, Lethbridge Research and Development Centre, Lethbridge, AB, Canada CHIRAG MAHESHWARI • ICAR-Indian Agricultural Research Institute, New Delhi, India SAYANTI GUHA MAJUMDAR • Division for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India PRABINA KUMAR MEHER • ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India DWIJESH CHANDRA MISHRA • Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, Delhi, India SHIKHA MITTAL • Division of Genomic Resources, ICAR-National Bureau of Plant Genetic Resources, Pusa, New Delhi, Delhi, India LAKESH MUDULI • Department of Plant Breeding and Genetics, College of Agriculture, OUAT, Bhubaneswar, Odisha, India M. LAKSHMI NARASU • Centre for Biotechnology, Institute of Science & Technology, JNT University, Hyderabad, Telangana, India ADITYA NARAYAN • University of Virginia, Charlottesville, VA, USA SUKANTA KUMAR PRADHAN • Orissa University of Agriculture and Technology, Bhubaneswar, Orissa, India; Department of Bioinformatics, CPGS, OUAT, Bhubaneswar, Odisha, India ANIL RAI • ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India GORIPARTHI RAMAKRISHNA • ICAR-National Institute for Plant Biotechnology, New Delhi, India PRAMOD WASUDEO RAMTEKE • Department of Biotechnology, Dr. Ambedkar College, Nagpur, Maharashtra, India; Faculty of Life Sciences, Mandsaur University, Mandsaur, Madhya Pradesh, India
Contributors
xiii
ANUJ RANA • Department of Microbiology, CCS Haryana Agriculture University, Hisar, Haryana, India LAAVANYA RAYAPROLU • International Crops Research Institute for Semi-Arid and Tropics, Hyderabad, India NEETI SANAN-MISHRA • Plant RNAi Biology Group, International Centre for Genetic Engineering and Biotechnology, New Delhi, India SWATI SAXENA • ICAR-National Institute for Plant Biotechnology, New Delhi, India HEMANT SHARMA • Department of Genetics and Plant Breeding, Ch. Charan Singh University, Meerut, Uttar Pradesh, India MANSI SHARMA • Dalhousie University, Halifax, NS, Canada SANDHYA SHARMA • ICAR-National Institute for Plant Biotechnology, New Delhi, India ANKIT SINGH • Centre for Rural Development and Technology, Indian Institute of Technology, Delhi, New Delhi, India ANUPAM SINGH • ICAR-National Institute for Plant Biotechnology, New Delhi, India; Centre for Transgenic Plant Development, Department of Biotechnology, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India KRISHNA PAL SINGH • Biophysics Unit, College of Basic Sciences and Humanities, GB Pant University of Agriculture and Technology, Pantnagar, Uttarakhand, India; Mahatma Jyotiba Phule Rohilkhand University, Bareilly, Uttar Pradesh, India RAKHI SINGH • Department of Genetics and Plant Breeding, Ch. Charan Singh University, Meerut, Uttar Pradesh, India SATENDRA SINGH • Department of Computational Biology and Bioinformatics JSSB, Sam Higginbottom Institute of Agriculture, Technology and Sciences (Formerly Allahabad Agriculture Institute), Allahabad, Uttar Pradesh, India AMBUJ SRIVASTAVA • Institute of Engineering in Medicine, University of California, San Diego, CA, USA UPASNA SRIVASTAVA • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa-no-ha campus, Chiba, Japan N. URMILLA • Department of Biotechnology, Vignan’s Foundation for Science, Technology & Research, Guntur, Andhra Pradesh, India JUNLI ZHANG • Department of Plant Sciences, University of California, Davis, CA, USA
Chapter 1 An Update on Progress and Challenges of Crop Genomes P. Hima Kumar, N. Urmilla, M. Lakshmi Narasu, and S. Anil Kumar Abstract Unprecedented demand on food crops is increasing day by day due to increasing population and decreasing natural resources. Sustainable measures need to be developed to meet the food security, which subsequently keep malnourishment in check. More than 100 crop genomes have been sequenced after the availability of Arabidopsis, the first plant genome. Advent of new technologies for sequencing eased the complex problems associated with it. Reduction in the cost of sequencing helped in the sequencing of orphan crops, important for local economies. Genome sequencing provided a clear understanding of the gene architecture of the crops for complex traits also. Development of high-quality genetic maps aids the translation of genomics from lab to field. Key words Crop genomics, Food security, Genome sequencing, Next generation technologies
1
Introduction Crop genomics plays an important role in the development of climate-smart crops for global food security. The world population is expected to reach 11 billion by 2100 and around 830 million people are expected to experience hunger by 2030 [1, 2]. Nearly 38% of Earth’s surface is used to grow food crops [3]. The father of the Green Revolution, Norman Borlaug, summarized the importance of food crops, “Without food, man can live at most but a few weeks; without it, all other components of social justice are meaningless.” The agricultural produce is a multifactorial dependent process. Abiotic and biotic stresses result in severe yield losses. Burgeoning population, decreasing natural resources, irregular climatic conditions, and global warming compound this problem further. Climate change results in immunosuppression and alters the dynamics of crop pests resulting in famine conditions [4]. Enhancement of food production by both conventional and nonconventional approaches is the need of the hour to bridge the gap between growing population and food security to prevent famines in the foreseeable future [2]. Crop breeding and transgenic
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
1
2
P. Hima Kumar et al.
approaches aided in the development of high yielding crops with increased productivity alongside with nutritional improvement [5]. To meet the United Nations Zero Hunger Target by 2030, the genetic improvement of crops needs to be scaled up. Genome characterization of crops is important to meet these challenges with analysis of large genome data using computational tools. Genome sequencing has become an essential routine technology for sequencing genomes from bacteria to humans [6, 7]. Arabidopsis genome, the first released plant genome in 2000, followed by rice, the first crop genomes in 2002 led to the foundation of crop genomes [8–10]. A very few reports are available on crop genomics, Michael and VanBuren [11], Bevan [12], Kersey [13], Varshney [14], and Purugganan and Jackson [15] reviewed the importance of genome sequencing and the progress and challenges faced for sequencing the crop genomes. The present review focuses on the updates and identified gaps in genome sequencing for the development of next generation crop domestication.
2
Materials
2.1 Progress and Challenges of Crop Genomes
The genome sequences aided in the development of novel crops, resistant to abiotic and biotic stresses with enhanced yield [12, 15]. The advent of new technologies like sequencing of complex genomes, epigenomic analysis, high-throughput phenotyping, satellite images, CRISPR–CAS9 editing, long-read single molecule sequencing, machine learning, and artificial intelligence improved crop genomes [16–22]. Next generation sequencing (NGS) technology in 2005 has revolutionized the sequencing of genomes and more than 1000 genomes (Table 1) are available now (https://en. wikipedia.org/wiki/List_of_sequenced_plant_genomes). Around 391,000 species of land plants and 8000 green algae are reported with diverse genomes [108–110]. The size of the plant genome also varies, the smallest in Genlisea with 60 Mb and the largest in Paris japonica with 150 Gb [111, 112]. The 3000 Rice Genome Project and 3000 Chickpea Genome Sequencing have provided much valuable information for genetic gains and the extended gene pool aids breeders in crop improvement [113, 114]. A decade ago, genome assembly tools available for nonplant species were not suitable for handling the commonly associated bottlenecks of plant genomes like genome size, heterozygosity, repeat content, paralogy, and assembling of 100–200 bp reads resulting in draft form with several contigs [115, 116]. Advent of next generation sequencing (NGS) technologies improved genome sequencing and also resolved problems to a larger extent resulting in rapid sequencing [21]. NGS eased genome assembly with innovations in short-read sequencing, economical long-read sequencing, nanopore methods, high-throughput chromosome
Progress and Challenges of Crop Genomes
3
Table 1 Non-exhaustive list of sequenced plant genomes S. No.
Name of the plant
No. of predicted Genome size genes
1
Oryza sativa (short grain rice) ssp. japonica 430 Mb
[8]
2
Oryza sativa (long grain rice) ssp. indica
[9]
3
Vitis vinifera
4
Carica papaya
372 Mbp
5
Physcomitrella patens ssp. patens str. Gransden
462.3 Mb
6
Cucumis sativus
350 Mbp
26,682
[26]
7
Sorghum bicolor
730 Mb
34,496
[27]
8
Zea mays
2.3 Gb
39,656
[28]
9
Ricinus communis (castor bean)
320 Mbp
31,237
[29]
10
Oryza glaberrima
11
Glycine max
1115 Mbp
46,430
[31]
12
Malus domestica
~742.3 Mbp
57,386
[32]
13
Phoenix dactylifera
658 Mbp
28,800
[33]
14
Theobroma cacao
[34]
15
Selaginella moellendorffii
[35]
16
Fragaria vesca
240 Mbp
34,809
[36]
17
Brassica rapa
485 Mbp
41,174
[37]
18
Setaria italica
19
Musa acuminata
523 Mbp
36,542
[39]
20
Cucumis melo
450 Mbp
27,427
[40]
21
Oryza rufipogon
406 Mb
37,071
[41]
22
Manihot esculenta
~760 Mb
30,666
[42]
23
Cajanus cajan (pigeon pea)
24
Solanum lycopersicum
900 Mbp
34,727
[44]
25
Linum usitatissimum
~350 Mbp
43,384
[45]
26
Prunus mume
[46]
27
Amborella trichopoda
[47]
28
Citrullus lanatus
29
Cicer arietinum L.
30
Nelumbo nucifera (sacred lotus)
430 Mb
References
[23] 28,629
[24] [25]
[30]
[38]
[43]
ca 425 Mbp
23,440
[48] [49]
929 Mb
[50] (continued)
4
P. Hima Kumar et al.
Table 1 (continued) S. No.
Name of the plant
No. of predicted Genome size genes
References
31
Elaeis guineensis
~1800 Mbp
34,800
[51]
32
Prunus persica
265 Mbp
27,852
[52]
33
Pyrus bretschneideri
34
Citrus sinensis
471.88 Mb
35
Brassica napus
1130 Mbp
36
Beta vulgaris (sugar beet)
714–758 Mb 27,421
[56]
37
Capsicum annuum (pepper) cv. CM334
~3.48 Gbp
34,903
[57]
38
Phaseolus vulgaris (common bean)
520 Mbp
31,638
[58]
39
Capsicum annuum (pepper) cv. Zunla-1
~3.48 Gbp
35,336
[59]
40
Capsicum annuum var. glabriusculum
~3.48 Gbp
34,476
[59]
41
Ananas comosus
382 Mb
27,024
[60]
42
Arachis duranensis
[61]
43
Arachis ipaensis
[61]
44
Amaranthus hypochondriacus
403.9 Mb
23,847
[62]
45
Ginkgo biloba
11.75 Gb
41,840
[63]
46
Picea abies (Norway spruce)
19.6 Gb
26,359
[64]
47
Picea glauca (white spruce)
20.8 Gb
14,462
[64]
48
Pinus taeda (loblolly pine)
20.15 Gb
9024
[64]
49
Pinus lambertiana (sugar pine)
31 Gb
13,936
[64]
50
Helianthus annuus
3.6 Gbb
52,232
[65]
51
Marchantia polymorpha
225.8 Mb
52
Rhodiola crenulata
344.5 Mb
35,517
[67]
53
Chenopodium quinoa
1.39–1.50 Gb
44,776
[68]
54
Dimocarpus longan
55
Pseudotsuga menziesii
16 Gb
54,830
[70]
56
Lactuca sativa
2.5 Gbb
38,919
[71]
57
Durio zibethinus
~738 Mbp
58
Pennisetum glaucum
~1,79 Gb
38,579
[73]
59
Mentha x piperita
353 Mb
35,597
[74]
60
Cocos nucifera
~2.42 Gb
[53] [54] 101,040
[55]
[66]
[69]
[72]
[75] (continued)
Progress and Challenges of Crop Genomes
5
Table 1 (continued) S. No.
Name of the plant
No. of predicted Genome size genes
References
61
Triticum aestivum (bread wheat)
14.5 Gb
[76]
62
Primula vulgaris
474 Mb
[77]
63
Azolla filiculoides
0.75 Gb
[78]
64
Salvinia cucullata
0.26 Gb
[78]
65
Rubus occidentalis
290 Mbp
[79]
66
Gnetum montanum
4.07 Gb
27,491
[80]
67
Siraitia grosvenorii
456.5 Mbp
30,565
[81]
68
Cucurbita argyrosperma
228.8 Mbp
27,998
[82]
69
Sclerocarya birrea (Marula)
18,397
[83]
70
Lablab purpureus
20,946
[83]
71
Vigna subterranea
31,707
[83]
72
Carya illinoinensis
651.31 Mb
[84]
73
C. cathayensis
706.43 Mb
[84]
74
Larix sibirica
12.34 Gb
[85]
75
Xanthoceras sorbifolium
504.2 Mb
24,672
[86]
76
Abies alba
18.16 Gb
94,205
[87]
77
Pleurozium schreberi (feather moss)
318 Mb
78
Solanum aethiopicum
1.02 Gbp
34,906
[89]
79
Trochodendron aralioides (wheel tree)
1.614 Gb
35,328
[90]
80
Castanea mollissima
785.53 Mb
36,479
[91]
81
Eruca sativa
851 Mbp
45,438
[92]
82
Coixlacryma-jobi L.
1.619 Gb
39,629
[93]
83
Eriobotrya japonica
760.1 Mb
45,743
[94]
84
Prunus salicina
284.2 Mbp
24,448
[95]
85
Amphicarpaea edgeworthii
299 Mb
27,899
[96]
86
Macadamia jansenii
750 Mb
[97]
87
Juglans sigillata
536.50 Mb
[98]
88
Macadamia integrifolia
745 Mb
34,274
[99]
89
Simmondsia chinensis (jojoba)
887 Mb
23,490
[100]
90
Diospyros oleifera
849.53 Mb
28,580
[101]
107,891
[88]
(continued)
6
P. Hima Kumar et al.
Table 1 (continued) S. No.
Name of the plant
No. of predicted Genome size genes
91
Fontinalis antipyretica (greater water-moss) 385.2 Mb
92
Juglans regia
540 Mb
[103]
93
Ricinus communis L. (wild castor)
318.13 Mb 30,066
[104]
94
Annona muricata
799.11 Mb
[105]
95
Digitaria exilis
761 Mb
96
Corylus heterophylla Fisch (Asian hazel)
370.75 Mb
430
23,375
References [102]
[106] 27,591
[107]
Source: Wikipedia (https://en.wikipedia.org/wiki/List_of_sequenced_plant_genomes)
conformation capture, efficient restriction maps, and optical mapping resulting in ease of whole genome sequencing [117– 124]. NGS led to the rapid development of several new reference genome sequences in either draft form or high-quality near-complete assemblies [125].
3
Methods Analysis of large genomes like maize, barley, and even polyploid genomes like cotton (2.5 Gb), wheat (~17 Gb), and sugarcane (3.13 Gb) has been possible with reduction in genome complexity, individual chromosome sequencing, and diploid progenitor sequencing [76, 121, 126–128]. This enables easy comparison of the genetic maps. The reduction in cost for sequencing led to the sequencing of orphan crops, thus extending the gene pool for improvement of crops [129, 130]. Exploration of genome including core and accessory genome can be used for exploitation of genes that favored evolution. Analysis of high-quality genomes leads to the understanding of pan-genomes. It still remains unclear after pan-genome analysis, how plants maintain adaptive variation despite variable gene content. Increased quantitative trait locus (QTL) and genome-wide association studies (GWAS) helped in understanding the genetic architecture of the plant like traits such as grain size, growth, and disease resistance [131]. Genome sequencing helped in dissecting the mechanisms behind gene functions and regulation for complex traits like crop behavior, crop diversification, biotic and abiotic stress tolerance, and epigenetic interactions. Genomics-assisted breeding (GAB) aids in cultivar development by exploiting the allelic variation. GAB helps in manipulation of allelic variation for crop improvement with higher nutritional value, economical, and timely manner. Genome editing
Progress and Challenges of Crop Genomes
7
tools like CRISPR–CAS9 aid in understanding the interaction of the gene and outcome [132]. Genome sequencing assists development of climate-smart crop domestication for global food security in unequivocal climates. Comparative genomics aids in understanding the role of complex traits for selection of parents for breeding. Crop genome sequencing connected genotype to phenotype for enhanced yield. Crop genomics need to be advanced further with new tools and technologies for analyzing high-throughput datasets for which high end computational resources are needed. Amalgamation of big data with breeding is the need of the hour.
Acknowledgements PHK acknowledges UGC for DR. D. S. Kothari Postdoctoral Fellowship Scheme [F.4-2/2006 (BSR)/BL/14-15/0392]. S. Anil Kumar acknowledges the SERB-NPDF fellowship (PDF/ 2015/000929). References 1. Raman R (2017) The impact of genetically modified (GM) crops in modern agriculture: a review. GM Crops Food 8:195–208 2. Food and Agriculture Organization (2019) The state of food security and nutrition in the world 2020. FAO 3. Foley JA, Ramankutty N, Brauman KA et al (2011) Solutions for a cultivated planet. Nature 478:337–342 4. Garrett KA, Dendy SP, Frank EE et al (2006) Climate change effects on plant disease: genomes to ecosystems. Annu Rev Phytopathol 844:489–509 5. Barros J, Temple S, Dixon RA (2019) Development and commercialization of reduced lignin alfalfa. Curr Opin Biotechnol 56:48–54 6. Fleischmann RD, Adams MD, White O et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512 7. Venter JC, Adams MD, Myers EW et al (2001) The sequence of the human genome. Science 291:1304–1351 8. Goff SA, Ricke D, Lan TH et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100 9. Yu J, Hu S, Wang J et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296:79–92 10. Kaul S, Koo HL, Jenkins J et al (2000) Analysis of the genome sequence of the flowering
plant Arabidopsis thaliana. Nature 408: 796–815 11. Michael TP, VanBuren R (2015) Progress, challenges and the future of crop genomes. Curr Opin Biotechnol 24:71–81 12. Bevan MW, Uauy C, Wulff BB et al (2017) Genomic innovation for crop improvement. Nature 543:346–354 13. Kersey PJ (2019) Plant genome sequences: past, present, future. Curr Opin Biotechnol 48:1–8 14. Varshney RK, Bohra A, Yu J et al (2021) Designing future crops: genomics-assisted breeding comes of age. Trends Plant Sci 26: 631–649 15. Purugganan MD, Jackson SA (2021) Advancing crop genomics from lab to field. Nat Genet 53:595–601 16. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351 17. EPIC Planning Committee (2012) Reading the second code: mapping epigenomes to understand plant growth, development, and adaptation to the environment. Plant Cell 24: 2257–2261 18. Clevers JG, Kooistra L, Van den Brande MM (2017) Using Sentinel-2 data for retrieving LAI and leaf and canopy chlorophyll content of a potato crop. Remote Sens 9:405
8
P. Hima Kumar et al.
19. Araus JL, Kefauver SC, Zaman-Allah M et al (2018) Translating high-throughput phenotyping into genetic gain. Trends Plant Sci 23: 451–466 20. Harfouche AL, Jacobson DA, Kainer D et al (2019) Accelerating climate resilient plant breeding by applying next-generation artificial intelligence. Trends Biotechnol 37: 1217–1235 21. Chen K, Wang Y, Zhang R et al (2019a) CRISPR/Cas genome editing and precision plant breeding in agriculture. Annu Rev Plant Biol 70:667–697 22. Wang H, Cimen E, Singh N et al (2020) Deep learning for plant genomics and crop improvement. Curr Opin Plant Biol 54: 34–41 23. Jaillon O, Aury JM, Noel B et al (2007) The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449:463–467 24. Ming R, Hou S, Feng Y et al (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452:991–996 25. Rensing SA, Lang D, Zimmer AD et al (2008) The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319:64–69 26. Huang S, Li R, Zhang Z et al (2009) The genome of the cucumber, Cucumis sativus L. Nat Genet 41:1275–1281 27. Paterson AH, Bowers JE, Bruggmann R et al (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457: 551–556 28. Schnable PS, Ware D, Fulton RS et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326:1112–1115 29. Chan AP, Crabtree J, Zhao Q et al (2010) Draft genome sequence of the oilseed species Ricinus communis. Nat Biotechnol 28: 951–956 30. Hurwitz BL, Kudrna D, Yu Y et al (2010) Rice structural variation: a comparative analysis of structural variation between rice and three of its closest relatives in the genus Oryza. Plant J 63:990–1003 31. Schmutz J, Cannon SB, Schlueter J et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178–183 32. Velasco R, Zharkikh A, Affourtit J et al (2010) The genome of the domesticated apple (Malus domestica Borkh.). Nat Genet 42: 833–839 33. Al-Dous EK, George B, Al-Mahmoud ME et al (2011) De novo genome sequencing and
comparative genomics of date palm (Phoenix dactylifera). Nature Biotechnol 29:521–527 34. Argout X, Salse J, Aury JM et al (2011) The genome of Theobroma cacao. Nat Genet 43: 101–108 35. Banks JA, Nishiyama T, Hasebe M et al (2011) The Selaginella genome identifies genetic changes associated with the evolution of vascular plants. Science 332:960–963 36. Shulaev V, Sargent DJ, Crowhurst RN et al (2011) The genome of woodland strawberry (Fragaria vesca). Nat Genet 43:109–116 37. Wang X, Wang H, Wang J et al (2011) The genome of the mesopolyploid crop species Brassica rapa. Nat Genet 43:1035–1039 38. Bennetzen JL, Schmutz J, Wang H et al (2012) Reference genome sequence of the model plant Setaria. Nat Biotechnol 30: 555–561 39. D’hont A, Denoeud F, Aury JM et al (2012) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488:213–217 40. Garcia-Mas J, Benjak A, Sanseverino W et al (2012) The genome of melon (Cucumis melo L.). Proc Natl Acad Sci U S A 109: 11872–11877 41. Huang X, Kurata N, Wang ZX et al (2012) A map of rice genome variation reveals the origin of cultivated rice. Nature 490:497–501 42. Prochnik S, Marri PR, Desany B et al (2012) The cassava genome: current progress, future directions. Trop Plant Biol 5:88–94 43. Singh NK, Gupta DK, Jayaswal PK et al (2012) The first draft of the pigeonpea genome sequence. J Plant Biochem Biotechnol 21:98–112 44. Tomato Genome Consortium (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485: 635–641 45. Wang Z, Hobson N, Galindo L et al (2012) The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads. Plant J 72:461–473 46. Zhang Q, Chen W, Sun L et al (2012) The genome of Prunus mume. Nat Commun 3: 1–8 47. Albert VA, Barbazuk WB, dePamphilis CW et al (2013) The Amborella genome and the evolution of flowering plants. Science 342: 1241089 48. Guo S, Zhang J, Sun H et al (2013) The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat Genet 45:51–58
Progress and Challenges of Crop Genomes 49. Jain M, Misra G, Patel RK, Priya P et al (2013) A draft genome sequence of the pulse crop chickpea (Cicer arietinum L.). Plant J 74:715–729 50. Ming R, VanBuren R, Liu Y et al (2013) Genome of the long-living sacred lotus (Nelumbo nucifera Gaertn.). Genome Biol 14:1–1 51. Singh R, Ong-Abdullah M, Low ET et al (2013) Oil palm genome sequence reveals divergence of interfertile species in old and new worlds. Nature 500:335–339 52. Verde I, Abbott AG, Scalabrin S et al (2013) The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet 45:487–494 53. Wu J, Wang Z, Shi Z et al (2013) The genome of the pear (Pyrus bretschneideri Rehd.). Genome Res 23:396–408 54. Xu Q, Chen LL, Ruan X et al (2013) The draft genome of sweet orange (Citrus sinensis). Nat Genet 45:59–66 55. Chalhoub B, Denoeud F, Liu S (2014) Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome. Science 345: 950–953 56. Dohm JC, Minoche AE, Holtgr€a we D et al (2014) The genome of the recently domesticated crop plant sugar beet (Beta vulgaris). Nature 505:546–549 57. Kim S, Park M, Yeom SI et al (2014) Genome sequence of the hot pepper provides insights into the evolution of pungency in capsicum species. Nat Genet 46:270–278 58. Schmutz J, McClean PE, Mamidi S et al (2014) A reference genome for common bean and genome-wide analysis of dual domestications. Nat Genet 46:707–713 59. Qin C, Yu C, Shen Y et al (2014) Wholegenome sequencing of cultivated and wild peppers provides insights into capsicum domestication and specialization. Proc Natl Acad Sci U S A 111:5135–5140 60. Ming R, VanBuren R, Wai CM et al (2015) The pineapple genome and the evolution of CAM photosynthesis. Nat Genet 47: 1435–1442 61. Bertioli DJ, Cannon SB, Froenicke L et al (2016) The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid ancestors of cultivated peanut. Nat Genet 48:438–446 62. Clouse JW, Adhikary D, Page JT et al (2016) The Amaranth genome: genome, transcriptome, and physical map assembly. Plant Genome 9(plantgenome2015):07
9
63. Guan R, Zhao Y, Zhang HE et al (2016) Draft genome of the living fossil Ginkgo biloba. Gigascience 5:49 64. Stevens KA, Wegrzyn JL, Zimin A et al (2016) Sequence of the sugar pine megagenome. Genetics 204:1613–1626 65. Badouin H, Gouzy J, Grassa CJ et al (2017) The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature 546:148–152 66. Bowman JL, Kohchi T, Yamato KT et al (2017) Insights into land plant evolution garnered from the Marchantia polymorpha genome. Cell 171:287–304 67. Fu Y, Li L, Hao S et al (2017) Draft genome sequence of the Tibetan medicinal herb Rhodiola crenulata. Gigascience 6:1–5 68. Jarvis DE, Ho YS, Lightfoot DJ et al (2017) The genome of Chenopodium quinoa. Nature 542:307–312 69. Lin Y, Min J, Lai R et al (2017) Genome-wide sequencing of longan (Dimocarpus longan Lour.) provides insights into molecular basis of its polyphenol-rich characteristics. Gigascience 6:1–14 70. Neale DB, McGuire PE, Wheeler NC et al (2017) The Douglas-fir genome sequence reveals specialization of the photosynthetic apparatus in Pinaceae. G3 (Bethesda) 7: 3157–3167 71. Reyes-Chin-Wo S, Wang Z, Yang X et al (2017) Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat Commun 8:1–1 72. Teh BT, Lim K, Yong CH et al (2017) The draft genome of tropical fruit durian (Durio zibethinus). Nat Genet 49:1633–1641 73. Varshney RK, Shi C, Thudi M et al (2017) Pearl millet genome sequence provides a resource to improve agronomic traits in arid environments. Nat Biotechnol 35:969–976 74. Vining KJ, Johnson SR, Ahkami A et al (2017) Draft genome sequence of Mentha longifolia and development of resources for mint cultivar improvement. Mol Plant 10: 323–339 75. Xiao Y, Xu P, Fan H et al (2017) The genome draft of coconut (Cocos nucifera). Gigascience 6:1–11 76. Appels R, Eversole K, Stein N et al (2018) International wheat genome sequencing consortium (IWGSC) shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361:eaar7191 77. Cocker JM, Wright J, Li J et al (2018) Primula vulgaris (primrose) genome assembly, annotation and gene expression, with comparative
10
P. Hima Kumar et al.
genomics on the heterostyly supergene. Sci Rep 8:17942 78. Li FW, Brouwer P, Carretero-Paulet L et al (2018) Fern genomes elucidate land plant evolution and cyanobacterial symbioses. Nat Plants 4:460–472 79. VanBuren R, Wai CM, Colle M et al (2018) A near complete, chromosome-scale assembly of the black raspberry (Rubus occidentalis) genome. Gigascience 7:giy094 80. Wan T, Liu ZM, Li LF et al (2018) A genome for gnetophytes and early evolution of seed plants. Nat Plants 4:82–89 81. Xia M, Han X, He H et al (2018) Improved de novo genome assembly and analysis of the Chinese cucurbit Siraitia grosvenorii, also known as monk fruit or luo-han-guo. Gigascience 7:giy067 82. Barrera-Redondo J, Ibarra-Laclette E, Va´zquez-Lobo A et al (2019) The genome of Cucurbita argyrosperma (silver-seed gourd) reveals faster rates of protein-coding gene and long noncoding RNA turnover and neofunctionalization within Cucurbita. Mol Plant 12:506–520 83. Chang Y, Liu H, Liu M et al (2019) The draft genomes of five agriculturally important African orphan crops. Gigascience 8:giy152 84. Huang Y, Xiao L, Zhang Z et al (2019) The genomes of pecan and Chinese hickory provide insights into Carya evolution and nut nutrition. Gigascience 8:giz036 85. Kuzmin DA, Feranchuk SI, Sharov VV et al (2019) Stepwise large genome assembly approach: a case of Siberian larch (Larix sibirica Ledeb). BMC Bioinformatics 20:35–46 86. Liang Q, Li H, Li S et al (2019) The genome assembly and annotation of yellowhorn (Xanthoceras sorbifolium Bunge). Gigascience 8:giz071 87. Mosca E, Cruz F, Go´mez-Garrido J et al (2019) A reference genome sequence for the European silver fir (Abies alba Mill.): a community-generated genomic resource. G3 (Bethesda) 9:2039–2049 88. Pederson ER, Warshan D, Rasmussen U (2019) Genome sequencing of Pleurozium schreberi: the assembled and annotated draft genome of a Pleurocarpous feather Moss. G3 (Bethesda) 9:2791–2797 89. Song B, Song Y, Fu Y et al (2019) Draft genome sequence of Solanum aethiopicum provides insights into disease resistance, drought tolerance, and the evolution of the genome. Gigascience 8:giz115 90. Strijk JS, Hinsinger DD, Zhang F et al (2019) Trochodendron aralioides, the first
chromosome-level draft genome in Trochodendrales and a valuable resource for basal eudicot research. Gigascience 8:giz136 91. Xing Y, Liu Y, Zhang Q et al (2019) Hybrid de novo genome assembly of Chinese chestnut (Castanea mollissima). Gigascience 8: giz112 92. Bell L, Chadwick M, Puranik M et al (2020) The Eruca sativa genome and transcriptome: a targeted analysis of sulfur metabolism and glucosinolate biosynthesis pre and postharvest. Front Plant Sci 11:525102 93. Guo C, Wang Y, Yang A et al (2020) The Coix genome provides insights into Panicoideae evolution and papery hull domestication. Mol Plant 13:309–320 94. Jiang S, An H, Xu F et al (2020) Chromosome-level genome assembly and annotation of the loquat (Eriobotrya japonica) genome. Gigascience 9:giaa015 95. Liu C, Feng C, Peng W et al (2020) Chromosome-level draft genome of a diploid plum (Prunus salicina). Gigascience 9:giaa130 96. Liu Y, Zhang X, Han K et al (2020) Insights into amphicarpy from the compact genome of the legume Amphicarpaea edgeworthii. Plant Biotechnol J 19:952–965 97. Murigneux V, Rai SK, Furtado A et al (2020) Comparison of long-read methods for sequencing and assembly of a plant genome. Gigascience 9:giaa146 98. Ning DL, Wu T, Xiao LJ et al (2020) Chromosomal-level assembly of Juglans sigillata genome using Nanopore, BioNano, and HiC analysis. Gigascience 9:giaa006 99. Nock CJ, Baten A, Mauleon R et al (2020) Chromosome-scale assembly and annotation of the macadamia genome (Macadamia integrifolia HAES 741). G3 (Bethesda) 10: 3497–3504 100. Sturtevant D, Lu S, Zhou ZW et al (2020) The genome of jojoba (Simmondsia chinensis): a taxonomically isolated species that directs wax ester accumulation in its seeds. Sci Adv 6:eaay3240 101. Suo Y, Sun P, Cheng H et al (2020) A highquality chromosomal genome assembly of Diospyros oleifera Cheng. Gigascience 9: giz164 102. Yu J, Li L, Wang S et al (2020) Draft genome of the aquatic moss Fontinalis antipyretica (Fontinalaceae, Bryophyta). Gigabyte 2020: 1–9 103. Zhang J, Zhang W, Ji F et al (2020) A highquality walnut genome assembly reveals extensive gene expression divergences after
Progress and Challenges of Crop Genomes whole-genome duplication. Plant Biotechnol J 18:1848 104. Lu J, Pan C, Fan W et al (2021) A chromosome-level assembly of a wild Castor genome provides new insights into the adaptive evolution in a Tropical Desert. Genomics Proteomics Bioinformatics. https://doi.org/10. 1016/j.gpb.2021.04.003 105. Strijk JS, Hinsinger DD, Roeder MM et al (2021) Chromosome-level reference genome of the soursop (Annona muricata): a new resource for Magnoliid research and tropical pomology. Mol Ecol Resour 21:1608–1619 106. Wang X, Chen S, Ma X et al (2021) Genome sequence and genetic diversity analysis of an under-domesticated orphan crop, white fonio (Digitaria exilis). Gigascience 10:giab013 107. Zhao T, Ma W, Yang Z et al (2021) A chromosome-level reference genome of the hazelnut, Corylus heterophylla Fisch. Gigascience 10:giab027 108. Mora C, Tittensor DP, Adl S et al (2011) How many species are there on earth and in the ocean? PLoS Biol 9:e1001127 109. Guiry MD (2012) How many species of algae are there? J Phycol 48:1057–1063 110. Willis K (2017) State of the world’s plants. Royal Botanics Gardens Kew. U.S.A 111. Pellicer J, Fay MF, Leitch IJ (2010) The largest eukaryotic genome of them all? Bot J Linn Soc 164:10–15 112. Fleischmann A, Michael TP, Rivadavia F et al (2014) Evolution of genome size and chromosome number in the carnivorous plant genus Genlisea (Lentibulariaceae), with a new estimate of the minimum genome size in angiosperms. Ann Bot 114:1651–1663 113. Varshney RK (2016) Exciting journey of 10 years from genomes to fields and markets: some success stories of genomics-assisted breeding in chickpea, pigeonpea and groundnut. Plant Sci 242:98–107 114. Wang W, Mauleon R, Hu Z et al (2018) Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557:43–49 115. Metzker ML (2009) Sequencing technologies - the next generation. Nat Rev Genet 11:31–46 116. Lamesch P, Berardini TZ, Li D et al (2012) The Arabidopsis information resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40:D1202–D1210 117. Branton D, Deamer DW, Marziali A et al (2008) The potential and challenges of nanopore sequencing. Nat Biotechnol 26: 1146–1153
11
118. Dixon JR, Selvaraj S, Yue F et al (2012) Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485:376–380 119. Levy-Sakin M, Ebenstein Y (2013) Beyond sequencing: optical mapping of DNA in the age of nanotechnology and nanoscopy. Curr Opin Biotechnol 24:690–696 120. Roberts R, Carneiro M, Schatz M (2013) The advantages of SMRT sequencing. Genome Biol 14:405 121. Jiao Y, Peluso P, Shi J et al (2017) Improved maize reference genome with single-molecule technologies. Nature 546:524–527 122. Belser C, Istace B, Denis E et al (2018) Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat Plants 4:879–887 123. Zhang X, Zhang S, Zhao Q et al (2019) Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants 5:833–845 124. Choi JY, Lye ZN, Groen SC et al (2020) Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice. Genome Biol 21:1–27 125. Stein JC, Yu Y, Copetti D et al (2018) Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza. Nat Genet 50:285–296 126. Mascher M, Gundlach H, Himmelbach A et al (2017) A chromosome conformation capture ordered sequence of the barley genome. Nature 544:427–433 127. Zhang J, Zhang X, Tang H et al (2018) Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat Genet 50:1565–1573 128. Wang M, Tu L, Yuan D et al (2019) Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat Genet 51: 224–229 129. Varshney RK, Ribaut JM, Buckler ES et al (2012) Can genomics boost productivity of orphan crops? Nat Biotechnol 30:1172–1176 130. Brozynska M, Furtado A, Henry R (2016) Genomics of crop wild relatives: expanding the gene pool for crop improvement. Plant Biotechnol J 14:1070–1085 131. Liu HJ, Yan J (2019) Crop genome-wide association study: a harvest of biological relevance. Plant J 97:8–18 132. Chen F, Song Y, Li X et al (2019b) Genome sequences of horticultural plants: past, present, and future. Hortic Res 1:6
Chapter 2 Updates on Genomic Resources for Crop Improvement Aditya Narayan, Pragya Chitkara, and Shailesh Kumar Abstract An increasing number of crop genomic resources, with novel technical achievements in genome analytics have led to dramatic changes in the landscape of agricultural research. This has improved our capacity to meet global challenges around food production and must be understood to better serve the needs of the human population. In this chapter, we provide a comprehensive review of historical changes in technologies which allow for improved plant genotyping, molecular marker discovery, and decoding of the plant genome. Further, we explore resources and databases available for multi-omics analysis and finally conclude with a discussion of translational genomics considerations. Ultimately, this chapter will serve as a tool for bioinformaticians and researchers to explore the deeply significant field of crop genomics. Key words Genomic, Resources, Molecular Markers, Plant Genomics, Genotyping, Omics tools, Translational genomics, Databases, Tools
1
Introduction The manipulation and engineering of plants in order to benefit broader society is an age-old practice that has continually influenced the development of humanity. Plant genetics and agricultural methods have continuously been iterated upon to provide for societal nutritional needs. This, in turn, implies that plants present a significant piece of broader conversations of public health, environmental changes, and social needs. This need has only been exacerbated due to climate and population changes [1]. A significant shift in the development of plants with more desirable qualities began with the foundations of Mendelian genetics, which in turn led plant breeders to cross plants with ideal traits to produce improved variants [2]. Such desirable traits include the ability to resist certain diseases, generate greater crop yield, provide biofuel crops, or provide increased nutritional value and better serve the anticipated global population.
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
13
14
Aditya Narayan et al.
Genome editing has developed significantly in the past several decades with increasing investment in the space leading to novel techniques, including both experimental and bioinformatic approaches, which allow scientists to understand and modify critical sequences. For example, genome sequencing of Arabidopsis thaliana in 2000 led to an accelerated understanding of plant genomes and methods of applying fledgling techniques to the relatively underexplored field [3]. Further, techniques exist which allow even the most lay scientists the ability to alter the regulation of gene expression and ultimately facilitate further insights into organismal functional genomics. Rapid innovation occurring in this space has lent energy to the agricultural sciences as it has allowed for more accessible opportunities to develop or remove traits in plants through the isolation of important genes and pathways. Advances include high-throughput sequencing methods, means for conducting proteomic and metabolomic analyses at scale, and the creation of large plant-based databases and bioresources. With this explosion of technical advancements, it is necessary to understand the basic terminology applied in studying genomic tools and resources in crop improvement. This chapter will therefore provide a foundational understanding of molecular markers, -omics approaches to studying plant genetic material, functional genomics, bioinformatics resources, and the translation of genetic insights to agronomically important traits.
2
Advances in Molecular Marker Discovery and Genotyping The foundation of many methodologies and scientific advancements occurring in the space of plant research is based on the identification of novel molecular markers and genotypes. Collected sequence data is an essential resource to facilitate future insights into the biological phenomenon and functional genomics [4, 5]. A wide array of markers exists within plant genomics. For example, genetic markers are DNA sequences with an identified location and control over a specific gene or trait. By extension, molecular markers are nucleotide sequences that may be mapped to individuals or species based on specific polymorphisms such as insertions, deletions, single nucleotide polymorphisms (SNPs), and translocations. Mondini et al. describe the ideal molecular marker as being codominant or dominant with respect to the expression, reproducible, and capable of detecting polymorphisms [6]. Recent years have seen the generation of a number of mechanisms that may be applied to detecting molecular markers. Examples include: l
The Application of Restriction Fragment Length Polymorphisms (RFLP): Restriction enzymes are used to cleave DNA sequences of interest after which gel electrophoresis is applied
Genomic Resources for Crops
15
to separate these fragments by size. Based on size, polymorphisms which influence sequence length may be detected [7]. l
l
l
Polymerase Chain Reaction (PCR): PCR may be applied to amplify sections of DNA using sequence specific primers flanking the area of interest in the genome. Identification of molecular markers may be performed through Random Amplification of Polymorphic DNA (RAPD) in which identical 10-mer primers anneal and amplify random sections of the genome. Thus, if a mutation occurs at a previously complementary site, no fragment will be produced and this alteration to the fragments produced will be detectable via gel electrophoresis [8]. Another PCR-based technique is Amplified Fragment Length Polymorphism (AFLP) which can rapidly generate large numbers of marker fragments without knowledge of the template sequence or large quantities of DNA. DNA is digested using restriction enzymes, after which adaptor sequences are ligated to the sticky ends of the resulting fragments. Primers complementary to the adaptor sequence and restriction site are used to amplify only selected sequences. It is labor intensive, but benefits from its reproducibility [9]. Simple-Sequence Repeats (SSRs): SSRs, or microsatellites, are repeated short tandem motifs that may vary with respect to their length (1-6 nt) and frequency. These sequences are abundantly dispersed across the entire genome, including protein-coding genes. PCR allows for rapid detection of SSR copy number, particularly due to the fact that sequences flanking SSRs are generally conserved. SSRs, notably, may also be applied to studying chloroplast and mitochondrial sequences [10–12]. A related technique examines Inter Simple Sequence Repeats (ISSR). ISSR are DNA sequences of 100–3000 bp in length located between oppositely oriented microsatellites. These sequences may be amplified using PCR, separated by gel electrophoresis, and the fragments examined [13]. Retrotransposons are widely distributed repetitive DNA sequences present throughout the genome. Given that they constitute between 40% and 60% of the plant genome, they represent a natural molecular marker target. A number of techniques may be employed to identify retrotransposons and these marker systems broadly rely on PCR amplification between the termini of the retrotransposon and flanking sequences [14, 15].
Beyond molecular markers, recent innovations related to DNA sequencing technologies and the growth of databases/resources allow for markers to be developed. High-throughput sequencing approaches have allowed for high accuracy and low-cost sequencing at scale. A brief overview of commonly applied sequencing technologies is as follows:
16
Aditya Narayan et al. l
Sanger Sequencing: The most commonly applied sequencing technique which allows for average read lengths of between 700 and 900 bp. The technique relies on the selective incorporation of chain terminating dideoxynucleotides (ddNTP) during DNA replication in four separate samples (with one ddNTP present in each sample). Following the template DNA extension from added primers, the resultant fragments are separated by gel electrophoresis in order to determine at which position each ddNTP incorporated to then reconstruct the full sequence [16].
l
Roche 454 Genome Sequencing: Allows for the production of large quantities of short sequences (smaller than 300 bp) by annealing DNA strands to resin beads via ligated adapter sequences. Strands are amplified by PCR and each nucleotide addition releases light which is recorded to construct the sequence [17].
l
Illumina: The majority of Illumina approaches apply sequencing by synthesis, in which fluorescently labeled nucleotides are used to nucleotide chains that have been immobilized on a flow cell surface. These dNTPs terminate the chain, are read to determine the nucleotide at that site, and are then cleaved to allow the next dNTP integration [18, 19].
l
Ion Torrent Sequencing Technology: A small semiconductor is used to monitor microwells that contain a nucleotide template. Nucleotides are added in sequence and each addition causes the release of a proton. This in turn causes a detectable change in pH and based on this change the sequence may be determined [20].
l
PacBio’s Single-Molecule Real-Time (SMRT) Sequencing: This approach applies sequencing by synthesis with fluorescent dyes in which emissions are detected using zero mode waveguides. A polymerase is immobilized to the guides with DNA template and each fluorescent dye-bound nucleotide incorporation is detected. This technique allows for very long read lengths of 10–15 kb [21].
l
Oxford Nanopore Sequencing: DNA molecules are funneled through a nanopore and a potential gradient is created when an ionic current passes across the pore. This technique works around the need for PCR amplification, is effective for long sequences, and is relatively low-cost [22].
Effectively, sequencing technologies have progressed rapidly in the last several years. At present, a wide array of genome sequencing projects has been conducted for plant species, including, but not limited to, poplar, soybean, papaya, and rice [23–26]. The ability to understand molecular markers and to rapidly gather, store, and disseminate plant genetic information will facilitate dramatic research advances. In particular, post-genome sequencing projects allows the evaluation of agricultural plant population structures and identification of key genes.
Genomic Resources for Crops
3
17
Decoding Plant Genomes In order to study, manipulate, and apply genetic resources found within plants, it is necessary to have reference genomes available [27]. However, given the relative infancy of many-omics methods, it is often necessary to create reference genomes of species when not otherwise available. This may be a technically challenging and costly task, and thus, we will briefly describe the process of creating such references. Broadly, reference genome assemblies are representations of an organism’s genome that have been constructed through de novo assembly of short sequences into comprehensive genomic templates in a manner akin to completing a puzzle [28]. Analysis which relies on reference genomes allows for the alignment of input DNA sequences and subsequently compares them to the assembly to identify variants or map key loci. The process of sequencing a new example of a strain or species for which a reference exists is known as resequencing and facilitates rapid study of small-scale differences rather than creating entirely new references. However, doing so restricts analysis as only certain areas of the genome are amenable to resequencing and must be updated frequently to match the rate at which new, more robust reference assemblies are created. With access to more robust assemblies and given the increase in genome sequencing projects, it is clear that reference genomes may allow us insight into structural genomic variants which influence the phenotype. Plants often display a wide range of genomic variance which may at first glance appear minor but is significant when considering how they impact functional changes within a species. Accordingly, many studies seek to explore the complete genetic content of a species—the pangeome [29]. The pangenome describes the full set of genes in a clade which may have variations even among closely related strains. Study of the pangenome by comparison of reference genomes may lead to evolutionary insight and allow us to understand what the “core genome” of a species is (what sequences are common between individuals), the full breadth of genes present in a species, and the like. This line of research may further be expanded to not only consider a specific species, but also genomic diversity at the genus level. These super-pangenomes are relevant in crop research as the wild relatives of crops may possess genetic diversity that was lost when the plants were adapted for agricultural purposes [30]. Several pan-genomic studies have taken place in plants, such as rice, soybean, and maize, illustrating one significant application of reference genomes toward understanding species diversity [31–33]. Read lengths vary from short (150–250 bp) to far larger (>15 kbp) fragments and de novo assembly for complex eukaryotes
18
Aditya Narayan et al.
often relies on shotgun sequencing. Shotgun sequencing begins with the isolation of DNA, which is then sheared into fragments and randomly sequenced to collect information on the whole genome. These fragments overlap and as such, the similarity of overlapping fragments may be examined in order to determine which sequences are adjacent to each other. When the fragments are fully aligned, the resulting contiguous sequence is known as a contig. With this approach, long reads with more overlapping sequences are far easier to assemble relative to shorter sequences due to the number of gaps between reads and repeated elements. Accordingly, a large challenge de novo assembly results from the presence of repetitive sequences as highly similar sequences lead to miss-assembly and longer sequences are costly to produce for largescale projects. This is a particular concern in plants given the abundance of repetitive elements in their genomes and the potential for whole genome duplication events. Chaisson et al. described three types of assembly gaps that may occur, namely muted gaps (when the assembly is shorter than the true genome due to repetitive sequences which are not able to be amplified or cannot be cloned/propagated), coverage gaps (no sequence reads sampled at loci), and repetitive element-related gaps (gaps due to the presence of repetitive elements with varied sequences/length/ etc.) [34]. A wide array of assembly algorithms exist for different input reads and purposes. A brief summary of commonly applied tools for various organisms is provided below: l
Assembly by Short Sequences (ABySS): This C++ based algorithm accepts paired-end as well as single-end sequencing reads. It is a parallelized sequence assembler which applies De Bruijin graphs (a compact representation based on k-mers, which is ideal for short read lengths) to identify sequence overlaps [35].
l
Velvet: A set of algorithms based on C which accepts both paired-end and single-end sequencing reads. Velvet is a set of algorithms which apply De Bruijin graphs to create contigs of significant length from very short sequences [36].
l
SGA: SGA is a C++ based algorithm that accepts paired-end sequencing reads only. It is a memory efficient algorithm that uses an overlap-based string graph model of assembly rather than De Bruijin graphs and is parallelizable [37].
l
Edena: A C++ based algorithm that accepts paired-end and single-end sequencing reads to generate contigs for bacterial sequencing. The algorithm functions by an overlap layout consensus approach rooted in classical overlap graph representations [38].
l
SoapDeNovo2: A C and C++ compatible tool that accepts single and paired-end sequencing reads for graph consumption. The
Genomic Resources for Crops
19
algorithm benefits from its low memory usage, ability to resolve the majority of repeat regions, and optimization for large genomes. It is optimized for reads generated by Illumina GA [39]. After contigs are created, they are oriented relative to each other by the process of scaffolding. Scaffolding programs link contigs to form larger scaffolds and while the majority of software tools contain scaffolding capabilities, there are also individual software tools available (i.e. OPERA, LINKS, etc.) [40, 41]. Gaps within scaffold are then subject to gap closing algorithms to minimize the number of aberrancies (i.e. Sealer) [42]. Given the complexity of the algorithms applied, it is then necessary to assess the assembly’s quality using tools such as TOOL for analyzing mate pairs in Assembly (TAMPA) or Recognition of Errors in Assemblies using Paired Reads (REAPR) [43, 44].
4
Functional Genomic Studies: Multi-Omics Resources Robust analysis of plant phenotypes is rooted not only in genomic analysis, but studies conducted at every biological scale [45]. Accordingly, it is necessary to study the full breadth and depth of the -omics levels underlying plant physiology. In this section, we will provide a brief overview of tools which may be employed to study the transcriptome, proteome, and metabolome.
4.1
Transcriptomics
High-throughput analysis of gene expression is a key approach to study regulatory motifs, candidate genes relevant to agronomic traits, and potential gene functions. The accumulation of data sets with large-scale gene expression and available databases in the public domain facilitates comparative analyses. In addition, small RNA (sRNA) is becoming an increasingly applicable tool in the broader toolkit that is transcriptomic analysis [46]. Sequence-tag based techniques are one approach which allows for transcriptomic studies and generally rely on expression sequence tags (ESTs) sequencing in cDNA libraries [47]. These are then translated into transcript sequences using various assembly algorithms to generate transcriptomic profiles with associated abundance levels. A variety of databases house these plant ESTs for rapid research applications such as EGENES or TIGR [48, 49]. However, this random sequencing approach is performed on a large scale and is relatively costly and time-intensive. Related approaches include, but are not limited to: l
Serial Analysis of Gene Expression (SAGE): mRNA is isolated and converted into cDNA which is in turn bound to beads via biotin attached to the primers used in amplification. It is then cleaved to create bound fragments of varying lengths. Adapters are then ligated, and the cDNAs are cleaved from the bead
20
Aditya Narayan et al.
leaving a short “tag” of the original cDNA which is then subject to DNA polymerase to create blunt-end cDNA fragments. These tag fragments are ligated together into ditags, amplified by PCR, cleaved again, concatenated, and cloned to create a SAGE library. This allows for the identification of a large number of transcripts by analysis of transcript tag sequences [50]. l
Massively Parallel Signature Sequencing (MPSS): This technique creates short sequence tags by sequencing 16–20 bps from the 30 end of cDNA using arrays of microbeads. A number of databases exist for plants, such as rice and Arabidopsis, which apply the MPSS technique [51–53].
Outside of EST-based approaches, it is also possible to explore hybridization-based platforms (microarrays, DNA chip-based techniques). These methodologies generate comprehensive data sets based on parallel hybridization with DNA immobilized on slides or chips. Microarrays may be divided into spotting (prepare arrays by adding a spot of cDNA on a glass slide then hybridizing fluorescently labeled targets) and on-chip synthesis methods (in situ synthesized DNA microarrays serve as templates for RNA transcription and are captured) [54–57]. Such transcriptome data has been gathered and democratized via databases such as NCBI’s Gene Expression Omnibus (GEO), ArrayExpress from the European Bioinformatics Institute, and Genevestigator [58–60]. 4.2
Proteomics
Proteomics constitutes the study of protein function, structure, interactions, and networks. It stands as the natural avenue for exploration in phylogenetic research as the proteins present in an individual or species are deeply relevant to agronomically relevant properties such as growth, responses to environmental changes, and so on. Rapid advancements in proteomics have allowed for progression from more time intensive techniques such as protein separation via chromatographic approaches, protein quantification methods, protein–protein interaction networks, and mass spectrometry. This has led to a subsequent explosion in highthroughput data and in turn, the creation of open-source data repositories [61]. The study of proteins present in a sample begins with sample preparation. Protein fractionation and precipitation (via SDS-PAGE, two-dimensional gel electrophoresis, or chromatographic techniques such as ion-exchange or affinity methods) are common approaches, though those seeking to apply them must be aware of the potential contamination and error intrinsic to wet lab approaches. There also exist gel-free methods in which protein mixtures are digested and separated by online separation methods which incorporate protein identification technology, such as MudPIT [62]. This approach is ideal for high-throughput analysis [63]. Subsequent identification steps allow for mass fingerprinting
Genomic Resources for Crops
21
using mass spectrometry in which proteins are fragmented, ionized, and detected. The resultant mass fingerprints may be used to query databases (via tools such as the Basic Local Alignment Search Tool and their subsets such as BLASTN for nucleotide searches, BLASTP for protein databases, and the like) to identify proteins present in a sample. Proteomic analysis may also take place on the scale of organelles through cell fractionation and organelle isolation or alternative techniques such as isotope-coded affinity tags (ICAT) which applies chemical labeling during mass spectrometry [64, 65]. Such exploration may facilitate an understanding of organelle-specific properties such as protein trafficking. Further, it is possible to conduct protein quantification in order to garner more detailed insights into protein dynamics and crop physiology under different conditions. The most commonly applied quantitative methods may be divided into traditional methods which are readily accessible in wet lab formats, as well as more complex alternative assays. Total protein quantification methods include the measurement of UV absorption at 280 nm, Bicinchoninic Acid assays, and Bradford assays. Alternative methods include difference gel electrophoresis (two-dimensional gel electrophoresis with fluorescently tagged proteins), Lowry assays, and other commercially available methods. For individual quantification, the enzyme-linked immunosorbent assay (ELISA), western blot, and mass spectrometry techniques may be applied. Following translation, proteins are subjected to rapid alterations via posttranslational modifications (PTM) [66]. These changes dramatically alter protein complexity and dynamics and thus must be studied to better understand how plant proteins influence and regulate phenotypic events [67, 68]. Such approaches often involve complex sample preparation steps, including affinitybased methods, immunoprecipitation (particularly in the case of phosphorylation and ubiquitination), and phase partitioning approaches (most relevant to glycosylphosphatidylinositol). Phosphorylation, in particular, is one of the most significant, and one of the most robustly studied, PTMs, as it has been shown to influence photosynthesis as well as pathogen responses [69, 70]. In vivo phosphorylation sites are mappable through the application of mass spectrometry-based approaches accompanied by other chromatographic methods such as ion-exchange. Another critical PTM process, ubiquitination, designed to target proteins for destruction by the proteome, has been studied using a combination of affinity chromatography and mass spectrometry to explore ubiquitin and ubiquitin-like protein systems. The final level of proteomic study is structural proteomics, as 3D structures allow for an understanding of protein interactions, drug interactions, and the study of novel proteins through homology-based methods. The most common methods for elucidating the structure of proteins are NMR and X-ray
22
Aditya Narayan et al.
crystallography. However, experimental determination of protein structures may be both time-consuming and inefficient, thereby limiting the scope of exploration. In recent years there has been a rapid increase in the number of tools allowing researchers access to modeling methods through a wide breadth of techniques. Of these, homology modeling has become the preferred choice in many cases to obtain the 3D coordinates of phytochemical structures—particularly as a preliminary analysis step due to its lack of rigor compared to the aforementioned approaches. In this approach, structural information is gleaned to build models of novel proteins by relying on structural information from evolutionarily related proteins. A wide array of tools are available to create homology models at scale, including: MODELLER, I-TASSER, SWISS-MODEL, Phyre2, and HHPred [71–75]. The study of protein structures has led to the consolidation of protein structural information in databases, such as the Protein Data Bank (PDB) [76]. In summary, protein profiling steps include sample preparation, separation, detection or quantification, identification, and structural analysis. 4.3
Metabolomics
The final level of analysis aims to explore metabolic systems through analysis of metabolites present in a particular tissue, cell, species, etc. Metabolomic approaches allow for the study of the deep complexity present in plant species to better understand cellular systems. There exists a wide array of metabolomic analysis tools and platforms for metabolic profiling. Such analysis begins with the collection of metabolic data using instruments such as NMR, mass spectrometry (gas and liquid chromatography), Fouriertransform ion cyclotron resonance mass spectrometry (FT-MS), and the like [77]. Based on the data collected, varying analytical steps may be performed. In the case of mass spectrometry or NMR spectra, the data must be adjusted to account for background noise, peak alignment, and so on. The data is then used to identify metabolites present in the sample via database queries and statistical analysis, such as principal component or multivariate analysis, to cluster metabolites or samples based on the hypothesis which is being explored. Visualization via heatmaps and metabolic maps may be performed to better understand the relationship between metabolites present. A wide array of data analysis, pathway mapping, and visualization tools exist for the purpose of metabolomic analyses: l
PRIMe is a platform for RIKEN metabolomics and is a web-based tool to support transcriptomic to metabolomic analyses with the ability to create two-dimensional feature maps among other functions [78].
l
Cytoscape is another web-based platform allowing integrated biomolecular interaction networks to be created [79].
Genomic Resources for Crops
23
l
InterSpin is a set of integrated tools which is specifically designed to conduct batch assignments of NMR peaks [80].
l
MetabolomeExpress is a combined data repository and web-based pipeline enabling processing, analysis, visualization, meta-analysis, and dissemination of metabolite profiles [81].
l
iPath3.0 is a web-based application for visualizing and analyzing cellular pathways which allows for rapid exploration of multiomics data with a range of customizable features [82].
l
KEGG is a reference which offers a database of knowledge on molecular interaction and reaction networks via pathway maps, as well as orthology based approaches to improve knowledge on novel organisms [83].
l
KaPPA-View is a web-based tool for integrating both transcript and metabolite data to create metabolic pathway maps [84].
l
MapMan is a tool which allows for the visualization of large data sets as metabolic maps and allows for rapid comparison of omics data in plants [85].
In summary, there are an enormous number of tools and processes to explore the variety of -omics data available to understand the full breadth and depth of plant biological complexity. In the next section, we will review some of the opportunities available to translate genomic insights to biological changes.
5
Translational Genomics: Bridging the Genotype-to-Phenotype Gap The abundant advances in sequencing, genotyping, and -omics analysis technologies have guided the creation of innumerable resources with respect to model crops. Translational genomics exists to transfer genomic insights from model plants to novel crops and reference genomes such that they may influence breeding, functional biology, or other species-specific goals [86]. This approach relies on comparative genomics to study similarities and differences among plant species. The greater the evolutionary relatedness between species, the greater the accuracy in performing functional annotation of novel sequences, and thus plants with more model species present as more attractive targets for translational genomic approaches [87]. Gene models in a reference genome may be created through RNA-sequencing and this may, in turn, allow for functional annotation of protein-coding sequences based on sequence homology (as determined by tools such as BLAST, which query protein databases). The domain architecture of proteins may also be determined by applying algorithms such as the hidden Markov Model [88]. Gene networks may be created to then study gene interactions based on protein–protein interaction networks and
24
Aditya Narayan et al.
concurrent expression. Databases such as BioGrid may be queried to better understand these protein–protein interactions [89]. Gene network models may be used to predict regulatory relationships and thereby understand the key roles of model plant genes or as predictive tools for studies in novel genomes. Genome annotation may also involve analysis of synteny, or conservation of the gene order, to explore common ancestry and common traits of species descended from that common ancestor. Understanding gene order conservation may then lead to phenotypic annotations relevant for breeding such as in the identification of quantitative trait loci (QTLs) that are linked to desirable traits [90]. A final mechanism of gene annotation is the process of exploring genetic markers associated with traits identified by genome wide association study (GWAS). This allows for the isolation of domestication-related phenotypes which then may be transferred to new genomes to create specific crop phenotypes. Effectively, translational genomics requires the identification of sequences linked to desirable phenotypes via sequence homology or conservation of gene order in model plants and to subsequently transfer those genomic resources to newly constructed genomes. In particular, molecular breeding is a method of translating genomic information to new lines and encompasses a variety of techniques such as marker-assisted backcrossing (MABC), marker-assisted recurrent selection (MARS), and genomic selection (GS) [91]. To facilitate further translational genomic research, it is necessary to integrate genomic resources into databases with genomic locations, plant ontology, associated GWAS studies, and the like. A selection of integrated plant-specific resources has been included below: l
EnsemblPlants: Provides whole genome data on a wide range of plant species and offers a variety of tools for visualizing, comparing, and analyzing plant data [92].
l
ChloroplastDB: Contains information on chloroplast genome data, which is highly relevant to the study of key plant processes such as photosynthesis [93].
l
KEGG: As mentioned previously, KEGG contains whole genome and large-scale EST data as well as a variety of analytical tools [83].
l
Plant GDB: A database of molecular sequence data for plant species which organizes ESTs into annotated contigs [94].
l
MIPSPlantsDB: A plant database containing information on Arabidopsis, Medicago, Lotus, rice, maize, and tomato and is effective for comparative studies [95].
l
MEGANTE: A web-based annotation system which allows for lay scientists to easily access genome sequence annotation tools [96].
Genomic Resources for Crops l
6
25
AgBase: AgBase is a resource for structural and functional annotation of agronomically relevant genomes with high ease of use, gene ontology annotations, and a range of integrated tools [97].
Conclusion The rapid advancement of high-throughput sequencing technologies, and tools for genome alteration has allowed the biological data to be gathered and manipulated at unprecedented scales. In this chapter, we have discussed analytical methods for the study and creation of reference genomes of crops and model plants with the intent of understanding the steps that scientists may take moving forward to apply the abundance of genetic knowledge at our fingertips to crop breeding. Translational genomics stands as a natural avenue for the creation of plant matter which will address critical global issues, and, in the future, it will be necessary for the scientific field to explore more techniques for transferring genomic resources to novel plants as well as how to create more robust databases and other resources for plant-specific studies.
References 1. Brown M, Funk C (2008) Food security under climate change. Science 319(5863):580–581. https://doi.org/10.1126/science.1154102 2. Smy´kal P, Varshney RK, Singh VK et al (2016) From Mendel’s discovery on pea to today’s plant genetics and breeding. Theor Appl Genet 129(12):2267–2280. https://doi.org/ 10.1007/s00122-016-2803-2 3. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 7 9 6 – 8 1 5 . h t t p s : // d o i . o r g / 1 0 . 1 0 3 8 / 35048692 4. Brady S, Provart N (2009) Web-queryable large-scale data sets for hypothesis generation in plant biology. Plant Cell 21(4):1034–1051. https://doi.org/10.1105/tpc.109.066050 5. Dhanapal A, Govindaraj M (2015) Unlimited thirst for genome sequencing, data interpretation, and database usage in genomic era: the road towards fast-track crop plant improvement. Genet Res Int 2015:1–15. https://doi. org/10.1155/2015/684321 6. Mondini L, Noorani A, Pagnotta MA (2009) Assessing plant genetic diversity by molecular tools. Diversity 1(1):19–35 7. Barnes SR (1991) RFLP analysis of complex traits in crop plants. Symp Soc Exp Biol 45: 219–228
8. Deragon JM, Landry BS (1992) RAPD and other PCR-based analyses of plant genomes using DNA extracted from small leaf disks. PCR Methods Appl 1(3):175–180. https:// doi.org/10.1101/gr.1.3.175 9. Qi X, Lindhout P (1997) Development of AFLP markers in barley. Mol Gen Genet 254(3):330–336. https://doi.org/10.1007/ s004380050423 10. Feng S, He R, Lu J et al (2016) Development of SSR markers and assessment of genetic diversity in medicinal Chrysanthemum morifolium cultivars. Front Genet 7:113. https://doi. org/10.3389/fgene.2016.00113 11. Purwoko D, Cartealy IC, Tajuddin T, Dinarti D, Sudarsono S (2019) SSR identification and marker development for sago palm based on NGS genome data. Breed Sci 69(1): 1–10. https://doi.org/10.1270/jsbbs.18061 12. Vieira ML, Santini L, Diniz AL, Munhoz CF (2016) Microsatellite markers: what they mean and why they are so useful. Genet Mol Biol 39(3):312–328. https://doi.org/10.1590/ 1678-4685-GMB-2016-0027 13. Godwin ID, Aitken EA, Smith LW (1997) Application of inter simple sequence repeat (ISSR) markers to plant genetics. Electrophoresis 18(9):1524–1528. https://doi.org/10. 1002/elps.1150180906
26
Aditya Narayan et al.
14. Grzebelus D (2006) Transposon insertion polymorphism as a new source of molecular markers. J Fruit Ornam Plant Res 14(Suppl 1):21–29 15. Kumar A, Bennetzen JL (1999) Plant retrotransposons. Annu Rev Genet 33(1):479–532 16. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74(12):5463–5467. https://doi.org/10.1073/pnas.74.12.5463 17. Thudi M, Li Y, Jackson SA, May GD, Varshney RK (2012) Current state-of-art of sequencing technologies for plant genomics research. Brief Funct Genomics 11(1):3–11. https://doi.org/ 10.1093/bfgp/elr045 18. Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT (2012) Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample [published correction appears in PLoS One. 2012;7(3):10.1371/annotation/ 64ba358f-a483-46c2-b224-eaa5b9a33939]. PLoS One 7(2):e30087. https://doi.org/10. 1371/journal.pone.0030087 19. Slatko BE, Gardner AF, Ausubel FM (2018) Overview of next-generation sequencing technologies. Curr Protoc Mol Biol 122(1):e59. https://doi.org/10.1002/cpmb.59 20. Lahens NF, Ricciotti E, Smirnova O et al (2017) A comparison of illumina and ion torrent sequencing platforms in the context of differential gene expression. BMC Genomics 18(1):602. https://doi.org/10.1186/ s12864-017-4011-0 21. Rhoads A, Au KF (2015) PacBio sequencing and its applications. Genomics Proteomics Bioinformatics 13(5):278–289. https://doi.org/ 10.1016/j.gpb.2015.08.002 22. Mikheyev AS, Tin MM (2014) A first look at the Oxford Nanopore MinION sequencer. Mol Ecol Resour 14(6):1097–1102. https:// doi.org/10.1111/1755-0998.12324 23. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Helsten U et al (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313:1596–1604 24. Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178–183 25. Ming R, Hou S, Feng Y, Yu Q, DionneLaporte A, Saw JH et al (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452: 991–996 26. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M et al (2002) A draft sequence
of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100 27. Edwards D, Batley J (2010) Plant genome sequencing: applications for crop improvement. Plant Biotechnol J 8(1):2–9. https:// doi.org/10.1111/j.1467-7652.2009. 00459.x 28. Jung H, Winefield C, Bombarely A, Prentis P, Waterhouse P (2019) Tools and strategies for long-Read sequencing and de novo assembly of plant genomes. Trends Plant Sci 24(8): 700–724. https://doi.org/10.1016/j.tplants. 2019.05.003 29. Golicz AA, Batley J, Edwards D (2016) Towards plant pangenomics. Plant Biotechnol J 14(4):1099–1105. https://doi.org/10. 1111/pbi.12499 30. Khan A, Garg V, Roorkiwal M, Golicz A, Edwards D, Varshney R (2020) Superpangenome by integrating the wild side of a species for accelerated crop improvement. Trends Plant Sci 25(2):148–158. https://doi. org/10.1016/j.tplants.2019.10.012 31. Li YH, Zhou G, Ma J, Jiang W, Jin LG, Zhang Z, Guo Y (2014) De novo assembly of soybean wild relatives for pangenome analysis of diversity and agronomic traits. Nat Biotechnol 32(10):1045–1052 32. Schatz M, Maron L, Stein J, Wences A, Gurtowski J, Biggers E, Lee H (2014) Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica. Genome Biol 15(11):506 33. Hirsch C, Foerster J, Johnson J et al (2014) Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26(1):121–135. https://doi.org/10.1105/tpc.113.119982 34. Chaisson MJP, Wilson RK, Eichler EE (Nov. 2015) Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16(11):627–640 35. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123. https://doi. org/10.1101/gr.089532.108 36. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829 37. Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22: 549–556 38. Hernandez D, Franc¸ois P, Farinelli L, Østera˚s M, Schrenzel J (2008) De novo bacterial genome sequencing: millions of very short
Genomic Resources for Crops reads assembled on a desktop computer. Genome Res 18:802–809 39. Luo R, Liu B, Xie Y et al (2012) SOAPdenovo2: an empirically improved memoryefficient short-read de novo assembler [published correction appears in Gigascience. 2015;4:30]. Gigascience 1(1):18. https://doi. org/10.1186/2047-217X-1-18 40. Gao S, Sung W-K, Nagarajan N (Nov. 2011) Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 18(11):1681–1691 41. Warren RL et al (2015) LINKS: scalable alignment-free scaffolding of draft genomes with long reads. Gigascience 4:35 42. Paulino D, Warren RL, Vandervalk BP, Raymond A, Jackman SD, Birol I (2015) Sealer: A scalable gap-closing application for finishing draft genomes. BMC Bioinformatics 16(1):230 43. Dew IM, Walenz B, Sutton G (Jun. 2005) A tool for analyzing mate pairs in assemblies (TAMPA). J Comput Biol 12(5):497–513 44. Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD (2013) REAPR: a universal tool for genome assembly evaluation. Genome Biol 14(5):R47 45. Morrell P, Buckler E, Ross-Ibarra J (2011) Crop genomics: advances and applications. Nat Rev Genet 13(2):85–96. https://doi. org/10.1038/nrg3097 46. Harbers M, Carninci P (2005) Tag-based approaches for transcriptome research and genome annotation. Nat Methods 2:495–502 47. Rudd S (2003) Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci 8(7):321–329. https://doi.org/10.1016/S1360-1385(03) 00131-6 48. Masoudi-Nejad A, Goto S, Jauregui R et al (2007) EGENES: transcriptome-based plant database of genes with metabolic pathway information and expressed sequence tag indices in KEGG. Plant Physiol 144(2):857–866. https://doi.org/10.1104/pp.106.095059 49. Chan AP, Pertea G, Cheung F et al (2006) The TIGR maize database. Nucleic Acids Res 34 (Database issue):D771–D776. https://doi. org/10.1093/nar/gkj072 50. Hu M, Polyak K (2006) Serial analysis of gene expression. Nat Protoc 1:1743–1760. https:// doi.org/10.1038/nprot.2006.269 51. Reinartz J, Bruyns E, Lin JZ et al (2002) Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct Genomic
27
Proteomic 1(1):95–104. https://doi.org/10. 1093/bfgp/1.1.95 52. Lu C, Kulkarni K, Souret FF, MuthuValliappan R, Tej SS, Poethig RS et al (2006) MicroRNAs and other small RNAs enriched in the Arabidopsis RNA-dependent RNA polymerase-2 mutant. Genome Res 16: 1276–1288 53. Nobuta K, Venu RC, Lu C, Belo A, Vemaraju K, Kulkarni K et al (2007) An expression atlas of rice mRNAs and small RNAs. Nat Biotechnol 25:473–477 54. Rickman DS, Herbert CJ, Aggerbeck LP (2003) Optimizing spotting solutions for increased reproducibility of cDNA microarrays. Nucleic Acids Res 31(18):e109. https://doi. org/10.1093/nar/gng109 55. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675–1680 56. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470 57. Lietard J, Somoza M (2019) Spotting, transcription and in situ synthesis: three routes for the fabrication of RNA microarrays. Comput Struct Biotechnol J 17:862–868. https://doi. org/10.1016/j.csbj.2019.06.004 58. Clough E, Barrett T (2016) The gene expression omnibus database. Methods Mol Biol 1418:93–110. https://doi.org/10.1007/ 978-1-4939-3578-9_5 59. Parkinson H, Kapushesky M, Shojatalab M et al (2007) ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database): D747–D750. https://doi.org/10.1093/nar/ gkl995 60. Hruz T, Laule O, Szabo G et al (2008) Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv Bioinforma 2008:420747. https://doi.org/ 10.1155/2008/420747 61. Agrawal G, Pedreschi R, Barkla B et al (2012) Translational plant proteomics: a perspective. J Proteome 75(15):4588–4601. https://doi. org/10.1016/j.jprot.2012.03.055 62. Schirmer EC, Yates JR 3rd, Gerace L (2003) MudPIT: a powerful proteomics tool for discovery. Discov Med 3(18):38–39 63. Yates JR, Ruse CI, Nakorchevsky A (2009) Proteomics by mass spectrometry: approaches,
28
Aditya Narayan et al.
advances, and applications. Annu Rev Biomed Eng 11:49–79 64. Duclos S, Desjardins M (2011) Organelle proteomics. Methods Mol Biol 753:117–128. https://doi.org/10.1007/978-1-61779148-2_8 65. Dunkley TP, Watson R, Griffin JL, Dupree P, Lilley KS (2004) Localization of organelle proteins by isotope tagging (LOPIT). Mol Cell Proteomics 3(11):1128–1134. https://doi. org/10.1074/mcp.T400009-MCP200 66. Kwon SJ, Choi EY, Choi YJ, Ahn JH, Park OK (2006) Proteomics studies of post-translational modifications in plants. J Exp Botany 57(7): 1547–1551. https://doi.org/10.1093/jxb/ erj137 67. Arsova B, Watt M, Usadel B (2018) Monitoring of plant protein post-translational modifications using targeted proteomics. Front Plant Sci 9:1168. https://doi.org/10.3389/fpls. 2018.01168 68. Grabsztunowicz M, Koskela M, Mulo P (2017) Post-translational modifications in regulation of chloroplast function: recent advances. Front Plant Sci 8:240. https://doi. org/10.3389/fpls.2017.00240 69. Xing T, Ouellet T, Miki BL (2002) Towards genomic and proteomic studies of protein phosphorylation in plant-pathogen interactions. Trends Plant Sci 7:224–230 70. Bennett J (1983) Regulation of photosynthesis by reversible phosphorylation of the lightharvesting chlorophyll a/b protein. Biochem J 212(1):1–13. https://doi.org/10.1042/ bj2120001 71. Webb B, Sali A (2014) Protein structure modeling with MODELLER. Methods Mol Biol 1137:1–15. https://doi.org/10.1007/978-14939-0366-5_1 72. Zhang Y (2008) I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9:40. https://doi.org/10.1186/1471-21059-40 73. Schwede T, Kopp J, Guex N, Peitsch MC (2003) SWISS-MODEL: an automated protein homology-modeling server. Nucleic Acids Res 31(13):3381–3385. https://doi.org/10. 1093/nar/gkg520 74. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJ (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 10(6):845–858. https://doi.org/ 10.1038/nprot.2015.053 75. So¨ding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic
Acids Res 33(Web Server issue):W244–W248. https://doi.org/10.1093/nar/gki408 76. Berman HM, Westbrook J, Feng Z et al (2000) The Protein Data Bank. Nucleic Acids Res 28(1):235–242. https://doi.org/10.1093/ nar/28.1.235 77. Fiehn O, Kopka J, Dormann P, Altmann T, Trethewey RN, Willmitzer L (2000) Metabolite profiling for plant functional genomics. Nat Biotechnol 18:1157–1161 78. Sakurai T, Yamada Y, Sawada Y et al (2013) PRIMe update: innovative content for plant metabolomics and integration of gene expression and metabolite accumulation. Plant Cell Physiol 54(2):e5. https://doi.org/10.1093/ pcp/pcs184 79. Shannon P, Markiel A, Ozier O et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504. https://doi.org/10.1101/gr.1239303 80. Yamada S, Ito K, Kurotani A, Yamada Y, Chikayama E, Kikuchi J (2019) InterSpin: integrated supportive webtools for low- and high-field NMR analyses toward molecular complexity. ACS Omega 4(2):3361–3369. https://doi.org/10.1021/acsomega.8b02714 81. Carroll AJ, Badger MR, Harvey MA (2010) The MetabolomeExpress Project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets. BMC Bioinformatics 11:376. https://doi. org/10.1186/1471-2105-11-376 82. Darzi Y, Letunic I, Bork P, Yamada T (2018) iPath3.0: interactive pathways explorer v3. Nucleic Acids Res 46(W1):W510–W513. https://doi.org/10.1093/nar/gky299 83. Kanehisa M (2016) KEGG bioinformatics resource for plant genomics and metabolomics. Methods Mol Biol 1374:55–70. https://doi. org/10.1007/978-1-4939-3167-5_3 84. Tokimatsu T, Sakurai N, Suzuki H et al (2005) KaPPA-view: a web-based analysis tool for integration of transcript and metabolite data on plant metabolic pathway maps. Plant Physiol 138(3):1289–1300. https://doi.org/10. 1104/pp.105.060525 85. Thimm O, Bl€asing O, Gibon Y et al (2004) MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J 37(6):914–939. https://doi.org/10.1111/j. 1365-313x.2004.02016.x 86. Elma MJ, Salentijn EMJ, Pereira A, Angenent GC, van der Linden GC, Krens F, Smulders MJM, Vosman B (2007) Plant translational
Genomic Resources for Crops genomics: from model species to crops. Mol Breeding 20:1–13 87. Rossignol M, Peltier J, Mock H, Matros A, Maldonado A, Jorrı´n J (2006) Plant proteome analysis: a 2004–2006 update. Proteomics 6(20):5529–5548. https://doi.org/10.1002/ pmic.200600260 88. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer EL, Tate J, Punta M (2014) Pfam: the protein families database. Nucleic Acids Res 42:D222–D230 89. Oughtred R, Stark C, Breitkreutz BJ et al (2019) The BioGRID interaction database: 2019 update. Nucleic Acids Res 47(D1): D529–D541. https://doi.org/10.1093/nar/ gky1079 90. Paterson AH, Freeling M, Tang HB, Wang XY (2010) Insights from the comparison of plant genome sequences. Annu Rev Plant Biol 61: 349–372 91. Varshney R, Kudapa H, Pazhamala L et al (2014) Translational genomics in agriculture: some examples in grain legumes. CRC Crit Rev Plant Sci 34(1–3):169–194. https://doi.org/ 10.1080/07352689.2014.897909 92. Bolser D, Staines DM, Pritchard E, Kersey P (2016) Ensembl plants: integrating tools for visualizing, mining, and analyzing plant
29
genomics data. Methods Mol Biol 1374: 115–140. https://doi.org/10.1007/978-14939-3167-5_6 93. Cui L, Veeraraghavan N, Richter A, Wall K, Jansen RK, Leebens-Mack J, Makalowska I, dePamphilis CW (2006) ChloroplastDB: the Chloroplast Genome Database. Nucleic Acids Res 34(suppl_1):D692–D696. https://doi. org/10.1093/nar/gkj055 94. Dong Q, Schlueter SD, Brendel V (2004) PlantGDB, plant genome database and analysis tools. Nucleic Acids Res 32(Database issue): D354–D359. https://doi.org/10.1093/nar/ gkh046 95. Spannagl M, Noubibou O, Haase D, Yang L, Gundlach H, Hindemitt T, Klee K, Haberer G, Schoof H, Mayer KFX (2007) MIPSPlantsDB—plant database resource for integrative and comparative plant genome research. Nucleic Acids Res 35(suppl_1):D834–D840. https://doi.org/10.1093/nar/gkl945 96. Numa H, Itoh T (2014) MEGANTE: a web-based system for integrated plant genome annotation. Plant Cell Physiol 55(1):e2. https://doi.org/10.1093/pcp/pct157 97. McCarthy FM, Gresham CR, Buza TJ et al (2011) AgBase: supporting functional modeling in agricultural organisms. Nucleic Acids Res 39(Database issue):D497–D506. https:// doi.org/10.1093/nar/gkq1115
Chapter 3 Next-Generation Sequencing Technologies: Approaches and Applications for Crop Improvement Anupam Singh, Goriparthi Ramakrishna, Tanvi Kaila, Swati Saxena, Sandhya Sharma, Ambika B. Gaikwad, M. Z. Abdin, and Kishor Gaikwad Abstract The persistent efforts toward attaining food security and balanced nutrition are challenged by the deteriorating natural resources, aberrant climate changes, and increase in population, hence calling for the utilization of innovative technologies to overcome the constraints of crop production. Crop improvement through multifaceted approaches that combine conventional and genomic technologies is necessary for developing biotic and abiotic stress-tolerant varieties with high yield and desirable nutritional quality. A detailed understanding of complex plant genome and genetic diversity is necessary to meet these challenges. Before 2004, genome sequencing was mostly dependent on Sanger sequencing technology, which though accurate was not high-throughput. The successful sequencing of Arabidopsis and rice genomes encouraged the sequencing of many other crop and model plants. Since then, sequencing technologies, accompanied by application of high-power computer technology have evolved at an astounding pace and developed into more advanced, innovative, and competitive next-generation sequencing (NGS) and Next-NGS technologies. The NGS technologies are low cost, rapid, and high-throughput. The advancement of NGS and Next-NGS technologies combined with automated phenotyping techniques have accelerated the crop improvement process. NGS technology enables the generation of reference genomes and re-sequencing of related species to understand the genetic diversity, transcriptome sequencing that provides insight into complex gene networks, metagenomics, as well as high-throughput genotyping methods like genotyping by sequencing (GBS) and QTL mapping which have been successfully used in crop improvement programs. In the present chapter, we briefly describe different generations of sequencing technologies, the current status of advanced NGS technologies, and their application in crop improvement including de novo nuclear/organellar genome assembly, re-sequencing, functional genomics, epigenetics, and marker development for introgression of important agronomic traits, population genetics, evolutionary biology, and pan-genomics. Key words Next-generation sequencing, Long-read sequencing, Crop improvement, Genomics, Transcriptomics, Epigenomics, Marker development
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_3, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
31
32
1
Anupam Singh et al.
Introduction Decoding the chemical structure of biological macromolecules (DNA, RNA, and Proteins) is necessary for understanding their structural and functional characteristics and the vacillating impacts they have. The nitrogenous base, sugar, and phosphate are three fundamental components of all the four nucleotides present in DNA or RNA. The DNA is composed of four nucleotides: two purines, Adenine (A) and Guanine (G), and two pyrimidines, Cytosine (C) and Thymine (T), while RNA consists of Uracil (U) instead of Thymine. The knowledge of nucleotide sequence arrangements is necessary for understanding the inheritance, structure, and function of a gene. The first breakthrough discovery was deciphering the 3D structure of DNA by Watson and Crick in 1953 [1]. The first attempt to sequence the nucleic acid was in 1964 when Richard Holley sequenced the yeast alanine tRNA [2] and he later sequenced the bacteriophage MS2 genome, which is also an RNA molecule [3]. The first DNA sequencing report came in 1968 when Wu reported the sequence of 12 bases of the bacteriophage lambda cohesive ends using primer extension methods [4–6]. Maxam and Gilbert sequenced the 24 bases of the lactose-repressor binding site and reported in 1973 [7]. Later in 1975, Sanger and Coulson came forward with a sequencing approach based on primer extension with different DNA polymerases called the Plus and Minus method [8]. A breakthrough came in 1977, when two methods, the chemical degradation of labeled DNA fragments developed by Maxam and Gilbert and the chain termination method developed by Sanger and Coulson transformed the field of DNA sequencing [9, 10]. Later on, the shotgun sequencing strategy was developed based on these methods and used for “de novo” genome assembly of bacteriophage lambda during 1982 [11]. By 1987, Smith, Hood, and Applied Biosystems developed automated, fluorescence-based Sanger sequencing machines, known as the first-generation sequencing platforms [12–14]. In the 1990s, the development of “hierarchical shotgun/clone by clone” genome sequencing and assembly strategy led to the success of the Arabidopsis genome initiative in 2000 [15], The Human Genome Project (HGP) in 2002 [16], and The International Rice Genome Sequencing Project (IRGSP) in 2005 [17]. Throughout the 1980s and 1990s, many scientific groups explored the alternative approaches for electrophoresis based sequencing techniques, but the success came gradually during the completion of HGP and these alternatives included the second-generation sequencing or next-generation sequencing (NGS) technologies [18]. These NGS methods relied on the sequencing-by-synthesis (SBS) approach which involved multiple cycles of polymerase-mediated addition of nucleotides followed by signal detection (fluorescence
Next Generation Sequencing Technologies for Crops
33
or voltage). In 2005, 454 Life Sciences commercially released the first NGS platform based on pyrosequencing [19]. The advancement of NGS technologies and computational methods has reduced the time and cost of sequencing by many folds, but the read length remained shorter than Sanger’s sequencing [20]. The repeat-rich genomes of plants and other organisms required longer reads for resolving the complexities of their genome organization. Exploring the long-read sequencing strategies started back in the 1980s; most of these strategies could not progress enough except for two methods leading to the development of single-molecule real-time (SMRT) sequencing platforms, also known as the next-NGS or third-generation sequencing techniques. The first SMRT technique is developed by Pacific Biosciences (PacBio-SMRT) [21] with the read length ranging from 10 to 100 kb and the second is nanopore sequencing developed by Oxford Nanopore Technologies (ONT) [22] with the maximum read length up to 900 kb [23]. These long-read sequencing techniques facilitated the detection of epigenetic modifications such as methylation and direct sequencing of RNA molecules (Iso-seq). Further, these long-read sequencing methods in combination with techniques like Mate-pair sequencing, 10 genomics linked read sequencing, Hi-C, and optical mapping were proved to be scalable, cost-effective means for generating chromosome-scale assemblies of large complex genomes. The NGS technologies further helped to catalog the genetic variations among the related species, identification of candidate genes and mutations linked to specific traits, unraveling regulatory gene networks through RNA-sequencing, exome sequencing, and metagenomics. The history of DNA sequencing technology has been discussed thoroughly in many reviews [20, 24–26]. The key developments of DNA sequencing technologies are enlisted in Table 1.
2
Evolution of Sequencing Technologies
2.1 First-Generation Sequencing Technology 2.1.1 Maxam and Gilbert Method
Methodology
Allan Maxam and Walter Gilbert developed the chemical cleavage/ degradation method of DNA sequencing in 1977 [9], based on partial degradation of the labeled DNA molecule by using chemicals (dimethyl sulfate, hydrazine, and hydrazine + NaCl) that cut the DNA at specific bases. The resulting DNA fragments are electrophoresed on a polyacrylamide gel and the sequence is identified by autoradiography. This method is carried out in four separate reactions for strong G + weak A, strong A, strong C, and C + T. The cleavage of purines requires methylation of purines by dimethyl sulfate followed by the heat treatment to break the glycosidic bond and an alkali treatment at 90 C to release sugar from phosphodiester bond resulting in
34
Anupam Singh et al.
Table 1 List of milestone achievements in the history of DNA sequencing technology Year
Technology development
References
1953
Discovery of the 3D structure of DNA
[1]
1953
Identification of amino acid sequence of insulin protein
[27]
1965
Sequencing of yeast alanine tRNA
[2]
1968
Sequencing of lambda phase cohesive ends
[6]
1975
Development of Plus Minus sequencing technology by Sanger and [28] Coulson
1977
Development of chemical degradation method of DNA sequencing
[9]
1977
Development of chain termination method of DNA sequencing
[10]
1977
The first whole-genome to be sequenced is of PhiX174
[29]
1982
Genome sequencing of the lambda phage genome
[11]
1983
The invention of polymerase chain reaction technology by Karry B [30] Mullis
1986–1987 DNA sequencing with fluorescent chain-terminating di-deoxynucleotides
[12]
1986
Applied Biosystems released the first-ever automated Sanger’s sequencing machine (ABI 370 PRISM™)
1988
Development of sequencing technique by serial addition of dNTPs [31]
1990
Paired-end sequencing
[32]
1996
Pyrosequencing
[33]
1999
PCR enrichment of DNA colonies in gels
[34]
2003
Amplification of DNA fragments on beads using Emulsion PCR
[35]
2003
Application of zero-mode waveguide approach for single-molecule [21] sequencing
2005
Sequencing by ligation of bead based DNA colonies
[36]
2005
Four color reversible terminator nucleotides
[37, 38]
2010
Application of single-molecule sequencing for the detection of DNA methylation
[39]
2010
Single-base resolution electron tunneling through a solid-state detector
[40]
2011
Sequencing by proton detection using a semiconductor
[41]
2015 2018
Development of nanopore sequencing and release of MinIon Commercial release of PromethIon (ONT)
https://nanoporetech. com/
Next Generation Sequencing Technologies for Crops
35
Fig. 1 Maxam and Gilbert’s DNA sequencing method
single-strand DNA break. Similarly, hydrazine replaces the pyrimidine bases in DNA and when followed by treatment with piperidine results in cleavage at cytosine and thymine [9] (Fig. 1). 1. Guanine/Adenine cleavage (G > A): Dimethyl sulfate is used to methylate both the purines, guanines and adenines, in the DNA at the N7 and N3 positions, respectively. Upon autoradiography of resolved polyacrylamide gel, the higher tendency of methylation of guanine compared to adenine results in the darker bands corresponding to guanines and lighter bands to adenines. 2. Adenine-specific cleavage: The glycosidic bond of methylated adenosine is more unstable compared to that of methylated guanosine. Hence, gentle treatment with dilute acid preferentially removes methylated adenines. Later, the alkali treatment breaks the DNA and subsequent autoradiography of resolved polyacrylamide gel reveals the dark and light bands corresponding to adenines and guanines, respectively.
36
Anupam Singh et al.
3. Cytosine/Thymine cleavage: Hydrazine is used to displace the nitrogenous base at thymine and cytosine position and leaving ribosylurea (abasic site). Further treatment with 0.5 M piperidine cleaves DNA by eliminating the products of hydrazine reaction (ribosylurea and hydrazone) and phosphates. Finally, autoradiography of resolved fragments on polyacrylamide gel gives pattern containing bands of similar intensity corresponding to the cleavages at both cytosines and thymines. 4. Cytosine-specific cleavage: In the presence of 2 M NaCl the reaction of hydrazine with thymine is preferentially suppressed finally resulting in darker bands corresponding only to cytosines. 2.1.2 Sanger’s Sequencing
In 1977, Frederick Sanger and colleagues developed a new DNA sequencing method similar to the earlier reported “Plus and Minus” method [28]. In this method, the nucleotide base analogs like 20 ,30 -dideoxy ribonucleoside triphosphate and Arabinonucleoside triphosphate are used for the primer extension reaction mixture. These base analogs inhibit the addition of new deoxyribonucleotide triphosphate (dNTP) by DNA polymerase due to a lack of 30 –OH group in their ribose sugar and therefore act as chainterminators. Hence, this method is also called the “dideoxy method,” “chain termination method,” or famously as “Sanger’s sequencing” [10].
Methodology
This technique involves amplification of a labeled DNA fragment with a di deoxyribonucleotide triphosphate (ddNTP) and dNTPs in a specific ratio. This procedure is carried out in four separate reactions, one for each of the ddNTPs (ddATP, ddGTP, ddCTP, and ddTTP), and random incorporation of these ddNTPs leads to termination of chain-elongation activity by polymerase, hence resulting in amplicons of different sizes ending with a specific ddNTP [42]. These amplification products are resolved on polyacrylamide gel in separate lanes followed by autoradiography to identify the nucleotide sequence of the template DNA molecule (Fig. 2). Later on, Sanger’s sequencing got improvised in several ways. These include laser detection of fluorescently labeled ddNTPs which facilitated the use of a single reaction mixture for all ddNTPs instead of four separate reactions and replacement of phosphate or tritium-radiolabeling [43]. The capillary-based electrophoresis provided increased resolving power compared to polyacrylamide gel or chromatography [44]. These improvements lead to the automated, fluorescence-based Sanger sequencing machines, referred to as first-generation sequencing technologies (Fig. 3). In 1986, Applied Biosystems released the first-ever automated
Next Generation Sequencing Technologies for Crops
37
Fig. 2 Sanger’s Chain termination method of DNA sequencing
Sanger’s sequencing machine (ABI 370A PRISM™) for commercial purpose [12]. The read length of Sanger’s sequencing ranges between 400 and 900 bp. Applications
1. The first genome to be sequenced by Sanger’s sequencing method is the phiX174 genome with a size of 5374 bp [29]. This was followed by 2. The bacteriophage λ genome of 48,501 bp [11] in 1982, 3. Haemophilus influenza genome of around 2 Mb [45] in 1995, 4. Yeast (Saccharomyces cerevisiae) genome of 12 Mb [46] in 1996, 5. Caenorhabditis elegans genome of 100 Mb [47] in 1998. 6. Further, automated Sanger’s sequencing was used in major genome sequencing projects. (a) In 2000, the Drosophila melanogaster genome (~175 Mb) was sequenced [48].
38
Anupam Singh et al.
Fig. 3 Methodology of first-generation sequencing technology (automated Sanger’s sequencing technique)
(b) The Arabidopsis Genome Initiative (AGI) was started in 1996 and completed sequencing of Arabidopsis thaliana genome of 115.4 Mb in 2000 [15]. (c) Human Genome Project (HGP) in 2002 (~3 Gb) [16]. (d) Sequencing of 389 Mb Rice genome completed in 2005 by “The International Rice Genome Sequencing Project” (IRGSP) [49]. (e) The cultivated soybean (Glycine max) genome sequencing in 2010 [50]. 2.2 Second/NextGeneration Sequencing Technologies
The alternative methods were developed to overcome the limitations of first-generation sequencing methods, such as the immobilization of DNA templates on a two-dimensional surface facilitates all the reactions to occur in a single setup instead of separate tubes for each reaction. The bacterial cloning was replaced by in vitro multiplication of template DNA to be sequenced. The direct detection of fluorescently labeled nucleotide is immediately after incorporation by polymerase rather than the detection of resolved DNA fragments. The sequencing process includes multiple cycles of polymerase-mediated addition of fluorescently labeled dNTPs,
Next Generation Sequencing Technologies for Crops
39
followed by imaging, thus known as the sequencing-by-synthesis (SBS) approach. Together, these improvements contributed to the evolution of second-generation sequencing or next-generation sequencing (NGS) [25] technologies. Table 2 summarizes the different sequencing technologies with their features. The main feature of NGS technologies are: 1. Generating millions of short reads (35–700 bp) through massively parallel sequencing. 2. Very rapid (less time-consuming) compared to the firstgeneration sequencing methods. 3. The cost of sequencing is low and continues to decrease further through the advancement of NGS technology. 4. Direct detection of the sequencing output instead of electrophoretic separation [51]. 5. Requires less DNA/RNA input (nanograms of starting materials are sufficient). 6. Sequence whole genomes quickly. 7. Sequence target regions in-depth. The NGS techniques are broadly categorized into two groups based on the sequencing approach: (1) sequencing by ligation (SBL) and (2) sequencing-by-synthesis (SBS). In SBL approach, a probe containing a known base with a specific fluorescent tag hybridizes to a template DNA fragment, followed by ligation to an adjacent anchor oligonucleotide which is complementary to the adapter sequence. The color of the fluorophore indicates the known bases within the probe which are complementary to specific positions in the template [20]. In the SBS approach, a polymerase incorporates labeled or unlabeled dNTPs in growing strand complementary to the template DNA and a signal such as fluorescence or a change in ionic concentration indicates the incorporated base in the elongating strand. Thus, the sequence of the template can be deduced by repeating the cycles of dNTPs addition and signal detection. The SBS methods are further classified into either cyclic reversible termination (CRT) or single-nucleotide addition (SNA). The CRTs are similar to fluorescently labeled ddNTPs in Sanger’s sequencing, but in CRT the 30 -OH is linked to a fluorophore which restricts the addition of the next nucleotide. After reading the fluorescence signal the fluorophore of the CRT is cleaved to allow the addition of the next nucleotide [52]. Based on these approaches, five major NGS platforms were released: Roche/454 pyrosequencing platform launched in 2005, Illumina/Solexa Genome Analyzer in 2006, ABI/SOLiD in 2007, Ion Torrent semiconductor sequencing technology was released by Life Technologies in 2010, and Nanoball sequencing by Complete genomics/BGI in 2015. Here, we briefly describe these five sequencing platforms.
Instrument
3730xl
ABI/Sanger
GS FLX
GS FLX Titanium
GS FLX Titanium+ 1 M
GS Junior
GS Junior+
Genome Analyzer
MiniSeq
MiSeq
NextSeq 500/550
NextSeq 1000/ 2000
Roche 454
Roche 454
Roche 454
Roche 454
Roche 454
Solexa
Illumina
Illumina
Illumina
Illumina
0.4–1.1 B
260–800 M
12–50 M
14–50 M
0.1 M
0.1 M
1M
400
GS20
200
96
Roche 454
Second-generation sequencing
370 Prism
ABI/Sanger
First-generation sequencing
Platform
Reads per run
25–150
75–150
25–300
75–150
700
400
700
450
250
100
400–900
Avg. read length (bp)
NA
Error t ype
SE, PE Mismatch
SE, PE Mismatch
SE, PE Mismatch
SE, PE Mismatch
SE, PE InDel
SE, PE InDel
SE, PE InDel
SE, PE InDel
SE, PE InDel
SE, PE InDel
SE
Read type
Table 2 A summary of sequencing technologies with their characteristic features [20, 26]
0.1
1
0.1
1
1
1
1
1
1
1
0.3
Error rate (%)
40–330 Gb
16–120 Gb
0.54–15 Gb
1.6–7.5 Gb
1 Gb
70 Mb
35 Mb
0.7 Gb
0.45 Gb
0.1 Gb
0.02 Gb
0.69–2.1 Mb
Data generated per run
11–48 h
11–29 h
4–56 h
7–24 h
18 h
10 h
23 h
10 h
NA
NA
Run time
2020
2014
2011
2013
2006
2014
2010
2011
2009
2007
2005
2002
1986
Year of introduction/ release
40 Anupam Singh et al.
NovaSeq
SOLiD 5500 Wildfire
SOLiD 5500xl
PGM 314 chip v2
PGM 316 chip v2
PGM 318 chip v2
Ion Proton
Ion S5/S5XL 520
Ion S5/S5XL 530
Ion S5/S5XL 540
BGISEQ-500 FCS
BGISEQ-500 FCL NA
Illumina
ABI
ABI
Ion Torrent
Ion Torrent
Ion Torrent
Ion Torrent
Ion Torrent
Ion Torrent
Ion Torrent
BGI
BGI
RS II
Sequel
Minion MK
PacBio
PacBio
Oxford Nanopore
Third-generation sequencing
HiSeq X
Illumina
100,000
350,000
55,000
NA
60–80 M
15–20 M
3–5 M
60–80 M
4–5.5 M
2–3 M
0.4–0.55 M
1.4 B
700 M
0.65–20 B
3B
HiSeq 3000/4000 5 B
Illumina
0.3–4 B
Hiseq 2500
Illumina
Up to 200 kb
8–12 kb
10–15 kb
50–100
50–100
200
200–400
200–400
200
200–400
200–400
200–400
50–75
50–75
35–250
150
50–150
36–250
InDel
InDel
InDel
InDel
InDel
InDel
InDel
AT bias
AT bias
1D, 2D
NA
NA
InDel/ Mismatch
NA
InDel
SE, PE AT bias
SE, PE AT bias
SE
SE
SE
SE
SE
SE
SE
SE
SE
SE, PE Mismatch
SE, PE Mismatch
SE, PE Mismatch
SE, PE Mismatch
12
NA
12
0.1
0.1
1
1
1
1
1
1
1
0.1
0.1
0.1
0.1
0.1
0.1
1.5 Gb
3.5–7 Gb
0.5–1 Gb
40–200 Gb
8–40 Gb
10–15 Gb
3–8 Gb
0.6–2 Gb
10 Gb
0.6–2 Gb
0.6–1 Gb
30–100 Mb
160–320 Gb
80–160 Gb
65–3000 Gb
900 Gb
Up to 1.5 Tb
9–500 Gb
48 h
0.5–6 h
4h
24 h
24 h
25 h
2.5–4 h
2.5–4 h
2–4 h
4–7.3 h
3–4.9 h
3.7–23 h
10 days
6 days
(continued)
2015
2016
2014
2015
2015
2015
2015
2015
2012
2013
2011
2011
2013
2011
2017
2014
1 kb to 2 Mb [69].
2.3.1 Helicos Biosciences/HeliScope Genetic Analysis System
Helicos Biosciences released the first commercial SMS based sequencing technology called HeliScope Genetic Analysis System in principle similar to the Illumina NGS technology, except for bridge amplification of template DNA [70]. In this method, the DNA template gets attached to a planar surface and then fluorescent reversible terminator dNTPs (virtual terminators) [71] are added, resulting in the addition of a single base at a time and the generated fluorescence image is recorded. Then fluorophore is cleaved to allow the addition of dNTP in the next cycle. It was the first technology to sequence the non-amplified DNA, thus avoiding all the PCR biases and errors during amplification. This technology was not successful due to shorter read length and longer run time.
2.3.2 Pacific Biosciences/PacBioSMRT™ Sequencing
In 2011, the Pacific Biosciences commercially released the PacBio RS sequencer, based on the SMRT sequencing technology [72] with the average read lengths up to 1.5 kb and a high average error rate of 13%. The sequencer called “Sequel” is based on improved sequencing chemistry, which has increased average read lengths and
Next Generation Sequencing Technologies for Crops
53
output per run more than tenfold and 100-fold, respectively. The new sequel sequencer can produce long reads up to 10–100 kb owing to the circular templates called “SMRT-bell,” while the error rate remained the same (13%) for “single-pass” [73]. Principle
The PacBio sequencers use SMRT sequencing technology, in which polymerization occurs in microwells covered with a metallic film essentially containing tiny holes called zero-mode waveguides (ZMWs) [21]. These ZMWs work on the principle that when light is allowed to pass through a hole of a diameter smaller than their wavelength causes the rapid decay of it. This allows visualization of each fluorophore molecules at the bottom of the ZMW, even in the background of neighboring molecules in solution due to the very small zone of laser excitation.
Methodology
The library preparation involves ligation of hairpin adapters sequences to both ends of template dsDNA to make a closed circular ssDNA template called a SMRT-bell [74]. Then the library is transferred onto a specialized flow cell containing zero-mode waveguides (ZMWs). Each ZMW contains a single polymerase molecule immobilized at the (transparent) bottom, which can bind to either of the hairpin adaptors of the input SMR-Tbell templates and initiate the replication process by incorporating fluorescently labeled nucleotides and the fluorescent signal is recorded after excitation of the fluorophore by laser (Fig. 10a). The camera system records the color and duration of the fluorescence signal in real-time (in the form of a movie). Later, the fluorophore is cleaved and washed away allowing the addition of the next base [72]. As the SMRT-bell DNA library is a closed circle, the replication of one strand of the template by the polymerase is continued to the complementary strand through the hairpin adapters. If the DNA polymerase is active for longer durations, both the strands of the SMRT-bell template can be sequenced multiple times (called “passes”) and generate a single continuous long read (CLR). Later, the adapter sequences in the CLR can be identified and removed to generate “sub-reads.” The consensus sequence of multiple sub-reads generated from a single ZMW is also known as a circular consensus sequence (CCS) with higher accuracy and the accuracy is expressed in terms of quality value (QV). The quality of the CCS increases with the increase in the number of passes. It was reported that at 25 passes, PacBio-SMRT sequencing can reach the accuracy of 99.999% (QV40), which is similar to that of Illumina sequencing. The accuracy can even reach up to 99.9999% (QV50) with 50 passes. It was observed that there is a trade-off between the length of the template molecule and sequencing accuracy, as a lower number of passes will be generated for longer molecules [68].
54
Anupam Singh et al.
Fig. 10 Third-generation sequencing technologies. (a) PacBio-SMRT sequencing, (b) Oxford nanopore sequencing [20]
The PacBio possesses several advantages other than SMS, such as the detection of modified bases and producing incredibly long reads of 10–100 kb in length, which are useful for de novo genome assemblies [20]. 2.3.3 Oxford Nanopore Technology/Nanopore Sequencing
The basic principle of nanopore sequencing developed long before second-generation sequencing technology had emerged. In 2014, Oxford Nanopore Technologies (ONT) commercially released the first nanopore sequencer MinION [75, 76]. A great amount of research is in progress to use nonbiological, solid-state technology to generate suitable nanopores with the ability to sequence doublestranded DNA molecules [77–79].
Principle
ONT sequencing is based on the previous finding that a singlestranded RNA or DNA molecule can be translocated across a lipid bilayer through large ion channels like α-hemolysin by electrophoresis. The movement of nucleic acid through the ion channel restricts ion flow, resulting decrease of the current for a duration proportional to the length of nucleic acid [80]. A nanopore sequencer consists of a flow cell in which two compartments filled with an ionic solution are separated by a lipid bilayer containing 2048 (in MinION) or 12,000 (in PromethION) individual nanopores.
Next Generation Sequencing Technologies for Crops
55
Methodology
In nanopore sequencing, high molecular weight DNA is either fragmented or directly used for library preparation. Then the template DNA is end-repaired to ligate hairpin adapter. These adapters are DNA–protein complexes which interact with the polymerase or helicase enzyme or motor protein attached to the nanopore and ensure the translocation of the DNA through the pore by a ratcheting mechanism [81]. When a DNA or RNA molecule is translocated through a nanopore; a change in the ionic current can be observed and attributed to the nucleotide sequence (Fig. 10b). The change in current across nanopore is read by a sensor several thousand times per second and is shown in the form of a squiggle plot. Finally, the output data is processed by the minKNOW software and data analysis is carried out to know the nucleotide sequence of the template DNA [68]. In this technology, a hairpin adapter is ligated to the one end of template dsDNA. The sequencing starts with the direct strand generating the “template read” and then polymerase continues to read the hairpin structure, followed by the complementary strand generating the “complement read”; individually these reads are called 1D. The consensus sequence generated by combining both the template and the complement reads is called a two-directional read or 2D [82, 83]. In nanopore sequencing, the read length depends on the quality or length of the template DNA fragment and is not limited by the technology. Therefore, if the high quality or very long DNA fragments are provided extremely long reads up to 1 Mb can be generated [23]. The limitations of ONT include the high sequencing error rate (~13%) [73], and not having the possibility of sequencing the same strand multiple times, as with the PacBioSMRT sequencing.
2.4 Synthetic LongRead Sequencing
Illumina and 10 genomics have developed the method called synthetic long-read (SLR) sequencing as an alternative approach to the PacBio and ONT’s long-read sequencing. Currently, there are two SLR systems available: the Illumina SLR sequencing (formerly “Moleculo”) and the 10 Genomics gel-emulsions (GEM) based sequencing technology. The SLR technology partitions a few fragments of high-quality DNA into microwells or an emulsion (micelle) from a collection of very long DNA fragments. In Moleculo sequencing, the templates in each partition get randomly fragmented and barcoded adapters get added to these fragments. While in 10 genomics, the template DNA fragments gets randomly amplified with barcoded primers. The resulting barcoded DNA fragments are used for short-read sequencing library preparation and sequencing on the Illumina platform. Finally, after sequencing, the short reads are assembled to get large contigs. As the same barcoded reads must be derived from the same original large fragments, the barcode sequences in the assembled contigs can be used to detect misassemblies [84].
56
Anupam Singh et al.
Fig. 11 Synthetic long-read sequencing techniques: (a) Illumina Moleculo sequencing, (b) 10 Genomics Linked read sequencing [20] 2.4.1 Illumina SLR Sequencing
In Illumina SLR/Moleculo sequencing technology, the highquality genomic DNA is sheared into 8–10 kb long fragments and then barcoded adapters are ligated to these fragments, which can be used to identify the ends of assembled contigs during the downstream process (Fig. 11a). These barcoded fragments are then transferred into a microtiter plate with ~3000 fragments per well; each well contains a single barcode, where they undergo further fragmentation and barcoded adapter ligation. The DNA fragments from different wells are then pooled and processed for standard Illumina short-read sequencing; the output data is processed and assembled to reconstruct the original long fragments [84].
2.4.2 10 Genomics SLR Sequencing
In 10 Genomics GEM code sequencing, high molecular weight DNA molecules of ~100 kb length get captured into micelles in an emulsion (GEMs) containing gel beads with specifically barcoded adapters (Fig. 11b). The 10 Genomics sequencing requires very small amounts of starting DNA material (~1 ng). In each GEM, the gel bead dissolves releasing the barcoded adapter molecules. Next, the smaller fragments of template DNA are amplified from the original large DNA molecule; each fragment carries a barcode
Next Generation Sequencing Technologies for Crops
57
identifying the source GEM. These barcoded fragments are pooled and the library is prepared for sequencing on a standard Illumina short-read sequencer. The resulting short reads are assembled to form a series of anchored fragments of up to ~50 kb [20, 85].
3 3.1
Alternative Approaches HI-C Sequencing
The high degree of packaging and organization of chromatin and its dynamic structure drive the various aspects of the gene regulation process, chromosome morphogenesis, genetic instability, and gene inheritance and transmission [86]. The spatial organization of chromatin can be studied using various approaches which may further classify into microscopic and molecular assays. Among molecular assay techniques, the chromosome conformation capture (3C) is a major assay method and many other methods have been developed based on the 3C technique (Fig. 12). In special variants of the 3C technology (ChIP-loop, ChIP-Seq, and ChIA-PET), immunoprecipitation is used to examine the role of the protein factor. Recently, new methods like 4C, 5C, and Hi–C have been established to exploit the next-generation sequencing (NGS) approach to examine the 3C ligation product library on a grand scale [86].
Fig. 12 Overview of Hi–C technology. (Adapted from de Wit E and de Laat W 2012, [87])
58
Anupam Singh et al.
Methodology The 3C method measures the average frequency of population in which two or more DNA fragments are linked in three-dimensional space. This method quantifies the number of interactions between genomic loci present in proximity in 3D space. The interacting loci are cross-linked with formaldehyde, followed by chromatin solubilization and fragmentation with restriction enzymes. Interacting fragments are then ligated together and purified to produce a genomic library of chimeric DNA molecules (Fig. 12) [86]. Advantages 1. Investigation of the histone modifications (low-resolution Hi-C data).
2. Analysis of the transcription factor interactions (highresolution Hi-C). 3. Deduction of the polymer structure of chromatin. 4. Comparatives chromosome organization study. 3.2
Optical Mapping
Optical mapping is an approach to create ordered genome-wide restriction maps from single DNA molecules. In this technique, large DNA molecules of 300 kb to few megabases (Mb) are subjected to restriction digestion on a glass slide surface and visualized under a fluorescence microscope. DNA fragments stained with intercalating dye are visualized and sized using the intensity of fluorescence. In this way, well-ordered restriction maps (Rmaps) are produced from digital images of fully and partially digested molecules (Fig. 13). These Rmaps are used for the construction of genome-wide physical restriction maps, which provide insights into long-range genome structure and genome variation [88, 89]. Most of the present day optical mapping technologies were developed by two companies, OpGen and Bionano Genomics. Argus technology was released by OpGen, while Irys and Saphyr are from the Bionano Genomics. Later in 2011 BGI acquired the Argus platform [90]. Advantages 1. High-resolution optical maps of DNA serve as a scaffold for the accurate alignment of contigs generated by shotgun sequencing.
2. The physical maps act as a scaffold to guide and/or validate DNA sequencing-based genome assemblies. 3. Characterization of whole genome (e.g., Black pepper, cotton) [91, 92]. 4. Identification of structural polymorphism in genomes (e.g., Maize, Clover) [93, 94].
Next Generation Sequencing Technologies for Crops
59
Fig. 13 Steps in optical mapping based genome assembly
4
Cost of Sequencing According to Moore’s law, the efficiency of computational resources through hardware upgradation doubles every 2 years. The NGS technology is no exception to this law. The HGP initiated in 1990 took 10 years to sequence the 2.9 Gb human genome [16] and it costs about US$ 2.7 billion. Later in 2008, a single human genome sequencing took about 2 months using the Roche454
60
Anupam Singh et al.
Fig. 14 Graphical representation of change in sequencing cost and mean read length during 2001–2020
sequencing platform and costs US$ 1 million [95]. The advancement of computational resources played its role in increasing the throughput of NGS technologies and reduced the cost of sequencing by multiple folds. The cost of sequencing in 2001 was US$ 5292 per Mb, which is reduced to US$ 0.01 in 2020 mainly due to the advancement of NGS technologies (https://www.genome. gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data) (Fig. 14).
5
Applications of NGS Technology in Crop Improvement
5.1 De Novo Sequencing of Crop Plant Genomes
As many as 748 land plants with genome size ranging from 128.32 kb to 27.6 Gb have been sequenced and/or published (https://www.ncbi.nlm.nih.gov/genome/browse/overview). Arabidopsis thaliana genome is the first plant genome to be sequenced and its sequencing was completed in the year 2000 by Arabidopsis Genome Initiative (AGI) [15]. In 2005, the rice genome was sequenced by the International Rice Genome Sequencing Project (IRGSP) [49]. Both these projects relied on the BAC-by-BAC sequencing strategy through the Sanger’s sequencing platforms. After 2005, the advancements of NGS technology has lowered the cost and time of sequencing and enabled
Next Generation Sequencing Technologies for Crops
61
many scientific groups to initiate the Genome sequencing projects of many other crop plants like Sorghum [96], Soybean [50], Tomato [97], Grapes [98], Maize [99], Apple [100], Pigeonpea [101, 102] (Table 3). Further development of long-read sequencing technology facilitated the sequencing of large, repeat-rich, complex (Ploidy) crop plant genomes. The large (16.24 Gb) and repeat-rich genome of garlic (Allium sativum) was sequenced and assembled using Illumina (NGS), 10 genomics (SLR), PacBio (SMRT), ONT-nanopore sequencing, and Hi-C technologies [115]. Several strategies have been adopted in combination with various sequencing technologies for the sequencing and assembly of large, complex genomes of crop plants. One such approach is reducing the genome complexity by sequencing the genome collected from haploid tissues (gametophyte) or generating double haploids for sequencing. This approach can be used for sequencing the genome of trees with high heterozygosity, high generation time, and large size, which makes it challenging to develop a highly homozygous inbred population. This approach was used to sequence the genome of loblolly pine (Pinus taeda L.), an important tree for the wood industry. The haploid gametophyte tissues were used to generate the 22 Gb high-quality genome assembly [142]. Similarly, to reduce the genome complexity of allotetraploid upland cotton (Gossypium hirsutum), the genome of allohaploid developed by pollen culture was sequenced to a depth of 245 coverage with Illumina short-read sequencing. Then the contig sequences were aligned against a dense genetic map to generate a highly contiguous genome assembly, which covered 96% of the estimated 2.5 Gb genome, further fluorescence in situ hybridization was used to confirm a successful allotetraploid genome assembly [140]. In the second approach, the diploid progenitor species genome was sequenced, followed by the original polyploid plant. The progenitor species genomes were used to sort sub-genomes of polyploid plants. For example, allopolyploid Brassica napus originated from two diploid progenitor species which underwent a genome triplication event long before the formation of rapeseed. For genome assembly of Brassica napus; first, the genomes of the diploid progenitors were sequenced and assemblies were used to sort the Brassica napus genome into respective sub-genomes. However, many sequence scaffolds showed ambiguous assignment to homologous groups due to homologous recombination between two sub-genomes and gene loss during evolution [140]. A similar approach was used to sequence the polyploid genome of Indian mustard (B. juncea). The Brassica juncea genome was sequenced using a combination of both short and long-read sequencing. The optical maps from BioNano Genomics were used for scaffolding the assembled contigs [140]. Finally, a high-quality Brassica juncea genome was assembled representing
CCC135-36
Coffea Arabica
Coffee
12
Hainan Tall
Cocos nucifera
Coconut
11
B97-61/B2
Theobroma cacao
Cocao tree
10
CM334
Capsicum annuum
Chillies/ pepper
9
CDC Frontier
Cicer arietinum
Chickpea
8
1300
2720
430
3500
738.09
485
Chiifu-40142
Brassica rapa
Field mustard, rape mustard
7
Chai Nat 574 80 (CN80)
Vigna mungo
Black gram
6
761.74
Lampung Dasun Kecil
Black pepper
5
Piper nigrum
Barrel medic
4
500
Barley
3
523
Medicago truncatula Mt3.5
DH-Pahang
Musa acuminate
Banana
2
742.3
5100
Golden delicious
Malus domestica
Apple
1
Estimated genome size (Mb)
Hordeum vulgare
Genotype
Species
Common S. No name
1094.45
2202.46
324.88
2935.88
530.894
352.983
498.912
761.22
412.924
4257.71
472.231
703.358
Reference assembly size (Mb)
Table 3 List of some commercially important plants with sequenced genomes
22
16
10
12
8
10
11
26
8
7
11
17
1
–
1
2
1
1
–
–
1
–
–
1
1
3
2
6
5
4
1
1
6
23
1
3
https://www.ncbi.nlm.nih. gov/genome/?term¼coffea +arabica
[111]
[110]
[109]
[108]
[107]
[106]
[91]
[105]
[104]
[103]
[100]
Chromosomes Organelles Assemblies References
62 Anupam Singh et al.
Opium poppy Papaver somniferum
26
Pea
Peach
Date palm
Pine apple
27
28
29
30
4450
Came´or Lovell Khalas F153
Pisum sativum
Prunus persica
Phoenix dactylifera
Ananas comosus
526
671
265
2870
364
579
360
2320
472
922
2700
1560
1250
16,900
450
350
587
High Noscapine 1
Azadirachta indica
Neem tree
25
VC1973A
Vigna radiata
Mung bean
24
Alphonso
Mangifera indica
Mango
23
B73
Zea mays
Maize
22
Miyakojima MG-20
Lotus japonicas
Lotus
21
T84–66
Brassica juncea
Indian mustard
Tifrunner
Arachis hypogaea
20
K30076
Arachis ipaensis
Groundnut/ peanut
18
19
V14167
Arachis duranensis
Groundnut (CWR) Groundnut (CWR)
17
Ershuizao
Allium sativum
Garlic
16
DHL92
Cucumis melo
Melon
15
IL 9930
Cucumis sativus
Cucumber
14
G19833
Phaseolus vulgaris
Common bean
13
382.056
556.481
227.569
4275.93
2715.53
261.458
463.638
392.983
2182.79
394.455
954.861
2557.07
1353.5
1084.26
16,559.4
374.928
226.641
521.077
25
–
8
–
11
–
11
20
10
–
18
20
10
10
7
12
7
11
1
2
1
–
1
–
2
2
2
–
–
1
–
–
–
1
4
–
3
4
4
1
3
1
3
1
42
1
1
3
1
2
1
7
3
2
[127]
[126]
[125]
[124]
[123]
[122]
[121]
[120]
[99]
[119]
[118]
[117]
[116]
[115]
[114]
[113]
[112]
(continued)
Next Generation Sequencing Technologies for Crops 63
Asha/ICPL 87119 Batoury DM1-3-516 R44 Darmor-bzh Nippanbare HapOB cultivar Reyan733-97 BTx623 Williams 82 Hawaii 4
AP85-441 Shuchazao
Cajanus cajan
Pistacia vera
Solanum tuberosum
Brassica napus
Oryza sativa
Rosa chinensis
Hevea brasiliensis
Sorghum bicolor
Glycine max
Fragaria vesca
Camellia sinensis
Pistachio
Potato
Rapeseed
Rice
Rose
Rubber tree
Sorghum
Soybean
Woodland strawberry
Cultivated strawberry
Sweet orange Citrus sinensis
Saccharum spontaneum
Pigeonpea
Sugar cane
Tea
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Valencia
Fragaria ananassa Camarosa
Genotype
Species
Common S. No name
Table 3 (continued)
2980
3360
367
813.4
240
1150
818
1460
532
389
1130
844
600
833
Estimated genome size (Mb)
3113.46
3133.29
327.83
697.762
214.373
979.046
709.345
1373.53
513.854
374.423
976.191
705.934
671.28
592.971
Reference assembly size (Mb)
15
32
18
28
7
20
10
–
7
12
19
–
–
11
–
–
1
–
1
2
2
1
2
–
2
1
1
1
2
2
2
2
1
12
6
5
1
81
3
12
1
2
[137]
[136]
[135]
[134]
[133]
[50]
[96]
[132]
[131]
[17]
[130]
[129]
[128]
[101, 102]
Chromosomes Organelles Assemblies References
64 Anupam Singh et al.
Thale cress
Tobacco
Tomato
Upland cotton
Wheat
Wine
45
46
47
48
49
50
Chinese Spring PN40024
Triticum aestivum
Vitis vinifera 475
1576
2250
TM-1
Gossypium hirsutum
4460
900
Flue-cured, Burley, Oriental
Nicotiana tabacum
135
Solanum lycopersicum Heinz 1706
Ecotype: Columbia
Arabidopsis thaliana
486.197
15,418.8
2189.14
828.349
3643.47
119.669
19
21
26
12
–
5
2
1
2
2
2
2
8
35
4
6
4
101
[98]
[141]
[140]
[97]
[139]
[138]
Next Generation Sequencing Technologies for Crops 65
66
Anupam Singh et al.
both the A and B sub-genomes of size 402 Mb and 547 Mb, respectively [118]. The allo-octoploid (2n ¼ 8x ¼ 56) cultivated strawberry (Fragaria ananassa) originated from interspecific hybridization between two wild octoploid progenitors Fragaria virginiana and Fragaria chiloensis, which are the products of fusion and interaction of four diploid progenitors through natural hybridization. The woodland strawberry (Fragaria vesca) is believed to be one of the four diploid progenitors (2n ¼ 2x ¼ 14) of cultivated strawberry. The genome of woodland strawberry was sequenced in 2010 using the NGS technology [118]. The cultivated strawberry genome was sequenced in 2019 using more advanced sequencing technologies, which facilitated the identification and confirmation of all four diploid progenitors [118]. With a similar approach, the large allotetraploid genome of peanut (Arachis hypogaea) was assembled. The A. hypogaea originated by the natural hybridization between two diploid progenitors, A. duranensis and A. ipaensis. Essentially, the complete genome assemblies of A. duranensis (1.2 Gb) and A. ipaensis (1.5 Gb) were shown to align directly with the genetic map of A. hypogaea [116]. Illumina SLR sequencing of the A. hypogaea genome showed 98–99% similarity with the diploid progenitor genomes, with variations arising due to homologous exchange between sub-genomes [117]. The third approach involves the reduction of polyploid genome complexity by sequencing the DNA from purified chromosomes. This approach was successfully applied to sequence the large allohexaploid genome of bread wheat (Triticum aestivum L.). In the bread wheat genome, three sub-genomes, A, B, and D are present independently. Wheat can tolerate the loss of an entire chromosome, so a radiation mutant population with aneuploidy was generated. The different chromosomes were sorted using flow cytometry [143]. The sequencing of the DNA from flow-sorted chromosomal segments enabled the precise allocation of most genes to respective A, B, and D sub-genomes [141]. 5.2 Re-sequencing of Related Species/ Other Plant Genetic Resources
The availability of high-quality reference genome assembly of crop plants has accelerated evolutionary studies to understand the genome complexity and evolutionary relationship between genotypes and species [144]. The NGS technology has enabled the identification of structural variations in the chromosomes that play a significant role in plant evolution. By sequencing sets of related genotypes, individually or pooled, within a species and comparing it to reference genome, genome-wide variations such as microsatellites or simple sequence repeats (SSRs), singlenucleotide polymorphisms (SNPs), insertions/deletions (InDels), and structural variations including translocations, chromosome fusions, and copy number variations (CNVs) can be identified. Further, these variations can be converted into genetic markers, mainly SSRs and SNPs. Three different approaches utilize the
Next Generation Sequencing Technologies for Crops
67
re-sequencing: (1) genome-wide identification of structural variations, (2) population genomics—genotyping by sequencing (GBS) and genome-wide association studies (GWAS), and (3) Pan-genome analysis. 5.2.1 Genome-Wide Identification of Structural Variations
The first of the steps toward the re-sequencing of plant genomes started with the initiation of the 1001 Genomes project (http:// www.1001genomes.org) [145], intending to unveil the genomic variations in the Arabidopsis plant collections. The 3000 rice genome (3K RG) project was initiated to identify the genomic variations and population structure of international rice collections [146]. The 3K RG project reiterated the presence of five groups of rice cultivars and identified 29 million single-nucleotide polymorphisms, 2.4 million small InDels, and over 90,000 structural variations [147]. Similar re-sequencing studies have been conducted in many other crop plants such as Maize [148], Grape [149], Soybean [150, 151], Pigeonpea [152], Chickpea [153], Pearl millet [154], Tomato [155], Rapeseed [156], poplar [157], etc. For example, sequencing the genome of Arabidopsis reference accession (Col-0) along with two other accessions (Bur-0 and Tsu-1) and subsequent mapping against the reference genome led to the identification of 823,325 SNPs and 79,961 small InDels [158]. In 2010, Kim et al. re-sequenced Glycine soja, a wild relative of cultivated soybean and identified 2.5 million SNPs and 196,000 InDels (35 to +14 bp) and estimated the divergence time of G. max and G. soja to be at 0.267 0.03 MYA which is before the soybean domestication [159]. A similar study was conducted with a soybean cultivar (DS-9712) and wild G. soja (DC2008-1) to identify genomic variations linked to seed permeability traits [160]. The whole-genome re-sequencing data of 292 Cajanus accessions revealed a total of 17.2 million genomic variations including 15.1 million SNPs, 0.9 million small insertions, and 1.2 million small deletions (InDels of 1–5 bp in length) [152]. The re-sequencing of 44 sorghum lines representing improved inbred lines, landraces, and wild species revealed 4.9 million high-quality SNPs, 1.9 million InDels, and specific gene loss and gain events in S. bicolor [161]. The re-sequencing of 302 accessions of soybean including wild, landrace, and improved soybean accessions at >11 depth revealed ~9.8 million SNPs, 876,799 small InDels, and 1614 CNVs [150]. Similarly, whole-genome re-sequencing of 429 chickpea accessions facilitated the identification of 4.97 million SNPs, 0.59 million small InDels, 4931 CNVs, and 60,742 presence-absence variations (PAVs) [153].
5.2.2 Population Genomics (GWAS and GBS)
A novel method for the discovery of SNPs in a genome and the populations without any prior knowledge about the genome sequence is called genotyping by sequencing (GBS) [162]. In the GBS technique, the genomic DNA is restriction digested followed
68
Anupam Singh et al.
by ligation of the barcode adaptors, PCR amplification, and then pooled to produce multiplex libraries for sequencing [162]. The GBS technique has been proven to be robust and successful in producing large numbers of SNPs for application in genetic analyses and genotyping [163, 164]. GBS technique has been efficiently applied in a wide range of plant breeding programs such as genomic selection (GS), genomic diversity (GD), genome-wide association (GWA), linkage analysis (LA), and marker discovery. The GBS approach has been successfully used for the genetic analysis and marker development in crops like rapeseed, lettuce, and soybean [165–167]. GWAS is a method for studying the associations between a genome-wide set of SNPs and desired phenotypic traits. Initially, GWAS analysis was carried out using micro-arrays, later with the advancement of sequencing technologies, GWAS became a powerful tool for detecting natural genomic variations corresponding to complex traits in crop plants. The quantitative evaluation is based on linkage disequilibrium (LD) through genotyping and phenotyping of diverse individuals. Thus, GWAS in crops uses a set of stable accessions or varieties phenotyped repeatedly for several traits and genotyped once to analyze the association between the traits of interest and identified genomic variations. GWAS have been successfully carried out in many crops, such as rice and maize [168, 169]. A high-density haplotype map of the rice genome was generated using low-coverage genome sequencing of 1083 cultivated rice accessions (containing both subspecies of O. sativa (indica and japonica)) and 446 wild rice accessions (Oryza rufipogon) [170]. The GWAS analysis was conducted with a set of 446 accessions of wild rice (O. rufipogon) for leaf sheath color and tiller angle [170]. A similar study in rice identified the alleles associated with ten grain-related traits and flowering time using the comprehensive data set of ~1.3 million SNPs [171]. In another study, researchers identified 30,984 SNP markers by sequencing 176 RILs (indica japonica) using 384 plex GBS protocol and identified QTLs associated with leaf width and aluminum tolerance [172]. In maize, GWAS of 368 lines resulted in the identification of ~1 million SNPs including 74 loci linked to maize kernel oil content and fatty acid composition [173]. Similarly, genotyping of 2815 maize inbred lines resulted in the identification of 681,257 SNPs distributed across the entire genome, in which some SNPs were linked to the known candidate genes controlling flowering time, kernel color, and sweetness [174]. In foxtail millet, a GWAS study involving GBS of a set of 916 diverse accessions identified multiple loci associated with many agronomic traits [175]. In sorghum, 917 diverse accessions from the worldwide collection were used to identify ~0.2 million SNPs through GBS [176]. Varala et al. [177] identified
Next Generation Sequencing Technologies for Crops
69
4294–14,550 SNPs with the GBS of four soybean accessions [177]. A total of 205,614 SNPs have been identified with the re-sequencing of 31 soybean accessions, providing a valuable genomic resource for soybean breeding programs [151]. Similarly, re-sequencing 84 tetraploid potato cultivars revealed 129,156 genomic variations [178]. The applications of GBS on a large collection of potato cultivars identified the alleles strongly associating with maturity and flesh color [178]. Identification of SNPs to construct high-density genetic linkage maps using GBS has been an important tool for numerous applications in plant breeding [179]. GBS analysis was used to incorporate 1000s of markers in the bread wheat map [180]. GBS and high-resolution SNP map were used to confirm the location of the semi-dwarfing gene (ari-e) on barley chromosome 5H [181]. 5.2.3 Pan-Genome Analysis
A pan-genome is defined as the collection of the entire set of genes present in a group of organisms of a biological clade, such as a species or genus. The pan-genome gets divided into core and dispensable genes. The core genes are a set of genes present among all the studied individuals, while dispensable genes are present in a subgroup or are individual specific [182]. The concept of pan-genome was developed and used to study bacteria in 2005 when the genome sequencing of several Streptococcus agalactiae isolates revealed that 80% (core genes) of the total identified genes were present in all the Streptococcus agalactiae isolates, while 20% (dispensable genes) were absent in at least one isolate [182]. Pan-genome studies have been conducted in many crops including Brassica, maize, wheat, rice, soybean, and pigeonpea. However, in the beginning, the pan-genome analysis was not conducted for plants due to the high cost of sequencing and the assumption that the rate of variation in higher organisms is low compared to bacteria [183]. The applications of pan-genome analysis for crop improvement have been widely reviewed [184–186]. In the first published plant pan-genome analysis, seven wild soybean accessions were sequenced and the individual genome assemblies were compared to identify the dispensable genes associated with biomass, flowering and maturity time, seed composition, organ size, and disease resistance in Glycine soja [187]. As the generation of high-quality genome assemblies for multiple individuals is very costly, several plant pan-genome studies including Brassica oleracea (ten genotypes) [188], bread wheat (18 genotypes) [189], and rapeseed (53 genotypes) [190] have adapted the iterative mapping and assembly approach [191]. These early plant pan-genome studies revealed the large variations (15–40%) of gene content among the studied species and most of the dispensable genes are associated with biotic and abiotic stress tolerance [184]. Recently, pan-genome studies have been conducted for many crop plants (Table 4) including the
70
Anupam Singh et al.
Table 4 A list of pan-genome studies conducted in crop plants Approach for Number of pan-genome accessions construction
No. of genes in Sequencing pan-genome method
References
Brassica rapa
3
41,858
[192]
Glycine soja
7
Oryza sativa
3
Zea mays
503
2015
Oryza sativa
1483
Iterative mapping and assembly
NA
Illumina HiSeq
[195]
2016
Brassica oleracea
10
Iterative mapping and assembly
61,379
Illumina HiSeq
[188]
2017
Brachypodium 54 distachyon Medicago 15 truncatula Triticum 19 aestivum
De novo assembly De novo assembly Iterative mapping and assembly
37,886
Illumina HiSeq Illumina HiSeq Illumina HiSeq
[196]
Brassica napus 53
Iterative mapping and assembly Iterative mapping and assembly Iterative mapping and assembly Map-to-pan
Year of publication Species 2014
2018
2019
Capsicum annum
383
O. sativa/O. rufipogon
67
Oryza sativa
3010
Sesamum indicum Helianthus annuus
5 493
Solanum 725 lycopersicum
De novo assembly De novo assembly De novo assembly De novo transcriptome
De novo assembly Iterative mapping and assembly De novo assembly
59,080 39,891 41,903
75,000 139,747
Illumina HiSeq Illumina HiSeq Illumina HiSeq Illumina HiSeq
[187] [193] [194]
[197] [189]
94,013
Illumina HiSeq
[190]
51,757
Illumina HiSeq
[198]
42,580
Illumina HiSeq
[199]
48,098
Illumina HiSeq PacBio
[147]
26,472
Illumina HiSeq Illumina HiSeq
[200]
61,205
40,369
Illumina NextSeq
[201]
[202] (continued)
Next Generation Sequencing Technologies for Crops
71
Table 4 (continued) Year of publication Species 2020
2021
Approach for Number of pan-genome accessions construction
No. of genes in Sequencing pan-genome method
References [203]
Brassica napus 9
Map-to-pan
105,672
Juglans
6
26,458
Glycine max
29
De novo assembly De novo assembly
57,492
Illumina HiSeq PacBio Illumina HiSeq Illumina HiSeq PacBio Illumina HiSeq
[204] [205]
Cajanus cajan 89
Iterative mapping and assembly
55,512
Sorghum
354
35,719
Illumina HiSeq
[207]
Zea mays
26
Iterative mapping and assembly De novo assembly
>10,300
Illumina, PacBio, Optical mapping
[208]
[206]
analysis of 54 Brachypodium distachyon individuals resulting in the identification of 7135 new genes that are not present in the reference genome and the presence of a population-specific core gene set was also observed in this study [196]. Pan-genome analysis of sesame (Sesamum indicum) was conducted using five genotypes including old and modern sesame cultivars [200]. In the recent pan-genome study of 89 pigeonpea (Cajanus cajan) accessions, three genes were found to be associated with seed weight [206]. The pan-genome sequencing study of Brassica oleracea revealed that 18.7% of the 61,379 genes were found to be affected by PAVs. The identified set of dispensable genes was reported to be associated with various agronomic traits such as disease resistance, flowering time, biosynthesis of vitamins, and glucosinolate metabolism [188]. Rice (Oryza sativa L.) Nipponbare reference genome is found to be lacking many genes associated with vital functions, such as GW5 [209], Sub1A [210], and Pikm-1 [211]. This indicates that one reference genome is not sufficient to capture entire genetic variations. In addition to the above-mentioned species, pan-genome studies conducted in some other plants are listed in Table 4.
72
Anupam Singh et al.
5.2.4 Marker Development (SSRs, SNPs, and InDels)
Molecular markers have been an important tool for plant genetics and breeding. Various molecular marker techniques have been developed and used in plant molecular breeding for decades [162]. The first molecular marker developed and used for plant genotyping was restriction fragment length polymorphism (RFLP) [212]. However, the application of RFLP is limited due to the complicated hybridization, radioactivity, number of available probes, and being time-consuming [213]. Later, several PCR-based markers such as amplified fragment length polymorphisms (AFLPs), random amplification of polymorphic DNA (RAPD), sequence characterized amplified region (SCAR), cleaved amplified polymorphic sequences (CAPS), simple sequence repeats (SSRs), and direct amplification of length polymorphisms (DALP), and diversity arrays technology (DArT) were developed [214]. Before the advancement of sequencing technologies, SSR markers were the most extensively used molecular markers in genetic and plant breeding studies. Compared to other marker systems, SSR markers have many advantages such as genome-wide distribution, higher polymorphism rate, and amenability to automation [215]. In the recent past, SNPs became a widely used molecular marker owing to the advancement of NGS technology. SNPs have proven to be advantageous over the other molecular markers due to their abundance in the genome and the relative ease of detection. For example, 23,438 putative SSRs were identified by sequencing and de novo assembly of the sesame genome and were successfully used for diversity analysis of accessions from 12 countries [216]. De novo genic SSRs have been identified in several non-model plants, including but not limited to Hevea brasiliensis [217] and Cajanus scarabaeoides [218]. Whole-genome sequencing and genome-wide analysis of foxtail millet resulted in the identification of 30,706 transposable elements (TEs), which were used to develop 20,278 TE-based markers such as Inter Retrotransposon Amplified Polymorphisms, Retrotransposon Based Insertion Polymorphisms, Repeat Junction-Junction Markers, Repeat Junction Markers, Retrotransposon Microsatellite Amplified Polymorphisms, and Insertion Site-Based Polymorphisms. Further, these markers were used to screen 96 Setaria italica accessions and three wild Setaria accessions [219]. Similarly, 2687 InDels based markers were identified from genome sequencing of three Phaseolus vulgaris L genotypes and were used to generate a genetic map [219].
5.2.5 QTL Mapping
A QTL is a region or locus of a genome that is associated with a quantitative trait. A QTL can be a single gene or a group of linked genes contributing to a specific trait. QTL mapping based on linkage and marker trait associations was successfully used for gene pyramiding, selection of a diverse population for biotic and abiotic stresses. Various molecular markers like RFLPs, RAPD,
Next Generation Sequencing Technologies for Crops
73
SSRs, AFLP, and SNPs have been used to identify QTLs in plants. The submergence tolerance QTL, SUBMERGENCE 1 (SUB1) in rice is probably the most successful example of QTL utilization in crop plants [220]. QTL mapping was conducted using NGS-based approaches in many crops, including but not limited to rice [221], maize [222], and wheat [223]. 5.2.6 Linkage Mapping/ Association Mapping
A linkage map is constructed based on the position and relative genetic distance between segregating markers along chromosomes. The main objective of linkage mapping is the identification of a gene or genomic region governing the phenotypic variation. Two approaches that have been successfully used in genetic mapping of important agronomic traits in crop plants are linkage mapping and association mapping or linkage disequilibrium (LD). Over the past two decades, linkage mapping has been successfully used in crop plants for the identification of many QTLs and gene cloning [224]. Linkage mapping involves the studying of genetic recombinations that occurred in a mapping population developed from bi-parent or multiparent crosses. Conversely, association mapping is a population-based mapping method including elite cultivars, landraces, exotic accessions, and wild relatives. Association mapping detects the historical recombination events accumulated over hundreds of generations, thus providing higher resolution and greater allele numbers [225]. NGS-based association mapping studies have been conducted in many crop plants, such as maize [173], soybean [150], sorghum [176], and potato [178].
5.2.7 Genomics-Assisted Breeding
In the NGS era, the application of genomic information has been common practice in crop breeding programs to develop elite cultivars with increased yield, tolerance to biotic/abiotic stresses, and better nutrition. Two approaches have been successfully used in genomics-assisted breeding: (1) marker-assisted selection (MAS) and (2) genomic selection (GS) [226]. In MAS, molecular markers linked to a specific gene or QTLs associated with a trait of interest or phenotypes are used to select progeny that carry favorable alleles for the trait of interest. While the GS uses all available markers present throughout a genome to predict the breeding value. Over the past two decades, with the advancement of NGS technologies, the number of available markers for the selection of plants in the breeding program has increased many folds by genome-wide genotyping approaches [227]. MAS is successfully used for the introgression of recessive alleles, gene pyramiding, selection of traits that are expressed only in the later stages of plant development, and traits that are expensive or difficult to phenotype [228]. The MAS has been successfully used to improve elite cultivars of many crops. For example, biomass accumulation in Triticale [229] and drought tolerance in chickpea [230].
74
Anupam Singh et al.
Originally, the GS was intended for application in livestock breeding [231]. However it has been successfully used in a wide range of crops including but not limited to maize, wheat, and Cassava [232, 233]. 5.2.8 Genotyping Arrays
NGS technologies have enabled the development of array-based genotyping platforms. Most of these genotyping arrays are developed based on technologies either from Illumina or Affymetrix. Illumina BeadArray technology uses specific oligos bound beads that fit into patterned microwells, followed by fluorescent detection of highly multiplexed SNPs. Initially, technologies like BeadXpress and then the GoldenGate assay detected 48–384 and 384–3072 SNPs per sample [234]. The latest higher-density Infinium assays can detect 3000 to over 5 million SNPs per sample [235]. The Affymetrix-based GeneChip arrays oligonucleotides are printed on an array using a photolithographic technique, followed by hybridization-based SNP calling [236]. More recent, Affymetrix Axiom technology is capable of simultaneous genotyping 384 samples with 50K SNPs or 96 samples 3650K SNPs. Many SNP-genotyping arrays have been developed from NGS datasets and used to improve breeding efficiency in many crops, including maize (60K SNPs) [237], rice (44K SNPs) [168], chickpea (2068 SNPs) [238], and pigeonpea (1616 SNPs) [239].
5.3 Organellar Genome Sequencing
Mitochondria and chloroplasts are the unique organelles in the biological system having their own genome. Structurally, it can be circular or linear and its replication characteristics also differ from that of the nuclear genome. Mitochondria are the cell’s powerhouse, while chloroplasts are responsible for photorespiration. Besides, the role of these organelles in different cellular functions makes them ideal candidates for in-depth analysis of their genomes [240]. The sequencing of organelle (chloroplast and mitochondrial) genomes has increased significantly in the last decade due to substantial breakthrough in NGS technology. The first organelle genome to be sequenced was the chloroplast genome of tobacco [241] and the liverwort (Marchantia polymorpha) [242]. The advancement of next-generation sequencing technology encouraged many scientific groups to sequence the organellar genomes of many crop plants. Till date, around 5296 chloroplast, 280 mitochondria, and 957 plastid genome sequences of land plant species have been submitted to the NCBI database (https://www.ncbi. nlm.nih.gov/genome/browse#!/organelles/). These organelle genome data have given new insights into the mutational processes that affect mtDNA and ptDNA and increased our understanding of genomic diversity and gene expression dynamics in these organelles [243–245]. Additionally, mtDNA sequencing has played a key role in identifying essential agricultural features including cytoplasmic male sterility caused by mitochondrial genome mutations.
Next Generation Sequencing Technologies for Crops
75
5.4 Functional Genomics in Crop Improvement
The field of functional genomics refers to the assessment of a gene function by using global (genome-wide or system-wide) experimental approaches like the data generated by structural genomics including DNA sequencing, gene expression (RNA-Seq or Microarray), noncoding RNA, interactions between protein–DNA, protein–RNA, and protein–protein [246]. In contrast, the classical genomics methods use forward and reverse genetics with geneby-gene approaches are time-consuming and expensive to decipher the biological function of novel genes. The advent of NGS technologies and their application has accelerated the comprehensive understanding of the molecular mechanisms that regulate plant development, growth, and responses to various biotic and abiotic stresses. High-throughput technologies like transcriptomics, epigenomics, and proteomics are important aspects of functional genomics.
5.4.1 Transcriptomics
After the completion of Arabidopsis genome sequencing [15], initially, microarray platforms were used for transcriptome analysis of Arabidopsis [247]. The first transcriptome analysis by using NGS technology was reported in 2007 [248]. Transcriptome analysis has been successfully used for the identification of genes and pathways associated with developmental stages and stress responses in various crop plants [249–251]. The expression atlas comprises mRNA profile of a set of tissues from different developmental stages that represents the entire life cycle of plants. The expression atlases provide complete information of differentially expressed genes across different tissues, including their tissue-specific and temporal expression pattern. Expression atlases of model and crop plants such as Arabidopsis, rice, and soybean have been used for the development of digital comparative transcriptome analysis tools and proven to be useful for crop improvement. The first plant expression atlas was of Arabidopsis consisting of 79 different tissues collected from both vegetative and reproductive stages of the plant development [252]. Rice gene expression atlas included 39 tissues representing the entire life cycle of rice [253]. Soybean expression atlas consists of 14 tissues collected at different stages including vegetative, reproductive, and root nodule [254]. Chickpea gene expression atlas consists of RNA-Seq data from 27 samples at five major developmental stages of the plant [255]. The pigeonpea expression atlas consists of the mRNA profiles of 30 tissues/organs representing developmental stages from germination to senescence [256]. Studies of expression atlases of other crop plants such as maize [257] and wheat [258] were also reported.
76
Anupam Singh et al.
5.4.2 Epigenomics or Epigenetics
Epigenetics is defined as the heritable changes in the structure of chromatin that can significantly regulate gene expression and affect the cellular functions without involving changes in the corresponding DNA sequence. Epigenetic regulation involves various molecular mechanisms such as histone modifications, DNA methylation, and noncoding RNAs (ncRNA) interactions, which regulate the open or closed states of chromatin, subsequently regulate the activation or silencing of the gene, respectively, and further control the onset of gene expression networks in specific cell types, tissues, and under different developmental or environmental stimuli [259]. Many NGS-based epigenomics approaches have been developed and successfully used in many crop plants such as whole-genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS), and ChIP-Seq.
5.4.3 Methyl C-seq/ Whole-Genome Bisulfite Sequencing
Whole-genome bisulfite sequencing known as BS-Seq or MethylCSeq is NGS-based approach used to determine the cytosine methylation profile of genomic DNA. In this method, target DNA is treated with sodium bisulfite to convert unmethylated cytosines into uracil [260], followed by sequencing. WGBS was used for the first time in plants for decoding the methylome of Arabidopsis [260] and later extended to other crops. Rapid improvement and application of bioinformatics have facilitated the identification of global methylation patterns, differential methylation status in specific regions or even at particular loci. For example, separate WGBS studies involving embryo and endosperm of maize [261] revealed that the endosperm genome is hypomethylated compared to the embryo. A study involving WGBS of pigeonpea cytoplasmic malesterile line, restorer line, and F1 (Hybrid) revealed the possible epigenetic regulation of pollen fertility related genes [262].
5.4.4 Reduced Representation Bisulfite Sequencing
This approach is similar to the GBS strategy except for the utilization of sodium bisulfite-treated DNA as starting material. In the beginning, RRBS was intended for studying the mammalian methylome at low costs [263]. In the RRBS approach, the target genomic DNA is first digested with restriction enzymes like MspI, followed by the size selection of fragmented DNA (40–220 bp) for subsequent bisulfite conversion and sequencing [263]. RRBS has been successfully used for methylation profiling of plant genomes, such as oak populations [264] and Brassica rapa [265]. Many alternative approaches to bisulfite conversion have been developed, such as methylated DNA immunoprecipitation sequencing (MeDIP-seq), methyl-binding domain sequencing (MBD-seq) [266], and methyl capture sequencing (MethylCapseq) [267]. MeDIP-seq involves the utilization of specific antibody to capture DNA fragments containing methylated cytosine [268]. These enrichment based approaches are capable of capturing methylated cytosines or total cytosines from very low quantities of DNA [269].
Next Generation Sequencing Technologies for Crops
77
5.4.5 Chromatin Immunoprecipitation
Chromatin immunoprecipitation (ChIP) is a modification of the chromatin conformation capture (3C) technique, used to study the DNA–protein interaction in vivo. In this technique, the genomic DNA and protein interactions are fixed by treatment with formaldehyde, followed by restriction digestion to generate small fragments of size ranging from 200 to 800 bp. Then the DNA–protein complex fragments are captured by immunoprecipitation using an antibody specific to the protein of interest. These immunoprecipitated DNA fragments are subjected to acid treatment to release the DNA–protein complex and the released DNA fragments are amplified by PCR. Similarly to capture the histone modifications landscapes, antibodies specific to the particular histone proteins (e.g., H327K3me or H34Kme3) are used [270]. Further application of NGS technologies in combination with ChIP led to the development of ChIP-seq. In this technology, the immunoprecipitated DNA fragments are finally sequenced using NGS platforms [270] and resulting epigenetic profiles are compared to corresponding transcriptome (RNA-Seq or Micro-array) data to assess the epigenetic regulation of gene expression.
5.4.6 Exome Sequencing
Exome sequencing refers to the targeted capture and sequencing of protein-coding regions of the genome (represents ~1–2% of the genome) with a focus on the identification of functional variants across the genome [271]. Exome sequencing has been successfully used to identify the genetic diversity in coding regions of many crop plants such as maize [272], wheat [273], and soybean [274].
5.5
Metagenomics refers to the study of genetic material obtained directly from surrounding environments, such as the rhizosphere and the phyllosphere where diverse microorganisms grow and interact with plants [275, 276] and these plant–microbial interactions can be beneficial, neutral, or detrimental to plant health and development [277]. Two symbiotic plant–microbe interactions have been extensively studied and well understood—vascular arbuscular mycorrhiza (VAM) and root nodule (RN). These bacterial interactions mainly involve the species belonging to genera: Rhizobium, Sinorhizobium, Azorhizobium, Bradyrhizobium, and Mesorhizobium [277]. The advancement of NGS technologies has increased our understanding of microbial populations in the rhizosphere and phyllosphere through various approaches such as 16S/18S/ITS amplicon sequencing, microbial whole-genome sequencing, metagenomics, metatranscriptomics, complete plasmid sequencing, and microbial single-cell sequencing. 16S/18S/ITS sequencing is the most common and powerful method to explore the microbial diversity in plant–microbe interactions. The NGS technology has facilitated the whole-genome sequencing of various strains of Rhizobia viz., Rhizobium leguminosarum bv. Viciae [278] and R. sullae type strain IS123T
Metagenomics
78
Anupam Singh et al.
[279]. The Bradyrhizobium amphicarpaeae sp. nov. 39S1MBT genome was sequenced using PacBio-SMRT platform [280]. Sequencing and analysis of the Rhizobium leguminosarum genome revealed that it shared more common genes with S. meliloti and M. loti in comparison to closely related Agrobacterium tumefaciens [278]. Metatranscriptomics has enabled to study of changes in the gene expression of microbes in response to changes in the host environment or developmental stage of the host. Comparative transcriptomics analysis of R. leguminosarum biovars viciae 3841 in the rhizosphere of Medicago sativa, pea, and Beta vulgaris revealed the expression of a common set of genes including rmrA encoding efflux pump and dctA inducing C4-dicarboxylate transport [281]. Liu et al. [282] studied the effect of soybean root exudates on two B. diazoefficiens strains 4534 and 4222 and found that differential expression of various genes encoding for two-component systems (nodW, phyR-sEcfG); bacterial chemotaxis (cheA), indole-3-acetic acid (IAA) metabolism, and ATP-binding cassette (ABC) transport proteins [282]. 5.6
Paleogenomics
Paleogenomics refers to the study of the evolutionary history of organisms based on the genomics information provided by highthroughput NGS technologies. Two approaches have been developed using NGS technologies to retrieve both macro- and microevolutionary information in paleogenomics. The first approach is an indirect (synchronic) method to compare modern genomes to predict/reconstruct the ancestral genomes that might have existed before millions of years of evolution (macro-evolution). While the second approach is a direct (allochronic) method that involves sequencing of genetic material recovered from the plant subfossils that have been preserved for over 10,000 years (microevolution) [283]. The recent advancement of NGS technologies has enabled the comparison of modern genomes with each other to reveal their evolutionary history based on the reconstructed genomes of their most recent common ancestors (MRCA). NGS technologies have facilitated the reconstruction of ancestral angiosperm karyotype (AAK), comprising a set of 22,899 genes that are conserved in modern-day crops since 190–238 MYA. These results overlap with the previous reports that the angiosperms have evolved some 250 MYA using evolutionary timescale approaches [284] and with the earliest plant fossil records recovered from the late Triassic era [285]. The ancestral monocot karyotype (AMK) reported with five proto-chromosomes comprising of 6707 ordered proto-genes, while the ancestral eudicot karyotype (AEK) consisting of 6284 ordered proto-genes distributed across seven proto-chromosomes diverged from AAK [285]. The reconstruction of ancestral genomes revealed that modern plant genomes are a mosaic of
Next Generation Sequencing Technologies for Crops
79
reconstructed ancestral proto-chromosomal segments. Moreover, the availability of these reconstructed ancestral genomes helps us to understand the evolutionary plasticity of genome organization at gene, chromosome, genome, and species levels for more than 200 million years of plant evolution [285]. Besides the reconstruction of karyotypes of ancestral genomes for monocots, dicots, and whole angiosperms (AMK, AEK, and AAK), ancestral genomes have been proposed for the Rosaceae [286], Brassicaceae [287], and Cucurbitaceae [288], as well as for the legumes [289]. The ancestral grass karyotype (AGK) reconstructed from rice, wheat, barley, Brachypodium, Setaria, sorghum, and maize has seven proto-chromosomes containing 8581 proto-genes (9430 in Wang et al. [290]) and covers up to 30 Mb [291]. The recent advancements in high-throughput NGS technologies have facilitated the sequencing of the ancient nuclear DNA material recovered from fossils, herbarium collections of various plants such as oak, maize, sunflower, barley, cotton, and Arabidopsis. The archaeo-genomic studies conducted in maize have revealed the origin and evolution of modern-day maize crops through human-centric selection pressures [292, 293]. The genome sequencing of archeological emmer wheat from the 3000-yearold Egyptian remains revealed the genetic erosion of biodiversity over time as it contained haplotypes that are absent in modern-day wheat cultivars [294]. Genetic analysis of the historical herbarium collections of sweet potato and modern samples revealed the pre-Columbian movement of the species from South America to Oceania [295] and the evolution of modern-day crop from an Ipomoea trifida ancestor [296]. Similarly, the genetic analysis of 88 samples including both contemporary and historical potatoes, along with some herbarium specimens originally collected by Charles Darwin, and one collected 359 years ago was used to understand the evolution of photoperiod sensitivity of the potato crop by quantifying change in allele frequencies at the StCDF1 gene [297].
6 6.1
Future Perspectives Genome Diversity
Although DNA sequencing is a relatively recent technology, it has proved to be a promising prospectus in the field of molecular biology and crop improvement. So far the genome of hot-spring red alga Cyanidioschyzon merolae is the first and only eukaryote genome to be sequenced completely without gaps [298]. In the recent past, the advancement of third-generation sequencing technologies resulted in so many near-complete genome assemblies by resolving the repeat-rich regions in the genomes. There are still millions of living species on earth with their genome yet to be sequenced and the availability of their genomes may provide a
80
Anupam Singh et al.
comprehensive view of genetic diversity and evolution. In 2018, Earth BioGenome Project (EBP) was initiated with the aim to sequence, catalog, and characterize the genomes of all of Earth’s eukaryotic biodiversity over the span of 10 years. The main goal of EBP includes large-scale genome sequencing projects like genome sequencing of 10,000 vertebrates (G10K), 10,000 plants (10KP), thousands of Fungal species, and Invertebrates. 6.2 Population Genomics
The large-scale/population-scale whole-genome re-sequencing has been carried out in very few crop plants such as rice [147], soybean [150], pearl millet [154], chickpea [153], pigeonpea [152], and maize [173]. Further advancement of genome sequencing technologies may reduce the cost of sequencing and time, to encourage many scientists across the world to initiate the population-scale genome sequencing of other crop plants.
6.3 Developmental Biology
Recent technologies like single-cell RNA-sequencing can enable sequencing-based profiling of single cells from different developmental stages and tissues can facilitate understanding molecular mechanisms underlying agronomically important and complex traits like photoperiod sensitivity, heterosis, apomixes, etc.
6.4 Portable RealTime Sequencers
The weight of currently available oxford nanopore sequencers is ranging from 87 g (MinION) to 28 kg (PromithION) that can start producing usable data within 2 min of run initiation (https:// nanoporetech.com/products/comparison). These miniature instruments can facilitate in situ sequencing of samples.
6.5 Unconventional Application of DNA Sequencing
The amount of data generated every day is posing a challenge to the world’s storage capabilities. DNA, the ultimate code of natural genetic information, provides a stable, energy-efficient, and sustainable means of data storage. With the exponential growth in the field of DNA synthesis and sequencing technology, DNA has been identified as a potential medium for data encryption, which may progress into DNA cryptography and stenography. Since 2012, DNA is being used for data storage [299]. One of the main advantages of DNA as a data storage medium is that it can store very dense data. Theoretically, DNA can encode 2 bits per nucleotide (nt) that can amount to approximately 455 exabytes for just 1 g of a single-stranded DNA [299–301].
References 1. Watson JD, Crick FH (1953) Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171: 737–738. https://doi.org/10.1038/ 171737a0
2. Holley RW, Everett GA, Madison JT, Zamir A (1965) Nucleotide sequences in the yeast alanine transfer ribonucleic acid. J Biol Chem 240:2122–2128
Next Generation Sequencing Technologies for Crops 3. Fiers W, Contreras R, Duerinck F et al (1976) Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature 260: 500–507. https://doi.org/10.1038/ 260500a0 4. Padmanabhan R, Padmanabhan R, Wu R (1972) Nucleotide sequence analysis of DNA: IX. Use of oligonucleotides of defined sequence as primers in DNA sequence analysis. Biochem Biophys Res Commun 48: 1295–1302. https://doi.org/10.1016/ 0006-291X(72)90852-2 5. Wu R (1972) Nucleotide sequence analysis of DNA. Nat New Biol 236:198–200. https:// doi.org/10.1038/newbio236198a0 6. Wu R, Kaiser AD (1968) Structure and base sequence in the cohesive ends of bacteriophage lambda DNA. J Mol Biol 35: 523–537. https://doi.org/10.1016/S00222836(68)80012-9 7. Gilbert W, Maxam A (1973) The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70:3581–3584. https://doi.org/ 10.1073/pnas.70.12.3581 8. Bernreiter A (2017) Molecular diagnostics to identify fungal plant pathogens – a review of current methods. Rev Cientı´fica Ecuatoriana 4:26–35 9. Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74:560–564. https://doi.org/10.1073/ pnas.74.2.560 10. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467. https://doi.org/10.1073/pnas. 74.12.5463 11. Sanger F, Coulson AR, Hong GF et al (1982) Nucleotide sequence of bacteriophage lambda DNA. J Mol Biol 162:729–773 12. Smith LM, Sanders JZ, Kaiser RJ et al (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321:674–679. https://doi.org/10.1038/321674a0 13. Hood LE, Hunkapiller MW, Smith LM (1987) Automated DNA sequencing and analysis of the human genome. Genomics 1: 201–212. https://doi.org/10.1016/08887543(87)90046-2 14. Hunkapiller T, Kaiser RJ, Koop BF, Hood L (1991) Large-scale and automated DNA sequence determination. Science 254:59–67. https://doi.org/10.1126/science.1925562 15. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:
81
796–815. https://doi.org/10.1038/ 35048692 16. Craig Venter J, Adams MD, Myers EW et al (2001) The sequence of the human genome. Science 291:1304–1351. https://doi.org/ 10.1126/science.1058040 17. Matsumoto T, Wu J, Kanamori H et al (2005) The map-based sequence of the rice genome. Nature 436:793–800. https://doi.org/10. 1038/nature03895 18. Shendure J (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145. https://doi.org/10.1038/nbt1486 19. Margulies M, Egholm M, Altman WE et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. https://doi.org/10.1038/ nature03959 20. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of nextgeneration sequencing technologies. Nat Rev Genet 17:333–351. https://doi.org/10. 1038/nrg.2016.49 21. Levene MJ, Korlach J, Turner SW et al (2003) Zero-mode waveguides for single-molecule analysis at high concentrations. Science 299: 682–686. https://doi.org/10.1126/science. 1079700 22. Haque F, Li J, Wu H-C et al (2013) Solidstate and biological nanopore for real-time sensing of single chemical and sequencing of DNA. Nano Today 8:56–74. https://doi. org/10.1016/j.nantod.2012.12.008 23. Jain M, Koren S, Miga KH et al (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36:338–345. https://doi.org/10.1038/nbt. 4060 24. Heather JM, Chain B (2016) The sequence of sequencers: the history of sequencing DNA. Genomics 107:1–8. https://doi.org/10. 1016/j.ygeno.2015.11.003 25. Shendure J, Balasubramanian S, Church GM et al (2017) DNA sequencing at 40: past, present and future. Nature 550:345. https:// doi.org/10.1038/nature24286 26. Kchouk M, Gibrat JF, Elloumi M (2017) Generations of sequencing technologies: from first to next generation. Biol Med 9:1. https://doi.org/10.4172/0974-8369. 1000395 27. Sanger F (1959) Chemistry of insulin. Science 129:1340–1344. https://doi.org/10.1126/ science.129.3359.1340 28. Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol
82
Anupam Singh et al.
94:441–448. https://doi.org/10.1016/ 0022-2836(75)90213-2 29. Sanger F, Air GM, Barrell BG et al (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265:687–695. https:// doi.org/10.1038/265687a0 30. Mullis KB, Faloona FA (1987) Specific synthesis of DNA in vitro via a polymerasecatalyzed chain reaction. Methods Enzymol 155:335–350. https://doi.org/10.1016/ 0076-6879(87)55023-6 31. Hyman ED (1988) A new method of sequencing DNA. Anal Biochem 174: 423–436. https://doi.org/10.1016/00032697(88)90041-3 32. Edwards A, Voss H, Rice P et al (1990) Automated DNA sequencing of the human HPRT locus. Genomics 6:593–608. https://doi. org/10.1016/0888-7543(90)90493-E 33. Ronaghi M, Karamohamed S, Pettersson B et al (1996) Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 242:84–89. https://doi.org/ 10.1006/abio.1996.0432 34. Mitra RD, Church GM (1999) In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res 27:e34. https://doi.org/10. 1093/nar/27.24.e34 35. Dressman D, Yan H, Traverso G et al (2003) Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci 100:8817–8822. https://doi. org/10.1073/pnas.1133470100 36. Shendure J, Porreca GJ, Reppas NB et al (2005) Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309: 1728–1732. https://doi.org/10.1126/sci ence.1117389 37. Seo TS, Bai X, Kim DH et al (2005) Fourcolor DNA sequencing by synthesis on a chip using photocleavable fluorescent nucleotides. Proc Natl Acad Sci 102:5926–5931. https:// doi.org/10.1073/pnas.0501965102 38. Ruparel H, Bi L, Li Z et al (2005) Design and synthesis of a 30 -O-allyl photocleavable fluorescent nucleotide as a reversible terminator for DNA sequencing by synthesis. Proc Natl Acad Sci 102:5932–5937. https://doi.org/ 10.1073/pnas.0501962102 39. Flusberg BA, Webster DR, Lee JH et al (2010) Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods 7:461–465. https://doi.org/ 10.1038/nmeth.1459
40. Huang S, He J, Chang S et al (2010) Identifying single bases in a DNA oligomer with electron tunnelling. Nat Nanotechnol 5: 868–873. https://doi.org/10.1038/nnano. 2010.213 41. Rothberg JM, Hinz W, Rearick TM et al (2011) An integrated semiconductor device enabling non-optical genome sequencing. Nature 475:348–352. https://doi.org/10. 1038/nature10242 42. Chidgeavadze ZG, Beabealashvilli RS, Atrazhev AM et al (1984) 20 ,30 -Dideoxy-30 aminonucleoside 50 -triphosphates are the terminators of DNA synthesis catalyzed by DNA polymerases. Nucleic Acids Res 12: 1671–1686. https://doi.org/10.1093/nar/ 12.3.1671 43. Prober JM, Trainor GL, Dam RJ et al (1987) A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238:336–341. https://doi. org/10.1126/science.2443975 44. Zhang J, Fang Y, Hou JY et al (1995) Use of non-cross-linked polyacrylamide for fourcolor DNA sequencing by capillary electrophoresis separation of fragments up to 640 bases in length in two hours. Anal Chem 67:4589–4593. https://doi.org/10. 1021/ac00120a026 45. Fleischmann RD, Adams MD, White O et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512. https://doi.org/ 10.1126/science.7542800 46. Goffeau A, Barrell BG, Bussey H et al (1996) Life with 6000 genes. Science 274:546–567. https://doi.org/10.1126/science.274. 5287.546 47. C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:2012–2018. https://doi. org/10.1126/science.282.5396.2012 48. Adams MD, Celniker SE, Holt RA et al (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195. https://doi.org/10.1126/science.287.5461. 2185 49. International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436:793–800. https:// doi.org/10.1038/nature03895 50. Schmutz J, Cannon SB, Schlueter J et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178. https://doi. org/10.1038/nature08670
Next Generation Sequencing Technologies for Crops 51. Mitra RD, Shendure J, Olejnik J et al (2003) Fluorescent in situ sequencing on polymerase colonies. Anal Biochem 320:55–65. https:// doi.org/10.1016/s0003-2697(03)00291-4 52. Ju J, Kim DH, Bi L et al (2006) Four-color DNA sequencing by synthesis using cleavable fluorescent nucleotide reversible terminators. Proc Natl Acad Sci 103:19635–19640. h t t p s : // d o i . o r g / 1 0 . 1 0 7 3 / p n a s . 0609513103 53. Nyre´n P, Lundin A (1985) Enzymatic method for continuous monitoring of inorganic pyrophosphate synthesis. Anal Biochem 151: 504–509. https://doi.org/10.1016/00032697(85)90211-8 54. Myllykangas S, Buenrostro J, Ji HP (2012) Overview of sequencing technology platforms. In: Bioinformatics for high throughput sequencing. Springer, New York, NY 55. Huse SM, Huber JA, Morrison HG et al (2007) Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 8: R143. https://doi.org/10.1186/gb-20078-7-r143 56. GenomeWeb (2015) Roche shutting down 454 sequencing business. GenomeWeb. https://www.genomeweb.com/sequencing/ roche-shutting-down-%20454sequencing-bus 57. Balasubramanian S (2015) Solexa sequencing: decoding genomes on a population scale. Clin Chem 61:21–24. https://doi.org/10.1373/ clinchem.2014.221747 58. Voelkerding KV, Dames SA, Durtschi JD (2009) Next-generation sequencing: from basic research to diagnostics. Clin Chem 55: 641–658. https://doi.org/10.1373/ clinchem.2008.112789 59. Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Mol Cell 58:586–597. https://doi.org/10. 1016/j.molcel.2015.05.004 60. Kulski JKKE-JK (2016) Next-generation sequencing — an overview of the history, tools, and “omic” applications. IntechOpen, Rijeka. Chapter 1 61. Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402. https://doi.org/ 10.1146/annurev.genom.9.081307.164359 62. Alic AS, Ruzafa D, Dopazo J, Blanquer I (2016) Objective review of de novo standalone error correction methods for NGS data. Wiley Interdiscip Rev Comput Mol Sci 6:111–146. https://doi.org/10.1002/ wcms.1239
83
63. Drmanac R, Sparks AB, Callow MJ et al (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327:78–81. https://doi.org/10.1126/science.1181498 64. Li Q, Zhao X, Zhang W et al (2019) Reliable multiplex sequencing with rare index mis-assignment on DNB-based NGS platform. BMC Genomics 20:1–13. https://doi. org/10.1186/s12864-019-5569-5 65. Xu Y, Lin Z, Tang C et al (2019) A new massively parallel nanoball sequencing platform for whole exome research. BMC Bioinformatics 20:1–9. https://doi.org/10.1186/ s12859-019-2751-3 66. Salzberg SL, Yorke JA (2005) Beware of mis-assembled genomes. Bioinformatics 21: 4320–4321. https://doi.org/10.1093/bioin formatics/bti769 67. Daber R, Sukhadia S, Morrissette JJD (2013) Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets. Cancer Genet 206:441–448. https:// doi.org/10.1016/j.cancergen.2013.11.005 68. van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C (2018) The third revolution in sequencing technology. Trends Genet 34: 666–681. https://doi.org/10.1016/j.tig. 2018.05.008 69. Payne A, Holmes N, Rakyan V, Loose M (2019) BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35:2193–2198. https://doi.org/10.1093/ bioinformatics/bty841 70. Braslavsky I, Hebert B, Kartalov E, Quake SR (2003) Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci 100:3960–3964. https://doi.org/10. 1073/pnas.0230489100 71. Bowers J, Mitchell J, Beer E et al (2009) Virtual terminator nucleotides for nextgeneration DNA sequencing. Nat Methods 6:593–595. https://doi.org/10.1038/ nmeth.1354 72. Eid J, Fehr A, Gray J et al (2009) Real-time DNA sequencing from single polymerase molecules. Science 323:133–138. https:// doi.org/10.1126/science.1162986 73. Ip CLC, Loose M, Tyson JR et al (2015) MinION analysis and reference consortium: phase 1 data release and analysis. F1000Research 4:1075. https://doi.org/10. 12688/f1000research.7201.1 74. Travers KJ, Chin C-S, Rank DR et al (2010) A flexible and efficient template format for circular consensus sequencing and SNP
84
Anupam Singh et al.
detection. Nucleic Acids Res 38:e159–e159. https://doi.org/10.1093/nar/gkq543 75. Eisenstein M (2012) Oxford Nanopore announcement sets sequencing sector abuzz. Nat Biotechnol 30:295–296 76. Loman NJ, Quinlan AR (2014) Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics 30:3399–3401. https://doi. org/10.1093/bioinformatics/btu555 77. Li J, Stein D, McMullan C et al (2001) Ion-beam sculpting at nanometre length scales. Nature 412:166–169. https://doi. org/10.1038/35084037 78. Dekker C (2007) Solid-state nanopores. Nat Nanotechnol 2:209–215. https://doi.org/ 10.1038/nnano.2007.27 79. Deamer D, Akeson M, Branton D (2016) Three decades of nanopore sequencing. Nat Biotechnol 34:518–524. https://doi.org/10. 1038/nbt.3423 80. Kasianowicz JJ, Brandin E, Branton D, Deamer DW (1996) Characterization of individual polynucleotide molecules using a membrane channel. Proc Natl Acad Sci U S A 93:13770–13773. https://doi.org/10. 1073/pnas.93.24.13770 81. Branton D, Deamer DW, Marziali A et al (2008) The potential and challenges of nanopore sequencing. Nat Biotechnol 26: 1146–1153. https://doi.org/10.1038/nbt. 1495 82. Lu H, Giordano F, Ning Z (2016) Oxford nanopore MinION sequencing and genome assembly. Genom Proteom Bioinformatics 14:265–279. https://doi.org/10.1016/j. gpb.2016.05.004 83. Jain M, Olsen HE, Paten B, Akeson M (2016) The Oxford nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 17:239. https://doi. org/10.1186/s13059-016-1103-0 84. McCoy RC, Taylor RW, Blauwkamp TA et al (2014) Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One 9:e106689 85. Heger M (2016) 10X genomics, pacific biosciences Morgan, provide business updates at JP Conference. Healthcare. GenomeWeb. https://www.genomeweb.com/sequencingt e c h n o l o g y / 1 0 x - g e n o m i c s - p a c i fi c biosciences-provide-businessupdates-jp-mor gan-healthcare 86. Belton JM, McCord RP, Gibcus JH et al (2012) Hi-C: a comprehensive technique to capture the conformation of genomes.
Methods 58:268–276. https://doi.org/10. 1016/j.ymeth.2012.05.001 87. de Wit E, de Laat W (2012) A decade of 3C technologies: insights into nuclear organization. Genes Dev 26:11–24. https://doi.org/ 10.1101/gad.179804.111 88. Aston C, Mishra B, Schwartz DC (1999) Optical mapping and its potential for largescale sequencing projects. Trends Biotechnol 17:297–302. https://doi.org/10.1016/ S0167-7799(99)01326-8 89. Schwartz DC, Li X, Hernandez LI et al (1993) Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262:110–114. https://doi.org/10.1126/science.8211116 90. Yuan Y, Chung CY-L, Chan T-F (2020) Advances in optical mapping for genomic research. Comput Struct Biotechnol J 18: 2051–2062. https://doi.org/10.1016/j. csbj.2020.07.018 91. Hu L, Xu Z, Wang M et al (2019) The chromosome-scale reference genome of black pepper provides insight into piperine biosynthesis. Nat Commun 10:4702. https://doi.org/10.1038/s41467-01912607-6 92. Wang M, Tu L, Yuan D et al (2019) Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat Genet 51:224–229. https://doi.org/10.1038/s41588-0180282-x 93. Jiao Y, Peluso P, Shi J et al (2017) Improved maize reference genome with single-molecule technologies. Nature 546:524–527. https:// doi.org/10.1038/nature22971 94. Yuan Y, Milec Z, Bayer PE et al (2018) Largescale structural variation detection in subterranean clover subtypes using optical mapping. Front Plant Sci 9:971 95. Wheeler DA, Srinivasan M, Egholm M et al (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452:872–876. https://doi.org/10.1038/ nature06884 96. Paterson AH, Bowers JE, Bruggmann R et al (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457: 551–556. https://doi.org/10.1038/ nature07723 97. Sato S, Tabata S, Hirakawa H et al (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485: 635–641. https://doi.org/10.1038/ nature11119
Next Generation Sequencing Technologies for Crops 98. Jaillon O, Aury J-M, Noel B et al (2007) The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449:463–467. https://doi. org/10.1038/nature06148 99. Schnable PS, Ware D, Fulton RS et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326:1112–1115. https://doi.org/10.1126/science.1178534 100. Velasco R, Zharkikh A, Affourtit J et al (2010) The genome of the domesticated apple (Malus domestica Borkh.). Nat Genet 42: 833–839. https://doi.org/10.1038/ng.654 101. Varshney RK, Chen W, Li Y et al (2012) Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resourcepoor farmers. Nat Biotechnol 30:83–89. https://doi.org/10.1038/nbt.2022 102. Singh NK, Gupta DK, Jayaswal PK et al (2011) The first draft of the pigeonpea genome sequence. J Plant Biochem Biotechnol 21:98–112. https://doi.org/10.1007/ s13562-011-0088-8 103. D’Hont A, Denoeud F, Aury J-M et al (2012) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488:213–217. https://doi.org/10. 1038/nature11241 104. Mayer KFX, Waugh R, Langridge P et al (2012) A physical, genetic and functional sequence assembly of the barley genome. Nature 491:711–716. https://doi.org/10. 1038/nature11543 105. Young ND, Debelle´ F, Oldroyd GED et al (2011) The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480:520. https://doi.org/10. 1038/nature10625 106. Pootakham W, Nawae W, Naktang C et al (2021) A chromosome-scale assembly of the black gram (Vigna mungo) genome. Mol Ecol Resour 21:238–250. https://doi.org/ 10.1111/1755-0998.13243 107. Wang X, Wang H, Wang J et al (2011) The genome of the mesopolyploid crop species Brassica rapa. Nat Genet 43:1035–1039. https://doi.org/10.1038/ng.919 108. Varshney RK, Song C, Saxena RK et al (2013) Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement. Nat Biotechnol 31:240–246. https://doi.org/10.1038/nbt.2491 109. Kim S, Park M, Yeom S-I et al (2014) Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species. Nat Genet 46:270–278. https://doi.org/10.1038/ng.2877
85
110. Argout X, Salse J, Aury J-M et al (2011) The genome of Theobroma cacao. Nat Genet 43: 101–108. https://doi.org/10.1038/ng.736 111. Xiao Y, Xu P, Fan H et al (2017) The genome draft of coconut (Cocos nucifera). Gigascience 6:1–11. https://doi.org/10.1093/ gigascience/gix095 112. Schmutz J, McClean PE, Mamidi S et al (2014) A reference genome for common bean and genome-wide analysis of dual domestications. Nat Genet 46:707–713. https://doi.org/10.1038/ng.3008 113. Huang S, Li R, Zhang Z et al (2009) The genome of the cucumber, Cucumis sativus L. Nat Genet 41:1275–1281. https://doi. org/10.1038/ng.475 114. Garcia-Mas J, Benjak A, Sanseverino W et al (2012) The genome of melon (Cucumis melo L.). Proc Natl Acad Sci 109:11872–11877. h t t p s : // d o i . o r g / 1 0 . 1 0 7 3 / p n a s . 1205415109 115. Sun X, Zhu S, Li N et al (2020) A chromosome-level genome assembly of garlic (Allium sativum) provides insights into genome evolution and allicin biosynthesis. Mol Plant 13:1328–1339. https://doi.org/ 10.1016/j.molp.2020.07.019 116. Bertioli DJ, Cannon SB, Froenicke L et al (2016) The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid ancestors of cultivated peanut. Nat Genet 48:438–446. https://doi.org/10.1038/ng. 3517 117. Bertioli DJ, Jenkins J, Clevenger J et al (2019) The genome sequence of segmental allotetraploid peanut Arachis hypogaea. Nat Genet 51:877–884. https://doi.org/10. 1038/s41588-019-0405-z 118. Yang J, Liu D, Wang X et al (2016) The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nat Genet 48:1225–1232. https://doi.org/10. 1038/ng.3657 119. Sato S, Nakamura Y, Kaneko T et al (2008) Genome structure of the legume, Lotus japonicus. DNA Res 15:227. https://doi.org/10. 1093/dnares/dsn008 120. Wang P, Luo Y, Huang J et al (2020) The genome evolution and domestication of tropical fruit mango. Genome Biol 21:60. https://doi.org/10.1186/s13059-02001959-8 121. Kang YJ, Kim SK, Kim MY et al (2014) Genome sequence of mungbean and insights into evolution within Vigna species. Nat
86
Anupam Singh et al.
Commun 5:5443. https://doi.org/10. 1038/ncomms6443 122. Krishnan NM, Pattnaik S, Jain P et al (2012) A draft of the genome and four transcriptomes of a medicinal and pesticidal angiosperm Azadirachta indica. BMC Genomics 13:464. https://doi.org/10.1186/14712164-13-464 123. Guo L, Winzer T, Yang X et al (2018) The opium poppy genome and morphinan production. Science 362:343–347. https://doi. org/10.1126/science.aat4096 124. Kreplak J, Madoui M-A, Ca´pal P et al (2019) A reference genome for pea provides insight into legume genome evolution. Nat Genet 51:1411–1422. https://doi.org/10.1038/ s41588-019-0480-1 125. Verde I, Abbott AG, Scalabrin S et al (2013) The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet 45:487–494. https:// doi.org/10.1038/ng.2586 126. Al-Mssallem IS, Hu S, Zhang X et al (2013) Genome sequence of the date palm Phoenix dactylifera L. Nat Commun 4:2274. https:// doi.org/10.1038/ncomms3274 127. Ming R, VanBuren R, Wai CM et al (2015) The pineapple genome and the evolution of CAM photosynthesis. Nat Genet 47: 1435–1442. https://doi.org/10.1038/ng. 3435 128. Zeng L, Tu XL, Dai H et al (2019) Whole genomes and transcriptomes reveal adaptation and domestication of pistachio. Genome Biol 20:79. https://doi.org/10.1186/ s13059-019-1686-3 129. Xu X, Pan S, Cheng S et al (2011) Genome sequence and analysis of the tuber crop potato. Nature 475:189–195. https://doi. org/10.1038/nature10158 130. Rousseau-Gueutin M, Belser C, Da Silva C et al (2020) Long-read assembly of the Brassica napus reference genome Darmor-bzh. Gigascience 9:giaa137. https://doi.org/10. 1093/gigascience/giaa137 131. Hibrand Saint-Oyant L, Ruttink T, Hamama L et al (2018) A high-quality genome sequence of Rosa chinensis to elucidate ornamental traits. Nat Plants 4:473–484. https:// doi.org/10.1038/s41477-018-0166-1 132. Tang C, Yang M, Fang Y et al (2016) The rubber tree genome reveals new insights into rubber production and species adaptation. Nat Plants 2:16073. https://doi.org/10. 1038/nplants.2016.73
133. Shulaev V, Sargent DJ, Crowhurst RN et al (2011) The genome of woodland strawberry (Fragaria vesca). Nat Genet 43:109–116. https://doi.org/10.1038/ng.740 134. Edger PP, Poorten TJ, VanBuren R et al (2019) Origin and evolution of the octoploid strawberry genome. Nat Genet 51:541–547. https://doi.org/10.1038/s41588-0190356-4 135. Xu Q, Chen L-L, Ruan X et al (2013) The draft genome of sweet orange (Citrus sinensis). Nat Genet 45:59–66. https://doi.org/ 10.1038/ng.2472 136. Zhang J, Zhang X, Tang H et al (2018) Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat Genet 50:1565–1573. https://doi.org/10. 1038/s41588-018-0237-2 137. Wei C, Yang H, Wang S et al (2018) Draft genome sequence of Camellia sinensis var. sinensis provides insights into the evolution of the tea genome and tea quality. Proc Natl Acad Sci 115:E4151–E4158. https://doi. org/10.1073/pnas.1719622115 138. Initiative TAG (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815. https://doi. org/10.1038/35048692 139. Sierro N, Battey JND, Ouadi S et al (2014) The tobacco genome sequence and its comparison with those of tomato and potato. Nat Commun 5:3833. https://doi.org/10. 1038/ncomms4833 140. Li F, Fan G, Lu C et al (2015) Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nat Biotechnol 33: 524–530. https://doi.org/10.1038/nbt. 3208 141. Appels R, Eversole K, Stein N et al (2018) Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361:eaar7191. https://doi. org/10.1126/science.aar7191 142. Zimin A, Stevens KA, Crepeau MW et al (2014) Sequencing and assembly of the 22-gb loblolly pine genome. Genetics 196: 875–890. https://doi.org/10.1534/genet ics.113.159715 143. Safa´r J, Bartos J, Janda J et al (2004) Dissecting large and complex genomes: flow sorting and BAC cloning of individual chromosomes from bread wheat. Plant J 39:960–968. https://doi.org/10.1111/j.1365-313X. 2004.02179.x
Next Generation Sequencing Technologies for Crops 144. Feuillet C, Leach JE, Rogers J et al (2011) Crop genome sequencing: lessons and rationales. Trends Plant Sci 16:77–88. https:// doi.org/10.1016/j.tplants.2010.10.005 145. Weigel D, Mott R (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biol 10:107. https://doi.org/10.1186/gb-200910-5-107 146. The 3000 rice genomes project (2014) The 3,000 rice genomes project. Gigascience 3:7. https://doi.org/10.1186/2047-217X-3-7 147. Wang W, Mauleon R, Hu Z et al (2018) Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557:43–49. https://doi.org/10.1038/s41586-0180063-9 148. Lai J, Li R, Xu X et al (2010) Genome-wide patterns of genetic variation among elite maize inbred lines. Nat Genet 42: 1027–1030. https://doi.org/10.1038/ ng.684 149. Liang Z, Duan S, Sheng J et al (2019) Wholegenome resequencing of 472 Vitis accessions for grapevine diversity and demographic history analyses. Nat Commun 10:1190. https://doi.org/10.1038/s41467-01909135-8 150. Zhou Z, Jiang Y, Wang Z et al (2015) Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean. Nat Biotechnol 33: 408–414. https://doi.org/10.1038/nbt. 3096 151. Lam H-M, Xu X, Liu X et al (2010) Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nat Genet 42:1053–1059. https://doi.org/10.1038/ng.715 152. Varshney RK, Saxena RK, Upadhyaya HD et al (2017) Whole-genome resequencing of 292 pigeonpea accessions identifies genomic regions associated with domestication and agronomic traits. Nat Genet 49:1082–1088. https://doi.org/10.1038/ng.3872 153. Varshney RK, Thudi M, Roorkiwal M et al (2019) Resequencing of 429 chickpea accessions from 45 countries provides insights into genome diversity, domestication and agronomic traits. Nat Genet 51:857–864. https://doi.org/10.1038/s41588-0190401-3 154. Varshney RK, Shi C, Thudi M et al (2017) Pearl millet genome sequence provides a resource to improve agronomic traits in arid environments. Nat Biotechnol 35:969–976. https://doi.org/10.1038/nbt.3943
87
155. Causse M, Desplat N, Pascual L et al (2013) Whole genome resequencing in tomato reveals variation associated with introgression and breeding events. BMC Genomics 14: 791. https://doi.org/10.1186/1471-216414-791 156. Wu D, Liang Z, Yan T et al (2019) Wholegenome resequencing of a worldwide collection of rapeseed accessions reveals the genetic basis of ecotype divergence. Mol Plant 12: 30–43. https://doi.org/10.1016/j.molp. 2018.11.007 157. Slavov GT, DiFazio SP, Martin J et al (2012) Genome resequencing reveals multiscale geographic structure and extensive linkage disequilibrium in the forest tree Populus trichocarpa. New Phytol 196:713–725. https://doi.org/10.1111/j.1469-8137. 2012.04258.x 158. Ossowski S, Schneeberger K, Clark RM et al (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18:2024–2033. https://doi.org/10. 1101/gr.080200.108 159. Kim MY, Lee S, Van K et al (2010) Wholegenome sequencing and intensive analysis of the undomesticated soybean (Glycine soja Sieb. and Zucc.) genome. Proc Natl Acad Sci 107:22032–22037. https://doi.org/10. 1073/pnas.1009526107 160. Ramakrishna G, Kaur P, Nigam D et al (2018) Genome-wide identification and characterization of InDels and SNPs in Glycine max and Glycine soja for contrasting seed permeability traits. BMC Plant Biol 18:141. https://doi. org/10.1186/s12870-018-1341-2 161. Mace ES, Tai S, Gilding EK et al (2013) Whole-genome sequencing reveals untapped genetic potential in Africa’s indigenous cereal crop sorghum. Nat Commun 4:2320. https://doi.org/10.1038/ncomms3320 162. He J, Zhao X, Laroche A et al (2014) Genotyping-by-sequencing (GBS), An ultimate marker-assisted selection (MAS) tool to accelerate plant breeding. Front Plant Sci 5: 1–8. https://doi.org/10.3389/fpls.2014. 00484 163. Elshire RJ, Glaubitz JC, Sun Q et al (2011) A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:e19379 164. Beissinger TM, Hirsch CN, Sekhon RS et al (2013) Marker density and read depth for genotyping populations using genotypingby-sequencing. Genetics 193:1073–1081. https://doi.org/10.1534/genetics.112. 147710
88
Anupam Singh et al.
165. Bus A, Hecht J, Huettel B et al (2012) Highthroughput polymorphism detection and genotyping in Brassica napus using nextgeneration RAD sequencing. BMC Genomics 13:281. https://doi.org/10.1186/14712164-13-281 166. Truong HT, Ramos AM, Yalcin F et al (2012) Sequence-based genotyping for marker discovery and co-dominant scoring in germplasm and populations. PLoS One 7:e37565 167. Sonah H, Bastien M, Iquira E et al (2013) An improved genotyping by sequencing (GBS) approach offering increased versatility and efficiency of SNP discovery and genotyping. PLoS One 8:e54603 168. Zhao K, Tung C-W, Eizenga GC et al (2011) Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat Commun 2:467. https:// doi.org/10.1038/ncomms1467 169. Tian F, Bradbury PJ, Brown PJ et al (2011) Genome-wide association study of leaf architecture in the maize nested association mapping population. Nat Genet 43: 159–162. https://doi.org/10.1038/ng.746 170. Huang X, Kurata N, Wei X et al (2012) A map of rice genome variation reveals the origin of cultivated rice. Nature 490:497–501. https://doi.org/10.1038/nature11532 171. Huang X, Zhao Y, Wei X et al (2012) Genome-wide association study of flowering time and grain yield traits in a worldwide collection of rice germplasm. Nat Genet 44: 32–39. https://doi.org/10.1038/ng.1018 172. Spindel J, Wright M, Chen C et al (2013) Bridging the genotyping gap: using genotyping by sequencing (GBS) to add high-density SNP markers and new value to traditional bi-parental mapping and breeding populations. Theor Appl Genet 126:2699–2716. https://doi.org/10.1007/s00122-0132166-x 173. Li H, Peng Z, Yang X et al (2013) Genomewide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat Genet 45:43–50. https://doi.org/ 10.1038/ng.2484 174. Romay MC, Millard MJ, Glaubitz JC et al (2013) Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol 14:R55. https://doi.org/10. 1186/gb-2013-14-6-r55 175. Jia G, Huang X, Zhi H et al (2013) A haplotype map of genomic variations and genomewide association studies of agronomic traits in foxtail millet (Setaria italica). Nat Genet 45:
957–961. https://doi.org/10.1038/ng. 2673 176. Morris GP, Ramu P, Deshpande SP et al (2013) Population genomic and genomewide association studies of agroclimatic traits in sorghum. Proc Natl Acad Sci 110: 453–458. https://doi.org/10.1073/pnas. 1215985110 177. Varala K, Swaminathan K, Li Y, Hudson ME (2011) Rapid genotyping of soybean cultivars using high throughput sequencing. PLoS One 6:e24811 178. Uitdewilligen JGAML, Wolters A-MA, D’hoop BB et al (2013) A next-generation sequencing method for genotyping-bysequencing of highly heterozygous autotetraploid potato. PLoS One 8:e62355 179. Ward JA, Bhangoo J, Ferna´ndez-Ferna´ndez F et al (2013) Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation. BMC Genomics 14:2. https://doi. org/10.1186/1471-2164-14-2 180. Poland JA, Brown PJ, Sorrells ME, Jannink J-L (2012) Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS One 7:e32253 181. Liu H, Bayer M, Druka A et al (2014) An evaluation of genotyping by sequencing (GBS) to map the Breviaristatum-e (ari-e) locus in cultivated barley. BMC Genomics 15:104. https://doi.org/10.1186/14712164-15-104 182. Tettelin H, Masignani V, Cieslewicz MJ et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A 102:13950–13955. h t t p s : // d o i . o r g / 1 0 . 1 0 7 3 / p n a s . 0506758102 183. Golicz AA, Bayer PE, Bhalla PL et al (2020) Pangenomics comes of age: from bacteria to plant and animal applications. Trends Genet 36:132–145. https://doi.org/10.1016/j.tig. 2019.11.006 184. Bayer PE, Golicz AA, Scheben A et al (2020) Plant pan-genomes are the new reference. Nat Plants 6:914–920. https://doi.org/10. 1038/s41477-020-0733-0 185. Tao Y, Zhao X, Mace E et al (2019) Exploring and exploiting pan-genomics for crop improvement. Mol Plant 12:156–169. https://doi.org/10.1016/j.molp.2018. 12.016
Next Generation Sequencing Technologies for Crops 186. Della Coletta R, Qiu Y, Ou S et al (2021) How the pan-genome is changing crop genomics and improvement. Genome Biol 22:3. https://doi.org/10.1186/s13059-02002224-8 187. Li Y, Zhou G, Ma J et al (2014) De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nat Biotechnol 32:1045–1052. https://doi.org/10.1038/nbt.2979 188. Golicz AA, Bayer PE, Barker GC et al (2016) The pangenome of an agronomically important crop plant Brassica oleracea. Nat Commun 7:13390. https://doi.org/10.1038/ ncomms13390 189. Montenegro JD, Golicz AA, Bayer PE et al (2017) The pangenome of hexaploid bread wheat. Plant J 90:1007–1013. https://doi. org/10.1111/tpj.13515 190. Hurgobin B, Golicz AA, Bayer PE et al (2018) Homoeologous exchange is a major cause of gene presence/absence variation in the amphidiploid Brassica napus. Plant Biotechnol J 16:1265–1274. https://doi.org/ 10.1111/pbi.12867 191. Golicz AA, Batley J, Edwards D (2016) Towards plant pangenomics. Plant Biotechnol J 14:1099–1105. https://doi.org/10. 1111/pbi.12499 192. Lin K, Zhang N, Severing EI et al (2014) Beyond genomic variation--comparison and functional annotation of three Brassica rapa genomes: a turnip, a rapid cycling and a Chinese cabbage. BMC Genomics 15:250. https://doi.org/10.1186/1471-216415-250 193. Schatz MC, Maron LG, Stein JC et al (2014) Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica. Genome Biol 15:506. https://doi.org/10. 1186/PREACCEPT-2784872521277375 194. Hirsch CN, Foerster JM, Johnson JM et al (2014) Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26: 121–135. https://doi.org/10.1105/tpc. 113.119982 195. Yao W, Li G, Zhao H et al (2015) Exploring the rice dispensable genome using a metagenome-like assembly strategy. Genome Biol 16:187. https://doi.org/10.1186/ s13059-015-0757-3 196. Gordon SP, Contreras-Moreira B, Woods DP et al (2017) Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun 8:2184. https://doi.org/10. 1038/s41467-017-02292-8
89
197. Zhou P, Silverstein KAT, Ramaraj T et al (2017) Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes. BMC Genomics 18:261. https://doi.org/10.1186/ s12864-017-3654-1 198. Ou L, Li D, Lv J et al (2018) Pan-genome of cultivated pepper (Capsicum) and its use in gene presence-absence variation analyses. New Phytol 220:360–363 199. Zhao Q, Feng Q, Lu H et al (2018) Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nat Genet 50:278–284. https://doi.org/10. 1038/s41588-018-0041-z 200. Yu J, Golicz AA, Lu K et al (2019) Insight into the evolution and functional characteristics of the pan-genome assembly from sesame landraces and modern cultivars. Plant Biotechnol J 17:881–892. https://doi.org/10. 1111/pbi.13022 201. Hu¨bner S, Bercovich N, Todesco M et al (2019) Sunflower pan-genome analysis shows that hybridization altered gene content and disease resistance. Nat Plants 5:54–62. https://doi.org/10.1038/s41477-0180329-0 202. Gao L, Gonda I, Sun H et al (2019) The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat Genet 51:1044. https://doi.org/10.1038/ s41588-019-0410-2 203. Song J-M, Guan Z, Hu J et al (2020) Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus. Nat Plants 6:34–45. https:// doi.org/10.1038/s41477-019-0577-7 204. Trouern-Trend AJ, Falk T, Zaman S et al (2020) Comparative genomics of six Juglans species reveals disease-associated gene family contractions. Plant J 102:410–423. https:// doi.org/10.1111/tpj.14630 205. Liu Y, Du H, Li P et al (2020) Pan-genome of wild and cultivated soybeans. Cell 182: 162–176.e13. https://doi.org/10.1016/j. cell.2020.05.023 206. Zhao J, Bayer PE, Ruperao P et al (2020) Trait associations in the pangenome of pigeon pea (Cajanus cajan). Plant Biotechnol J 18: 1946–1954. https://doi.org/10.1111/pbi. 13354 207. Ruperao P, Thirunavukkarasu N, Gandham P et al (2021) Sorghum pan-genome explores the functional utility for genomic-assisted breeding to accelerate the genetic gain. Front Plant Sci 12:963 208. Hufford MB, Seetharam AS, Woodhouse MR et al (2021) De novo assembly, annotation,
90
Anupam Singh et al.
and comparative analysis of 26 diverse maize genomes. Science 373:655–662. https://doi. org/10.1126/science.abg5289 209. Shomura A, Izawa T, Ebana K et al (2008) Deletion in a gene associated with grain size increased yields during rice domestication. Nat Genet 40:1023–1028. https://doi.org/ 10.1038/ng.169 210. Xu K, Xu X, Fukao T et al (2006) Sub1A is an ethylene-response-factor-like gene that confers submergence tolerance to rice. Nature 442:705–708. https://doi.org/10.1038/ nature04920 211. Ashikawa I, Hayashi N, Yamane H et al (2008) Two adjacent nucleotide-binding site–leucine-rich repeat class genes are required to confer pikm-specific rice blast resistance. Genetics 180:2267–2276. https://doi.org/10.1534/genetics.108. 095034 212. Botstein D, White RL, Skolnick M, Davis RW (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32:314–331 213. Bernatzky R, Tanksley SD (1986) Toward a saturated linkage map in tomato based on isozymes and random cDNA sequences. Genetics 112:887–898 214. Konieczny A, Ausubel FM (1993) A procedure for mapping Arabidopsis mutations using co-dominant ecotype-specific PCR-based markers. Plant J 4:403–410. https://doi.org/10.1046/j.1365-313x. 1993.04020403.x 215. Gupta PK, Varshney RK (2000) The development and use of microsatellite markers for genetic analysis and plant breeding with emphasis on bread wheat. Euphytica 113: 163–185. https://doi.org/10.1023/ A:1003910819967 216. Wei X, Wang L, Zhang Y et al (2014) Development of simple sequence repeat (SSR) markers of sesame (Sesamum indicum) from a genome survey. Molecules 19:5150–5162. h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / molecules19045150 217. Salgado LR, Koop DM, Pinheiro DG et al (2014) De novo transcriptome analysis of Hevea brasiliensis tissues by RNA-seq and screening for molecular markers. BMC Genomics 15:236. https://doi.org/10.1186/ 1471-2164-15-236 218. Nigam D, Saxena S, Ramakrishna G et al (2017) De novo assembly and characterization of Cajanus scarabaeoides (L.) thouars transcriptome by paired-end sequencing.
Front Mol Biosci 4:48. https://doi.org/10. 3389/fmolb.2017.00048 219. Yadav CB, Bonthala VS, Muthamilarasan M et al (2015) Genome-wide development of transposable elements-based markers in foxtail millet and construction of an integrated database. DNA Res 22:79–90. https://doi. org/10.1093/dnares/dsu039 220. Bailey-Serres J, Fukao T, Ronald P et al (2010) Submergence tolerant rice: SUB1’s journey from landrace to modern cultivar. Rice 3:138–147. https://doi.org/10.1007/ s12284-010-9048-5 221. Takagi H, Abe A, Yoshida K et al (2013) QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. Plant J 74:174–183. https://doi.org/10.1111/tpj. 12105 222. Liu S, Yeh C-T, Tang HM et al (2012) Gene mapping via bulked segregant RNA-Seq (BSR-Seq). PLoS One 7:e36406 223. Hussain W, Baenziger PS, Belamkar V et al (2017) Genotyping-by-sequencing derived high-density linkage map and its application to QTL mapping of flag leaf traits in bread wheat. Sci Rep 7:16394. https://doi.org/10. 1038/s41598-017-16006-z 224. Price AH (2006) Believe it or not, QTLs are accurate! Trends Plant Sci 11:213–216. https://doi.org/10.1016/j.tplants.2006. 03.006 225. Zhu C, Gore M, Buckler ES, Yu J (2008) Status and prospects of association mapping in plants. Plant Genome 1:5. https://doi. org/10.3835/plantgenome2008.02.0089 226. Varshney RK, Graner A, Sorrells ME (2005) Genomics-assisted breeding for crop improvement. Trends Plant Sci 10:621–630. https://doi.org/10.1016/j.tplants.2005. 10.004 227. Varshney RK, Terauchi R, McCouch SR (2014) Harvesting the promising fruits of genomics: applying genome sequencing technologies to crop breeding. PLoS Biol 12:1–8. https://doi.org/10.1371/journal.pbio. 1001883 228. Collard BCY, Mackill DJ (2008) Markerassisted selection: an approach for precision plant breeding in the twenty-first century. Philos Trans R Soc Lond Ser B Biol Sci 363: 557–572. https://doi.org/10.1098/rstb. 2007.2170 229. Busemeyer L, Ruckelshausen A, Mo¨ller K et al (2013) Precision phenotyping of biomass accumulation in triticale reveals temporal
Next Generation Sequencing Technologies for Crops genetic patterns of regulation. Sci Rep 3: 2442. https://doi.org/10.1038/srep02442 230. Varshney RK, Thudi M, Nayak SN et al (2014) Genetic dissection of drought tolerance in chickpea (Cicer arietinum L.). Theor Appl Genet 127:445–462. https://doi.org/ 10.1007/s00122-013-2230-6 231. Hayes B, Goddard M (2010) Genome-wide association and genomic selection in animal breeding. Genome 53:876–883. https://doi. org/10.1139/G10-076 232. Crossa J, Pe´rez P, Hickey J et al (2014) Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity 112:48–60. https://doi.org/10.1038/hdy.2013.16 233. de Oliveira EJ, de Resende MDV, da Silva Santos V et al (2012) Genome-wide selection in cassava. Euphytica 187:263–276. https:// doi.org/10.1007/s10681-012-0722-0 234. Shen R, Fan J-B, Campbell D et al (2005) High-throughput SNP genotyping on universal bead arrays. Mutat Res 573:70–82. https://doi.org/10.1016/j.mrfmmm.2004. 07.022 235. Steemers FJ, Gunderson KL (2007) Whole genome genotyping technologies on the BeadArray platform. Biotechnol J 2:41–49. https://doi.org/10.1002/biot.200600213 236. Matsuzaki H, Dong S, Loi H et al (2004) Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods 1: 109–111. https://doi.org/10.1038/ nmeth718 237. Ganal MW, Durstewitz G, Polley A et al (2011) A large maize (Zea mays L.) SNP genotyping array: development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLoS One 6:e28334. https://doi.org/10. 1371/journal.pone.0028334 238. Hiremath PJ, Kumar A, Penmetsa RV et al (2012) Large-scale development of costeffective SNP marker assays for diversity assessment and genetic mapping in chickpea and comparative mapping in legumes. Plant Biotechnol J 10:716–732. https://doi.org/ 10.1111/j.1467-7652.2012.00710.x 239. Saxena RK, Penmetsa RV, Upadhyaya HD et al (2012) Large-scale development of cost-effective single-nucleotide polymorphism marker assays for genetic mapping in pigeonpea and comparative mapping in legumes. DNA Res 19:449–461. https:// doi.org/10.1093/dnares/dss025 240. Sanand S, Srivastava H, Kaila T et al (2020) Methods and tools for plant organelle genome sequencing, assembly, and
91
downstream analysis. In: Methods in molecular biology. Humana Press, Totowa, NJ, pp 49–98 241. Shinozaki K, Ohme M, Tanaka M et al (1986) The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. EMBO J 5: 2043–2049 242. Ohyama K, Fukuzawa H, Kohchi T et al (1986) Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature 322:572–574. https://doi.org/10. 1038/322572a0 243. Segovia R, Pett W, Trewick S, Lavrov DV (2011) Extensive and evolutionarily persistent mitochondrial tRNA editing in Velvet Worms (phylum Onychophora). Mol Biol Evol 28:2873–2881. https://doi.org/10. 1093/molbev/msr113 244. Fitzgerald TL, Shapter FM, McDonald S et al (2011) Genome diversity in wild grasses under environmental stress. Proc Natl Acad Sci U S A 108:21140–21145. https://doi. org/10.1073/pnas.1115203108 245. Hardouin EA, Tautz D (2013) Increased mitochondrial mutation frequency after an island colonization: positive selection or accumulation of slightly deleterious mutations? Biol Lett 9:20121123. https://doi.org/10. 1098/rsbl.2012.1123 246. Hieter P, Boguski M (1997) Functional genomics: it’s all how you read it. Science 278: 601–602. https://doi.org/10.1126/science. 278.5338.601 247. Yamada K, Lim J, Dale JM et al (2003) Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 302: 842–846. https://doi.org/10.1126/science. 1088305 248. Weber APM, Weber KL, Carr K et al (2007) Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol 144:32–42. https://doi.org/10. 1104/pp.107.096677 249. Lowe R, Shirley N, Bleackley M et al (2017) Transcriptomics technologies. PLoS Comput Biol 13:e1005457 250. Nejat N, Ramalingam A, Mantri N (2018) Advances in transcriptomics of plants. Adv Biochem Eng Biotechnol 164:161–185. https://doi.org/10.1007/10_2017_52 251. Ramakrishna G, Kaur P, Singh A et al (2021) Comparative transcriptome analyses revealed different heat stress responses in pigeonpea (Cajanus cajan) and its crop wild relatives.
92
Anupam Singh et al.
Plant Cell Rep 40:881. https://doi.org/10. 1007/s00299-021-02686-5 252. Schmid M, Davison TS, Henz SR et al (2005) A gene expression map of Arabidopsis thaliana development. Nat Genet 37:501–506. https://doi.org/10.1038/ng1543 253. Wang L, Xie W, Chen Y et al (2010) A dynamic gene expression atlas covering the entire life cycle of rice. Plant J 61:752–766. https://doi.org/10.1111/j.1365-313X. 2009.04100.x 254. Severin AJ, Woody JL, Bolon YT et al (2010) RNA-Seq atlas of glycine max: a guide to the soybean transcriptome. BMC Plant Biol 10: 160. https://doi.org/10.1186/1471-222910-160 255. Kudapa H, Garg V, Chitikineni A, Varshney RK (2018) The RNA-Seq-based high resolution gene expression atlas of chickpea (Cicer arietinum L.) reveals dynamic spatio-temporal changes associated with growth and development. Plant Cell Environ 41:2209–2225. https://doi.org/10.1111/pce.13210 256. Pazhamala LT, Purohit S, Saxena RK et al (2017) Gene expression atlas of pigeonpea and its application to gain insights into genes associated with pollen fertility implicated in seed formation. J Exp Bot 68:2037–2054. https://doi.org/10.1093/jxb/erx010 257. Sekhon RS, Briskine R, Hirsch CN et al (2013) Maize gene atlas developed by RNA sequencing and comparative evaluation of transcriptomes based on RNA sequencing and microarrays. PLoS One 8:e61005 258. Ramı´rez-Gonza´lez RH, Borrill P, Lang D et al (2018) The transcriptional landscape of polyploid wheat. Science 361:eaar6089. https:// doi.org/10.1126/science.aar6089 259. Allis CD, Jenuwein T (2016) The molecular hallmarks of epigenetic control. Nat Rev Genet 17:487–500. https://doi.org/10. 1038/nrg.2016.59 260. Gu H, Smith ZD, Bock C et al (2011) Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nat Protoc 6: 468–481. https://doi.org/10.1038/nprot. 2010.190 261. Wang P, Xia H, Zhang Y et al (2015) Genome-wide high-resolution mapping of DNA methylation identifies epigenetic variation across embryo and endosperm in Maize (Zea may). BMC Genomics 16:21. https:// doi.org/10.1186/s12864-014-1204-7 262. Junaid A, Kumar H, Rao AR et al (2018) Unravelling the epigenomic interactions between parental inbreds resulting in an
altered hybrid methylome in pigeonpea. DNA Res 25:361–373. https://doi.org/10. 1093/dnares/dsy008 263. Meissner A, Gnirke A, Bell GW et al (2005) Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res 33: 5868–5877. https://doi.org/10.1093/nar/ gki901 264. Platt A, Gugger PF, Pellegrini M, Sork VL (2015) Genome-wide signature of local adaptation linked to variable CpG methylation in oak populations. Mol Ecol 24:3823–3830. https://doi.org/10.1111/mec.13230 265. Chen X, Ge X, Wang J et al (2015) Genomewide DNA methylation profiling by modified reduced representation bisulfite sequencing in Brassica rapa suggests that epigenetic modifications play a key role in polyploid genome evolution. Front Plant Sci 6:836. https://doi. org/10.3389/fpls.2015.00836 266. Clark C, Palta P, Joyce CJ et al (2012) A comparison of the whole genome approach of MeDIP-Seq to the targeted approach of the infinium HumanMethylation450 BeadChip® for methylome profiling. PLoS One 7:e50233 267. Taiwo O, Wilson GA, Morris T et al (2012) Methylome analysis using MeDIP-seq with low DNA concentrations. Nat Protoc 7: 617–636. https://doi.org/10.1038/nprot. 2012.012 268. Zhao M-T, Whyte JJ, Hopkins GM et al (2014) Methylated DNA immunoprecipitation and high-throughput sequencing (MeDIP-seq) using low amounts of genomic DNA. Cell Reprogr 16:175–184. https:// doi.org/10.1089/cell.2014.0002 269. Weng Y-I, Huang TH-M, Yan PS (2009) Methylated DNA immunoprecipitation and microarray-based analysis: detection of DNA methylation in breast cancer cell lines. Methods Mol Biol 590:165–176. https://doi.org/ 10.1007/978-1-60327-378-7 270. Park PJ (2009) ChIP–seq: advantages and challenges of a maturing technology. Nat Rev Genet 10:669–680. https://doi.org/10. 1038/nrg2641 271. Warr A, Robert C, Hume D et al (2015) Exome sequencing: current and future perspectives. G3 5:1543–1550. https://doi. org/10.1534/g3.115.018564 272. Muraya MM, Schmutzer T, Ulpinnis C et al (2015) Targeted sequencing reveals largescale sequence polymorphism in maize candidate genes for biomass production and composition. PLoS One 10:e0132120
Next Generation Sequencing Technologies for Crops 273. Saintenac C, Jiang D, Akhunov ED (2011) Targeted analysis of nucleotide and copy number variation by exon capture in allotetraploid wheat genome. Genome Biol 12:R88. https://doi.org/10.1186/gb-2011-129-r88 274. Haun WJ, Hyten DL, Xu WW et al (2011) The composition and origins of genomic variation among individuals of the soybean reference cultivar Williams 82. Plant Physiol 155: 645–655. https://doi.org/10.1104/pp.110. 166736 275. Vorholt JA (2012) Microbial life in the phyllosphere. Nat Rev Microbiol 10:828–840. https://doi.org/10.1038/nrmicro2910 276. Bulgarelli D, Schlaeppi K, Spaepen S et al (2013) Structure and functions of the bacterial microbiota of plants. Annu Rev Plant Biol 64:807–838. https://doi.org/10.1146/ annurev-arplant-050312-120106 277. Newton AC, Fitt BDL, Atkins SD et al (2010) Pathogenesis, parasitism and mutualism in the trophic space of microbe-plant interactions. Trends Microbiol 18:365–373. https://doi. org/10.1016/j.tim.2010.06.002 278. Young JPW, Crossman LC, Johnston AWB et al (2006) The genome of Rhizobium leguminosarum has recognizable core and accessory components. Genome Biol 7:R34. https://doi.org/10.1186/gb-2006-7-4-r34 279. Sablok G, Rosselli R, Seeman T et al (2017) Draft genome sequence of the nitrogen-fixing rhizobium sullae type strain IS123T focusing on the key genes for symbiosis with its host Hedysarum coronarium L. Front Microbiol 8:1348 280. Bromfield ESP, Cloutier S, Nguyen HDT (2019) Description and complete genome sequence of bradyrhizobium amphicarpaeae sp. Nov., harbouring photosystem and nitrogenfixation genes. Int J Syst Evol Microbiol 69:2841–2848. https://doi.org/10.1099/ ijsem.0.003569 281. Ramachandran VK, East AK, Karunakaran R et al (2011) Adaptation of Rhizobium leguminosarumto pea, alfalfa and sugar beet rhizospheres investigated by comparative transcriptomics. Genome Biol 12:R106. https://doi.org/10.1186/gb-2011-12-10r106 282. Liu X, Wei S, Wang F et al (2012) Burkholderia and Cupriavidus spp. are the preferred symbionts of Mimosa spp. in southern China. FEMS Microbiol Ecol 80:417–426. https:// doi.org/10.1111/j.1574-6941.2012. 01310.x
93
283. Pont C, Wagner S, Kremer A et al (2019) Paleogenomics: reconstruction of plant evolutionary trajectories from modern and ancient DNA. Genome Biol 20:1–17. https://doi.org/10.1186/s13059-0191627-1 284. Barba-Montoya J, dos Reis M, Schneider H et al (2018) Constraining uncertainty in the timescale of angiosperm evolution and the veracity of a Cretaceous Terrestrial Revolution. New Phytol 218:819–834. https://doi. org/10.1111/nph.15011 285. Murat F, Armero A, Pont C et al (2017) Reconstructing the genome of the most recent common ancestor of flowering plants. Nat Genet 49:490–496. https://doi.org/10. 1038/ng.3813 286. Raymond O, Gouzy J, Just J et al (2018) The Rosa genome provides new insights into the domestication of modern roses. Nat Genet 50:772–777. https://doi.org/10.1038/ s41588-018-0110-3 287. Murat F, Louis A, Maumus F et al (2015) Understanding Brassicaceae evolution through ancestral genome reconstruction. Genome Biol 16:262. https://doi.org/10. 1186/s13059-015-0814-y 288. Wu S, Shamimuzzaman M, Sun H et al (2017) The bottle gourd genome provides insights into Cucurbitaceae evolution and facilitates mapping of a Papaya ring-spot virus resistance locus. Plant J 92:963–975. https://doi.org/10.1111/tpj.13722 289. Wang J, Sun P, Li Y et al (2017) Hierarchically aligning 10 legume genomes establishes a family-level genomics platform. Plant Physiol 174:284–300. https://doi.org/10.1104/pp. 16.01981 290. Wang X, Wang J, Jin D et al (2015) Genome alignment spanning major poaceae lineages reveals heterogeneous evolutionary rates and alters inferred dates for key evolutionary events. Mol Plant 8:885–898. https://doi. org/10.1016/j.molp.2015.04.004 291. Murat F, Xu J-H, Tannier E et al (2010) Ancestral grass karyotype reconstruction unravels new mechanisms of genome shuffling as a source of plant evolution. Genome Res 20:1545–1557. https://doi.org/10. 1101/gr.109744.110 292. Ramos-Madrigal J, Smith BD, MorenoMayar JV et al (2016) Genome sequence of a 5,310-year-old maize cob provides insights into the early stages of maize domestication. Curr Biol 26:3195–3201. https://doi.org/ 10.1016/j.cub.2016.09.036
94
Anupam Singh et al.
293. Kistler L, Maezumi SY, Gregorio de Souza J et al (2018) Multiproxy evidence highlights a complex evolutionary legacy of maize in South America. Science 362:1309–1313. https://doi.org/10.1126/science.aav0207 294. Scott MF, Botigue´ LR, Brace S et al (2019) A 3,000-year-old Egyptian emmer wheat genome reveals dispersal and domestication history. Nat Plants 5:1120–1128. https:// doi.org/10.1038/s41477-019-0534-5 295. Roullier C, Benoit L, McKey DB, Lebot V (2013) Historical collections reveal patterns of diffusion of sweet potato in Oceania obscured by modern plant movements and recombination. Proc Natl Acad Sci 110: 2205–2210. https://doi.org/10.1073/ pnas.1211049110 ˜ oz-Rodrı´guez P, Carruthers T, Wood 296. Mun JRI et al (2018) Reconciling conflicting phylogenies in the origin of sweet potato and dispersal to polynesia. Curr Biol 28: 1246–1256.e12. https://doi.org/10.1016/ j.cub.2018.03.020
297. Gutaker RM, Weiß CL, Ellis D et al (2019) The origins and adaptation of European potatoes reconstructed from historical genomes. Nat Ecol Evol 3:1093–1101. https://doi. org/10.1038/s41559-019-0921-3 298. Nozaki H, Takano H, Misumi O et al (2007) A 100%-complete sequence reveals unusually simple genomic features in the hot-spring red alga Cyanidioschyzon merolae. BMC Biol 5: 28. https://doi.org/10.1186/1741-70075-28 299. Church GM, Gao Y, Kosuri S (2012) Nextgeneration digital information storage in DNA. Science 337:1628. https://doi.org/ 10.1126/science.1226355 300. Arcadia CE, Kennedy E, Geiser J et al (2020) Multicomponent molecular memory. Nat Commun 11:691. https://doi.org/10. 1038/s41467-020-14455-1 301. Kim J, Bae JH, Baym M, Zhang DY (2020) Metastable hybridization-based DNA information storage to allow rapid and permanent erasure. Nat Commun 11:1–8. https://doi. org/10.1038/s41467-020-18842-6
Chapter 4 Check CRISPR Editing Events in Transgenic Wheat with Next-Generation Sequencing Junli Zhang Abstract CRISPR-Cas9 is a convenient tool to create knockdown mutants for desired genes. Early generations of transgenic plants usually have multiple editing events. To select the best plants for future study, these plants need to be screened for editing type and proportion. However, the traditional restriction enzyme method cannot tell the editing type and Sanger sequencing cannot check a mixture of different editing events. Nextgeneration sequencing (NGS) is the best choice for screening CRISPR-Cas9 induced edits. With NGS, hundreds to thousands of T1 plants can be sequenced without genome-specific primers. Here we present the detailed procedure for using NGS to select plants with the best editing, including primer design, PCR setup, sample preparation, and sequencing result analysis. Key words CRISPR, Wheat, Next-generation sequencing
1
Introduction CRISPR (clustered regularly interspaced short palindromic repeats) is a simple and efficient genome editing technology and has been widely used to precisely knock out genes of interest in multiple crops [1]. CRISPR is well suited to knock out genes in polyploid species: only one guide RNA (gRNA) is needed to knock out all the homologous or homeologous genes, a difficult task to accomplish with other mutation methods. Wheat is an allohexaploid species, containing three genomes (A, B, and D). Homeologous genes among the three genomes are usually 97% similar. It is very easy to find gRNAs that target all three genomes. The recent breakthrough in wheat transformation [2] makes it much easier to get tens to hundreds of transgenic plants in almost any wheat genotypes. After getting the transgenic plants, the editing efficiency of
Supplementary Information The online version contains supplementary material available at [https://doi.org/ 10.1007/978-1-0716-2533-0_4]. Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
95
96
Junli Zhang
each plant needs to be checked. We can use the restriction enzyme method to check whether its recognition site near the Protospacer Adjacent Motif (PAM) sequence is destroyed, but there are multiple limitations to this approach. First, it needs genome-specific primers flanking the PAM; second, it requires a restriction enzyme recognition sequence on the CRISPR cutting site; finally, it cannot tell the frequency of different editing events. Next-generation sequencing (NGS) can overcome all the limitations mentioned above. Recently, more and more sequencing labs and companies provide CRISPR sequencing (or amplicon sequencing) services (see Note 1). PCR amplicons can be submitted and they will add barcodes and sequencing adapters to your samples before sequencing. Here we will describe how to use NGS to check the CRISPR editing efficiency in transgenic plants.
2
Materials
2.1 Used by All Procedure
1. Pipettes and pipette tips (10 μL, 20 μL, 200 μL, 1 mL). 2. 1.5 mL Eppendorf tubes. 3. Sterile deionized water.
2.2
PCRs
1. PCR machine. 2. Thin-walled PCR tubes. 3. Taq polymerase buffer, 10. 4. Taq polymerase (commercial or homemade). 5. 50 mM MgCl2. 6. 10 μM dNTPs. 7. 10 μM Forward primer with the left adapter. 8. 10 μM Reverse primer with the right adapter. 9. 10 μM Left barcodes. 10. 10 μM Right barcodes. 11. Isolated genomic DNA.
2.3 PCR Product Recovery
1. Homemade beads (protocol can be found here: https:// ethanomics.files.wordpress.com/2012/08/serapure_v2-2. pdf, you can also use the commercial Agencourt AMPure XP Beads (Beckman, A63881)). 2. Freshly prepared 70% ethanol: 1.0 mL per DNA sample. 3. Elution buffer (10 mM Tris-HCl, pH 8.0): 30 μL per DNA pool. 4. Magnet for 1.5 mL tubes.
CRISPR Editing Events in Transgenic Wheat
2.4 DNA Concentration Measurement
3
97
1. Qubit Fluorometer (version 2 or higher, Thermo Fisher Scientific). 2. Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Q32851).
Methods The CRISPR sequencing normally yields 50,000 to 100,000 reads per sample, our lab usually pools PCR amplicons of up to 300 samples. In order to pool multiple samples, barcodes need to be added to each sample via two rounds of PCRs (Fig. 1): the first around adds adapters to your PCR target (for priming of the second round primers) and the second-round PCR adds barcodes to the ends of PCR amplicons. Most sequencing providers use Illumina PE150 (paired end 150 bp) for sequencing.
3.1
Design Primers
1. PCR amplicon should be about 150 to 400 bp, ideally Load Genome from File ¼> select your fasta file used above; to load the bam files: go to menu File ¼> Load from File ¼> select one or more .bam files. Visit IGV website for more information.
Fig. 6 Use IGV to check the alignments and editing details
CRISPR Editing Events in Transgenic Wheat
4
105
Notes 1. We have been using the sequencing services in the USA provided by MGH CCIB DNA Core (https://dnacore.mgh. harvard.edu/new-cgi-bin/site/pages/crispr_sequencing_ main.jsp) and Genewiz (https://www.genewiz.com/Public/ Services/Next-Generation-Sequencing/Amplicon-Sequenc ing-Services/Amplicon-EZ). 2. If there are still dimers after the cleaning, you can clean up again with 0.8 of homemade beads. Adjust the final concentration and submit samples based on the instructions of your sequencing service provider. You can also use gel recovery to clean up your PCR product instead of beads. 3. All these analyses are done in your computer memory. It may freeze your browser if overloading, but there should be no problem for regular amplicon sequencing that normally yields 50,000 to 100,000 reads per sample. Refresh the page if you need to redo the analysis or there are problems. The root URL of the web tools may change in the future (right now: https:// junli.netlify.app/), so check my GitHub repository (https:// github.com/pinbo/junli-blog) for an update of the root URL if the current URLs are not accessible. Another useful tool you can try: https://crispresso.pinellolab.partners.org/submission. 4. There might be great variations on the number of reads for each individual in the pooled samples, which are caused by the DNA concentration variations among samples and primer efficiency differences. It might be not easy to normalize hundreds of DNA samples, but make sure to collect a similar amount of tissue samples. If you need to pool PCR amplicons of different primers, you can pool the PCR product of each pair of primers first, measure the concentration of each pool after cleaning, and then pool an equal amount of DNA/primer. For example, you have two pairs of primers (P1 and P2). P1 amplified 40 samples and P2 amplified 56 samples. After cleaning, P1 pool has a concentration of 30 ng/μL and P2 pool has 40 ng/μL. You need to submit 25 μL DNA at 20 ng/μL (500 ng DNA totally). Then you will need 500 ng * 40 / (40 + 56) ¼ 208 ng DNA from P1 and 500–208 ¼ 292 ng from P2, that is, 208 ng/30 ng/μL ¼ 6.9 μL of P1 DNAs and 292 ng/40 ng/μL ¼ 7.3 μL. Just add 25–6.9–7.3 ¼ 10.8 μL of water. 5. Random mutations may be introduced during PCR especially using homemade Taq. That is okay for our indel check. If the same SNP shows up in almost all the reads, that is, possibly a real variation between your material and the reference sequence.
106
Junli Zhang
Acknowledgements This project was supported by the Agriculture and Food Research Initiative Competitive Grants 2017-67007-25939 (WheatCAP) and 2022-68013-36439 (WheatCAP) from the USDA National Institute of Food and Agriculture, and Howard Hughes Medical Institute funding to Dr. Jorge Dubcovsky. I am grateful to Drs. Gilad Gabay, Qiujie Liu, Chao Bian, and Chaozhong Zhang for useful discussions and testing the method. Dr. Chao Bian also helped test the barcodes and adaptors. Many thanks to Saarah Kuzay and Priscilla Glenn for critical reviewing and editing the manuscript. References 1. Chen K, Wang Y, Zhang R et al (2019) CRISPR/Cas genome editing and precision plant breeding in agriculture. Annu Rev Plant Biol 70:667–697 2. Debernardi JM, Tricoli DM, Ercoli MF et al (2020) A GRF–GIF chimeric protein improves the regeneration efficiency of transgenic plants. Nat Biotechnol 38:1274–1279 3. Chen S, Zhou Y, Chen Y, Gu J (2018) Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890
4. Connelly JP, Pruett-Miller SM (2019) CRIS.Py: a versatile and high-throughput analysis program for CRISPR-based genome editing. Sci Rep 9:4194 5. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760 6. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
Chapter 5 Virus Induced Gene Silencing: A Tool to Study Gene Function in Wheat Gaganpreet Kaur Dhariwal, Raman Dhariwal, Michele Frick, and Andre´ Laroche Abstract Bread wheat is a staple crop for human consumption. Recent technological advances have not only helped in availability of a high quality and annotated reference genome of wheat (IWGSC RefSeq v2.1.) but also in pangenomic sequencing of cultivars of different origin including stress tolerant and susceptible cultivars, and transcriptomic sequencing of lines resistant/tolerant to different biotic and abiotic stresses. However, despite the vast progress made in sequencing the genomes and transcriptomes, functional annotation of wheat genes is still lagging behind those of model plants. The host defense response to viruses, known as virus-induced gene silencing (VIGS), has been effectively manipulated as a tool to study gene function in model and crop plants. VIGS allows researchers to generate and screen large loss-of-function phenotypic data with less efforts and no requirements of stable transformation in a relatively short span of time as compared to other methods such as TILLING, EcoTILLING, and RNAi. In this chapter, we discussed the basics of posttranscriptional gene silencing and described step-by-step protocols for functional annotation of wheat genes using the barley stripe mosaic virus-based VIGS system. Key words BSMV, dsRNAs, Functional annotation, RISC, RNA silencing, siRNAs
1
Introduction Bread wheat (Triticum aestivum L.), a staple crop for human consumption, is the youngest polyploid (segmental allohexaploid) among all crop plant species that evolved about 10,000 years ago and belongs to subtribe Triticinae and tribe Triticeae of the family Poaceae [1–3]. Bread wheat genome is complex and extremely large (~17 Gb), which is ~sevenfold larger than the maize genome, ~37-fold larger than the rice genome and ~128-fold larger than the Arabidopsis genome [4]. It consists of ~80% repeated sequences and ~ 70% transposable elements [1, 4]. Thanks to the wheat scientific community efforts, the complex wheat genome has now been sequenced and a partially annotated high quality reference
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
107
108
Gaganpreet Kaur Dhariwal et al.
genome sequence (IWGSC RefSeq v2.1; IWGSC Annotation v2.1) is available to further explore crop improvement opportunities [5]. Moreover, recent advances in next-generation sequencing technology have also enabled sequencing of genomes and transcriptomes of wheat genotypes of different origins that include lines with different levels of tolerance/resistance to abiotic and biotic stresses [6, 7] (for reviews, see [8, 9]). Intense analyses of different gene families in various model plant systems have also greatly improved our knowledge about their molecular functions and phenotypic effects. Consequently, inventories of genes with differential expression under several stress conditions have been established for many plants species [9]. However, despite the vast progress made in sequencing the genomes and transcriptomes, and the intense characterization of gene families in model plants, functional annotation of wheat genes is still lagging behind [9, 10]. Although, reverse genetics techniques such as TILLING [11, 12] and EcoTILLING [13], T-DNA tagging [14], and transposon tagging (for review, see [15]) have been effectively used in past for functional annotation of genes in plants with relatively smaller genomes. However, these techniques require tiresome efforts and longer time for the generation of mutant populations and functional annotation in larger genome species such as wheat. All life forms are equipped with highly sophisticated, inherent ancient natural defense systems to effectively detect exogenous and altered endogenous nucleic acids, a common building block of life [16]. The host responses to different types of exogenous nucleic acids, that is, viruses, transposons, transgenes, and double-stranded (ds)RNA, known as quelling in fungi [17], RNA interference in vertebrates and invertebrates [18] and posttranscriptional gene silencing (PTGS) in plants [19–24], have been effectively manipulated as a tool to study gene functions in different organisms [18, 25–34]. Virus-induced gene silencing (VIGS) [35], one of the several variants of PTGS, is an effective reverse genetic approach to transiently silence expression of endogenous genes in a sequence-specific manner and offers several benefits such as fast screening, systemic expression, and loss-of-function phenotype screening in minimum time [31, 32], (for a review, see [36]). Moreover, VIGS is particularly useful in identifying the real causal gene among multiple candidate genes such as those underlying a major quantitative trait loci interval or those showing specific expression patterns during transcriptome analysis for a particular trait [37]. The initial step to decipher a gene function using VIGS is the development of recombinant virus carrying nucleic acid fragment homologous to the endogenous gene to be silenced [31]. Infection with recombinant virus leads to synthesis of large amount of viral dsRNA or dsRNA intermediate (replicative form), an intermediate molecule in viral replication, within the cells [26, 33]. In addition
VIGS to Study Gene Functions in Wheat
109
to formation of dsRNA intermediates, viral single-stranded RNA can also fold into secondary structures such as helix/stem-loop hairpin after infection [38]. Upon sufficient levels of dsRNA (replicative form and/or secondary structures) in the cytoplasm, host activates the RNA silencing pathway; DICER, an RNase-like enzyme, targets and cleaves whole viral dsRNA/hairpin RNA molecules into 20–25 nucleotide long small interfering RNAs (siRNAs) with 3’ two nucleotide overhangs as a defense response [33, 36, 39]. Eventually, siRNAs are then incorporated into the RNA-induced silencing complex (RISC) where they guide the specific degradation or suppression of sufficiently complementary endogenous mRNAs at posttranscriptional level [36, 40–42]. In essence, first the passenger (sense) strand separates from the guide strand (antisense) and RISC complex, and degrades; thereafter the guide strand of the siRNAs, which remains associated with the RISC, targets homologous mRNAs of endogenous gene for degradation eventually resulting in downregulation of the host gene transcript level [36, 39, 43]. See representative Fig. 1 for molecular mechanism of VIGS induced by positive-strand RNA viruses (+ssRNA viruses) like the barley stripe mosaic virus (BSMV) in plants. All VIGS/PTGS methods rely on the activity and sufficient availability of dsRNAs corresponding to the targeted endogenous gene [26, 33, 41]. While dsRNAs may be generated by host RNA dependent RNA polymerases (RdRp) [44] from PTGS inducer “inverted repeat” or “coexpressed sense and antisense RNA” transgenes [45–48] to generate either stably heritable RNAi plants or to study transient silencing, most of the VIGS methods rely mainly on dsRNAs formed during replication of positive-sense strand of single-stranded (ss)RNA viruses and consequently as replicative intermediates within the cell [49]. Recombinant virus can also be effectively constructed to express hairpin like dsRNAs to mediate VIGS in plants [32, 50]. The literature on the use of PTGS/VIGS methods for study gene functions has also grown tremendously over the past three decades. Readers are advised to consult different research and review articles published elsewhere for further details about PTGS in general [19–25, 46, 51–53] and VIGS in monocotyledonous host species including wheat [31–33, 35, 36, 54–58]. Several VIGS methods have also been published in different crop species but are beyond the scope of this chapter; however, readers are referred to protocols/methods published elsewhere [26, 27, 37]. In this chapter, we describe the step-by-step experimental protocols for two methods of VIGS using modified tripartite (α, β, and γ) RNA genome of BSMV (see Fig. 2 for genome organization of BSMV) [31–33, 59, 60] to study gene functions in wheat. These protocols can be used effectively in barley as well. While all three RNAs (α, β, and γ) together are required to cause infection [37],
110
Gaganpreet Kaur Dhariwal et al.
Fig. 1 Molecular mechanism of virus-induced gene silencing (VIGS) induced by positive-strand RNA viruses (+ssRNA viruses) in plants. After successful infection, recombinant BSMV virus (shown as blue-red color line capped at 50 end and tRNA like structure at 30 end; virus RNA is shown in blue color, while gene silencing construct asGOI or hpGOI in red) undergoes replication, subgenomic (sg) transcription, and translation. During replication, negative strand is synthesized first using viral RdRp and form, together with positive strand, dsRNA intermediate (replicative form), a main source for binding of DCL. In addition, viral single-stranded RNAs (viral genome RNA or mRNA produced from sg transcription) fold into secondary structures such as helix/stem-loop hairpin. dsRNA intermediate and stem of hairpin serve as binding sites for DCL, which cleave them into 20–25 nt long siRNAs with two nucleotide 30 overhangs. Artificially designed hairpin gene silencing construct provide additional dsRNA template (stem of hairpin) for DCL for increasing the efficiency of silencing. Newly made siRNAs load into RISC complex where passenger (RNA sense) strand separates from the guide strand (RNA antisense) and RISC complex, and degrades, thereafter the RISC targets homologous virus RNAs and mRNAs of endogenous target gene using guide-strand specificity. Eventually it results into degradation of both virus RNA and transcript of target gene in cell cytoplasm. AGO argonaute protein, asGOI antisense gene of interest fragment, DCL dicer-like protein, dsRNA double stranded RNA, gRNA genomic RNA, hpGOI hairpin gene of interest fragment, mRNA messenger RNA, RdRp RNA dependent RNA polymerase, RISC RNA-induced silencing complex, siRNA short-interfering RNA, ssRNA single-stranded RNA, TFs transcription factors
VIGS to Study Gene Functions in Wheat
111
Fig. 2 Genome organization and genetic modifications of barley stripe mosaic virus (BSMV). Rectangle boxes represent the open reading frames (ORFs) for different proteins/enzymes and are named accordingly for the three RNA genomes: (1) RNAα, (2) RNAβ, and (3) RNAγ. TGB triple gene block—multifunctional movement protein
γ-genome is used to carry short (up to 400 nt) fragments homologous to mRNA of endogenous plant gene. Upon infection, the wheat plant induces a defense response. The first method (hereafter referred as “the classical method”) uses ligation-dependent cloning of sense and/or antisense strand of target gene and in vitro transcription of virus genomes followed by direct inoculation of wheat plants, while the second method (hereafter referred as “Agrobacterium mediated method”) uses ligationindependent cloning of gene construct and in planta production of recombinant virus in Nicotiana benthamiana followed by secondary inoculations of wheat plants. The workflows of two methods are presented in Fig. 3.
2
Materials Prepare solutions and buffers using analytical grade reagents in Milli-Q Type I water or other ultrapure water (resistivity 18 MΩ-cm at 25 C) [61]. Store solutions, buffers and reagents at room temperature unless indicated otherwise on product or in this protocol. Dispose biological and other waste material carefully following local bylaws and regulations for waste disposal.
2.1 Plant Material, Bacterial Strains, and Virus Vectors
This workflow requires seeds of a BSMV susceptible wheat cultivar and N. benthamiana, competent bacterial cells of Escherichia coli (E. coli) DH5α/DH10B (ThermoFisher Scientific Inc., Nepean, ON, Canada) and Agrobacterium tumefaciens strain C58
112
Gaganpreet Kaur Dhariwal et al.
Fig. 3 Workflow of virus induced gene silencing (VIGS) induced by recombinant BSMV in wheat
(Lifeasible, Shirley, NY, USA), and BSMV vectors pα42, pβ.Δβa, pγ42-MCS, and γ.bPDS4-as developed by Holzberg et al. [31], pCaBS-α and pCaBS-β developed by Yuan et al. [59] and pCa-γbLIC-ccdB, a modified version of pCa-γbLIC [59] vector developed by Huang et al. [62].
VIGS to Study Gene Functions in Wheat
2.2 Growth and Culture Media and Antibiotics
113
LB Agar, LB broth, YM broth, ampicillin, kanamycin, and rifampicin. LB Agar Bacto tryptone: 5.0 g.
Bacto yeast extract: 2.5 g. Sodium chloride: 5.0 g. Bacto Agar: 7.5 g. Water to 500 mL. Autoclave at 121 C/15 psi for 20 min. Cool to 55 C. Add antibiotics. Pour into plates. LB Broth Tryptone: 10.0 g.
Yeast extract: 5.0 g. Sodium chloride: 5.0 g. Water to 1 L. Autoclave at 121 C/15 psi for 20 min and cool to RT. SOC medium Prepare 1 M NaCl and 1 M KCl stock solutions.
Prepare and filter-sterilize a 2 M Mg++ stock (1 M MgCl2.6H2O and 1 M MgSO4.7H2O). Prepare and filter-sterilize a 2 M glucose solution. To 97 mL H2O add 2 g Bacto tryptone, 0.5 g yeast extract, 1 mL 1 M NaCl, and 0.25 mL 1 M KCl. Autoclave at 121 C/15 psi for 20 min and cool to RT. Add 1 mL Mg++ stock and 1 mL glucose stock to make solution 20 mM. Filter-sterilize. YM Broth Yeast extract (0.04%): 0.2 g.
Mannitol (1.00%): 5.0 g. Sodium chloride (1.7 mM): 0.05 g. MgSO4.7H2O (0.8 mM): 0.10 g. K2HPO4 (2.2 mM): 0.192 g. Water to 500 mL. Autoclave at 121 C/15 psi for 20 min and cool to RT.
114
Gaganpreet Kaur Dhariwal et al.
Antibiotics Ampicillin (100 mg/mL): 1.0 g of sodium ampicillin dissolved in 10 mL of water. Filter sterilized and store in aliquots in 20 C
Kanamycin (50 mg/mL; Kan50): 0.5 g of kanamycin dissolved in 10 mL of water. Filter-sterilized and store in aliquots in 20 C. Rifampicin (50 mg/mL): 250 mg of rifampicin dissolved in 5 mL of methanol. Vortex immediately to prevent from sticking to the tube. Add ~5 drops of 10 N NaOH to facilitate dissolving. Rifampicin is light sensitive thus care should be taken to cover the media and plates containing it to avoid inefficiency. 2.3 Kits and Reagents
This workflow requires different kits mentioned below. Follow the manufacturer instructions for preparing solutions.
2.3.1 The Classical Method
1. QIAGEN Plasmid Midi Kit and QIAprep Spin Miniprep Kit (QIAGEN Inc., Toronto, ON, Canada). Store at room temperature (15–25 C).
Plasmid DNA Isolation and Purification
2. Dialysis tubing cellulose membrane (roll of dried flat tubing membrane; molecular weight cut-off: 14,000) and clips (MilliporeSigma, Oakville, ON, Canada). Store at room temperature. 3. Restriction enzymes PacI, NotI, SwaI, BssHII, MluI and SpeI (New England Biolabs® Inc., Ipswich, MA, USA). Store at 20 C.
Ligation
1. T4 DNA ligase (New England Biolabs® Inc.). Store at 20 C.
In Vitro Transcription
1. mMESSAGE mMACHINE™ T7 Transcription Kit (ThermoFisher Scientific Inc.). Store kit contents at 20 C except “nuclease-free water” which can be stored at any temperature. 2. RNase Inhibitor (New England Biolabs® Inc.). Store at 20 C. 3. Proteinase K, molecular biology grade (New England Biolabs® Inc.). Store at 20 C.
Plant Inoculations Using In Vitro Transcribed Viral RNAs
1. 10 Glycine Phosphate (GP) buffer: Add and mix 18.77 g glycine and 26.13 g K2HPO4 (dipotassium phosphate) into nuclease-free doubled distilled water, bring volume up to 500 mL and autoclave at 121 C/15 psi for 20 min. 2. Inoculation buffer FES: Add 50 mL 10 GP buffer, 2.5 g sodium pyrophosphate, 2.5 g bentonite and 2.5 g celite into some nuclease-free doubled distilled water and bring volume up to 250 mL and autoclave at 121 C/15 psi for 20 min. Note that bentonite and celite will not dissolve.
VIGS to Study Gene Functions in Wheat 2.3.2 Agrobacterium Based Method Plasmid DNA Isolation and Purification
115
1. QIAGEN Plasmid Midi Kit and QIAprep Spin Miniprep Kit (QIAGEN Inc.). Store at room temperature (15–25 C). 2. Restriction enzymes ApaI and T4 DNA polymerase (New England Biolabs® Inc.). Store at 20 C. 3. dTTP solution, molecular biology grade (100 mM) (ThermoFisher Scientific Inc.) and dATP solution, molecular biology grade (100 mM) (New England Biolabs® Inc.). Store both solutions at 20 C. 4. DTT (dithiothreitol; 100 mM) (ThermoFisher Scientific Inc.). Store at 4 C.
Plant RNA Isolation and cDNA Synthesis
1. DEPC (Diethylpyrocarbonate; v/v) treated water: Prepare 0.1% DEPC water by stirring for 2 h at 37 C followed by autoclave at 121 C/15 psi for 30 min to inactivate traces of DEPC (see Note 1). 2. 10 MOPS buffer: For 100 mL volume, add 4.18 g MOPS in DEPC-treated water, dissolve and bring the pH to 7.0 with NaOH. Then add 2 mL sodium acetate (1 M) and 2 mL EDTA (0.5 M, pH 8) before adjusting final volume to 100 mL with DEPC water (see Note 2). 3. QIAGEN RNeasy Plant Mini Kit (QIAGEN Inc.) or TRI Reagent® solution (MilliporeSigma). Store QIAGEN kit at room temperature (15–25 C) and TRI Reagent® solution at 4 C. 4. Invitrogen™ Quant-iT™ RNA Assay Kit (ThermoFisher Scientific Inc.). Store in refrigerator (2–8 C) and protect from light. 5. Agilent RNA 6000 Nano Kit (Agilent Technologies, Inc., Mississauga, ON, Canada) (see Note 3). Store all reagents following manufacturer’s instructions for the individual reagent and protect from light. 6. Invitrogen™ SuperScript™ III First-Strand Synthesis System (ThermoFisher Scientific Inc.). Store the kit contents at 20 C. 7. RNA loading dye (6X): 50% glycerol (v/v), 10 mM EDTA (pH 8) (v/v), 0.25% bromophenol blue (w/v), 0.25% xylene cyanol (w/v). 8. RNaseZap® RNase decontamination solution (ThermoFisher Scientific Inc.). Store at room temperature.
PCR Product Purification
1. QIAquick PCR Purification Kit (QIAGEN Inc.). Store at room temperature (15–25 C).
116
Gaganpreet Kaur Dhariwal et al.
2.3.3 Other Reagents and Solutions
1. Ethidium bromide solution (ThermoFisher Scientific Inc.). Store at room temperature. 2. GelRed nucleic acid gel stain (Biotium, Hayward, CA). 3. 2 M CaCl2 (CaCl2, anhydrous: 22.198 g, double distilled water to 100 mL, filter-sterilize) 4. Nuclease-free temperature.
water
(QIAGEN
Inc.).
Store
at
room
5. Formaldehyde solution 37% (ThermoFisher Scientific Inc.). Store at room temperature. 6. TE buffer (10 mM Tris, 1 mM EDTA, pH 8.0): Tris 1 M
40 mL
EDTA 0.5 M
8 mL
Adjust final volume with water up to 4 L and autoclave for 15 min. 7. DNA loading dye (6): 30% (v/v) glycerol, 0.25% (w/v) bromophenol blue, 0.25% (w/v) xylene cyanol FF. Store at 4 C. 8. DNA loading buffer (3-part TE: 1-part 6 loading dye). 9. TAE buffer (50 stock solution) To prepare 1 L of 50 TAE dissolve the following components in 600 mL deionized water: Tris base (FW ¼ 121)
242 g
Glacial acetic acid
57.1 mL
0.5 M EDTA (pH 8.0)
100 mL
Adjust final volume to 1 liter with deionized water.
10. TBE buffer (10 stock solution) To prepare 1 L of 10 TBE buffer, dissolve the following components in 600 mL of deionized water: Tris base (FW ¼ 121)
108 g
Boric acid (FW ¼ 61.8)
55 g
0.5 M EDTA (pH 8.0)
40 mL
Adjust final volume to 1 liter with deionized water. 11. Sodium acetate solution (3 M, pH 5.2) Dissolve 2456.1 g of sodium acetate in 800 mL of deionized water. Adjust the pH to 5.2 with glacial acetic acid and leave the solution overnight to cool. Adjust the pH to 5.2
VIGS to Study Gene Functions in Wheat
117
once more before adjusting the final volume to 1 L with deionized water. Filter-sterilize the solution. 12. β-mercaptoethanol (β-ME; 14.3 M) (MilliporeSigma). 13. Ethanol (70% and 100%). 14. Isopropanol. 15. Formamide (45.04 g/mol). 16. Agarose. 17. Glycerol. 2.4 Computational Tools
This workflow requires use of Windows OS running on a desktop PC or laptop computer device (e.g., 64-bit, Intel Core-i7 processor, 16GB RAM and 250 GB Hard Drive). The following opensource software is needed to complete parts of VIGS workflows. 1. si-FI (http://labtools.ipk-gatersleben.de/) is an easy-to-use Python (v. 2.7) graphical user interface (GUI) based software tool for Microsoft Windows™-based systems for long doublestranded RNAi (RNA interference)-target design and off-target prediction.
3
Methods The basic workflows of the two VIGS methods (classical and Agrobacterium mediated VIGS methods) are presented in Fig. 3. Both methods of VIGS can be divided into five main steps: designing of gene silencing constructs, cloning of gene silencing constructs, production of recombinant virus genomes, plant inoculations, and assessment of gene silencing. In this chapter, we described VIGS protocols using both methods for silencing the wheat phytoene desaturase (TaPDS) gene at the seedling developmental stage (Zadoks 12–15) [63].
3.1 Designing of Gene Silencing Constructs
As gene silencing in VIGS depends on the homology of the gene silencing construct with the target endogenous gene(s) to be silenced, VIGS constructs are designed to decipher the function of (1) entire small gene family by targeting common exonic region for simultaneous silencing, (2) individual homoeologous gene by targeting unique 500 or 300 untranslated regions (UTR) (see Note 4), and (3) alternative transcript of the specific gene by targeting the alternative exon(s).
3.1.1 Selection of Target Region
We recommend selecting and design gene silencing constructs, which can be around 50 to 100 nucleotides (nt) for the classical method and from 50 to 400 nt for the Agrobacterium mediated
118
Gaganpreet Kaur Dhariwal et al.
Fig. 4 Screenshot of main menu of siFi (v. 1.2.3–0008) program for designing of RNAi/VIGS construct and off-target prediction
method, using an open-source software siFi (v. 1.2.3–0008) to maximise the silencing and minimize the off-targets. 1. Download the latest siFi installer (e.g., sifi1.2.3–0008.exe) (41.1 MB) from “http://labtools.ipk-gatersleben.de/” and install the siFi software program on your Windows OS. See Fig. 4 for screenshot of main menu of siFi (v. 1.2.3–0008). 2. Open the https://plants.ensembl.org/Triticum_aestivum/ Info/Index to browse the Wheat web page of EnsemblPlants. 3. Paste “Phytoene desaturase” or name/EnsemblID of your gene of interest in search box on left hand right corner and hit “Go.” 4. Click on “TraesCS4B02G300100” hyperlink for Phytoene desaturase or EnsemblID of your gene of interest on webpage of
VIGS to Study Gene Functions in Wheat
119
Fig. 5 Screenshot of Ensembl gene models of Phytoene desaturase gene shown on Ensembl wheat webpage
search results. Then left-click anywhere on exon-intron region of TraesCS4B02G300100.1 predicted gene model (or model of your gene of interest) shown as image (see Fig. 5 for Ensembl gene models of “Phytoene desaturase”/TaPDS gene) in “Summary” section of tab named as “Gene: TraesCS4B02G300100” followed by left click on “Transcript” hyperlink “TraesCS4B02G300100.1” on next prompt window. 5. Then click on “cDNA” under summary section of “Transcriptbased displays,” a left-hand sidebar menu on “Trans: TraesCS4B02G300100.1” tab on new webpage, followed by hitting “Download sequence” button under cDNA sequence title. Click in the check box of “Select/deselect all” followed by click in the check box of “cDNA (transcripts)” on prompt to select just cDNA of TraesCS4B02G300100.1 or your gene of interest. Then hit “Download” button to download sequence in “FASTA” (*.fa) format. 6. Locate the downloaded FASTA file and change its extension from “*.fa” to “*.fasta” (see Note 5). It can be changed by clicking file name and selecting “rename” on the prompt window. Save this file in your VIGS folder. 7. Browse the “ftp://ftp.ensemblgenomes.org/pub/plants/ release-50/fasta/triticum_aestivum/cdna/” and right-click on “Triticum_aestivum.IWGSC.cdna.all.fa.gz” file and select “Save link as. . .” from the popup window. On the prompt, change to desirable (VIGS) folder and click the “Save” button to save the selected cDNA archive. 8. Unzip zipped cDNA archive (*.gz file) using a suitable program such as 7-zip, if available on your computer program (see Note 6). 9. Locate the unzipped cDNA archive (FASTA file *.fa) and change its extension from “*.fa” to “*.fasta” (see Note 5). 10. Start siFi software from start menu of your computer and select “RNAi design” tab on the prompt window (see Fig. 6 for screenshot of menu display on the start prompt).
120
Gaganpreet Kaur Dhariwal et al.
Fig. 6 Screenshot of menu display on starting of siFi software program
Fig. 7 Screenshot of “Database management” tab options available in siFi software program
11. Click on “Database management” tab on main window of siFi software followed by click on “Create new database” button (see Fig. 7 for screenshot of “Database management” tab options).
VIGS to Study Gene Functions in Wheat
121
12. Click “Choose source file” on the prompt window and browse your VIGS folder and select cDNA archive “Triticum_aestivum.IWGSC.cdna.all.fasta” to create a new database in siFi. Click next button on the bottom right corner after opening cDNA file. 13. Enter a database name “VIGS” or any other name of your choice such as “Ref1” on next window of siFi. Again, click next button on the bottom right corner. It will create a local reference sequence database on your computer using Bowtie [64] (see Note 7). 14. When progress bar is complete (100%) and show “Database successfully created,” click on “Finish” button on right side on opened siFi window. 15. Open the “RNAi design” window in siFi program by clicking on “RNAi design” tab. 16. Either directly paste the cDNA sequence of TraesCS4B02G300100.1 (or your gene of interest) into the blank sequence space/window or upload the downloaded renamed sequence “Triticum_aestivum_TraesCS4B02G300100.fasta” by browsing using “Open File” button. 17. Choose the reference database, that is, “VIGS” from the dropdown menu followed by pressing the “Start” button on same
Fig. 8 Screenshot of “target window” of siFi software program
122
Gaganpreet Kaur Dhariwal et al.
Fig. 9 RNAi design plot of TaPDS mRNA/cDNA
Fig. 10 Expended view of RNAi design plot of selected region of TaPDS mRNA/cDNA
“RNAi design” window using the default parameters. It will open the siRNA target window (see Fig. 8 for screenshot of “target window”). 18. Select the true targets by clicking in the check boxes for hits with significantly higher number of siRNA hits. For TaPDS in this study, select first five hits (all chromosome 4 homoeologous copies) and then press “OK”. 19. It will open the “RNAi design plot” window and display RNAi design plot of your gene (see Fig. 9 for RNAi design plot of TaPDS mRNA based on targets selected in the last step). Select the target region of your cDNA based on “efficient siRNAs” peaks (shown by red color lines; Fig. 9). For TaPDS in this study, we selected a region from 21 to 226 nt (for expended view of RNAi design plot, see Fig. 10). 20. Now to check the off-targets of siRNAs, go to main menu of siFi software program and click “Options” followed by selecting the “Switch mode” – “Off-target prediction”. See Fig. 11 for screenshot of selections made to switch to “Off-target prediction”. 21. Paste the selected (21 to 226 nt) cDNA sequence of TaPDS gene transcript (or selected transcript sequence of your gene of interest) into the blank sequence space/window on the “Offtarget prediction” tab.
VIGS to Study Gene Functions in Wheat
123
Fig. 11 Screenshot of selections made to switch from “RNAi design” mode to “Off-target prediction” mode in main menu of siFi software program
22. Like step 17, choose the reference database, that is, VIGS from the drop-down menu followed by pressing the “Start” button using the default parameters (see Fig. 12 for screenshot of options selected for “Off-target prediction”). It will open the off-target plot window. 23. Based on off-target results (see Fig. 13 for RNAi/VIGS targets and off-targets on selected region), select the TaPDS transcript region from 115 to 226 nt to design TraesCS4B02G300100specific gene silencing construct. The finally selected region
124
Gaganpreet Kaur Dhariwal et al.
Fig. 12 Screenshot of “Off-target prediction” mode page of siFi software program and options selected for off-target prediction for selected region of TaPDS mRNA/cDNA
(115 to 226 nt) will target both transcripts of TaPDS on chromosome 4B specifically (Fig. 13). 3.1.2 Design of Gene Silencing Construct
Gene silencing constructs can be designed using cDNA fragments in (1) sense, (2) antisense, or (3) both orientations (as hairpin RNA, hpRNA). While mainly antisense orientation has been utilized in most of the VIGS systems, hpRNA constructs, which fold back as dsRNA after transcription, have shown strong silencing effect in some VIGS systems [32, 65]. In this protocol, two gene silencing constructs (hairpin construct for the classical method and
VIGS to Study Gene Functions in Wheat
siRNA counts per position
20 15 10 5 0
TraesCS4B02G300100.2
0 20 15 10 5 0
20 15 10 5 0
Total siRNA hits
Efficient siRNA hits
186 186
103 103
TraesCS4D02G299000.1 TraesCS4D02G299000.2
29 29
14 14
50
100
150
200
50
100
150
200
50
100
150
200
50
100
150
200
TraesCS4B02G300100.1
0 TraesCS4D02G299000.1
0 20 15 10 5 0
Targets TraesCS4B02G300100.2 TraesCS4B02G300100.1
125
TraesCS4D02G299000.2
0
RNAi trigger sequence position
All siRNAs
Efficient siRNA
Fig. 13 RNAi design plot of selected region of TaPDS mRNA/cDNA for “off-target prediction”
antisense construct for Agrobacterium based method) will be designed from 115 to 226 nt of TaPDS gene transcript TraesCS4B02G300100.1. The Classical Method
Several modifications of BSMV vectors are available for VIGS. For the classical method, use BSMV VIGS vectors (pα42, pβ.Δβa, pγ42-MCS, and γ.bPDS4-as) developed by Holzberg et al. [31], which require cloning of gene silencing construct in PacI and NotI restriction sites of γ.bPDS4-as vector. Use pγ42-MCS as a negative control. 1. For hairpin construct for the classical method, further reselect a short (e.g., 50 nt) region (from 151 to 200 nt of TaPDS gene transcript TraesCS4B02G300100.1) from 50 UTR (e.g., see TaPDS.1s-151-200) and predict its reverse, complement, and reverse complement strands as shown below. >TaPDS.1s-151-200 5'-GCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCC-3'
>TaPDS.1s-151-200 reverse
126
Gaganpreet Kaur Dhariwal et al.
5'-CCTCTACCTGGCTCGCCGCTCCGCCTACCCTTCTCCCTCCTCCCTCAGCG-3'
>TaPDS.1s-151-200 complement 5'-CGCTGAGGGAGGAGGGAGAAGGGTAGGCGGAGCGGCGAGCCAGGTAGAGG-3'
>TaPDS.1s-151-200 reverse complement 5'-GGAGATGGACCGAGCGGCGAGGCGGATGGGAAGAGGGAGGAGGGAGTCGC-3'
2. Add above sequences in the following manner. >TaPDS.1hp-151-200oligo 5'GCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCCGGAGATGGACCGAGCGGCGAGGCGGAT GGGAAGAGGGAGGAGGGAGTCGC-3'
3'CGCTGAGGGAGGAGGGAGAAGGGTAGGCGGAGCGGCGAGCCAGGTAGAGGCCTCTACCTGGCTCGCCGCTCCGCCTA CCCTTCTCCCTCCTCCCTCAGCG-5'
3. Add the PacI (shown in underlined black font) and NotI (shown in underlined purple font) restriction enzyme site sequences as shown below. >TaPDS.1hp-151-200oligo 5'TAAGCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCCGGAGATGGACCGAGCGGCGAGGCGGATGGGAAG AGGGAGGAGGGAGTCGCGC-3'
3'TAATTCGCTGAGGGAGGAGGGAGAAGGGTAGGCGGAGCGGCGAGCCAGGTAGAGGCCTCTACCTGGCTCGCCGCTCCGCCTACCCT TCTCCCTCCTCCCTCAGCGCGCCGG-5'
VIGS to Study Gene Functions in Wheat
127
4. Add a short intron sequence (as shown in lower case black font) between forward and reverse complementary strands to achieve a loop in hairpin (see Note 8): >TaPDS.1hp-151-200oligo 5'TAAGCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCCgtcaagagagGGAGATGGACCGAGCGGCGAGGCGGATGGGAAGA GGGAGGAGGGAGTCGCGC-3'
3'TAATTCGCTGAGGGAGGAGGGAGAAGGGTAGGCGGAGCGGCGAGCCAGGTAGAGGcagttctctcCCTCTACCTGGCTCGCCGCTCCGCCTACCCTT CTCCCTCCTCCCTCAGCGCGCCGG-5'
5. For efficient cloning, cut the above two sequences into four sequences and arrange in 50 –30 orientation as shown below, and send them for synthesis separately. >TaPDS.1hp-151-200oligo_1 5'TAAGCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCCgtcaagagag3'
>TaPDS.1hp-151-200oligo_2 5'-GGAGATGGACCGAGCGGCGAGGCGGATGGGAAGAGGGAGGAGGGAGTCGCGC-3'
>TaPDS.1hp-151-200oligo_3 5'-GGCCGCGCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCCctctcttgac3'
>TaPDS.1hp-151-200oligo_4 5'-GGAGATGGACCGAGCGGCGAGGCGGATGGGAAGAGGGAGGAGGGAGTCGCTTAAT-3'
Agrobacterium Based Method
For Agrobacterium mediated VIGS, we are utilizing the three binary BSMV VIGS vectors pCaBS-α, pCaBS-β, and pCa-γbLICccdB developed by Yuan et al. [59] and Huang et al. [62]. Vector pCa-γbLIC-ccdB carry a ccdB gene for efficient selection of
128
Gaganpreet Kaur Dhariwal et al.
recombinant clones [62] and is a modified version of pCa-γbLIC vector originally developed by Yuan et al. [59]. We utilize pCa-γbLIC-ccdB vector for ligation-independent cloning (LIC) of a short gene silencing construct selected from 115 to 226 nt of TaPDS gene transcript TraesCS4B02G300100.1. 1. For LIC, one can also use same hairpin design as used above for the classical method but that will require modification of the sequence ends. To make the above sequences suitable for LIC, remove the PacI and NotI restriction sites and add the LIC and 5cohesive sequences 50 -AAGGAAG-30 0 0 -AACCACCACCACCG-3 to the 50 end of forward and reverse strands, respectively (see Note 9). After making these changes (as shown below), sequences can be sent for synthesis. >TaPDS.1hp-151-200oligoLIC_1 5'- AAGGAAGGCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCCgtcaagagag3'
>TaPDS.1hp-151-200oligoLIC_2 5'-GGAGATGGACCGAGCGGCGAGGCGGATGGGAAGAGGGAGGAGGGAGTCGC-3'
>TaPDS.1hp-151-200oligoLIC_3 5'AACCACCACCACCGGCGACTCCCTCCTCCCTCTTCCCATCCGCCTCGCCGCTCGGTCCATCTCCctctctt gac-3'
>TaPDS.1hp-151-200oligoLIC_4 5'-GGAGATGGACCGAGCGGCGAGGCGGATGGGAAGAGGGAGGAGGGAGTCGC-3'
2. Alternatively, one can design PCR primers from the selected region of the gene of interest, based on siFi results, using any standard PCR primer designing tool followed by adding adaptor sequences 50 -AAGGAAGTTTAA-30 and 50 -AACCACCACCACCGT-30 , respectively, to the 50 ends of forward and reverse primers. While these primers can be used to amplify the target regions from the two types of templates: (1) the cDNA synthesized from the RNA isolated from a specific plant stage/tissue/organ and (2) the synthetic double-stranded cDNA of the target region of the gene of
VIGS to Study Gene Functions in Wheat
129
interest such as gBlocks™ Gene Fragments (Integrated DNA Technologies, Inc., Coralville, IA, USA), the added 50 adaptor sequences permit LIC of amplified PCR product into the pCa-γbLIC/pCa-γbLIC-ccdB vector following treatments of both insert and vector with ApaI and T4 DNA polymerase. Though this method only allows for cloning of gene silencing construct in sense or antisense orientation, a large fragment (200 to 400 nt) can be cloned easily. 3.2 Cloning of Gene Silencing Constructs
All three BSMV RNAs (α, β, and γ) are required to cause successful infection of wheat plants but only BSMV RNA-γ is modified to carry short (up to 400 nt) gene silencing construct [31] in most of the systems developed so far. Therefore, both methods of cloning (ligation-dependent and -independent) are focused on cloning of gene silencing construct in the cDNA-γ (pγ42-MCS or γ.bPDS4-as in the classical method; pCa-γbLIC-ccdB in Agrobacterium based method) corresponding to RNA-γ but all three BSMV vectors (α, β, and γ) will be first produced in large quantities.
3.2.1 LigationDependent Cloning
To produce plasmids in large quantity, they should be multiplied by growing cultures of transformed (with pα42, pβ.Δβa, pγ42-MCS, and γ.bPDS4-as plasmids) E. coli bacterial cells such as DH5α, DH10B, or JM109 followed by a suitable plasmid isolation method. However, initial step depends on conditions of plasmids/vectors available. If plasmids are available to you in circular DNA form, then start with transformations as given below in step 1, otherwise, if plasmids are available as glycerol stocks of transformed bacterial cells, pick the bacterial growth from glycerol stocks using sterile loop and streak on the appropriate agar plates and follow the instructions from step 4 in this section.
Preparation of Plasmid DNAs
1. Transform DH5α or DH10B competent cells with pα42, pβ.Δβa, pγ42-MCS, and γ.bPDS4-as plasmids. If required, competent cells can be prepared in lab following previously developed protocol [61, 66]. 2. Incubate transformed samples at 37 C for 60 min in a shaking incubator. 3. Plate 100 μL of the above culture of the transformed cells on LB agar plates containing 100 mg/mL ampicillin. 4. Incubate the LB agar plates overnight at 37 C upside-down. 5. Pick 1–4 colonies from overnight grown LB agar plates of each transformation (pα42, pβ.Δβa, pγ42-MCS, and γ.bPDS4-as) and inoculate separately 200 mL of LB broth (containing ampicillin) in 500 mL conical flasks. 6. Incubate the media flasks of all transformations overnight at 200 rpm and 37 C.
130
Gaganpreet Kaur Dhariwal et al.
7. Isolate plasmid DNAs using QIAGEN Plasmid Midi Kit following manufacturer’s instructions (see Note 10). 8. Elute DNAs into 15 mL round-bottom centrifuge tube. 9. Check the quality and concentration of isolated plasmid DNAs by running on 0.8% agarose gel at 80 V for 2 h; mix 1 μL plasmid with 199 μL of loading buffer, and use 1 μL, 5 μL and 10 μL from 200 μL diluted plasmids to load on the gel (see Note 11). 10. Digest high quality γ.bPDS4-as plasmid (100 μL; 1 μg/μL) (see Note 12) first with PacI restriction enzyme in total reaction volume 250 μL (H2O: 97.5 μL; plasmid DNA: 100 μL; 10 buffer: 25 μL; BSA: 2.5 μL; and PacI: 25 μL of 10 U/μL; overnight at 37 C) following manufacturer’s instructions (see Note 13). 11. Precipitate the linearized γ.bPDS4-as plasmid using following recipe: DNA
250 μL (100 μg)
3 M sodium acetate (pH 5.2)
25 μL (1/10 of the total volume)
100% chilled ethanol
550 μL
Keep at 4 C for 10 min. 12. Centrifuge the precipitate at 20,000 x g for 10 min. 13. Carefully decant off the supernatant and keep the pellet. 14. Wash the pellet in 70% ethanol; keep the pellet in 70% ethanol for 30 min to 1 h at 4 C. 15. Repeat steps 12 and 13 and dry the pellet for 5–10 min at room temperature. 16. Dissolve the pellet in 240 μL (volume required for second digestion) nuclease-free water. 17. Digest linearized γ.bPDS4-as plasmid second time with NotI restriction enzyme in total reaction volume 300 μL (linearized DNA: 240 μL, 10 buffer O with BSA: 30 μL, and NotI: 30 μL of 10 U/μL; overnight at 37 C) following manufacturer’s instructions (see Note 13). 18. Check and separate the digested γ plasmid using 0.8% agarose gel electrophoresis. Add 60 μL of 6 loading dye to the digested DNA and incubate the mixture at 60 C for 5 min to reduce the chances of religation of the original barley PDS insert into the vector. Transfer the heated samples on ice. To load the total volume of 360 μL (300 μL digested DNA and 60 μL loading dye) in the gel, use a wide comb or join 8–10 teeth of the comb by cello tape and cast the gel. Run the gel overnight at 50 Vs (25 mA) or at 70 V if power supply shows 2 for upregulated genes and fold change Open (Fig. 4a). Once the file is selected and uploaded, the quality results will be displayed in no time (as shown in Fig. 4b–f). The Fig. 4b shows the detailed view of the range of quality values across all bases at each position in the raw fastq file. The summary of per base sequence count is shown in Fig. 4c, which
Fig. 3 Snapshots of steps involved in downloading the FastQC tool
182
Jyotika Bhati et al.
Fig. 4 Results of FastQC Graphical User Interface
shows the proportion of each nucleotide position in the raw read file. The GC content across the length of each sequence in the raw file is summarized in Fig. 4d. The blue-colored bell-shaped curve is the theoretical/expected GC distribution, whereas the red-colored curve is the observed GC content for the input file. The degree of duplication in the input file is also shown in Fig. 4e. Further, sequence length distribution is briefed in Fig. 4f, which is showing the distribution of fragment sizes in input file. For further reference, fastqc result can be saved by clicking on File - > Save report menu. For the command-line, we first need to load the fastqc module by invoking the command “module load fastqc-0.11.8”followed by the main command for quality check “fastqc sra_data.fastq.gz”.
The command-line results of FastQC can be saved as .html and .zip files.
3.3 Trimming of Raw Reads
To start with, we need to load the module trimmomatic by passing the command “module load trimmomatic-0.39” followed by running the trimming command. Trimmomatic works with FASTQ files (compressed/ uncompressed both) [23].
Protocol for Reference Based Transcriptome Assembly
183
The above mentioned parameters may be interpreted as follows and can be customized based on the need of experiment (see Note 3). (a) READ_TYPE —Read type can be PE (paired-end) or SE (single-end) depending on the raw read files. (b) TRIMLOG —User-assigned filename where the log of all read trimmings are saved. (c) PHRED33 —Convert quality score to Phred-33. (d) PHRED64 —Convert quality score to Phred-64. (e) ILLUMINACLIP —Cut adapter and other Illumina-specific sequences from the read. (f) LEADING —Cut bases off the start of a read, if below a threshold quality. (g) TRAILING —Cut bases off the end of a read, if below a threshold quality. (h) SLIDINGWIDOW —Sliding window trimming approach. Start scanning 50 end and clips reads once average quality within the window falls below a threshold. (i) MINLEN —Drop the read if it is below a specified length. The resulting files of trimmomatic are paired.fastq and unpaired.fastq with respect to the paired-end reads or single-end reads. Here, in our analysis, we have got 04 files as sra_data_R1_paired.fastq, sra_data_R1_unpaired.fastq, sra_data_R2_paired. fastq, and sra_data_R2_unpaired.fastq.
184
Jyotika Bhati et al.
Fig. 5 Step-wise trimming procedure using Trimmomatic
The overview of the trimming procedure is shown in Fig. 5. Once we get the trimming results, the paired.fastq files could be further validated for their quality by using FastQC (see Note 4). 3.4 Indexing of the Reference Genome
In order to perform the reference based assembly, we need to download the genome file for Triticum aestivum (or any nearby available genome of species under consideration) from NCBI. To start with, select the “Genome” option from the drop-down list on NCBI homepage and then search for the name of the reference species (here we are considering Triticum aestivum as reference genome) (Fig. 6a). A detailed summary about the genome will be displayed and genome fasta file can be downloaded by clicking on “genome” as shown in Fig. 6b. An auto pop-up will be displayed to browse the genome file (Fig. 6c).
Protocol for Reference Based Transcriptome Assembly
185
Fig. 6 Steps involved in downloading the reference Triticum aestivum genome
Moreover, the above step can also be performed in commandline by using the “wget” command along with the complete URL of the reference genome as displayed below.
After decompressing the reference genome file using “gunzip,” we renamed the reference genome file as Ta.fa. Further, for indexing the reference genome, bowtie2-build command from Bowtie2 can be used. If a large genome is to be indexed, -large-
186
Jyotika Bhati et al.
index parameter is also added in the command, for example, wheat genome; however, for smaller genomes, this option can be ignored.
It gives output as a set of 6 files with suffixes 0.1.bt2, 0.2.bt2, 0.3.bt2, 0.4.bt2, .rev.1.bt2, and .rev.2.bt2. All together these files form the index and are required for aligning reads to that reference.
3.5 Mapping of Raw Reads to Reference Genome
Tophat [6] tool is used for mapping the raw reads to the wheat reference genome. Here, the output directory “tophat_1” and “tophat_2” are for salinity stress and control condition’s, respectively.
The above mentioned parameters may be interpreted as follows. (a) –o —Sets the name of directory where the output of Tophat are saved. The default path (if not assigned) would be “./ tophat_out”. (b) –r —This is expected (mean) inner distance between the mate pairs. The tophat script produces a number of results file, that is, (a) accepted_hits.bam, (b) deletions.bed, (c) junctions.bed, (d) prep_reads.info, (e) align_summary.txt, (f) insertions.bed, and (g) unmapped.bam. The file accepted_hits.bam is the read alignment file which is used for further analysis.
3.6 Identification of Differentially Expressed Genes
In order to identify the DEGs, we first need to download the genome annotation file for Triticum aestivum (either in .gff or . gtf format) from NCBI as shown in Fig. 7. After decompressing the reference annotation file using “gunzip,” we renamed it as Ta.gff.
Protocol for Reference Based Transcriptome Assembly
187
Fig. 7 Stepwise flowchart to download the annotation file for Triticum aestivum genome
Now, we proceed with cufflinks program, which assembles the transcriptomes from RNA-Seq data and quantifies their expression. Cufflink [10] program uses the alignment file (accepted_hits.bam) generated using Tophat along with the annotation file. Here, the generated output directory “cufflink_1” and “cufflink_2” are for salinity stress and control, respectively.
188
Jyotika Bhati et al.
The above mentioned parameters may be interpreted as follows. (a) –o —Output file directory where cufflinks writes the results. (b) –p —Number of threads to align reads. Default value is 1. (c) –g —Uses the assigned reference annotation file, that is, gff3/ gtf format. Cufflinks generates 4 output files (a) genes.fpkm_tracking, (b) isoforms.fpkm_tracking, (c) transcripts.gtf, and (d) skipped. gtf. The “genes.fpkm_tracking” file contains the estimated genelevel expression values. The “isoform.fpkm_tracking” file has estimated isoform-level expression values. Both these files are in FPKM tracking format. The skipped fragments are marked into “skipped. gtf” file.The GTF file “transcripts.gtf” contains cufflinks’ assembled isoforms with one GTF record per row. These records could either be a transcript or an exon within a transcript.
For multiple RNA-Seq read data, assembled transcripts from each of them are required to be merged into a master transcriptome to be used as a input to cuffmerge. This is the major step needed for the differential expression analysis of the newly assembled transcripts.
The above mentioned parameters may be interpreted as follows. (a) –g —Reference annotation file of the reference species. (b) –s —This parameter points to the genomic DNA sequences for the reference. (c) –p —Number of threads to align reads. Default value is 1. (d) Assembly.txt file contains the transcripts.gtf files along with their complete paths for all the samples (generated from cufflinks). Cuffmerge produces a GTF file “merged.gtf”, that merges all together the input assemblies. In order to identify DEGs between two conditions (control and treated), cuffdiff program is used for this purpose. This program identifies genes to be upregulated or downregulated between
Protocol for Reference Based Transcriptome Assembly
189
two or more conditions along with their undergoing isoform-level regulation.
The above mentioned parameters may be interpreted as follows. (a) –o —Output directory of the cuffdiff results. (b) –b —Specify the reference genome file in fasta format. (c) –p —Number of threads required to align reads. Default value is 1. (d) –L —Label for each sample, which will be included in the output files. (e) –u —Specify the merged assembly file. Cuffdiff calculates the FPKM for each transcripts and results into four set of output files: (a) FPKM tracking files, (b) Count tracking files, (c) Read group tracking files, and (d) Differential expression tests. To further identify the DEGs, data exported from gene_exp.diff to excel sheet and two filters are applied (a) p-value 2 for upregulated genes and fold change < 2 for downregulated genes or may be user defined. 3.7 Annotation of Identified DEGs
To procced further, loci of upregulated and downregulated genes were extracted from the cuffdiff results and fasta sequences for the same are extracted using bedtools.
(a) –fi —Sets reference genome to search for the fasta sequences of DEGs. (b) –bed —File having the Sequence ID (similar to the one in Reference genome file) and the start and end position of the DEGs. (c) –fo —Output fasta file. The fasta files for upregulated and downregulated DEGs are aligned to the NCBI ‘nr’ database using the BLASTX module of
190
Jyotika Bhati et al.
BLAST program. We can either perform the homology search using online BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi? PROGRAM¼blastx&PAGE_TYPE¼BlastSearch&LINK_ LOC¼blasthome) or by using standalone version of BLAST ( h t t p s : // b l a s t . n c b i . n l m . n i h . g o v / B l a s t . c g i ? PA G E _ T Y P E¼BlastDocs&DOC_TYPE¼Download) mainly used for large datasets (see Note 5).
Once we get the accession numbers from the BLAST output, the same are searched for Gene Ontology (GO) IDs using Gene Ontology Resource [24] available at http://geneontology.org/. Further, identified GO IDs for genes are submitted to AgriGO [25] for some additional graphs. In AgriGO, as a first step we selected the Singular Enrichment Analysis (SEA) option. Further as step 2, one can either choose the reference species or can use customized annotation if reference species is not available. As a next step, we selected the reference annotation option from the dropdown list. Finally job can be submitted keeping the other parameters at default (Fig. 8). Annotation results of AgriGO can be viewed graphically as shown in Fig. 9. Also, detailed results can be downloaded in the form of table and spreadsheet (Fig. 10).
4
Notes 1. All the tools used in the discussed pipeline are open source. 2. The selections of the tools in the pipeline should be chosen very wisely and carefully based on the requirement of experiment and analysis to be performed. 3. The parameters are to be customized carefully based on the need of experiment. 4. It is advisable to perform the Quality check of data twice, that is, prior to the trimming and after the trimming. A suitable P-value and Fold change threshold should be chosen based on the requirement of experiment. 5. For large datasets, the BLAST homology search must be performed with the standalone NCBI BLAST.
Protocol for Reference Based Transcriptome Assembly
Fig. 8 Data submission steps using AgriGO
191
Fig. 9 Flowchart of AgriGO annotation results
Fig. 10 Representation of AgriGO annotation results as bar chart
Protocol for Reference Based Transcriptome Assembly
193
References 1. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63 2. Ozsolak F, Milos PM (2010) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12:87 3. Marguerat S, Bahler J (2010) RNA-seq: from technology to biology. Cell Mol Life Sci 67: 569–579 4. Baruzzo G, Hayer KE, Kim EJ et al (2017) Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods 14: 135–139 5. Florea LD, Salzberg SL (2013) Genomeguided transcriptome assembly in the age of next-generation sequencing. IEEE/ACM Trans Comput Biol Bioinform 10(5): 1234–1240 6. Kim D, Pertea G, Trapnell C et al (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4):R36 7. Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21 8. Kim D, Paggi JM, Park C et al (2019) Graphbased genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37(8):907–915 9. Langmead B, Salzberg S (2012) Fast gappedread alignment with bowtie 2. Nat Methods 9: 357–359 10. Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515 11. Pertea M, Pertea GM, Antonescu CM et al (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33(3):290–295 12. Shao M, Kingsford C (2017) Accurate assembly of transcripts through phase-preserving graph decomposition. Nat Biotechnol 35(12): 1167–1169 13. Maretty L, Sibbesen JA, Krogh A (2014) Bayesian transcriptome assembly. Genome Biol 15(10):501 14. Behera S, Voshall A, Moriyama EN (2021) Plant transcriptome assembly: review and
benchmarking. In: Helder IN (ed) Bioinformatics [internet]. Exon Publications, Brisbane 15. Zhou Y, Li XH, Guo QH et al (2021) Salt responsive alternative splicing of a RING finger E3 ligase modulates the salt stress tolerance by fine-tuning the balance of COP9 signalosome subunit 5A. PLoS Genet 17(11):e1009898 16. Runxuan Z, Cristiane PG, Calixto Y et al (2017) A high quality Arabidopsis transcriptome for accurate transcript-level analysis of alternative splicing. Nucleic Acids Res 45(9): 5061–5073 17. Ruben G, Anamarija B, Francisco JE et al (2021) Plant virus evolution under strong drought conditions results in a transition from parasitism to mutualism. Proc Natl Acad Sci 118(6):e2020990118 18. Vitoriano CB, Calixto CPG (2021) Reading between the lines: RNA-seq data mining reveals the alternative message of the Rice leaf transcriptome in response to heat stress. Plants (Basel) 10:1647 19. Barakate A, Orr J, Schreiber M et al (2021) Barley anther and Meiocyte transcriptome dynamics in meiotic prophase I. Front Plant Sci 11:619404 20. Rapazote-Flores P, Bayer M, Milne L et al (2019) BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq. BMC Genomics 20:968 21. Li Y, Mi X, Zhao S et al (2020) Comprehensive profiling of alternative splicing landscape during cold acclimation in tea plant. BMC Genomics 21:65 22. Zhao Y, Cheng X, Liu X et al (2018) The wheat MYB transcription factor TaMYB is involved in drought stress responses in Arabidopsis. Front Plant Sci 9:1426 23. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120 24. The Gene Ontology Consortium (2021) The gene ontology resource: enriching a GOld mine. Nucleic Acids Res 49(D1):D325–D334 25. Zhou D, Xin Z, Yi L et al (2010) agriGO: a GO analysis toolkit for the agricultural community. Nucleic Acids Res 38(S2):W64–W70
Chapter 8 Transcriptome Data Analysis Using a De Novo Assembly Approach Himanshu Avashthi, Jyotika Bhati, Shikha Mittal, Ambuj Srivastava, Neeraj Budhlakoti, Anuj Kumar, Pramod Wasudeo Ramteke, Dwijesh Chandra Mishra, and Anil Kumar Abstract Characterization and profiling of the gene expression data or, more formally, called transcriptome are crucial steps in revealing the involvement of RNA in a variety of biological processes. RNA characteristics have been investigated in a number of researches employing whole transcriptome sequencing (RNA-seq) data in all the major crops. To date, a number of crop genomes have been sequenced, which allows researchers to understand the typical biological mechanism, which may further be utilized to discover and characterise candidate genes responsible. Gene expression profiling, that is, RNA-seq, is a useful tool for identifying and understanding biological processes/mechanisms, such as the coding, decoding, regulation, and expression of genes. A basic RNA-seq process may be divided into two broad categories, that is, reference based and de novo, as per the availability of the reference genome related to the species/crop under consideration. In this chapter, we introduce basic RNA-seq analysis approaches, pipelines and software, focusing particularly on de novo transcriptome assembly and identification of differentially expressed genes (DEGs)/transcripts by using finger millet as an example. Key words Bioinformatics approach, De novo assembly, RNA-Seq, Next generation sequencing (NGS), Differential gene expression (DGE)
1
Introduction The transcriptome is the snapshot of complete RNA transcripts in a given cell for a specific developmental stage or physiological condition [1]. The transcriptome must be understood in order to interpret the functional parts of the genome and to comprehend the underlying mechanisms. Microarray technologies have been successfully used in the past to identify differentially expressed genes between developmental stages or between healthy and diseased groups. The advancement in the NGS technology, especially RNA-seq, has replaced microarray technology in no time because
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_8, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
195
196
Himanshu Avashthi et al.
of its better resolution and higher reproducibility. RNA-seq data can be used to identify genes or transcripts specific to a condition, which may be further utilized to develop stress-resistant millet crops. There are broadly two methods available to study transcripts specific to a condition: (1) candidate gene approach and (2) global transcriptome profiling through RNA-seq data. The candidate gene approach is commonly used for selected genes. In this approach, we generally use two techniques, such as semiquantitative (RT-PCR) and quantitative (qRT-PCR) reverse transcription polymerase chain reaction. On the other hand, transcriptome profiling is used to identify DEGs at the whole transcriptome level [2, 3]. Details of each technique are discussed below in a sequential manner. There are two methods to study the transcript level: (1) candidate gene approach (2) global gene expression analysis through RNA-seq data. 1.1 Candidate Gene Approach
As it is known that the plant adaptation system is triggered by activation of various molecules involved in signal transduction. The initial step in any stress response is stress perception and subsequent molecular signalling, which involves a number of genes whose expression influences stress response. For example, in response to abiotic stresses, abscisic acid plays a vital function as a signalling molecule [4]. In this approach we mainly identify stress (abiotic and biotic) responsive genes, plant protective or defencerelated genes, and also the genes responsible for various diseases, growth and development, or specific to a particular experimental condition.
1.2 Global Transcriptome Profiling
With the advent of massively parallel next generation sequencing technology, decreasing cost of sequencing and the availability of huge genomic and transcriptomic data have enabled researchers to use RNA-seq technique to study plant response to various treatment specific conditions. In terms of reliability and precision, RNA sequencing (RNA-seq) has become a standard approach for global transcriptome profiling. It provides better results in comparison of microarray and other expression methods [5]. The use of bioinformatics tools to analyze RNA-seq data gives a plethora of information that is critical for understanding plant response to stress. The majority of the transcriptome profiling studies in crops have been conducted to better understand the underlying mechanism of gene expression changes caused by abiotic, biotic stresses and numerous developmental processes during various phases of the plant’s life cycle. RNA-seq based analysis includes a simple experimental design (single genotype with control and stress condition) and complicated experimental designs (multiple experimental factors). Transcriptome analysis and identification of DEGs includes several steps such as generation of reads or downloading of raw data from public
Protocol for Reference Based Transcriptome Assembly
197
Fig. 1 Steps involved in de novo assembly and differential gene expression analysis
databases, quality assessment and preprocessing, assembly of reads, abundance estimation, expression quantification, normalization, and identification of differentially expressed genes [6–8]. At each step, several alternatives (online or offline) are available, and numerous pipelines have been built using a combination of these alternatives. However, few studies have been carried out to compare various methods, but there is no standard workflow of RNA-seq data analysis for a specific genotype or experimental condition [9]. Here, in this chapter we attempted to describe the basic protocol used in de novo analysis of RNA-seq data (for pairedend reads) using examples of finger millet [10]. Steps of NGS-based de novo assembly and differential gene expression analysis are illustrated in below Fig. 1.
2
Materials
2.1 Hardware Requirements
Hardware requirements for RNA-seq analysis may vary depending on the size of the data to be processed and the memory footprints of the software. Online services, such as Galaxy (http://usegalaxy. org), can also be used to perform transcriptome analysis, but to deal with large amounts of data on a regular basis, usually a highperformance computing facility with 64-bit architecture and
198
Himanshu Avashthi et al.
Linux-based operating systems are needed, with high-speed Internet connectivity [11]. In this chapter, to assemble transcriptome and identify differentially expressed genes in two contrasting genotypes (low Ca+2 (GPHCPB-1) and high Ca+2 (GPHCPB-45)) of finger millet, supercomputing facility ASHOKA (Advanced Supercomputing Hub for OMICS Knowledge in Agriculture) available at ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi was used. 2.2 Data Analysis Tools
1. SRA toolkit: SRA toolkit (https://github.com/ncbi/sra-tools) is free to use software available at NCBI website. It converts sequence read archive (SRA) data into a fastq format using the “fastq-dump” command. It can be installed on Windows, Linux, and Mac operating systems [12]. 2. FastQC: FastQC tool (http://www.bioinformatics.babraham. ac.uk/projects/fastqc/) provides a quick and easy way to assess the quality of raw sequence data from high-throughput sequencing platforms. It also suggests that there is a problem with your data that you should be aware of before initiating further analysis. It takes a raw sequence file (fastq, bam, and sam file) as an input and calculates the sequence composition, sequencing quality, and statistical parameters and finally generates a graphical report in the form of an html file. It is also available for Windows, Linux, and Mac operating systems [13]. 3. Trimmomatic: Trimmomatic (http://www.usadellab.org/ cms/?page¼trimmomatic) is a tool to trim and crop illumina reads. It removes adapter sequences, contamination and low quality reads and ultimately provides clean reads for users. It can be used for both paired-end and single-end reads. It is a command line tool which is very fast and multithreaded [14]. 4. Trinity: Trinity (https://github.com/trinityrnaseq/ trinityrnaseq) is a de novo assembly tool that allows you to build an assembly of transcriptome without a genome from RNA-seq data. Currently, a genome-guided version of trinity is also available. Trinity also contains a set of perl scripts for generating statistics to check assembly quality and wrapping external tools for downstream analysis [15]. 5. RSEM: RSEM (https://github.com/alyssafrazee/ballgown) is a user-friendly tool that may be used to calculate gene and isoform abundances from paired-end or single-end RNA-Seq data [16]. 6. edgeR: An edgeR (https://bioconductor.org/packages/ release/bioc/html/edgeR.html) is a package of Bioconductor used for computational differential expression analysis. It is concerned with assessing relative changes in expression levels between conditions rather than absolute expression levels [17].
Protocol for Reference Based Transcriptome Assembly
3
199
Methods
3.1 Experimental Design
Before deploying the RNA-seq approach for transcriptome analysis, it is necessary to carefully plan the experimental design, with consideration of sample preparation methods, NGS platform selection, various sequencing parameters such as sequencing depth and coverage, numbers of technical and biological replicates, sequencing length, paired-end or single-end sequencing, and other data analysis methods according to the goals of the study [6–9].
3.2 Generation of Sequencing Reads
1. Extract total RNA from different samples and ensure that the isolated RNA is of good quality and integrity. 2. Construct cDNA libraries according to the manufacturer’s instructions using a commercially available library preparation kit. The quality of RNA and cDNA can be checked using agarose gel electrophoresis and spectrophotometric methods [18]. 3. After validation of the insert size and the quantification of the library, load the sample onto the flow cell for cluster generation using commercially available cluster generation kits and specific sequencing primers.
3.3 Downloading of Raw Transcriptome Data
The raw RNA-sequencing data can be obtained from the sequencing data repositories such as Sequence Read Archive (SRA, https:// www.ncbi.nlm.nih.gov/sra) of NCBI and European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena) of EBI in the form of SRA and/or FASTQ file format, with the help of project ID of the RNA-seq experiment and accession number of the SRA files. For our example, data was downloaded from the SRA database of NCBI. Here, two contrasting genotypes (differing in calcium content) of finger millet i.e., SRR1151079 and SRR1151080 were considered. wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/srapub-run-1/SRR1151079/SRR1151079.1 wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/srapub-run-1/SRR1151080/SRR1151080.1
3.4 Dumping of SRA Data into Fastq File Format
After downloading, we need to dump SRA files into Fastq files using SRA Toolkit [12]. Both the SRA files were converted to fastq format using the “fastq-dump” command of the SRA toolkit. /transcriptome_data/sratoolkit.2.10.8-ubuntu64/bin/fastq-dump --defline-seq ’@$sn[_$rn]/$ri’ --split-files SRR1151079 /transcriptome_data/sratoolkit.2.10.8-ubuntu64/bin/fastq-dump --defline-seq ’@$sn[_$rn]/$ri’ --split-files SRR1151080
200
Himanshu Avashthi et al.
Here, we have used --split-files because the data is of paired-end reads (SRR1151079 and SRR1151080), which splits each SRA file into two fastq files (forward and reverse), that is, SRR1151079_1. fastq, SRR1151079_2.fastq, SRR1151080_1.fastq, and SRR1151080_2.fastq. 3.5 Quality Assessment and Preprocessing of Raw Reads
The first step of RNA-seq analysis is to examine the quality of the raw sequence data produced by the sequencer. There are a number of quality control measures that provide information about data quality and identify potential technical issues that may arise throughout the sample preparation and sequencing process in the next-generation sequencing (NGS) machine. FastQC is the most generally used program for it, which requires the Java Runtime Environment (JVM) and can be used on both a command line and a Graphical User Interface (GUI). It generates an HTML-based graphical representation report which describes the quality of raw reads and is helpful in quick assessment. The report includes information about basic statistics, per base sequence quality, per tile sequence quality, per sequence quality score, per base sequence content, per sequence GC content, per base N content, sequence length distribution, sequence duplication levels, overrepresented sequences and adapter content. This tool explains about 11 qualities as shown in (Fig. 2). In the figure below, the green tick shows normal, the orange triangle shows slightly abnormal, and the red cross shows a very unusual report. For more details, complete documentation on FastQC reports has been provided by the developers [14]. Based on the quality report’s interpretation, preprocessing is applied on raw read sequences, which generally comprises removal of adapter sequences, low quality reads, filtering of contaminants and trimming of sequences with low quality scores. To perform these steps, NGSQC-Toolkit [19], HTQC-Toolkit [20] (for quality control purpose), Trimmomatic [14], Trim Galore [21], Cutadapt [22], and FASTX-Toolkit [23] (for preprocessing purpose) are the most widely used publicly available tools. Of these, Trimmomatic has been the most widely used for Illumina singleended and paired-end read data, which is a command-line application written in Java. In our example dataset, we used FastQC and Trimmomatic programs for quality control and preprocessing respectively. 1. FastQC can be used to assess the quality of the sequenced reads, which generates a variety of graphical reports based on reads statistics and quality control measures. Firstly, we have to check the quality of reads before the trimming step. If data is of poor quality and has excess adapter sequences, then we use trimming. After trimming, we check again the quality of improved reads.
Protocol for Reference Based Transcriptome Assembly
201
Fig. 2 Graphical representation of FastQC generated report of each parameter /transcriptome_data/FastQC-0.11.8/fastqc SRR1151079_1.fastq SRR1151079_2.fastq /transcriptome_data/FastQC-0.11.8/fastqc SRR1151080_1.fastq SRR1151080_2.fastq
Here, to assess the quality, we have taken two input files, SRR1151079_1.fastq (forward) and SRR1151079_2.fastq (reverse) generated by SRA Toolkit. When we run this FASTQC command, it will generate two html files, namely, SRR1151080_1_fastqc.html and SRR1151080_2_fastqc.html and two zip files, SRR1151080_1_fastqc.zip and SRR1151080_2_fastqc.zip. The HTML file contains a graphical representation report of raw reads which describes 11 different types of qualities as mentioned earlier. Trimmomatic can be used for both paired-end and singleend reads (see Note 1). When we use Trimmomatic, apply the following preprocessing methods based on the requirements derived from the FastQC report. Good quality sequence reads downloaded from repositories may not require any editing, and editing is not enough to qualify low-quality reads for further investigation.
202
Himanshu Avashthi et al.
2. Remove adapters and other Illumina-specific sequences with the ILLUMINACLIP step. 3. AVGQUAL can be used to filter low-quality reads based on their average quality being less than a certain threshold. 4. Quality trimming can be performed using CROP, HEADCROP, LEADING, TRALING, SLIDINGWINDOW, MINLEN, and TOPHRED33 parameters. CROP: used to cut the read to specified length; HEADCROP: used to cut the specified number of bases from the beginning of the read; LEADING/TRALING: used to cut bases off the start/end of a read, if less than a threshold quality; SLIDINGWINDOW: used to carry out a sliding window trimming; MINLEN: If the read is less than a certain length, it is dropped; TOPHRED33: used to convert quality scores to Phred-33 (see Note 2). 5. With MINLEN, filter the retained reads based on read length and drop the reads that are less than a set threshold length (usually HC_LC_R1.fastq cat HC_R2.fastq LC_R2.fastq >HC_LC_R2.fastq
1. After this, we can assemble the concatenated forward (HC_LC_R1.fastq) and reverse reads (HC_LC_R2.fastq) into transcripts by running the below mentioned command using de novo assembly tool Trinity. Define different parameters, directory path and name of input files (HC_LC_R1.fastq and HC_LC_R2.fastq), RAM (50G), and number of processors (CPU 6) usage. Trinity --seqType fq --left HC_LC_R1.fastq --right HC_LC_R2. fastq --max_memory 50G --CPU 6 --no_bowtie
2. When the job is completed, it will generate a directory named trinity_out_dir in which we found a single assembled transcript file named Trinity.fasta. 3. If we want to check whether the assembly process has been done successfully or not, then we can run the “TrinityStats.pl” script to produce an assembly statistics report with length and number of transcripts generated, as well as the contig N50 value. perl /trinity/trinityrnaseq-Trinity-v2.5.1/util/TrinityStats. pl /transcriptome_data/Assembly/Trinity.fasta >trinity_stats. txt
3.7 Transcript Abundance Estimation
1. By using the perl script “align_and_estimate_abundance.pl” with some parameters “--est_method RSEM” and “--aln_method bowtie” or “bowtie2”, we can map the reads of each sample to the assembled transcript independently. /trinity/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq -left LC_R1.fastq --right LC_R2.fastq --est_method RSEM -aln_method bowtie2 --trinity_mode --prep_reference --output_dir rsem_outdir_LC
206
Himanshu Avashthi et al.
Fig. 4 Showing FPKM value of each transcript in RSEM output /trinity/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq -left HC_R1.fastq --right HC_R2.fastq --est_method RSEM -aln_method bowtie2 --trinity_mode --prep_reference --output_dir rsem_outdir_HC
When the job is finished, you will find two output files in each output directory (rsem_outdir_HC and rsem_outdir_LC) named RSEM.isoforms.results and RSEM.genes.results (Fig. 4) in which contains abundance estimation information. 2. Using the files generated in the first step, build a matrix of count data of transcripts of each sample in a tab-delimited text file by running the perl script “abundance_estimates_to_matrix.pl” found in Trinity’s utility (util) directory. In the belowmentioned command, we renamed the file names RSEM.isoforms.results of each sample to HC.isoforms.results and LC. isoforms.results. /trinity/trinityrnaseq-Trinity-v2.5.1/util/abundance_estimates_to_matrix.pl --est_method RSEM --gene_trans_map none / transcriptome_data/HC.isoforms.results /transcriptome_data/ LC.isoforms.results
When the job is finished, you will find two files, namely, RSEM.isoform.counts.matrix and RSEM.isoform.TMM. EXPR.matrix. The ‘RSEM.isoform.counts.matrix’ file (Fig. 5) is utilized for differential expression downstream analysis and RSEM.isoform.TMM.EXPR.matrix file is referred to the gene expression matrix. 3.8 Normalization of Raw Count Data and Identification of Differentially Expressed Genes
The expression level in RNA-seq analysis is measured using read count data from the total number of mapped fragmented transcripts, which is predicted to be proportional to their abundance in RNA samples. Before comparing gene expression levels in various samples, it is necessary to normalize read count data to remove nonuniformities among samples generated by technical sources of variation such as sequencing depth, library size, and nucleotide
Protocol for Reference Based Transcriptome Assembly
207
Fig. 5 Showing RSEM generated counts matrix
compositions [7]. Some of the general methodologies employed in DGE analysis software programs to normalize RNA-seq data are as follows [27]. 3.8.1 FPKM (Fragments per Kilobase per Million) and RPKM (Reads per Kilobase per Million)
These are the two methods which perform the simplest normalization. With these basic methods, gene/transcript count is divided by the total number of reads in each library. For single-end reads, RPKM is utilized, while for paired-end reads, FPKM is used. Both of these approaches are the most extensively used within-sample and between-sample normalisation methods as well as implemented in most RNA-seq analysis programs, including DESeq, edgeR, and RSeQC [28]. Here, we have used Empirical Analysis of Digital Gene Expression Data in R (edgeR) for differential expression analysis. For this, we used the below mentioned perl script “run_DE_analysis.pl” inbuilt in Trinity software. perl /opt/software/applications/trinity/trinityrnaseq-Trinity-v2.5.1/Analysis/DifferentialExpression/run_DE_analysis. pl --matrix RSEM.isoform.counts.matrix --method edgeR --output deg --dispersion .05
When the job is finished, it will generate one pdf file (RSEM. isoform.counts.matrix.HC_vs_LC.edgeR.DE_results.MA_n_Volcano.pdf) in which you will found MA plot and volcano plot. Simultaneously, it also generates two tab-delimited files (i) RSEM. isoform.counts.matrix.HC_vs_LC.edgeR.count_matrix (ii) RSEM.isoform.counts.matrix.HC_vs_LC.edgeR.DE_results. With the help of the second (ii) tab-delimited file (Fig. 6), we can identify differentially expressed genes by applying filters such as Log2 fold change (2 and 2) and p-value (2. (b) The size of bulges is not more than one or two nucleotides. (c) The miRNA does start and end from a bulge. (d) Maximum mismatches are not more than two nucleotides. 5. In psRNATarget analysis, “Expect value” should be less than 2.5 for prediction of accurate targets.
References 1. Chand Jha U, Nayyar H, Mantri N, Siddique KHM (2021) Non-coding RNAs in legumes: their emerging roles in regulating biotic/abiotic stress responses and plant growth and development. Cell 10:1674. https://doi.org/ 10.3390/CELLS10071674 2. Wang J, Meng X, Dobrovolskaya OB et al (2017) Non-coding RNAs and their roles in stress response in plants. Genomics Proteomics Bioinformatics 15:301. https://doi.org/10. 1016/J.GPB.2017.01.007 3. Sharma Y, Sharma A, Madhu et al (2022) Long non-coding RNAs as emerging regulators of pathogen response in plants. Noncoding RNA 8 ( 1 ) : 4 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / NCRNA8010004 4. Pachnis V, Belayew A, Tilghman SM (1984) Locus unlinked to alpha-fetoprotein under the control of the murine raf and Rif genes. Proc Natl Acad Sci 81:5523–5527. https:// doi.org/10.1073/PNAS.81.17.5523 5. Zhang H, Guo H, Hu W, Ji W (2020) The emerging role of long non-coding RNAs in plant defense against fungal stress. Int J Mol
Sci 21:2659. https://doi.org/10.3390/ IJMS21082659 6. Wani SH, Vijayan R, Choudhary M et al (2021) Nitrogen use efficiency (NUE): elucidated mechanisms, mapped genes and gene networks in maize (Zea mays L.). Physiol Mol Biol Plants 27:2875–2891. https://doi.org/ 10.1007/S12298-021-01113-Z 7. Kumar A, Sharma M, Kumar S et al (2018) Functional and structural insights into candidate genes associated with nitrogen and phosphorus nutrition in wheat (Triticum aestivum L.). Int J Biol Macromol 118:76–91. https:// doi.org/10.1016/J.IJBIOMAC.2018.06.009 8. Suravajhala P, Kumar A, Pandeya A et al (2018) A web resource for nutrient use efficiencyrelated genes, quantitative trait loci and microRNAs in important cereals and model plants. F1000Res 7. https://doi.org/10.12688/ F1000RESEARCH.14561.1 9. Kumar A, Batra R, Gahlaut V et al (2018) Genome-wide identification and characterization of gene family for RWP-RK transcription factors in wheat (Triticum aestivum L.). PLoS
Protocol for Identification of miRNAs in Plants One 13:e0208409. https://doi.org/10. 1371/JOURNAL.PONE.0208409 10. Kumar A, Sharma M, Gahlaut V et al (2019) Genome-wide identification, characterization, and expression profiling of SPX gene family in wheat. Int J Biol Macromol 140:17–32. https://doi.org/10.1016/J.IJBIOMAC. 2019.08.105 11. Kumar A, Gahlaut V, Nagaraju M (2020) Transcription factors and their roles in phosphorus stress tolerance in crop plants. In: Transcription factors for abiotic stress tolerance in plants, pp 201–224. https://doi.org/10.1016/B978-012-819334-1.00011-3 12. Wani SH, Tripathi P, Zaid A et al (2018) Transcriptional regulation of osmotic stress tolerance in wheat (Triticum aestivum L.). Plant Mol Biol 97:469–487. https://doi.org/10. 1007/S11103-018-0761-6 13. Gahlaut V, Jaiswal V, Kumar A, Gupta PK (2016) Transcription factors involved in drought tolerance and their possible role in developing drought tolerant cultivars with emphasis on wheat (Triticum aestivum L.). Theor Appl Genet 129:2019–2042. https:// doi.org/10.1007/S00122-016-2794-Z 14. Djami-Tchatchou AT, Sanan-Mishra N, Ntushelo K, Dubery IA (2017) Functional roles of microRNAs in agronomically important plants-potential as targets for crop improvement and protection. Front Plant Sci 8:378. https://doi.org/10.3389/FPLS.2017. 00378/BIBTEX 15. Zhou M, Luo H (2013) MicroRNA-mediated gene regulation: potential applications for plant genetic engineering. Plant Mol Biol 83:59–75. https://doi.org/10.1007/S11103-0130089-1 16. Zhang B (2015) MicroRNA: a new target for improving plant tolerance to abiotic stress. J Exp Bot 66:1749. https://doi.org/10.1093/ JXB/ERV013 17. Sun G (2012) MicroRNAs and their diverse functions in plants. Plant Mol Biol 80:17–36. https://doi.org/10.1007/S11103-0119817-6 18. Navarro L, Dunoyer P, Jay F et al (2006) A plant miRNA contributes to antibacterial resistance by repressing auxin signaling. Science 312:436–439. https://doi.org/10.1126/SCI ENCE.1126088 19. Nanda S, Yuan SY, Lai FX et al (2020) Identification and analysis of miRNAs in IR56 rice in response to BPH infestations of different virulence levels. Sci Rep 10:1–13. https://doi.org/ 10.1038/s41598-020-76198-9
225
20. Parmar S, Gharat SA, Tagirasa R et al (2020) Identification and expression analysis of miRNAs and elucidation of their role in salt tolerance in rice varieties susceptible and tolerant to salinity. PLoS One 15:e0230958. https://doi. org/10.1371/JOURNAL.PONE.0230958 21. Chen SY, Su MH, Kremling KA et al (2020) Identification of miRNA-eQTLs in maize mature leaf by GWAS. BMC Genomics 21:1– 13. https://doi.org/10.1186/S12864-02007073-0/TABLES/4 22. Zhou Z, Cao Y, Li T et al (2020) MicroRNAs are involved in maize immunity against fusarium verticillioides ear rot. Genomics Proteomics Bioinformatics 18:241–255. https://doi. org/10.1016/J.GPB.2019.11.006 23. Zhao Z, Xue Y, Yang H et al (2016) Genomewide identification of miRNAs and their targets involved in the developing internodes under maize ears by responding to hormone signaling. PLoS One 11:e0164026. https://doi. org/10.1371/JOURNAL.PONE.0164026 24. Zare S, Nazarian-Firouzabadi F, Ismaili A, Pakniyat H (2019) Identification of miRNAs and evaluation of candidate genes expression profile associated with drought stress in barley. Plant Gene 20:100205. https://doi.org/10. 1016/J.PLGENE.2019.100205 25. Ye Z, Zeng J, Long L et al (2021) Identification of microRNAs in response to low potassium stress in the shoots of Tibetan wild barley and cultivated. Curr Plant Biol 25:100193. https://doi.org/10.1016/J.CPB.2020. 100193 26. He X, Han Z, Yin H et al (2021) Highthroughput sequencing-based identification of miRNAs and their target mRNAs in wheat variety Qing Mai 6 under salt stress condition. Front Genet 12:1467. https://doi.org/10. 3389/FGENE.2021.724527/BIBTEX 27. Singroha G, Sharma P, Sunkur R (2021) Current status of microRNA-mediated regulation of drought stress responses in cereals. Physiol Plant 172:1808–1821. https://doi.org/10. 1111/PPL.13451 28. Sihag P, Sagwal V, Kumar A et al (2021) Discovery of miRNAs and development of heatresponsive miRNA-SSR markers for characterization of wheat germplasm for terminal heat tolerance breeding. Front Genet 12:1336. https://doi.org/10.3389/FGENE.2021. 699420/BIBTEX 29. Parveen A, Mustafa SH, Yadav P, Kumar A (2019) Applications of machine learning in miRNA discovery and target prediction. Curr
226
Anuj Kumar et al.
Genomics 20:537. https://doi.org/10.2174/ 1389202921666200106111813 30. Meher PK, Begam S, Sahu TK et al (2022) ASRmiRNA: abiotic stress-responsive miRNA prediction in plants by using machine learning algorithms with pseudo K-tuple nucleotide compositional features. Int J Mol Sci 23: 1 6 1 2 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / IJMS23031612 31. Kumar A, Chauhan A, Sharma M et al (2017) Genome-wide mining, characterization and development of miRNA-SSRs in Arabidopsis thaliana. bioRxiv 203851. https://doi.org/ 10.1101/203851 32. Tyagi S, Kumar A, Gautam T et al (2021) Development and use of miRNA-derived SSR markers for the study of genetic diversity, population structure, and characterization of genotypes for breeding heat tolerant wheat varieties. PLoS One 16:e0231063. https:// doi.org/10.1371/JOURNAL.PONE. 0231063 33. Sagwal V, Sihag P, Singh Y et al (2022) Development and characterization of nitrogen and phosphorus use efficiency responsive genic and miRNA derived SSR markers in wheat. Heredity 2022:1–11. https://doi.org/10.1038/ s41437-022-00506-4 34. Kozomara A, Birgaoanu M, Griffiths-Jones S (2019) miRBase: from microRNA sequences to function. Nucleic Acids Res 47:D155– D162. https://doi.org/10.1093/NAR/ GKY1141 35. Griffiths-Jones S, Grocock RJ, van Dongen S et al (2006) miRBase: microRNA sequences,
targets and gene nomenclature. Nucleic Acids Res 34:D140. https://doi.org/10.1093/ NAR/GKJ112 36. Acland A, Agarwala R, Barrett T et al (2013) Database resources of the National Center for biotechnology information. Nucleic Acids Res 41:D8–D20. https://doi.org/10.1093/ NAR/GKS1189 37. Sayers EW, Agarwala R, Bolton EE et al (2019) Database resources of the National Center for biotechnology information. Nucleic Acids Res 47:D23–D28. https://doi.org/10.1093/ NAR/GKY1069 38. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410. https://doi.org/10.1016/ S0022-2836(05)80360-2 39. Johnson M, Zaretskaya I, Raytselis Y et al (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36:W5. https://doi.org/10. 1093/NAR/GKN201 40. Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31:3406. https://doi.org/ 10.1093/NAR/GKG595 41. Dai X, Zhao PX (2011) psRNATarget: a plant small RNA target analysis server. Nucleic Acids Res 39:W155. https://doi.org/10.1093/ NAR/GKR319 42. Dai X, Zhuang Z, Zhao PX (2018) psRNATarget: a plant small RNA target analysis server (2017 release). Nucleic Acids Res 46:W49– W54. https://doi.org/10.1093/NAR/ GKY316
Chapter 10 Functional Annotation of miRNAs in Rice Using ARMOUR Neeti Sanan-Mishra and Kavita Goswami Abstract The role of miRNAs and significance of their interaction with the mRNAs has been well established in a wide range of essential biological processes in plants. Many online databases are available for reporting the miRNAs and their target transcripts in a variety of plants. ARMOUR (ARice miRNA–mRNA interaction resource) presents a cohesive database for all analysis related to miRNAs and their predicted target mRNAs across 7 Indian rice cultivars in 38 different tissue or abiotic stress conditions. It covers profiles of 689 known and 1664 putative novel miRNAs. The information on miRNA profiles is supplemented by the sequence information of mature and hairpin structures. ARMOUR provides the flexibility to query the database in multiple ways using preset or custom text searches. It also facilitates searching for the target mRNAs, determining the gene ontology (enrichment and their associated biological pathways. The interactive user interface allows ARMOUR to serve as an integrated resource for investigation of miRNAs in rice and related plant species. Key words Rice, microRNA, Transcripts, Gene annotation, Expression, Database
1
Introduction miRNAs comprise a major class of small, non-coding RNAs that are processed from transcripts arising from different loci in the genome. They have the ability to regulate gene expression at transcriptional and posttranscriptional levels in a sequence specific manner [1, 2]. A single miRNA may regulate the expression of many cognate transcripts. Thus, they may act as a probable master switches in many biological pathways related to growth, development and response to stresses. The biogenesis and functions of miRNA have been described in many articles [2–4]. Comprehensive profiling of miRNAs and their targets is necessary to understand their function in the context of evolution and development of plants. Next-generation sequencing (NGS) technology along with computational approaches have accelerated the identification and prediction of miRNAs and their targets [4–6]. The founding
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_10, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
227
228
Neeti Sanan-Mishra and Kavita Goswami
principles that have led to the development of specific prediction algorithms are however based on experimental studies on the conserved miRNA sequences and their validated targets. Computational approaches are now widely used as a rapid, rigorous, and cost-effective technique to identify miRNAs and/or their targets from the NGS datasets [6–9]. The primary resources for plant miRNAs include miRBase [10–12] and Plant microRNA Database (PMRD) [13]. University of Manchester, miRBase contains 604 precursors and 738 mature rice miRNAs (Version 22) [12], while PMRD contains 2773 experimentally validated and computationally predicted miRNAs [13]. In addition NCBI GEO/SRA has close to 80,000 experimental series comprising of several rice NGS profiles. The accumulation of this data has generated the need for a comprehensive database integrating the information on expression profiles of miRNAs and their target mRNAs. ARMOUR, consolidates extensive NGS datasets of rice miRNAs to provide information on the expression profiles, targets, and associated biological pathways [14].
2
Materials
2.1 Database Content
1. Data from leaf, root, flag-leaf, and panicle tissues of 7 Indian rice cultivars grown under normal, salt stress, and heat stress conditions. 2. The rice cultivars used were Pusa Basmati 1 (PB1), Lalat (high yielding), Satabdi (dry season), BPT 5206 (heat susceptible), Annapurna (dwarf heat tolerant), N-22 (heat/drought tolerant), and Pokkali (salt tolerant). 3. Rice miRNAs identified using NGS datasets from 38 different developmental and stressed conditions, representing 7 varieties of rice. 4. The established or known miRNAs of rice were obtained from miRBase Rel 21 and searched in the NGS libraries, while the novel putative miRNAs were predicted using the miRcat tool in UEA sRNA Workbench. 5. 689 known and 1664 predicted novel miRNAs are included in the database (Table 1). 6. Backend databases available for search comprise of miRNA precursor sequence database and miRNA mature sequence database from miRBase. Transcript database was obtained from MSU rice database release 7 [15].
ARMOUR for miRNAs Annotations
229
Table 1 Statistics on database content Number of Datasets included
38
Number of known miRNAs
689
Number of putative novel miRNAs
1664
Number of miRNA target transcripts
14,890
Number of unique miRNA–mRNA associations
26,321
Number of associate pathways
118
Number of GO-biological processes
2400
Number of GO-molecular functions
1868
Number of GO-cellular component
487
Fig. 1 Overview of user interaction options
3
Methods ARMOUR database is designed to exhibit a biologist friendly user interface (UI) or user experience (UX). The home page provides an interactive view so that the user is familiarized to the database. The UI is designed to interact with the database in 4 different ways to inspect miRNA–mRNA interactions with available information on expression profiles (Fig. 1). The following sections illustrate detailed features of ARMOUR design and usage. Home page: http://armour.icgeb.trieste.it/login Operating system(s): Platform independent. Programming languages: HTML, CSS, MySQL, jQuery.SQL. License: Not required. Restrictions: None.
230
Neeti Sanan-Mishra and Kavita Goswami
3.1 Key Features of the Database
1. Device independent responsive front end. 2. Floating frames to ensure ease of navigation on all screens. 3. Fluid design for tables. 4. Reader friendly color coding. 5. Ease of navigation on all screens. 6. Preloaded sample identifiers on each query interface. 7. Client side validation of queries. 8. Simplified BLAST search. 9. Keyword filter on the query results 10. Sorting functionality on table headers 11. Interactive matrix table representation in advanced search. 12. Summary charts for gene ontology and pathway queries. 13. Sequence level highlighting of miRNA structure.
3.2 Database Design and Utility
1. The database was designed to be potable so that multiple applications can be developed on the database. 2. Specialised scripts were used for hassle-free database updating that allowed the database to be scalable without any requirement for change in the design. 3. All data was stored in MySQL v5.6 database. It is a RDBMS (Relational Database Management System) and a preferred choice for biological databases. 4. Data normalization was done to ensure minimal redundancy levels, since one miRNA can target many mRNAs and vice versa. 5. The UI was built using HTML5 (Hyper Text Mark-up Language) and CSS3 (Cascading Style Sheets). 6. UX was driven by PHP (Hypertext Preprocessor) and jQuery. 7. SQL queries were designed for memory efficient data retrieval. 8. The API (application programming interface) is a distinctive feature of the design and schema. 9. ER (Entity Relationship) tool was used to create a connected diagram. It places a total of 13 tables representing comprehensive and heterogeneous information related to miRNA and mRNA. 10. All the images are generated on the fly through server side scripting.
3.3
Database Access
1. It is designed to be queried in four different ways (Fig. 1) and any of the given search options can be used to examine miRNA–mRNA interaction with readily accessible expression level information.
ARMOUR for miRNAs Annotations
231
2. Local installation of NCBI-BLAST v 2.2.3 [15] is implemented in ARMOUR database. BLAST variants that are available for users include blastn, tblastn, and tblastx with user-defined expect threshold (E-Value) cutoff option for searching. 3. The search can be performed directly from the home page by using the miRNA IDs and by other given keys to obtain information. 4. The user can search the database via two different means. (a) Sequence based search: This comprise of sequence of miRNAs and their precursors. It also implements transcript database obtained from MSU7 [16]. (b) Query builder interface: This allows users to search the database by using unlimited keywords. These can also be used in combination with a list of miRNA or transcript IDs (Fig. 1). Keywords can be searched on Gene Description, GO (Gene Ontology) and Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways. 5. Every query generates results in a unique correlation matrix format that provides all information related to the query on the miRNA, precursor sequences, expression profiles, and target transcripts. 6. The matrix is populated with number of gene hits that can be requeried into the database. 7. The matrix also shows how many genes are unique and being shared within and across multiple keywords. 3.4 Nomenclature and Gene Annotation
1. Annotation data type was restricted to varchar to ensure simplicity, fast query and retrieval. 2. miRNAs can be searched directly from the home page using the miRNA IDs (Fig. 1). 3. miRNA IDs include miRBase ID, miRBase accession and miRNA family. The nomenclature of known miRNAs and their cognate targets conform to the nomenclature of miRBase and Rice Genome Annotation Project. 4. miRNA IDs for all the putative novel miRNAs are represented as Novel-n (where n is an integer ranging from 1 to 1664). The predicted novel miRNA ID will be updated periodically as they will be validated and entered into miRBase. 5. Transcript or Gene IDs that are covered include Entrez Gene ID, MSU7 Transcript ID, RAPDB gene ID, Uniprot ID, and RefSEQ mRNA ID. 6. Transcript or Gene annotation includes Gene description, RNA type (coding or noncoding), Chromosome number, Gene start, Gene end, Orientation, and Transcript length.
232
Neeti Sanan-Mishra and Kavita Goswami
7. The identified target transcripts were further annotated by identifying their GO categories. The GO terms are well planned terminologies which describes the biological process (P), molecular function (F), and cellular components (C) of gene products. To find the associated pathways which are being targeted and affected by these miRNAs through their target transcripts, the associated KEGG pathway information [17] are also mentioned. 3.5 miRNA or Transcript Identification and Scoring
1. miRNA ID can be used to fetch all information related to the miRNA, their expression profiles, precursor miRNA and target transcripts (Fig. 2). 2. Transcripts targeted by both known and predicted miRNA are identified using psRNATarget (Plant Small RNA Target Analysis Server) using default parameters with the dataset from Rice Annotation Project Version 7 using [18]. 3. Targets were qualified based on satisfying all the following criteria. (a) Expectation Score between 0 and 3. (b) Unpaired Energy Score (UPE) ranging from 0.0 to 25.0. (c) High Scoring Segment Pair (HSP) size of 15 to 20 nucleotides. (d) Maximum Number of transcripts targeted less than or equal to 200 hits.
Fig. 2 Overall schema of database development and functionality representation
ARMOUR for miRNAs Annotations
233
(e) Flanking length around target for accessibility analysis ranging from 17 nucleotides upstream and 13 nucleotides downstream around target size. (f) central mismatch leading to translational inhibition centred between 9 to 11 nucleotide from the 50 end of the miRNA. 3.6 Expression Analysis
1. The normalized expression status of the known and putative novel miRNAs listed in the database in the various libraries is accessible as reads per million. 2. The log2 fold-change, identifying the miRNA de-regulation can be determined. This is a unique feature of ARMOUR. 3. The analysis can be performed by conditional filtering in one or multiple samples or conditions based on cutoff in foldexpression (FE), to achieve differential expression (DE). 4. The FE and DE results are provided in an interactive table with read-values and annotation.
4
Conclusion Rice is one of the staple food crops in the world and its genome has been sequenced and mapped [19]. There are several publications related to role of rice miRNAs in plant development and response to stress [3, 4, 20–22]. ARMOUR provides a comprehensive integrome resource of rice to fasten the meta-analysis of miRNA mediated gene regulation in rice at various conditions. It integrates miRNA expression data with predicted target information for analyzing miRNA-associated phenotypes and biological functions by associating gene level information to conventional GO pathways. This web application provides users with rice miRNA–mRNA integration data along with multiple interfaces to interact with the backend database through gene expression, sequence based search and custom query builder. Therefore, it provides a useful resource for researchers investigating the miRNAs and their functional impacts in rice or related cereal crops. Location: ARMOUR can be accessed at http://www.icgeb. org/mishra-lab.html
Acknowledgments The authors thank Dr. Deepti Mittal, Dr. Mohammed Aslam, and Dr. Neha Sharma for their help with library preparation. We are grateful to Ms. Rashmi Renu Sahoo, Mr. Yusuf Khan, and Ms. Anita Tripathi for assistance with analyzing the sequencing
234
Neeti Sanan-Mishra and Kavita Goswami
data. The authors thank team Bionivid for their help on developing the database. We acknowledge the help of Mr. Dario Palmisano in hosting the database on the website. The work was supported by grants from ICGEB and Department of Biotechnology, Government of India. References 1. Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2): 281–297 2. Achkar NP, Cambiagno DA, Manavella PA (2016) miRNA biogenesis: a dynamic pathway. Trends Plant Sci 21(12):1034–1044 3. Sanan-Mishra N, Mukherjee SK (2007) A peep into the plant miRNA world. Open Plant Sci J 1:1–9 4. Sanan-Mishra, N., & Kumari, A. (2020). Role of RNA interference in seed germination. Plant Small RNA, 101–116 5. Goswami, K., Tripathi, A., Gautam, B., & Sanan‐Mishra, N. (2019). Impact of next‐generation sequencing in elucidating the role of microRNA related to multiple abiotic stresses. Molecular Plant Abiotic Stress: Biology and Biotechnology, 389–426 6. Motameny S, Wolters S, Nurnberg P, Schumacher B (2010) Next generation sequencing of miRNAs – strategies, resources and methods. Genes 1(1):70–84 7. Yang X, Zhang H, Li L (2011) Global analysis of gene-level microRNA expression in Arabidopsis using deep sequencing data. Genomics 98(1):40–46 8. Tripathi A, Goswami K, Sanan-Mishra N (2015) Role of bioinformatics in establishing miRs as modulators of abiotic stress responses: the new revolution. Front Physiol 6:286 9. Yang X, Li L (2012) Analyzing the microRNA transcriptome in plants using deep sequencing data. Biology 1(2):297–310 10. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ (2006) miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34(Database issue):D140–D144 11. Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ (2008) miRBase: tools for microRNA genomics. Nucleic Acids Res 36(Database issue):D154–D158 12. Kozomara A, Griffiths-Jones S (2014) miRBase: annotating high confidence microRNAs
using deep sequencing data. Nucleic Acids Res 42(Database issue):D68–D73 13. Zhang Z, Yu J, Li D, Liu F, Zhou X, Wang T, Ling Y, Su Z (2010) PMRD: plant microRNA database. Nucleic Acids Res 38(Database issue):D806–D813 14. Sanan-Mishra N, Tripathi A, Goswami K, Shukla RN, Vasudevan M, Goswami H (2018) ARMOUR - a rice miRNA: mRNA interaction resource. Front Plant Sci 9:602 15. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36(Web Server issue):W5–W9 16. Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek RL, Lee Y, Zheng L, Orvis J, Haas B, Wortman J, Buell CR (2007) The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Res 35(Database issue):D883–D887 17. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30 18. Dai X, Zhao PX (2011) psRNATarget: a plant small RNA target analysis server. Nucleic Acids Res 39(Web Server issue):W155–W159 19. Sasaki T, Burr B (2000) International rice genome sequencing project: the effort to completely sequence the rice genome. Curr Opin Plant Biol 3(2):138–141 20. Jones-Rhoades MW, Bartel DP, Bartel B (2006) MicroRNAs and their regulatory roles in plants. Annu Rev Plant Biol 57:19–53 21. Lv DK, Bai X, Li Y, Ding XD, Ge Y, Cai H, Ji W, Wu N, Zhu YM (2010) Profiling of coldstress-responsive miRNAs in rice by microarrays. Gene 459(1–2):39–47 22. Ding Y, Chen Z, Zhu C (2011) Microarraybased analysis of cadmium-responsive microRNAs in rice (Oryza sativa). J Exp Bot 62(10):3563–3573
Chapter 11 Identification of ceRNAs in Cereal Crops: A Computational Approach Tinku Gautam, Hemant Sharma, Rakhi Singh, and Anuj Kumar Abstract In recent years, it has become evident that noncoding RNAs (ncRNAs), both short (miRNA/siRNA) and long noncoding RNAs (lncRNAs), play an important role in the regulation of gene expression. MicroRNAs (miRNAs) are the most studied RNA molecule, whose function is to negatively regulate the expression of target genes by complementary binding. Competing endogenous RNAs (ceRNAs) or target mimics (TMs) are transcripts which sequester miRNAs, therefore deregulating the expression of miRNA target genes. Recent advances in the field of miRNA-target-ceRNA network have gradually revealed the functional significance of ceRNAs in regulating normal development and stress response processes in plants and animals. Therefore, the computational identification of ceRNAs is an important and necessary step to deepen our understanding of the regulation mechanisms of various biological processes. Here, we provide a pipeline which can be utilized to identify candidate ceRNAs in plants. Key words ceRNA, Target mimics, miRNA, Wheat, TAPIR, Computational identification
1
Introduction MicroRNAs (miRNAs) are small (18–24 nt), endogenous RNAs, whose main function is to negatively regulate the expression of target genes [1]. So far, these tiny RNA molecules are one of the most studied noncoding RNAs (ncRNAs) in all the biological systems [2]. In brief, miRNA binds with target messenger RNA (mRNA) in a complementary sequence-specific manner and carried out their silencing at the post-transcriptional level either by inducing the cleavage of target mRNA or in some cases by inhibiting target mRNA from protein translation. Therefore miRNAs act as key regulators of diverse developmental processes, stress response, metabolic processes, and apoptosis [3]. At genic level the expression of miRNAs is mainly controlled by different transcription factors (TFs), or cis-regulatory elements present in the promoter regions of miRNA genes.
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_11, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
235
236
Tinku Gautam et al.
However, the mechanism of miRNA regulation at posttranscriptional level is relatively complex and a number of theories have been surfaced. In one of these theories Salmena et al. [4] proposed a post-transcriptional regulatory model of miRNAs, namely, the “ceRNA hypothesis”. This hypothesis suggest that there exist some endogenously produced RNAs, which competitively sponge miRNAs, therefore named as complementary endogenous RNAs (ceRNAs). CeRNA is a RNA molecules that negatively regulate the expressions of captured miRNAs, therefore indirectly control the expressions of miRNA target genes. In 2007, a ncRNA named Induced by Phosphate Starvation 1 (IPS1) was identified which can sequester miR399 due to the presence of a 23 nucleotide binding site similar to the binding site available in PHO2 mRNA (involved in phosphors homeostasis) [5]. Although, IPS1 have sequence similar to PHO2, except a 3 nt mismatch at 10th to 12th bp which forms a loop at cleavage site. Therefore, avoiding cleavage of IPS1, which ultimately results in the sequestering of miR399 and as a consequence, the original target of miR399 (PHO2) is deregulated. This mechanism is known as target mimicry and the transcript competing for miRNA is known as target mimics. Another ceRNA was identified in rice and named as MIKKI (retrotransposon-originated transcript) which decoy osa-miR171 to antagonist the inhibition of SCARECROW-Like TF, which function in root development of rice [6]. According to the recent research in plant and animal systems, it is clear that ceRNAs are not always ncRNAs and any transcript originated from either protein coding genes, pseudogenes, transposable elements, as well as circular RNAs can function as ceRNA to regulate the expressions of miRNAs [6–10]. Therefore, almost all transcripts in the transcriptome of a species have the ability to function as ceRNA and can be examined for the identification of ceRNAs. Keeping in view the importance of ceRNAs in crop improvement, we systematically describe a pipeline (including required programs, parameters, and databases) for ceRNA identification in plants based on high-throughput sequencing data.
2
Material
2.1 Computational Facility
Normal computational facility (Windows 10/11, High speed processor equal or above tenth generation) is required with high-speed Internet connection.
2.2 Database and Analysis Tools
1. miRBase Database (this database is a collection of published miRNA sequences; http://www.mirbase.org) [10].
Protocol for ceRNAs Identification
237
2. SSEARCH program of UVa FASTA server (http://fasta.bioch. virginia.edu/fasta_www2/fasta_www.cgi), 3. TAPIR web server (http://bioinformatics.psb.ugent.be/ webtools/tapir) [11]. 2.3 Data Set Required for ceRNA Identification
1. Mature sequences of miRNAs of interest, which can be downloaded from the miRBase database, or derived from in-house sRNA experimental data. 2. A collection of transcripts, which can be downloaded from Ensembl Plants database (http://plants.ensembl.org/index. html) or RNA-seq-derived transcripts from in-house experiment.
3
Methods In this chapter we are going to use the most studied miRNAceRNA (miR399-IPS1) example from Arabidopsis to demonstrate our methodology and will also provide some results from wheat.
3.1 Retrieval of miRNAs Sequences from miRBase Database
1. Open miRBase database (http://www.mirbase.org/) and navigate to the search option on the home page.
3.2 Retrieval of Transcript Sequence
1. Go to Ensembl Plants Database (http://plants.ensembl.org/ index.html) and select Arabidopsis from the species list.
2. Now enter ath-miR399a in the search box and download the mature miRNA sequence (Fig. 1).
2. Now enter IPS1 (AT3G09922) in the search box and download the mRNA (cDNA) sequence in FASTA format (Fig. 2). 3.3 Identification of ceRNAs
Following are two main methods for the computational identification of ceRNAs in plants. 1. Using ssearch fasta program. (a) Open the home page of UVa FASTA server and open “SSEARCH” from the program list (https://fasta.bioch. virginia.edu/fasta_www2/fasta_intro.shtml) (b) Now click on the option “Align two sequences.” (c) Choose FASTA: DNA:DNA from “(A) Program”. (d) Paste the miRNA sequence in the “(B.1) Enter first query sequence”. (e) Then choose DNA (rev-com only). (f) Now paste transcript sequence in the “(C.1) Enter the second sequence”.
238
Tinku Gautam et al.
Fig. 1 Retrieval of mature miRNA sequence from miRBase database
(g) Click on the “Compare sequence” option with the default parameters (Fig. 3). (h) Search results (Fig. 4) were manually analyzed for the rules prescribed by Ivashuta et al. [12] for ceRNAs identification. For rules please see “Notes” section. 2. Identification of ceRNAs using TAPIR Web server. (a) Open the home page of TAPIR using the URL: https:// fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi. (b) Now click on “Precise” out of the three options present on the home page. (c) Paste the miRNA sequence in the first space and transcript sequences in the second space. (d) Click on the option “Target mimicry search” with the default parameters (Fig. 5).
Protocol for ceRNAs Identification
239
Fig. 2 Retrieval of transcript sequence from Ensembl Plants database
(e) Now submit the sequences and after few minutes the user will be able to explore the results (Fig. 6). (f) These results than manually parsed for the rules provided in the “Notes” section.
4
Notes Rules for the selection of candidate ceRNA 1. A total of 1–5 nt bulge or 1–2 nt mismatch should be present in the ceRNA sequence corresponding to miRNA bases 10–11. 2. No bulge is allowed in any other region of ceRNA sequence. 3. Only one mismatch is allowed at first base. 4. Mismatches in a row should not exceed 2. 5. There should be no more than three mismatches overall.
240
Tinku Gautam et al.
Fig. 3 Pipeline for the computational identification of ceRNAs in plants using ssearch fasta program
Protocol for ceRNAs Identification
241
Fig. 4 Identification of ceRNAs: (a) bulge in the IPS1 transcript, (b) some more examples using wheat transcripts
242
Tinku Gautam et al.
Fig. 5 Pipeline for the identification of ceRNAs using TAPIR web server
Fig. 6 Results produced by TAPIR web server
Protocol for ceRNAs Identification
243
References 1. Steinkraus BR, Toegel M, Fulga TA (2016) Tiny giants of gene regulation: experimental strategies for microRNA functional studies. Wiley Interdiscip Rev Dev Biol 5:311–362. https://doi.org/10.1002/wdev.223 2. Paschoal AR, Maracaja-Coutinho V, Setubal JC et al (2012) Non-coding transcription characterization and annotation: a guide and web resource for non-coding RNA databases. RNA Biol 9:274–282. https://doi.org/10. 4161/rna.19352 3. Ameres SL, Zamore PD (2013) Diversifying microRNA sequence and function. Nat Rev Mol Cell Biol 14:475–488. https://doi.org/ 10.1038/nrm3611 4. Salmena L, Poliseno L, Tay Y, Kats L, Pandolfi PP (2011) A ceRNA hypothesis: the Rosetta stone of a hidden RNA language? Cell 146: 353–358. https://doi.org/10.1016/j.cell. 2011.07.014 5. Franco-Zorrilla JM, Valli A, Todesco M et al (2007) Target mimicry provides a new mechanism for regulation of microRNA activity. Nat Genet 39:1033–1037. https://doi.org/10. 1038/ng2079 6. Cho J, Paszkowski J (2017) Regulation of rice root development by a retrotransposon acting as a microRNA sponge. elife 6:30038. https:// doi.org/10.7554/eLife.30038.001 7. Tay Y, Kats L, Salmena L, Weiss D, Tan SM, Ala U et al (2011) Coding-independent
regulation of the tumor suppressor PTEN by competing endogenous mRNAs. Cell 147: 344–357. https://doi.org/10.1016/j.cell. 2011.09.029 8. Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP (2010) A coding independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465: 1033–1038. https://doi.org/10.1038/ nature09144 9. Witkos TM, Krzyzosiak WJ, Fiszer A, Koscianska E (2018) A potential role of extended simple sequence repeats in competing endogenous RNA crosstalk. RNA Biol 15:1399–1409. https://doi.org/10.1080/15476286.2018. 1536593 10. Kozomara A, Birgaoanu M, Griffiths-Jones S (2019) miRBase: from microRNA sequences to function. Nucleic Acids Res 47:55–62. https://doi.org/10.1093/nar/gky1141 11. Bonnet E, He Y, Billiau K, Van de Peer Y (2010) TAPIR, a web server for the prediction of plant microRNA targets, including target mimics. Bioinformatics 26:1566–1568. https://doi.org/10.1093/bioinformatics/ btq233 12. Ivashuta S, Banks IR, Wiggins BE et al (2011) Regulation of gene expression in plants through mirna inactivation. PLoS One 6: e21330. https://doi.org/10.1371/journal. pone.0021330
Chapter 12 Genotyping-by-Sequencing (GBS) Method for Accelerating Marker-Assisted Selection (MAS) Program Laavanya Rayaprolu, Santosh P. Deshpande, and Rajeev Gupta Abstract Marker-assisted selection (MAS) plays a pivotal role in a breeding program where molecular DNA markers are used for phenotypic selections in crop improvement. Several markers have been used where SNPs (single-nucleotide polymorphisms) have been identified and effectively used. Next-generation sequencing (NGS) technologies have made significant changes to whole-genome sequencing revolutionizing plant breeding. Genotype by sequencing (GBS) is a rapid, cost-effective, and high-throughput method in NGS which enables genotyping of large populations with the discovery of SNPs. The GBS approach includes digestion of genomic DNA with restriction enzymes followed by ligation of barcode adapters, PCR amplification, and sequencing of the amplified DNA pool on a single lane of flow cells. This method has been developed and applied in the sequencing of multiplexed genomic samples. GBS is implemented successfully in genome-wide association study (GWAS), diversity studies, QTL mapping, genetic linkage analysis, marker discovery, and genomic selection under large-scale plant breeding programs. Key words Marker-assisted selection (MAS), Genotyping-by-sequencing (GBS), Next-generation sequencing (NGS), Single-nucleotide polymorphism (SNP), Genomic selection (GS)
1
Introduction Plant breeding involves the phenotypic selection with desired traits in a segregating population. However, selection based on trait phenotype has several limitations, especially concerning genotype environment (G E) interactions. Additionally, phenotypic selection procedures are often expensive, laborious, timeconsuming, cost-intensive, and show the least selection gain in trait value, especially for traits with complex inheritance. With these limitations, an approach for phenotypic selection with molecular markers was developed that could be reliable for indirect selection for target traits. The discovery of DNA markers led to the identification and mapping of desired genes for specific traits for crop improvement. Analysis of polymorphic markers leads to better selection gains compared to morphological and biochemical
Shabir Hussain Wani and Anuj Kumar (eds.), Genomics of Cereal Crops, Springer Protocols Handbooks, https://doi.org/10.1007/978-1-0716-2533-0_12, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
245
246
Laavanya Rayaprolu et al.
markers. Molecular markers are independent of the environment and detectable at all stages of plant growth. Molecular markers linked with a gene/QTL cosegregate with the trait phenotype across generations. DNA markers detect the presence of allelic variation in the genes underlying desired traits, therefore, increasing the efficiency of plant breeding. Marker-assisted selection (MAS) was developed to bridge the problems connected with the conventional plant by the selection of phenotypes toward the selection of genes, either directly or indirectly.
2
Materials
2.1 Molecular Markers
Several molecular markers have been developed in molecular breeding to enhance crop improvement. The first marker identified as restriction fragment length polymorphism (RFLP) [1] was used for genetic map construction. The first group of genetic linkage maps of sorghum consisted of RFLP markers derived from maize probes [2]. There were challenges in using RFLP markers as they were time-consuming, radioactive with limited available probes, and complicated hybridization [3]. Later, several PCR-based markers were developed which were less time-consuming and inexpensive. The improved markers include random amplification of polymorphic DNA (RAPD) [4], sequence characterized amplified region (SCAR; [5], cleaved amplified polymorphic sequences (CAPS; [6], simple sequence repeats (SSRs; [7, 8], amplified fragment length polymorphisms (AFLPs; [9], and direct amplification of length polymorphisms (DALP; [10]. In continuation, the identification of variations at the single base-pair resolution was introduced where the use of single-nucleotide polymorphisms (SNPs; [11] as DNA markers for plant genotyping increased. Over the past decade, the SNP-based marker techniques improved in marker density with low costs. The most common system in fluorescent detection of SNP-specific hybridization probes on PCR products is Taqman, Molecular Beacons, and Invader. SNP-specific PCR primer extension products are used in homogeneous Mass-Extend (hME) assay and their outputs are read on a MALDI-TOF mass spectrophotometer. Applications of all these methods result in around 100–1000s of SNPs per day [12]. Molecular markers are a prerequisite for gene mapping, segregation analysis, genetic diagnosis, phylogenetic analysis, and numerous biological applications. Among various types of molecular markers, available SNPs are the most suitable for genome-wide analysis [12, 13].
2.2 Marker-Assisted Selection
Marker-assisted selection (MAS) is the use of trait-linked molecular markers for indirect selections in the phenotype (desired allele/ gene) of a trait of interest for crop improvement. MAS increase the efficiency of the breeding for trait selection compared to
Molecular Markers to Assist Crop Breeding
247
conventional breeding [13, 14]. There are several applications and advantages of MAS over phenotypic selection. Several traits are expressed only at the flowering or harvest stage and with DNA markers the trait genotype can be identified in the initial stages to plan appropriate crosses. MAS is highly useful in selecting traits with recessive inheritance which cannot be determined by phenotype in heterozygous conditions. Few traits require a specific environment for genotype selection and MAS can be used to select the genotypes in a nonspecific environment. Traits with lengthy protocols such as biochemical traits can be assessed using DNA markers. For a higher genetic gain selection of traits with low heritability and GxE interaction through markers is the quicker route. MAS is useful in pyramiding two or more unlinked genes contributing to the same phenotype and in backcross breeding to speed up the recurrent genome recovery [14]. The success of MAS is supported by strong marker–trait associations for a particular trait. The markers should have a tight association to the traits and genome, codominant, highly polymorphic with recombinations between the markers with a saturated linkage map. The selection of individuals and data analysis with low cost is an important parameter in MAS [14, 15]. Plant breeders use MAS to trace dominant or recessive alleles for the trait of interest which helps to identify suitable individuals among segregating progenies [15]. Marker-assisted selection includes (1) marker-assisted backcross breeding, (2) markerassisted gene/QTL pyramiding, (3) marker-assisted recurrent selection, and (4) genomic selection (GS) [14, 15]. 2.3 Marker-Assisted Backcross Breeding
In marker-assisted backcross breeding, the indirect selection is made using DNA markers linked to the desired gene. The introgression of the target gene/QTL is more precisely and efficiently done in backcross breeding. There are three levels of selection in backcross breeding; foreground selection, recombinant selection, and background selection. Foreground selection involves the use of linked markers for the trait of interest whereas in recombinant selection the backcross progeny is selected (with target gene) and tightly linked flanking markers to minimize linkage drag. In background selection, the backcross progeny is selected with maximum parent genome recovery [14].
2.4 Gene/QTL Pyramiding
In marker-assisted gene pyramiding, multiple genes from different donors are simultaneously combined in a single genotype using linked markers [16]. With the availability of an array of molecular markers and genetic maps, MAS has become possible both for traits governed by major genes as well as for QTLs. The use of molecular markers in breeding depends on various factors such as a genetic map with molecular markers linked to the major gene(s) or QTLs. A tight association between the markers and the major gene(s) or
248
Laavanya Rayaprolu et al.
the QTLs is observed where the marker is genetically associated with the trait of interest and there is a low genetic distance between the marker and the gene. Adequate recombinations between the markers associated with the trait(s) of interest and the rest of the genome is an important parameter for gene/QTL pyramiding [15]. 2.5 Marker-Assisted Recurrent Selection
Recurrent selection involves cycles of selection and inters mating improving a segregating population. Several selection cycles are possible within 1 year, accumulating favorable QTL alleles in the breeding population [14].
2.6 Genomic Selection
Genomic selection uses whole-genome marker data as predictors of performance and consequently delivers predictions of trait values that are more accurate for use in selection. In GS hundreds and thousands of DNA markers covering the whole genome are selected to cover all the genes in linkage disequilibrium. This approach has become feasible because of the availability of a large number of SNPs and new methods to efficiently genotype a large number of SNPs [14].
2.7 Next-Generation Sequencing (NGS)
NGS relies on massively parallel sequencing producing millions of sequences simultaneously at a low cost [13]. There are several NGS platforms, such as Roche 454 FLX Titanium [17], Illumina MiSeq, and HiSeq2500 [18], Ion Torrent PGM [19], is used. The development of high-throughput sequencing technologies has lowered the cost of DNA sequencing to a great extent [13]. There is a similar protocol of DNA template preparation followed by ligation of universal adapters at both ends of the shredded DNA fragments in all the NGS platforms Sequencing technique follows an iterative manner, where the nucleotides are incorporated followed by the emission of a signal and its detection by the sequencer [20]. Most NGS platforms generate reliable sequences and display near-perfect coverage behavior on GC-rich, neutral, and moderately AT-rich genomes [13]. In Illumina sequencing, DNA molecules and primers are first attached on a slide and amplified with polymerase so that local clonal DNA colonies are formed. To determine the sequence, four types of reversible terminator bases are added and nonincorporated nucleotides are washed away. A camera takes images of the fluorescently labeled nucleotides, and then the dye, along with the terminal 3 blocker, is chemically removed from the DNA, allowing for the next cycle to begin. The DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera [21]. Illumina generates shorter reads, (50–300 bp), with sequencing throughputs ranging from 1.5 to 600 Gbp. MiSeq sequencer to the high-throughput HiSeq2500 sequencer are some
Molecular Markers to Assist Crop Breeding
249
of the instruments commercialized by Illumina [13] NGS technologies have been extensively used in genotyping, enabled novel applications [13, 21]. NGS is being used for whole-genome sequencing to discover a high number of SNPs and explore the diversity within species to further study the genome-wide associations (GWAS) [22]. Throughout the process, accuracy and throughput with lower costs must be prioritized. Multiplex sequencing is also being studied to study short DNA sequences (barcode tagging) followed by pooling of the samples into a single sequencing channel [13] which is further used for mapping SNPs in cereals [22].
3
Methods
3.1 Genotyping-bySequencing (GBS)
GBS technique is feasible to study diversity in large genome panels [22]. GBS generates large numbers of SNPs for use in genetic analyses and genotyping [23]. GBS can target 2.3% of a genome for sequencing [24]. The positives of the GBS method include low cost, reduced sample handling, fewer PCR and purification steps, no reference sequence limits, and efficient bar-coding [25]. GBS is widely used for genomics-assisted plant breeding and provides a simple and efficient method for genetic map construction [13, 26]. The library preparation is simple in GBS and more amenable to use on large numbers of individuals [22]. The new technology of modified library preparation is resulting in more SNP numbers with more coverage and depth resulting in higher efficiency [13]. There are two different types of GBS protocols: (1) restriction enzyme digestion, in which no specific SNPs are identified and ideal for discovering new markers for MAS programs; (2) multiplex enrichment PCR, in which a set of SNPs has been defined for a section of the genome. A multi sequencing strategy was introduced which used inexpensive bar-coding techniques increasing the efficiency and cutting down the costs. GBS was introduced to study complex genomes, population studies, germplasm characterization, plant genetics, breeding, and SNP identification in diverse crops at low costs [13, 27].
3.2 GBS Methodology [13, 24, 28] (Fig. 1)
1. Sample Preparation: High-quality contamination-free genomic DNA is crucial for GBS protocols. DNA from each sample should be accurate, balanced, and of high quality. 2. Restriction Enzyme: ApeKI, a type II restriction endonuclease, is the most commonly used restriction enzyme creating a 50 overhang. Sometimes, a two-enzyme (PstI–MspI) GBS protocol is followed depending on the complexity of the sequencing. 3. Adapter Design: There is a need for customized barcode adapters specific to a single restriction overhang sequence that
250
Laavanya Rayaprolu et al.
D A
B
C
SNP identification Gene/QTL mapping Diversity analysis GWAS High density genome maps Phylogenetics Identification of candidate genes Linkage analysis Market discovery Genomic selection
H
E
G F
Fig. 1 Steps of GBS. (a) Tissue is obtained from plant species. (b) Ground leaf tissues for DNA isolation, quantification, and normalization. (c) DNA digestion with restriction enzymes. (d) Ligations of adapters. (e) Barcoding region in adapter 1 in random PstI–MseI restricted DNA fragments. (f) DNA fragments with different bar codes from different biological samples. (g) Bioinformatics analysis of sequences from the library on an NGS sequencer. (h) Application of GBS. (Source: [12])
terminates at the 4–8 bp barcode on the end of its top strand. This is complementary to the “sticky” end generated by ApeKI. To minimize the error of misidentifying the samples all pairwise combinations of barcodes differ by a minimum of three mutational steps. 4. Ligation: The ligation and digestion are done in the same tube/plate. NEB Buffer4 with the addition of ATP is used for the ligation reaction. The concentration of Adapter1 needs to change according to the species. Adapter2 is A Y-adapter and is not amplified unless the PCR reaction has first proceeded from Adapter1 on the other end of the same fragment. 5. Library Construction: In this protocol, the DNA samples, barcode, and common adapter pairs are plated and dried followed by the digestion of the samples with ApeKI. The adapters are ligated to the ends of genomic DNA fragments and the T4 ligase is inactivated by heating. Appropriate primers with binding sites are added to the ligated adapters followed by
Molecular Markers to Assist Crop Breeding
251
PCR. The PCR products are cleaned up and fragment sizes of the resulting library are checked on a DNA analyzer. Libraries without adapter dimers are retained for DNA sequencing. 6. Multiplexing and PCR Amplification: The ligated samples are multiplexed followed by PCR amplification. In Illumina HiSeq the multiplex library is PCR amplified using a short extension time. The fragments that are in the 200–500 bp range and suitable for bridge amplification are enriched in this procedure and fragments that have a PstI /MspI cut site amplify. 7. Sequencing on Miseq: The GBS protocol involves the use of assays to generate a de-multiplexed set of FASTQ files with the adapter sequences removed after sequencing. The denatured and diluted library containing PhiX is loaded onto a Miseq Reagent Kit cycle cartridge and the run is initiated. The data is downloaded in FASTQ format where each sample represents a forward and reverse read labeled as R1 and R2. Presently, the most efficient approach for plant genotyping in NGS technologies is the Reduction of Representation Library (RRL) [29]. This approach involves the cutting of the entire genome with specific restriction enzyme(s) that reduce genome complexity for the organism of interest. Its results sequence dataset has higher read coverage per locus while allowing a higher level of multiplexing with uniquely barcoded adapters for different samples [12]. The limitation of RRL is that the important genomic regions are not captured by GBS libraries when restriction sites are not available surrounding those regions. To overcome this, multiple GBS libraries with different combinations of an enzyme are used. Table 1 depicts the different methods of GBS. Source: [12].
4
Notes Application in Plant Breeding GBS is powerful, rapid, and low-cost to genotype different populations with further implementation on diversity studies, genetic linkage analysis, GWAS, molecular marker discovery, and genomic selection. GBS helps to identify high-density SNP markers and construct genetic linkage maps to study various applications in plant breeding [28]. NGS has improved the quality of markers and coverage which is an essential prerogative for the identification of trait-specific SNPs [13]. GBS is being used to sequence whole genomes and resequence recombinant inbred lines (RILs) to map traits to enhance breeding programs [13, 30]. Genomes of many crops such as maize, sorghum, wheat, barley, Arabidopsis, rice, potato, and cassava have been sequenced using GBS at a low cost
252
Laavanya Rayaprolu et al.
Table 1 Types of GBS protocols Restriction enzyme
Method
Insert size
Sequencing Sequencing Barcodes platform mode
RAD-seq (Restriction association DNA sequencing)
SbfI or EcoRI
Sizeselection
96
Illumina
Paired-end
MSG (Multiplex shotgun genotyping)
MseI
Sizeselection
384
Illumina
Single end
GBS (Genotype by sequencing)
ApeKI
> >
3
library(BGLR) data(wheat) Y : 0 if the ith genotype of j th SNP is AA The differences in the different Bayesian regression models are the prior assumptions on the effects of SNPs, where all such regression model can be formulated as y ¼ 1n μ þ Z β þ ϵ
or y ¼ μ þ
m X
zj β j þ ϵ
ð1Þ
j ¼1
Here, y is the vector of phenotypes of n individuals, μ is the overall mean, zj is the vector of genotypes (0, 1, 2) corresponding to jth SNP, βj is the effect of jth SNP and ϵ is the random vector of errors or residuals. For all the Bayesian regression model, ϵ is assigned a scaled-inverse N 0, Iσ 2e , where the residual variance 2 2 2 chi-squared prior, that is, σ e χ σ e jd e , S e with de and Se are the degree of freedom and scaled parameter respectively. The overall mean μ is assigned a flat prior.
Genomic Selection using Bayesian Methods
3.1
BayesA
263
For BayesA, all the m SNPS are considered to have effects on the trait variance with some of them are having moderate to large effects. The prior distribution of the SNP effects are assigned scaled-t distribution [24] that are represented as mixture of scaled-normal distribution, that is, f β j jσ 2β N β j j0, σ 2β j
for
easy computation. The variance of the marker effects are assigned inverse chi-squared distribution prior with degree of freedom ν and
scale parameter S, that is, f σ 2β j jν, S χ 2 ðν, S Þ, where S Γ(r, s)
with r and s are the rate and shape parameter respectively. The value of r is set to a certain value and s is determined in such a way that the total contribution of the linear predictor equals the R-squared. The rate and shape parameter for the residual variance are also determined in the similar manner. 3.2
BayesB
In BayesB, a large proportion (1 π) of SNPs are assumed to have zero effects whilst a small proportion (π) of SNPs considered to have effects. The prior distribution of the effect of each SNP is a mixture of scaled-t distribution (similar to BayesA) with probability π and a distribution of point mass at zero with probability1 π. In other words, ( with probability π N 0, σ 2β j 2 f β j jσ β j , π ¼ 0 with probability ð1 π Þ Similar to BayesA, the variances of the marker effects are assigned inverse chi-square priors, that is, ( χ 2 ðν, S Þ with probability π 2 f σ β j jν, S, π ¼ 0 with probability ð1 π Þ Here, the other hyper parameters ν, S, r and s are determined in a similar way as that of BayesA. The parameter π is considered unknown and was assigned a Beta prior, that is, π Beta ( p0, π 0) with p0 > 0 and π 0 ∈ [0, 1]. The Beta prior is parameterised in such ð1π 0 Þ . a way that E ðπ Þ ¼ π 0 and varðπ Þ ¼ π01þp 0
3.3
BayesC
In this model, a large group of SNPs are assumed to have no effects and a small group of SNPs have larger effects on the variability of the trait. Further, there is common variance for SNPs with nonzero effects instead of locus specific variance as in case of BayesB model [25]. Similar to BayesB model, ( 2 with probability π N 0, σ β f β j jσ 2β , π ¼ 0 with probability ð1 π Þ
264
Prabina Kumar Meher et al.
The prior distributions of the variance of the marker effects are inverse chi-square distribution and all other assumptions about the hyper parameters are same as to the BayesB model. 3.4 Bayesian LASSO (BLASSO)
In this model, the nonzero marker effects are assigned double exponential prior [15], with the heavy tail accommodating the larger effects. Further, the double exponential distribution are represented as independent normal densities withzero mean and
marker-specific variance, that is,f β j jτ2j , σ 2e N β j j0, τ2j σ 2e , 2 2 where f τ2j j λ2 exp τ2j j λ2 and f(λ2| r, s) Γ(r, s). The rate parameter r and shape parameter s are estimated in a similar fashion as explained in BayesA and BayesB. 3.5 Bayesian Ridge Regression (BRR)
Here, the prior distribution of the nonzero marker effects are considered identically and independently distributed Gaussian distribution with common variance for all the effects. Statistically,
2 f β j jσ 2β N β j j0, σ 2β , where the variance σ β is assigned a scaled-inverse chi-squared prior, that is, f σ 2β jν, S χ 2 ðν, S Þ . The hyper parameter ν and S are determined as explained in case of BayesA.
3.6
BayesCπ
The BayesCπ [25, 26] method replaces the BayesB method’s locus specific variance of the marker effects with common variance. Besides, it also resembles with the BayesC model with regard to the distribution of the marker effects, whereas the proportion of the nonzero marker effects (π) is unknown and is assigned uniform prior. The value of π is determined in such a way that the prior density of the additive SNP effects is zero with probability 1 π and normally distributed with probability π. Statistically, ( with probability π N 0, σ 2β j 2 f β j jσ β j , π ¼ 0 with probabilityð1 π Þ
3.7
BayesU
In BayesU model [18], the marker effects are assigned Horseshoe prior [19] with locus-specific variance. Statistically, β j
N 0, τ2 σ 2j , where σ j is the local parameter and is assigned positive half-cauchy distribution prior with degree of freedom 1, that is, σ 2j C þ ð0, 1Þ and the global parameter τ is assigned flat prior. 3.8
BayesHP
Horseshoe prior with another local parameter, known as Horseshoe+ prior is assigned to the marker effects in BayesHP model with the marker effects having locus specific variance. Statistically, β j N 0, τ2 σ 2j
with σ 2j
C þ 0, α j , αj
C+(0, 1), and τ C+(0, N1).
Genomic Selection using Bayesian Methods
265
Here, σ j and αj are the local parameters and τ is the global parameter assigned with the positive-half Cauchy distribution prior and N is the size of training dataset. The positive-half Cauchy distribution is represented as mixture of inverse gamma (IG) distribution. More ! σ 2j τ2 IG 12 ,
clearly,
3.9
IG 12 , θ1 , θ j IG 12 , 12 j αj 1 þ 1,N2 . φ , and φ C 2
,
α2j IG 12 , ϑ1 , ϑ j IG 12 , 1 , j
In case of BayesU and BayesHP, the local parameter σ j is assigned positive-half Cauchy distribution with fixed value of degree of freedom, whereas in BayesHE model the local parameter is assigned half-t distribution prior with an unknown degree of freedom.
BayesHE
Statistically, β j N 0, τ2 σ 2j
with σ j half ‐ t+(ϑ, 1),
τ C+(0, N1) and ϑ Gamma(a, b). By modeling the t and Cauchy distribution with inverse gamma (IG) distribution, the prior distribution is reformulated as β j N 0, τ2 σ 2j , where σ 2j IG 2ϑ , θϑj , θ j IG 12 , 1 , τ2 IG 12 , φ1 , φ IG 12 , N 2 and ϑ Gamma(a, b). More details on the estimation of hyper parameters through Gibbs sampling strategy can be found at Shi et al. [6]. 3.10 Implementing BayesA, BayesB, BayesC, BLASSO, and BRR
In this section, we will discuss about implementing different Bayesian methods using R-code for estimation of marker effects, error variance, heritability, and breeding values.
3.10.1 Model with Only Random Effect of Markers
In this model, only the effects of markers are accounted without using any other fixed effects and covariates.
> library(BGLR) > model_1 library(BGLR) > model_2 library(BGLR) > model_3 X B h2_new varU_new varE_new varU VarE Herit awk ’OFS="\t" {print $1"_"$2+1"_"$3, $1, $2+1, $3, “."}’ Sorted_merged.bed > Sorted_merged_bed.saf
(f) Generate HOMER BED format from SAF or BED file and then annotate the peaks. > awk ’OFS="\t” {print $1"_"$2+1"_"$3, $1, $2+1, $3, "."}’ Sorted_merged.bed > Sorted_merged_Homer.bed.
(g) Annotate the peaks using Homer. > annotatePeaks.pl Sorted_merged_Homer.bed hg19 1> peakAnn.csv 2> annLog.txt
Single Cell Analysis in Crops Improvements
287
(h) For convertion of large bigwig files in order to smaller files, we need to sort the bigwig files based on chromosomes and coordinates. > find TC* -name ’*.bdg’ | parallel "sort -k1,1 -k2,2n {} > {.}.sort.bdg”
When q-value is used to filter peak detection, the lower the value, the smaller the final peak set. When using p-value, instead, the opposite happens: the lower the p-value, the larger the final peak set, suggesting that p-value filter is acting on the left tail of the distribution; here are some examples. at p ¼ 0.05 > macs2 callpeak -t atac.bed.gz -g hs -n test_p05 -p 0.05
at p ¼ 0.01 > macs2 callpeak -t atac.bed.gz -g hs -n test_p01 -p 0.01
at q ¼ 0.01 > macs2 callpeak -t atac.bed.gz -g hs -n test -q 0.01
Motif Analysis for scATAC Data Motif is a specific base sequence with high affinity to some proteins. Motif analysis is carried out by using FindMotifsGenome tool of HOMER. Reliable peaks are screened according to peak information obtained from peak calibration and annotation. Motif is the transcription factor corresponding to the predicted Motif [72]. Here are some helpful scripts for finding the motifs using Homer, which gives novel motifs and known motifs as output. ## scATAC homer commands > findMotifsGenome.pl homer_peaks.bed hg19 motif homer -len 8,10,12
(d) making of consensus bed file from ATAC replicates narrow-peak files ##First, load the Bio-conductor libraries to work with genomic ranges: > library(rtracklayer) >library (GenomicRanges)
288
Upasna Srivastava and Satendra Singh
## Get the list of. narrow Peak files in the current directory. We can treat them all as replicates. The same code should work with. bed or .gff files. > peaks