Plant Omics: Advances in Big Data Biology 1789247519, 9781789247510

This book provides a comprehensive overview of plant omics and big data in the fields of plant and crop biology. It disc

371 58 54MB

English Pages 309 [311] Year 2022

Table of contents :
Plant Omics
CABI BIOTECHNOLOGY SERIES
Copyright
Contents
Contributors
Preface
1 Plant Genomics
1.1 Introduction
1.2 Advanced Technologies in Plant Genomics
1.3 Status of Fabaceae Genomics
1.4 Status of Poaceae Genomics
1.5 Conclusion
References
2 Plant Transcriptomics: Data--driven Global Approach to Understand Cellular Processes and Their Reg
2.1 Introduction
2.2 Overview of RNA-Seq-Based Transcriptome Profiling
2.2.1 Phase-IA: Sampling time-point, replication, and depth of coverage
2.2.2 Phase-IB: Single or paired-end sequence reads - platform and error rate
2.2.3 Phase-II: Factors in processing of sequence reads and their limitations
2.2.4 Phase III: Choosing the reference and mapping in model plant species
2.2.5 Phase III: De novo or hybrid assembly for non-model species
2.2.6 Phase III: Choice of aligner
2.2.7 Phase IV: Detection of differentially expressed transcripts and their gene loci
2.3 Conclusions and Perspectives
Acknowledgment
References
3 Plant Proteomics
3.1 Introduction
3.2 Proteomic Technology in Plant Science
3.3 Plant-subcellular Proteomics
3.3.1 Importance of plant-subcellular proteomics
3.3.2 Subcellular proteomics: understanding mechanism in soybean under flooding stress
3.4 Plant Proteomics of Post-translational Modifications
3.4.1 Importance of post-translational modifications in plants
3.4.2 Post-translational modifications: understanding mechanism in soybean under flooding stress
3.5 Plant Proteomics: Understanding Environmental Stress Responses
3.5.1 Plant proteomics: understanding interaction between plants and biotic stress
3.5.2 Plant proteomics: understanding signaling mechanism under abiotic stresses
3.6 Future Perspective
References
4 Plant Metabolomics: The Great Potential of Plant Metabolomics in Big Data Biology
4.1 Introduction
4.2 Analytical Targets and Techniques
4.2.1 Analytical targets in plant metabolomics
4.2.1.1 Central metabolites
4.2.1.2 Secondary metabolites
4.2.2 Analytical methods for plant metabolomics
4.2.3 Metabolite identification/annotation in metabolomics data
4.3 The Importance of Sharing Metabolomics Data
4.3.1 Metabolome data repositories
4.3.2 Toward reproducible metabolome data analysis
4.3.3 Future metabolomics data analysis enhancing new biological discoveries
4.4 Conclusions and Outlook
References
5 Plant Phenomics
5.1 Introduction to Plant Phenomics
5.2 Basic Technologies for Plant Phenotyping
5.3 Indoor Phenotyping
5.3.1 Indoor phenotyping platforms
5.3.1.1 Laboratory or growth chamber
5.3.1.2 Greenhouse
5.3.2 Limitations of the current indoor phenotyping platforms
5.4 Field Phenotyping
5.4.1 Field phenotyping platforms
5.4.1.1 Satellites
5.4.1.2 UAVs
5.4.1.3 Ground-based platforms
5.4.2 Limitations of the current field phenotyping platforms
5.5 Conclusion and Future Perspectives
References
6 Plant Non--coding Transcriptomics: Overview of lncRNAs in Abiotic Stress Responses
6.1 Introduction
6.2 History of ncRNA Research
6.3 Classification of ncRNAs
6.4 Molecular Functions of ncRNA
6.4.1 miRNAs
6.4.2 Trans-acting siRNAs (ta-siRNAs) and phased siRNAs (pha-siRNAs)
6.4.3 Pol IV- and Pol V-derived lncRNAs and siRNAs
6.4.4 RNA interfering events induced by cis-natural antisense RNAs (cis-NATs)
6.4.5 Cis-NATs enhance mRNA translation
6.4.6 Cis-NATs derived from RNA degradation
6.4.7 lncRNAs COLDAIR, COOLAIR, and COLDWRAP that regulate chromatin modification at the FLC locus
6.4.8 ENOD40 and ASCO, mRNA-like long intergenic ncRNAs that regulate alternative splicing events by
6.4.9 APOLO and HID1, long intergenic ncRNAs forming RNA-DNA hybrids that repress gene expression
6.4.10 ceRNA/RNA Soggy/RNA decoy mimic miRNA targets
6.4.11 Circular RNA
6.4.12 RNA polymerase III-derived lncRNAs
6.4.13 Viroids: sub-viral plant-pathogenic lncRNAs
6.5 Concluding Remarks
Acknowledgments
References
7 Plant Epigenomics
7.1 Significance of Histone Modifications
7.1.1 Histone proteins in plants
7.1.2 Functions of conservative modification sites in canonical histone proteins
7.1.3 The genome-wide distribution and responsiveness of major histone modifications
7.1.4 Histones and histone modifications in the construction of genomes and chromosome structures
7.2 DNA Methylation
7.2.1 DNA methylation in plants
7.2.2 DNA methylation mechanism in A. thaliana
7.2.3 Genome-wide DNA methylation patterns in plant genomes
7.2.4 Methods to investigate global DNA methylation patterns
References
8 Plant Organellar Omics
8.1 Introduction
8.2 Nucleus
8.3 Endoplasmic Reticulum
8.4 Golgi Apparatus
8.5 Vacuole
8.6 Peroxisome
8.7 Oil Body
8.8 Plastid
8.9 Mitochondrion
8.10 Databases for Images/Movies of Organelle Dynamics
8.11 Conclusions
Acknowledgment
References
9 Plant Cis--elements and Transcription Factors
9.1 Introduction
9.2 Methods to Infer TF-DNA Interactions
9.2.1 Wet-lab approaches
9.2.2 Dry-lab approaches
9.3 Related Databases for TFs and Cis-elements
9.3.1 TF-related databases
9.3.2 Cis-element-related databases
9.4 Advanced Analysis in GRNs
9.5 Prospective View on Studies of Gene Regulation
References
10 Plant Gene Expression Network
10.1 Introduction
10.2 Visualization of Relationships of Genes by GENs: Nodes and Edges
10.3 Types of Relationships in GENs
10.4 Similarity and/or Reciprocity in Gene Expression Profiles
10.4.1 PCC
10.4.2 DCA
10.5 Common Regulatory Mechanisms in Gene Expressions
10.6 Sequence Similarities in mRNAs
10.7 Similarities in the Biological Functions of Expressed Genes
10.7.1 GENs with computational annotations of genes
10.7.2 GENs containing knowledge-based information and ontology for biological functions
10.7.3 GENs with metabolic pathway information
10.8 Network Construction Tools with Multiple Types of Information about Genes
10.9 Knowledge-bases for RNA-Seq Data, Expression Data, and GENs
Acknowledgments
References
11 Plant Hormones: Gene Family Organization and Homolog Interactions of Genes for Gibberellin Metabo
11.1 Plant Hormones and Height Control
11.2 Brassica napus
11.3 GAs
11.3.1 GA metabolism
11.3.2 GA signaling
11.3.3 GA-auxotroph and response mutants
11.4 GA Metabolism and Signaling Genes in B. napus: Gene Family Diversity and Gene Expression
11.4.1 Early GA biosynthesis (synthesis of GA12)
11.4.2 BnaGA20ox
11.4.3 BnaGA3ox
11.4.4 BnaGA2ox
11.4.5 GA signaling genes
11.5 Expression of Homeologous Genes
11.6 General Discussion
Acknowledgments
References
12 Plant-Pathogen Interaction: New Era of Plant-Pathogen Interaction Studies: “Omics” Perspectives
12.1 Introduction
12.2 Overview of Plant Defense against Pathogens
12.3 Transcriptome of Plant and Pathogen Interactions: Providing a Global Understanding of the Host-
12.4 Proteomics and Plant-Pathogen Interactome: Network Analysis
12.5 NLRome Provides a Comprehensive Way to Study NLRs
12.6 NLR and Avr Interaction Could Be Divided into Three Patterns
12.7 NLRs Function in Singleton, Pair, or Network
12.8 Pan-NLRome Reveals Diversity of NLRs
12.9 Concluding Remarks
Acknowledgments
References
13 Plant GWAS
13.1 Introduction
13.2 Core Processes in GWAS
13.2.1 Associating genotypic variations with phenotypic variations
13.2.2 Preparing GWAS populations
13.2.3 Checking phenotype data
13.2.4 Mixed linear model
13.2.5 Analyzing statistical significance
13.2.6 GWAS software
13.3 Graphical Representation of GWAS Results
13.3.1 Manhattan plot
13.3.2 Quantile-quantile (QQ) plot
13.4 Case Studies
13.4.1 Arabidopsis
13.4.2 Rice
13.5 Problems with GWAS
13.5.1 Functional validation of GWAS results
13.5.2 Spurious association, rare alleles
13.6 Conclusion and Prospects
References
14 Plant Genomic Selection: a Concept That Uses Genomics Data in Plant Breeding
14.1 Introduction
14.2 Core Processes in GS
14.2.1 Preparation of training data
14.2.2 Construction of GS model
14.3 Implementation of GS in Practical Plant Breeding
14.4 Advanced Topics in GS
14.4.1 GS model incorporating G × E effects
14.4.2 DNA marker selection for GS model construction
14.4.3 Combination with other omics
14.5 Concluding Remarks
References
15 Plant Genome Editing
15.1 Introduction
15.2 Genome Editing Using CRISPR-Cas9 in Plants: an Overview
15.3 Genome Manipulation Using a CRISPR-dCas9-based System Without DSB Induction
15.4 Engineered Cas9 and Newly Discovered Cas Proteins for Plant Genome Editing
15.5 Prime Editing
15.6 Conclusions
References
16 Introduction of Deep Learning Approaches in Plant Omics Research
16.1 Introduction
16.2 Supervised Learning
16.2.1 Classification task: CNN
16.2.2 Regression task: RNN, LSTM
16.3 Unsupervised Learning
16.3.1 Generation task: GAN
16.3.2 Dimensionality reduction task: AE, word2vec
16.4 Deep Reinforcement Learning: DQN
16.5 Other Deep Learning Techniques: GNN, Transformer, AutoML
16.5.1 Deep learning for graphs: GNN
16.5.2 Natural language processing: transformer
16.5.3 Automatic machine learning: AutoML
16.6 Summary
References
17 Deep Learning on Images and Genetic Sequences in Plants: Classifications and Regressions
17.1 Introduction
17.2 Deep Learning for Plant Images
17.2.1 Deep learning for taxonomic classification of plant images
17.2.2 Deep learning for stress/disease diagnosis based on plant images
17.2.3 Deep learning for non-invasive prediction of plant images
17.2.4 Deep learning for regression and quantification of plant images
17.2.5 Deep learning for automated sorting of plant images
17.3 Deep Learning for DNA Sequences
17.4 Deep Learning for Amino Acid Sequences: Prediction of Protein Folding
17.5 CNN Guides for Beginners: Tips and Precautions in Practice
17.5.1 Installing libraries and preparing data for application of a CNN
17.5.2 Evaluation of CNN model performance
17.5.3 Interpretability and explainability of CNN models
17.6 Future Perspectives
Acknowledgments
References
18 Deep Learning in Plant Omics: Object Detection and Image Segmentation
18.1 Introduction
18.2 Object Detection and Image Segmentation in Plant Phenomics
18.2.1 Object detection and its applications
18.2.2 Image segmentation and its applications
18.3 Current Challenges of Object Detection and Image Segmentation for Plant Phenomics
18.3.1 Data annotation cost
18.3.2 Generalization capability of current deep learning models
18.4 Conclusion and Future Perspective
References
19 Plant Experimental Resources
19.1 Introduction
19.2 Overview of Arabidopsis Resources
19.2.1 Arabidopsis seed resources for omics analysis
19.2.2 Arabidopsis DNA resources
19.3 Overview of Experimental Plant Resources for Crop Research
19.3.1 Rice resources
19.3.2 Wheat resources
19.3.3 Tomato resources
19.3.4 Legume resources
19.4 Conclusion and Perspective
References
20 Plant Omics Databases: an Online Resource Guide
20.1 Introduction
20.2 Arabidopsis Omics Databases
20.2.1 Arabidopsis genome databases
20.2.2 Arabidopsis epigenome databases
20.2.3 Arabidopsis transcriptome databases
20.2.4 Arabidopsis proteome databases
20.3 Omics Databases for Crop Plants
20.3.1 Rice (Oryza sativa L.)
20.3.2 Wheat (Triticum aestivum L.)
20.3.3 Maize (Zea mays)
20.3.4 Soybean (Glycine max)
20.3.5 Tomato (Solanum lycopersicum)
20.3.6 Pepper (Capsicum annuum)
20.4 Databases for Bryophytes
20.5 Databases for Other Plant Species
20.6 Portals for Plant Omics Databases
20.6.1 Bio-Analytic Resource for Plant Biology (BAR)
20.6.2 Gramene
20.6.3 Phytozome
20.7 Future Perspectives
References
Index

Recommend Papers

Plant Omics: Advances in Big Data Biology 9781789247534, 9781789247510, 9781789247527, 1789247535

120 68 7MB Read more

Advances in Orchid Biology, Biotechnology and Omics 9819910781, 9789819910786

This book provides comprehensive insights into the existing and emerging trends in orchid biology based on the findings

225 14 6MB Read more

Advances in Bioinformatics and Big Data Analytics

The book will play a vital role in improvising knowledge on the practical application of information science in the biol

301 65 53MB Read more

Collecting Experiments: Making Big Data Biology 9780226635187

Databases have revolutionized nearly every aspect of our lives. Information of all sorts is being collected on a massive

136 99 5MB Read more

Synthetic Biology and iGEM: Techniques, Development and Safety Concerns: An Omics Big-data Mining Perspective 9789819924608, 9789819924592, 981992460X

This book focuses on biological engineering techniques, multi-omics big-data integration, and data-mining techniques, as

97 80 3MB Read more

Synthetic Biology and iGEM: Techniques, Development and Safety Concerns: An Omics Big-data Mining Perspective 9819924596, 9789819924592

This book focuses on biological engineering techniques, multi-omics big-data integration, and data-mining techniques, as

121 5 2MB Read more

Advances in Big Data Analytics [1 ed.] 9781683921820

This volume contains the proceedings of the 2017 International Conference on Advances in Big Data Analytics (ABDA'1

148 98 3MB Read more

Bioinformatics for Omics Data: Methods and Protocols (Methods in Molecular Biology, 719) 1617790265, 9781617790263

100 72 12MB Read more

Advances in Big Data Analytics [1 ed.] 9781601323538

Advances in Big Data Analyticsis a compendium of papers presented at ABDA '16, an international conference that ser

139 18 9MB Read more

Omics Science for Rhizosphere Biology 9811608881, 9789811608889

This book presents a timely review of the latest advances in rhizosphere biology, which have been facilitated by the app

313 29 5MB Read more

Plant Omics: Advances in Big Data Biology
1789247519, 9781789247510

Author / Uploaded
Hajime Ohyanagi
Kentaro Yano
Eiji Yamamoto
Ai Kitazumi

Similar Topics
Biology
Biotechnology

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

11

CABI BIOTECHNOLOGY SERIES

Plant Omics Advances in Big Data Biology EDITED BY HAJIME OHYANAGI, EIJI YAMAMOTO, AI KITAZUMI AND KENTARO YANO

Plant Omics Advances in Big Data Biology

CABI BIOTECHNOLOGY SERIES Biotechnology, in particular the use of transgenic organisms, has a wide range of applications including agriculture, forestry, food and health. There is evidence that it could make a major impact in producing plants and animals that are able to resist stresses and diseases, thereby increasing food security. There is also potential to produce pharmaceuticals in plants through biotechnology, and provide foods that are nutritionally enhanced. Genetically modified organisms can also be used in cleaning up pollution and contamination. However, the application of biotechnology has raised concerns about biosafety, and it is vital to ensure that genetically modified organisms do not pose new risks to the environment or health. To understand the full potential of biotechnology and the issues that relate to it, scientists need access to information that not only provides an overview of and background to the field, but also keeps them up to date with the latest research findings. This series, which extends the scope of CABI’s successful “Biotechnology in Agriculture” series, addresses all topics relating to biotechnology including transgenic organisms, molecular analysis techniques, molecular pharming, in vitro culture, public opinion, economics, development and biosafety. Aimed at researchers, upper-level students and policy makers, titles in the series provide international coverage of topics related to biotechnology, including both a synthesis of facts and discussions of future research perspectives and possible solutions.

Titles Available 1. Animal Nutrition with Transgenic Plants Edited by G. Flachowsky 2. Plant-derived Pharmaceuticals: Principles and Applications for Developing Countries Edited by K.L. Hefferon 3. Transgenic Insects: Techniques and Applications Edited by M.Q. Benedict 4. Bt Resistance: Characterization and Strategies for GM Crops Producing Bacillus thuringiensis Toxins Edited by Mario Soberón, Yulin Gao and Alejandra Bravo 5. Plant Gene Silencing: Mechanisms and Applications Edited by Tamas Dalmay 6. Ethical Tensions from New Technology: The Case of Agricultural Biotechnology Edited by Harvey James 7. GM Food Systems and their Economic Impact Tatjana Brankov and Koviljko Lovre 8. Endophyte Biotechnology: Potential for Agriculture and Pharmacology Edited by Alexander Schouten 9. Forest Genomics and Biotechnology Richard Meilan and Matias Kirst 10. Transgenic Insects: Techniques and Applications, 2nd Edition Edited by M.Q. Benedict and Maxwell J. Scott 11. Plant Omics: Advances in Big Data Biology Edited by Hajime Ohyanagi, Eiji Yamamoto, Ai Kitazumi and Kentaro Yano 12. Next-generation Sequencing and Agriculture Edited by Phillip Bayer and Dave Edwards 13. Aquaculture and Fisheries Biotechnology: Genetic Approaches Rex A Dunham

Plant Omics Advances in Big Data Biology

Edited by:

Hajime Ohyanagi Eiji Yamamoto Ai Kitazumi Kentaro Yano

CABI is a trading name of CAB International CABI Nosworthy Way Wallingford Oxfordshire OX10 8DE UK Tel: +44 (0)1491 832111 E-mail: [email protected] Website: www.cabi.org

CABI 200 Portland Street Boston MA 02114 USA T: +1 (617)682-9015 E-mail: [email protected]

© CAB International 2023. All rights reserved. No part of this publication may be reproduced in any form or by any means, electronically, mechanically, by photocopying, recording or otherwise, without the prior permission of the copyright owners. The views expressed in this publication are those of the author(s) and do not necessarily represent those of, and should not be attributed to, CAB International (CABI). Any images, figures and tables not otherwise attributed are the author(s)’ own. References to internet websites (URLs) were accurate at the time of writing. CAB International and, where different, the copyright owner shall not be liable for technical or other errors or omissions contained herein. The information is supplied without obligation and on the understanding that any person who acts upon it, or otherwise changes their position in reliance thereon, does so entirely at their own risk. Information supplied is neither intended nor implied to be a substitute for professional advice. The reader/user accepts all risks and responsibility for losses, damages, costs and other consequences resulting directly or indirectly from using this information. CABI’s Terms and Conditions, including its full disclaimer, may be found at https://www.cabi.org/ terms-and-conditions/. A catalogue record for this book is available from the British Library, London, UK. > ISBN-13: 9781789247510 (hardback) 9781789247527 (ePDF) 9781789247534 (ePub) DOI: 10.1079/9781789247534.0000 Commissioning Editor: David Hemming Editorial Assistant: Emma McCann Production Editor: Marta Patiño Typeset by Exeter Premedia Services Pvt Ltd, Chennai, India Printed and bound in the UK by Severn, Gloucester

Contents

Contributorsxiii Prefacexvii PART I: Baseline Knowledge 1. Plant Genomics Masaru Bamba, Kenta Shirasawa, Sachiko Isobe, Nadia Kamal, Klaus Mayer and Shusei Sato 1.1. Introduction 1.2. Advanced Technologies in Plant Genomics 1.3. Status of Fabaceae Genomics 1.4. Status of Poaceae Genomics 1.5. Conclusion 2. Plant Transcriptomics: Data-driven Global Approach to Understand Cellular Processes and Their Regulation in Model and Non-Model Plants Ai Kitazumi, Isaiah C.M. Pabuayon, Kevin R. Cushman, Kentaro Yano and Benildo G. de los Reyes 2.1. Introduction 2.2. Overview of RNA-Seq-Based Transcriptome Profiling 2.2.1. Phase-IA: Sampling time-point, replication, and depth of coverage 2.2.2. Phase-IB: Single or paired-end sequence reads – platform and error rate 2.2.3. Phase-II: Factors in processing of sequence reads and their limitations 2.2.4. Phase-III: Choosing the reference and mapping in model plant species 2.2.5. Phase-III: De novo or hybrid assembly for non-model species 2.2.6. Phase-III: Choice of aligner 2.2.7. Phase-IV: Detection of differentially expressed transcripts and their gene loci 2.3. Conclusions and Perspectives

1

1 1 3 4 6 10

10 11 13 15 17 19 20 23 24 25

v

vi

Contents

3. Plant Proteomics 30 Setsuko Komatsu and Ghazala Mustafa 3.1. Introduction 30 3.2. Proteomic Technology in Plant Science 31 3.3. Plant-subcellular Proteomics 32 3.3.1. Importance of plant-subcellular proteomics 32 3.3.2. S ubcellular proteomics: understanding mechanism in soybean under flooding stress 38 3.4. Plant Proteomics of Post-translational Modifications 39 3.4.1. Importance of post-translational modifications in plants 39 3.4.2. P ost-translational modifications: understanding mechanism in soybean under flooding stress 40 3.5. Plant Proteomics: Understanding Environmental Stress Responses 41 3.5.1. P lant proteomics: understanding interaction between plant and biotic stress 41 3.5.2. P lant proteomics: understanding signaling mechanism under abiotic stresses42 3.6. Future Perspective 43 4. Plant Metabolomics: The Great Potential of Plant Metabolomics in Big Data Biology Miyako Kusano and Atsushi Fukushima 4.1. Introduction 4.2. Analytical Targets and Techniques 4.2.1 Analytical targets in plant metabolomics 4.2.2. Analytical methods for plant metabolomics 4.2.3. Metabolite identification/annotation in metabolomics data 4.3. The Importance of Sharing Metabolomics Data 4.3.1. Metabolome data repositories 4.3.2. Towards reproducible metabolome data analysis 4.3.3. Future metabolomics data analysis enhancing new biological discoveries 4.4. Conclusions and Outlook 5. Plant Phenomics Wei Guo and Jiangsan Zhao 5.1. Introduction to Plant Phenomics 5.2. Basic Technologies for Plant Phenotyping 5.3. Indoor Phenotyping 5.3.1. Indoor phenotyping platforms 5.3.2. Limitations of the current indoor phenotyping platforms 5.4. Field Phenotyping 5.4.1. Field phenotyping platforms 5.4.2. Limitations of the current field phenotyping platforms 5.5. Conclusion and Future Perspectives

50 50 52 52 55 56 56 57 57 60 60 67 67 68 69 69 70 71 71 72 73

Contents

vii

6. Plant Non-coding Transcriptomics: Overview of lncRNAs in Abiotic Stress Responses 79 Akihiro Matsui and Motoaki Seki 6.1. Introduction 79 6.2. History of ncRNA Research 82 6.3. Classification of ncRNAs 82 6.4. Molecular Functions of ncRNA 83 6.4.1. MicroRNAs 83 6.4.2. Trans-acting siRNAs (ta-siRNAs) and phased siRNAs (pha-siRNAs) 83 6.4.3. Pol IV- and Pol V-derived lncRNAs and siRNAs 84 6.4.4 RNA interfering events induced by cis-natural antisense RNAs (cis-NATs)85 6.4.5. Cis-NATs enhance mRNA translation 85 6.4.6. Cis-NATs derived from RNA degradation 86 6.4.7. l ncRNAs COLDAIR, COOLAIR, and COLDWRAP that regulate chromatin modification at the FLC locus 86 6.4.8. ENOD40 and ASCO, mRNA-like long intergenic ncRNAs that regulate alternative splicing events by interacting with RNA-binding protein 87 6.4.9. A POLO and HID1, long intergenic ncRNAs forming RNA–DNA hybrids that repress gene expression 87 6.4.10. ceRNA/RNA Soggy/RNA decoy mimic miRNA targets 88 6.4.11. Circular RNA 88 6.4.12. RNA polymerase III-derived lncRNAs 89 6.4.13. Viroids: sub-viral plant-pathogenic lncRNAs 89 6.5. Concluding Remarks 89 7. Plant Epigenomics 97 Taiko Kim To and Jong-Myong Kim 7.1. Significance of Histone Modifications 97 7.1.1. Histone proteins in plants 98 7.1.2. F unctions of conservative modification sites in canonical histone proteins98 7.1.3. The genome-wide distribution and responsiveness of major histone modifications99 7.1.4. H istones and histone modifications in the construction of genomes and chromosome structures 100 7.2. DNA Methylation 100 7.2.1. DNA methylation in plants 101 7.2.2. DNA methylation mechanism in A. thaliana 101 7.2.3. Genome-wide DNA methylation patterns in plant genomes 102 7.2.4. Methods to investigate global DNA methylation patterns 103 8. Plant Organellar Omics Masatake Kanai, Kentaro Tamura, Katarzyna Tarnawska-Glatt, Shino Goto-Yamada, Kenji Yamada and Shoji Mano 8.1. Introduction 8.2. Nucleus 8.3. Endoplasmic Reticulum

108

108 108 110

viii

Contents

8.4. Golgi Apparatus 8.5. Vacuole 8.6. Peroxisome 8.7. Oil body 8.8. Plastid 8.9. Mitochondrion 8.10. Databases for Images/Movies of Organelle Dynamics 8.11. Conclusions

111 111 112 113 114 114 115 116

PART II: Advanced Topics 9. Plant Cis-elements and Transcription Factors 124 Chi-Nga Chow, Kuan-Chieh Tseng and Wen-Chi Chang 9.1. Introduction 124 9.2. Methods to Infer TF–DNA Interactions 125 9.2.1. Wet-lab approaches 125 9.2.2. Dry-lab approaches 127 9.3. Related Databases for TFs and Cis-elements129 9.3.1. TF-related databases 129 9.3.2. Cis-element-related databases 129 9.4. Advanced Analysis in GRNs 130 9.5. Prospective View on Studies of Gene Regulation 132 10. Plant Gene Expression Network Miyu Asari, Ai Kitazumi, Eiji Nambara, Benildo G. de los Reyes and Kentaro Yano 10.1. Introduction 10.2. Visualization of Relationships of Genes by GENs: Nodes and Edges 10.3. Types of Relationships in GENs 10.4. Similarity and/or Reciprocity in Gene Expression Profiles 10.4.1 PCC 10.4.2 DCA 10.5. Common Regulatory Mechanisms in Gene Expressions 10.6. Sequence Similarities in mRNAs 10.7. Similarities in the Biological Functions of Expressed Genes 10.7.1. GENs with computational annotations of genes 10.7.2. G ENS containing knowledge-based information and ontology for biological functions 10.7.3. GENS with metabolic pathway information 10.8. Network Construction Tools with Multiple Types of Information about Genes 10.9. Knowledge bases for RNA-Seq Data, Expression Data and GENs 11. Plant Hormones: Gene Family Organization and Homolog Interactions of Genes for Gibberellin Metabolism and Signaling in Allotetraploid Brassica napus Eiji Nambara, Dawei Yan, Jing Wen, Arjun Sharma, Frederik Nguyen, Ange Yan, Karin Uruma and Kentaro Yano 11.1. Plant Hormones and Height Control

137 137 138 138 139 140 141 143 143 144 144 145 146 146 147

151

152

Contents

ix

11.2. Brassica napus152 11.3. GAs 153 11.3.1. GA metabolism 153 11.3.2. GA signaling 155 11.3.3. GA-auxotroph and response mutants 155 11.4. G A Metabolism and Signaling Genes in B. napus: Gene Family Diversity and Gene Expression 156 11.4.1. Early GA biosynthesis (synthesis of GA12) 156 11.4.2. BnaGA20ox 160 11.4.3. BnaGA3ox 162 11.4.4. BnaGA2ox 162 11.4.5. GA signaling genes 163 11.5. Expression of Homeologous Genes 164 11.6. General Discussion 165 12. Plant–Pathogen Interaction: New Era of Plant–Pathogen Interaction Studies: “Omics” Perspectives Shu’an Zheng and Ryohei Terauchi 12.1. Introduction 12.2. Overview of Plant Defense against Pathogens 12.3. Transcriptome of Plant and Pathogen Interactions: Providing a Global Understanding of the Host–Pathogen Interplay 12.4. Proteomics and Plant–Pathogen Interactome: Network Analysis 12.5. NLRome Provides a Comprehensive Way to Study NLRs 12.6. NLR and Avr Interaction Could Be Divided into Three Patterns 12.7. NLRs Function in Singleton, Pair, or Network 12.8. Pan-NLRome Reveals Diversity of NLRs 12.9. Concluding Remarks 13. Plant GWAS Matthew Shenton 13.1. Introduction 13.2. Core Processes in GWAS 13.2.1. Associating genotypic variations with phenotypic variations 13.2.2. Preparing GWAS populations 13.2.3. Checking phenotype data 13.2.4. Mixed linear model 13.2.5. Analyzing statistical significance 13.2.6. GWAS software 13.3. Graphical Representation of GWAS Results 13.3.1. Manhattan plot 13.3.2. Quantile–quantile (QQ) plot 13.4. Case Studies 13.4.1. Arabidopsis 13.4.2. Rice 13.5. Problems with GWAS

172 172 172 173 174 175 175 176 176 176 181 181 182 182 182 182 183 183 184 184 184 184 185 185 186 186

x

Contents

13.5.1. Functional validation of GWAS results 13.5.2. Spurious association, rare alleles 13.6. Conclusion and Prospects 14. Plant Genomic Selection: a Concept That Uses Genomics Data in Plant Breeding Eiji Yamamoto 14.1. Introduction 14.2. Core Processes in GS 14.2.1. Preparation of training data 14.2.2. Construction of GS model 14.3. Implementation of GS in Practical Plant Breeding 14.4. Advanced Topics in GS 14.4.1 GS model incorporating G × E effects 14.4.2 DNA marker selection for GS model construction 14.4.3 Combination with other omics 14.5. Concluding Remarks

186 187 187 190 190 192 193 193 197 198 198 199 199 199

15. Plant Genome Editing Naoki Wada, Yuriko Osakabe and Keishi Osakabe 15.1. Introduction 15.2. Genome Editing Using CRISPR-Cas9 in Plants: an Overview 15.3. G enome Manipulation Using a CRISPR-dCas9-based System Without DSB Induction 15.4. Engineered Cas9 and Newly Discovered Cas Proteins for Plant Genome Editing 15.5. Prime Editing 15.6. Conclusions

205

16. Introduction of Deep Learning Approaches in Plant Omics Research Eli Kaminuma 16.1. Introduction 16.2. Supervised Learning 16.2.1. Classification task: CNN 16.2.2. Regression task: RNN, LSTM 16.3. Unsupervised Learning 16.3.1. Generation task: GAN 16.3.2. Dimensionality reduction task: AE, Word2vec 16.4. Deep Reinforcement Learning: DQN 16.5. Other Deep Learning Techniques: GNN, Transformer, AutoML 16.5.1. Deep learning for graphs: GNN 16.5.2. Natural language processing: Transformer 16.5.3. Automatic machine learning: AutoML 16.6. Summary

217

17. Deep Learning on Images and Genetic Sequences in Plants: Classifications and Regressions Kanae Masuda and Takashi Akagi 17.1. Introduction

205 207 207 209 212 212

217 217 218 218 219 219 219 219 220 220 220 220 221 224 224

Contents

17.2. Deep Learning for Plant Images 17.2.1. Deep learning for taxonomic classification of plant images 17.2.2. Deep learning for stress/disease diagnosis based on plant images 17.2.3. Deep learning for non-invasive prediction of plant images 17.2.4. Deep learning for regression and quantification of plant images 17.2.5. Deep learning for automated sorting of plant images 17.3. Deep Learning for DNA Sequences 17.4. Deep Learning for Amino Acid Sequences: Prediction of Protein Folding 17.5. CNN Guides for Beginners: Tips and Precautions in Practice 17.5.1. Installing libraries and preparing data for application of a CNN 17.5.2. Evaluation of CNN model performance 17.5.3. Interpretability and explainability of CNN models 17.6. Future Perspectives 18. Deep Learning in Plant Omics: Object Detection and Image Segmentation Wei Guo and Akshay L. Chandra 18.1. Introduction 18.2. Object Detection and Image Segmentation in Plant Phenomics 18.2.1. Object detection and its applications 18.2.2. Image segmentation and its applications 18.3. C urrent Challenges of Object Detection and Image Segmentation for Plant Phenomics 18.3.1. Data annotation cost 18.3.2. Generalization capability of current deep learning models 18.4. Conclusion and Future Perspective

xi

225 225 226 226 227 227 228 229 229 229 230 230 231 234 234 236 236 236 238 238 238 239

PART III: Resources 19. Plant Experimental Resources 246 Masatomo Kobayashi 19.1. Introduction 246 19.2. Overview of Arabidopsis Resources247 19.2.1. Arabidopsis seed resources for omics analysis 247 19.2.2. Arabidopsis DNA resources 248 19.3. Overview of Experimental Plant Resources for Crop Research 248 19.3.1. Rice resources 248 19.3.2. Wheat resources 249 19.3.3. Tomato resources 250 19.3.4. Legume resources 250 19.4. Conclusion and Perspective 250 20. Plant Omics Databases: an Online Resource Guide Feng Li, Yingtian Deng, Eiji Yamamoto and Zhenya Liu 20.1. Introduction 20.2. Arabidopsis Omics Databases 20.2.1. Arabidopsis genome databases 20.2.2. Arabidopsis epigenome databases

253 253 254 254 255

xii

Contents

20.2.3. Arabidopsis transcriptome databases 255 20.2.4. Arabidopsis proteome databases 256 20.3. Omics Databases for Crop Plants 256 20.3.1. Rice (Oryza sativa L.) 256 20.3.2. Wheat (Triticum aestivum L.) 258 20.3.3. Maize (Zea mays)259 20.3.4. Soybean (Glycine max)259 20.3.5. Tomato (Solanum lycopersicum)259 20.3.6. Pepper (Capsicum annuum)260 20.4. Databases for Bryophytes 260 20.5. Databases for Other Plant Species 261 20.6. Portals for Plant Omics Databases 261 20.6.1. Bio-Analytic Resource for Plant Biology (BAR) 261 20.6.2. Gramene 261 20.6.3. Phytozome 261 20.7. Future Perspectives 263 Index271

Supplementary materials for Chapter 11 can be accessed via the QR code below.

Contributors

Takashi Akagi, Graduate School of Environmental and Life Science, Okayama University, Okayama, Japan. [email protected] Miyu Asari, School of Agriculture, Meiji University, Kawasaki, Japan. Masaru Bamba, Graduate School of Life Sciences, Tohoku University, Sendai, Japan. Akshay L. Chandra, Department of Computer Science and Engineering, Indian Institute of Technology Hyderabad, Kandi, India. Wen-Chi Chang, College of Biosciences and Biotechnology, Institute of Tropical Plant Sciences and Microbiology, National Cheng Kung University, Tainan, Taiwan; Department of Life Sciences, National Cheng Kung University, Tainan, Taiwan. [email protected] Chi-Nga Chow, College of Biosciences and Biotechnology, Institute of Tropical Plant Sciences and Microbiology, National Cheng Kung University, Tainan, Taiwan. Kevin R. Cushman, Department of Plant and Soil Science, Texas Tech University, Lubbock, Texas, USA. Benildo G. de los Reyes, Department of Plant and Soil Science, Texas Tech University, Lubbock, Texas, USA, [email protected] Yingtian Deng, Key Laboratory of Horticultural Plant Biology, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei, China. Atsushi Fukushima, RIKEN Center for Sustainable Resource Science, Yokohama, Japan; Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Kyoto, Japan Shino Goto-Yamada, Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland. Wei Guo, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan. [email protected] Sachiko Isobe, Laboratory of Plant Genetics and Genomics, Kazusa DNA Research Institute, Kisarazu, Japan.

xiii

xiv

Contributors

Nadia Kamal, Plant Genome and Systems Biology, Helmholtz Zentrum München, Munich, Germany. Eli Kaminuma, Graduate School of Design and Architecture, Nagoya City University, Nagoya, Japan. [email protected] Masatake Kanai, Laboratory of Organelle Regulation, National Institute for Basic Biology, Okazaki, Aichi, Japan. Jong-Myong Kim, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan; Ac-Planta Inc., Tokyo, Japan. [email protected] Taiko Kim To, Department of Biological Sciences, The University of Tokyo, Tokyo, Japan. Ai Kitazumi, Department of Plant and Soil Science, Texas Tech University, Lubbock, Texas, USA. [email protected] Masatomo Kobayashi, RIKEN BioResource Research Center, Tsukuba, Japan. [email protected] Setsuko Komatsu, Faculty of Environment and Information Sciences, Fukui University of Technology, Fukui, Japan. [email protected] Miyako Kusano, Faculty of Life and Environmental Science, University of Tsukuba, Tsukuba, Japan; Tsukuba-Plant Innovation Research Center, University of Tsukuba, Tsukuba, Japan; RIKEN Center for Sustainable Resource Science, Yokohama, Japan. [email protected] Feng Li, Key Laboratory of Horticultural Plant Biology, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei, China. [email protected] Zhenya Liu, Key Laboratory of Horticultural Plant Biology, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei, China. Shoji Mano, Laboratory of Organelle Regulation, National Institute for Basic Biology, Okazaki, Japan; Department of Basic Biology, SOKENDAI (The Graduate University for Advanced Studies), Okazaki, Japan. [email protected] Kanae Masuda, Graduate School of Environmental and Life Science, Okayama University, Okayama, Japan. Akihiro Matsui, Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Japan; Plant Epigenome Regulation Laboratory, RIKEN Cluster for Pioneering Research, Wako, Japan. [email protected] Klaus Mayer, Plant Genome and Systems Biology, Helmholtz Zentrum München, Munich, Germany. Ghazala Mustafa, Department of Plant Sciences, Quaid-I-Azam University, Islamabad, Pakistan. Eiji Nambara, Department of Cell and Systems Biology, University of Toronto, Ontario, Canada. eiji. [email protected] Frederik Nguyen, Department of Cell and Systems Biology, University of Toronto, Ontario, Canada. Hajime Ohyanagi, JCRAC Data Center, National Center for Global Health and Medicine, Tokyo, Japan. [email protected] Keishi Osakabe, Tokushima University, Tokushima, Japan. [email protected] Yuriko Osakabe, Tokyo Institute of Technology, Yokohama, Japan.

Contributors

xv

Isaiah C.M. Pabuayon, Department of Plant and Soil Science, Texas Tech University, Lubbock, Texas, USA. Shusei Sato, Graduate School of Life Sciences, Tohoku University, Sendai, Japan. shuseis@ige. tohoku.ac.jp Motoaki Seki, Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Japan; Plant Epigenome Regulation Laboratory, RIKEN Cluster for Pioneering Research, Wako, Japan; Kihara Institute for Biological Research, Yokohama City University, Yokohama, Japan. [email protected] Arjun Sharma, Department of Cell and Systems Biology, University of Toronto, Ontario, Canada. Matthew Shenton, NARO Institute of Crop Science, Tsukuba, Japan. [email protected] Kenta Shirasawa, Laboratory of Plant Genetics and Genomics, Kazusa DNA Research Institute, Kisarazu, Japan. Kentaro Tamura, Department of Environmental and Life Sciences, School of Food and Nutritional Sciences, University of Shizuoka, Shizuoka, Japan. Katarzyna Tarnawska-Glatt, Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland. Ryohei Terauchi, Laboratory of Crop Evolution, Kyoto University, Kyoto, Japan; Iwate Biotechnology Research Center, Iwate, Japan. [email protected] Kuan-Chieh Tseng, Department of Life Sciences, National Cheng Kung University, Tainan, Taiwan. Karin Uruma, Department of Life Sciences, School of Agriculture, Meiji University, Kawasaki, Japan. Naoki Wada, Tokushima University, Tokushima, Japan. Jing Wen, National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, China. Kenji Yamada, Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland. Eiji Yamamoto, Graduate School of Agriculture, Meiji University, Kawasaki, Japan. yame@meiji. ac.jp Ange Yan, Department of Life Sciences, School of Agriculture, Meiji University, Kawasaki, Japan. Dawei Yan, Department of Plant Sciences, University of California, Davis, USA. Kentaro Yano, School of Agriculture, Meiji University, Kawasaki, Japan. [email protected] Jiangsan Zhao, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan. Shu’an Zheng, Laboratory of Crop Evolution, Kyoto University, Kyoto, Japan.

Preface

I gave her one, they gave him two, You gave us three or more; They all returned from him to you, Though they were mine before. from Alice’s Adventures in Wonderland, Lewis Carroll (1865)

*** The concept and term of genome (gene suffixed by -ome, which refers to a totality or complete set) have been appreciated and exploited for more than a century. With this idea, the further elucidation of biological systems has revealed the extremely complex, stochastic, yet resilient and well-orchestrated nature of biology and has given names to the branches of genomics such as transcriptome, proteome, metabolome, phenome, and so forth, instead of confining the exploration to one gene at a time. As the suffix -ome suggests, each omics is inherently a big data biology whose ultimate goal is to integrate the myriad data into one, as in the above quote from the chapter “Alice’s Evidence” in Alice’s Adventures in Wonderland. For a long time, addressing the totality of biology was no more than a half-fledged hope, but advances in the technology of molecular biology have given wings to approaching such objectives. Among the kingdoms of life, Plantae is essential to humankind and has served as a model organism from early genetics to the modern basic science. The goal of this book is to provide baseline knowledge to students as a guide to omics and to present recent advancements in the selected topics with the focus on plant omics. This book, Plant Omics: Advances in Big Data Biology, has three sections, corresponding to baseline knowledge, advanced topics, and resources. The baseline section covers plant genomics (Chapter 1), transcriptomics (Chapter 2), proteomics (Chapter 3), metabolomics (Chapter 4), phenomics (Chapter 5), non-coding transcriptomics (Chapter 6), epigenomics (Chapter 7), and organellar omics (Chapter 8). In the later chapters, advanced topics such as plant cis-element and transcription factors (Chapter 9), gene expression networks (Chapter 10), hormones (Chapter 11), plant–pathogen interactions (Chapter 12), GWAS (Genome-Wide Association Studies) (Chapter 13), genomic selection (Chapter 14), genome editing (Chapter 15), and deep learning (Chapters 16, 17, and 18) are dissected by cutting-edge plant scientists. In the last couple of chapters, valuable

xvii

xviii

Preface

archives for plant experimental resources (Chapter 19) and online omics databases (Chapter 20) are summarized by resource specialists. As the editors, we would like to express our sincere gratitude to all the authors for their great contributions to this book. We hope that this book will serve as a guide for students and be an inspiring read for researchers from various fields. We thank Alison Smith, David Hemming, Ali Thompson, Emma McCann, and Marta Patiño of CABI for their continuous guidance and encouragement during all the stages of this project.

Hajime Ohyanagi Eiji Yamamoto Ai Kitazumi Kentaro Yano

1

Plant Genomics

Masaru Bamba1, Kenta Shirasawa2, Sachiko Isobe2, Nadia Kamal3, Klaus Mayer3 and Shusei Sato1* 1 Graduate School of Life Sciences, Tohoku University, Japan; 2Laboratory of Plant Genetics and Genomics, Kazusa DNA Research Institute, Japan; 3Plant Genome and Systems Biology, Helmholtz Zentrum München, Munich, Germany

Abstract In this post-genomic era, we now have easy access to the genetic information of entire living organisms, and that information has been essential for biological research. The prosperity of genomics resulted from the progress of DNA sequence technologies, the development of computational analysis environments, and the establishment of biological resources. Plant genomics is one of the research fields that has strongly benefited from these technical advances. This chapter presents the evolution and transition of DNA sequence technologies and gives concrete examples of the proliferation in plant genomic research.

1.1 Introduction As was the case in other biological lineages, such as Saccharomyces cerevisiae in eukaryotes and Caenorhabditis elegans in multicellular organisms, plant genome analysis started with a single general model, Arabidopsis thaliana, by a large multinational consortium using the Sanger sequencing method (Arabidopsis Genome Initiative, 2000). The obtained Arabidopsis genome information has been used as a solid infrastructure in the plant research community. Along with the progress of DNA sequencing technologies, the target of plant genome analysis shifted toward a wide range of plant species (Fig. 1.1), and a variety of plant genome information has been applied as the basis for integration of the multiple biological omics data (Rai et al., 2017). The development of long-read sequencing technologies made it feasible to sequence not only a single representative

genome but also multiple accessions within the same species (Golicz et al., 2020). In this chapter, we attempt to shed light on the status of plant genomics by describing the latest genome sequencing technologies and give concrete examples of genome/pan-genome analysis in two plant taxa (Fabaceae and Poaceae).

1.2 Advanced Technologies in Plant Genomics Genome sequencing technology has advanced dramatically in the past 15 years after the appearance of next- generation sequencing (NGS) technologies. The first phase of NGS technology was the development of massive parallel sequencing platforms with read lengths of approximately 50–300 bp, the so-called second- generation sequencing technology, compared

*Corresponding author: shuseis@ige.tohoku.ac.jp © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0001

1

2

M. Bamba et al.

Fig. 1.1. Sequenced plant genomes: important and milestone species in plant genomics. Plant genomes whose whole genomes have been sequenced and that have been chosen as important/ milestone species. Phylogenetic relationships (A) among green plants, except for seed plants (based on Wickett et al., 2014); and (B) among seed plants (based on Angiosperm Phylogeny Group (APG) IV. Taxonomic characteristics shown in the branches. Common name or cultivar described follows the scientific name.

with the first-generation dideoxy chain termination method (Sanger method). This was followed by the third and fourth generations, in which sequencing of single DNA molecules without amplification was achieved with average median length of approximately 10–20 kbp, and also several reads longer than 50 kbp. Since the transition of NGS platforms has been fast in the past decade, the platforms used at the beginning of the NGS era, such as Roche 454 and ABI SOLiD, have already become obsolete.

The Illumina HiSeq, which was a representative of the short-read sequencing platform in the 2010s, also stopped being produced recently. Despite the frequent update of their platforms, all of the fundamental technologies in NGS are considered to be present, since new concepts of sequencing strategy have not been introduced during the past several years. The current NGS platforms can roughly be classified into the following four categories: (i) bench- top short- read sequencing (e.g.,

Plant Genomics

Illumina MiSeq, Thermo Fisher IonProton); (ii) large- scale short- read sequencing (e.g., Illumina NovaSeq, MGI DNB-Seq); (iii) accurate long-read sequencing (e.g., PacBio Sequel II); and (iv) ultra-long-read sequencing (e.g., Oxford Nanopore Technologies). Short-read sequencing platforms are frequently used for base variant detection (single nucleotide polymorphisms (SNPs) and short insertion and deletion (indels)) by whole- genome shotgun sequencing- based methods such as GBS (Elshire et al., 2011), RAD-Seq (Baird et al., 2008), and GRAS- Di (Miki et al., 2020). With the massive amount of data production, short- read sequencing platforms are also used for gene expression and protein–DNA interaction analyses through RNA sequencing (RNA- seq) and chromatin immunoprecipitation sequencing (ChIP- seq), respectively. On the other hand, long- read sequencing platforms are basically applied for whole-genome assembly and structure variant identification at both genome and transcriptome levels. In whole-genome assembly and structure variant identification, several other technologies also assist the analysis, such as optical mapping (e.g., Bionano Saphyr, available at https://bionanogenomics.com/, accessed July 2022) and the Hi-C library, which is based on a genome- wide chromatin conformation capture method (Lieberman-Aiden et al., 2009). Until the long- read sequencing platform gained popularity, short reads were used for genome assembly (Giani et al., 2020). Because of the highly repetitive nature of many plant genomes, de novo assembly (building a genome from scratch without any reference genome information) of plant genomes using short-read sequences tended to be a time- and labor- consuming process. This problem resulted from the short reads on repetitive regions in which where they were actually from could not be identified. Thus the plant genome sequencing projects were carried out focusing on a single representative accession in the target species. Availability of long- read sequencing technologies is expected to help to overcome the difficulties with assembling the repeat- rich region. Oxford Nanopore Technologies sequencers produce ultra- long reads of > 100 kb in length, but the sequences are error- prone (Dumschott et al., 2020). Furthermore, to achieve the ultra- long- read sequencing,

3

extracting intact high- molecular- weight DNA is essential, but it is still challenging in plants because of the presence of cell walls and various secondary metabolites (Dumschott et al., 2020). In this way, the fourth-generation sequencing platforms still have room for improvement in plant genome analysis. Among current practical approaches, HiFi reads produced from circular consensus sequencing (CCS), which allow us to read one sequence multiple times, generated from PacBio Sequel II, are considered suitable in plant genome sequencing due to their accuracy (which has improved from 90% to more than 99.9%) with 10–20 kb read lengths (Hon et al., 2020). Because of its high accuracy, the contigs constructed with HiFi reads do not require error collection after assembly. In addition, Hi-C and comparable methods, such as Omni- C, have largely contributed to constructing a proximity map to generate chromosome- scale scaffolds, although it is recommended that the results should be confirmed by comparing the results of optical and/or linkage mapping (Udall and Dawe, 2018). These technologies allow us to establish high- quality (reference- level) plant genomes and compare them more easily and efficiently. The comparison of many genomes allows us to estimate a plant’s historical trajectories with population genomics approaches and to presume which genetic polymorphisms were responsible for the phenotypic variation (Bamba et al., 2019). Furthermore, comparing high- quality genomes will eliminate the limitation of focusing only on the differences in the core genomes shared among all focal organisms. Therefore, the progress of sequencing technologies is and will continue to be bringing plant genomics research into the pan-genome analysis era.

1.3 Status of Fabaceae Genomics Fabaceae (Leguminosae) is the third- largest family of flowering plants, consisting of 751 genera and 19,500 species (Christenhusz and Byng, 2016). The economic value of Fabaceae for human consumption is second only to Gramineae, and the most significant character from an ecological viewpoint is a biological nitrogen-fixing symbiosis with nodule bacteria

4

M. Bamba et al.

called rhizobia (Bennett, 2011). Since nitrogen- fixing symbiosis could help reduce chemical fertilizer for plant growth, leguminous plants are drawing attention toward sustainable agricultural crop production; therefore, significant efforts have been made to establish the genomic resources. In the genome of Fabaceae, a draft genome of Lotus japonicus was published in 2008 ahead of other legumes (Sato et al., 2008), followed by the complete genomes of soybean (Glycine max) (Schmutz et al., 2010) and Medicago truncatula (Young et al., 2011). L. japonicus and M. truncatula are used as the model legume species for nitrogen- fixing symbiosis, while soybean is used for a molecular basis of the production of the seed protein and oil contents. In addition to these three species, genome analyses of other leguminous plants have been carried out for 12 species, including pigeon pea (Varshney et al., 2012), chickpea (Varshney et al., 2013), mung bean (Kang et al., 2014), common bean (Schmutz et al., 2014), adzuki bean (Kang et al., 2015), hyacinth bean (Chang et al., 2019), white lupin (Hufnagel et al., 2020), pea (Kreplak et al., 2019), bambara groundnut (Chang et al., 2019), cowpea (Lonardi et al., 2019), asparagus bean (Xia et al., 2019), and black lentil (Pootakham et al., 2020). Three species, lima bean (Wisser et al., 2021), rice bean (Kaul et al., 2019), and cluster bean (Gaikwad et al., 2020), are uploaded to the pre-print server, and the lentil genome has not been published but is available as pre-released information (KnownPulse, available at https:// knowpulse.usask.ca/, accessed July 2022). In other leguminous plants, genomes of 30 species, including clover (Istvánek et al., 2014), lupin (Hane et al., 2017), and peanut (Bertioli et al., 2019), were published. In total, the genomes of 46 species belonging to 29 genera were available at the time of writing (December 2020), and it can be said that the genomes of all major commercial legumes have been revealed, except for faba bean (Vicia fava). The whole- genome sequencing of faba bean is challenging, due to the large size of genomes (around 13 Gbp); however, the advanced sequencing platforms will allow us to reach that. For four crop species (peanut, pigeon pea, soybean, and white lupin) and for M. truncatula, their germlines and pan-genomic data, which can be used to detect their structural variations, are available. Soybean has the largest sets of

germlines, consisting of over 50,000 lines (Liu et al., 2020), 2819 of which have been re-sequenced, and 23 genomes were assembled for the reference-level quality. Additionally, the genome data of the wild relative of soybean (Glycine soja) have been established for over 100 accessions. For peanut, the resequencing data of large sets of germlines (over 10,000) and five high-quality genome assemblies are available. Peanut is an allotetraploid species, and the genome information on the predictive progenitor species (Arachis duranensis, A. ipaensis, and A. monticola) is also available (Bertioli et al., 2016; Yin et al., 2018). In pigeon pea and white lupin, germline re-sequencing and pan-genomic data are viable, although there is currently no information on wild relatives (Zhao et al., 2020; Hufnagel et al., 2021). In Medicago, the re- sequencing data on germlines of M. truncatula and M. sativa are available. The pan-genome of M. truncatula can be used for the detection of structural variations (Zhou et al., 2017), and the re-sequencing-level pan-genomes of M. sativa are available (Shen et al., 2020). In addition, re- sequencing- level pan- genome information is available in six leguminous species: adzuki bean (Yang et al., 2015), common bean (Lobaton et al., 2018), pea (Kreplak et al., 2019), chickpea (Varshney et al., 2019), L. japonicus (Shah et al., 2020), and black lentil (Pootakham et al., 2020). In L. japonicus, there are pan-genomes and germlines for 136 wild accessions, and these were used for understanding the adaptation history of that species in the natural environment (Shah et al., 2020). Chickpea, common bean, and pea have pan-genome information consisting of 429, 35, and 42 lines, respectively (Lobaton et al., 2018; Varshney et al., 2019; Kreplak et al., 2019). For the Vigna pan-genome, the 49 and 89 genomes of adzuki bean and black lentil, respectively, are available (Yang et al., 2015; Pootakham et al., 2020); furthermore, the pan- genome of the cowpea project (CowpeaPan) is in progress.

1.4 Status of Poaceae Genomics In the Poaceae, the rice genome (Oryza sativa subsp. japonica cv. Nipponbare) has been determined in advance of all other monocots

Plant Genomics

(International Rice Genome Sequencing Project, 2005). This genomic information had been used for the sequencing of other Poaceae crops, such as maize (Schnable et al., 2009), sorghum (Paterson et al., 2009), barley (International Barley Genome Sequencing Consortium, 2012), and wheat (International Wheat Genome Sequencing Consortium, IWGSC, 2018) as a reference. Information for whole-genome variants among other rice cultivars (indica, Guangluai-3, Nongken-58, and Kasalath) (Sakai et al., 2014) and wild species, Oryza rufipogon and O. longistaminata, which are candidates for their origin, have also been published. Besides, the rice genome collection consists of more than 200 high- quality collections and more than 450 low- quality collections so far (Huang et al., 2012). The rice genome, therefore, has become an essential tool for agricultural prosperity with the Poaceae. One of the most significant milestones in recent Poaceae genome research is the determination of the cereal crop genomes. In recent years there have been significant breakthroughs in sequencing technologies and the ability to assemble even the largest and most complex cereal genomes, such as bread wheat and barley. For both the latter species, reference- quality genome assemblies have been generated, in 2017 for barley (Mascher et al., 2017) and in 2018 for wheat (IWGSC, 2018), using novel computational strategies and genome assembly algorithms. The high repeat content (> 80%), high transposon activity, large genome sizes (e.g., 17 Gb for bread wheat, five times larger than the human genome), and polyploidy have complicated the assembly of cereal genomes for a long time. Single reference genomes are an invaluable tool to better understand cereal biology and unlock the gene content as well as regulatory networks. To assess the genetic potential of natural variation in cereal crops, however, multi- genome comparisons become essential. As a consequence, genome projects including the generation and comparative analysis of multiple reference genome sequences for wheat and barley started to arise. Major objectives of these pan-genome (which represents the entire set of genes within a species) projects are to determine the core gene set, i.e., the set of genes shared by all lines, and genes shared by only some lines or singleton genes (dispensable genes). Other main areas to study are structural variation, single

5

nucleotide polymorphisms (SNPs), particular genes and quantitative trait loci (QTLs) involved in specific traits, copy number variations (CNVs), presence–absence variations (PAVs), and many more. One recent pan- genome project working on a polyploid species with a large genome is the international 10+ wheat genome project, coordinated by Prof. Curtis Pozniak from University of Saskatchewan. For this project, wheat lines from all around the world were chosen to ensure a maximum of genetic diversity and hence a pan-genome as complete as possible. The selected 10+ bread wheat cultivars were subsequently sequenced using Illumina short- read technology and assembled with NRGene’s DeNovoMagic (NRGene, Ness Ziona, Israel) algorithm, leading to high-quality chromosome- scale genome assemblies. Comparing the gene content of these reference genomes revealed variation in gene content, which likely reflects the complex breeding history of the selected lines as well as adaptation to diverse environments throughout their breeding history. Extensive efforts to improve grain yield and quality and make plants more resistant to pests and diseases are also reflected in the genic space. By comparing the chromosomal structure of the reference sequences, a diversity of structural rearrangements and introgressions from wild relatives could be identified. This also highlights the importance of multiple reference genomes in high- quality genome projects, since this enables the investigation of chromosomal translocations, duplications, and deletions with high accuracy. For the barley pan- genome project, 20 diverse barley lines were selected from 22,000 barley accessions that were previously hosted at IPK Gatersleben and have been genetically characterized (Milner et al., 2019; Jayakodi et al., 2020). The selected 20 lines represent the major barley germplasms and include eight cultivars, 11 landraces and one wild barley accession (Hordeum vulgare subsp. spontaneum). High-quality reference genome assemblies were generated for the 20 accessions using either the TRITEX pipeline (Monat et al., 2019) or other short-read assembly algorithms. Comparative structural analysis of the 20 barley lines could show that the single-copy barley core genome present in all lines was made up of

6

M. Bamba et al.

402.5 Mb and included almost the entirety of the annotated gene space. On the other hand, PAV was found in a total of 235.9 Mb of single-copy sequence in the panel of 20 accessions, representing the variable component of the pan-genome. In the study by Jayakodi et al. (2020) a method based on chromosome conformation capture sequencing (Hi-C) (Himmelbach et al., 2018) was used to study large chromosomal inversions (> 1 Mb). The genomes of 70 accessions were analyzed and 42 inversions ranging from 4 Mb to 141 Mb in size could be identified. The majority of these inversions were located in the proximal regions of the chromosome arms that are known for their low recombining rate. In summary, the newly sequenced wheat and barley reference genomes provide an unprecedented basis for functional gene discovery and breeding that help to improve cereals. Subsequent project phases include the generation of de novo gene predictions for all assemblies based on extensive transcriptomic data. These data will be the basis for in-depth insights into

the functional and regulatory organization of the wheat and barley pan-genomes.

1.5 Conclusion Genome information on plant species has been an important tool for anchoring the extensive dataset produced from related analyses. The high- throughput capacity introduced by NGS technologies has made it feasible in a wide range of plant species to apply advanced genetic approaches using a large number of germline resources, such as population genomics and genomic selections. The cost reduction and enhanced quality of long- read sequencing technology will make complex genomes accessible for whole-genome investigation as well as pan-genome analysis, both of which offer a broader understanding of genetic diversity of gene pools in the target species. Accumulating comprehensive genome information will continue to be the basis for plant research by integrating a large variety of information provided by advancing plant omics approaches.

References Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814), 796–815. DOI: 10.1038/35048692. Baird, N.A., Etter, P.D., Atwood, T.S., Currey, M.C., Shiver, A.L. et al. (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PloS ONE 3(10), e3376. DOI: 10.1371/journal. pone.0003376. Bamba, M., Kawaguchi, Y.W. and Tsuchimatsu, T. (2019) Plant adaptation and speciation studied by population genomic approaches. Development, Growth & Differentiation 61(1), 12–24. DOI: 10.1111/ dgd.12578. Bennett, B.C. (2011) Twenty-five economically important plant families. Encyclopedia of Life Support Systems (EOLSS), Economic Botany. Available at: https://docplayer.net/20954333-Twentyfive-econ omically-important-plant-families.html (accessed June 2022). Bertioli, D.J., Cannon, S.B., Froenicke, L., Huang, G., Farmer, A.D. et al. (2016) The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid ancestors of cultivated peanut. Nature Genetics 48(4), 438–446. DOI: 10.1038/ng.3517. Bertioli, D.J., Jenkins, J., Clevenger, J., Dudchenko, O., Gao, D. et al. (2019) The genome sequence of segmental allotetraploid peanut Arachis hypogaea. Nature Genetics 51(5), 877–884. DOI: 10.1038/ s41588-019-0405-z. Chang, Y., Liu, H., Liu, M., Liao, X., Sahu, S.K. et al. (2019) The draft genomes of five agriculturally important African orphan crops. GigaScience 8(3), giy152. DOI: 10.1093/gigascience/giy152. Christenhusz, M.J.M. and Byng, J.W. (2016) The number of known plants species in the world and its annual increase. Phytotaxa 261(3), 201. DOI: 10.11646/phytotaxa.261.3.1. Dumschott, K., Schmidt, M.H.-W., Chawla, H.S., Snowdon, R. and Usadel, B. (2020) Oxford nanopore sequencing: new opportunities for plant genomics? Journal of Experimental Botany 71(18), 5313–5322. DOI: 10.1093/jxb/eraa263.

Plant Genomics

7

Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K. et al. (2011) A robust, simple genotyping- by-sequencing (GBS) approach for high diversity species. PloS ONE 6(5), e19379. DOI: 10.1371/ journal.pone.0019379. Gaikwad, K., Ramakrishna, G., Srivastava, H., Saxena, S., Kaila, T. et al. (2020) Chromosome scale reference genome of cluster bean (Cyamopsis tetragonoloba (L.) Taub). Genomics. DOI: 10.1101/2020.05.16.098434. Giani, A.M., Gallo, G.R., Gianfranceschi, L. and Formenti, G. (2020) Long walk to genomics: history and current approaches to genome sequencing and assembly. Computational and Structural Biotechnology Journal 18, 9–19. DOI: 10.1016/j.csbj.2019.11.002. Golicz, A.A., Bayer, P.E., Bhalla, P.L., Batley, J. and Edwards, D. (2020) Pangenomics comes of age: from bacteria to plant and animal applications. Trends in Genetics 36(2), 132–145. DOI: 10.1016/j. tig.2019.11.006. Hane, J.K., Ming, Y., Kamphuis, L.G., Nelson, M.N., Garg, G. et al. (2017) A comprehensive draft genome sequence for lupin (Lupinus angustifolius), an emerging health food: insights into plant-microbe interactions and legume evolution. Plant Biotechnology Journal 15(3), 318–330. DOI: 10.1111/pbi.12615. Himmelbach, A., Ruban, A., Walde, I., Šimková, H., Doležel, J. et al. (2018) Discovery of multi-megabase polymorphic inversions by chromosome conformation capture sequencing in large-genome plant species. The Plant Journal 96(6), 1309–1316. DOI: 10.1111/tpj.14109. Hon, T., Mars, K., Young, G., Tsai, Y.-C., Karalius, J.W. et al. (2020) Highly accurate long- read HiFi sequencing data for five complex genomes. Scientific Data 7(1), 399. DOI: 10.1038/ s41597-020-00743-4. Huang, X., Kurata, N., Wei, X., Wang, Z.-X., Wang, A. et al. (2012) A map of rice genome variation reveals the origin of cultivated rice. Nature 490(7421), 497–501. DOI: 10.1038/nature11532. Hufnagel, B., Marques, A., Soriano, A., Marquès, L., Divol, F. et al. (2020) High-quality genome sequence of white lupin provides insight into soil exploration and seed quality. Nature Communications 11(1), 1–12. DOI: 10.1038/s41467-019-14197-9. Hufnagel, B., Soriano, A., Taylor, J., Divol, F., Kroc, M. et al. (2021) Pangenome of white lupin provides insights into the diversity of the species. Plant Biotechnology Journal 19(12), 2532–2543. DOI: 10.1111/pbi.13678. International Barley Genome Sequencing Consortium (2012) A physical, genetic and functional sequence assembly of the barley genome. Nature 491, 711–716. International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436, 793–800. International Wheat Genome Sequencing Consortium (2018) Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361, 6403. Istvánek, J., Jaros, M., Krenek, A. and Řepková, J. (2014) Genome assembly and annotation for red clover (Trifolium pratense; Fabaceae). American Journal of Botany 101(2), 327–337. DOI: 10.3732/ ajb.1300340. Jayakodi, M., Padmarasu, S., Haberer, G., Bonthala, V.S., Gundlach, H. et al. (2020) The barley pan- genome reveals the hidden legacy of mutation breeding. Nature 588(7837), 284–289. DOI: 10.1038/ s41586-020-2947-8. Kang, Y.J., Kim, S.K., Kim, M.Y., Lestari, P., Kim, K.H. et al. (2014) Genome sequence of mungbean and insights into evolution within Vigna species. Nature Communications 5, 5443. DOI: 10.1038/ ncomms6443. Kang, Y.J., Satyawan, D., Shim, S., Lee, T., Lee, J. et al. (2015) Draft genome sequence of adzuki bean, Vigna angularis. Scientific Reports 5, 1–8. DOI: 10.1038/srep08069. Kaul, T., Eswaran, M., Thangaraj, A., Meyyazhagan, A., Nehra, M. et al. (2019) Rice bean (Vigna umbellata) draft genome sequence: unravelling the late flowering and unpalatability related genomic resources for efficient domestication of this underutilized crop. [bioRxiv]. DOI: 10.1101/816595. Kreplak, J., Madoui, M.-A., Cápal, P., Novák, P., Labadie, K. et al. (2019) A reference genome for pea provides insight into legume genome evolution. Nature Genetics 51(9), 1411–1422. DOI: 10.1038/ s41588-019-0480-1. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T. et al. (2009) Comprehensive mapping of long- range interactions reveals folding principles of the human genome. Science 326(5950), 289–293. DOI: 10.1126/science.1181369. Liu, Y., Du, H., Li, P., Shen, Y., Peng, H. et al. (2020) Pan-genome of wild and cultivated soybeans. Cell 182(1), 162–176. DOI: 10.1016/j.cell.2020.05.023.

8

M. Bamba et al.

Lobaton, J.D., Miller, T., Gil, J., Ariza, D., de la Hoz, J.F. et al. (2018) Resequencing of common bean identifies regions of inter-gene pool introgression and provides comprehensive resources for molecular breeding. The Plant Genome 11(2), 170068. DOI: 10.3835/plantgenome2017.08.0068. Lonardi, S., Muñoz-Amatriaín, M., Liang, Q., Shu, S., Wanamaker, S.I. et al. (2019) The genome of cowpea (Vigna unguiculata [L.] Walp.). The Plant Journal 98(5), 767–782. DOI: 10.1111/tpj.14349. Mascher, M., Gundlach, H., Himmelbach, A., Beier, S., Twardziok, S.O. et al. (2017) A chromosome conformation capture ordered sequence of the barley genome. Nature 544(7651), 427–433. DOI: 10.1038/nature22043. Miki, Y., Yoshida, K., Enoki, H., Komura, S., Suzuki, K. et al. (2020) GRAS- Di system facilitates high- density genetic map construction and QTL identification in recombinant inbred lines of the wheat progenitor Aegilops tauschii. Scientific Reports 10(1), 21455–21455. DOI: 10.1038/ s41598-020-78589-4. Milner, S.G., Jost, M., Taketa, S., Mazón, E.R., Himmelbach, A. et al. (2019) Genebank genomics highlights the diversity of a global barley collection. Nature Genetics 51(2), 319–326. DOI: 10.1038/ s41588-018-0266-x. Monat, C., Padmarasu, S., Lux, T., Wicker, T., Gundlach, H. et al. (2019) TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biology 20(1), 284. DOI: 10.1186/s13059-019-1899-5. Paterson, A.H., Bowers, J.E., Bruggmann, R., Dubchak, I., Grimwood, J. et al. (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457(7229), 551–556. DOI: 10.1038/nature07723. Pootakham, W., Nawae, W., Naktang, C., Sonthirod, C., Yoocha, T. et al. (2020) A chromosome-scale assembly of the black gram (Vigna mungo) genome. Molecular Ecology Resources 21(1), 238–250. DOI: 10.1111/1755-0998.13243. Rai, A., Saito, K. and Yamazaki, M. (2017) Integrated omics analysis of specialized metabolism in medicinal plants. The Plant Journal 90(4), 764–787. DOI: 10.1111/tpj.13485. Sakai, H., Kanamori, H., Arai-Kichise, Y., Shibata-Hatta, M., Ebana, K. et al. (2014) Construction of pseudomolecule sequences of the aus rice cultivar Kasalath for comparative genomics of Asian cultivated rice. DNA Research 21(4), 397–405. DOI: 10.1093/dnares/dsu006. Sato, S., Nakamura, Y., Kaneko, T., Asamizu, E., Kato, T. et al. (2008) Genome structure of the legume, Lotus japonicus. DNA Research 15(4), 227–239. DOI: 10.1093/dnares/dsn008. Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T. et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature 463(7278), 178–183. DOI: 10.1038/nature08670. Schmutz, J., McClean, P.E., Mamidi, S., Wu, G.A., Cannon, S.B. et al. (2014) A reference genome for common bean and genome-wide analysis of dual domestications. Nature Genetics 46(7), 707–713. DOI: 10.1038/ng.3008. Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F. et al. (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326(5956), 1112–1115. DOI: 10.1126/science.1178534. Shah, N., Wakabayashi, T., Kawamura, Y., Skovbjerg, C.K., Wang, M.-Z. et al. (2020) Extreme genetic signatures of local adaptation during Lotus japonicus colonization of Japan. Nature Communications 11(1), 253. DOI: 10.1038/s41467-019-14213-y. Shen, C., Du, H., Chen, Z., Lu, H., Zhu, F. et al. (2020) The chromosome-level genome sequence of the autotetraploid alfalfa and resequencing of core germplasms provide genomic resources for alfalfa research. Molecular Plant 13(9), 1250–1261. DOI: 10.1016/j.molp.2020.07.003. Udall, J.A. and Dawe, R.K. (2018) Is it ordered correctly? Validating genome assemblies by optical mapping. The Plant Cell 30(1), 7–14. DOI: 10.1105/tpc.17.00514. Varshney, R.K., Chen, W., Li, Y., Bharti, A.K., Saxena, R.K. et al. (2012) Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resource-poor farmers. Nature Biotechnology 30(1), 83–89. DOI: 10.1038/nbt.2022. Varshney, R.K., Song, C., Saxena, R.K., Azam, S., Yu, S. et al. (2013) Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement. Nature Biotechnology 31(3), 240–246. DOI: 10.1038/nbt.2491. Varshney, R.K., Thudi, M., Roorkiwal, M., He, W., Upadhyaya, H.D. et al. (2019) Resequencing of 429 chickpea accessions from 45 countries provides insights into genome diversity, domestication and agronomic traits. Nature Genetics 51(5), 857–864. DOI: 10.1038/s41588-019-0401-3. Wickett, N.J., Mirarab, S., Nguyen, N., Warnow, T., Carpenter, E. et al. (2014) Phylotranscriptomic analysis of the origin and early diversification of land plants. Proceedings of the National Academy of Sciences 111(45), E4859–E4868. DOI: 10.1073/pnas.1323926111.

Plant Genomics

9

Wisser, R.J., Oppenheim, S.J., Ernest, E.G., Mhora, T.T., Dumas, M.D. et al. (2021) Genome assembly of a Mesoamerican derived variety of lima bean: a foundational cultivar in the Mid-Atlantic USA. G3|Genes|Genomes|Genetics 11(11), jkab207. DOI: 10.1093/g3journal/jkab207. Xia, Q., Pan, L., Zhang, R., Ni, X., Wang, Y. et al. (2019) The genome assembly of asparagus bean, Vigna unguiculata ssp. sesquipedialis. Scientific Data 6(1), 1–10. DOI: 10.1038/s41597-019-0130-6. Yang, K., Tian, Z., Chen, C., Luo, L., Zhao, B, et al. (2015) Genome sequencing of adzuki bean (Vigna angularis) provides insight into high starch and low fat accumulation and domestication. Proceedings of the National Academy of Sciences 112(43), 13213–13218. DOI: 10.1073/pnas.1420949112. Yin, D., Ji, C., Ma, X., Li, H., Zhang, W. et al. (2018) Genome of an allotetraploid wild peanut Arachis monticola: a de novo assembly. GigaScience 7(6), 1–9. DOI: 10.1093/gigascience/giy066. Young, N.D., Debellé, F., Oldroyd, G.E.D., Geurts, R., Cannon, S.B. et al. (2011) The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480(7378), 520–524. DOI: 10.1038/ nature10625. Zhao, J., Bayer, P.E., Ruperao, P., Saxena, R.K., Khan, A.W. et al. (2020) Trait associations in the pangenome of pigeon pea (Cajanus cajan). Plant Biotechnology Journal 18(9), 1946–1954. DOI: 10.1111/pbi.13354. Zhou, P., Silverstein, K.A.T., Ramaraj, T., Guhlin, J., Denny, R. et al. (2017) Exploring structural variation and gene family architecture with de novo assemblies of 15 Medicago genomes. BMC Genomics 18(1), 1–14. DOI: 10.1186/s12864-017-3654-1.

2

Plant Transcriptomics: Data-driven Global Approach to Understand Cellular Processes and Their Regulation in Model and Non-Model Plants

Ai Kitazumi1, Isaiah C.M. Pabuayon1, Kevin R. Cushman1, Kentaro Yano2 and Benildo G. de los Reyes1* 1 Department of Plant and Soil Science, Texas Tech University, Lubbock, Texas, USA; 2School of Agriculture, Meiji University, Kawasaki, Japan

Abstract Under the new paradigms of integrative and network biology, comparison of expression changes among a small subset of candidate genes across phenotypic variants is hardly informative or conclusive in the context of cellular response regulation and the underlying genetic mechanisms. Integration of global changes in gene expression under multiple conditions with existing genomic databases that have been curated systematically are key for the efficient extraction of robust and biologically meaningful patterns and signatures that are reflective of cellular states. Despite the increasing availability of a wide array of computational tools, the resolution of RNA-seq-based transcriptome profiling is as good as the experimental design that determines the window of information revealed relative to the hypothesis being tested, and this intricacy is often underestimated. In this chapter, we discuss the important aspects of data analytics and the basic principles that must be taken into consideration to better bridge the design of the wet-lab experiments with the requirements of a robust dry-lab knowledge dissection and integration. We also highlight the unique assumptions and requirements between transcriptome experiments conducted using plant genetic models with comprehensive and annotated genomes for reference-guided assembly and extraction of biological knowledge, in comparison with the non-model plant species, which rely on a de novo assembly of transcriptome datasets followed by homology-based comparison with closely related species with reference genome.

2.1 Introduction The inherent potential of every single cell in multicellular organisms such as plants is defined by the same nuclear genome. Multicellularity is achieved because the genome is expressed in many different ways, facilitating differentiation, morphogenesis, growth, and adaptive responses. Intricately regulated expression of the genome in time and space

creates a vast, dynamic, and enormously complex transcriptome that varies under different cell types in response to intrinsic and extrinsic signals (Araújo et al., 2017). Transcriptome profiles represent different networks of gene induction and repression that uniquely identify every cell type under specific conditions, i.e., spatio-temporal signatures. Thus, transcriptome analysis is a central connecting bridge between genotype and phenotype. It also represents a core component

*Corresponding author: benildo.reyes@ttu.edu 10

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0002

Plant Transcriptomics

of large data- driven exploratory research and hypothesis-driven discovery in integrative plant biology. Being sessile organisms, plants exhibit a high degree of plasticity. The core of such inherent adaptive potential is the dynamic changes in the transcriptome, which represent the outcomes of integrating different developmental and environmental signals (Paaby and Rockman, 2014). An important aspect of the dynamic nature of the transcriptome is the contribution of extensive gene duplication, which is an important feature of the genomes of many plant species. Duplicated genes in large families serve as substrates for sub-functionalization through the creation of novel and/or specialized networks comprising distinct subsets of genes with multiple paralogs (Panchy et al., 2016). Differential regulation of individual paralogs and their interaction with their associated genes across the genome contribute to the large permutation of transcriptome configurations that define different adaptive responses in plants (Das et al., 2016; Kitazumi et al., 2018; Pabuayon et al., 2020). The diversity by which the genome could be expressed differentially to configure adaptive transcriptomic responses is the consequence of multiple layers of regulation. Gene expression fluxes are the outcomes of regulation at the level of transcriptional initiation (i.e., cis-regulation by enhancers and silencers, and trans-regulation by transcriptional activators and repressors) (Shlyueva et al., 2014; de los Reyes et al., 2015), post-transcriptional transcript degradation by microRNAs (miRNAs) (Jones-Rhoades et al., 2006; Kitazumi et al., 2015; Pabuayon et al., 2020), and epigenomic or chromatin- level control through DNA methylation, noncoding RNAs (ncRNA), and histone modification (Gibney and Nolan, 2010; Law and Jacobsen, 2010; de los Reyes et al., 2018; de los Reyes, 2019). Collectively, these layers of regulation define the full potential of the genome to configure a vast array of transcriptome status (i.e., spatio-temporal signatures) to account for the complex requirements of multi- cellularity and adaptation. Understanding the biological implications of spatio-temporal fluxes in the transcriptome, qualitatively and quantitatively, is a critical first step for understanding the intricate

11

mechanisms governing cellular- level and whole-organismal-level responses. During the past three decades, we have seen the evolution of technology and approaches used for profiling the transcriptome, from the semi-global clone-by-clone sequencing of expressed sequenced tags (ESTs) to the global hybridization-based profiling by microarray and later to the first-generation global sequencing- based platforms such as massively parallel signature sequencing (MPSS) (Wang et al., 2009). The more recent innovation was an application of the next-generation sequencing (NGS) technology, which led to a paradigm shift that allowed an even more universal scope of profiling the spatio-temporal transcriptome fluxes by direct sampling and deep- sequencing of transcripts (RNA- seq technology), which was not possible with the earlier technologies. The RNA-seq technology not only provided a powerful means for capturing at high resolution and dynamic range the most subtle changes in transcriptome fluxes, but it also allowed the profiling of qualitative changes by revealing the contributions of alternative splicing (Trapnell et al., 2012; Sibley et al., 2016). In effect, the RNA-seq technology afforded a truly comprehensive view of the vast array of expression capacities of the genome, further allowing the interpretation of such changes in the context of regulatory networks, and synergistic interactions, which are keys for a meaningful view of the intricacy of cellular and biological functions in relation to genotype and adaptive phenotypes.

2.2 Overview of RNA-Seq-Based Transcriptome Profiling Transcriptomics by RNA-seq facilitates the profiling of transcript abundance for every gene locus and their alternative splicing variants, miRNAs, and all other classes of ncRNAs across the entire euchromatic and heterochromatic regions of the genome (Wang et al., 2009). The individual reverse transcribed RNAs from fragmented pools of cellular RNA are called a read or short read, as opposed to long reads generated from RNA without fragmentation. Sequence reads are generated from one end (single-end reads) or both ends (paired-end reads) of fragmented RNA

12

A. Kitazumi et al.

molecules. The sequences generated are mapped against a reference genome (i.e., complete or near-complete genomic sequence of the target organism, hereafter referred to as reference for the mapping) by finding the most complementary site between a read and the target genome, which represents the sample in a process called reference-based mapping. The genomic location and the number of reads per genomic location (i.e., depth) are used to extrapolate the degree of transcriptional activity. Alternatively, de novo assembly is conducted in the absence of a reference genome, which will be discussed in subsequent sections. RNA-seq-based transcriptome analysis has a wide range of applications from the exploratory investigation and comparative quantitative and qualitative analysis of cell- type-, tissue-, organ-, developmental stage-, and/or treatment- specific patterns of transcription, to a more hypothesis-driven investigation and confirmation of downstream target genes of a single mutation event (e.g., overexpression and knockout), and global analysis of large-scale co-activation or co-suppression of genes in a regulatory network. According to the same principle, the RNA- seq- based transcriptome profiling can be used to identify the various signals, processing, interaction, targeting, and fate of transcribed protein-coding genes across the genome, including the mapping of transcription start sites (TSS-seq) (Yamashita et al., 2011), detection of selective polyadenylation for maturation of mRNA (3′-Seq) (Sanfilippo et al., 2017), profiling small RNA (miRNA-seq) (Addo-Quaye et al., 2008), detection of RNA–protein interaction (RIP- seq) (Zhao et al., 2010), ribosome-associated mRNA quantification (ribo-seq) (Ingolia et al., 2009), detection of post- transcriptional RNA methylation (Meyer et al., 2012), quantification of RNA stability against degradation (BRIC-seq) (Imamachi et al., 2014), and sequencing upstream for variation in cis-element reporters (CRE-seq) (Kwasnieski et al., 2012). More recent application of this technology is the profiling of short and/or long noncoding RNAs (ncRNA-seq) with potential transcriptional and post-transcriptional regulatory functions in the cell (Guttman et al., 2009; Ulitsky, 2016). RNA-seq-based transcriptome profiling has been increasingly used as a powerful

approach for building comprehensive spatio- temporal profiles of all genes (transcriptome roadmap or atlas) across human populations (e.g., Encyclopedia of DNA Elements (ENCODE) project) as well as across the entire spectrum of genetic diversity that represents a certain species used as models for genetic studies, such as mouse (e.g., functional annotation of mammalian genome (FANTOM) project) (Fantom Consortium and the Riken Genome Exploration Research Group Phase I & Il Team, 2002; Encode Project Consortium, 2012). In these recent examples of global-scope transcriptome projects, profiles specific to tissue type, disease state and developmental stages are surveyed in a systematic manner to allow direct comparison of samples or individuals across contrasting biological states. These projects were not only successful in identifying new isoforms and regulatory elements beyond what can be achieved by DNA sequence-based prediction alone (Encode Project Consortium et al., 2020), but also identified a significant number of ncRNA loci and novel gene-coding loci that may have functions in transcriptional regulation and translational modulation, which would not have been identified by conventional ab initio gene prediction. These advances paved the way for investigating the contributions of epigenetic regulation to the dynamic nature of the transcriptome. Therefore, comprehensive profiling of the transcriptome is a crucial step for understanding the functional context of genes in biological processes in which the field of plant biology is currently lagging behind, compared with the more rapid advances in the field of human and animal biology (Klepikova and Penin, 2019; Sjöstedt et al., 2020). The resolution and biological interpretability of most typical transcriptome studies are largely dependent on a number of key factors, including the extent of sampling (i.e., time-point, tissue types, and developmental stages) and scope of comparative panel (i.e., few representative genotypes for direct comparison versus larger populations of individuals across genetic populations) for mining of common trends and patterns. Unlike the relatively more straightforward analysis of the genome sequence, the dynamic and stochastic nature of the transcriptome makes it impossible to add more genotypes or to

Plant Transcriptomics

compare different transcriptomes if experiments were not conducted at the same time or in a directly comparable time window. Prior optimization of experimental conditions in relation to reference datasets in existing plant databases (e.g., Plant Omics Data Center: http://plantomics.mind.meiji.ac.jp/podc/; RiceXPro: https://ricexpro.dna.affrc.go.jp/) (all accessed July 2022) is beneficial so that spatial profiles established in large datasets can be used to add meaningful annotations. For this purpose, it is often recommended to include model species with well-investigated transcriptomes to serve as a baseline. The scope of discovery and depth of biological interpretation is also limited by the availability, or lack thereof, of comprehensively annotated reference genomes, which are readily available in plant species used as genetic models but often lacking or inadequate among less investigated non-model crops or orphan plant species. In reference-guided transcriptome analysis, loci that are not represented in the reference genome or have extensive variation in exon structures or open reading frames, large genomic rearrangements (e.g., duplication, inversion, insertion, deletion, and translocation) and transposable elements are major sources of errors in mapping and annotation. In this chapter, we present and discuss the factors that are crucial for a standard transcriptomics experiment (Fig. 2.1). We describe the important aspects of data analytics in the context of the typical bulk RNA-seq approach and the appropriate strategies on how to better bridge the design of the wet- lab experiments with the requirements for a robust dry- lab knowledge dissection and integration. We also highlight the unique assumptions and requirements between transcriptome experiments conducted using plant genetic models with comprehensibly annotated reference genome sequences for reference- guided assembly and extraction of biological knowledge, in comparison with the non-model plant species, which rely on de novo assembly of transcriptome datasets followed by homology- based comparison with closely related species where annotated reference genome sequences or assemblies are available.

13

2.2.1 Phase-IA: Sampling time-point, replication, and depth of coverage The first step in establishing a robust RNA- seq- based transcriptome profiling experiment in plants involves choosing the appropriate tissue or organ as the source of target RNA, determining the biologically meaningful temporal sampling design and methodology, and choosing the appropriate sequencing strategy that will generate the data resolution adequate to the scope and nature of the central biological question (Fig. 2.1, top panel). In eukaryotic cells including plants, the bulk products of synthesis during the process of transcription are derived from housekeeping and maintenance genes such as rRNAs (and tRNAs), overshadowing the representation of protein-coding mRNAs, which make up only around 5% of the total RNA pool (Warner, 1999). To improve the representation of mRNAs in the target sample for RNA-seq library construction, enrichment by poly- A selection is performed to filter out the high- abundance rRNAs (and tRNAs). Alternatively, a procedure for rRNA depletion is performed in cases of reduced full-length mRNA abundance due to degradation. This approach is also effective for capturing target RNAs without poly-A tails, including the non-polyadenylated ncRNAs. Coupled with a procedure for target RNA size fractionation, the RNA-seq technology can also be used to profile the expression of different types of small (21 nt to 27 nt) RNAs, i.e., miRNAs and small interfering RNAs (siRNAs), which are crucial for understanding the epigenomic-level regulation of transcriptome fluxes. Compared with the static DNA information, the process of RNA transcription is inherently dynamic and noisy, where subtle differences in sampling strategies and background conditions could influence the relative abundances and steady state of the final mRNA products. Moreover, the bulk sampling approach (as opposed to the single-cell sampling approach) may easily capture the stochastic nature of the transcriptome from one cell to another within a tissue or organ, thus reflecting the sum of multiple nuclei undergoing the process of transcription at slightly variable stages of development and/or physiological status. It is important to acknowledge that the outliers

14

A. Kitazumi et al.

Fig. 2.1. Standard workflow of a typical RNA-seq experiment and analysis (prior to gene network analysis) composed of four major component phases: (I) sampling from the biological experiments and sequencing of fragmented RNA molecules; (II) pre-processing of raw sequence reads; (III) mapping of the pre-processed sequence reads for model species or non-model species; and (IV) detection of differentially expressed genes and the degree of response in both a quantitative and qualitative manner by integrating available knowledge databases. The recommendation for sequencing and mapping strategy is described alongside the expected resolution for each analysis.

Plant Transcriptomics

across a wide range of expression due to the stochastic nature of gene expression among a group of cells are most likely lost or averaged out in bulk sampling experiment, and this reduces the resolution of the data and undermines biological interpretation. The overall purpose of most transcriptomics experiments is essentially to uncover major shifts in the transcriptional machinery as a consequence of upstream signaling, and to use the transcriptional changes as a means to understand the various components of genetic machinery that contribute to the phenotype. Considering such purpose, the sampling time- point for transcriptome profiling should encompass the entire window of the biological processes. For example, upstream signaling cascades involving short-lived molecules such as reactive oxygen species (ROS) could end within a short period, often a few minutes, from the onset of the treatment. Direct responses to these types of signals are often also relatively short-lived and should be captured by narrowly spaced sampling time-points. On the other hand, secondary or tertiary effects to gene expression of the initial or primary short- lived changes would occur a bit later and are often sustained for a much longer period of time (Yun et al., 2010). To minimize the impact of background noise due to the effects of circadian rhythms, it is ideal that the sampling from the control experiment (i.e., Control t0, where t = time-point) should follow the same temporal sampling as the experiments with treatments (i.e., Control t1, t2, tn alongside Treatment t1, t2, tn, where t = time- point). However, if such sampling is not possible due to resource restrictions or limited number of individual plants, it should be ensured that the subsequent sampling during treatments (i.e., Treatment t1, t2, tn) must be synchronized with the circadian and physiological status of the control individuals (i.e., Control t0), and this is often achieved by selecting plants or tissues in the same developmental age and sampled at the same time of the day as the initial control (t0) sampling. It is also highly desirable that the biological significance of the collected samples be empirically evaluated by testing the expression of few well-known marker genes that are expected to change in expression in a predictable manner in response to similar treatments by RT-qPCR.

15

Taking into consideration the stochastic nature of the transcription process among cells in a given tissue or organ, and the potential confounding effects of bulk-sampling strategy, allocation of robust biological replicates is far more important than technical replicates in any RNA-seq experiment. It is well established that the NGS technology itself generates a highly reproducible and consistent output across sequencing runs as long as the samples and processed libraries are high quality. In contrast, different individuals within a genetically homogenous population often have significant variations that are almost always captured as background noise or sample-to-sample variation in an RNA-seq experiment (Marioni et al., 2008; Schurch et al., 2016). Robust sequencing coverage or depth that should yield biologically interpretable results has become relatively easy to satisfy with recent sequencing capacity of most NGS platforms, which consistently generate billions of reads per run (discussed later). For example, around 2–8 Gbp input was shown to be sufficient to recover the entire exonic regions in most model organisms (e.g., human, mouse, fly, Arabidopsis, and rice) regardless of genome size, and further increase in sampling depth could facilitate detection of single-exon transcripts from noncoding regions (Patterson et al., 2019). With 150 bp paired-end sequencing, this roughly translates to 10–40 million reads, assuming 50 bp is removed from each pair during pre-processing. In tomato RNA-seq experiments, four biological replicates with 20 million reads per library (at length of 125 bp paired-end) are recommended for robust detection of differentially expressed genes (Lamarre et al., 2018).

2.2.2 Phase-IB: Single or paired-end sequence reads – platform and error rate Having established the biologically meaningful target RNA samples and robust sampling time- points, the next step is to construct the sequencing libraries in an appropriate platform. The fundamental yet challenging task in RNA-seq experiments is to trace the original genomic loci from which the RNA fragments were derived after the total pool of target RNAs has been fragmented and

16

A. Kitazumi et al.

processed in accordance with the length requirements of the NGS pipeline. Michel Snyder, pioneer and expert in both short-read and long-read RNA- seq technology, has mentioned that: “The way we do RNA-Seq now … is you take the transcriptome, you blow it up into pieces and then you try to figure out how they all go back together again. If you think about it, it’s kind of a crazy way to do things” (Nawy, 2013; Sharon et al., 2013). Indeed, reconstruction of the highly complex transcriptome through the RNA- seq approach is like a jigsaw puzzle without corner pieces or borders. The fragmentation of target RNA molecules is merely a technical requirement for anchoring the fragments to the slide for amplification, resulting in reduced sequence identity of each transcription event, which leads to ambiguous mapping called multi- mapping, where one fragment could be traced to multiple locations across the genome with the same score instead of pinpointing a single locus of true origin. This is arguably one of the biggest disadvantages of the RNA- seq technology compared with microarray. Although the probes in a microarray experiment do not differentiate the products of alternative splicing or multiple paralogs and often miss any transcripts not represented by a probe, RNA fragmentation or multi-mapping due to fragmentation is not an issue. Transposable elements and other repetitive sequences account for about 40% of the rice genome based on the Nipponbare reference genome model, and gene duplications

and their resulting pseudogenes contribute to increased rates of multi-mapping of fragmented RNA sequence reads (Das et al., 2016; Panchy et al., 2016). There are four major platforms for obtaining short sequence reads from target RNA samples on which read length, read quality, and throughput depend. These include the Roche 454, Ion Torrent, ABI’s SOLiD, and Illumina HiSeq/NovaSeq systems (Metzker, 2010; Escalona et al., 2016; Goodwin et al., 2016). Of these, the Illumina HiSeq/NovaSeq system is the most popular, followed by Thermo Fisher’s Ion Torrent, ABI’s SOLiD system (sequencing by ligation), and Roche 454 (Fig. 2.2). Library preparation is generally a similar process regardless of the platforms (Fig. 2.1, top panel), where the target RNA is first fragmented mechanically or enzymatically, then adapters are ligated for amplification for cDNA synthesis (RNA ligation method) or cDNA is synthesized by random primer followed by degradation of a second strand of cDNA (dUTP method) for strand-specific libraries to distinguish sense or antisense transcripts (Levin et al., 2010). To increase template abundance, the cDNAs with adapters are amplified by PCR (i.e., normally for 15 cycles) in sequencing libraries constructed from nanograms of samples, which is necessary for enrichment of target but also is a source of

Fig. 2.2. Proportion of adopted sequencing platforms in transcriptome studies based on available information from the National Centre for Biotechnology Information (NCBI) database Short Read Archives (SRA) in representative plant species and in humans. Labels indicate the number of studies in the given categories. Data were obtained on 13 October 2020. The “others” category includes the following platforms: BGISEQ, Helicos, Roche 454, PacBio SMRT, Oxford Nanopore, Capillary, and Complete Genomics.

Plant Transcriptomics

redundancy and bias that can be detected as PCR duplication during pre-processing. In the Illumina system, the cDNA is further amplified by bridge PCR, where reactions occur using complementary sequences anchored on the surfaces of glass slides to form a cluster. In contrast, the other three platforms (IonTorrent, SOLiD, and Roche 454) use emulsion PCR where a single molecule is amplified from individual beads. During sequencing, the templates in clusters on glass slides or in beads are amplified while incorporating labeled nucleotides with 3′-blocked dNTP (Illumina), dNTP without 3′-block (IonTorrent and Roche 454), or dinucleotide probes (SOLiD) (Goodwin et al., 2016). In this sequencing step, the Illumina system has the advantage of precision and throughput compared with the others. Sequencing by synthesis with cyclic reversible termination (Illumina) is usually less error-prone than single-nucleotide addition (Roche 454 and IonTorrent) because the former stops synthesis reaction per base, whereas the latter relies on regulating only one dNTP as incorporated. As a result, when a given template has a homopolymer where the same base is repeated, the length of the homopolymer needs to be determined by intensity of fluorescent signals, which could be prone to error (Knief, 2014). Unlike single nucleotide substitution (SNS) errors, insertions and deletions (indels) are detrimental, particularly for de novo assembly of the transcriptome, where one indel could shift the open reading frame. As for throughput, emulsion PCR poses constraints where longer reads reduce the number of beads that can fit within a semi-conductor or glass slides with wells (i.e., longer reads require more space to accommodate bigger beads). In summary, in terms of maximum sequence length, precision and throughput, the following comparisons could be established. The shortest 75 bp in SOLiD and 300 bp in Illumina are generally more accurate (0.01–1% error rate) compared with 400 bp in IonTorrent and 700 bp in Roche 454 (about 1.7% error rate) (Escalona et al., 2016). As for throughput, Illumina is capable of producing 3 billion reads per flow cell (in the case of HiSeq X: 150 bp paired- end), followed by 1.4 billion reads by SOLiD (in the case of 5500xl: 50 bp single-end), 15–20 M reads by IonTorrent (in the case of S5 530: 400 bp single-end), and 1 M by Roche

17

454 (GS FLX 700–1000 bp), reflecting the negative correlation between size of beads and throughput (Knief, 2014; Goodwin et al., 2016). Combined with high throughput and low error rates, the Illumina platform is becoming the standard short RNA-seq platform, accounting for more than 97% of studies in the public sequencing database today (Fig. 2.2). Although it came with cost in error rates, the Iso-Seq approach, which capitalizes on the single molecule sequencing chemistry of PacBio, and Oxford long- read nanopore sequencing, achieves significantly increased read length as a third-generation sequencing platform (Goodwin et al., 2016). Given the high error rates (10–20% unless it is corrected using high depth or circular consensus read), it requires extensive error correction and caution to interpret. Nonetheless, it generates information that allows the detection of alternative splicing with introns that exceeds the length of fragments and single molecule sequencing and addresses the PCR duplication bias used in short- read sequencing (Koren et al., 2017). The experimental design should be decided based on the hypothesis of the experiments. For instance, if the target transcripts were known to have high copy numbers, as in the case of polyploids and in highly heterozygous species, the long reads could be a reasonable choice to build a gene model even if the transcriptome data are not used for quantitative analysis of transcript abundances. If the experiment is based on model species with solid gene models or detection of conserved genes from non-model species, short- read sequencing might suffice and establishing a robust number of time-points might be worth the expenditure (approximately 100× more cost) (Goodwin et al., 2016).

2.2.3 Phase-II: Factors in processing of sequence reads and their limitations The raw output from the sequencer (after base-calls) is in FASTQ format, which describes sequencing information (sequencer ID, adapter information), sequences, and corresponding base quality. In this second phase, the library quality, sequence bias, and insert length should be checked and compared across libraries (Fig. 2.1, middle panel). The raw output might

A. Kitazumi et al.

18

contain sequences that are not of any biological significance but are rather artifacts of the library preparation, amplification, and sequencing procedures, such as adapter and index sequences. These artifacts can compromise the mapping process and need to be removed prior to mapping. This “cleaning” process is called pre-processing. As for the quality of bases, cut-offs at 10, 20, and 30 correspond to error rates of 10%, 1% and 0.1%, respectively, and usually the quality score 20 (Q20, allowing 1% error) and the quality score 30 (Q30, allowing 0.1% error) are used to trim the bases with low quality. The aligner algorithm is sensitive to mismatches that occur at 5′-ends of sequence reads and removal of technical contamination is crucial for later analysis. Current sequencing platforms are very capable of obtaining high- quality reads and a proportion of reads that pass Q30 is used to evaluate sequencing quality and pre-processing. Other than those critical factors, screening for bias in GC content and sequence duplication (highly duplicated k-mers) are included in standard pre-processing. However, coding sequences are inherently biased regions of the genome and any genomes that have species- specific signatures would therefore create strong biases, which generally do not hold strong enough to discard the sequencing unless they occurred uniformly across replicates and conditions.

Nevertheless, the high number of reads obtained after pre- processing does not guarantee that they represent the near- complete transcriptome. Bias in sequencing due to PCR duplication during library construction is one of those factors that cannot be corrected by pre- processing. Any two identical sequences could be due to true biological effects (i.e., transcription from a same position) or due to PCR duplication that occurs upon cDNA synthesis. Additionally, degraded target RNA samples could reduce the representation of the coding region and compromise the resolution. To demonstrate, RNA-seq data from Arabidopsis thaliana with functional cytoplasmic RNA degradation pathways (VARICOSE, which removes a protective 5′ cap and exonuclease SUPPRESSOR OF VCS responsible for 3′ to 5′ degradation) are treated with transcriptional inhibitors (1 mM cordycepin) to mimic RNA degradation in the absence of new transcriptional activities (data from Sorenson et al., 2018). With the basic assumption that such samples resemble mistreated RNA extracts from a tissue, pre-processed reads were analyzed for recovery of total loci by de novo assembly. To evaluate the effect of time after termination of transcription without technical variability during library construction for sequencing, all samples were equalized in input read number at 60 million reads after pre-processing (Table 2.1).

Table 2.1. Number of detectable loci and percentage/average length of recovered loci in the Arabidopsis transcriptome from de novo assembly of eight libraries treated by transcription inhibitor (no treatment to 480-minute treatment). The 60 million sequence reads that passed the set criteria (Q30 and the 80% of bases remain after removal of low-quality reads) were used for assembly of each library. Proportion of reads that passed pre- processing (%)

Number of unique TAIRa transcripts detected from de novo assembly

Average coverage of reference transcripts % (average length of detected reference transcripts)b

0

91.10

30,015

87.8 (433.4)

7

90.96

29,534

85.8 (427.3)

15

90.73

29,365

85.9 (421.0)

30

91.20

28,074

82.1 (411.2)

Minutes after treatment

60

91.52

26,336

77.0 (412.4)

120

91.20

23,887

70.0 (403.3)

240

90.96

21,209

61.9 (420.8)

480

90.18

17,836

52.0 (411.5)

The Arabdopsis Information Resource. Araport version (Araport11-201606, total 33,864 loci) has been used to evaluate the degree of completeness of transcriptome assembly. a

b

Plant Transcriptomics

As a result, an average of 87.82% of all loci were recovered by de novo assembly alone from 0-h control, which reflects the baseline of what 60 million single reads can discover in the model Arabiopsis thaliana transcriptome. The recovery rate progressively worsened to 51.96% in 8 h of treatment and reduced average length of de novo assembled transcripts. The point here is that even though the 60 million sequence reads that were pre- processed and passed the Q30 cut- off were used for each library, the representation of loci in de novo assembly decreased significantly with treatment over time. The authors have observed that multi- exon reads were more likely to be degraded than single-exon transcripts, consistent with shorter transcript assembly (Table 2.1) (Sorenson et al., 2018). This demonstrates the efficiency of library preparation in RNA- seq experiments, which can be constructed from 50 ng of total RNA. However, the sequence reads obtained do not correlate with biology, and the raw sample quality is very important and a determining factor. In other words, high-quality libraries can be made even from degraded samples, but that does not guarantee a transcriptome of sufficient quality for robust biological interrogation. While RNA-seq has been successfully applied to rare archeological samples, this does not mean we can use it without precaution. Many sequencing facilities guarantee read quality and number as long as the sample RNA passes the minimum quality standard. It is important to emphasize the need to check the RNA integrity number (RIN), based on the degradation of 18S and 28S rRNA, and handle the samples with extreme caution. With this, we will move on to analyze the pre-processed libraries by comparing model species and non-model species.

2.2.4 Phase III: Choosing the reference and mapping in model plant species Having gained sequence datasets of adequate quality after pre- processing, the third phase of the RNA-seq pipeline is to obtain quantitative information of the expression of each transcript- producing gene locus by mapping the short sequence reads against the reference genome (Fig. 2.1, middle panel). The availability

19

of a reference genome is a determining factor in this step. Mapping to well-annotated reference sequences of a model species is a privilege particular to plants, because there are gold-standard quality sequences for monocots and dicots that were established by physical mapping- based approaches (Arabidopsis Genome Initiative, 2000; International Rice Genome Sequencing Project, 2005; Kawahara et al., 2013). The percentage of gaps within a primary assembly (“golden path length”) is higher in human (4.95%) and mouse (2.86%) than in Arabidopsis thaliana (0.16%) and Oryza sativa (0.03%). Additionally, there are 91 more reference plant genomes that are annotated in Ensembl Plants (https://plants.ensembl.org/species.html) (accessed July 2022). To reconstruct the original sequence of an expressed transcript that has been fragmented for the RNA sequencing pipeline, reads from RNA-seq need to be mapped back to the genomic locus of origin based on sequence similarity and alignment. Needless to say, the choice of the reference genome model to map the RNA sequence reads with should be based primarily on genetic relatedness. Similarities in the coding regions are most important so that the loci and gene model from the reference can accurately capture the entirety of the transcriptome as revealed by the RNA-seq datasets. The match between species of interest and the reference sequence is expected to be manifested in terms of high mapping rates, where the majority of reads are traced back to a specific location in the genome at high statistical significance. Conversely, if there were multiple candidates for reference genome sequences, experimental mapping and comparison of map rates and coverage could be beneficial to ensure a robust outcome and high biological interpretability of the transcriptome data matrix. If the selection of reference genome sequences were effective for the sampled organism or genotype, the map rates could be either reduced or still high but with many examples of discordant mapping (i.e., where paired mapping is mapped in a distant genomic location, generally in different chromosomes) and higher coverage in noncoding regions. For an extreme example in rice, genetic distance between Oryza sativa (AA- genome) and O. officinalis (CC-genome) represents 9 million years of evolutionary divergence,

20

A. Kitazumi et al.

whereas O. rufipogon (AA-genome) represents the progenitor of domesticated O. sativa within several thousands of years of divergence (Huang et al., 2012; Shenton et al., 2020). When O. sativa model Nipponbare reference genome (IRGSP-1.0) was used for the mapping of RNA- seq from O. sativa, O. rufipogon, and O. officinalis, the overall map rates were 90.41%, 90.52%, and 37.7%, respectively (data from Kitazumi et al., 2018 and from SRA data file DRR001374 of the NCBI’s Sequence Read Archive). These generally reflect the effects of genetic distance and conserved synteny. Even though O. officinalis is genetically very distant from O. sativa, the conservation of syntenic loci between their genomes was high enough to allow detection of orthology (Kitazumi et al., 2018). Although using O. sativa as a reference for O. officinalis mapping is an unrealistic strategy, even with different species such as O. rufipogon, it is worth noting that the majority of reads could be mapped to the model reference genome (IRGSP-1.0). There have been major efforts to increase the number of reference- quality sequence assemblies for genus Oryza in addition to the gold- standard Nipponbare reference genome (IRGSP- 1.0) to facilitate more comprehensive analysis of genetic diversity and evolution. It has been shown that conservation of synteny is generally high and the overall structure showed similarity to the gold-standard reference genome (IRGSP-1.0) (Fig. 2.3). Sequence comparison across homologous chromosome-1, for example, generated the phylogenetic tree separating the major groups (i.e., japonica, indica, and aus with O. nivara as an outgroup). The Nipponbare reference genome (IRGSP- 1.0) and the long- read assembly of Nipponbare showed near-complete matches, whereas the Nipponbare reference and O. nivara had the highest divergence as shown by the largest proportion of non-syntenic regions as expected. Synteny among the japonica group is generally high but lower in the indica group, reflecting higher diversity in the indica group. Interestingly, it is apparent that the assembly methods have higher influence than genetics does, as seen in two assemblies of Minghui 63 and Zhenshan 97, by short reads and long reads, which clustered into different clades. Long reads have the advantage over short reads in making contiguous assembly across repetitive regions and transposable

element-derived sequences, therefore they are expected to recover native chromosome sequences that cannot be obtained by reference- based assembly using the gold- standard Nipponbare. However, the genome structure revealed is quite similar to the Nipponbare, indicating high synteny between japonica and indica and even with the aus group with one possible inversion in N22. In addition, there is a need to link the newly established genome annotation to the existing gene annotation (e.g., gene model and locus designation) of Nipponbare for biological interrogation. Even with higher conservation in synteny in coding sequences compared with noncoding sequences, sequencing new intra-species genomes of rice for RNA-seq might not be worth the cost. This is an important point for consideration in deciding to analyze the RNA-seq as model species or non-model species.

2.2.5 Phase III: De novo or hybrid assembly for non-model species One of the hallmarks of RNA-seq is the application to transcriptome profiling of non-model species where no reference genome and/or transcriptome assembly has been established. Although the potential impact of this application is large, this process is computationally expensive and prone to errors. Benchmarking with sequenced species showed that the majority of reads can be mapped to de novo assemblies but type-I error (i.e., transcript is truncated due to insufficient overlaps) and type-II error (i.e., chimeric transcripts fused by unspecific reads) are prominent, regardless of the assembly method used (Honaas et al., 2016). It is hard to generalize the proportion of errors of each type but, in an earlier example, 29% of loci was lost due to type-I error (incomplete assembly) and 57% of assemblies showed type-II error (chimeric assembly). Circumventing such errors is a high priority and there are several ways to do so: (i) include long reads; (ii) use relatively distant genome as a reference; and (iii) assembly guided by reference. Long reads that reach a few thousand bases from PacBio and Oxford Nanopore allow sequencing without prior fragmentation or PCR amplification. As discussed earlier, the error rate is quite high but if there was enough depth to correct

Plant Transcriptomics

21

Fig. 2.3. Pairwise comparison of chromosome structure of chr01 of reference IRGSP-1.0 Nipponbare and recent assemblies from japonica, indica, aus, and Oryza nivara as outgroup. The numbers 1–19 denote the assembly number, where 1 is reference Nipponbare and 2–19 are paired assemblies. The dendrogram shows genetic similarity between different cultivars. The color indicates distribution of similar sequences and lines indicate possible inversion. Underlines indicate that the assembly includes long reads by PacBio system. NCBI genome assemblies (from left): koshihikari (GCA_000164945.1), A123 (GCA_000817635.1), HEG4 (GCA_000817615.1), Carolina Gold Select (GCA_004007595.1), hitomebore (GCA_000321445.1), NPB_PacBio (GCA_003865235.1), Oryza nivara (GCA_000576065.1), HR22 (GCA_000725085.2), Samba mahsuri (GCA_001305255.1), Minghui 63 (GCA_001618785.1), Zhenshan 97 (GCA_001618795.1), IR8_PacBio (GCA_001889745.1), Minghui 63_PacBio (GCA_001623365.2), 9311_PacBio (GCA_003865215.1), Zhenshan 97_PacBio (GCA_001623345.2), Shuhui 498_PacBio (GCA_002151415.1), N22_PacBio (GCA_001952365.1). Assemblies of cv. Kasalath and reference Nipponbare (Os-Nipponbare-Reference-IRGSP-1.0) downloaded from IRGSP (https:// rapdb.dna.affrc.go.jp/download/irgsp1.html) (accessed 5 November 2019). The syntenic region and chromosome rearrangement were detected by Smash (Pratas et al., 2015). The genetic distance shown as a dendrogram was calculated based on k-mers by kWIP v0.2.0 (Murray et al., 2017).

such errors by scaffolding, or if sequences are read multiple times (i.e., circular correction), there is a good chance that near full- length gene models could be recovered. The methods to correct long reads by short reads, or hybrid assembly using both short and long reads, are available. The advantage of using the relatively close genome as reference in model species has been discussed in the previous section, in the examples from the genus Oryza. Lastly, the reference-guided assembly could take advantage of both reference genome and de novo assembly. In this method, the reads need to be anchored to a reference and the result of mapping per genomic location is assembled de novo to recover the fragments that are not represented in the reference

(Fig. 2.1, middle panel). The unmapped reads can be de novo assembled to increase representation of loci that are absent in reference. This hybrid approach could recover more accurate gene models but is still inadequate in recovering the loci with large sequence divergence. In many cases the research question is to identify novel genes from non-model species as a means for explaining the expression of novel phenotypes. This basic assumption is fundamentally incompatible with the scope of reference-based or reference-guided assembly. In such cases, de novo assembly reconstructs the entire transcriptome to use as a reference for the particular species of interest. The memory-consuming and error-prone de novo assembly of transcripts is conducted by

22

A. Kitazumi et al.

aligning the sequence reads with each other based on overlaps (Fig. 2.1, middle panel). The output assembly needs to be annotated by searching for protein domains and by identifying orthologous genes across various databases. In preparation for de novo assembly, important factors that influence the quality of such assembly are not only the sequencing quality and depth but also the assembly methods. Adequate depths or large numbers of reads help in finding overlaps and obtaining contiguous assemblies even among low-expression gene loci. In addition, sampling time-points and tissues could be a significant limiting factor. There are many genes that are expressed only in specific conditions, such as stress. In previous studies, analysis of ten different abiotic stress conditions led to the identification of 4279 novel genes that were not identified and annotated ab initio in previous annotated models (Kawahara et al., 2016). The rice genome is undoubtedly one of the most scrutinized references, through both NGS and cDNA libraries, but the gene loci with expression under specific stress conditions seem to have been overlooked or missed from past studies. In this context, the scope of sampling determines the resolution of study. In any case, we should be aware that transcriptome assembly should be interpreted much more carefully than genome assembly. De novo assembled transcriptome could be partial and could contain many mis-assemblies that cannot be removed unless genome or comprehensive full-cDNA libraries could be used for filtration. The completeness of transcriptome assembly is commonly evaluated by the Benchmarking Universal Single- Copy Orthologs (BUSCO) by using datasets from evolutionarily closely related species on a core set of genes that are expected for any functional transcriptome (Simão et al., 2015). The tools and data subsets in de novo transcriptome assembly should be defined. Essentially, assembly is based on overlapping sequences and the goal is to recursively connect two sequence reads into one contig, just like a round- robin tournament between each read to find the best pair with statistical confidence. The number of tournaments is proportional to the diversity of k-mers, the fragment of sequences used for merging two reads into one, and the length of k-mers is an important choice. For example, longer k-mers promote specific

bridging because of the reduced number of candidates, but it risks a mismatch where one incorrect base could reject bridging between reads from similar locus. On the other hand, shorter k-mers will be tolerant to incorrect bases that occur outside of the k-mer region, but the lower identity of reads will find a large number of reads, as candidates are based on matching of short sequences. Given this trade-off, many tools adopt multiple k-mers during assembly. For instance, spades use k-mer 21, 33, and 55 for 150 bp paired-end assembly and 21, 33, 55, 77, 99, and 127 for 250 bp paired-end assembly (Nurk et al., 2013). Upon implementation of de novo assembly, the entire process can be completed by assembling the individual libraries, or multiple libraries could be combined. It is recommended to assemble all runs from an experiment into a single batch in order to increase the diversity of sequences and maximize the recovery of various loci (Davidson et al., 2017). However, this method requires a significant amount of computing memory, and it might not make biological sense to group reads together if sampling was from genetically distant organisms. Even within genetically related individuals, the RNA samples from severe treatment or adverse conditions are expected to have different splicing patterns or different degrees of degradation, such that mixing with others would not be appropriate. In these cases, grouping according to genotype or conditions might be a sound strategy. The goal is to construct a reference where the majority of transcripts are represented with fewer duplications and as few errors as possible. The initial assemblies are expected to be redundant (e.g., multiple assemblies from one locus) and errors (chimeric assembly) need to be removed prior to using it as a reference. The redundant and chimeric transcripts can be identified by mapping the unassembled reads back to the initial assembly, because they show multiple mapping and discordant reads (i.e., paired reads mapped to different assemblies). This process of eliminating redundant and chimeric reads is called SuperTranscripts construction (Davidson and Oshlack, 2014; Davidson et al., 2017). SuperTranscripts do not necessarily equal unigenes that originate from a single locus; rather, they enable comparison between different libraries (e.g., control versus experimental) as a unit of locus. In comparing different samples (e.g., multiple sampling from genetically or geographically

Plant Transcriptomics

distant populations), the clustering might not merge well, as the clusters represent only one library but not others. In such cases, clustering at the protein level could help to reduce the SuperTranscripts in functional context, using detection tools such as orthologfinder and SonicParanoid (Emms and Kelly, 2015; Cosentino and Iwasaki, 2019). In all cases, the bottom line is not to lose the functional transcripts of interest, and completeness of the assembly should be evaluated on raw assembly and SuperTranscripts. The completeness of the assembly is commonly evaluated by BUSCO to avoid oversimplification of de novo assembly (Simão et al., 2015). Proper protein prediction serves as a basis of annotation and link to GO terms and KEGG (Kyoto Encyclopedia of Genes and Genomes).

23

alignment or complemented by a more traditional Smith-Waterman-like approach (Keel and Snelling, 2018; Lin and Hsu, 2018). Therefore, mapping requires attention to the threshold in the initial seeding stage, which is a critical step in initiating alignment and loss of match. STAR and HISAT2 performances are comparable, but STAR has been shown to be slightly inefficient in genetically distant or polymorphic genotypes. This appears to be due to more tolerance of mismatches by recursive mapping of entire reads, whereas HISAT2 terminates global alignment when short MEM has been found (Kim et al., 2015, 2019; Schaarschmidt et al., 2020). Additionally, STAR requires much more computing memory than HISAT2. Therefore, the choice of aligner should be decided based on memory requirement, reference quality, and read quality. The output of mapping should be checked for map rates and occurrence of multi- mapped reads in case mapping parameters need 2.2.6 Phase III: Choice of aligner some adjustments. Recently, pseudo-mapping has been gaining With a reference in hand, either by selecting from attention along with the alignment- based assemblies or de novo SuperTranscripts, the next methods described earlier. The programs Kallisto step is to map the raw sequence reads to obtain and Salmon (Bray et al., 2016; Patro et al., 2017) abundant counts and establish expression levels identified k-mers at the location that differentiacross libraries (Fig. 2.1, middle panel). Most ates one transcript from other similar transcripts NGS read-alignment tools for short sequences based on reference information, then reduced can be used as long as the tool accommodates the query read sequence into the k-mer combisplice- aware mapping (e.g., allows mapping nation and assigned the most likely transcripts when a read spans multiple exons), which can it originated from. In this way, no base-to-base align RNA reads against a genome with introns. alignment is done, thereby reducing memory There are many tools that can be used, and requirement, and allowing the mapping to finish HISAT2 (as a successor to bowtie/tophat/Hisat) in minutes compared with hours in alignment- and STAR are freely available and regarded to based mapping. This method uses multiple be the most effective tools for RNA-seq mapping iteration bootstrapping and the expectation– against reference (Dobin et al., 2013). In the case maximization (EM) algorithm, which distributes of SuperTranscript reference, theoretically the ambiguous/multi- mapped reads into possible intron is absent, therefore any NGS aligner can isoforms so that the coverage will be equalized be used. across different isoforms. The drawback of this Most short- read aligners utilize a seed- method is that it cannot find transcripts that are and-extend approach where short seeds from not in the reference transcriptome collection the 5′-end are anchored to a given reference and reads can be mapped falsely if there was a to initiate the alignment. In the seeding step, match in k-mer regardless of the loci of origin. the genome index is built by suffix array and In other words, the read can be assigned to the the Burrows Wheeler Transform (BWT), which wrong reference based on k-mer profile and enables the search for the maximum alignment discards most of the reads that do not contain block(s) called maximal exact matches (MEM). k-mers represented in the index. This method is The inherent nature of MEM is that it does not magnitudes faster than standard mapping, but allow mismatches or gaps unless it was combined it is not tolerant of mismatches or gaps like the with backtracking searches or used with graph traditional method.

24

A. Kitazumi et al.

Successful mapping depends on completeness of the reference transcriptome sequences and at this point this method is suitable only for highly scrutinized genomes with near-complete annotation. It is ideal to use a subset of data to try different tools and method (i.e., mapping, or pseudo-mapping) along with simulated data generated from the expected reference to ensure the transcriptome is quantified appropriately and to determine sensitivity and other parameters in mapping (Gourlé et al., 2019).

2.2.7 Phase IV: Detection of differentially expressed transcripts and their gene loci The raw counts from the mapping and pseudo- mapping results representing all alignments need to be converted to number of reads per genes and transcripts to establish the expression level and abundance across different libraries (Fig. 2.1, bottom panel). The values should be normalized across different libraries to enable proper statistical testing and direct comparisons across libraries using fragments/ reads per kilobase of transcript per million fragments mapped (FPKM/RPKM), counts per million (CPM), transcripts per million (TPM), trimmed mean of M values (TMM), and various others (Abbas-Aghababazadeh et al., 2018). The earlier methods for calculating the expression level across different libraries are based on the concept of fragments/reads per million mapped (FPKM/RPKM), which takes into account the gene length, with the basic assumption that longer genes have more reads mapped to a given locus. In FPKM/RPKM, the number of reads mapped to a given locus is normalized by dividing both by the total length of exons (in kilobases) and by the total number of mapped reads (in millions) in the library. FPKM/RPKM is still a valid metric in comparing abundance of a given locus relative to other loci within a library, but not across libraries, because it results in large differences in overall averages. The reason for the discrepancy is that a given reference has the same number of genes and gene length distribution, therefore only the total number of reads is an effective denominator overall. As a result, FPKM/RPKM falls under the assumption that the transcription rate for each locus is equal and

that number of reads is linear to the depth of the library. However, the transcription rate differs for genes with different length, and regulation occurs at different levels (Hetzel et al., 2016). If gene models are developed across different libraries for alternative splicing events, FPKM/ RPKM could overestimate the expression for a given locus with low expression by predicting a truncated gene model, due to the reduction of the denominator and gene length. To circumvent the issue of normalization, TPM and CPM are often used so that averages across the libraries are constant. In TPM calculation, the sum of read counts for all loci is expressed per 1000 bp and normalized per 1 million mapped reads. In CPM, calculation of gene length is disregarded, and expression is essentially a percentage of 1 million mapped reads. Both TPM and CPM equalize the total per library and satisfy the statistical requirements. The basic assumption of equal transcription capacity across different libraries might not fit all kinds of research questions. For example, samples from plants that went through stress treatment for an extended period might be more compromised in transcriptional and biochemical components than plants grown under ideal conditions. Similar differences could be found for plant samples from different developmental stages (e.g., young leaf versus senescence), different tissues (e.g., flower versus photosynthetic leaves), domesticated or cultivated (Sun et al., 2020). It was widely recognized that housekeeping genes could deviate from the basic assumptions, and such thinking had been used as part of the baseline for the analysis of microarray-based transcriptome datasets. Unless it is compared with single-cell sequencing, this will be hard to address computationally. In this perspective, TMM normalization of read counts could potentially address these issues by selecting a group of genes to represent constant expression levels. This method utilizes the negative binomial distribution, which is very compatible with gene expression distribution where lowly expressed genes have higher variances than highly expressed genes. First, genes/transcript loci with extreme expression are trimmed, i.e., 30% highest and lowest genes in terms of fold- change expression, and 5% of the most and least abundant gene/transcripts. Then the remaining untrimmed genes are used to calculate the

Plant Transcriptomics

normalization factor across libraries. Essentially, the untrimmed group of genes/transcripts are treated as a group of “housekeeping genes” that represent the basal transcriptional capacity expected of different samples (e.g., tissues, treatments, genotypes). DESeq/DESeq2 normalization may also be used with a similar basic assumption to TMM normalization by EdgeR, but the number of genes that can be detected as differentially expressed would vary according to different benchmarks and experiments (Li et al., 2022). The choice should be made based on biological assumptions, expectations (stringency), and trial comparisons.

2.3 Conclusions and Perspectives The normalization and statistical detection of differentially expressed genes serve as the basis for functional annotation. They also serve as the backbone for co-expression clustering, genetic network analysis, and for linking to biological pathways and extraction and synthesis of knowledge on processes and phenotypes. While many plant genes and traits can be examined through the paradigm of Mendelian genetics, the majority of traits, particularly those of agronomic significance and whose expressions are heavily influenced by environmental factors, are quantitative in nature and the variances cannot be easily explained by simple mutation-based forward or reverse genetics. The fundamental promise of the RNA-seq approach is to decipher the individual actions and synergistic interactions of multiple genes at the spatio-temporal scale. The molecular basis of such quantitative traits is much more difficult to delineate and remains unexplained even in well- studied systems. For instance, human height is a classic example of a multi- genic trait and extensive

25

genomic and transcriptomic studies have led to the conclusion that only a fraction of genes has a direct contribution in cis (thus named core genes), and the rest (61–88%) are not predicted to have a direct contribution, functioning in trans (thus named peripheral genes). The synergy between the cis and trans components defines the regulatory networks that control cellular biochemical and physiological processes (Liu et al., 2019). For the peripheral genes, variants are associated with 40–62% of the SNP database in human height alone (Boyle et al., 2017). The number of protein- coding genes and noncoding loci that comprise the trans components could easily be in the thousands, but the case of human height is not a special or specific scenario; rather, the same principles also apply to many traits across multicellular eukaryotes, including plants. For instance, QTL studies in rice for environmental stress tolerance and yield potential have been able to explain only about 30% of the phenotypic variances. It can be assumed that such major-effect QTLs as can be detected by genome-wide association studies (GWAS) function as core genes acting in cis (Moncada et al., 2001; Thomson et al., 2003; Zhang et al., 2005; Bimpong et al., 2011; Gonzaga et al., 2017). These findings illustrate the inherent limitations of the paradigm of “finding one transcription factor that explains all” or “finding a variant that can be used for predicting a trait”. It further supports the need to understand how the genome functions in a holistic manner, a critical component of which is the ability to unravel the complex fluxes in the transcriptome in time and space, and how those fluxes translate into networks that define additive effects of multiple genes, hence gene networks defining quantitative traits (de los Reyes, 2019).

Acknowledgment This work was supported by the National Science Foundation (NSF-IOS) Plant Genome Research Program Grant-1602494, and Bayer CropScience Endowed Professorship. Genomic computations were performed using the supercomputing facilities at the Research Organization of Information and Systems (ROIS) National Institute of Genetics, Mishima, Japan.

26

A. Kitazumi et al.

References Abbas-Aghababazadeh, F., Li, Q. and Fridley, B.L. (2018) Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing. PloS ONE 13(10), e0206312. DOI: 10.1371/journal.pone.0206312. Addo-Quaye, C., Eshoo, T.W., Bartel, D.P. and Axtell, M.J. (2008) Endogenous siRNA and miRNA targets identified by sequencing of the Arabidopsis degradome. Current Biology 18(10), 758–762. DOI: 10.1016/j.cub.2008.04.042. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814), 796–815. DOI: 10.1038/35048692. Araújo, I.S., Pietsch, J.M., Keizer, E.M., Greese, B., Balkunde, R. et al. (2017) Stochastic gene expression in Arabidopsis thaliana. Nature Communications 8(1), 2132. DOI: 10.1038/s41467-017-02285-7. Bimpong, I.K., Serraj, R., Chin, J.H., Ramos, J., Mendoza, E.M.T. et al. (2011) Identification of QTLs for drought-related traits in alien introgression lines derived from crosses of rice (Oryza sativa cv. IR64) × O. glaberrima under lowland moisture stress. Journal of Plant Biology 54(4), 237–250. DOI: 10.1007/ s12374-011-9161-z. Boyle, E.A., Li, Y.I. and Pritchard, J.K. (2017) An expanded view of complex traits: from polygenic to omnigenic. Cell 169(7), 1177–1186. DOI: 10.1016/j.cell.2017.05.038. Bray, N.L., Pimentel, H., Melsted, P. and Pachter, L. (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34(5), 525–527. DOI: 10.1038/nbt.3519. Cosentino, S. and Iwasaki, W. (2019) SonicParanoid: fast, accurate and easy orthology inference. Bioinformatics (Oxford, England) 35(1), 149–151. DOI: 10.1093/bioinformatics/bty631. Das, M., Haberer, G., Panda, A., Das Laha, S., Ghosh, T.C. et al. (2016) Expression pattern similarities support the prediction of orthologs retaining common functions after gene duplication events. Plant Physiology 171(4), 2343–2357. DOI: 10.1104/pp.15.01207. Davidson, N.M. and Oshlack, A. (2014) Corset: enabling differential gene expression analysis for de novo assembled transcriptomes. Genome Biology 15(7), 410. DOI: 10.1186/s13059-014-0410-6. de los Reyes, B.G. (2019) Genomic and epigenomic bases of transgressive segregation: new breeding paradigm for novel plant phenotypes. Plant Science 288, 110213. DOI: 10.1016/j. plantsci.2019.110213. de los Reyes, B.G., Mohanty, B., Yun, S.J., Park, M.-R. and Lee, D.-Y. (2015) Upstream regulatory architecture of rice genes: summarizing the baseline towards genus-wide comparative analysis of regulatory networks and allele mining. Rice (New York, N.Y.) 8, 14. DOI: 10.1186/s12284-015-0041-x. de los Reyes, B.G., Kim, Y.S., Mohanty, B., Kumar, A., Kitazumi, A. et al. (2018) Cold and water deficit regulatory mechanisms in rice: optimizing stress tolerance potential by pathway integration and network engineering. In: Sasaki, T. and Ashikari, M. (eds) Rice Genomics, Genetics and Breeding. Springer Nature, Singapore, pp. 317–359. Davidson, N.M., Hawkins, A.D.K. and Oshlack, A. (2017) SuperTranscripts: a data driven reference for analysis and visualisation of transcriptomes. Genome Biology 18(1), 148. DOI: 10.1186/ s13059-017-1284-1. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C. et al. (2013) STAR: ultrafast universal RNA- seq aligner. Bioinformatics (Oxford, England) 29(1), 15–21. DOI: 10.1093/bioinformatics/bts635. Emms, D.M. and Kelly, S. (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biology 16, 157. DOI: 10.1186/ s13059-015-0721-2. Encode Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. DOI: 10.1038/nature11247. Encode Project Consortium, Snyder M.P., Gingeras, T.R., Moore, J.E., Weng, Z., Gerstein, M.B. et al. (2020) Perspectives on ENCODE. Nature 583(7818), 693–698. DOI: 10.1038/s41586-020-2449-8. Escalona, M., Rocha, S. and Posada, D. (2016) A comparison of tools for the simulation of genomic next- generation sequencing data. Nature Reviews. Genetics 17(8), 459–469. DOI: 10.1038/nrg.2016.57. Fantom Consortium and the Riken Genome Exploration Research Group Phase I & Il Team (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cdnas. Nature 420, 563–573. Gibney, E.R. and Nolan, C.M. (2010) Epigenetics and gene expression. Heredity 105(1), 4–13. DOI: 10.1038/hdy.2010.54.

Plant Transcriptomics

27

Gonzaga, Z.J.C., Carandang, J., Singh, A., Collard, B.C.Y., Thomson, M.J. et al. (2017) Mapping QTLs for submergence tolerance in rice using a population fixed for SUB1A tolerant allele. Molecular Breeding 37(4), 47. DOI: 10.1007/s11032-017-0637-5. Goodwin, S., McPherson, J.D. and McCombie, W.R. (2016) Coming of age: ten years of next-generation sequencing technologies. Nature Reviews. Genetics 17(6), 333–351. DOI: 10.1038/nrg.2016.49. Gourlé, H., Karlsson-Lindsjö, O., Hayer, J. and Bongcam-Rudloff, E. (2019) Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics (Oxford, England) 35(3), 521–522. DOI: 10.1093/ bioinformatics/bty630. Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F. et al. (2009) Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458(7235), 223–227. DOI: 10.1038/nature07672. Hetzel, J., Duttke, S.H., Benner, C. and Chory, J. (2016) Nascent RNA sequencing reveals distinct features in plant transcription. Proceedings of the National Academy of Sciences 113(43), 12316–12321. DOI: 10.1073/pnas.1603217113. Honaas, L.A., Wafula, E.K., Wickett, N.J., Der, J.P., Zhang, Y. et al. (2016) Selecting superior de novo transcriptome assemblies: lessons learned by leveraging the best plant genome. PloS ONE 11(1), e0146062. DOI: 10.1371/journal.pone.0146062. Huang, X., Kurata, N., Wei, X., Wang, Z.-X., Wang, A. et al. (2012) A map of rice genome variation reveals the origin of cultivated rice. Nature 490(7421), 497–501. DOI: 10.1038/nature11532. Imamachi, N., Tani, H., Mizutani, R., Imamura, K., Irie, T. et al. (2014) BRIC-seq: a genome-wide approach for determining RNA stability in mammalian cells. Methods (San Diego, Calif.) 67(1), 55–63. DOI: 10.1016/j.ymeth.2013.07.014. Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S. and Weissman, J.S. (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324(5924), 218–223. DOI: 10.1126/science.1168978. International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436, 793–800. Jones-Rhoades, M.W., Bartel, D.P. and Bartel, B. (2006) MicroRNAS and their regulatory roles in plants. Annual Review of Plant Biology 57, 19–53. DOI: 10.1146/annurev.arplant.57.032905.105218. Kawahara, Y., de la Bastide, M., Hamilton, J.P., Kanamori, H., McCombie, W.R. et al. (2013) Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice (New York, N.Y.) 6(1), 4. DOI: 10.1186/1939-8433-6-4. Kawahara, Y., Oono, Y., Wakimoto, H., Ogata, J., Kanamori, H. et al. (2016) TENOR: database for comprehensive mRNA-Seq experiments in rice. Plant & Cell Physiology 57(1), e7. DOI: 10.1093/pcp/ pcv179. Keel, B.N. and Snelling, W.M. (2018) Comparison of burrows-wheeler transform-based mapping algorithms used in high-throughput whole-genome sequencing: application to illumina data for livestock genomes. Frontiers in Genetics 9, 35. DOI: 10.3389/fgene.2018.00035. Kim, D., Langmead, B. and Salzberg, S.L. (2015) HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12(4), 357–360. DOI: 10.1038/nmeth.3317. Kim, D., Paggi, J.M., Park, C., Bennett, C. and Salzberg, S.L. (2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37(8), 907–915. DOI: 10.1038/ s41587-019-0201-4. Kitazumi, A., Kawahara, Y., Onda, T.S., De Koeyer, D. and de los Reyes, B.G. (2015) Implications of mir166 and mir159 induction to the basal response mechanisms of an andigena potato (Solanum tuberosum subsp. andigena) to salinity stress, predicted from network models in Arabidopsis. Genome 58, 13–24. Kitazumi, A., Pabuayon, I.C.M., Ohyanagi, H., Fujita, M., Osti, B. et al. (2018) Potential of Oryza officinalis to augment the cold tolerance genetic mechanisms of Oryza sativa by network complementation. Scientific Reports 8(1), 16346. DOI: 10.1038/s41598-018-34608-z. Klepikova, A.V. and Penin, A.A. (2019) Gene expression maps in plants: current state and prospects. Plants 8(9), 309. DOI: 10.3390/plants8090309. Knief, C. (2014) Analysis of plant microbe interactions in the era of next generation sequencing technologies. Frontiers in Plant Science 5, 216. DOI: 10.3389/fpls.2014.00216. Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Bergman, N.H. et al. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research 27(5), 722–736. DOI: 10.1101/gr.215087.116.

28

A. Kitazumi et al.

Kwasnieski, J.C., Mogno, I., Myers, C.A., Corbo, J.C. and Cohen, B.A. (2012) Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proceedings of the National Academy of Sciences 109(47), 19498–19503. DOI: 10.1073/pnas.1210678109. Lamarre, S., Frasse, P., Zouine, M., Labourdette, D., Sainderichin, E. et al. (2018) Optimization of an RNA-Seq differential gene expression analysis depending on biological replicate number and library size. Frontiers in Plant Science 9, 108. DOI: 10.3389/fpls.2018.00108. Law, J.A. and Jacobsen, S.E. (2010) Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Reviews. Genetics 11(3), 204–220. DOI: 10.1038/nrg2719. Levin, J.Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D.A. et al. (2010) Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods 7(9), 709–715. DOI: 10.1038/nmeth.1491. Li, Y., Ge, X., Peng, F., Li, W. and Li, J.J. (2022) Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology 23(1), 79. DOI: 10.1186/ s13059-022-02648-4. Lin, H.N. and Hsu, W.L. (2018) DART: a fast and accurate RNA-seq mapper with a partitioning strategy. Bioinformatics (Oxford, England) 34(2), 190–197. DOI: 10.1093/bioinformatics/btx558. Liu, X., Li, Y.I. and Pritchard, J.K. (2019) Trans effects on gene expression can drive omnigenic inheritance. Cell 177(4), 1022–1034. DOI: 10.1016/j.cell.2019.04.014. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. and Gilad, Y. (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 18(9), 1509–1517. DOI: 10.1101/gr.079558.108. Metzker, M.L. (2010) Sequencing technologies: the next generation. Nature Reviews. Genetics 11(1), 31–46. DOI: 10.1038/nrg2626. Meyer, K.D., Saletore, Y., Zumbo, P., Elemento, O., Mason, C.E, et al. (2012) Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell 149(7), 1635–1646. DOI: 10.1016/j.cell.2012.05.003. Moncada, P., Martínez, C.P., Borrero, J., Chatel, M., Gauch Jr, H. et al. (2001) Quantitative trait loci for yield and yield components in an Oryza sativa×Oryza rufipogon BC2F2 population evaluated in an upland environment. Theoretical and Applied Genetics 102(1), 41–52. DOI: 10.1007/ s001220051616. Murray, K.D., Webers, C., Ong, C.S., Borevitz, J. and Warthmann, N. (2017) kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity. PLoS Computational Biology 13(9), e1005727. DOI: 10.1371/journal.pcbi.1005727. Nawy, T. (2013) End- to- end RNA sequencing. Nature Methods 10(12), 1144–1145. DOI: 10.1038/ nmeth.2750. Nurk, S., Bankevich, A., Antipov, D., Gurevich, A., Korobeynikov, A. et al. (2013) Assembling genomes and mini-metagenomes from highly chimeric reads. In: Deng, M., Jiang, R., Sun, F. and Zhang, X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science, Vol. 7821. Springer, Berlin, Heidelberg, pp. 158–170. Paaby, A.B. and Rockman, M.V. (2014) Cryptic genetic variation: evolution’s hidden substrate. Nature Reviews Genetics 15(4), 247–258. DOI: 10.1038/nrg3688. Pabuayon, I.C.M., Kitazumi, A., Gregorio, G.B., Singh, R.K. and de los Reyes, B.G. (2020) Contributions of adaptive plant architecture to transgressive salinity tolerance in recombinant inbred lines of rice: molecular mechanisms based on transcriptional networks. Frontiers in Genetics 11, 594569. DOI: 10.3389/fgene.2020.594569. Panchy, N., Lehti-Shiu, M. and Shiu, S.H. (2016) Evolution of gene duplication in plants. Plant Physiology 171, 2294–2316. DOI: 10.1104/pp.16.00523. Patro, R., Duggal, G., Love, M.I., Irizarry, R.A. and Kingsford, C. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods 14, 417–419. DOI: 10.1038/ nmeth.4197. Patterson, J., Carpenter, E.J., Zhu, Z., An, D., Liang, X. et al. (2019) Impact of sequencing depth and technology on de novo RNA-Seq assembly. BMC Genomics 20(1), 604. DOI: 10.1186/s12864-019-5965-x. Pratas, D., Silva, R.M., Pinho, A.J. and Ferreira, P.J.S.G. (2015) An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Scientific Reports 5, 10203. DOI: 10.1038/srep10203. Sanfilippo, P., Miura, P. and Lai, E.C. (2017) Genome-wide profiling of the 3′ ends of polyadenylated RNAs. Methods (San Diego, Calif.) 126, 86–94. DOI: 10.1016/j.ymeth.2017.06.003.

Plant Transcriptomics

29

Schaarschmidt, S., Fischer, A., Zuther, E. and Hincha, D.K. (2020) Evaluation of seven different RNA-seq alignment tools based on experimental data from the model plant Arabidopsis thaliana. International Journal of Molecular Sciences 21(5), 1720. DOI: 10.3390/ijms21051720. Schurch, N.J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A. et al. (2016) How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA (New York, N.Y.) 22(6), 839–851. DOI: 10.1261/rna.053959.115. Sharon, D., Tilgner, H., Grubert, F. and Snyder, M. (2013) A single-molecule long-read survey of the human transcriptome. Nature Biotechnology 31(11), 1009–1014. DOI: 10.1038/nbt.2705. Shenton, M., Kobayashi, M., Terashima, S., Ohyanagi, H., Copetti, D. et al. (2020) Evolution and diversity of the wild rice Oryza officinalis complex, across continents genome types, and ploidy levels. Genome Biology and Evolution 12, 413–428. DOI: 10.1093/gbe/evaa037. Shlyueva, D., Stampfel, G. and Stark, A. (2014) Transcriptional enhancers: from properties to genome- wide predictions. Nature Reviews Genetics 15(4), 272–286. DOI: 10.1038/nrg3682. Sibley, C.R., Blazquez, L. and Ule, J. (2016) Lessons from non- canonical splicing. Nature Reviews Genetics 17(7), 407–421. DOI: 10.1038/nrg.2016.46. Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V. and Zdobnov, E.M. (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics (Oxford, England) 31(19), 3210–3212. DOI: 10.1093/bioinformatics/btv351. Sjöstedt, E., Zhong, W., Fagerberg, L., Karlsson, M., Mitsios, N. et al. (2020) An atlas of the protein-coding genes in the human, pig, and mouse brain. Science 367(6482), eaay5947. DOI: 10.1126/science. aay5947. Sorenson, R.S., Deshotel, M.J., Johnson, K., Adler, F.R. and Sieburth, L.E. (2018) Arabidopsis mRNA decay landscape arises from specialized RNA decay substrates, decapping-mediated feedback, and redundancy. Proceedings of the National Academy of Sciences 115(7), E1485–E1494. DOI: 10.1073/pnas.1712312115. Sun, X.-M., Bowman, A., Priestman, M., Bertaux, F., Martinez-Segura, A. et al. (2020) Size-dependent increase in RNA polymerase II initiation rates mediates gene expression scaling with cell size. Current Biology 30(7), 1217–1230. DOI: 10.1016/j.cub.2020.01.053. Thomson, M.J., Tai, T.H., McClung, A.M., Lai, X.-H., Hinga, M.E. et al. (2003) Mapping quantitative trait loci for yield, yield components and morphological traits in an advanced backcross population between Oryza rufipogon and the Oryza sativa cultivar Jefferson. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 107(3), 479–493. DOI: 10.1007/s00122-003-1270-8. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D. et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7(3), 562–578. DOI: 10.1038/nprot.2012.016. Ulitsky, I. (2016) Evolution to the rescue: using comparative genomics to understand long non-coding RNAs. Nature Reviews Genetics 17(10), 601–614. DOI: 10.1038/nrg.2016.85. Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63. DOI: 10.1038/nrg2484. Warner, J.R. (1999) The economics of ribosome biosynthesis in yeast. Trends in Biochemical Sciences 24(11), 437–440. DOI: 10.1016/S0968-0004(99)01460-7. Yamashita, R., Sathira, N.P., Kanai, A., Tanimoto, K., Arauchi, T. et al. (2011) Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. Genome Research 21, 775–789. DOI: 10.1101/gr.110254.110. Yun, K.-Y., Park, M.R., Mohanty, B., Herath, V., Xu, F. et al. (2010) Transcriptional regulatory network triggered by oxidative signals configures the early response mechanisms of japonica rice to chilling stress. BMC Plant Biology 10(1), 16. DOI: 10.1186/1471-2229-10-16. Zhang, Z.-H., Li, S., Wei, L., Wei, C. and Ying-Guo, Z. (2005) A major QTL conferring cold tolerance at the early seedling stage using recombinant inbred lines of rice (Oryza sativa L.). Plant Science 168(2), 527–534. DOI: 10.1016/j.plantsci.2004.09.021. Zhao, J., Ohsumi, T.K., Kung, J.T., Ogawa, Y., Grau, D.J. et al. (2010) Genome- wide identification of polycomb- associated RNAs by RIP- seq. Molecular Cell 40(6), 939–953. DOI: 10.1016/j. molcel.2010.12.011.

3 1

Plant Proteomics

Setsuko Komatsu1* and Ghazala Mustafa2 Faculty of Environment and Information Sciences, Fukui University of Technology, Japan; 2Department of Plant Sciences, Quaid-I-Azam University, Islamabad, Pakistan

Abstract The advent of proteomic techniques has made it possible to identify a broad spectrum of proteins in living systems. Its capability is especially useful for crop breeding as it gives clues, not only about nutritional value, but also about yield, and how these factors are affected by adverse conditions. This chapter describes recent progress in plant proteomics and highlights the achievements made in understanding the proteins of major soybean under flooding stress. The main emphasis will be on crop responses to abiotic and biotic stresses. Rigorous genetic testing of the role of possibly important proteins can be conducted. Currently, a massive amount of research output in DNA, mRNA, and protein levels is available, and it is suggested that now the proteome is a key data layer to be applied in practical crop breeding.

3.1 Introduction Proteomics is a cutting- edge technique and provides direct insight into the molecular functions of the cell as it deals with the ultimate functional molecules, rather than genetic codes or messenger molecules. Knowledge of plant proteins, as well as their dynamics in response to environmental and biological stressors, may significantly impact the improvement of their yield and nutritional properties. Functional genomic tools, such as proteomics, are widely used in plant research (Wang and Komatsu, 2016a). They are routinely employed in comprehensive profiling of complex protein extracts, then deliver valuable qualitative and quantitative information on protein dynamics in plant organisms. Implementation of the cell fractionation technique allows reasonable simplification of the protein sample matrix and provides better

insight into the molecular organization of individual compartments, as well as macromolecular complexes (Komatsu and Hossain, 2013). More specifically, the global proteomic approach addresses alterations in multiple biological processes, occurring sequentially and in parallel at the tissue, cell, or subcellular level (Watson et al., 2003). Based on the acquired information, plant development can be systematically characterized. In natural conditions, living organisms are always faced with abiotic and biotic stresses (Mittler, 2006). Organisms must adjust their physiology and development to assure their survival under changing conditions within time ranges varying between diurnal, seasonal, annual periodicity or even in the order of several years to several tens of years (Iizumi et al., 2014). As sessile organisms, plants have to endure a wide variety of environmental stresses while

*Corresponding author: skomatsu@fukui-ut.ac.jp 30

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0003

Plant Proteomics

staying in the same habitat during their entire life cycle. Modulation of protein activity occurs at the level of transcription which determines protein abundance, followed by regulation at levels of translation, and post- translation (Hashiguchi and Komatsu, 2016). Among them, post- translational modifications are crucial because they can bring about fine-tuning of protein function, localization, half-life, and interactions to mitigate the potential damage of environmental stresses. In this chapter, advantages of a proteomic technique are discussed, and subcellular and post- translational proteomics are described. Furthermore, biological processes against flooding in soybean are chosen to describe the application of plant proteomics to allow future practice in this area.

31

extracted by using multiple techniques (Wang and Komatsu, 2016a). Highly efficient lysis and extraction of proteins can be achieved using several lysis buffer formulations. Each proteomic analysis follows one of two main strategies: gel-based and gel-free (Fig. 3.1). The gel-based method for protein quantification was established by O’Farrell (1975). It separates proteins in two dimensions of gel electrophoresis: the first dimension is isoelectric focusing, which separates the proteins according to their isoelectric points; in the second dimension, proteins are separated according to their molecular weight by sodium dodecyl sulfate- polyacrylamide gel electrophoresis. In quantitative two-dimensional difference gel electrophoresis, the protein samples are labeled with a fluorescent dye that improves the quality and number of identified protein spots (Arruda et al., 2011). However, the gel- based strategy has certain limitations. For instance, 3.2 Proteomic Technology in low abundant proteins and any proteins with Plant Science post-translational modifications are not detectable with this strategy (León et al., 2013). In plant science, diverse proteomic techniques The gel- free techniques, which comprise have been used to understand the molecular in- solution digestion of protein samples, are basis of different cellular functions, including categorized into two types: labeled and label-free. plant responses to stress conditions (Komatsu Metabolic in vivo labeling techniques such as et al., 2015). The first step in an efficient prot- SILAC (stable isotope labeling with amino acids eomic study is the extraction of proteins along in cell culture) and 15N labeling allow quantiwith maintaining the integrity of proteins. fication with smaller measurement bias. ICAT The choice of method for protein extraction (isotope-coded affinity tag), 18O labeling, TMT largely depends on the type of plant organelle (tandem mass tags), and iTRAQ (isobaric tags for and organs, and the nature of the desired relative and absolute quantification) are chemiproteins to be extracted. Another important cal in vitro labeling methods that can be applied point is the removal of contaminants from the to static samples like clinical samples (Rauniyar sample during protein extraction. Here, multiple and Yates, 2014). On the other hand, in label- methods are used for the lysis and extraction free proteomic techniques, the digested peptides of proteins from the tissue. Natarajan et al. are separated by liquid chromatography followed (2005) compared four different protein extrac- by mass spectrometry. Liquid chromatography– tion/solubilization methods, urea, thiourea/ mass spectrometry (LC–MS/MS) allows broad urea, phenol, and a modified tricarboxylic acid range identification of proteins (Neilson et al., cycle/acetone, to determine their effectiveness 2011). The label- free approach requires less in separating soybean seed proteins by two- time for sample preparation and identifications. dimensional gel electrophoresis (Natarajan Purpose-wise, gel-free techniques work very well et al., 2005). Another widely used method is for the characterization of subcellular proteomtrichloroacetic acid/acetone precipitation. In ics. Particularly, the advantage of gel-free/label- gel- free proteomics, the chloroform- methanol free proteomics dwells in its simplicity in sample extraction method, followed by reduction with preparation and ease of data production. dithiothreitol, alkylation with iodoacetamide, In the past decade, a lot of studies have been and digestion with trypsin and lysyl endopepti- performed to explore the molecular mechanisms dase is used (Komatsu et al., 2013). For subcel- of plants using gel-based and gel-free proteomics lular proteomics, proteins from organelles are (Table 3.1). Plant researchers opted for various

32

S. Komatsu and G. Mustafa

Fig. 3.1. Schematic representation of the proteomic technologies adapted for plant sciences. Integrated gel-based and gel-free approaches coupled with bioinformatic tools can be used for more comprehensive understanding of the molecular mechanisms.

strategies to explore the molecular mechanisms. Li et al. (2016) used iTRAQ labeling with cation exchange–nanoLC–MS/MS to understand the proteomic alterations in maize on exposure to lead stress. The lead stress increased the proteins related to stress, redox, signaling, and transport while decreasing the proteins related to nucleotide metabolism, amino acid metabolism, RNA, and protein metabolism. On exposure to biotic stress, Phytophthora capsici (a plant pathogen), proteins related to carbon fixation, cyano- amino acid metabolism, fructose and mannose metabolism, glutathione metabolism, and phenylpropanoid biosynthesis were decreased in black pepper (Mahadevan et al., 2016).

3.3 Plant-subcellular Proteomics Subcellular proteomic studies provide in-depth understanding of subcellular function in plants; however, they include, in their experimental workflow, steps aimed at purifying organelles. Subcellular proteomics has potential to elucidate localized cellular responses and investigates communications among subcellular

components during plant growth and under stresses.

3.3.1 Importance of plant-subcellular proteomics Plant cells are characterized by a high degree of compartmentalization and have evolved diverse developmental changes and responses to environmental conditions. Based on these diverse response strategies, plant cells show diversified proteome. Subcellular organelle proteomics depicts the proteomic profiles that are present in particular organelles. Organelles are characterized by membrane-bounded compartments within a cell that regulate the cellular function by interacting with one another (Gupta et al., 2016). Subcellular proteomics reveals the localized cellular responses and sheds light on the possible interactions between subcellular compartments during the processes of plant development and stress response. Briefly, the plant cell wall is a composite structure around the cell, acting as the first defensive layer toward external stress, and its proteins play a crucial role

Maize inbred line 9782

Soybean cv. Surge & Davison

Maize seeds

Rice cv. N22 & Mianhui101

Rice cv. Xiushui 134

Drought/heat (42°C day)

Drought/heat (42°C day)

High temperature (37°C)

Quinclorac herbicide SA pretreatment

Plant

Pb

Abiotic stress

Stresses

4 leaf stage

Anrhesis stage

–

3 weeks

6 day old seedlings

Growth stage

6h

72 h

8h

7 days no- water

Treatment time

Table 3.1. Proteomic technology adopted by plant science.

Leaves

Anthers

Leaves

Leaves

Leaves

Organ

Major findings

Das et al., 2016

Li et al., 2016

References

iTRAQ, LC-MS/ MS

iTRAQ, NSI-MS/ MS, QExective

Mu et al., 2017

Continued

Increased: ascorbate/glutathione Wang et al., metabolism, porphyrin/chlorophyll 2017 metabolism, heat shock protein 70 Decreased: chlorophyll synthesis related proteins, photosynthesis

Increased: small heat shock protein, β-expansins, lipid transfer proteins in N22 Decreased: α-expansins, three β-expansins in Mianhui 101

EI-MS/MS, Increased: heat shock proteins, small Zhao et al., QExective, heat shock proteins, ethylene responsive 2016 iTRAQ, SCX LC- protein, ripening-inducible like protein MS/MS

2D-DIGE, MALDI- Increased: ROS levels, photosynthesis, TOF/MS heat stress-induced EF-Tu protein, heat shock-related proteins in Surge Decreased: ROS levels, ABA, Calvin cycle, carbon fixation during both stresses in Davison

iTRAQ, SCX LC- Increased: stress, redox, signaling, MS/MS transport Decreased: nucleotide metabolism, amino acid metabolism, RNA, protein metabolism

Proteomic methodologies

Plant Proteomics 33

Foxtail millet var. Yugu1

Cassava cv. Xinxuan 048

4 var. of peanut

Drought

Drought

Drought

20 day old seedlings

30 days

3 weeks

Barley genotypes 4 leaf stage TF57 & TF58

Water logging

Growth stage

Plant

Stresses

Table 3.1. Continued Proteomic methodologies Major findings

References

Leaves

Leaves

Shan et al., 2018

Pan et al., 2018

Continued

2DE, PMF, Increased: late embryogenesis abundant Thangella MALDI-TOF/TOF protein, calcium ion binding protein, et al., 2018 MS sucrose synthase isoform Decreased: RuBisCO, cytosolic ascorbate peroxidase,

TMT, LC-MS/MS Increased: photosynthesis, carotenoid biosynthesis, aminoacyl-tRNA biosynthesis, photosynthesis antenna proteins, oxidative phosphorylation, RNA transport Decreased: porphyrin/chlorophyll metabolism, C5-branched dibasic acid metabolism, MAPK signaling pathway

TMT, SDS-PAGE, Increased: late embryogenesis LC-MS/MS proteins, ROS scavenging enzymes, photosynthesis, glycolysis, TCA cycle, ATP synthesis Decreased: receptor like protein kinase, pentatricopeotide repeat containing protein, aquaporin

Leaves & roots 2DE, MALDI-TOF Increased: photosynthesis metabolism- Luan et al., MS/MS and energy-related proteins, ATP 2018 synthase subunit, heat shock protein 70 (TF58 genotype) Decreased: pyruvate decarboxylase, 1-amino cyclopropane 1-carboxylic acid oxidase, glutamine synthetase, glutathione-S-transferases, β-1, 3-glucanase (TF57 genotype)

Organ

10–20 days Plant tissues

190 days

7 days

3 weeks

Treatment time

34 S. Komatsu and G. Mustafa

Spinach var. Sp73 50 days old

Maize Vp16 & mutant vp16

Wheat genotype Jinmai 47

Okra cv. Wufu

Jojoba (Simmondsia chinensis)

Heat 37°C

Drought

Drought

300 mM/l NaCl

Cold 15-20°C (day)

7–8 leaf stage

3 weeks postgermination

15 days seedling

3 days post- germination

Seedlings

Soybean cv. Union85140

200 mM NaCl

Growth stage

Plant

Stresses

Table 3.1. Continued

7 days

48 h

4 days

7 days

72 h

24 h

Treatment time

iTRAQ, LC-MS/ MS

Increased: TCA cycle related proteins, sucrose, proline, LEA proteins, chaperone protein Decreased: RuBisCO activase

5 & 6 leaf stages

SDS-PAGE, Increased: ferredoxin 3, NADP-malic iTRAQ, SCX LC- enzyme, glyceraldehyde-3-phosphate MS/MS dehydrogenase of plastid 1 Decreased: photosynthesis, RuBisCO small chain proteins, RuBisCO activase

Continued

Gao et al., 2019

Zhan et al., 2019

Wang et al., 2019

Liu et al., 2019

SDS-PAGE, Increased: α-amylase, histone H2A iTRAQ, SCX, LC- Decreased: calcium ion binding, TOF/MS pentatricopeptide repeat-containing protein mitochondrial, pyruvate kinase

Pi et al., 2018

References

Li et al., 2019

Increased: cyanidin-3-arabinoside chloride and a dihydroxy B-ring flavonoid, MYB transcription factor phosphorylation

Major findings

2DE, iTRAQ, LC- Increased: ROS scavenging MS pathways, protein synthesis/turnover, carbohydrate, amino acid metabolism Decreased: photosynthesis

iTRAQ, LC-MS/ MS QExective

Proteomic methodologies

Above-ground TMT, LC-MS/MS Increased: heat shock proteins, two parts α-crystalline heat shock protein Decreased: ribonucleoprotein, stress- response A/B barrel domain-containing protein, conserved hypothetical protein, sucrose synthase

Leaves

Leaves

Leaves

Organ

Plant Proteomics 35

Soybean var. Qihuang 34

Rapeseed cv. Zhongshuang 11

Tibetan hull-less barley cv. DQ & XL

Flooding/ submergence

N treatment

Drought

3-leaf stage

Black pepper var. 8–12 week old Panniyur I seedlings

Phytophthora capsici

Magnaporthe Rice cv. oryzae (Guy11 & Nipponbare JS153

Pea cv. Messire

30 days postplantation

7 days

2 leaf stage

Growth stage

Didymella pinodes

Biotic stress

Plant

Stresses

Table 3.1. Continued

Plant tissue

Organ

24 h

24 h

Full flowering stage

48 h

Seedlings

Leaves

Leaves (sub- cellular)

Leaves

3 & 14 days Roots

7 days

Treatment time Decreased: glycolysis/gluconeogenesis pathway, GPI, ADH, PDC, enolase, phosphoglycerate kinase, lignin biosynthesis

Major findings

iTRAQ, SCX MS

LC/MS

ESI LC-MS/MS

LC-MS/MS

Wang et al., 2020

Qin et al., 2019

Lin et al., 2019

References

Mahadevan et al., 2016

Continued

Increased: regulation of protein serine/ Lin et al., threonine phosphatase activity, response 2018 to oxidative stress, protein folding, glutathione metabolic process

Increased: 7S-globulin-like protein Decreased: carbon fixation, cyano- amino acid metabolism, fructose, mannose metabolism, glutathione metabolism, phenylpropanoid biosynthesis

Increased: glutathione-S-transferases, Desalegn stress related proteins, pisatin et al., 2016 biosynthesis, amino acid, TCA cycle Decreased: geranyl–geranyl diphosphate synthase, oxaloacetate

Increased: abscisic acid induced- regulated-responsive-activate, ethylene synthesis-degradations Decreased: cytochrome P450

TMT10-, LC-MS/ Increased: cell wall organization/ MS biogenesis proteins Decreased: peroxidases

iTRAQ, TOF-MS

Proteomic methodologies

36 S. Komatsu and G. Mustafa

6 weeks

Solanum tuberosum cv. Désirée

Wheat cv. Chinese Spring & Funelliformis mosseae

Agrobacterium (AGL1)

Xanthomonas translucens

49 day old seedlings

Growth stage

Plant

Stresses

Table 3.1. Continued Organ

1 day

Leaves, shoots, roots

3 days post Membrane infection

Treatment time

LC-MS/MS QExective

SDS-PAGE, LC- MS/MS

Proteomic methodologies

Increased: GSTs, peroxidases, acidic endochitinases, heat shock proteins

Increased: LRR-like receptor protein kinase Decreased: photosynthetic chlorophyll a/b binding proteins, NAD(P)H-quinone oxidoreductases, cytochrome bf-6 complex

Major findings

Fiorilli et al., 2018

Burra et al., 2018

References Plant Proteomics 37

38

S. Komatsu and G. Mustafa

in cell structure, metabolism, cell division, signal transduction, and response to various environmental perturbations (Jacq et al., 2018). The nucleus as one of the most important subcellular organelles houses the largest number of proteins along with the genetic information (Petrovská et al., 2015). Mitochondria, as the powerhouses of the cell, regulate the cell metabolic processes through synthesis of macromolecules (Harvey, 2019). These and other subcellular organelles are involved in development and stress response; therefore, exploring the function and interaction among subcellular compartments under different circumstances is essential for revealing the molecular response mechanisms. The prominent organelles of plants, the chloroplasts, consist of a highly complex and integrated set of proteins that function in a synchronized way to perform various cellular processes. Proteomic analysis of chloroplast was performed in wheat (Kamal et al., 2012), Arabidopsis (Ling et al., 2019), apple (Morkūnaitė‐Haimi et al., 2019), rice (Zhao et al., 2019) and poplar (Velikova et al., 2014). In poplar, the absence of isoprene triggers a rearrangement of the chloroplast protein profile to minimize the negative effects of stress. Ling et al. (2019) explained the chloroplast- associated protein degradation system, which is essential for organellar function and plant development. The cell wall is a dynamic structure that confines the cell volume and protects against biotic/abiotic stresses (Calderan-Rodrigues et al., 2019). It acts as the first line of defense against pathogenic attack and is associated with signaling molecules that can modulate plant immunity (Bethke and Glazebrook, 2019). Cell wall proteomics has been carried out in different plant species like Arabidopsis (Tan et al., 2013), rice (Pandey et al., 2010), Brassica (Ligat et al., 2011), wheat (Kong et al., 2010), chickpea (Elagamey et al., 2017), tomato (Dahal et al., 2010), and potato (Lim et al., 2012). In wheat, cell wall growth is reduced to lessen energy utilization under flooding stress (Kong et al., 2010). Bacterial infection increased the proteins related to pathogenesis and glycolysis, whereas proteins related to antioxidant activity and stress were decreased in tomato (Dahal et al., 2010). These studies elaborate that cell wall proteins are regulated in response to biotic and abiotic stresses.

Proteomic studies of other subcellular organelles have been carried out to understand the cellular mechanisms. Organelles are specified to operate cellular functions in cooperation with other compartments. Vacuoles play an important role in the cell homeostasis, detoxification, protein storage and cellular trafficking pathways (Di Sansebastiano et al., 2017). Plasma membrane functions in perceiving and transferring the signals into cells, which alters the gene expression and in turn leads to changes in protein abundance and mediates physiological processes (Komatsu et al., 2007). Endoplasmic reticulum proteomics revealed that cytosolic calcium levels induced by flooding and drought stresses might disturb the endoplasmic reticulum environment for proper protein folding in soybean root tip (Wang and Komatsu, 2016b). Subcellular proteomic analysis under stress conditions imparts more in-depth understanding of the molecular functions and the coordination between different organelles.

3.3.2 Subcellular proteomics: understanding mechanism in soybean under flooding stress Based on flooding responsive maps sketched by crude proteins, subcellular proteomics has been applied to interpret events in targeted subcellular and bridge interactions within cell compartments. Using proteomic techniques, soybean proteins affected by flooding stress were identified in subcellular organelles such as the nucleus (Komatsu et al., 2014; Oh et al., 2014; Yin and Komatsu, 2015, 2016), mitochondria (Komatsu et al., 2011; Kamal and Komatsu, 2015; Mustafa and Komatsu, 2016), endoplasmic reticulum (Komatsu et al., 2012; Wang and Komatsu, 2016b), plasma membrane (Komatsu et al., 2009), and cell wall (Komatsu et al., 2010). Under flooding, cell-wall metabolism was highlighted by proteomic analysis using crude proteins, and biosynthesis of jasmonic acid and reactive oxygen species scavenging contributed to suppressed lignification in the root, including hypocotyl (Komatsu et al., 2010). In plasma membrane, flooding-induced proteins related to antioxidative system, stress, and signaling played

Plant Proteomics

roles in protecting cells from oxidative damage, protein degradation, and ion homeostasis (Komatsu et al., 2009). Compared with cell wall and plasma membrane, more studies focused on the nucleus, addressing that suppressed RNA metabolism and activated abscisic acid signaling responded to flooding with exposure from 3 h to 48 h (Komatsu et al., 2010; Yin and Komatsu, 2015; Yin and Komatsu, 2016), whereas DNA repair via acceleration of poly-ADP-ribosylation and signaling transduction via interaction of RACK1 with 14-3-3 protein coped with prolonged flooding (Komatsu et al., 2014; Oh et al., 2014). Energy consumption was evidenced by impaired electron- transport chains in mitochondria, and the proteins involved in the tricarboxylic acid cycle were figured out by mitochondrial proteomics in response to flooding (Komatsu et al., 2011; Kamal and Komatsu, 2015). Additionally, mitochondria are the target organelles of aluminum oxide nanoparticles under flooding stress. Other results indicated that aluminum oxide nanoparticles of various sizes affect mitochondrial proteins by regulating membrane permeability and tricarboxylic acid cycle activity under flooding stress. These findings provide insight into the effect of flooding on mitochondrial function in early-stage soybean and reveal that several mitochondrial proteins may play a role in the mitochondrial response to flooding stress (Mustafa and Komatsu, 2016). In addition, endoplasmic reticulum proteomics implied that flooding suppressed protein synthesis and caused dysfunction of protein folding, leading to reduction of glycoproteins (Komatsu et al., 2012; Wang and Komatsu, 2016b). Collectively, subcellular proteomic studies provide in-depth understanding of subcellular function in response to flooding. Further proteomic investigation will figure out a clearer view of signal transduction among individual compartments and system reaction within cells under flooding stress.

3.4 Plant Proteomics of Posttranslational Modifications Protein post- translational modifications are among the fastest and earliest of plant responses

39

to changes in the environment, making the mechanisms and dynamics of post-translational modifications an important area of plant science. Post- translational modifications regulate the protein activity and localization, as well as protein–protein interactions in cellular processes, leading to intricate regulation of the plant’s response to environmental stimuli.

3.4.1 Importance of post-translational modifications in plants In plants, the stress-induced responses depend on highly coordinated interaction of signal transduction pathways. Hormonal crosstalk regulated through interaction among signaling molecules determines the plant’s response to abiotic stresses (Weng et al., 2015). Inflection of protein activity occurs during translation that determines protein abundance followed by regulation at the translation and post-translation levels. Among these regulation levels, post- translational modifications are crucial, because they can bring about fine-tuning of protein function, localization, half-life, and interaction to mitigate the potential damage of environmental stresses (Xiong et al., 2016). Phosphorylation is a major post- translational modification in plants, responsible for regulating diverse cellular functions. Phosphorylation occurs primarily on the serine, threonine, and tyrosine residues. Studies of protein phosphorylation in crop plants include wheat (Lv et al., 2016), maize (Zörb et al., 2010), sugar beet (Yu et al., 2016), and chickpea (Kumar et al., 2014). Protein acetylation is another post- translational modification that regulates gene expression. In rice, various cellular processes were affected by lysine acetylation from metabolic processes, signal transduction, RNA processing, protein translation, and stability (Nallamilli et al., 2014). In wheat, lysine acetylation was reported in 277 proteins mainly involved in energy production (Zhang et al., 2016). Environmental stresses mediated the process of protein phosphorylation and acetylation that influences crop productivity through energy regulation.

40

S. Komatsu and G. Mustafa

Glycosylation is the most abundant post- subcellular localization, and protein activity/ translational modification in plants, involved stability. Identification and characterization in different biological functions. The major- of post-translational modifications on a large ity of eukaryotic proteins are glycosylated. scale will help to functionally characterize Protein glycosylation is important for protein proteins and will improve our understanding folding, interaction, stability, mobility, and of the mechanisms of crop acclimation and signal transduction (Roth et al., 2012). stress tolerance acquisition (Wu et al., 2016). Proteins present on the cell surface are mostly Post-translational modifications are versatile glycosylated. In eukaryotes, glycosylation within the cell, whereas our current knowledge reactions occur in the endoplasmic reticulum. in crop plants is very limited. Furthermore, the Secretory pathway proteins are modified by N- majority of the studies until now have focused glycan through the process of N-glycosylation on exploring the changes in post-translational in the endoplasmic reticulum and follow a modifications at a single time- point during pathway which promotes efficient folding the stress or recovery stage. Thus, time-course with the help of calnexin, calreticulin, and studies are required to understand the regulaheat shock proteins (Williams, 2006). In tory mechanisms of proteins involved in time- plants, osmotic and drought stress activate cell dependent response/recovery. death signals across species (Reis et al., 2016). Overexpression of heat shock proteins, which act as chaperones during protein folding 3.4.2 Post-translational modifications: inside endoplasmic reticulum, can reduce celunderstanding mechanism in soybean lular damage and increase tolerance against under flooding stress salt stress in tomato (Fu et al., 2016). These findings suggest that environmental stresses impair the process of protein folding, which Post-translational modifications regulate protein leads to severe effects on plant growth and activity and localization, as well as protein– protein interactions in cellular processes, leading development. to elaborate regulation of the plant’s response Other extensively studied post- translational modifications in crop plants to environmental stimuli (Hashiguchi and under abiotic stress include SUMOylation Komatsu, 2016). In soybean, post-translational mediated flooding response is (involving small ubiquitin- like modifier pro- modification- teins) (Ghimire et al., 2020), S-nitrosylation illuminated with developed proteomics. Using (Lindermayr et al., 2005), and succinyla- proteomic techniques, soybean proteins affected tion (He et al., 2016). Research on protein by flooding stress were identified with post- post- translational modifications in crop translational modifications such as phosphorylaplants and their role in regulating the stress tion (Nanjo et al., 2010, 2012; Yin et al., 2014; Yin responses is still at the descriptive stage. and Komatsu, 2015), glycosylation (Mustafa and Among post- translational modifications, Komatsu, 2014; Wang and Komatsu, 2016b), phosphorylation has been extensively studied ubiquitination (Yanagawa and Komatsu, 2012), in plants while others are still at the prelimi- and S-nitrosylation (Hashiguchi and Komatsu, nary stages. During stress, phosphorylation, 2018). At initial flooding, hormone regulation glycosylation, and acetylation act as stress responsive factors. Substantial improvement of ethylene and abscisic acid participated in has been made in histone post-translational flooding tolerance via phosphorylation of modifications explaining the mechanism of eukaryotic translation initiation 4G, zinc- containing protein 47, their contribution in forming memory of the finger/BTB domain- rich protein, and rRNA processing repeated stresses (Suter and Widmer, 2013). glycine- Plant acclimation and tolerance to an abiotic protein Rrp5 (Yin et al., 2014; Yin and stress are associated with significant changes Komatsu, 2015). Flooding altered phosin post-translational modifications of specific phorylation status through dephosphorylproteins. Post- translational modifications ated proteins involved in protein folding/ are vital for regulating protein function, cell structure and phosphorylated proteins

Plant Proteomics

involved in energy production (Nanjo et al., 2010). Phosphoproteins were enriched using a phosphoprotein purification column prior to digestion and mass spectrometry. The accumulation of proteins involved in energy production increased as a result of flooding, while accumulation of proteins involved in protein folding and cell structure maintenance decreased. Flooding induced changes of phosphorylation status of proteins involved in energy generation, protein synthesis, and cell structure maintenance (Nanjo et al., 2012). The response to flooding stress may be regulated by both modulation of protein expression and phosphorylation state. Besides, glycosylation and ubiquitination play roles in protein synthesis and degradation under flooding. For example, flooding reduced accumulation of glycoproteins, while it increased the abundance of proteins involved in glycolysis (Mustafa and Komatsu, 2014; Wang and Komatsu, 2016b). Moreover, flooding activated ubiquitin-mediated proteolysis, indicating involvement of 26S proteasome in protein degradation (Yanagawa and Komatsu, 2012). In addition, S-nitrosylation comprised rapid molecular processes that changed the abundance of the active form of alcohol dehydrogenase in flooded soybean (Hashiguchi and Komatsu, 2018). These findings reveal the importance of post-translational modifications in the core flooding response of hormone signaling, energy metabolism, protein folding, and degradation. Furthermore, novel post-translational modifications of acetylation, succinylation, crotonylation, and carbohydration will add insights into flooding perception and response determination in soybean.

3.5 Plant Proteomics: Understanding Environmental Stress Responses Plant cells in natural conditions are always challenged with a wide range of adverse environmental changes that restrict plant growth and limit the productivity of plants such as crops. Changing environmental conditions greatly affect the accumulation of many proteins; therefore, analysis of alterations in the proteome

41

is essential to understand the plant response to abiotic and biotic stresses (Table 3.1).

3.5.1 Plant proteomics: understanding interaction between plants and biotic stress Plants are at the forefront of various environmental conditions during their growth and development. Plant distribution is largely affected by abiotic stresses as well as biotic stresses. There is a wide range of biotic factors that affect plant growth. Fungi, bacteria, and viruses cause severe damages to plant growth and development. Plant pathogens are responsible for causing serious losses to crops and ultimately impact agricultural production. Proteomics has been used to study plant defense responses against pathogens (Grandellis et al., 2016). As a response mechanism, plants identify certain molecules derived from the pathogens that are called elicitors (Camejo et al., 2016). Once elicitors are identified, different transduction signals are activated leading to the development of the plant’s response strategy. Reactive oxygen species are accumulated as a response mechanism that inhibits the invading pathogens or signaling molecules. Reactive oxygen species induce certain post-translational modifications that modify protein structure, function, localization, and cellular stability (Camejo et al., 2019). Currently, the focus of plant proteomic research is on the subcellular localization, interaction, and molecular/functional characterization of proteins so that the molecular mechanisms involved in plant pathogenic interactions causing susceptibility or resistance become more understandable. Phytopathogenic fungi are among the most damaging plant-parasitic organisms and cause serious diseases leading to reduced crop production. Proteomic studies have been carried out to understand the plant–fungus interaction and pathogenicity (Mehta et al., 2008; Mathesius, 2009; Bhadauria et al., 2010). Understanding the pathogenic life cycle, mode of infection, and virulence factors is important for developing effective crop protection strategies, including the development of tolerant varieties and fungicides (McCouch, 2004; Kim and Hwang,

42

S. Komatsu and G. Mustafa

2007; Pariaud et al., 2009; Parker et al., 2009). In citrus, Alternaria alternata increased the accumulation of ferredoxin and cyclophilin in a susceptible variety at the final stages of infection (Santos Dória et al., 2019). Plant-pathogenic bacteria are also responsible for causing serious damage to crop production. Pseudomonas syringae, a causative agent of leaf/flower blight in plants, regulated plant immunity in Arabidopsis by modulating the reactive oxygen species, energy balance, and glucosinolate biosynthesis (Zhang et al., 2018). Plant viruses induce changes at genetic, metabolic, and physiological levels in host plants and impact the growth and productivity of important crop plants. In tobacco, tobacco-mosaic virus infection altered the proteins related to photosynthesis, carbon metabolism, plant defense, and protein processing in the endoplasmic reticulum (Das et al., 2019). Plant pathogenic interactions are highly complex as they alter proteins from different metabolic pathways. Therefore, comprehensive global level proteome comparisons are required to gain complete understanding of the interaction mechanisms of these pathogens. Insects are another example of biotic stress on crop plants. Herbivorous insects directly damage the health of plants while other insects act as vectors for transmission of various plant pathogenic microorganisms and viruses (Hooks and Fereres, 2006; Hogenhout et al., 2008). Proteomic strategies have been applied to understand the mechanism of virus acquisition and transmission by insect vectors. The majority of plant viruses are spread by insects. The mode of perception of stress by plants depends largely on the interaction mechanism of three parties, behavior of insect influencing the transmission and spread of the plant virus (Hogenhout et al., 2008; Fereres and Moreno, 2009). Insects help in the transfer of virus to the host plant by damaging the physical barrier protecting the plant. Insect proteins play an important role as virus receptors, in the uptake and transport of virus through various tissues, and in replication of the viruses (Uzest et al., 2007). Huo et al. (2014) explained the role of vitellogenin of Laodelphax striatellus (small brown planthopper) in transovarial transmission of Rice stripe virus. Proteins from Sogatella furcifera (white- backed planthopper) facilitate the transmission of southern rice

black-streaked dwarf virus (Mar et al., 2014). These studies suggest that the interaction between plant and insect is quite complex and warrants more comprehensive studies.

3.5.2 Plant proteomics: understanding signaling mechanism under abiotic stresses Abiotic stresses, such as flooding, drought, salinity, and high/low temperatures, are the major constraints that global plant growth faces at present. Plants respond to a stress by modulating the abundance of candidate proteins, either by accumulation or by synthesizing novel proteins primarily associated with the plant’s tolerance system. The cellular mechanisms of stress sensing and signal transduction into cellular organelles have been reported (Komatsu and Hossain, 2013). Nevertheless, the responses of plant cells to abiotic stresses differ in each organ (Komatsu and Hossain, 2013). As the correlation between the expression of mRNAs and the abundance of the corresponding proteins is difficult to assess in specific organs, proteomic techniques provide one of the best options for the functional analysis of translated regions of the genome. As an example, the result of proteomic analysis in drought and flooding stresses is described below. Flooding and drought stresses exert deleterious effects on soybean growth (Oh and Komatsu, 2015). Considering the agricultural importance of soybean, clarification of the underlying mechanisms in response to combined stresses is of extreme value (Wang and Komatsu, 2018). Organ-specific analysis indicated that the root tip in the early-stage soybean was more sensitive to both stresses than other organs. Protein quality control and calcium homeostasis were disrupted in the endoplasmic reticulum of soybean exposed to combined stresses. Furthermore, increased cytosolic calcium in stressed soybean was verified from the endoplasmic reticulum and it further induced the accumulation of pyruvate decarboxylase. These findings employing proteomic studies suggest that calcium homeostasis might represent the bridge between cytosol and subcellular compartments in plant cells of soybean root tip in response to the

Plant Proteomics

combined stresses. In addition, calcium release from the endoplasmic reticulum was required for unfolded protein response (Liu et al., 2011) and elevated cytosolic calcium directed pyruvate in stressed soybean (Wang and Komatsu, 2017), indicating the importance of calcium roles in protein metabolism and energy regulation to cope with flooding and drought stresses.

3.6 Future Perspective To reiterate, proteomics is a cutting- edge technique that provides more in- depth understanding of the molecular mechanisms of plants. Plants have evolved various strategies at the molecular level that aid in their adaptation to changing environmental conditions. Proteomic approaches combined with bioinformatic methods permit detection of the dynamic cellular profiles of proteins, protein modification, and protein–protein interactions. Subcellular proteomics is critical to explore cellular responses. A growing number of subcellular proteomic studies have been carried out, in addition to studies of isolation protocols and purity assessments. Among subcellular proteomic studies conducted to date, those focused on identification and characterization of nuclear proteins were more common compared with others. Although numerous

43

proteins were identified in different subcellular compartments, the total number of specific proteins was far lower than expected, indicating that improved methods of isolation and detection are needed for low-abundance and novel proteins. The role of post-translational modifications in regulating the stress response is yet to be explored. An outline of the mechanisms regulated by post- translational modifications as stress- responsive factors is available for phosphorylation, glycosylation, acetylation, and ubiquitination in plants. In the future, exploring the role of other post-translational modifications in plants will help elucidate the stress response mechanisms in detail. From the viewpoint of protein evolution, the Peptide Atlas (available at http://www.peptideatlas. org, accessed July 2022) is a key resource. It is a multi-organism, publicly open compendium of peptides identified by proteomic techniques, and will facilitate comparative proteomic analysis among plant, animal, and other species. In addition, an important aspect in studying plant proteomics is the identification and characterization of proteins that have insufficient experimental support, which are called missing proteins. Furthermore, the convergence of diverse mass spectrometry techniques coupled with bioinformatic technology will help to get a more precise and comprehensive overview of plant stress response mechanisms (Fig. 3.1).

References Arruda, S.C.C., Barbosa, H. de S., Azevedo, R.A. and Arruda, M.A.Z. (2011) Two-dimensional difference gel electrophoresis applied for analytical proteomics: fundamentals and applications to the study of plant proteomics. The Analyst 136(20), 4119–4126. DOI: 10.1039/c1an15513j. Bethke, G. and Glazebrook, J. (2019) Measuring pectin properties to track cell wall alterations during plant-pathogen interactions. Methods in Molecular Biology (Clifton, N.J.) 1991, 55–60. DOI: 10.1007/978-1-4939-9458-8_6. Bhadauria, V., Banniza, S., Wang, L.-X., Wei, Y.-D. and Peng, Y.L. (2010) Proteomic studies of phytopathogenic fungi, oomycetes and their interactions with hosts. European Journal of Plant Pathology 126(1), 81–95. DOI: 10.1007/s10658-009-9521-4. Burra, D.D., Lenman, M., Levander, F., Resjö, S. and Andreasson, E. (2018) Comparative membrane- associated proteomics of three different immune reactions in potato. International Journal of Molecular Sciences 19(2), 538. DOI: 10.3390/ijms19020538. Calderan-Rodrigues, M.J., Guimarães Fonseca, J., de Moraes, F.E., Vaz Setem, L., Carmanhanis Begossi, A. et al. (2019) Plant cell wall proteomics: a focus on monocot species, Brachypodium distachyon, Saccharum spp. and Oryza sativa. International Journal of Molecular Sciences 20(8), 1975. DOI: 10.3390/ijms20081975.

44

S. Komatsu and G. Mustafa

Camejo, D., Guzmán-Cedeño, Á. and Moreno, A. (2016) Reactive oxygen species, essential molecules, during plant-pathogen interactions. Plant Physiology and Biochemistry 103, 10–23. DOI: 10.1016/j. plaphy.2016.02.035. Camejo, D., Guzmán-Cedeño, A., Vera-Macias, L. and Jiménez, A. (2019) Oxidative post-translational modifications controlling plant- pathogen interaction. Plant Physiology and Biochemistry 144, 110–117. DOI: 10.1016/j.plaphy.2019.09.020. Dahal, D., Pich, A., Braun, H.P. and Wydra, K. (2010) Analysis of cell wall proteins regulated in stem of susceptible and resistant tomato species after inoculation with Ralstonia solanacearum: a proteomic approach. Plant Molecular Biology 73(6), 643–658. DOI: 10.1007/s11103-010-9646-z. Das, A., Eldakak, M., Paudel, B., Kim, D.-W., Hemmati, H. et al. (2016) Leaf proteome analysis reveals prospective drought and heat stress response mechanisms in soybean. BioMed Research International 2016, 6021047. DOI: 10.1155/2016/6021047. Das, P.P., Lin, Q. and Wong, S.M. (2019) Comparative proteomics of Tobacco mosaic virus-infected Nicotiana tabacum plants identified major host proteins involved in photosystems and plant defence. Journal of Proteomics 194, 191–199. DOI: 10.1016/j.jprot.2018.11.018. Desalegn, G., Turetschek, R., Kaul, H.P. and Wienkoop, S. (2016) Microbial symbionts affect Pisum sativum proteome and metabolome under Didymella pinodes infection. Journal of Proteomics 143, 173–187. DOI: 10.1016/j.jprot.2016.03.018. Di Sansebastiano, G.P., Barozzi, F., Piro, G., Denecke, J. and de Marcos Lousa, C. (2017) Trafficking routes to the plant vacuole: connecting alternative and classical pathways. Journal of Experimental Botany 69(1), 79–90. DOI: 10.1093/jxb/erx376. Elagamey, E., Narula, K., Sinha, A., Ghosh, S., Abdellatef, M.A.E. et al. (2017) Quantitative extracellular matrix proteomics suggests cell wall reprogramming in host- specific immunity during vascular wilt caused by Fusarium oxysporum in chickpea. Proteomics 17(23–24), 1600374. DOI: 10.1002/ pmic.201600374. Fereres, A. and Moreno, A. (2009) Behavioural aspects influencing plant virus transmission by homopteran insects. Virus Research 141(2), 158–168. DOI: 10.1016/j.virusres.2008.10.020. Fiorilli, V., Vannini, C., Ortolani, F., Garcia-Seco, D., Chiapello, M. et al. (2018) Omics approaches revealed how arbuscular mycorrhizal symbiosis enhances yield and resistance to leaf pathogen in wheat. Scientific Reports 8(1), 9625. DOI: 10.1038/s41598-018-27622-8. Fu, C., Liu, X.X., Yang, W.W., Zhao, C.M. and Liu, J. (2016) Enhanced salt tolerance in tomato plants constitutively expressing heat-shock protein in the endoplasmic reticulum. Genetics and Molecular Research 15(2), 15028301. DOI: 10.4238/gmr.15028301. Gao, F., Ma, P., Wu, Y., Zhou, Y. and Zhang, G. (2019) Quantitative proteomic analysis of the response to cold stress in Jojoba, a tropical woody crop. International Journal of Molecular Sciences 20(2), 243. DOI: 10.3390/ijms20020243. Ghimire, S., Tang, X., Zhang, N., Liu, W. and Si, H. (2020) SUMO and SUMOylation in plant abiotic stress. Plant Growth Regulation 91(3), 317–325. DOI: 10.1007/s10725-020-00624-1. Grandellis, C., Vranych, C.V., Piazza, A., Garavaglia, B.S., Natalia, G. et al. (2016) An overview of proteomics tools for understanding plant defense against pathogens. Current Issues in Molecular Biology 19, 129–136. Gupta, D.B., Rai, Y., Gayali, S., Chakraborty, S. and Chakraborty, N. (2016) Plant organellar proteomics in response to dehydration: turning protein repertoire into insights. Frontiers in Plant Science 7, 460. Harvey, A.J. (2019) Mitochondria in early development: linking the microenvironment, metabolism and the epigenome. Reproduction (Cambridge, England) 157(5), R159–R179. DOI: 10.1530/REP-18-0431. Hashiguchi, A. and Komatsu, S. (2016) Impact of post-translational modifications of crop proteins under abiotic stress. Proteomes 4(4), 42. DOI: 10.3390/proteomes4040042. Hashiguchi, A. and Komatsu, S. (2018) Early changes in S- nitrosoproteome in soybean seedlings under flooding stress. Plant Molecular Biology Reporter 36(5–6), 822–831. DOI: 10.1007/ s11105-018-1124-9. He, D., Wang, Q., Li, M., Damaris, R.N., Yi, X. et al. (2016) Global proteome analyses of lysine acetylation and succinylation reveal the widespread involvement of both modification in metabolism in the embryo of germinating rice seed. Journal of Proteome Research 15(3), 879–890. DOI: 10.1021/acs. jproteome.5b00805. Hogenhout, S.A., Ammar, E.-D., Whitfield, A.E. and Redinbaugh, M.G. (2008) Insect vector interactions with persistently transmitted viruses. Annual Review of Phytopathology 46, 327–359. DOI: 10.1146/ annurev.phyto.022508.092135.

Plant Proteomics

45

Hooks, C.R.R. and Fereres, A. (2006) Protecting crops from non-persistently aphid-transmitted viruses: a review on the use of barrier plants as a management tool. Virus Research 120(1–2), 1–16. DOI: 10.1016/j.virusres.2006.02.006. Huo, Y., Liu, W., Zhang, F., Chen, X., Li, L. et al. (2014) Transovarial transmission of a plant virus is mediated by vitellogenin of its insect vector. PLoS Pathogens 10(3), e1003949. DOI: 10.1371/journal. ppat.1003949. Iizumi, T., Luo, J.-J., Challinor, A.J., Sakurai, G., Yokozawa, M. et al. (2014) Impacts of El Niño Southern Oscillation on the global yields of major crops. Nature Communications 5, 3712. DOI: 10.1038/ ncomms4712. Jacq, A., Burlat, V. and Jamet, E. (2018) Plant cell wall proteomics as a strategy to reveal candidate proteins involved in extracellular lipid metabolism. Current Protein & Peptide Science 19(2), 190–199. DOI: 10.2174/1389203718666170918152859. Kamal, A.H.M. and Komatsu, S. (2015) Involvement of reactive oxygen species and mitochondrial proteins in biophoton emission in roots of soybean plants under flooding stress. Journal of Proteome Research 14(5), 2219–2236. DOI: 10.1021/acs.jproteome.5b00007. Kamal, A.H.M., Cho, K., Kim, D.-E., Uozumi, N., Chung, K.-Y. et al. (2012) Changes in physiology and protein abundance in salt-stressed wheat chloroplasts. Molecular Biology Reports 39(9), 9059–9074. DOI: 10.1007/s11033-012-1777-7. Kim, B.S. and Hwang, B.K. (2007) Microbial fungicides in the control of plant diseases. Journal of Phytopathology 155(11–12), 641–653. DOI: 10.1111/j.1439-0434.2007.01314.x. Komatsu, S. and Hossain, Z. (2013) Organ-specific proteome analysis for identification of abiotic stress response mechanism in crop. Frontiers in Plant Science 4, 71. DOI: 10.3389/fpls.2013.00071. Komatsu, S., Konishi, H. and Hashimoto, M. (2007) The proteomics of plant cell membranes. Journal of Experimental Botany 58(1), 103–112. DOI: 10.1093/jxb/erj209. Komatsu, S., Wada, T., Abaléa, Y., Nouri, M.-Z., Nanjo, Y. et al. (2009) Analysis of plasma membrane proteome in soybean and application to flooding stress response. Journal of Proteome Research 8(10), 4487–4499. DOI: 10.1021/pr9002883. Komatsu, S., Kobayashi, Y., Nishizawa, K., Nanjo, Y. and Furukawa, K. (2010) Comparative proteomics analysis of differentially expressed proteins in soybean cell wall during flooding stress. Amino Acids 39(5), 1435–1449. DOI: 10.1007/s00726-010-0608-1. Komatsu, S., Yamamoto, A., Nakamura, T., Nouri, M.-Z., Nanjo, Y. et al. (2011) Comprehensive analysis of mitochondria in roots and hypocotyls of soybean under flooding stress using proteomics and metabolomics techniques. Journal of Proteome Research 10(9), 3993–4004. DOI: 10.1021/ pr2001918. Komatsu, S., Kuji, R., Nanjo, Y., Hiraga, S. and Furukawa, K. (2012) Comprehensive analysis of endoplasmic reticulum-enriched fraction in root tips of soybean under flooding stress using proteomics techniques. Journal of Proteomics 77, 531–560. DOI: 10.1016/j.jprot.2012.09.032. Komatsu, S., Han, C., Nanjo, Y., Altaf-Un-Nahar, M., Wang, K. et al. (2013) Label-free quantitative proteomic analysis of abscisic acid effect in early-stage soybean under flooding. Journal of Proteome Research 12(11), 4769–4784. DOI: 10.1021/pr4001898. Komatsu, S., Hiraga, S. and Nouri, M.Z. (2014) Analysis of flooding-responsive proteins localized in the nucleus of soybean root tips. Molecular Biology Reports 41(2), 1127–1139. DOI: 10.1007/ s11033-013-2959-7. Komatsu, S., Tougou, M. and Nanjo, Y. (2015) Proteomic techniques and management of flooding tolerance in soybean. Journal of Proteome Research 14(9), 3768–3778. DOI: 10.1021/acs.jproteome.5b00389. Kong, F.J., Oyanagi, A. and Komatsu, S. (2010) Cell wall proteome of wheat roots under flooding stress using gel-based and LC MS/MS-based proteomics approaches. Biochimica et Biophysica Acta 1804(1), 124–136. DOI: 10.1016/j.bbapap.2009.09.023. Kumar, R., Kumar, A., Subba, P., Gayali, S., Barua, P. et al. (2014) Nuclear phosphoproteome of developing chickpea seedlings (Cicer arietinum L.) and protein-kinase interaction network. Journal of Proteomics 105, 58–73. DOI: 10.1016/j.jprot.2014.04.002. León, I.R., Schwämmle, V., Jensen, O.N. and Sprenger, R.R. (2013) Quantitative assessment of in-solution digestion efficiency identifies optimal protocols for unbiased protein analysis. Molecular & Cellular Proteomics 12(10), 2992–3005. DOI: 10.1074/mcp.M112.025585. Ligat, L., Lauber, E., Albenne, C., San Clemente, H., Valot, B. et al. (2011) Analysis of the xylem sap proteome of Brassica oleracea reveals a high content in secreted proteins. Proteomics 11(9), 1798–1813. DOI: 10.1002/pmic.201000781.

46

S. Komatsu and G. Mustafa

Li, G.K., Gao, J., Peng, H., Shen, Y.O., Ding, H.P. et al. (2016) Proteomic changes in maize as a response to heavy metal (lead) stress revealed by iTRAQ quantitative proteomics. Genetics and Molecular Research 15(1), 1. DOI: 10.4238/gmr.15017254. Lim, S., Chisholm, K., Coffin, R.H., Peters, R.D., Al-Mughrabi, K.I. et al. (2012) Protein profiling in potato (Solanum tuberosum L.) leaf tissues by differential centrifugation. Journal of Proteome Research 11(4), 2594–2601. DOI: 10.1021/pr201004k. Lindermayr, C., Saalbach, G. and Durner, J. (2005) Proteomic identification of S-nitrosylated proteins in Arabidopsis. Plant Physiology 137(3), 921–930. DOI: 10.1104/pp.104.058719. Ling, Q., Broad, W., Trösch, R., Töpel, M., Demiral Sert, T. et al. (2019) Ubiquitin-dependent chloroplast- associated protein degradation in plants. Science 363(6429), eaav4467. DOI: 10.1126/science. aav4467. Lin, S., Nie, P., Ding, S., Zheng, L., Chen, C. et al. (2018) Quantitative proteomic analysis provides insights into rice defense mechanisms against Magnaporthe oryzae. International Journal of Molecular Sciences 19(7), E1950. DOI: 10.3390/ijms19071950. Lin, Y., Li, W., Zhang, Y., Xia, C., Liu, Y. et al. (2019) Identification of genes/proteins related to submergence tolerance by transcriptome and proteome analyses in soybean. Scientific Reports 9(1), 14688. DOI: 10.1038/s41598-019-50757-1. Li, S., Yu, J., Li, Y., Zhang, H., Bao, X. et al. (2019) Heat-responsive proteomics of a heat-sensitive spinach variety. International Journal of Molecular Sciences 20(16), E3872. DOI: 10.3390/ijms20163872. Liu, L., Cui, F., Li, Q., Yin, B., Zhang, H. et al. (2011) The endoplasmic reticulum-associated degradation is necessary for plant salt tolerance. Cell Research 21(6), 957–969. DOI: 10.1038/cr.2010.181. Liu, S., Zenda, T., Dong, A., Yang, Y., Liu, X. et al. (2019) Comparative proteomic and morpho-physiological analyses of maize wild-type Vp16 and mutant vp16 germinating seed responses to PEG-induced drought stress. International Journal of Molecular Sciences 20(22), 5586. DOI: 10.3390/ijms20225586. Luan, H., Shen, H., Pan, Y., Guo, B., Lv, C. et al. (2018) Elucidating the hypoxic stress response in barley (Hordeum vulgare L.) during waterlogging: a proteomics approach. Scientific Reports 8(1), 9655. DOI: 10.1038/s41598-018-27726-1. Lv, D.-W., Zhu, G.-R., Zhu, D., Bian, Y.-W., Liang, X.-N. et al. (2016) Proteomic and phosphoproteomic analysis reveals the response and defense mechanism in leaves of diploid wheat T. monococcum under salt stress and recovery. Journal of Proteomics 143, 93–105. DOI: 10.1016/j.jprot.2016.04.013. Mahadevan, C., Krishnan, A., Saraswathy, G.G., Surendran, A., Jaleel, A. et al. (2016) Transcriptomeassisted label- free quantitative proteomics analysis reveals novel insights into Piper nigrum- Phytophthora capsici Phytopathosystem. Frontiers in Plant Science 7, 785. DOI: 10.3389/ fpls.2016.00785. Mar, T., Liu, W. and Wang, X. (2014) Proteomic analysis of interaction between P7-1 of Southern rice black-streaked dwarf virus and the insect vector reveals diverse insect proteins involved in successful transmission. Journal of Proteomics 102, 83–97. DOI: 10.1016/j.jprot.2014.03.004. Mathesius, U. (2009) Comparative proteomic studies of root-microbe interactions. Journal of Proteomics 72(3), 353–366. DOI: 10.1016/j.jprot.2008.12.006. McCouch, S. (2004) Diversifying selection in plant breeding. PLoS Biology 2(10), e347. DOI: 10.1371/ journal.pbio.0020347. Mehta, A., Brasileiro, A.C.M., Souza, D.S.L., Romano, E., Campos, M.A. et al. (2008) Plant-pathogen interactions: what is proteomics telling us? The FEBS Journal 275(15), 3731–3746. DOI: 10.1111/j.1742-4658.2008.06528.x. Mittler, R. (2006) Abiotic stress, the field environment and stress combination. Trends in Plant Science 11(1), 15–19. DOI: 10.1016/j.tplants.2005.11.002. Morkūnaitė‐Haimi, Š., Vinskiene, J., Stanienė, G. and Haimi, P. (2019) Differential chloroplast proteomics of temperature adaptation in apple (Malus domestica Borkh.) microshoots. Proteomics 19(19), 1800142. DOI: 10.1002/pmic.201800142. Mustafa, G. and Komatsu, S. (2014) Quantitative proteomics reveals the effect of protein glycosylation in soybean root under flooding stress. Frontiers in Plant Science 5, 627. DOI: 10.3389/fpls.2014.00627. Mu, Q., Zhang, W., Zhang, Y., Yan, H., Liu, K. et al. (2017) iTRAQ-based quantitative proteomics analysis on rice anther responding to high temperature. International Journal of Molecular Sciences 18(9), 1811. DOI: 10.3390/ijms18091811. Mustafa, G. and Komatsu, S. (2016) Insights into the response of soybean mitochondrial proteins to various sizes of aluminum oxide nanoparticles under flooding stress. Journal of Proteome Research 15(12), 4464–4475. DOI: 10.1021/acs.jproteome.6b00572.

Plant Proteomics

47

Nallamilli, B.R.R., Edelmann, M.J., Zhong, X., Tan, F., Mujahid, H. et al. (2014) Global analysis of lysine acetylation suggests the involvement of protein acetylation in diverse biological processes in rice (Oryza sativa). PloS ONE 9(2), e89283. DOI: 10.1371/journal.pone.0089283. Nanjo, Y., Skultety, L., Ashraf, Y. and Komatsu, S. (2010) Comparative proteomic analysis of early-stage soybean seedlings responses to flooding by using gel and gel-free techniques. Journal of Proteome Research 9(8), 3989–4002. DOI: 10.1021/pr100179f. Nanjo, Y., Skultety, L., Uváčková, L., Klubicová, K., Hajduch, M. et al. (2012) Mass spectrometry-based analysis of proteomic changes in the root tips of flooded soybean seedlings. Journal of Proteome Research 11(1), 372–385. DOI: 10.1021/pr200701y. Natarajan, S., Xu, C., Caperna, T.J. and Garrett, W.M. (2005) Comparison of protein solubilization methods suitable for proteomic analysis of soybean seed proteins. Analytical Biochemistry 342(2), 214–220. DOI: 10.1016/j.ab.2005.04.046. Neilson, K.A., Ali, N.A., Muralidharan, S., Mirzaei, M., Mariani, M. et al. (2011) Less label, more free: approaches in label-free quantitative mass spectrometry. Proteomics 11(4), 535–553. DOI: 10.1002/ pmic.201000553. Oh, M.W. and Komatsu, S. (2015) Characterization of proteins in soybean roots under flooding and drought stresses. Journal of Proteomics 114, 161–181. DOI: 10.1016/j.jprot.2014.11.008. Oh, M.W., Nanjo, Y. and Komatsu, S. (2014) Identification of nuclear proteins in soybean under flooding stress using proteomic technique. Protein and Peptide Letters 21(5), 458–467. DOI: 10.2174/09298665113206660120. O’Farrell, P.H. (1975) High resolution two-dimensional electrophoresis of proteins. The Journal of Biological Chemistry 250(10), 4007–4021. Pan, J., Li, Z., Wang, Q., Garrell, A.K., Liu, M. et al. (2018) Comparative proteomic investigation of drought responses in foxtail millet. BMC Plant Biology 18(1), 315. DOI: 10.1186/s12870-018-1533-9. Pandey, A., Rajamani, U., Verma, J., Subba, P., Chakraborty, N. et al. (2010) Identification of extracellular matrix proteins of rice (Oryza sativa L.) involved in dehydration-responsive network: a proteomic approach. Journal of Proteome Research 9(7), 3443–3464. DOI: 10.1021/pr901098p. Pariaud, B., Ravigné, V., Halkett, F., Goyeau, H., Carlier, J. et al. (2009) Aggressiveness and its role in the adaptation of plant pathogens. Plant Pathology 58(3), 409–424. DOI: 10.1111/j.1365-3059.2009.02039.x. Parker, D., Beckmann, M., Zubair, H., Enot, D.P., Caracuel-Rios, Z. et al. (2009) Metabolomic analysis reveals a common pattern of metabolic re-programming during invasion of three host plant species by Magnaporthe grisea. The Plant Journal 59(5), 723–737. DOI: 10.1111/j.1365-313X.2009.03912.x. Petrovská, B., Šebela, M. and Doležel, J. (2015) Inside a plant nucleus: discovering the proteins. Journal of Experimental Botany 66(6), 1627–1640. DOI: 10.1093/jxb/erv041. Pi, E., Zhu, C., Fan, W., Huang, Y., Qu, L. et al. (2018) Quantitative phosphoproteomic and metabolomic analyses reveal GmMYB173 optimizes flavonoid metabolism in soybean under salt stress. Molecular & Cellular Proteomics 17(6), 1209–1224. DOI: 10.1074/mcp.RA117.000417. Qin, L., Walk, T.C., Han, P., Chen, L., Zhang, S. et al. (2019) Adaption of roots to nitrogen deficiency revealed by 3D quantification and proteomic analysis. Plant Physiology 179(1), 329–347. DOI: 10.1104/pp.18.00716. Rauniyar, N. and Yates, J.R. (2014) Isobaric labeling-based relative quantification in shotgun proteomics. Journal of Proteome Research 13(12), 5293–5309. DOI: 10.1021/pr500880b. Reis, P.A.B., Carpinetti, P.A., Freitas, P.P.J., Santos, E.G.D., Camargos, L.F. et al. (2016) Functional and regulatory conservation of the soybean ER stress-induced DCD/NRP-mediated cell death signaling in plants. BMC Plant Biology 16(1), 156. DOI: 10.1186/s12870-016-0843-z. Roth, Z., Yehezkel, G. and Khalaila, I. (2012) Identification and quantification of protein glycosylation. International Journal of Carbohydrate Chemistry 2012, 1–10. DOI: 10.1155/2012/640923. Santos Dória, M., Silva Guedes, M., de Andrade Silva, E.M., Magalhães de Oliveira, T., Pirovani, C.P. et al. (2019) Comparative proteomics of two citrus varieties in response to infection by the fungus Alternaria alternata. International Journal of Biological Macromolecules 136, 410–423. DOI: 10.1016/j.ijbiomac.2019.06.069. Shan, Z., Luo, X., Wei, M., Huang, T., Khan, A. et al. (2018) Physiological and proteomic analysis on long-term drought resistance of cassava (Manihot esculenta Crantz). Scientific Reports 8(1), 17982. DOI: 10.1038/s41598-018-35711-x. Suter, L. and Widmer, A. (2013) Environmental heat and salt stress induce transgenerational phenotypic changes in Arabidopsis thaliana. PLoS ONE 8, e60364.

48

S. Komatsu and G. Mustafa

Tan, L., Eberhard, S., Pattathil, S., Warder, C., Glushka, J. et al. (2013) An Arabidopsis cell wall proteoglycan consists of pectin and arabinoxylan covalently linked to an arabinogalactan protein. Plant Cell 25, 270–287. Thangella, P.A.V., Pasumarti, S.N.B.S., Pullakhandam, R., Geereddy, B.R. and Daggu, M.R. (2018) Differential expression of leaf proteins in four cultivars of peanut (Arachis hypogaea L.) under water stress. 3 Biotech 8(3), 157. DOI: 10.1007/s13205-018-1180-8. Uzest, M., Gargani, D., Drucker, M., Hébrard, E., Garzo, E. et al. (2007) A protein key to plant virus transmission at the tip of the insect vector stylet. In: Proceedings of the National Academy of Sciences, pp. 17959–17964 (Vol. 104). DOI: 10.1073/pnas.0706608104. Velikova, V., Ghirardo, A., Vanzo, E., Merl, J., Hauck, S.M. et al. (2014) Genetic manipulation of isoprene emissions in poplar plants remodels the chloroplast proteome. Journal of Proteome Research 13(4), 2005–2018. DOI: 10.1021/pr401124z. Wang, J., Islam, F., Li, L., Long, M., Yang, C. et al. (2017) Complementary RNA-sequencing based transcriptomics and iTRAQ proteomics reveal the mechanism of the alleviation of quinclorac stress by salicylic acid in Oryza sativa ssp. japonica. International Journal of Molecular Sciences 18(9), E1975. DOI: 10.3390/ijms18091975. Wang, X. and Komatsu, S. (2016a) Plant subcellular proteomics: application for exploring optimal cell function in soybean. Journal of Proteomics 143, 45–56. DOI: 10.1016/j.jprot.2016.01.011. Wang, X. and Komatsu, S. (2016b) Gel-free/label-free proteomic analysis of endoplasmic reticulum proteins in soybean root tips under flooding and drought stresses. Journal of Proteome Research 15(7), 2211–2227. DOI: 10.1021/acs.jproteome.6b00190. Wang, X. and Komatsu, S. (2017) Proteomic analysis of calcium effects on soybean root tip under flooding and drought stresses. Plant & Cell Physiology 58(8), 1405–1420. DOI: 10.1093/pcp/ pcx078. Wang, X. and Komatsu, S. (2018) Proteomic approaches to uncover the flooding and drought stress response mechanisms in soybean. Journal of Proteomics 172, 201–215. DOI: 10.1016/j. jprot.2017.11.006. Wang, Y., Zhang, X., Huang, G., Feng, F., Liu, X. et al. (2019) iTRAQ-based quantitative analysis of responsive proteins under PEG-induced drought stress in wheat leaves. International Journal of Molecular Sciences 20(11), 2621. DOI: 10.3390/ijms20112621. Wang, Y., Sang, Z., Xu, S., Xu, Q., Zeng, X. et al. (2020) Comparative proteomics analysis of Tibetan hull- less barley under osmotic stress via data-independent acquisition mass spectrometry. GigaScience 9(3), 019. DOI: 10.1093/gigascience/giaa019. Watson, B.S., Asirvatham, V.S., Wang, L. and Sumner, L.W. (2003) Mapping the proteome of barrel medic (Medicago truncatula). Plant Physiology 131, 1104–1112. Weng, L., Zhao, F., Li, R. and Xiao, H. (2015) Cross-talk modulation between ABA and ethylene by transcription factor SlZFP2 during fruit development and ripening in tomato. Plant Signaling & Behavior 10(12), e1107691. DOI: 10.1080/15592324.2015.1107691. Williams, D.B. (2006) Beyond lectins: the calnexin/calreticulin chaperone system of the endoplasmic reticulum. Journal of Cell Science 119(Pt 4), 615–623. DOI: 10.1242/jcs.02856. Wu, X., Gong, F., Cao, D., Hu, X. and Wang, W. (2016) Advances in crop proteomics: PTMs of proteins under abiotic stress. Proteomics 16(5), 847–865. DOI: 10.1002/pmic.201500301. Xiong, Y., Peng, X., Cheng, Z., Liu, W. and Wang, G.L. (2016) A comprehensive catalog of the lysine- acetylation targets in rice (Oryza sativa) based on proteomic analyses. Journal of Proteomics 138, 20–29. DOI: 10.1016/j.jprot.2016.01.019. Yanagawa, Y. and Komatsu, S. (2012) Ubiquitin/proteasome-mediated proteolysis is involved in the response to flooding stress in soybean roots, independent of oxygen limitation. Plant Science 185–186, 250–258. DOI: 10.1016/j.plantsci.2011.11.014. Yin, X. and Komatsu, S. (2015) Quantitative proteomics of nuclear phosphoproteins in the root tip of soybean during the initial stages of flooding stress. Journal of Proteomics 119, 183–195. DOI: 10.1016/j.jprot.2015.02.004. Yin, X. and Komatsu, S. (2016) Nuclear proteomics reveals the role of protein synthesis and chromatin structure in root tip of soybean during the initial stage of flooding stress. Journal of Proteome Research 15(7), 2283–2298. DOI: 10.1021/acs.jproteome.6b00330. Yin, X., Sakata, K. and Komatsu, S. (2014) Phosphoproteomics reveals the effect of ethylene in soybean root under flooding stress. Journal of Proteome Research 13(12), 5618–5634. DOI: 10.1021/ pr500621c.

Plant Proteomics

49

Yu, B., Li, J., Koh, J., Dufresne, C., Yang, N. et al. (2016) Quantitative proteomics and phosphoproteomics of sugar beet monosomic addition line M14 in response to salt stress. Journal of Proteomics 143, 286–297. DOI: 10.1016/j.jprot.2016.04.011. Zhan, Y., Wu, Q., Chen, Y., Tang, M., Sun, C. et al. (2019) Comparative proteomic analysis of okra (Abelmoschus esculentus L.) seedlings under salt stress. BMC Genomics 20(1), 381. DOI: 10.1186/ s12864-019-5737-7. Zhang, T., Meng, L., Kong, W., Yin, Z., Wang, Y. et al. (2018) Quantitative proteomics reveals a role of JAZ7 in plant defense response to Pseudomonas syringae DC3000. Journal of Proteomics 175, 114–126. DOI: 10.1016/j.jprot.2018.01.002. Zhang, Y., Song, L., Liang, W., Mu, P., Wang, S. et al. (2016) Comprehensive profiling of lysine acetylproteome analysis reveals diverse functions of lysine acetylation in common wheat. Scientific Reports 6, 21069. DOI: 10.1038/srep21069. Zhao, F., Zhang, D., Zhao, Y., Wang, W., Yang, H. et al. (2016) The difference of physiological and proteomic changes in maize leaves adaptation to drought, heat, and combined both stresses. Frontiers in Plant Science 7, 1471. DOI: 10.3389/fpls.2016.01471. Zhao, J., Xu, J., Chen, B., Cui, W., Zhou, Z. et al. (2019) Characterization of proteins involved in chloroplast targeting disturbed by rice stripe virus by novel protoplast−chloroplast proteomics. International Journal of Molecular Sciences 20(2), E253. DOI: 10.3390/ijms20020253. Zörb, C., Schmitt, S. and Mühling, K.H. (2010) Proteomic changes in maize roots after short-term adjustment to saline growth conditions. Proteomics 10(24), 4441–4449. DOI: 10.1002/pmic.201000231.

4

Plant Metabolomics: The Great Potential of Plant Metabolomics in Big Data Biology

Miyako Kusano1,2,3* and Atsushi Fukushima3,4 Faculty of Life and Environmental Science, University of Tsukuba, Japan; 2Tsukuba- Plant Innovation Research Center, University of Tsukuba, Japan; 3RIKEN Center for Sustainable Resource Science, Yokohama, Japan; 4Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Kyoto, Japan

1

Abstract Metabolomics aims to analyze the so-called metabolome, which is composed of all the low-molecular-weight compounds present in biological systems. Approximately 1 million metabolites are estimated to exist in the plant kingdom. The remarkable development of a range of analytical platforms enables the separation, detection, characterization, and (semi-)quantification of primary and secondary metabolites in plants. Furthermore, spatial metabolomics techniques such as mass spectrometry and magnetic resonance imaging have shown great progress in visualizing the localization of metabolites in plant tissues. Based on the great contribution of the metabolomics community, guidelines for metabolite identification confidence and the collection of metabolomics metadata are proposed. Here, the analytical targets and techniques in plant metabolomics are summarized. Furthermore, the progress made in generating “reliable” metabolomics data for big data biology is introduced. In conclusion, metabolomics data analyses will boost the progress of big data biology, though the validation of data reliability is important. When researchers upload metabolomics data, minimum reporting standards as proposed by the Metabolomics Standards Initiative should be provided along with metabolite identification confidence. More data should be shared worldwide to foster the development of computational approaches for outcome interpretation. Finally, the integration of other omics with metabolomics data and the synthesis on a system level are crucial steps toward linking genotype–metabotype–phenotype relationships and implementing “metabolic editing” in plants.

4.1 Introduction Plants have a great ability to produce “primary metabolites,” organic compounds such as carbohydrates and amino acids that are produced from inorganic compounds via assimilation processes. Moreover, metabolites of low molecular weight (< 1500 Da) with diverse

chemical structures can also be synthesized as “secondary metabolites.” Of these secondary metabolites, “specialized metabolites” are produced as complex blends of structurally diverse biologically active metabolites, which are used for survival strategies under certain naïve environments. According to a statistical evaluation of the metabolite–plant species

*Corresponding author: kusano.miyako.fp@u.tsukuba.ac.jp 50

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0004

Plant Metabolomics

51

Box 4.1. Terminology for understanding chromatography hyphenated to mass spectrometry.

‍

‍

Mass spectrum (MS): a histogram plot of the ion signal intensity as a function of mass-to-charge (m/z; the dimensionless quantity formed by dividing the mass number of an ion by its charge number). Retention time (RT): the time from the injection of the sample to the time of compound elution and the maximum signal of the given compound at the detector. Total ion current (TIC) chromatogram: total ion intensity across the entire range of m/z being detected at every point in the analysis.

database, KNApSAcK, the total number of metabolites produced by plant species is approximately 1,060,000 (Afendi et al., 2012). The huge chemical diversity and wide dynamic range of metabolite concentrations (from femtomolar to millimolar order) on the plant metabolome hinders metabolite detection, quantification, and identification from an analytical point of view (Lei et al., 2011; Jorge et al., 2016). Due to the complex physicochemical properties of metabolites, sample preparation and extraction steps should be optimized prior to metabolomic analysis. Currently, metabolite characterization in plant metabolomics is mainly carried out through mass spectrometry (MS)-based hyphenated technologies (see Box 4.1 for terminology), which consist of coupling MS with powerful chromatographic techniques such as gas chromatography (GC), liquid chromatography (LC), and capillary electrophoresis (CE). Since chromatographic techniques are time consuming, separation- free techniques such as direct infusion (DI)- MS have been introduced (Goodacre et al., 2003; Majchrzak et al., 2020). These techniques allow very high-throughput analysis for the measurement of the complex mass spectra of crude samples/

extracts. Direct analysis can also be achieved through desorption-based techniques, such as MS imaging (MSI). In section 4.2, the analytical targets and techniques in plant metabolomics are introduced. Furthermore, the relationship between identification confidence and different metabolomic techniques is also discussed. Moreover, metabolite identification/annotation in the metabolomics workflow is crucial for the biological interpretation of the obtained data. The metabolomics community has proposed defined metrics to assess the identification/ annotation confidence of MS- based and nuclear magnetic resonance (NMR)- based metabolomics data (Sumner et al., 2007; Schrimpe-Rutledge et al., 2016; Viant et al., 2017). Here, the accepted levels of the metabolite identification/annotation confidence in metabolomics for biological interpretation are explained. Public metabolomics databases and repositories have been developed based on the Metabolomics Standards Initiative guidelines proposed by the metabolomics community (Fiehn et al., 2007a), such as the Gene Expression Omnibus (GEO) (Barrett et al., 2013), ArrayExpress (Parkinson et al., 2007), the European Nucleotide Archive (ENA) (Cummins

52

M. Kusano and A. Fukushima

Stable-isotope labeling

(Semi-) quantification

Spatial metabolomics

Finger printing

• FT-MS • NMR

• GC-MS • LC-MS • CE-MS • NMR • DART-MS

• MSI • DESI • MALDI • SIMS • MRI • LAESI-MS

• in situ MS • DI-MS • FT-IR/NIR

high

identification confidence

low

4.1. Identification confidence in different metabolomics methods. CE, capillary electrophoresis; Fig. 4.1. DART, direct analysis in real time; DESI, desorption electrospray ionization; DI, direct infusion; FT, spectroscopy/near-infrared infrared spectroscopy; Fourier transform; GC, gas chromatography; IR/NIR, infrared spectroscopy/nearlaser-ablation ablation electrospray ionization; LC, liquid chromatography; MS, mass spectrometry; MSI, LAESI, laserMS imaging; NMR, nuclear magnetic resonance; MRI, magnetic resonance imaging; SIMS, secondary ion mass spectrometry.

et al., 2022), the Genomic Expression Archive (GEA) (Kodama et al., 2019), and the Sequence Read Archive (SRA) (Leinonen et al., 2011). In section 4.3, the status and potential of metabolomics data collection with metadata for big data biology are discussed.

detected using current analytical techniques are introduced (Table 4.1). Furthermore, the metabolite identification/annotation confidence levels proposed by the metabolomics community are explained (Fig. 4.2). Then, the recent activities of the metabolomics data repository are shown.

4.2 Analytical Targets and 4.2 Techniques As mentioned in the Introduction, the plant metabolome is highly complex, has a wide dynamic range, and is very diverse. First, analytical targets in plant metabolomics are introduced. There are three strategies to detect and identify or annotate metabolites: separation- ased, non- eparation-b ased, and separation-b based, non-sseparationbased, desorption- ased (Fig. 4.1). However, current desorption-b based analytical techniques cannot achieve absolute quantification of the metabolome in plants, even in the model plant Arabidopsis thaliana (Arabidopsis), which has one of the smallest genome sizes of all flowering plants (Arabidopsis Genome, 2000). Great efforts were also made in the middle of the 2000s to construct publicly available metabolomics databases with the aim of increasing the number of identified/ annotated metabolites (Li and Gaquerel, 2021). In this section, the metabolites that are typically

4.2.1 4.2.1 Analytical targets in plant metabolomics separation-based based analytical MS coupled with separationplatforms are mainly used for the detection of secondary metabolites, while central MS- based metabolites can be analyzed by MSbased NMR-based based metabolomics techniques and 11H-NMR(Table 4.1). Identification/annotation confidence is highly dependent on the purpose of the study (Fig. 4.1). To increase metabolite coverage, several analytical techniques can be combined (Kusano et al., 2011; Alseekh and Fernie, 2018). The physicochemical properties of the analytical targets, such as hydrophobicity, molecular weight, and volatility, should be carefully considered in the experimental setset-up up for metabolite extraction. Fractionation before injection into

Plant Metabolomics

53

Table 4.1. List of analytical platforms and instrument(s) detecting a variety of metabolite classes and typical metabolites. Platform/instrument(s)

Metabolite class

Typical metabolites

GC-MS

Amino and organic acids, fatty acids, sugars and their derivatives, amines, and organic volatile compounds

Proteinogenic amino acids, TCA cycle intermediates, amino acid derivatives, sugar mono-, di-, and tri-saccharides, sugar monophosphates, polyamines, and aroma

LC-MS

Amino and organic acids, polar lipids, pigments, and plant specialized metabolites

Alkaloids, monogalactosyldiacylglycerols, carotenoids, flavonols, anthocyanins, and glucosinolates

CE-MS

Polar-ionic compounds

Proteinogenic amino acids, TCA cycle intermediates, amino acid derivatives, sugar mono- and di-phosphates, purines, pyrimidines, mono-, di-, and triphosphate nucleosides

NMR

Compounds in soluble mixtures

Proteinogenic amino acids, organic acids, sugars, phenolic compounds, alcohols, and esters

Fig. 4.2. Metrics of identification confidence in metabolomics. Lv = Level. Lv. 1: “identified metabolites” according to their match with reference standard compounds or metabolites whose full 2D structure has been elucidated by spectroscopic methods and/or MS spectra. Lv. 2: “putatively identified compounds” according to literature data or database match containing at least two pieces of orthogonal information, including retention time (index), MS/MS fragment information, spectroscopic data, etc. Lv. 3: “putatively annotated compounds” that are determined by database match only. Lv. 4: “unknown compounds with molecular formula” supported by high resolution mass accuracy and/or a heuristic filtering approach. Lv. 5: “unknown compounds” that can be reproducibly detected and quantified as signals. Lv. 1 and Lv. 2 compound annotations can be used for the interpretation of biological data. Lv. 3 compound annotations are used for metabolomics data interpretation when other types of orthogonal data can be obtained at different biological layers, for example, genetic modification. *The total number of Lv. 1–5 compound annotations is highly dependent on the accuracy and resolution of the analytical instrument. To the author’s understanding, this level also includes metabolites whose full 2D structure has been elucidated. If this is correct, it should be indicated.

54

M. Kusano and A. Fukushima

the analytical instrument is required when the yield of the analytical target is very low. 4.2.1.1 Central metabolites Central metabolites, including primary metabolites and their intermediates, are commonly detected in all plant species. Primary metabolites, such as carbohydrates, amino acids, organic acids, fatty acids, and nucleic acids, are essential for the growth, development, and reproduction of organisms. Since central metabolites, except for fatty acids and terpenoids, show high hydrophilicity, chemical derivatization is required to volatilize metabolites prior to their separation on a capillary column. Through chemical derivatization, hydrophobic compounds with -COOH, -OH, and -NH2 groups are also derivatized to make these compounds volatile enough. GC-MS is a major technique for the detection of metabolites after derivatization because of the data reliability supported by the long history of the capillary GC, electron ionization (EI) method, and accumulation of MS spectra with retention information in metabolite databases (Fiehn, 2016). Over 100 central metabolites can be identified with reference compounds or annotated by referring to the MS spectra with retention information (Babushok et al., 2007). It should be noted that GC-MS-based metabolomics is applied using a polarity similar to that of the GC column (5% diphenyl/95% dimethyl siloxane) as the gold standard. This enables direct comparison of retention information in reference libraries. 1 H- NMR- based metabolite profiling is another major technique for the detection of central metabolites (Table 4.1). Approximately 100 metabolites can be detected using 1H-NMR- based metabolite profiling (Krishnan et al., 2004). The integrated analysis of GC-MS-based and NMR-based metabolite profiles of Brassica rapa var. perviridis increases the metabolite coverage (Ichihashi et al., 2020). CE- MS differs from chromatographic separation- based methods such as GC and LC, and is a great technique for the detection of ionized metabolites. CE- MS has a wide metabolic coverage for metabolites with molecular weight < 1500 Da because: (i) CE can separate compounds through a capillary

column; (ii) CE-MS employs the electrospray ionization (ESI) method to generate ions; and (iii) the cation and anion modes can be applied to detect positively charged and negatively charged metabolites. However, the use of CE-MS in plant metabolomics for the detection of central metabolites has been limited compared with that of GC-MS. One of the reasons for this is the difficulty of detecting sugars using CE-MS. Sugars are the most abundant molecules in plants and sugar solutions are neutral in nature. Such very high concentrations of sugars in plant samples hinder the (semi-)quantification of sugars using 1H-NMR because of the complex splitting patterns of sugar molecules (Govindaraju et al., 2000). LC- MS can also be used for central metabolites by pre-treating the samples for the fractionation of targeted metabolite classes, such as sugars, sugar phosphates, amino acids/ amines, organic acids, and nucleic acids, based on polarity and/or hydrophilicity (Table 4.1). It is important to choose appropriate columns to detect the targeted central metabolites. Although it is not a popular method in plant metabolomics, chemical derivatization can be utilized to increase detection coverage, retention, and stabilization, and to improve the sensitivity for polar metabolites (Bian et al., 2018; Willacey et al., 2019).

4.2.1.2 Secondary metabolites Primary metabolites are necessary for the life of organisms, while the production of secondary metabolites is not always essential for survival. However, many secondary metabolites have biological functions, particularly in plants and microorganisms. This means that these kinds of secondary metabolites are far from being “secondary.” Thus, such metabolites are often termed specialized metabolites. In this section, the major classes of plant specialized metabolites are introduced. Volatile organic compounds (VOCs). VOCs are low- molecular- weight organic compounds (< 200 Da) with a high vapor pressure at ambient temperature. It is estimated that 30,000 different plant VOCs are synthesized in various plant tissues (Peñuelas and Llusià, 2004). GC-MS is

Plant Metabolomics

the most commonly used analytical technique for non-targeted VOC profiling, and is used in combination with automatic VOC collection techniques such as solid- phase microextraction (SPME) and stir- bar sorptive extraction (SBSE) (Tholl et al., 2021). Recently, proton transfer reaction (PTR)-MS, which is a DI-MS for VOC detection, has been proposed for the real-time detection of VOCs emitted from plants (Majchrzak et al., 2020). Based on their biosynthetic origin and the pathways they belong to, VOCs are classified into furanones/pyrones, terpenoids, aliphatic VOCs, methyl- branched VOCs, amines, sulfur (S)- containing VOCs, and aromatic VOCs (Schwab et al., 2008). Among the aliphatic VOCs, methyl jasmonate and green leaf volatiles, which are derived from fatty acids, are synthesized through lipoxygenase (LOX) pathways (Dudareva et al., 2013; Bouwmeester et al., 2019). In contrast, methyl-branched VOCs are derived from the degradation of amino acids (Schwab et al., 2008). Furanones/pyrones, apocarotenoids (degradation products of carotenoids), mono- and sesqui- terpenes, S- containing VOCs including isothiocyanates, and aromatic VOCs are known to have good or bad flavor for humans (Goff and Klee, 2006). Fractionation distillation is often required to analyze flavor/aroma VOCs due to their low concentration. Non-polar metabolites. The metabolites categorized in this class are chlorophylls, carotenoids, and membrane lipids (hereafter, lipids). These metabolites are hydrophobic; therefore, they are contained in the organic layer after liquid–liquid phase separation. An LC photodiode array detector (PDA) can detect chlorophylls and carotenoids if standard compounds are available. If not, the use of LC-PDA-MS or LC- PDA-MS/MS is recommended (de Rosso and Mercadante, 2007). Lipids have various functions in the biological events of cells from different organisms. For example, lipids provide insulation as permeable barriers to the external environment of cells. Plant lipids play indispensable roles, including the storage of metabolic energy, protection against dehydration and pathogens, electron transport, and light absorption. A typical lipid structure consists of hydrocarbons that include

55

nonpolar carbon–carbon or carbon–hydrogen bonds. This is a special characteristic of lipids in terms of their physicochemical properties. To boost the structural estimation accuracy, the specific characteristics of lipids have been matched by integrating LC-MS-based lipid profiling and the fragmentation patterns of lipids (Tsugawa et al., 2020). Specialized

metabolites with neutral and polar

properties. Plant secondary metabolite core structures have also been proposed (Wang et al., 2019). Metabolites such as nitrogen (N)- containing metabolites (also called alkaloids), phenylpropanoids, benzenoids, flavonoids, and di-, sester-, and tri- terpenes possess different physicochemical properties. Thus, it is impossible to detect these metabolites simultaneously using a single analytical instrument. Current LC-MS-based metabolite profiling without pre-fractionation can annotate approximately 50 polar metabolites, including flavanols, anthocyanins, and synaptic acid derivatives. This limitation is due to the abundance and physicochemical properties of analytes, as well as the difficulty of annotating the detected peaks based on the m/z, retention time, and MS/MS spectral patterns. physicochemical

4.2.2 Analytical methods for plant metabolomics As shown in Fig. 4.1, the identification confidence depends on the analytical method. Stable isotope labeling methods using MS or NMR can trace each labeled compound at the atomic level. When growing plants, in vivo labeling methods using 13C- or 15N-labeled nutrient sources can trace metabolite flux. In contrast, in vitro labeling methods use 13C- or deuterium- labeled chemicals for metabolite derivatization to improve metabolite detection, and isotope exchange with labeled solvents to obtain structural information. Metabolite (semi- )quantification is a major method used in plant metabolomics. The separation technique(s) can be applied according to the physicochemical properties of the metabolites. EI and ESI are often applied in MS-based metabolomics. Since the EI method

56

M. Kusano and A. Fukushima

can ionize gas- phase compounds under vacuum conditions, GC is the best choice for compound separation in EI. ESI is an atmospheric ionization method; thus, LC and CE are popular choices for compound separation in ESI. Particularly for ESI-based MS, ion suppression is a type of matrix effect that hinders analyte quantification (Lu et al., 2008; Furey et al., 2013). Atmospheric pressure chemical ionization (APCI), which is mostly applied for the ionization of uncharged, moderately polar to non-polar compounds with thermal stability (up to 1500 Da), can be combined with GC and LC separation. This is expected to expand metabolite coverage in plant metabolomics. Direct analysis in real time (DART) ionization requires no sample preparation. Solid, liquid, and powder samples can be directly analyzed to obtain m/z information with high mass resolution and accurate mass spectra. However, throughput and reproducibility need to be improved for application in plant metabolomics. For non-targeted metabolomics, time- of-flight (TOF)-MS is widely used, because it can achieve high- throughput and accurate mass profiling. Quadrupole (Q)- and ion trap (IT)-MS have relatively higher sensitivity than TOF. However, these MS techniques have a limited resolution. A combination of other MS methods with TOF-MS, for example Q-TOF, can be employed to obtain MS/MS spectra. Triple Q-MS can also be applied to conduct targeted metabolite profiling using multiple reaction monitoring (MRM). Orbitrap and Fourier- transform ion cyclotron resonance (FT-ICP)- MS have offered ion isolation at ultrahigh resolution compared with TOF-MS. The use of ultrahigh- resolution MS techniques enables the detection of 0.001 atomic mass units. As mentioned in section 4.2.1.1, 1H-NMR- based metabolite profiling is often applied to detect central metabolites. NMR spectra are highly reproducible and quantitative over a wide dynamic range, although the sensitivity is lower than that of MS- based metabolite profiling. Another advantage of NMR- based metabolomics is that it is non-destructive and can thus be utilized in in vivo experiments. Spatial metabolomics can visualize the localization of metabolites in plant tissues without extraction and compound separation steps. MSI and MRI are used for spatial

metabolomics. In plant metabolomics, the resolution of spatial metabolomics is of the order of several micrometers when applying MSI (Bartels and Svatoš, 2015; Petras et al., 2017). A data repository for these kinds of data will be required to accumulate spatial information and integrate metabolite data for big data biology. 4.2.3 Metabolite identification/ annotation in metabolomics data The identification confidence is very important for data reliability. To assess the identification confidence, the metabolomics community proposed five levels of compound annotations (Fig. 4.2). Among these, level (Lv.) 1 annotations correspond to “identified metabolites” using reference standard match or full 2D structure elucidation, with at least two orthogonal techniques (e.g., retention time (index) and accurate mass for MS-based metabolomics) defining 2D structure confidently. Lv. 2 annotations correspond to “putatively identified” compounds, describing a probable structure that is matched to literature data or databases by diagnostic evidence. For these, there must be at least two orthogonal pieces of information excluding all other candidates. For biological data interpretation, Lv. 1 and 2 compound annotations should be used.

4.3 The Importance of Sharing Metabolomics Data Many databases associated with metabolome analyses have been developed over the past two decades (Tohge and Fernie, 2009; Fukushima and Kusano, 2013). Examples include METLIN (Smith et al., 2005), MassBank (Horai et al., 2010), BioMagResBank (BMRB) (Romero et al., 2020), KEGG (Kanehisa et al., 2021), KNApSAcK (Afendi et al., 2012), PubChem (Kim et al., 2021), ChemSpider (Little et al., 2012), LIPID MAPS (Sud et al., 2012), LipidBank (Watanabe et al., 2000), PathBank (Wishart et al., 2020), and PlantCyc (available at http://www.plantcyc. org, accessed July 2022). Since the establishment of the Metabolomics Standards Initiative (Fiehn et al., 2007a, Fiehn et al., 2007b), many efforts have been made to ensure appropriate metabolomics studies.

Plant Metabolomics

4.3.1 Metabolome data repositories As general- purpose repositories for metabolomics studies, there are two major databases, MetaboLights (Haug et al., 2020) and Metabolomics Workbench (Sud et al., 2016) (Table 4.2). As of July 2021, MetabolomeXchange (http://www.metabolomexchange.org/) indicates that over 800 and 1400 metabolomics datasets are publicly available in MetaboLights and the Metabolomics Workbench, respectively. MetabolomeExpress (Carroll et al., 2010) stores raw data from GC-MS instruments. The Global Natural Product Social (GNPS) is a web-based computational environment that aims to share tandem mass spectrometry data (Wang et al., 2016). MassIVE (https:// massive.ucsd.edu/, accessed September 2022) is a ProteomeXchange (Deutsch et al., 2020) resource that archives metabolomics datasets submitted through GNPS. For NMR data collection, nmrML, an open XML-based exchange and storage format for NMR spectral data, has been provided to standardize the NMR raw data (Schober et al., 2018). The Golm Metabolome Database (GMD) is a plant- specific metabolome database that provides publicly available mass spectral libraries, metabolite profiling datasets, and useful tools (Kopka et al., 2005; Hummel et al., 2010). MeRy- B provides metabolite profile data and knowledge from multiple research projects, especially in plant primary metabolism (Ferry-Dumazet et al., 2011). It is a web-based knowledge base for data storage, visualization, and analysis of plant NMR metabolomic profiles. MassBase (http://webs2.kazusa.or.jp/massbase/ index.php/, accessed September 2022) is also used for raw data sharing across different species and different techniques (Ara et al., 2021). Metabolonote easily supports the publication of metabolomic metadata (Ara et al., 2015). The data and metadata for more than 140 Arabidopsis mutants are archived at the Plant and Microbial Metabolomics Resource (PMR) (https://metnetweb.gdcb.iastate.edu/, accessed September 2022) (Bais et al., 2012; Hur et al., 2013). It is a flexible database designed for data sharing in metabolomics, and implements data analysis tools. Chloroplast 2010 is a database for large-scale phenotypic screening of Arabidopsis chloroplast mutants, and contains MS-derived

57

amino acid and fatty acid data from more than 10,000 T- DNA insertion mutants (Gu et al., 2007; Lu et al., 2011; Bell et al., 2012). We previously developed the Metabolite Profiling Database for Knock-Out Mutants in Arabidopsis (MeKO) (Fukushima et al., 2014) for improving gene annotation. Fifty Arabidopsis mutants were chosen that resemble the wild-type plant (ecotype Columbia, Col). These mutants were characterized during normal plant growth using the previously established metabolite profiling method based on GC-MS (Kusano et al., 2007), which mainly detects primary metabolites.

4.3.2 Toward reproducible metabolome data analysis As mentioned above, major public repositories for general metabolomics studies have been established. However, data sharing is difficult to complete, especially for improving data reanalysis, reusability, and reproducibility. Recently, a systematic analysis of publicly available metabolomics metadata revealed that the majority of shared metadata leaves a great deal to be desired (Spicer et al., 2017a, Spicer et al., 2017b). Despite proposing minimum reporting guidelines for computational data analysis in this area (Goodacre et al., 2007), the need for improving metadata completeness has been debated (Considine and Salek, 2019). Developing an efficient e- infrastructure and data analysis workflow for metabolomics is also important. Examples of these are Galaxy-based tools (Giacomoni et al., 2015; Davidson et al., 2016), XCMSonline (Tautenhahn et al., 2012; Huan et al., 2017), PhenoMeNal (Peters et al., 2019), MetabolomeExpress (Carroll et al., 2010, 2015) , and MetaboAnalyst (Pang et al., 2021). MetabolomeExpress allows users to process, disseminate, and interpret GC-MS data, enabling meta-analysis in the systems. XCMS Online is freely available and works well from raw data processing to statistical data analysis, including uni- and multivariate analysis (Tautenhahn et al., 2012). Another collaborative approach is the use of Jupyter notebooks (https://www. metabolomicsworkbench.org/data/AnalyzeU singJupyterNotebooks.php, accessed September 2022) and R markdown files for metabolomics

Description

A general-purpose repository for metabolomics studies in the European Bioinformatics Institute (EMBL-EBI)

A general-purpose repository for metabolomics studies in the USA

A web-based knowledge base for data storage, visualization, and analysis of plant NMR metabolomic profiles

A Wiki-based database and management system that manages experimental metadata for metabolomics studies

A well-established metabolome database, providing publicly available mass spectral libraries, metabolite profiling datasets, and useful tools

The Global Natural Product Social (GNPS) is a web-based computational environment that aims to share tandem mass (MS-MS) spectrometry data. MassIVE is one of the ProteomeXchange resources and also archives metabolomics datasets submitted through GNPS

Resource name

MetaboLights

Metabolomics Workbench

Mery-B

Metabolonote

The Golm Metabolome Database (GMD)

GNPS-MassIVE

https://gnps.ucsd.edu

http://gmd.mpimp-golm.mpg.de/

http://metabolonote.kazusa.or.jp/

http://services.cbib.u-bordeaux.fr/MERYB/index.php

https://www.metabolomicsworkbench.org/

https://www.ebi.ac.uk/metabolights/

URL

Table 4.2. Useful metabolome repositories and related resources.

32405051

15613389

25905099

21668943

26467476

31691833

PubMed ID

No

No

Yes

Yes

Yes

Yes

Continued

Current component of MetabolomeXchange

58 M. Kusano and A. Fukushima

A large-scaled depository of mass spectrometry datasets for metabolome analysis, including plant, algae, microorganisms, animals, and foods

An integrated plant metabolome data repository based on the semantic web

MassBase

RIKEN Plant Metabolonome MetaDatabase (PMM)

All URLs accessed in September 2022.

Description

Resource name

Table 4.2. Continued

http://metabobank.riken.jp/pmm/db/plantMetabolomics

http://webs2.kazusa.or.jp/massbase/

URL

34918130

34177338

PubMed ID

No

No

Current component of MetabolomeXchange

Plant Metabolomics 59

60

M. Kusano and A. Fukushima

studies (Mendez et al., 2019). In addition to enhancing the metadata of metabolomics data analysis, it is better to recognize the importance and prospects of good experimental designs and be aware of the problems related to sample size. We are currently developing an integrated plant metabolome data repository based on the semantic web, called the RIKEN Plant Metabolome MetaDatabase (RIKEN PMM) (http://metabobank.riken.jp/pmm/db/ plantMetabolomics) (Fukushima et al., 2021). It mainly stores MS-based (e.g., GC-MS-based) metabolite profiling data of plants together with their detailed, structured experimental metadata, including sampling and experimental procedures. The metadata are described using the Resource Description Framework (RDF) and standardized vocabularies, such as the Metabolomics Standards Initiative Ontology (MSIO) (https://github.com/MSI-Metabolomics- Standards-Initiative/MSIO/, accessed September 2022), which are to be integrated with various life and biomedical science data. RIKEN PMM implements intuitive and interactive operations for plant metabolome data, including raw data (netCDF format), mass spectra (NIST MSP format), and metabolite annotations. This feature is suitable not only for biologists who are interested in metabolomic phenotypes, but also for researchers who would like to investigate plant metabolomic approaches. In addition to existing accessor tools for general-purpose metabolomics repositories (Table 4.3), we also developed the R package ‘rRPMM’ (https:// github.com/afukushima/rRPMM, accessed September 2022) to download and parse the metadata from the RIKEN PMM. The package can convert the downloaded metadata to an R list and visualize the species distribution in the metadatabase.

sharing, visualization, and meta- analysis of metabolite profile data are clearly very useful in biological science, constructing and managing metabolite- profile- oriented databases such as PMR (Hur et al., 2013) and MeKO (Fukushima et al., 2014) are still laborious tasks. The integration of metabolomics studies between different laboratories is challenging. Fundamentally, typical high- throughput instruments potentially generate systematic analytical variation among different datasets (so- called batch effects (Leek et al., 2010; Lazar et al., 2013)), hampering direct comparison with previously published data, especially absolute values of signal intensity in chromatograms. To overcome these difficulties, the NIST Standard Reference Material (SRM 1950, “Metabolites in Human Plasma”) (Phinney et al., 2013; Simón-Manso et al., 2013), which contains the concentration of approximately 100 analytes, may improve the sharing of metabolomics datasets. Previous studies have evaluated human samples in different laboratories using different analytical platforms, indicating the need for reproducible metabolomics studies (Siskos et al., 2017; Izumi et al., 2019). Inter- laboratory studies of leaf development and its molecular phenotypes, including metabolite profiling, in different Arabidopsis accessions showed that metabolite profile data exhibited a strong genotype × environment effect (Massonnet et al., 2010). This suggests that controlling environmental conditions (e.g., light quality) is one of the key factors for reproducibility in plant metabolomics. Metabolomics- driven approaches, including large- scale comparative metabolite profiling, metabolite quantitative trait locus (mQTL), and genome- wide association studies (mGWAS), could be used to facilitate crop improvement and high-throughput breeding.

4.3.3 Future metabolomics data analysis enhancing new biological discoveries

4.4 Conclusions and Outlook

Developing new data processing software, tools, and data repositories for metabolomics data will increase slowly but surely. Improvements in data standardization, exchange format, and efficient and rapid data deposition could facilitate data sharing in the metabolomics community. While

Metabolomics is a crucial, integral part of systems biology, but a beginner feels self- conscious about entering this area, because there are few scientists who are very good at integrating analytical chemistry with large-scale data analysis using chem- and bio-informatics, at least compared with their counterparts in

Metabolomics Workbench

MetaboLights

RIKEN PMM

Metabolomics Workbench

metabolomicsWorkbenchR

metabolighteR

rRPMM

mwtab

MTBLS Python-based REST MetaboLights service

Repository

Software name

https://github.com/EBI-Metabolights/MtblsWS-Py

https://github.com/MoseleyBioinformaticsLab/mwtab

https://github.com/afukushima/rRPMM

https://cran.r-project.org/web/packages/metabolighteR/ index.html

https://bioconductor.org/packages/release/bioc/html/ metabolomicsWorkbenchR.html

URL

Yes

Yes

Yes

No

No

Raw data

Table 4.3. List of accessor software tools for public metabolome (meta)datasets (All URLs accessed in September 2022.)

Yes

Yes

No

No

Yes

Processed data

Python

Python

R

R

R

Software type Plant Metabolomics 61

62

M. Kusano and A. Fukushima

genomics and proteomics. Metabolomics data collected by GC-MS, LC-MS, and NMR analyses will boost the progress of big data biology. Since plants can produce various types of metabolites, the validation of data reliability is important. Depending on the purpose and situation, appropriate analytical techniques must be chosen to monitor the changes in metabolite levels in different plant species. In the future, minimum reporting standards as proposed by the Metabolomics Standards Initiative should be provided along with metabolite identification confidence when

researchers upload metabolomics data (Sumner et al., 2007). Generally, metabolomics data do not provide easily interpretable outcomes. One of the key ways to further develop computational approaches for interpretation is to share more data with scientists worldwide. Integrating other omics with metabolomics data and synthesizing on a system level is also valuable. This is an important step toward linking genotype–metabotype–phenotype relationships and implementing “metabolic editing” as the next generation of metabolic engineering in plants.

References Afendi, F.M., Okada, T., Yamazaki, M., Hirai-Morita, A., Nakamura, Y. et al. (2012) KNApSAcK family databases: integrated metabolite-plant species databases for multifaceted plant research. Plant and Cell Physiology 53(2), e1. DOI: 10.1093/pcp/pcr165. Alseekh, S. and Fernie, A.R. (2018) Metabolomics 20 years on: what have we learned and what hurdles remain? The Plant Journal 94(6), 933–942. DOI: 10.1111/tpj.13950. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814), 796–815. DOI: 10.1038/35048692. Ara, T., Enomoto, M., Arita, M., Ikeda, C., Kera, K. et al. (2015) Metabolonote: a wiki-based database for managing hierarchical metadata of metabolome analyses. Frontiers in Bioengineering and Biotechnology 3, 38. DOI: 10.3389/fbioe.2015.00038. Ara, T., Sakurai, N., Suzuki, H., Aoki, K., Saito, K. et al. (2021) MassBase: a large-scaled depository of mass spectrometry datasets for metabolome analysis. Plant Biotechnology 38(1), 167–171. DOI: 10.5511/plantbiotechnology.20.0911a. Babushok, V.I., Linstrom, P.J., Reed, J.J., Zenkevich, I.G., Brown, R.L. et al. (2007) Development of a database of gas chromatographic retention properties of organic compounds. Journal of Chromatography A 1157(1–2), 414–421. DOI: 10.1016/j.chroma.2007.05.044. Bais, P., Moon-Quanbeck, S.M., Nikolau, B.J. and Dickerson, J.A. (2012) P lantmetabolomics.org: mass spectrometry-based Arabidopsis metabolomics: database and tools update. Nucleic Acids Research 40(Database issue), D1216–D1220. DOI: 10.1093/nar/gkr969. Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F. et al. (2013) NCBI GEO: archive for functional genomics data sets: update. Nucleic Acids Research 41(Database issue), D991–D995. DOI: 10.1093/nar/gks1193. Bartels, B. and Svatoš, A. (2015) Spatially resolved in vivo plant metabolomics by laser ablation-based mass spectrometry imaging (MSI) techniques: LDI-MSI and LAESI. Frontiers in Plant Science 6, 471. DOI: 10.3389/fpls.2015.00471. Bell, S.M., Burgoon, L.D. and Last, R.L. (2012) MIPHENO: data normalization for high throughput metabolite analysis. BMC Bioinformatics 13, 10. DOI: 10.1186/1471-2105-13-10. Bian, X., Li, N., Tan, B., Sun, B., Guo, M.-Q. et al. (2018) Polarity-tuning derivatization-LC-MS approach for probing global carboxyl-containing metabolites in colorectal cancer. Analytical Chemistry 90(19), 11210–11215. DOI: 10.1021/acs.analchem.8b01873. Bouwmeester, H., Schuurink, R.C., Bleeker, P.M. and Schiestl, F. (2019) The role of volatiles in plant communication. The Plant Journal 100(5), 892–907. DOI: 10.1111/tpj.14496. Carroll, A.J., Badger, M.R. and Harvey Millar, A. (2010) The MetabolomeExpress Project: enabling web- based processing, analysis and transparent dissemination of GC/MS metabolomics datasets. BMC Bioinformatics 11, 376. DOI: 10.1186/1471-2105-11-376. Carroll, A.J., Zhang, P., Whitehead, L., Kaines, S., Tcherkez, G. et al. (2015) PhenoMeter: a metabolome database search tool using statistical similarity matching of metabolic phenotypes for high- confidence detection of functional links. Frontiers in Bioengineering and Biotechnology 3, 106. DOI: 10.3389/fbioe.2015.00106.

Plant Metabolomics

63

Considine, E.C. and Salek, R.M. (2019) A tool to encourage minimum reporting guideline uptake for data analysis in metabolomics. Metabolites 9(3), E43. DOI: 10.3390/metabo9030043. Cummins, C., Ahamed, A., Aslam, R., Burgin, J., Devraj, R. et al. (2022) The European Nucleotide Archive in 2021. Nucleic Acids Research 50(D1), D106–D110. DOI: 10.1093/nar/gkab1051. Davidson, R.L., Weber, R.J.M., Liu, H., Sharma-Oates, A. and Viant, M.R. (2016) Galaxy-M: a Galaxy workflow for processing and analyzing direct infusion and liquid chromatography mass spectrometry- based metabolomics data. GigaScience 5, 10. DOI: 10.1186/s13742-016-0115-8. de Rosso, V.V. and Mercadante, A.Z. (2007) Identification and quantification of carotenoids, by HPLC- PDA-MS/MS, from Amazonian fruits. Journal of Agricultural and Food Chemistry 55(13), 5062–5072. DOI: 10.1021/jf0705421. Deutsch, E.W., Bandeira, N., Sharma, V., Perez-Riverol, Y., Carver, J.J. et al. (2020) The ProteomeXchange consortium in 2020: enabling “big data” approaches in proteomics. Nucleic Acids Research 48(D1), D1145–D1152. DOI: 10.1093/nar/gkz984. Dudareva, N., Klempien, A., Muhlemann, J.K. and Kaplan, I. (2013) Biosynthesis, function and metabolic engineering of plant volatile organic compounds. The New Phytologist 198(1), 16–32. DOI: 10.1111/ nph.12145. Ferry-Dumazet, H., Gil, L., Deborde, C., Moing, A., Bernillon, S. et al. (2011) MeRy-B: a web knowledgebase for the storage, visualization, analysis and annotation of plant NMR metabolomic profiles. BMC Plant Biology 11, 104. DOI: 10.1186/1471-2229-11-104. Fiehn, O. (2016) Metabolomics by gas chromatography-mass spectrometry: combined targeted and untargeted profiling. Current Protocols in Molecular Biology 114, 30. DOI: 10.1002/0471142727. mb3004s114. Fiehn, O., Robertson, D., Griffin, J., van der Werf, M., Nikolau, B. et al. (2007a) The metabolomics standards initiative (MSI). Metabolomics 3(3), 175–178. DOI: 10.1007/s11306-007-0070-6. Fiehn, O., Sumner, L.W., Rhee, S.Y., Ward, J., Dickerson, J. et al. (2007b) Minimum reporting standards for plant biology context information in metabolomic studies. Metabolomics 3(3), 195–201. DOI: 10.1007/s11306-007-0068-0. Fukushima, A. and Kusano, M. (2013) Recent progress in the development of metabolome databases for plant systems biology. Frontiers in Plant Science 4, 73. DOI: 10.3389/fpls.2013.00073. Fukushima, A., Kusano, M., Mejia, R.F., Iwasa, M., Kobayashi, M. et al. (2014) Metabolomic characterization of knockout mutants in Arabidopsis: development of a metabolite profiling database for knockout mutants in Arabidopsis. Plant Physiology 165, 948–961. Fukushima, A., Takahashi, M., Nagasaki, H., Aono, Y., Kobayashi, M. et al. (2021) Development of RIKEN plant metabolome metaDatabase. Plant and Cell Physiology 63(3), 433–440. DOI: 10.1093/pcp/ pcab173. Furey, A., Moriarty, M., Bane, V., Kinsella, B. and Lehane, M. (2013) Ion suppression; a critical review on causes, evaluation, prevention and applications. Talanta 115, 104–122. DOI: 10.1016/j. talanta.2013.03.048. Giacomoni, F., Le Corguille, G., Monsoor, M., Landi, M., Pericard, P. et al. (2015) Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics. Bioinformatics 31, 1493– 1495. DOI: 10.1093/bioinformatics/btu813. Goodacre, R., York, E.V., Heald, J.K. and Scott, I.M. (2003) Chemometric discrimination of unfractionated plant extracts analyzed by electrospray mass spectrometry. Phytochemistry 62(6), 859–863. DOI: 10.1016/s0031-9422(02)00718-5. Goff, S.A. and Klee, H.J. (2006) Plant volatile compounds: sensory cues for health and nutritional value? Science 311(5762), 815–819. DOI: 10.1126/science.1112614. Goodacre, Royston, Broadhurst, D., Smilde, A.K., Kristal, B.S., Baker, J.D. et al. (2007) Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3(3), 231–241. DOI: 10.1007/ s11306-007-0081-3. Govindaraju, V., Young, K. and Maudsley, A.A. (2000) Proton NMR chemical shifts and coupling constants for brain metabolites. NMR in Biomedicine 13(3), 129–153. DOI: 10.1002/1099-1492(200005)13:33.0.co;2-v. Gu, L., Jones, A.D. and Last, R.L. (2007) LC-MS/MS assay for protein amino acids and metabolically related compounds for large-scale screening of metabolic phenotypes. Analytical Chemistry 79(21), 8067–8075. DOI: 10.1021/ac070938b.

64

M. Kusano and A. Fukushima

Haug, K., Cochrane, K., Nainala, V.C., Williams, M., Chang, J. et al. (2020) MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Research 48(D1), D440–D444. DOI: 10.1093/nar/gkz1019. Horai, H., Arita, M., Kanaya, S., Nihei, Y., Ikeda, T. et al. (2010) MassBank: a public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry 45(7), 703–714. DOI: 10.1002/ jms.1777. Huan, T., Forsberg, E.M., Rinehart, D., Johnson, C.H., Ivanisevic, J. et al. (2017) Systems biology guided by XCMS online metabolomics. Nature Methods 14(5), 461–462. DOI: 10.1038/nmeth.4260. Hummel, J., Strehmel, N., Selbig, J., Walther, D. and Kopka, J. (2010) Decision tree supported substructure prediction of metabolites from GC-MS profiles. Metabolomics 6(2), 322–333. DOI: 10.1007/ s11306-010-0198-7. Hur, M., Campbell, A.A., Almeida-de-Macedo, M., Li, L., Ransom, N. et al. (2013) A global approach to analysis and interpretation of metabolic data for plant natural product discovery. Natural Product Reports 30(4), 565–583. DOI: 10.1039/c3np20111b. Ichihashi, Y., Date, Y., Shino, A., Shimizu, T., Shibata, A. et al. (2020) Multi-omics analysis on an agroecosystem reveals the significant role of organic nitrogen to increase agricultural crop yield. Proceedings of the National Academy of Sciences 117(25), 14552–14560. DOI: 10.1073/pnas.1917259117. Izumi, Y., Matsuda, F., Hirayama, A., Ikeda, K., Kita, Y. et al. (2019) Inter-laboratory comparison of metabolite measurements for metabolomics data integration. Metabolites 9(11), E257. DOI: 10.3390/ metabo9110257. Jorge, T.F., Mata, A.T. and António, C. (2016) Mass spectrometry as a quantitative tool in plant metabolomics. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 374(2079), 20150370. DOI: 10.1098/rsta.2015.0370. Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M. and Tanabe, M. (2021) KEGG: integrating viruses and cellular organisms. Nucleic Acids Research 49(D1), D545–D551. DOI: 10.1093/nar/ gkaa970. Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J. et al. (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Research 49(D1), D1388–D1395. DOI: 10.1093/nar/ gkaa971. Kodama, Y., Mashima, J., Kosuge, T. and Ogasawara, O. (2019) DDBJ update: the Genomic Expression Archive (GEA) for functional genomics data. Nucleic Acids Research 47(D1), D69–D73. DOI: 10.1093/ nar/gky1002. Kopka, J., Schauer, N., Krueger, S., Birkemeyer, C., Usadel, B. et al. (2005) [email protected]: the Golm Metabolome Database. Bioinformatics (Oxford, England) 21(8), 1635–1638. DOI: 10.1093/ bioinformatics/bti236. Krishnan, P., Kruger, N.J. and Ratcliffe, R.G. (2004) Metabolite fingerprinting and profiling in plants using NMR. Journal of Experimental Botany 56(410), 255–265. DOI: 10.1093/jxb/eri010. Kusano, M., Fukushima, A., Redestig, H. and Saito, K. (2011) Metabolomic approaches toward understanding nitrogen metabolism in plants. Journal of Experimental Botany 62(4), 1439–1453. DOI: 10.1093/jxb/erq417. Kusano, M., Fukushima, A., Arita, M., Jonsson, P., Moritz, T. et al. (2007) Unbiased characterization of genotype-dependent metabolic regulations by metabolomic approach in Arabidopsis thaliana. BMC Systems Biology 1, 53. DOI: 10.1186/1752-0509-1-53. Lazar, C., Meganck, S., Taminau, J., Steenhoff, D., Coletta, A. et al. (2013) Batch effect removal methods for microarray gene expression data integration: a survey. Briefings in Bioinformatics 14(4), 469–490. DOI: 10.1093/bib/bbs037. Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B. et al. (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews. Genetics 11(10), 733–739. DOI: 10.1038/nrg2825. Lei, Z., Huhman, D.V. and Sumner, L.W. (2011) Mass spectrometry strategies in metabolomics. The Journal of Biological Chemistry 286(29), 25435–25442. DOI: 10.1074/jbc.R111.238691. Leinonen, R., Sugawara, H., Shumway, M. and International Nucleotide Sequence Database.C (2011) The sequence read archive. Nucleic Acids Research 39(Database issue), D19–21. DOI: 10.1093/ nar/gkq1019. Li, D. and Gaquerel, E. (2021) Next-generation mass spectrometry metabolomics revives the functional analysis of plant metabolic diversity. Annual Review of Plant Biology 72, 867–891. DOI: 10.1146/ annurev-arplant-071720-114836.

Plant Metabolomics

65

Little, J.L., Williams, A.J., Pshenichnov, A. and Tkachenko, V. (2012) Identification of “known unknowns” utilizing accurate mass data and ChemSpider. Journal of the American Society for Mass Spectrometry 23(1), 179–185. DOI: 10.1007/s13361-011-0265-y. Lu, W., Bennett, B.D. and Rabinowitz, J.D. (2008) Analytical strategies for LC-MS-based targeted metabolomics. Journal of Chromatography B 871(2), 236–242. DOI: 10.1016/j.jchromb.2008.04.031. Lu, Y., Savage, L.J., Larson, M.D., Wilkerson, C.G. and Last, R.L. (2011) Chloroplast 2010: a database for large-scale phenotypic screening of Arabidopsis mutants. Plant Physiology 155(4), 1589–1600. DOI: 10.1104/pp.110.170118. Majchrzak, T., Wojnowski, W., Rutkowska, M. and Wasik, A. (2020) Real- time volatilomics: a novel approach for analyzing biological samples. Trends in Plant Science 25(3), 302–312. DOI: 10.1016/j. tplants.2019.12.005. Massonnet, C., Vile, D., Fabre, J., Hannah, M.A., Caldana, C. et al. (2010) Probing the reproducibility of leaf growth and molecular phenotypes: a comparison of three Arabidopsis accessions cultivated in ten laboratories. Plant Physiology 152(4), 2142–2157. DOI: 10.1104/pp.109.148338. Mendez, K.M., Pritchard, L., Reinke, S.N. and Broadhurst, D.I. (2019) Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing. Metabolomics 15(10), 125. DOI: 10.1007/s11306-019-1588-0. Pang, Z., Chong, J., Zhou, G., de Lima Morais, D.A., Chang, L. et al. (2021) MetaboAnalyst 5.0: narrowing the gap between raw spectra and functional insights. Nucleic Acids Research 49(W1), W388–W396. DOI: 10.1093/nar/gkab382. Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R. et al. (2007) ArrayExpress: a public database of microarray experiments and gene expression profiles. Nucleic Acids Research 35(Database issue), D747–D750. DOI: 10.1093/nar/gkl995. Peñuelas, J. and Llusià, J. (2004) Plant VOC emissions: making use of the unavoidable. Trends in Ecology & Evolution 19(8), 402–404. DOI: 10.1016/j.tree.2004.06.002. Peters, K., Bradbury, J., Bergmann, S., Capuccini, M., Cascante, M. et al. (2019) PhenoMeNal: processing and analysis of metabolomics data in the cloud. GigaScience 8(2), giy149. DOI: 10.1093/ gigascience/giy149. Petras, D., Jarmusch, A.K. and Dorrestein, P.C. (2017) From single cells to our planet-recent advances in using mass spectrometry for spatially resolved metabolomics. Current Opinion in Chemical Biology 36, 24–31. DOI: 10.1016/j.cbpa.2016.12.018. Phinney, K.W., Ballihaut, G., Bedner, M., Benford, B.S., Camara, J.E. et al. (2013) Development of a standard reference material for metabolomics research. Analytical Chemistry 85(24), 11732–11738. DOI: 10.1021/ac402689t. Romero, P.R., Kobayashi, N., Wedell, J.R., Baskaran, K., Iwata, T. et al. (2020) BioMagResBank (BMRB) as a resource for structural biology. Methods in Molecular Biology 2112, 187218. Schober, D., Jacob, D., Wilson, M., Cruz, J.A., Marcu, A. et al. (2018) nmrML: a community supported open data standard for the description, storage, and exchange of NMR data. Analytical Chemistry 90(1), 649–656. DOI: 10.1021/acs.analchem.7b02795. Schrimpe-Rutledge, A.C., Codreanu, S.G., Sherrod, S.D. and McLean, J.A. (2016) Untargeted metabolomics strategies-challenges and emerging directions. Journal of the American Society for Mass Spectrometry 27(12), 1897–1905. DOI: 10.1007/s13361-016-1469-y. Schwab, W., Davidovich-Rikanati, R. and Lewinsohn, E. (2008) Biosynthesis of plant-derived flavor compounds. The Plant Journal 54(4), 712–732. DOI: 10.1111/j.1365-313X.2008.03446.x. Simón-Manso, Y., Lowenthal, M.S., Kilpatrick, L.E., Sampson, M.L., Telu, K.H. et al. (2013) Metabolite profiling of a NIST standard reference material for human plasma (SRM 1950): GC-MS, LC-MS, NMR, and clinical laboratory analyses, libraries, and web-based resources. Analytical Chemistry 85(24), 11725–11731. DOI: 10.1021/ac402503m. Siskos, A.P., Jain, P., Römisch-Margl, W., Bennett, M., Achaintre, D. et al. (2017) Interlaboratory reproducibility of a targeted metabolomics platform for analysis of human serum and plasma. Analytical Chemistry 89(1), 656–665. DOI: 10.1021/acs.analchem.6b02930. Smith, C.A., O’Maille, G., Want, E.J., Qin, C., Trauger, S.A. et al. (2005) METLIN: a metabolite mass spectral database. Therapeutic Drug Monitoring 27, 747–751. Spicer, R.A., Salek, R. and Steinbeck, C. (2017a) Compliance with minimum information guidelines in public metabolomics repositories. Scientific Data 4(1), 170137. DOI: 10.1038/sdata.2017.137. Spicer, R.A., Salek, R. and Steinbeck, C. (2017b) A decade after the metabolomics standards initiative it’s time for a revision. Scientific Data 4(1), 170138. DOI: 10.1038/sdata.2017.138.

66

M. Kusano and A. Fukushima

Sud, M., Fahy, E., Cotter, D., Azam, K., Vadivelu, I. et al. (2016) Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Research 44(D1), D463–D470. DOI: 10.1093/nar/gkv1042. Sud, M., Fahy, E., Cotter, D., Dennis, E.A. and Subramaniam, S. (2012) LIPID MAPS-nature lipidomics gateway: an online resource for students and educators interested in lipids. Journal of Chemical Education 89(2), 291–292. DOI: 10.1021/ed200088u. Sumner, L.W., Amberg, A., Barrett, D., Beale, M.H., Beger, R. et al. (2007) Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics: Official Journal of the Metabolomic Society 3(3), 211–221. DOI: 10.1007/s11306-007-0082-2. Tautenhahn, R., Patti, G.J., Rinehart, D. and Siuzdak, G. (2012) XCMS Online: a web-based platform to process untargeted metabolomic data. Analytical Chemistry 84(11), 5035–5039. DOI: 10.1021/ ac300698c. Tholl, D., Hossain, O., Weinhold, A., Röse, U.S.R. and Wei, Q. (2021) Trends and applications in plant volatile sampling and analysis. The Plant Journal 106(2), 314–325. DOI: 10.1111/tpj.15176. Tohge, T. and Fernie, A.R. (2009) Web-based resources for mass-spectrometry-based metabolomics: a user’s guide. Phytochemistry 70(4), 450–456. DOI: 10.1016/j.phytochem.2009.02.004. Tsugawa, H., Ikeda, K., Takahashi, M., Satoh, A., Mori, Y. et al. (2020) A lipidome atlas in MS-DIAL 4. Nature Biotechnology 38(10), 1159–1163. DOI: 10.1038/s41587-020-0531-2. Viant, M.R., Kurland, I.J., Jones, M.R. and Dunn, W.B. (2017) How close are we to complete annotation of metabolomes? Current Opinion in Chemical Biology 36, 64–69. DOI: 10.1016/j.cbpa.2017.01.001. Wang, S., Alseekh, S., Fernie, A.R. and Luo, J. (2019) The structure and function of major plant metabolite modifications. Molecular Plant 12(7), 899–919. DOI: 10.1016/j.molp.2019.06.001. Wang, M., Carver, J.J., Phelan, V.V., Sanchez, L.M., Garg, N. et al. (2016) Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nature Biotechnology 34(8), 828–837. DOI: 10.1038/nbt.3597. Watanabe, K., Yasugi, E. and Oshima, M. (2000) How to search the glycolipid data in “LIPIDBANK for Web” the newly developed lipid database in Japan. Trends in Glycoscience and Glycotechnology 12(65), 175–184. DOI: 10.4052/tigg.12.175. Willacey, C.C.W., Naaktgeboren, M., Lucumi Moreno, E., Wegrzyn, A.B., van der Es, D. et al. (2019) LC-MS/MS analysis of the central energy and carbon metabolites in biological samples following derivatization by dimethylaminophenacyl bromide. Journal of Chromatography. A 1608, 460413. DOI: 10.1016/j.chroma.2019.460413. Wishart, D.S., Li, C., Marcu, A., Badran, H., Pon, A. et al. (2020) PathBank: a comprehensive pathway database for model organisms. Nucleic Acids Research 48(D1), D470–D478. DOI: 10.1093/nar/ gkz861.

5

Plant Phenomics

Wei Guo* and Jiangsan Zhao Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan

Abstract Plant phenomics is the study of plant growth and performance using multidisciplinary approaches, including high-speed computing, computer vision, sensing, and robotics. Numerous phenotyping platforms, ranging from laboratories to growth chambers, greenhouses, and finally the field, have been established through substantial efforts devoted by the plant phenotyping community. These platforms offer opportunities to measure plant traits at different scales and automation levels, with reasonable throughput and accuracy. Different sensor technologies and data analysis pipelines have been successfully implemented in plant phenomics. However, the available phenotyping platforms are not without limitations, preventing the broader application of plant phenomics in crop breeding, vegetation monitoring, and precision agriculture. The present review summarizes recent advances in various phenotyping platforms in plant phenomics, highlights their limitations, and proposes research directions for future studies in this area.

5.1 Introduction to Plant Phenomics Genomics has shown continuous progress over the past three decades. The introduction of next- generation sequencing (NGS) has allowed for rapid and effortless sequencing of the whole genome of many plant species (Martin et al., 2013; Li et al., 2018) (see also Chapter 1 of this book). One of the most important objectives of genomics is to detect genes or loci that are associated with various plant traits (Sukumaran et al., 2015, 2018) (see also Chapter 12 of this book). In addition to contributing to advances in basic biology, such efforts could help solve future food problems through crop breeding (Hickey et al., 2019; Esposito et al., 2020) (see also Chapter 13 of this book). Plant phenotypes are highly unstable, as opposed to genotypes. Plant phenotypes are determined by both genotype and its interaction with the ever-changing environment during plant

growth. Fluctuations in the surrounding environment can result from a combination of several or all of the following factors: management, heterogeneous soil nutrient distribution, temperature and light changes, water and drought stress, and biotic stresses (Tardieu et al., 2017). Conventional phenotyping methods, such as manual measurement of plant traits, show low accuracy and throughput, which limits further advances in plant omics biology (Furbank and Tester, 2011; Minervini et al., 2015). Plant phenomics is an emerging field involving the study of plant growth and performance using multidisciplinary approaches, including high- speed computing, computer vision, sensing, and robotics. Plant phenomics covers a wide range of plant traits, including cell structure, tissue and organ morphology, whole plant architecture, and canopy structure of a population. Therefore, plant phenomics

*Corresponding author: guowei@g.ecc.u-tokyo.ac.jp © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0005

67

68

W. Guo and J. Zhao

Fig. 5.1. Examples of plant phenotyping: (a) manual phenotyping and (b) high-throughput phenotyping using an unmanned aerial vehicle.

contributes not only to the improvement of phenotyping throughput compared with conventional manual phenotyping methods (Fig. 5.1) but also to investigations in other research fields, such as functional genomics and responses to environmental change (Furbank and Tester, 2011; Ninomiya et al., 2019; Yang et al., 2020). In this chapter, we briefly review the most relevant topics in plant phenomics. In addition, we discuss the limitations of the current phenotyping techniques and platforms and propose some research objectives for further development. Of note, we do not focus on specific topics in plant phenomics; therefore, further reading is recommended (Table 5.1). Instead, we hope that this chapter will inform general readers of plant phenomics and encourage researchers in related fields to accelerate advances in plant phenomics. Table 5.1. List of reviews on specific topics in plant phenomics. Topic

Reference

Root phenotyping

Atkinson et al. (2019) Tracy et al. (2020)

Shoot phenotyping

Gibbs et al. (2018)

Field phenotyping

Chawade et al. (2019) Jin et al. (2021b)

Indoor phenotyping

Xu et al. (2020)

Indoor and field phenotyping

Li et al. (2021)

Machine learning in plant phenotyping

Singh et al. (2016) Singh et al. (2018)

5.2 Basic Technologies for Plant Phenotyping Conventional phenotyping involves qualitative or quantitative measurements of phenotypic traits by hand (Fig. 5.1a). These labor-intensive and time-consuming measurements often lack sufficient temporal resolution and are limited to small- scale experiments (van der Heijden et al., 2012; Atkinson et al., 2019). In addition, certain physiological and biochemical phenotypes are typically difficult to capture using these conventional phenotyping methods. Another disadvantage of conventional phenotyping is low objectivity with no clear criteria, such as the severity of either abiotic or biotic stresses and the timing of different events during plant development (Das Choudhury et al., 2019). In this context, extensive efforts have been devoted to modernizing phenotyping, either partially or entirely, in plant phenomics. Advanced image sensing techniques, including those making use of X-rays, visual, near-infrared (NIR), and thermal infrared (TIR) sensors to form images of plants by detecting the solar radiation reflected from their surfaces, have enabled the quantification of morphological, physiological, and temporal changes in plants using current phenotyping methods in a high- throughput manner (Cabrera-Bosquet et al., 2016; Rahaman et al., 2019). Recently, an increasing number of platforms that integrate automating, image sensing, and data analyzing technologies have been deployed for

Plant Phenomics

high- throughput phenotyping purposes. In general, high- throughput phenotyping platforms can be classified as either plant-to-sensor (moving plants using conveyor transportation and phenotyping plants with well- calibrated sensors in a standardized and well-controlled environment) or sensor-to-plant (phenotyping plants in situ) (York, 2019). These platforms must be built cost- effectively because their designs closely depend on the experimental set-up and must frequently be customized to the specific aims of the experiment. For instance, it is difficult to use a platform with high-precision measurements on a single organ to assess field- grown plants. Depending on the circumstances in which they are used, most systems can be classified as either indoor or field phenotyping platforms. Indoor platforms can provide well- controlled environments, enabling researchers to phenotype plants under carefully controlled conditions of nutrient supply, irrigation, light, temperature, and CO2 levels (Ma et al., 2019). Given the sessile nature of plants, field platforms are only categorized as sensor-to-plant, allowing phenotyping in natural growing environments. Field conditions may never be fully reproduced or simulated inside a greenhouse or growth chamber, rendering field phenotyping an indispensable step to validate the discoveries of laboratory experiments (Virlet et al., 2017). Although high- throughput systems can generate massive data, the processing and interpretation of these data represent the next major challenges. Researchers have used computer vision and pattern recognition technologies combined with data mining methods, including conventional statistics and advanced machine learning algorithms, to analyze image- based, non- invasive phenotypic data (Perez-Sanz et al., 2017; Singh et al., 2020). For instance, visualizing phenotypic data in real time and adjusting phenotyping conditions interactively can improve the accuracy of the acquired data. Furthermore, incorporating new algorithms can facilitate data interpretation in crop breeding and optimize agricultural management (Afonnikov et al., 2016). Overall, rapid and effective data processing and interpretation can significantly improve the phenotyping efficiency of different platforms.

69

5.3 Indoor Phenotyping Indoor phenotyping grows plants and conducts phenotyping under laboratory, growth chamber, or greenhouse conditions. An advantage of indoor phenotyping is that it can provide well-controlled environments. In addition, indoor phenotyping shows a higher possibility of enabling more precise phenotypic measurements, even at the organ and cellular levels. Growth chambers or greenhouses equipped with high-throughput and non-invasive imaging sensors are being increasingly used in various plant phenotyping tasks. For instance, plant growth has been estimated using 2D or 3D canopy structures of plants captured by visual imaging. Moreover, plant transpiration and photosynthetic efficiency have been visualized using thermal and fluorescence imaging. Both phenotypic and biochemical properties of plants can be measured simultaneously through NIR imaging.

5.3.1 Indoor phenotyping platforms 5.3.1.1 Laboratory or growth chamber Most cellular- and organ-level phenotyping is conducted using laboratory- or growth chamber- based platforms. These platforms include upright microscopy, confocal microscopy, laser ablation scanning, and micro- computed tomography (CT). Stomatal density (or index) is an indicator of water and gas exchange and the photosynthetic efficiency of the stomatal apparatus. Toda et al. (2021) developed a simple and affordable image-analysis platform for direct stomatal phenotyping based on microscopic images. Observation of fluorescent protein localization in transgenic plants using confocal microscopy enables the analysis of cell division and expansion at high temporal and spatial resolutions (Grandjean et al., 2004; Mohanty et al., 2009). In addition, confocal microscopy enables the non-destructive measurement of root growth (Wuyts et al., 2011). Laser ablation scanning has been used to analyze anatomical traits, such as root cortical aerenchyma, cell count, and size measurement (Galindo-Castañeda et al., 2018; Strock et al., 2019). Micro-CT has been used to image the cross-sections of maize stalks and

70

W. Guo and J. Zhao

vascular bundles, in which the number, area, and spatial distribution of different cell types could be quantified (du Plessis et al., 2016). In addition, X-ray micro-CT has been used to non-destructively elucidate the pattern of root branching in soil (Bao et al., 2014). 5.3.1.2 Greenhouse Over the past few decades, numerous plant phenotyping facilities have been developed, and most of them have been built to study individual plants under controlled or semi-controlled environments in a high-throughput manner (Fiorani and Schurr, 2013; Pieruschka and Schurr, 2019). In these facilities, various sensors, such as high-resolution cameras, light detection and range (LiDAR), X-ray, and magnetic resonance imaging (MRI) systems, and NIR sensors are installed to measure traits related to plant morphology, growth, and stress responses. Generally, in greenhouse phenotyping, individual plants are treated as assessment units, whereas organs (e.g., leaves, stems, flowers, and fruits) are segmented based on the research objective (see also Chapter 18 of this book). Leaves are the most accessible and frequently measured organs for determining plant growth status. The most commonly analyzed leaf parameters include single and total leaf area, leaf number, leaf thickness, stomatal size and number, and leaf distribution on the shoot (Gay and Hurd, 1975; Gary et al., 1991; Thompson et al., 2007). Light interception resulting from overlapping leaves affects the yield potential of crops, because leaf overlapping reduces the photoactive leaf area and airflow across leaf surfaces (Falster and Westoby, 2003). Stress identification, classification, and quantification are often performed using leaf images. To achieve this objective, different imaging techniques, such as visual, infrared, fluorescence, and NIR imaging (Chaerle and Van Der Straeten, 2000), have been employed. For instance, RGB- based images have been used to quantify the severity of foliar disease in soybean (Alves et al., 2021). Hyperspectral imaging has been used to measure the severity of water stress in tomatoes (Elvanidi et al., 2018) and aphid stress in cotton (Chen et al., 2018). Moreover, chlorophyll fluorescence imaging has been used to determine the extent of chilling

injury in tomato seedlings (Dong et al., 2019). In addition, because fluorescence images reflect the efficiency of the photosynthetic center, they can be used to assess photosynthetic activity (Genty and Meyer, 1995). The stem is often treated as a supporting organ shaping the canopy structure, which affects light interception for photosynthesis (Wells et al., 1993). In addition, stem diameter is an important indicator of lodging resistance (Aiken et al., 2003; Yamamoto et al., 2016; Robertson et al., 2017). A certain degree of photosynthesis also occurs in the stems (Henry et al., 2020). Traditionally, stem size and diameter were manually measured using a ruler and calipers, respectively. However, both visual and LiDAR imaging are now used to extract stem size precisely in a high-throughput manner (Rose et al., 2015; Narvaez et al., 2017; Paulus, 2019). Root system architecture determines water uptake efficiency and capability (Zhao et al., 2017; Burridge et al., 2020). However, root phenotyping is generally more labor-intensive, as roots are mostly hidden underground and inaccessible, which renders the measurement difficult. Conventional root phenotyping is mainly performed destructively (i.e., digging out, washing, and scanning of the pieces, followed by image analysis) when the plants are cultivated in soil. To enable non-invasive root phenotyping, some gadgets have been developed. For instance, rhizotrons are equipment or facility with at least one transparent panel that allows for non- invasive root observation and measurement (Huck and Taylor, 1982); they allow continuous in situ root imaging. In addition, a minirhizotron is a transparent tube in which moving cameras are installed (Taylor et al., 1990), and root images can be obtained by inserting the tube into soil near the roots.

5.3.2 Limitations of the current indoor phenotyping platforms In cell- and organ-level phenotyping, although high- end digital microscopy in conjunction with high-performance computers can capture and analyze thousands of images in a single day, tissue sampling and setting of samples onto the equipment remain low-throughput.

Plant Phenomics

The current non- invasive indoor root phenotyping systems have specific advantages and disadvantages. Although minirhizotrons enable non-invasive root phenotyping (Taylor et al., 1990), root images obtained using this system are limited to areas near the equipment and the root growth environment is disturbed by the installation of minirhizotron itself. MRI is more suitable than CT for imaging large root systems (e.g., later developmental stages). In small root systems, however, CT provides more detailed information than MRI, because of its higher spatial resolution (Metzner et al., 2015). Nevertheless, the most common drawback of the current indoor phenotyping systems is the cost of the facilities, which remain rather expensive for individual plant scientists, agronomists, breeders, and even small institutions. Therefore, affordable and easy-to-use facilities must be developed.

5.4 Field Phenotyping Plant growth is affected by various environmental factors (Calvo et al., 2014). The environmental stressors of field-grown plants include soil characteristics, rainfall, temperature, weather, diseases, insect pests, and weeds. In addition, from the viewpoint of canopy structure, plants themselves create heterogeneity in their growth environment. For instance, competition for water, soil nutrients, and photosynthetically active radiation among plants in a canopy lead to differences in the distribution of these factors within the growth environment. Therefore,

71

precise environmental control is impossible in field phenotyping. To solve these problems, recent advances in field phenotyping include remote and/or proximal sensing systems that can measure plant phenotypes and collect environmental data (Araus and Cairns, 2014; Shakoor et al., 2017). Various sensors (e.g., hyperspectral, thermal, and LiDAR sensors) have been used in plant phenotyping. These sensors are installed in either ground-based or remote sensing systems. Ground-based sensing systems are installed in fields or on mobile platforms (e.g., robots and ground vehicles) and can access target plants from less than 1 m to a few meters. Thus, ground-based systems can achieve high spatial resolution. Remote sensing systems include UAVs (Maes and Steppe, 2019; Guo et al., 2021), blimps/balloons, and satellites. Another point to be considered is that the target traits in field phenotyping include different scales (Fig. 5.2). Therefore, platforms and/or data analysis systems specific to the target traits must often be developed, as discussed in the following sections.

5.4.1 Field phenotyping platforms 5.4.1.1 Satellites Because field trials are often large-scale in terms of the number of plants and area, satellite imaging is a powerful tool for evaluating crop performance through growth seasons (Tattaris et al., 2016). Satellites provide visual, multispectral, and hyperspectral images. Recently,

Fig. 5.2. Levels in field phenotyping: (a) field level; (b) plot/canopy level; (c) plant level; and (d) organ level.

72

W. Guo and J. Zhao

commercially available daily satellite imagery has become popular in precision agriculture, from which abiotic or biotic stresses, uneven or variable-rate fertilization, and irrigation can be examined closely during crop development. Therefore, satellite imaging is preferred in large-scale studies. WorldView-3 captures panchromatic and multispectral images with spatial resolutions of 0.31 and 1.24 m ground sample distance (GSD), respectively (http://worldview3. digitalglobe.com/, accessed September 2022). Low-resolution images captured by Sentinel-2 (https://sentinel.esa.int/web/sentinel/missions/ sentinel-2, accessed September 2022) and Landsat-8 (https://www.usgs.gov/landsatmissions/landsat-8, accessed September 2022) are freely available, whereas high- resolution images are only commercially available. 5.4.1.2 UAVs UAVs fly at much lower altitudes than satellites and thus capture images at significantly higher spatial resolutions. The establishment of protocols and reduction in cost have made UAVs major phenotyping systems for efficient imaging and image analysis. Various sensors can be installed in UAVs (Zarco-Tejada et al., 2012; Adão et al., 2017). UAV RGB imaging has been used to detect nitrogen stress in rapeseed leaves (Zhang et al., 2020) and count sorghum heads (Guo et al., 2018; Ghosal et al., 2019). Thermal sensors on UAVs have been used to study the canopy temperature in maize (Zhang et al., 2019). Highly accurate crop height data could be extracted using multitemporal RGB and LiDAR sensors (Becirevic et al., 2019). Moreover, in potato, above-ground biomass and yield could be predicted using UAV-based RGB and hyperspectral images (Li et al., 2020).

In addition, handheld platforms have been used for phenotyping various traits. Specifically, handheld spectroradiometers have been used to assess the damage caused by various diseases, such as wheat powdery mildew and wheat stripe rust, or insects, such as aphids (Qiao et al., 2010). Handheld chlorophyll meters and chlorophyll fluorescence meters have been used to measure chlorophyll content as an indicator of drought stress in maize plants (Gekas et al., 2013; Roth et al., 2013). A handheld infrared thermometer has been used to estimate drought tolerance in maize (Zhang et al., 2019). A backpack-based platform is also useful for obtaining precise information on the canopy structure. A backpack- based LiDAR sensor, in conjunction with a fixed terrestrial 3D laser scanning system, enables precise measurement of various traits, such as plant height and spike number per area (Jin et al., 2021a; Zhu et al., 2021). Parallel phenotyping for multiple traits is preferable not only to reduce time and labor but also to obtain comprehensive and precise information on the overall health of plants. Mobile platforms, such as phenomobiles (Madec et al., 2017), are more advantageous than handheld ones, because the former can implement multiple sensors and can thus obtain information on multiple traits, including canopy height and, sometimes, even yield.

5.4.2 Limitations of the current field phenotyping platforms

From the perspective of hardware, plant phenotyping using the current satellite imagery technology is expensive. UAVs are limited by payload, power, policy, and other hardware constraints (Guo et al., 2021). Ground- based platforms 5.4.1.3 Ground-based platforms move slowly, meaning that the environmental Ground-based platforms are placed proximal to conditions may change by the time they move the target plants to ensure the acquisition of from one plot to another, and sometimes, these high-resolution data. Field servers and high- platforms may not be able to move because of resolution camera-based platforms have been excessively wet soil. Changing the environmental used to monitor rice heading and flowering conditions is particularly problematic for some (Guo et al., 2015; Desai et al., 2019). Friedli spectral (Virlet et al., 2017) and thermal sensing et al. (2016) applied a pole-based terrestrial systems (Deery et al., 2014). Hyperspectral laser scanning platform to monitor canopy images at high spatial, spectral, and temporal height changes in maize, soybean, and wheat. resolutions are rare in practice because of the

Plant Phenomics

limitations of hyperspectral imaging sensors. Fixed ground-based platforms are less flexible in terms of experimental location, target crops, and farming practices (Kirchgessner et al., 2016; Virlet et al., 2017). The cost of fixed ground- based platforms can also increase when a sizable experimental site is to be covered (Araus et al., 2018; Mirnezami et al., 2021). However, handheld platforms are rather labor-intensive for field phenotyping. For root phenotyping in the field, non-invasive measurement systems need to be developed. Although ground-penetrating radar, electrical resistance tomography, and electromagnetic inductance are useful to achieve this objective, the current technology remains far from practical application, despite decades of research (Atkinson et al., 2019). From the software perspective, data transmission, storage, organization, and management remain major challenges (Neveu et al., 2019; Pommier et al., 2019). Moreover, image analysis of field phenotyping data requires further advances in image processing and machine and deep learning techniques (Costa et al., 2019; Demidchik et al., 2020; Yang et al., 2020). Additional details on popular deep learning techniques in plant phenotyping are provided in Chapters 16–18.

73

5.5 Conclusion and Future Perspectives As an interdisciplinary research area, plant phenomics integrates plant science, data science, engineering, and other fields, aiming to contribute to other study areas as well as to solve future food production problems. Plant phenotyping infrastructure has been built rapidly worldwide. Indoor phenotyping facilities and techniques are well developed and have shown immense potential to accelerate research in other areas, such as breeding; nonetheless, field phenotyping is not without limitations, such as poor resolution, high hardware cost, and lack of algorithms that can manage complex and massive data. Moreover, linking phenomics and other omics data represents a current challenge that must be addressed through collaborative effort across multiple disciplines. We believe that it is important to encourage young researchers who are interested in plant phenomics to incorporate skills from multiple disciplines in order to develop advanced technical expertise in this area. Accelerated collaborations and communications with experts from diverse backgrounds will produce valuable ideas for sustainable advancement.

References Adão, T., Hruška, J., Pádua, L., Bessa, J., Peres, E. et al. (2017) Hyperspectral imaging: a review on UAV-based sensors, data processing and applications for agriculture and forestry. Remote Sensing 9(11), 1110. DOI: 10.3390/rs9111110. Afonnikov, D.A., Genaev, M.A., Doroshkov, A.V., Komyshev, E.G. and Pshenichnikova, T.A. (2016) Methods of high-throughput plant phenotyping for large-scale breeding and genetic experiments. Russian Journal of Genetics 52(7), 688–701. DOI: 10.1134/S1022795416070024. Aiken, R.M., Nielsen, D.C. and Ahuja, L.R. (2003) Scaling effects of standing crop residues on the wind profile. Agronomy Journal 95(4), 1041–1046. DOI: 10.2134/agronj2003.1041. Alves, K.S., Guimarães, M., Ascari, J.P., Queiroz, M.F., Alfenas, R.F. et al. (2021) RGB-based phenotyping of foliar disease severity under controlled conditions. Tropical Plant Pathology 47(1), 105–117. DOI: 10.1007/s40858-021-00448-y. Araus, J.L. and Cairns, J.E. (2014) Field high-throughput phenotyping: the new crop breeding frontier. Trends in Plant Science 19(1), 52–61. DOI: 10.1016/j.tplants.2013.09.008. Araus, J.L., Kefauver, S.C., Zaman- Allah, M., Olsen, M.S. and Cairns, J.E. (2018) Translating high- throughput phenotyping into genetic gain. Trends in Plant Science 23, 451–466. DOI: 10.1016/j. tplants.2018.02.001. Atkinson, J.A., Pound, M.P., Bennett, M.J. and Wells, D.M. (2019) Uncovering the hidden half of plants using new advances in root phenotyping. Current Opinion in Biotechnology 55, 1–8. DOI: 10.1016/j. copbio.2018.06.002.

74

W. Guo and J. Zhao

Bao, Y., Aggarwal, P., Robbins, N.E., Sturrock, C.J., Thompson, M.C. et al. (2014) Plant roots use a patterning mechanism to position lateral root branches toward available water. In: Proceedings of the National Academy of Sciences, pp. 9319–9324 (Vol. 111). DOI: 10.1073/pnas.1400966111. Becirevic, D., Klingbeil, L., Honecker, A., Schumann, H., Rascher, U. et al. (2019) On the derivation of crop heights from multitemporal UAV based imagery. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences IV- 2/W5, 95–102. DOI: 10.5194/ isprs-annals-IV-2-W5-95-2019. Burridge, J.D., Rangarajan, H. and Lynch, J.P. (2020) Comparative phenomics of annual grain legume root architecture. Crop Science 60(5), 2574–2593. DOI: 10.1002/csc2.20241. Cabrera-Bosquet, L., Fournier, C., Brichet, N., Welcker, C., Suard, B. et al. (2016) High-throughput estimation of incident light, light interception and radiation-use efficiency of thousands of plants in a phenotyping platform. The New Phytologist 212(1), 269–281. DOI: 10.1111/nph.14027. Calvo, P., Nelson, L. and Kloepper, J.W. (2014) Agricultural uses of plant biostimulants. Plant and Soil 383(1–2), 3–41. DOI: 10.1007/s11104-014-2131-8. Chaerle, L. and Van Der Straeten, D. (2000) Imaging techniques and the early detection of plant stress. Trends in Plant Science 5(11), 495–501. DOI: 10.1016/s1360-1385(00)01781-7. Chawade, A., van Ham, J., Blomquist, H., Bagge, O., Alexandersson, E. et al. (2019) High-throughput field-phenotyping tools for plant breeding and precision agriculture. Agronomy 9(5), 258. DOI: 10.3390/agronomy9050258. Chen, T., Zeng, R., Guo, W., Hou, X., Lan, Y. et al. (2018) Detection of stress in cotton (Gossypium hirsutum L.) caused by aphids using leaf level hyperspectral measurements. Sensors 18(9), 2798. DOI: 10.3390/s18092798. Costa, J.M., Marques da Silva, J., Pinheiro, C., Barón, M., Mylona, P. et al. (2019) Opportunities and limitations of crop phenotyping in southern European countries. Frontiers in Plant Science 10, 1125. DOI: 10.3389/fpls.2019.01125. Das Choudhury, S., Samal, A. and Awada, T. (2019) Leveraging image analysis for high-throughput plant phenotyping. Frontiers in Plant Science 10, 508. DOI: 10.3389/fpls.2019.00508. Deery, D., Jimenez-Berni, J., Jones, H., Sirault, X. and Furbank, R. (2014) Proximal remote sensing buggies and potential applications for field-based phenotyping. Agronomy 4(3), 349–379. DOI: 10.3390/agronomy4030349. Demidchik, V.V., Shashko, A.Y., Bandarenka, U.Y., Smolikova, G.N., Przhevalskaya, D.A. et al. (2020) Plant phenomics: fundamental bases, software and hardware platforms, and machine learning. Russian Journal of Plant Physiology 67(3), 397–412. DOI: 10.1134/S1021443720030061. Desai, S.V., Balasubramanian, V.N., Fukatsu, T., Ninomiya, S. and Guo, W. (2019) Automatic estimation of heading date of paddy rice using deep learning. Plant Methods 15, 76. DOI: 10.1186/ s13007-019-0457-1. Dong, Z., Men, Y., Li, Z., Zou, Q., Ji, J. et al. (2019) Chlorophyll fluorescence imaging as a tool for analyzing the effects of chilling injury on tomato seedlings. Scientia Horticulturae 246, 490–497. DOI: 10.1016/j.scienta.2018.11.019. du Plessis, A., le Roux, S.G. and Guelpa, A. (2016) The CT scanner facility at Stellenbosch university: an open access X-ray computed tomography laboratory. Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms 384, 42–49. DOI: 10.1016/j. nimb.2016.08.005. Elvanidi, A., Katsoulas, N., Ferentinos, K.P., Bartzanas, T. and Kittas, C. (2018) Hyperspectral machine vision as a tool for water stress severity assessment in soilless tomato crop. Biosystems Engineering 165, 25–35. DOI: 10.1016/j.biosystemseng.2017.11.002. Esposito, S., Carputo, D., Cardi, T. and Tripodi, P. (2020) Applications and trends of machine learning in genomics and phenomics for next-generation breeding. Plants 9, 34. DOI: 10.3390/plants9010034. Falster, D.S. and Westoby, M. (2003) Plant height and evolutionary games. Trends in Ecology & Evolution 18(7), 337–343. DOI: 10.1016/S0169-5347(03)00061-2. Fiorani, F. and Schurr, U. (2013) Future scenarios for plant phenotyping. Annual Review of Plant Biology 64, 267–291. DOI: 10.1146/annurev-arplant-050312-120137. Friedli, M., Kirchgessner, N., Grieder, C., Liebisch, F., Mannale, M. et al. (2016) Terrestrial 3D laser scanning to track the increase in canopy height of both monocot and dicot crop species under field conditions. Plant Methods 12, 9. DOI: 10.1186/s13007-016-0109-7. Furbank, R.T. and Tester, M. (2011) Phenomics: technologies to relieve the phenotyping bottleneck. Trends in Plant Science 16(12), 635–644. DOI: 10.1016/j.tplants.2011.09.005.

Plant Phenomics

75

Galindo- Castañeda, T., Brown, K.M. and Lynch, J.P. (2018) Reduced root cortical burden improves growth and grain yield under low phosphorus availability in maize. Plant, Cell & Environment 41(7), 1579–1592. DOI: 10.1111/pce.13197. Gary, C., Jones, J.W. and Longuenesse, J.J. (1991) Modelling daily changes in specific leaf area of tomato: the contribution of the leaf assimilate pool. Acta Horticulturae 328(328), 205–210. DOI: 10.17660/ ActaHortic.1993.328.19. Gay, A.P. and Hurd, R.G. (1975) The influence of light on stomatal density in the tomato. New Phytologist 75(1), 37–46. DOI: 10.1111/j.1469-8137.1975.tb01368.x. Gekas, F., Pankou, C., Mylonas, I., Ninou, E., Sinapidou, E. et al. (2013) The use of chlorophyll meter readings for the selection of maize inbred lines under drought stress. In: Proceedings of World Academy of Science, Engineering and Technology, World Academy of Science, Engineering and Technology (WASET), p. 947. Available at: https://zenodo.org/record/1087546 (accessed 25 May 2022). Genty, B. and Meyer, S. (1995) Quantitative mapping of leaf photosynthesis using chlorophyll fluorescence imaging. Functional Plant Biology 22(2), 277. DOI: 10.1071/PP9950277. Ghosal, S., Zheng, B., Chapman, S.C., Potgieter, A.B., Jordan, D.R. et al. (2019) A weakly supervised deep learning framework for sorghum head detection and counting. Plant Phenomics 2019, 1525874. DOI: 10.34133/2019/1525874. Gibbs, J.A., Pound, M., French, A.P., Wells, D.M., Murchie, E. et al. (2018) Plant phenotyping: an active vision cell for three-dimensional plant shoot reconstruction. Plant Physiology 178(2), 524–534. DOI: 10.1104/pp.18.00664. Guo, W., Fukatsu, T. and Ninomiya, S. (2015) Automated characterization of flowering dynamics in rice using field-acquired time-series RGB images. Plant Methods 11, 7. DOI: 10.1186/s13007-015-0047-9. Guo, W., Zheng, B., Potgieter, A.B., Diot, J., Watanabe, K. et al. (2018) Aerial imagery analysis – quantifying appearance and number of sorghum heads for applications in breeding and agronomy. Frontiers in Plant Science 9, 1544. DOI: 10.3389/fpls.2018.01544. Grandjean, O., Vernoux, T., Laufs, P., Belcram, K., Mizukami, Y. et al. (2004) In vivo analysis of cell division, cell growth, and differentiation at the shoot apical meristem in Arabidopsis. The Plant Cell 16(1), 74–87. DOI: 10.1105/tpc.017962. Guo, W., Carroll, M.E., Singh, A., Swetnam, T.L., Merchant, N. et al. (2021) UAS-based plant phenotyping for research and breeding applications. Plant Phenomics (Washington, D.C.) 2021, 9840192. DOI: 10.34133/2021/9840192. Henry, R.J., Furtado, A. and Rangan, P. (2020) Pathways of photosynthesis in non-leaf tissues. Biology 9(12), E438. DOI: 10.3390/biology9120438. Hickey, L.T., N Hafeez, A., Robinson, H., Jackson, S.A., Leal-Bertioli, S.C.M. et al. (2019) Breeding crops to feed 10 billion. Nature Biotechnology 37(7), 744–754. DOI: 10.1038/s41587-019-0152-9. Huck, M.G. and Taylor, H.M. (1982) The rhizotron as a tool for root research. Advances in Agronomy 35, 1–35. DOI: 10.1016/S0065-2113(08)60320-X. Jin, S., Sun, X., Wu, F., Su, Y., Li, Y. et al. (2021a) Lidar sheds new light on plant phenomics for plant breeding and management: recent advances and future prospects. ISPRS Journal of Photogrammetry and Remote Sensing 171, 202–223. DOI: 10.1016/j.isprsjprs.2020.11.006. Jin, X., Zarco- Tejada, P.J., Schmidhalter, U., Reynolds, M.P., Hawkesford, M.J. et al. (2021b) High- throughput estimation of crop traits: a review of ground and aerial phenotyping platforms. IEEE Geoscience and Remote Sensing Magazine 9(1), 200–231. DOI: 10.1109/MGRS.2020.2998816. Kirchgessner, N., Liebisch, F., Yu, K., Pfeifer, J., Friedli, M. et al. (2016) The ETH field phenotyping platform FIP: a cable-suspended multi-sensor system. Functional Plant Biology 44(1), 154–168. DOI: 10.1071/FP16165. Li, B., Xu, X., Zhang, L., Han, J., Bian, C. et al. (2020) Above-ground biomass estimation and yield prediction in potato by using UAV-based RGB and hyperspectral imaging. ISPRS Journal of Photogrammetry and Remote Sensing 162, 161–172. DOI: 10.1016/j.isprsjprs.2020.02.013. Li, D., Quan, C., Song, Z., Li, X., Yu, G. et al. (2021) High-throughput plant phenotyping platform (HT3P) as a novel tool for estimating agronomic traits from the lab to the field. Frontiers in Bioengineering and Biotechnology 8, 623705. DOI: 10.3389/fbioe.2020.623705. Li, H., Rasheed, A., Hickey, L.T. and He, Z. (2018) Fast-forwarding genetic gain. Trends in Plant Science 23(3), 184–186. DOI: 10.1016/j.tplants.2018.01.007. Ma, D., Carpenter, N., Amatya, S., Maki, H., Wang, L. et al. (2019) Removal of greenhouse microclimate heterogeneity with conveyor system for indoor phenotyping. Computers and Electronics in Agriculture 166, 104979. DOI: 10.1016/j.compag.2019.104979.

76

W. Guo and J. Zhao

Madec, S., Baret, F., De Solan, B., Thomas, S., Dutartre, D. et al. (2017) High-throughput phenotyping of plant height: comparing unmanned aerial vehicles and ground lidar estimates. Frontiers in Plant Science 8, 2002. DOI: 10.3389/fpls.2017.02002. Maes, W.H. and Steppe, K. (2019) Perspectives for remote sensing with unmanned aerial vehicles in precision agriculture. Trends in Plant Science 24, 152–164. DOI: 10.1016/j.tplants.2018.11.007. Martin, L., Fei, Z., Giovannoni, J. and Rose, J.K.C. (2013) Catalyzing plant science research with RNA- seq. Frontiers in Plant Science 4, 66. DOI: 10.3389/fpls.2013.00066. Metzner, R., Eggert, A., van Dusschoten, D., Pflugfelder, D., Gerth, S. et al. (2015) Direct comparison of MRI and X-ray CT technologies for 3D imaging of root systems in soil: potential and challenges for root trait quantification. Plant Methods 11, 17. DOI: 10.1186/s13007-015-0060-z. Minervini, M., Scharr, H. and Tsaftaris, S.A. (2015) Image analysis: the new bottleneck in plant phenotyping [applications corner]. IEEE Signal Processing Magazine 32(4), 126–131. DOI: 10.1109/ MSP.2015.2405111. Mirnezami, S.V., Srinivasan, S., Zhou, Y., Schnable, P.S. and Ganapathysubramanian, B. (2021) Detection of the progression of anthesis in field-grown maize tassels: a case study. Plant Phenomics 2021, 4238701. DOI: 10.34133/2021/4238701. Mohanty, A., Luo, A., DeBlasio, S., Ling, X., Yang, Y. et al. (2009) Advancing cell biology and functional genomics in maize using fluorescent protein-tagged lines. Plant Physiology 149, 601–605. DOI: 10.1104/pp.108.130146. Narvaez, F.Y, Reina, G., Torres-Torriti, M., Kantor, G. and Cheein, F.A. (2017) A survey of ranging and imaging techniques for precision agriculture phenotyping. IEEE/ASME Transactions on Mechatronics 22(6), 2428–2439. DOI: 10.1109/TMECH.2017.2760866. Neveu, P., Tireau, A., Hilgert, N., Nègre, V., Mineau-Cesari, J. et al. (2019) Dealing with multi-source and multi-scale information in plant phenomics: the ontology-driven Phenotyping Hybrid Information System. The New Phytologist 221(1), 588–601. DOI: 10.1111/nph.15385. Ninomiya, S., Baret, F. and Cheng, Z.-M.M. (2019) Plant phenomics: emerging transdisciplinary science. Plant Phenomics 2019, 2765120. DOI: 10.34133/2019/2765120. Paulus, S. (2019) Measuring crops in 3D: using geometry for plant phenotyping. Plant Methods 15, 103. DOI: 10.1186/s13007-019-0490-0. Perez-Sanz, F., Navarro, P.J. and Egea-Cortines, M. (2017) Plant phenomics: an overview of image acquisition technologies and image data analysis algorithms. GigaScience 6(11), 1–18. DOI: 10.1093/ gigascience/gix092. Pieruschka, R. and Schurr, U. (2019) Plant phenotyping: past, present, and future. Plant Phenomics 2019, 7507131. DOI: 10.34133/2019/7507131. Pommier, C., Michotey, C., Cornut, G., Roumet, P., Duchêne, E. et al. (2019) Applying FAIR principles to plant phenotypic data management in GnpIS. Plant Phenomics 2019, 1671403. DOI: 10.34133/2019/1671403. Qiao, H., Xia, B., Ma, X., Cheng, D. and Zhou, Y. (2010) Identification of damage by diseases and insect pests in winter wheat. Journal of Triticeae Crops 30, 770–774. Rahaman, M.M., Ahsan, M.A. and Chen, M. (2019) Data- mining techniques for image- based plant phenotypic traits identification and classification. Scientific Reports 9, 19526. DOI: 10.1038/ s41598-019-55609-6. Robertson, D.J., Julias, M., Lee, S.Y. and Cook, D.D. (2017) Maize stalk lodging: morphological determinants of stalk strength. Crop Science 57(2), 926–934. DOI: 10.2135/cropsci2016.07.0569. Rose, J.C., Paulus, S. and Kuhlmann, H. (2015) Accuracy analysis of a multi-view stereo approach for phenotyping of tomato plants at the organ level. Sensors (Basel, Switzerland) 15(5), 9651–9665. DOI: 10.3390/s150509651. Roth, J.A., Ciampitti, I.A. and Vyn, T.J. (2013) Physiological evaluations of recent drought‐tolerant maize hybrids at varying stress levels. Agronomy Journal 105(4), 1129–1141. DOI: 10.2134/ agronj2013.0066. Shakoor, N., Lee, S. and Mockler, T.C. (2017) High throughput phenotyping to accelerate crop breeding and monitoring of diseases in the field. Current Opinion in Plant Biology 38, 184–192. DOI: 10.1016/j.pbi.2017.05.006. Singh, A., Ganapathysubramanian, B., Singh, A.K. and Sarkar, S. (2016) Machine learning for high- throughput stress phenotyping in plants. Trends in Plant Science 21(2), 110–124. DOI: 10.1016/j. tplants.2015.10.015.

Plant Phenomics

77

Singh, A., Jones, S., Ganapathysubramanian, B., Sarkar, S., Mueller, D. et al. (2020) Challenges and opportunities in machine-augmented plant stress phenotyping. Trends in Plant Science 26, 53–69. DOI: 10.1016/j.tplants.2020.07.010. Singh, A.K., Ganapathysubramanian, B., Sarkar, S. and Singh, A. (2018) Deep learning for plant stress phenotyping: trends and future perspectives. Trends in Plant Science 23(10), 883–898. DOI: 10.1016/j.tplants.2018.07.004. Strock, C.F., Burridge, J., Massas, A.S.F., Beaver, J., Beebe, S. et al. (2019) Seedling root architecture and its relationship with seed yield across diverse environments in Phaseolus vulgaris. Field Crops Research 237, 53–64. DOI: 10.1016/j.fcr.2019.04.012. Tardieu, F., Cabrera-Bosquet, L., Pridmore, T. and Bennett, M. (2017) Plant phenomics, from sensors to knowledge. Current Biology 27, R770–R783. DOI: 10.1016/j.cub.2017.05.055. Sukumaran, S., Dreisigacker, S., Lopes, M., Chavez, P. and Reynolds, M.P. (2015) Genome- wide association study for grain yield and related traits in an elite spring wheat population grown in temperate irrigated environments. Theoretical and Applied Genetics 128, 353–363. DOI: 10.1007/ s00122-014-2435-3. Sukumaran, S., Reynolds, M.P. and Sansaloni, C. (2018) Genome-wide association analyses identify QTL hotspots for yield and component traits in durum wheat grown under yield potential, drought, and heat stress environments. Frontiers in Plant Science 9, 81. DOI: 10.3389/fpls.2018.00081. Tattaris, M., Reynolds, M.P. and Chapman, S.C. (2016) A direct comparison of remote sensing approaches for high-throughput phenotyping in plant breeding. Frontiers in Plant Science 7, 1131. DOI: 10.3389/ fpls.2016.01131. Taylor, H.M., Upchurch, D.R. and McMichael, B.L. (1990) Applications and limitations of rhizotrons and minirhizotrons for root studies. Plant and Soil 129(1), 29–35. DOI: 10.1007/BF00011688. Thompson, A.J., Andrews, J., Mulholland, B.J., McKee, J.M.T., Hilton, H.W. et al. (2007) Overproduction of abscisic acid in tomato increases transpiration efficiency and root hydraulic conductivity and influences leaf expansion. Plant Physiology 143, 1905–1917. DOI: 10.1104/pp.106.093559. Toda, Y., Tameshige, T., Tomiyama, M., Kinoshita, T. and Shimizu, K.K. (2021) An affordable image- analysis platform to accelerate stomatal phenotyping during microscopic observation. Frontiers in Plant Science 12, 715309. DOI: 10.3389/fpls.2021.715309. Tracy, S.R., Nagel, K.A., Postma, J.A., Fassbender, H., Wasson, A. et al. (2020) Crop improvement from phenotyping roots: highlights reveal expanding opportunities. Trends in Plant Science 25, 105–118. DOI: 10.1016/j.tplants.2019.10.015. van der Heijden, G., Song, Y., Horgan, G., Polder, G., Dieleman, A. et al. (2012) SPICY: towards automated phenotyping of large pepper plants in the greenhouse. Functional Plant Biology 39, 870–877. DOI: 10.1071/FP12019. Virlet, N., Sabermanesh, K., Sadeghi-Tehran, P. and Hawkesford, M.J. (2017) Field scanalyzer: an automated robotic field phenotyping platform for detailed crop monitoring. Functional Plant Biology 44, 143–153. DOI: 10.1071/FP16163. Wells, R., Burton, J.W. and Kilen, T.C. (1993) Soybean growth and light interception: response to differing leaf and stem morphology. Crop Science 33(3), 520–524. DOI: 10.2135/cropsci1993.0011183X00 3300030020x. Wuyts, N., Bengough, A.G., Roberts, T.J., Du, C., Bransby, M.F. et al. (2011) Automated motion estimation of root responses to sucrose in two Arabidopsis thaliana genotypes using confocal microscopy. Planta 234, 769–784. DOI: 10.1007/s00425-011-1435-7. Xu, L., Chen, J., Ding, G., Lu, W., Ding, Y. et al. (2020) Indoor phenotyping platforms and associated trait measurement: progress and prospects. Smart Agriculture 2, 23–42. DOI: 10.12133/j. smartag.2020.2.1.202003-SA002. Yamamoto, K., Guo, W. and Ninomiya, S. (2016) Node detection and internode length estimation of tomato seedlings based on image analysis and machine learning. Sensors (Basel, Switzerland) 16(7), 1044. DOI: 10.3390/s16071044. Yang, W., Feng, H., Zhang, X., Zhang, J., Doonan, J.H. et al. (2020) Crop phenomics and high-throughput phenotyping: past decades, current challenges, and future perspectives. Molecular Plant 13(2), 187–214. DOI: 10.1016/j.molp.2020.01.008. York, L.M. (2019) Functional phenomics: an emerging field integrating high-throughput phenotyping, physiology, and bioinformatics. Journal of Experimental Botany 70(2), 379–386. DOI: 10.1093/jxb/ ery379.

78

W. Guo and J. Zhao

Zarco-Tejada, P.J., González-Dugo, V. and Berni, J.A.J. (2012) Fluorescence, temperature and narrow- band indices acquired from a UAV platform for water stress detection using a micro-hyperspectral imager and a thermal camera. Remote Sensing of Environment 117, 322–337. DOI: 10.1016/j. rse.2011.10.007. Zhang, L., Niu, Y., Zhang, H., Han, W., Li, G. et al. (2019) Maize canopy temperature extracted from UAV thermal and RGB imagery and its application in water stress monitoring. Frontiers in Plant Science 10, 1270. DOI: 10.3389/fpls.2019.01270. Zhao, J., Bodner, G., Rewald, B., Leitner, D., Nagel, K.A. et al. (2017) Root architecture simulation improves the inference from seedling root phenotyping towards mature root systems. Journal of Experimental Botany 68, 965–982. DOI: 10.1093/jxb/erw494. Zhu, Y., Sun, G., Ding, G., Zhou, J., Wen, M. et al. (2021) Large-scale field phenotyping using backpack LiDAR and CropQuant-3D to measure structural variation in wheat. Plant Physiology 187, 716–738. DOI: 10.1093/plphys/kiab324. Zhang, J., Xie, T., Yang, C., Song, H., Jiang, Z. et al. (2020) Segmenting purple rapeseed leaves in the field from UAV RGB imagery using deep learning as an auxiliary means for nitrogen stress detection. Remote Sensing 12(9), 1403. DOI: 10.3390/rs12091403.

6

Plant Non-coding Transcriptomics: Overview of lncRNAs in Abiotic Stress Responses

Akihiro Matsui1,2* and Motoaki Seki1,2,3* Plant Genomic Network Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Japan; 2Plant Epigenome Regulation Laboratory, RIKEN Cluster for Pioneering Research, Wako, Japan; 3Kihara Institute for Biological Research, Yokohama City University, Yokohama, Japan

1

Abstract Extensive transcriptome studies of over 40 plant species have shown that very broad regions of plant genomes contain noncoding RNAs (ncRNAs). ncRNAs are known to regulate gene expression, and are involved in DNA methylation, DNA structural modification, histone modification, RNA degradation, RNA masking, and translational regulation. The biological roles of ncRNAs impact plant growth, response to biotic and abiotic stresses, and regulate inherited properties of gene expression. Therefore, it is important to elucidate the ncRNA regulatory mechanisms in plants, including their expression and the interaction of their targets such as RNAs, DNAs, and proteins. In this review, recent research pertaining to the biogenesis and functions of ncRNAs is summarized and discussed.

6.1 Introduction Plant genome- wide transcriptome analyses have revealed many types of noncoding RNAs (ncRNAs) that are involved in biological regulation in addition to the housekeeping ncRNAs, such as ribosomal (rRNA), transfer (tRNAs), small nucleolar (snoRNA), and small nuclear (snRNA) RNAs (Yamada et al., 2003; Matsui et al., 2008; Liu et al., 2012; H. Wang et al., 2014). Transcriptional activities from plant ncRNAs have been recognized for about 30 years, but their biological functions and mechanisms of action were still not fully understood. The roles of plant ncRNAs and common mechanisms and their complexity have been

highlighted only recently (Yu et al., 2019). To name examples, the ncRNAs were shown to function by altering DNA methylation, DNA structural modification, histone modification, guiding RNA degradation, RNA masking, and translational regulation (Table 6.1). Although ncRNAs are currently classified according to their generation processes, ncRNAs in the same category have diverse molecular functions and can form mutually related cross-networks of RNA metabolism. To provide an overview of the complex metabolism and functions of plant ncRNAs, the well-known biogenesis and molecular functions of ncRNAs are summarized and discussed in this review.

*Corresponding authors: akihiro.matsui@riken.jp and motoaki.seki@riken.jp © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0006

79

RNA polymerase

Pol II

Pol II, RDR6

Pol II

Pol IV, Pol V, RDR2

Pol II

Pol II

Pol II

Pol II, RDRs

Pol II

Pol II

Pol II

Pol II

Pol II Pol V

Type of ncRNAs

miRNA

ta-siRNA

pha-siRNA

Pol IV- and Pol Vderived siRNA

nat-siRNA

cis-NAT

cis-NAT

Stress-induced NAT

COLDAIR

COOLAIR

COLDWRAP

ENOD40 and ASCO

APOLO and HID1

Allen et al., 2005; Ronemus et al., 2006 Johnson et al., 2009; Song et al., 2012; Källman et al., 2013

RNA cleavage in trans-acting RNA cleavage in cis-acting

–

–

lncRNA with 5'-CAP and without 3'-PolyA

mRNA with 3'-PolyA

lncRNA with 5'-CAP and without 3'-PolyA

dsRNA derived from mRNA without 5'-CAP and 3'-PolyA

Bind to sites around translational start

mRNA and/or ncRNA

mRNA and/or ncRNA

Formation of local chromatin loop structure and regulation of gene expression

Regulation of RNA splicing

Formation of repressive chromatin structure

chromatin modifier recruitment by RNA splicing

chromatin modifier recruitment

RNA degradation

Translational regulation

miRNA site masking RNA splicing site masking

RNA cleavage

Continued

Ariel et al., 2014; P. Wang et al., 2014b

Yang et al., 1993; Campalans et al., 2004; Bardou et al., 2014

Kim and Sung, 2017

Liu et al., 2010; Csorba et al., 2014; Marquardt et al., 2014; Fang et al., 2020

Pien et al., 2008; Buzas et al., 2011; Heo and Sung, 2011

Mourrain et al., 2000; Gazzani et al., 2004; Luo and Chen, 2007; Matsui et al., 2017

Bazin et al., 2017; Deforges et al., 2019

Bardou et al., 2014; Cho and Paszkowski, 2017

Ron et al., 2010; Yuan et al., 2015

Li et al., 2006a; Zheng et al., 2007; Wierzbicki et al., 2008; Zheng et al., 2009; Matzke and Mosher, 2014

Qi et al., 2005; Yu et al., 2005; Voinnet, 2009; Zhang et al., 2015

Reference

RNA cleavage, translational regulation, DNA methylation

Molecular function

Regional wide range expression DNA methylation

"Head-to-tail" phased pattern

"Head-to-tail" phased pattern

Precursor has hairpin loop structures

Significant profile

Table 6.1. Non-coding RNAs (ncRNAs) with known functions. Papers of particular interest that describe the biogenesis and molecular functions of the ncRNAs are summarized.

80 A. Matsui and M. Seki

RNA polymerase

Pol II

Pol II

Pol III

Pol II -Rolling Circle Replication

Type of ncRNAs

ceRNA/RNA spongy/ RNA decoy

circRNA

Pol III-derived ncRNA

Viroid

Table 6.1. Continued

lncRNA

lncRNA with USE-TATA type promoter

Closed loop ncRNA

–

Significant profile

RNA cleavage, DNA methylation

-

Inhibition of miRNA cleavage, promotion of alternative splicing

Inhibition of miRNA cleavage

Molecular function

Wassenegger et al., 1994; Ding, 2009; Navarro et al., 2012

Wu et al., 2012; Li et al., 2016

P. Wang et al., 2014; Sun et al., 2016; Conn et al., 2017; Frydrych Capelari et al., 2019

Shin et al., 2006; Franco-Zorrilla et al., 2007

Reference

Plant Non-coding Transcriptomics 81

82

A. Matsui and M. Seki

6.2 History of ncRNA Research The generic term ncRNA describes transcripts that have no open reading frame or transcripts that have an open reading frame but do not appear to encode a functional protein (Ponting et al., 2009). Genome- wide transcriptome analyses have identified ncRNAs that are transcribed from different regions of the genome. ncRNAs < 40 nt long are classified as small RNAs, which include microRNAs (miRNAs) and small interfering RNAs (siRNAs), and ncRNAs > 200 nt long are classified as long noncoding RNAs (lncRNAs) (Ponting et al., 2009). Human lncRNAs were first reported in 1992 (Lukiw et al., 1992), and a functional miRNA, lin-4, was reported in the nematode Caenorhabditis elegans in 1993 (Lee et al., 1993). The development of large- scale transcriptome technologies, such as tiling arrays and next- generation sequencing (NGS), has accelerated the identification of ncRNAs and promoted investigations into their regulation. Approximately 75% of the human genome and about 85% of the yeast genome are transcribed, while the regions encoding proteins in the genome are 73% and 2% in yeast and human, respectively (David et al., 2006; Alexander et al., 2010; Djebali et al., 2012). The expression of a lncRNA early nodulin 40 (ENOD40) was reported first in Glycine max during nodule development (Yang et al., 1993). Subsequently, other biologically important plant lncRNAs, including MsENOD40, TPSI1, and OsPI1, have been reported (Crespi et al., 1994; Liu et al., 1997; Wasaki et al., 2003). These reports indicate that ncRNAs play important biological roles in a wide range of organisms. The first analysis of an Arabidopsis tiling array detected 5817 novel transcriptional sequences (Yamada et al., 2003). A similar transcriptome profile was reported in rice (Li et al., 2006b). A subsequent comprehensive analysis of the Arabidopsis thaliana transcriptome identified 40,000 lncRNAs, including more than 30,000 natural antisense transcripts (NATs) and more than 6000 long intergenic ncRNAs (lincRNAs) (Matsui et al., 2008; Liu et al., 2012; H. Wang et al., 2014). NGS results have indicated that 80–90% of the Arabidopsis genome is transcribed in some development

stages and conditions (Ariel et al., 2015). Long ncRNAs have now been detected in more than 40 plant species and comprehensive information about plant- and/or type-specific lncRNAs is available in several databases, as described in a previous review (Bhatia et al., 2017). Datasets of small RNAs are also available from about 2000 Arabidopsis libraries and 1300 libraries from more than 40 plant species (Feng et al., 2020; Lunardon et al., 2020). However, the vast number of biological roles played by ncRNAs in the life cycles of plants are still unknown.

6.3 Classification of ncRNAs The molecular functions of ncRNAs depend not only on their biogenesis, but also on their positional relationship to nearby protein-coding genes and their interactions with proteins. Therefore, the classification of lncRNAs is often overlapping, inconsistent, and problematic. In this review, the first classification is based on size, which is a commonly used classification scheme. Small ncRNAs < 40 nt long include miRNA and siRNAs; lncRNAs are > 200 nt long (Ponting et al., 2009); and ncRNAs 50–300 nt long are classified as intermediate-sized ncRNAs in Arabidopsis and rice (Wang et al., 2014a). They are also associated with positive histone marks, such as H3K4me3 and H3K9ac, but not generally with negative mark H3K27me3 (Wang et al., 2014a). The molecular function of intermediate-sized ncRNAs remains unclear. The second classification is based on the polyadenylation state of the lncRNAs. Non-polyadenylation is thought to be due to differences in the RNA polymerase and polyadenylation process. Non- polyadenylated lncRNAs are often shorter than polyadenylated lncRNAs and are more readily induced by abiotic stresses (Matsui et al., 2008; Wang et al., 2014a). The third classification is based on the positional relationship of the lncRNA to adjacent protein- coding genes. The categories are: (i) long intergenic ncRNAs (lincRNAs), which are transcribed from the regions between two protein-coding genes; (ii) intronic ncRNAs, which are transcribed from introns of protein-coding genes; (iii) long sense ncRNAs, which are transcribed from the sense strand of protein-coding genes and overlap with

Plant Non-coding Transcriptomics

exons; (iv) long antisense ncRNAs, which are transcribed from the antisense strand of protein- coding genes and overlap with exons; and (v) promoter-associated sense RNAs and upstream antisense RNAs, which are transcribed from regions close to the promoters of protein-coding genes (Ponting et al., 2009). Recent studies have shown that the location of lncRNAs is associated to some extent with their molecular functions.

6.4 Molecular Functions of ncRNA The combination of transcriptional regions, transcriptional modes, RNA maturation processes, and molecular functions with interfering targets such as RNAs, DNAs, and proteins is expected to complicate the actual classification of lncRNAs. In this section, the typical examples of recent research on the relationships between ncRNAs and their biological and molecular functions are summarized for understanding actual molecular functions of lncRNAs (Table 6.1).

6.4.1 miRNAs miRNAs play important roles in most biological processes in plants and animals. Primary transcripts (pri-miRNAs) are transcribed as lncRNA precursors by DNA-dependent RNA polymerase II (Pol II), in a similar manner to transcription of protein-coding genes, and form the hairpin-loop structures. The pri- miRNAs are processed by Dicer-like 1 (DCL1) into mainly 21 nt miRNA/ miRNA∗ duplexes that form double-stranded RNA structures from the stem region of pri-miRNAs (Voinnet, 2009). The miRNA/miRNA∗ duplex has 2 nt 3′-overhangs on both strands and the 2′-OH position at the 3′-end of the miRNA is methylated by the small RNA methyltransferase HUA Enhancer 1 (HEN1) (Yu et al., 2005). The miRNA in the miRNA/miRNA∗ duplex binds to the Argonaute (AGO) protein in the RNA-induced silencing complex (RISC), and the miRNA∗ is ejected and degraded, except in some special cases (Mi et al., 2008). The bound miRNAs guide the RISC to target genes in trans in a genome by base pairing and predominantly mediate post- transcriptional gene silencing by cleavage and/ or translational inhibition of the target mRNAs.

83

Among the ten Arabidopsis AGO proteins, AGO1 was found to play the major role in miRNA- mediated repression of target genes (Zhang et al., 2015). The miRNAs of the RISC complex recognize the targets by pairing with a near-perfect co- primary sequence in the 5′ region of the miRNA and incomplete matched sequence in the 3′ region (Schwab et al., 2005; Axtell and Meyers, 2018). Finally, the RISC cleaves the target mRNA by the endonuclease activity of the P-element induced wimpy testes (PIWI) domain of the AGO proteins (Qi et al., 2005). miRNAs also regulate genes by translational inhibition accompanied by mRNA cleavage. For example, Apetala 2 and squamosa promoter binding protein-like 3 (SPL3) are regulated translationally by miR172 and miR156/7, respectively (Chen, 2004; Gandikota et al., 2007). miRNAs are important regulators of large sets of target genes for adapting to environmental changes, for triggering the immune response, and for promoting organ development (Li et al., 2012; Rogers and Chen, 2013). Importantly, miRNAs are also used for negative feedback regulations on the homeostasis of global miRNA metabolism by cleaving components of the miRNA maturation process components. The miRNAs, miR162, miR863-3p, and miR168, guide mRNA cleavage of DCL1, ARLPKs, and SERRATE, as well as AGO1, which is involved in miRNA maturation (Xie et al., 2003; Vaucheret et al., 2006; Niu et al., 2016). The databases, miRbase and PmiREN, contain the sequences and annotations of miRNAs from > 80 plant species that were obtained by NGS (Kozomara et al., 2019; Guo et al., 2020). These data have been used to name the miRNAs (Axtell and Meyers, 2018). miRNAs are deeply involved in regulating the expression of genes involved in various biological processes and signaling networks.

6.4.2 Trans-acting siRNAs (ta-siRNAs) and phased siRNAs (pha-siRNAs) The siRNAs are classified into various types of 21–25 nt RNAs in which RNA- interfering long- chain double- stranded RNAs are cleaved by Dicers. Each type of siRNA is described below. The trans-acting siRNAs (ta-siRNAs) are specialized siRNAs that are found in plants (Allen et al., 2005; Ronemus et al., 2006). Production of

84

A. Matsui and M. Seki

ta-siRNAs is triggered by the cleavage of lncRNAs by miR173, miR390, and miR828. The cleaved 3′ fragments are stabilized by suppressor of gene silencing 3 (SGS3) and antisense-strand RNAs are synthesized by the polymerase RDR6 to form double- stranded RNAs (dsRNAs) (Allen et al., 2005). The dsRNAs are processed into siRNAs by DCL4 and double-stranded RNA-binding factor 4 (DRB4) (Nakazawa et al., 2007). These siRNAs have a characteristic “head- to- tail” phased pattern when the 21 nt sequences are mapped to the genome sequence. The ta-siRNAs function mainly in AGO1- mediated RISCs, and “act in trans” with the mRNAs of target genes by base pairing. In Arabidopsis, tight feedback regulation between miR390, TAS3 ta- siRNA, and ARF4, which is targeted by TAS3 ta-siRNA, is required for lateral root initiation (Marin et al., 2010). The miR390–TAS3–ARF4 regulatory loop is thought to function as a mediator that connects abiotic factors and development stages by regulating transcription of the miR390 and TAS3 precursors and SGS3 stability (Bartels and Sunkar, 2005; Zhong et al., 2013; Matsui et al., 2014). The phased RNAs (pha-siRNAs) are derived from lncRNAs and/or specific mRNAs by miRNA cleavage, similar to ta-siRNA, except that they “act in cis” on local mRNAs. The pha-siRNAs derived from NBS-LRR mRNA have been shown to mediate the immune response in plants (Shivaprasad et al., 2012; Yang et al., 2015). Another example of mRNA-derived pha-RNAs is the conserved pha-siRNAs derived from pentatricopeptide repeat- containing proteins (PPRs), which are triggered by conserved miRNAs and ta-siRNAs (Xia et al., 2013). Besides the 21 nt pha-siRNAs, 24 nt pha-siRNAs are generated in repetitive genome sequences and expressed in the male organ of rice (Johnson et al., 2009; Song et al., 2012). The 24 nt pha-siRNAs are processed by a DCL3 homolog, DCL3b, to target DNA methylation at the transcriptional level (Wu et al., 2010; Song et al., 2012).

6.4.3 Pol IV- and Pol V-derived lncRNAs and siRNAs DNA methylation is the process by which methyl groups are added to the cytosine bases in DNA (Matzke and Mosher, 2014). While

methylation occurs in the cytosine–guanine sequence (CG) in animals, in plants the methylation of cytosine can occur in three contexts, CG, CHG, and CHH, where H represents any base other than G. In symmetric CG and CHG methylation which occurs during DNA synthesis, both of the daughter DNA strands are recognized by the DNA methylation enzymes DNA methyltransferase 1 (MET1) and chromomethylase 3 (CMT3) using one strand as a template (i.e., maintenance methylation). Asymmetric CHH methylation occurs in only one DNA strand by RNA-directed DNA methylation (RdDM) in response to various conditions including stress (i.e., de novo methylation). In de novo methylation by RdDM, siRNA plays a vital role in recruiting the methylation complex, which requires the plant- specific RNA polymerases Pol IV and Pol V, as well as Pol II-derived RNAs at some loci (Wierzbicki et al., 2008; Zheng et al., 2009). In a major RdDM pathway, 24 nt siRNAs are generated from Pol IV transcripts through processes mediated by RDR2 and DCL3, followed by incorporation into AGO4 and/or AGO6 (Li et al., 2006a; Zheng et al., 2007). Pol V transcribes lncRNAs that are triphosphorylated or capped at the 5′-ends and lack poly(A) tails from target regions. The Pol V- derived lncRNAs recruit the siRNA–AGO4 complex in a sequence- dependent manner in which the DDR complex includes: defective in RNA-directed DNA methylation (DRD1); defective in meristem silencing 3 (DMS3); and RNA-directed DNA methylation 1 (RDM1) (Matzke and Mosher, 2014). Finally, the Dnmt3 class de novo methylase, domains rearranged methylase 2 (DRM2), is recruited to add a methyl group to cytosines at CHH sites (Wierzbicki et al., 2008). Pol V associates with some target sequences assisted by SU(VAR)3-9 homologs SUVH2 and SUVH9, which bind to methylated DNA but do not function in histone methylation, suggesting that Pol V also contributes to methylated DNA maintenance (Johnson et al., 2014). Thousands of genomic regions that generate Pol IV/RDR2-dependent transcripts have been identified (Li et al., 2015). Most of them are located in intergenic regions; 65% of them overlap with annotated repetitive elements and transposons and 9% of them overlap with protein-coding genes in the genome. RdDM appears to be crucial in

Plant Non-coding Transcriptomics

regulating activities of retrotransposable elements. For example, the defective RdDM was associated with activation of retrotransposon ONSEN in response to thermal stress (Ito et al., 2011). Heat stress was shown to increase aberrant read- through transcription, which decreased during the recovery process from the stress in wild-type plants but not in RdDM mutants (Popova et al., 2013). RdDM was also implicated in genome inheritance and reprogramming during the reproductive stage (Gehring, 2019). For instance, pollen siRNAs were found to be produced from de-silenced transposons by activation of DNA demethylase DEMETER and suppression of chromatin remodeler DDM1 in vegetative pollen cells, and the produced siRNAs translocated the sperm cells and reinforced RdDM (Slotkin et al., 2009). Although the function of female gametogenesis remains largely unknown, siRNAs are considered to move from the central cell to the egg to promote transposable element silencing, which is essential for seed development in Arabidopsis (Ibarra et al., 2012). These data suggest that RdDM regulates gene expression in coordination with other epigenetic controls (Matzke and Mosher, 2014). Such an RdDM mechanism has a profound impact on transposable elements-rich plant species. For example, the percentage of locus with transposable elements within the 500 bp promoter region is higher in crop plants with a large genome (e.g., 39% in tomato, 41% in rice, and 45% in maize) compared with small-genome Arabidopsis (approximately 18%) (Lang et al., 2017). The critical developmental defects found in RdDM mutants of these crops show that RdDM has major effects in gene regulation in crop plants (Alleman et al., 2006; Erhard et al., 2009; Wei et al., 2014).

6.4.4 RNA interfering events induced by cis-natural antisense RNAs (cis-NATs) Recent analyses have revealed a large number of sense–antisense NAT pairs, in which the RNAs exist on both sense and antisense strands of the same region on the genome, corresponding to 70% of annotated mRNAs expressed in plant genomes (Yamada et al., 2003; Matsui

85

et al., 2008; H. Wang et al., 2014). The expression of cis- NAT pairs appears to be highly responsive to abiotic stresses (e.g., drought, cold, salt, and light) (Matsui et al., 2008; H. Wang et al., 2014). Many light- responsive NATs located between protein-coding genes, as well as other ncRNAs, were found to exhibit high levels of histone acetylation that was dynamically correlated with changes in NAT expression (H. Wang et al., 2014). It has been suggested that NATs trigger the production of small RNAs from the RNA interference of sense and antisense transcripts, leading to gene silencing. A study that integrated RNA sequencing and small RNA sequencing data reported the accumulation of small RNAs in their overlapping regions, suggesting the occurrence of an RNA interference event (Ron et al., 2010; Yuan et al., 2015). In Arabidopsis, cis-NAT expression increased the sense mRNA through NAT-siRNAs and inhibited the cleavage of miRNAs (Gao et al., 2015). However, compared with the large number of candidate cis-NAT loci, few examples of siRNA-mediated effects on their transcription have been reported, which implies that NAT-siRNA is not the main function of NAT pairs. RNA interfering events without siRNAs mask the target sites of mRNAs. Long ncRNAs can upregulate the expression of mRNAs by masking the miRNA targeting sites by base pairing (Cho and Paszkowski, 2017) and can also regulate intron splicing of sense transcripts by masking splicing sites by complementary base pairing (Bardou et al., 2014).

6.4.5 Cis-NATs enhance mRNA translation Cis-NATs have been shown to regulate cognate sense mRNA translation in Arabidopsis (Bazin et al., 2017; Deforges et al., 2019), and the translational effect was confirmed for four out of five tested cis-NAT–sense mRNA pairs in transgenic plants and transient expression in protoplasts (Deforges et al., 2019). In rice, an antisense lncRNA of cis-NATs was found to increase the translation of the opposite strand PHO1-2 mRNA, which encodes a protein that loads phosphate into xylem (Jabnoune et al.,

86

A. Matsui and M. Seki

2013). A cis-lncNAT (cis-NATPHO1-2) was found to interact with the antisense strand of PHO1-2 cis-NATPHO1-2, which was expressed only under phosphate starvation. Knockdown and overexpression of cis-NATPHO1-2 was shown to regulate the PHO1-2 protein level without affecting the expression or nuclear export of PHO1-2 mRNA. Interestingly, cis- NATPHO1-2 expression positively correlated with the transport of sense–antisense pairs to polysomes (Jabnoune et al., 2013). In mice, an antisense ncRNA derived from a repeat sequence, SINEUPs, was demonstrated to promote protein translation by binding to the start codon and KOZAK sequence of mRNA present on the opposite strand (Carrieri et al., 2012; Takahashi et al., 2018). These results indicate that initiation of translation is regulated by a type of antisense lncRNA.

6.4.6 Cis-NATs derived from RNA degradation Other antisense ncRNAs are associated with dsRNAs that are produced from cleaved mRNAs by RDR6 (Gazzani et al., 2004; Luo and Chen, 2007). These antisense ncRNAs were related to RNA quality control and/or RNA decay pathways. Aberrant RNAs with premature stop codons or excessively long 3′ untranslated regions were processed by nonsense-mediated decay (Belostotsky and Sieburth, 2009; Kurihara et al., 2009). The aberrant RNAs are recognized by retention of Up-Frameshift (UPF) proteins and the cap structure was removed, followed by poly(A) degradation, then the aberrant RNAs were degraded from the 5′ and/or 3′ directions (Belostotsky and Sieburth, 2009). Some aberrant RNAs that escape from the nonsense- mediated decay were found to form dsRNAs by SGS3, SDE5, and RDR6, and were processed into siRNAs by DCL4 and DCL2 (Mourrain et al., 2000; Parent et al., 2015a, Parent et al., 2015b). MiRNA- directed RISC cleavage also produces siRNA, suggesting that RNAs without the cap structure and polyadenylation are siRNA- producing targets (Gazzani et al., 2004; Luo and Chen, 2007). However, siRNA production was found to be only slightly affected in RNA decay mutants (Branscheid et al., 2015; Martínez de

Alba et al., 2015; Yu et al., 2015). Clearly, the switch from the degradation of aberrant RNA to siRNAs synthesis is yet to be fully understood. A genome-wide analysis of plants growing under normal conditions identified antisense lncRNAs in more than 6000 protein-coding loci and the antisense lncRNAs were co-expressed with the protein- coding RNAs on the sense strand in plants subjected to abiotic stress (Matsui et al., 2008). Among them, antisense ncRNAs > 1000 nt long were detected from protein-coding loci, and dsRNA production was redundantly mediated by at least three of the six Arabidopsis RDRs (RDR1/2/6) (Matsui et al., 2017). The RDR1/2/6- dependent antisense lncRNAs promote mRNA degradation in the opposite strand, which corresponds to poly(A)- sense RNA degradation without siRNA production. Exoribonuclease activity is reduced by the increase of phosphonucleotide (3′-phosphoadenosine 5′-phosphate; PAP) (Estavillo et al., 2011), whereas de-capping activity is increased by MPK6- and SnRK2- mediated phosphorylation of DCP1 and VARICOSE (Xu and Chua, 2012; Soma et al., 2017). Abiotic stress seems to induce the accumulation of aberrant RNAs such as the antisense lncRNAs and increase mRNA de- capping activity in plants (Matsui et al., 2019).

6.4.7 lncRNAs COLDAIR, COOLAIR, and COLDWRAP that regulate chromatin modification at the FLC locus Many lncRNAs are involved in gene silencing through chromatin modification. Such examples include FLOWERING LOCUS C (FLC) encoding a MAD box-containing transcription factor and lncRNAs (e.g., COLDAIR, COOLAIR, and COLDWRAP), which orchestrate the floral induction through epigenetic modification. COLDAIR is transcribed from the first intron of FLC as a 5′ capped, non- polyadenylated lncRNA by Pol II in the sense direction of FLC (Heo and Sung, 2011). During the vernalization period, COLDAIR expression is induced and then recruits the polycomb repressive complex 2 (PRC2) by interacting directly with one of its components, H3K27 trimethyltransferase CURLY LEAF (CLF), thereby increasing

Plant Non-coding Transcriptomics

H3K27me3 repressive marks at the FLC locus (Buzas et al., 2011; Heo and Sung, 2011). CLF may repress indirectly H3K4me3 positive marks at the FLC locus by increasing H3K27me3 levels, leading to reduced FLC transcription (Pien et al., 2008). COOLAIR is also an lncRNA derived from the antisense strand of FLC, which is induced prior to COLDAIR transcription and increases in H3K27me3 levels (Liu et al., 2010). COOLAIR was found to function during the early phase of vernalization independently of PRC2 and H3K27me3 (Csorba et al., 2014). COOLAIR also participates in the autonomous pathway to repress FLC. A short splice variant of COOLAIR, which is processed by cleavage stimulation factors CstF64 and CstF77 and spliceosome factor PRP8, affects the recruitment of the H3K4 histone demethylase homolog FLD (FLOWERING LOCUS D) to FLC, leading to H3K4me2 demethylation of FLC (Liu et al., 2007; Hornyik et al., 2010; Marquardt et al., 2014). Another study also reported direct interaction between COOLAIR processing components and chromatin modifiers (Fang et al., 2020). FCA and FY are linked to 3′ RNA processing of proximal COOLAIR (Liu et al., 2010). Set domain group 26 (SDG26), which physiologically associates with chromatin modifiers luminidependens (LD) and FLD to promote H3K27me3 accumulation, interacts with RNA- binding protein, flowering control locus A (FCA), and 3′ processing factor flowering locus Y (FY), suggesting that RNA processing promotes H3K27me3 (Fang et al., 2020). Different types of lncRNAs have been isolated from the FLC locus. COLDWRAP is a non-polyadenylated lncRNA that associates with the upstream region of FLC in the same direction (Kim and Sung, 2017). COLDWRAP is induced during vernalization from the chromatin loop that forms between the FLC promoter and the 3′ end of the first intron together with the polycomb complex and stabilizes FLC repression. Altogether, these studies on chromatin regulation of the FLC locus indicate that several lncRNAs mediate chromatin modification to modulate gene expression according to prevailing environmental conditions and development stages.

87

6.4.8 ENOD40 and ASCO, mRNA-like long intergenic ncRNAs that regulate alternative splicing events by interacting with RNA-binding protein Several lncRNAs splice precursor RNA (pre- mRNA) by interacting with RNA splicing factors. The lncRNA ENOD40, which is highly conserved in legumes and other plant species (Gultyaev and Roussis, 2007), is involved in the organogenesis of root symbiotic nodules (Yang et al., 1993). The co-localization of ENOD40 and MtRBP1 with splicing factors in nuclear speckles, which is a field of splicing regulation, has been reported in Medicago truncatula (Campalans et al., 2004). ENOD40 was found to conserve the translocation activity of AtNSRs, the closest Arabidopsis homologs of MtRBP1 (Bardou et al., 2014). AtNSRs act as nuclear regulators of alternative splicing in response to auxin, and can interact with an lncRNA designated alternative splicing competitor (ASCO) known as structured lncRNA lnc351 (Bardou et al., 2014). Expression of ENOD40 and ASCO in Arabidopsis affects the splicing pattern of target mRNAs. Additionally, ASCO was found to compete in the binding of NSRs to an alternative splicing target in vitro, suggesting that lncRNA regulation of the splicing system is conserved among the plant species. In wheat, more than 1000 lncRNAs were estimated to have small nuclear ribonucleoprotein motifs, implying that many lncRNAs play important roles in spliceosomes (Zhang et al., 2016).

6.4.9 APOLO and HID1, long intergenic ncRNAs forming RNA–DNA hybrids that repress gene expression Several lncRNAs form RNA–DNA hybrids and regulate the formation of local chromatin loop structures by recruiting chromatin-modifying complexes. Auxin- regulated promoter loop RNA (APOLO), which was isolated as npc34 (Ben Amor et al., 2009), is transcribed by two RNA polymerases, Pol II and Pol V, from a region located about 5 kb upstream of Pinoid (PID), a gene that encodes a key regulator of polar auxin transport (Ariel et al., 2014). The progressive interaction between LIKE

88

A. Matsui and M. Seki

HETEROCHROMATIC PROTEIN 1 (LHP1) and APOLO promotes the formation of local chromatin loop structures and the basal epigenetic landscape, repressing PID and APOLO expression. Auxin treatment leads to decreases in DNA methylation and H3K27me3 repressive mark in the APOLO−PID genomic region. The reduction in the number of repressive marks promotes the formation of an open loop structure, allowing the PID mRNA and APOLO lncRNA to be transcribed by Pol II. Subsequently, APOLO lncRNA gradually recruits LHP1 to close the loop. In turn, the transcription of APOLO by Pol V triggers RdDM and PRC2 likely redeposits the repressive marks. Finally, the APOLO lncRNA-mediated chromatin loop is reformed and PID is down- regulated (Ariel et al., 2014). The HIDDEN TREASURE 1 (HID1) lncRNA was shown to repress the expression of phytochrome- interacting factor 3 (PIF3) by trans-acting as part of a large nuclear protein–RNA complex that binds to the chromatin of the first intron of PIF3 (Wang et al., 2014b). The rice homolog, OsHID1, rescued the phenotype of an hid1 mutant in Arabidopsis, suggesting that the function of HID1 is conserved across the species (Wang et al., 2014b).

6.4.10 ceRNA/RNA Soggy/RNA decoy mimic miRNA targets RNA decoys are lncRNAs that can act as “RNA sponge” by competing with endogenous RNAs (ceRNAs) through a short sequence that shares homology with the miRNA- binding sites of target mRNAs. The lncRNA induced by phosphate starvation 1 (IPS1), which prevents interaction between UBC24 mRNA and miR399, is a major regulator of plant phosphate homeostasis (Shin et al., 2006; Franco-Zorrilla et al., 2007). IPS1 and family members, AT4.1 and AT4.2, have a highly conserved 22 nt sequence that is not perfectly complementary to miR399, due to a mismatch at positions 10–11 (Shin et al., 2006; Franco-Zorrilla et al., 2007). Although miR399 can bind to IPS1 lncRNA, no miR399- dependent cleavage has been detected, and induction of IPS1 under phosphate starvation resulted in the release of miR399 from its

target UBC24 mRNA. Genome-wide analyses of Arabidopsis and rice identified other conserved lncRNAs that potentially act as target mimics of 20 miRNAs (Wu et al., 2013). Recently, 2669 non-redundant endogenous target mimics were detected by searching for miRNA target sites in genome and transcriptome datasets of 43 plant species (Deng et al., 2020). Long intergenic ncRNAs that mimic miRNA targets also have been found in animals, suggesting this is a prevalent mechanism that prevents negative regulation of target genes by miRNA (Kartha and Subramanian, 2014).

6.4.11 Circular RNA In eukaryotes, closed-loop RNAs that have a covalent bond linking the 5′ and 3′ ends, called circRNAs, are produced by non-canonical splicing of pre-mRNAs, referred to as back splicing (Chen, 2016). circRNAs are stable and can be captured from rRNA- depleted total RNA by RNase R treatment. The presence of circRNAs in plants was confirmed initially in Arabidopsis (P. Wang et al., 2014). The plant circRNA databases, PlantCircNet (Zhang et al., 2017) and PlantcircBase 3.0 (Chu et al., 2018), contain information about > 100,000 circRNAs from different species. In canonical splicing events, GT and AG dinucleotides are located at the 5′ and 3′ ends of splice sites. In animals, most circRNAs are spliced at the canonical GT/AG signal. In plants, the GT/AG splicing signal was found in plants such as Arabidopsis and cotton but not in rice (Sun et al., 2016; Ye et al., 2017; Zhao et al., 2017). The molecular function of circRNAs is largely unknown, although there is some evidence that circRNAs play a role in controlling gene expression. In Arabidopsis, circRNAs were identified as candidate miRNA target mimics and the function of one of them was confirmed (Frydrych Capelari et al., 2019). In rice and Arabidopsis, only 6.6% and 5.0% of circRNAs, respectively, were found to contain putative miRNA- binding sites (Ye et al., 2015). An exonic circRNA derived from the SEPALLATA 3 (SEP3) transcript was found to bind to the DNA strand in the SEP3 locus and the RNA–DNA hybrid promoted exon-skipped alternative splicing of the SEP3 pre-mRNA

Plant Non-coding Transcriptomics

(Conn et al., 2017). The molecular function and the biological role of the circRNAs will be further understood from the expression pattern of circRNAs and the identification of the target sequences.

6.4.12 RNA polymerase III-derived lncRNAs NcRNAs that are < 400 nt long are transcribed by RNA polymerase III (Pol III) (Schramm and Hernandez, 2002). Genome sequence analysis of Arabidopsis detected 20 novel putative lncRNAs (< 300 nt long) that are transcribed by RNA polymerase III (Wu et al., 2012). Although their molecular function was unclear, the Pol III- derived lncRNAs were considered to have a biological function due to the specificity of the conditions in which expression is induced. AtR8 in Arabidopsis was induced by salicylic acid and expressed during seed germination, which decreased seed germination in wild-type and an AtR8 partial deletion mutant (Li et al., 2016).

6.4.13 Viroids: sub-viral plantpathogenic lncRNAs Viroids are plant pathogenic lncRNAs 240–400 nt long composed of a circular single-stranded RNA. During infection, the viroids are amplified by rolling circle replication through host DNA- dependent RNA polymerases forced to transcribe their RNA as templates. The viroids then are transported to neighboring cells or vascular tissues (Ding, 2009). Interestingly, part of the peach latent mosaic viroid (PLMVd) sequence matched the chloroplast heat shock

89

protein mRNA, indicating that small RNAs derived from PLMVd guide the degradation of the host mRNA (Navarro et al., 2012). Induction of DNA methylation on viroid- specific sequences of host-plant genome was shown when autonomous viroid RNA–RNA replication had taken place in the transgenic tobacco plants expressing potato spindle tuber viroid (PSTVd) (Wassenegger et al., 1994).

6.5 Concluding Remarks The classification, biological, and molecular functions of ncRNAs are highlighted in this review. NGS technologies have revealed that various types of ncRNAs are transcribed in different regions of a genome. The ncRNAs are known to play various roles in epigenetic regulation, transcription, processing translational activity, and mRNA stability. Although studies have evolved from accidental phenotypic confirmation to functional analysis, further research is needed to discover the full genome- wide range of ncRNAs. Several studies have indicated that aptamer-tagged RNA sequences may be useful for studying the cellular localization and interaction between lncRNAs and RNA-binding proteins. The secondary structure of lncRNAs is known to be required for their stability and protein binding. RNA-binding protein immunoprecipitation sequencing (RIP- sequencing) and computational analysis of sequence motifs and/or significant secondary structures are important tools for identifying potential functional parts of lncRNAs. The accumulation of knowledge from such studies will lead to a better understanding of the functional significance of new and known lncRNAs.

Acknowledgments This work was supported by grants from RIKEN and Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science, and Technology (Innovative Areas 18H04791 and 18H04705) to M.S. We thank Margaret Biswas, PhD, from Edanz Group (https://en-author-services. edanz.com/ac, accessed September 2022) for editing a draft of this manuscript.

90

A. Matsui and M. Seki

References Alexander, R.P., Fang, G., Rozowsky, J., Snyder, M. and Gerstein, M.B. (2010) Annotating non-coding regions of the genome. Nature Reviews. Genetics 11(8), 559–571. DOI: 10.1038/nrg2814. Alleman, M., Sidorenko, L., McGinnis, K., Seshadri, V., Dorweiler, J.E. et al. (2006) An RNA-dependent RNA polymerase is required for paramutation in maize. Nature 442(7100), 295–298. DOI: 10.1038/ nature04884. Allen, E., Xie, Z., Gustafson, A.M. and Carrington, J.C. (2005) MicroRNA-directed phasing during trans- acting siRNA biogenesis in plants. Cell 121(2), 207–221. DOI: 10.1016/j.cell.2005.04.004. Ariel, F., Jegu, T., Latrasse, D., Romero-Barrios, N., Christ, A. et al. (2014) Noncoding transcription by alternative RNA polymerases dynamically regulates an auxin-driven chromatin loop. Molecular Cell 55(3), 383–396. DOI: 10.1016/j.molcel.2014.06.011. Ariel, F., Romero- Barrios, N., Jégu, T., Benhamed, M. and Crespi, M. (2015) Battles and hijacks: noncoding transcription in plants. Trends in Plant Science 20(6), 362–371. DOI: 10.1016/j. tplants.2015.03.003. Axtell, M.J. and Meyers, B.C. (2018) Revisiting criteria for plant microRNA annotation in the era of big data. The Plant Cell 30(2), 272–284. DOI: 10.1105/tpc.17.00851. Bardou, F., Ariel, F., Simpson, C.G., Romero-Barrios, N., Laporte, P. et al. (2014) Long noncoding RNA modulates alternative splicing regulators in Arabidopsis. Developmental Cell 30(2), 166–176. DOI: 10.1016/j.devcel.2014.06.017. Bartels, D. and Sunkar, R. (2005) Drought and salt tolerance in plants. Critical Reviews in Plant Sciences 24(1), 23–58. DOI: 10.1080/07352680590910410. Bazin, J., Baerenfaller, K., Gosai, S.J., Gregory, B.D., Crespi, M. et al. (2017) Global analysis of ribosome- associated noncoding RNAs unveils new modes of translational regulation. Proceedings of the National Academy of Sciences 114(46), E10018–E10027. DOI: 10.1073/pnas.1708433114. Belostotsky, D.A. and Sieburth, L.E. (2009) Kill the messenger: mRNA decay and plant development. Current Opinion in Plant Biology 12(1), 96–102. DOI: 10.1016/j.pbi.2008.09.003. Ben Amor, B., Wirth, S., Merchan, F., Laporte, P., d’Aubenton-Carafa, Y. et al. (2009) Novel long non- protein coding RNAs involved in Arabidopsis differentiation and stress responses. Genome Research 19(1), 57–69. DOI: 10.1101/gr.080275.108. Bhatia, G., Goyal, N., Sharma, S., Upadhyay, S.K. and Singh, K. (2017) Present scenario of long non- coding RNAs in plants. Non-Coding RNA 3(2), 16. DOI: 10.3390/ncrna3020016. Branscheid, A., Marchais, A., Schott, G., Lange, H., Gagliardi, D. et al. (2015) SKI2 mediates degradation of RISC 5′-cleavage fragments and prevents secondary siRNA production from miRNA targets in Arabidopsis. Nucleic Acids Research 43(22), 10975–10988. DOI: 10.1093/nar/gkv1014. Buzas, D.M., Robertson, M., Finnegan, E.J. and Helliwell, C.A. (2011) Transcription- dependence of histone H3 lysine 27 trimethylation at the Arabidopsis polycomb target gene FLC. The Plant Journal 65(6), 872–881. DOI: 10.1111/j.1365-313X.2010.04471.x. Campalans, A., Kondorosi, A. and Crespi, M. (2004) Enod40, a short open reading frame-containing mRNA, induces cytoplasmic localization of a nuclear RNA binding protein in Medicago truncatula. The Plant Cell 16(4), 1047–1059. DOI: 10.1105/tpc.019406. Carrieri, C., Cimatti, L., Biagioli, M., Beugnet, A., Zucchelli, S. et al. (2012) Long non-coding antisense RNA controls Uchl1 translation through an embedded SINEB2 repeat. Nature 491, 454–457. DOI: 10.1038/nature11508. Chen, L.L. (2016) The biogenesis and emerging roles of circular RNAs. Nature Reviews Molecular Cell Biology 17, 205–211. DOI: 10.1038/nrm.2015.32. Chen, X. (2004) A microRNA as a translational repressor of APETALA2 in Arabidopsis flower development. Science 303, 2022–2025. DOI: 10.1126/science.1088060. Cho, J. and Paszkowski, J. (2017) Regulation of rice root development by a retrotransposon acting as a microRNA sponge. eLife 6, e30038. DOI: 10.7554/eLife.30038. Chu, Q., Bai, P., Zhu, X., Zhang, X., Mao, L. et al. (2018) Characteristics of plant circular RNAs. Briefings in Bioinformatics 21, 135–143. DOI: 10.1093/bib/bby111. Conn, V.M., Hugouvieux, V., Nayak, A., Conos, S.A., Capovilla, G. et al. (2017) A circRNA from SEPALLATA3 regulates splicing of its cognate mRNA through R-loop formation. Nature Plants 3, 17053. DOI: 10.1038/nplants.2017.53.

Plant Non-coding Transcriptomics

91

Crespi, M.D., Jurkevitch, E., Poiret, M., d’Aubenton-Carafa, Y., Petrovics, G. et al. (1994) enod40, a gene expressed during nodule organogenesis, codes for a non-translatable RNA involved in plant growth. The EMBO Journal 13(21), 5099–5112. DOI: 10.1002/j.1460-2075.1994.tb06839.x. Csorba, T., Questa, J.I., Sun, Q. and Dean, C. (2014) Antisense COOLAIR mediates the coordinated switching of chromatin states at FLC during vernalization. Proceedings of the National Academy of Sciences 111(45), 16160–16165. DOI: 10.1073/pnas.1419030111. David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J. et al. (2006) A high-resolution map of transcription in the yeast genome. Proceedings of the National Academy of Sciences 103(14), 5320–5325. DOI: 10.1073/pnas.0601091103. Deforges, J., Reis, R.S., Jacquet, P., Sheppard, S., Gadekar, V.P. et al. (2019) Control of cognate sense mRNA translation by cis-natural antisense RNAs. Plant Physiology 180(1), 305–322. DOI: 10.1104/ pp.19.00043. Deng, J., Li, Q., Huang, L., Tang, W., Zheng, K. et al. (2020) PendoTMBase: a database for plant endogenous target mimics. Interdisciplinary Sciences, Computational Life Sciences 12(4), 526–529. DOI: 10.1007/s12539-020-00396-2. Ding, B. (2009) The biology of viroid-host interactions. Annual Review of Phytopathology 47, 105–131. DOI: 10.1146/annurev-phyto-080508-081927. Djebali, S., Davis, C.A., Merkel, A., Dobin, A., Lassmann, T. et al. (2012) Landscape of transcription in human cells. Nature 489(7414), 101–108. DOI: 10.1038/nature11233. Erhard, K.F.Jr., Stonaker, J.L., Parkinson, S.E., Lim, J.P., Hale, C.J. et al. (2009) RNA polymerase IV functions in paramutation in Zea mays. Science 323(5918), 1201–1205. DOI: 10.1126/science.1164508. Estavillo, G.M., Crisp, P.A., Pornsiriwong, W., Wirtz, M., Collinge, D. et al. (2011) Evidence for a SAL1-PAP chloroplast retrograde pathway that functions in drought and high light signaling in Arabidopsis. The Plant Cell 23(11), 3992–4012. DOI: 10.1105/tpc.111.091033. Fang, X., Wu, Z., Raitskin, O., Webb, K., Voigt, P. et al. (2020) The 3′ processing of antisense RNAs physically links to chromatin-based transcriptional control. Proceedings of the National Academy of Sciences 117(26), 15316–15321. DOI: 10.1073/pnas.2007268117. Feng, L., Zhang, F., Zhang, H., Zhao, Y., Meyers, B.C. et al. (2020) An online database for exploring over 2,000 Arabidopsis small RNA libraries. Plant Physiology 182(2), 685–691. DOI: 10.1104/pp.19.00959. Franco-Zorrilla, J.M., Valli, A., Todesco, M., Mateos, I., Puga, M.I. et al. (2007) Target mimicry provides a new mechanism for regulation of microRNA activity. Nature Genetics 39(8), 1033–1037. DOI: 10.1038/ng2079. Frydrych Capelari, É., da Fonseca, G.C., Guzman, F. and Margis, R. (2019) Circular and micro RNAs from Arabidopsis thaliana flowers are simultaneously isolated from AGO-IP libraries. Plants (Basel, Switzerland) 8(9), 302. DOI: 10.3390/plants8090302. Gandikota, M., Birkenbihl, R.P., Höhmann, S., Cardon, G.H., Saedler, H. et al. (2007) The miRNA156/157 recognition element in the 3′ UTR of the Arabidopsis SBP box gene SPL3 prevents early flowering by translational inhibition in seedlings. The Plant Journal 49(4), 683–693. DOI: 10.1111/j.1365-313X.2006.02983.x. Gao, W., Liu, W., Zhao, M. and Li, W.X. (2015) NERF encodes a RING E3 ligase important for drought resistance and enhances the expression of its antisense gene NFYA5 in Arabidopsis. Nucleic Acids Research 43(1), 607–617. DOI: 10.1093/nar/gku1325. Gazzani, S., Lawrenson, T., Woodward, C., Headon, D. and Sablowski, R. (2004) A link between mRNA turnover and RNA interference in Arabidopsis. Science 306(5698), 1046–1048. DOI: 10.1126/ science.1101092. Gehring, M. (2019) Epigenetic dynamics during flowering plant reproduction: evidence for reprogramming? The New Phytologist 224(1), 91–96. DOI: 10.1111/nph.15856. Gultyaev, A.P. and Roussis, A. (2007) Identification of conserved secondary structures and expansion segments in enod40 RNAs reveals new enod40 homologues in plants. Nucleic Acids Research 35(9), 3144–3152. DOI: 10.1093/nar/gkm173. Guo, Z., Kuang, Z., Wang, Y., Zhao, Y., Tao, Y. et al. (2020) PmiREN: a comprehensive encyclopedia of plant miRNAs. Nucleic Acids Research 48(D1), D1114–D1121. DOI: 10.1093/nar/gkz894. Heo, J.B. and Sung, S. (2011) Vernalization-mediated epigenetic silencing by a long intronic noncoding RNA. Science 331, 76–79. DOI: 10.1126/science.1197349. Hornyik, C., Terzi, L.C. and Simpson, G.G. (2010) The spen family protein FPA controls alternative cleavage and polyadenylation of RNA. Developmental Cell 18, 203–213. DOI: 10.1016/j. devcel.2009.12.009.

92

A. Matsui and M. Seki

Ibarra, C.A., Feng, X., Schoft, V.K., Hsieh, T.F., Uzawa, R. et al. (2012) Active DNA demethylation in plant companion cells reinforces transposon methylation in gametes. Science 14, 1360–1364. DOI: 10.1126/science.1224839. Ito, H., Gaubert, H., Bucher, E., Mirouze, M., Vaillant, I. et al. (2011) An siRNA pathway prevents transgenerational retrotransposition in plants subjected to stress. Nature 472, 115–119. DOI: 10.1038/ nature09861. Jabnoune, M., Secco, D., Lecampion, C., Robaglia, C., Shu, O. et al. (2013) A rice cis-natural antisense RNA acts as a translational enhancer for its cognate mRNA and contributes to phosphate homeostasis and plant fitness. Plant Cell 25, 4166–4182. DOI: 10.1105/tpc.113.116251. Johnson, C., Kasprzewska, A., Tennessen, K., Fernandes, J., Nan, G.-L. et al. (2009) Clusters and superclusters of phased small RNAs in the developing inflorescence of rice. Genome Research 19(8), 1429–1440. DOI: 10.1101/gr.089854.108. Johnson, L.M., Du, J., Hale, C.J., Bischof, S., Feng, S. et al. (2014) SRA- and SET-domain-containing proteins link RNA polymerase V occupancy to DNA methylation. Nature 507(7490), 124–128. DOI: 10.1038/nature12931. Källman, T., Chen, J., Gyllenstrand, N. and Lagercrantz, U. (2013) A significant fraction of 21-nucleotide small RNA originates from phased degradation of resistance genes in several perennial species. Plant Physiology 162(2), 741–754. DOI: 10.1104/pp.113.214643. Kartha, R.V. and Subramanian, S. (2014) Competing endogenous RNAs (ceRNAs): new entrants to the intricacies of gene regulation. Frontiers in Genetics 5, 8. DOI: 10.3389/fgene.2014.00008. Kim, D.H. and Sung, S. (2017) Vernalization-triggered intragenic chromatin loop formation by long noncoding RNAs. Developmental Cell 40(3), 302–312. DOI: 10.1016/j.devcel.2016.12.021. Kozomara, A., Birgaoanu, M. and Griffiths-Jones, S. (2019) miRBase: from microRNA sequences to function. Nucleic Acids Research 47, D155–D162. DOI: 10.1093/nar/gky1141. Kurihara, Y., Matsui, A., Hanada, K., Kawashima, M., Ishida, J. et al. (2009) Genome-wide suppression of aberrant mRNA-like noncoding RNAs by NMD in Arabidopsis. Proceedings of the National Academy of Sciences 106, 2453–2458. DOI: 10.1073/pnas.0808902106. Lang, Z., Wang, Y., Tang, K., Tang, D., Datsenka, T. et al. (2017) Critical roles of DNA demethylation in the activation of ripening-induced genes and inhibition of ripening-repressed genes in tomato fruit. Proceedings of the National Academy of Sciences 114, E4511–E4519. DOI: 10.1073/ pnas.1705233114. Lee, R.C., Feinbaum, R.L. and Ambros, V. (1993) The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843–854. DOI: 10.1016/0092-8674(93)90529-Y. Li, C.F., Pontes, O., El-Shami, M., Henderson, I.R., Bernatavichute, Y.V. et al. (2006a) An ARGONAUTE4- containing nuclear processing center colocalized with Cajal bodies in Arabidopsis thaliana. Cell 126, 93–106. DOI: 10.1016/j.cell.2006.05.032. Li, L., Wang, X., Stolc, V., Li, X., Zhang, D. et al. (2006b) Genome-wide transcription analyses in rice using tiling microarrays. Nature Genetics 38, 124–129. DOI: 10.1038/ng1704. Li, D., Huang, X., Liu, Z., Li, S., Okada, T. et al. (2016) Effect of AtR8 lncRNA partial deletion on Arabidopsis seed germination. Molecular Soil Biology 7, 1–7. DOI: 10.5376/msb.2016.07.0007. Li, F., Pignatta, D., Bendix, C., Brunkard, J.O., Cohn, M.M. et al. (2012) MicroRNA regulation of plant innate immune receptors. Proceedings of the National Academy of Sciences 109, 1790–1805. DOI: 10.1073/pnas.1118282109. Li, S., Vandivier, L.E., Tu, B., Gao, L., Won, S.Y. et al. (2015) Detection of Pol IV/RDR2-dependent transcripts at the genomic scale in Arabidopsis reveals features and regulation of siRNA biogenesis. Genome Research 25, 235–245. DOI: 10.1101/gr.182238.114. Liu, C., Muchhal, U.S. and Raghothama, K.G. (1997) Differential expression of TPS11, a phosphate starvation- induced gene in tomato. Plant Molecular Biology 33, 867–874. DOI: 10.1023/A:1005729309569. Liu, F., Marquardt, S., Lister, C., Swiezewski, S. and Dean, C. (2010) Targeted 3′ processing of antisense transcripts triggers Arabidopsis FLC chromatin silencing. Science 327, 94–97. DOI: 10.1126/ science.1180278. Liu, F., Quesada, V., Crevillén, P., Swiezewski, S. and Dean, C. (2007) The Arabidopsis RNA-binding protein FCA requires a lysine-specific demethylase 1 homolog to downregulate FLC. Molecular Cell 28, 398–407. DOI: 10.1016/j.molcel.2007.10.018. Liu, J., Jung, C., Xu, J., Wang, H., Deng, S. et al. (2012) Genome-wide analysis uncovers regulation of long intergenic noncoding RNAs in Arabidopsis. Plant Cell 24, 4333–4345. DOI: 10.1105/tpc.112.102855.

Plant Non-coding Transcriptomics

93

Lukiw, W.J., Handley, P., Wong, L. and Crapper McLachlan, D.R. (1992) BC200 RNA in normal human neocortex, non- Alzheimer dementia (NAD), and senile dementia of the Alzheimer type (AD). Neurochemical Research 17, 591–597. DOI: 10.1007/BF00968788. Lunardon, A., Johnson, N.R., Hagerott, E., Phifer, T., Polydore, S. et al. (2020) Integrated annotations and analyses of small RNA-producing loci from 47 diverse plants. Genome Research 30, 497–513. DOI: 10.1101/gr.256750.119. Luo, Z. and Chen, Z. (2007) Improperly terminated, unpolyadenylated mRNA of sense transgenes is targeted by RDR6-mediated RNA silencing in Arabidopsis. Plant Cell 19, 943–958. DOI: 10.1105/ tpc.106.045724. Marin, E., Jouannet, V., Herz, A., Lokerse, A.S., Weijers, D. et al. (2010) miR390, Arabidopsis TAS3 tasiRNAs, and their AUXIN RESPONSE FACTOR targets define an autoregulatory network quantitatively regulating lateral root growth. Plant Cell 22, 1104–1117. DOI: 10.1105/tpc.109.072553. Marquardt, S., Raitskin, O., Wu, Z., Liu, F., Sun, Q. et al. (2014) Functional consequences of splicing of the antisense transcript COOLAIR on FLC transcription. Molecular Cell 54, 156–165. DOI: 10.1016/j. molcel.2014.03.026. Martínez de Alba, A.E., Moreno, A.B., Mallory, A.C. and Christ, A. (2015) In plants, decapping prevents RDR6-dependent production of small interfering RNAs from endogenous mRNAs. Nucleic Acids Research 43, 2902–2913. DOI: 10.1093/nar/gkv119. Matsui, A., Mizunashi, K., Tanaka, M., Kaminuma, E., Nguyen, A.H. et al. (2014) tasiRNA-ARF pathway moderates floral architecture in Arabidopsis plants subjected to drought stress. BioMed Research International 2014, 303451. DOI: 10.1155/2014/303451. Matsui, A., Ishida, J., Morosawa, T., Mochizuki, Y., Kaminumaet, E. et al. (2008) Arabidopsis transcriptome analysis under drought, cold, high-salinity and ABA treatment conditions using a tiling array. Plant & Cell Physiology 49, 1135–1149. DOI: 10.1093/pcp/pcn101. Matsui, A., Iida, K., Tanaka, M., Yamaguchi, K., Mizuhashi, K. et al. (2017) Novel stress-inducible antisense RNAs of protein-coding loci are synthesized by RNA-dependent RNA polymerase. Plant Physiology 175(1), 457–472. DOI: 10.1104/pp.17.00787. Matsui, A., Nakaminami, K. and Seki, M. (2019) Biological function of changes in RNA metabolism in plant adaptation to abiotic stress. Plant & Cell Physiology 60(9), 1897–1905. DOI: 10.1093/pcp/pcz068. Matzke, M.A. and Mosher, R.A. (2014) RNA-directed DNA methylation: an epigenetic pathway of increasing complexity. Nature Reviews. Genetics 15(6), 394–408. DOI: 10.1038/nrg3683. Mi, S., Cai, T., Hu, Y., Chen, Y., Hodges, E. et al. (2008) Sorting of small RNAs into Arabidopsis argonaute complexes is directed by the 5′ terminal nucleotide. Cell 133(1), 116–127. DOI: 10.1016/j. cell.2008.02.034. Mourrain, P., Béclin, C., Elmayan, T., Feuerbach, F., Godon, C. et al. (2000) Arabidopsis SGS2 and SGS3 genes are required for posttranscriptional gene silencing and natural virus resistance. Cell 101(5), 533–542. DOI: 10.1016/s0092-8674(00)80863-6. Nakazawa, Y., Hiraguri, A., Moriyama, H. and Fukuhara, T. (2007) The dsRNA-binding protein DRB4 interacts with the Dicer-like protein DCL4 in vivo and functions in the trans-acting siRNA pathway. Plant Molecular Biology 63, 777–785. DOI: 10.1261/rna.2455411. Navarro, B., Gisel, A., Rodio, M.E., Delgado, S., Flores, R. et al. (2012) Small RNAs containing the pathogenic determinant of a chloroplast-replicating viroid guide the degradation of a host mRNA as predicted by RNA silencing. The Plant Journal 70(6), 991–1003. DOI: 10.1111/j.1365-313X.2012.04940.x. Niu, D., Lii, Y.E., Chellappan, P., Lei, L., Peralta, K. et al. (2016) miRNA863-3p sequentially targets negative immune regulator ARLPKs and positive regulator SERRATE upon bacterial infection. Nature Communications 7, 11324. DOI: 10.1038/ncomms11324. Parent, J.-S., Jauvion, V., Bouché, N., Béclin, C., Hachet, M. et al. (2015a) Post-transcriptional gene silencing triggered by sense transgenes involves uncapped antisense RNA and differs from silencing intentionally triggered by antisense transgenes. Nucleic Acids Research 43(17), 8464–8475. DOI: 10.1093/nar/gkv753. Parent, J.-S., Bouteiller, N., Elmayan, T. and Vaucheret, H. (2015b) Respective contributions of Arabidopsis DCL2 and DCL4 to RNA silencing. The Plant Journal 81(2), 223–232. DOI: 10.1111/tpj.12720. Pien, S., Fleury, D., Mylne, J.S., Crevillen, P., Inzé, D. et al. (2008) ARABIDOPSIS TRITHORAX1 dynamically regulates FLOWERING LOCUS C activation via histone 3 lysine 4 trimethylation. The Plant Cell 20(3), 580–588. DOI: 10.1105/tpc.108.058172. Ponting, C.P., Oliver, P.L. and Reik, W. (2009) Evolution and functions of long noncoding RNAs. Cell 136(4), 629–641. DOI: 10.1016/j.cell.2009.02.006.

94

A. Matsui and M. Seki

Popova, O.V., Dinh, H.Q., Aufsatz, W. and Jonak, C. (2013) The RdDM pathway is required for basal heat tolerance in Arabidopsis. Molecular Plant 6(2), 396–410. DOI: 10.1093/mp/sst023. Qi, Y., Denli, A.M. and Hannon, G.J. (2005) Biochemical specialization within Arabidopsis RNA silencing pathways. Molecular Cell 19(3), 421–428. DOI: 10.1016/j.molcel.2005.06.014. Rogers, K. and Chen, X. (2013) Biogenesis, turnover, and mode of action of plant microRNAs. The Plant Cell 25(7), 2383–2399. DOI: 10.1105/tpc.113.113159. Ronemus, M., Vaughn, M.W. and Martienssen, R.A. (2006) MicroRNA-targeted and small interfering RNA- mediated mRNA degradation is regulated by argonaute, dicer, and RNA-dependent RNA polymerase in Arabidopsis. Plant Cell 18, 1559–1574. DOI: 10.1105/tpc.106.042127. Ron, M., Alandete Saez, M., Eshed Williams, L., Fletcher, J.C. and McCormick, S. (2010) Proper regulation of a sperm-specific cis-nat-sirna is essential for double fertilization in arabidopsis. Genes & Development 24, 1010–1021. DOI: 10.1101/gad.1882810. Schramm, L. and Hernandez, N. (2002) Recruitment of RNA polymerase III to its target promoters. Genes and Development 16, 2593–2620. DOI: 10.1101/gad.1018902. Schwab, R., Palatnik, J.F., Riester, M., Schommer, C., Schmid, M. et al. (2005) Specific effects of microRNAs on the plant transcriptome. Developmental Cell 8, 517–527. DOI: 10.1016/j.devcel.2005.01.018. Shin, H., Shin, H.S., Chen, R. and Harrison, M. (2006) Loss of At4 function impacts phosphate distribution between the roots and the shoots during phosphate starvation. Plant Journal 45, 712–726. DOI: 10.1111/j.1365-313X.2005.02629.x. Shivaprasad, P.V., Chen, H.M., Patel, K., Bond, D.M., Santos, B.A. et al. (2012) A microRNA superfamily regulates nucleotide binding site‐leucine‐rich repeats and other mRNAs. Plant Cell 24, 859–874. DOI: 10.1105/tpc.111.095380. Slotkin, R.K., Vaughn, M., Borges, F., Tanurdzić, M., Becker, J.D. et al. (2009) Epigenetic reprogramming and small RNA silencing of transposable elements in pollen. Cell 136, 461–472. DOI: 10.1016/j. cell.2008.12.038. Soma, F., Mogami, J., Yoshida, T., Abekura, M., Takahashi, F. et al. (2017) ABA-unresponsive SnRK2 protein kinases regulate mRNA decay under osmotic stress in plants. Nature Plants 3, 16204. DOI: 10.1038/nplants.2016.204. Song, X., Li, P., Zhai, J., Zhou, M., Ma, L. et al. (2012) Roles of DCL4 and DCL3b in rice phased small RNA biogenesis. Plant Journal 69, 462–474. DOI: 10.1111/j.1365-313X.2011.04805.x. Sun, X.Y., Wang, L., Ding, J.C., Wang, Y.R., Wang, J.S. et al. (2016) Integrative analysis of Arabidopsis thaliana transcriptomics reveals intuitive splicing mechanism for circular RNA. FEBS Letters 590, 3510–3516. DOI: 10.1002/1873-3468.12440. Takahashi, H., Kozhuharova, A., Sharma, H., Hirose, M., Ohyama, T. et al. (2018) Identification of functional features of synthetic SINEUPs, antisense lncRNAs that specifically enhance protein translation. PloS ONE 13(2), e0183229. DOI: 10.1371/journal.pone.0183229. Vaucheret, H., Mallory, A.C. and Bartel, D.P. (2006) AGO1 homeostasis entails coexpression of MIR168 and AGO1 and preferential stabilization of MIR168 by AGO1. Molecular Cell 22(1), 129–136. DOI: 10.1016/j.molcel.2006.03.011. Voinnet, O. (2009) Origin, biogenesis, and activity of plant microRNAs. Cell 136(4), 669–687. DOI: 10.1016/j.cell.2009.01.046. Wang, H., Chung, P.J., Liu, J., Jang, I.-C., Kean, M.J. et al. (2014) Genome-wide identification of long noncoding natural antisense transcripts and their responses to light in Arabidopsis. Genome Research 24(3), 444–453. DOI: 10.1101/gr.165555.113. Wang, P.L., Bao, Y., Yee, M.-C., Barrett, S.P., Hogan, G.J. et al. (2014) Circular RNA is expressed across the eukaryotic tree of life. PloS ONE 9(6), e90859. DOI: 10.1371/journal.pone.0090859. Wang, Y., Wang, X., Deng, W., Fan, X., Liu, T.-T. et al. (2014a) Genomic features and regulatory roles of intermediate-sized non-coding RNAs in Arabidopsis. Molecular Plant 7(3), 514–527. DOI: 10.1093/ mp/sst177. Wang, Y., Fan, X., Lin, F., He, G., Terzaghi, W. et al. (2014b) Arabidopsis noncoding RNA mediates control of photomorphogenesis by red light. Proceedings of the National Academy of Sciences 111(28), 10359–10364. DOI: 10.1073/pnas.1409457111. Wasaki, J., Yonetani, R., Shinano, T., Kai, M. and Osaki, M. (2003) Expression of the OsPI1 gene, cloned from rice roots using cDNA microarray, rapidly responds to phosphorus status. New Phytologist 158(2), 239–248. DOI: 10.1046/j.1469-8137.2003.00748.x. Wassenegger, M., Heimes, S., Riedel, L. and Sänger, H.L. (1994) RNA-directed de novo methylation of genomic sequences in plants. Cell 76(3), 567–576. DOI: 10.1016/0092-8674(94)90119-8.

Plant Non-coding Transcriptomics

95

Wei, L., Gu, L., Song, X., Cui, X., Lu, Z. et al. (2014) Dicer-like 3 produces transposable element-associated 24-nt siRNAs that control agricultural traits in rice. Proceedings of the National Academy of Sciences 111(10), 3877–3882. DOI: 10.1073/pnas.1318131111. Wierzbicki, A.T., Haag, J.R. and Pikaard, C.S. (2008) Noncoding transcription by RNA polymerase Pol IVb/ Pol V mediates transcriptional silencing of overlapping and adjacent genes. Cell 135, 635–648. DOI: 10.1016/j.cell.2008.09.035. Wu, H.J., Wang, Z.M., Wang, M. and Wang, X.J. (2013) Wide-spread long non-coding RNAs (lncRNAs) as endogenous target mimics (eTMs) for microRNAs in plants. Plant Physiology 161, 1875–1884. DOI: 10.1104/pp.113.215962. Wu, J., Okada, T., Fukushima, T., Tsudzuki, T., Sugiura, M. et al. (2012) A novel hypoxic stress-responsive long non-coding RNA transcribed by RNA polymerase III in Arabidopsis. RNA Biology 9, 302–313. DOI: 10.4161/rna.19101. Wu, L., Zhou, H.Y., Zhang, Q.Q., Zhang, J.G., Ni, F.R. et al. (2010) DNA methylation mediated by a microRNA pathway. Molecular Cell 38, 465–475. DOI: 10.1016/j.molcel.2010.03.008. Xia, R., Meyers, B.C., Liu, Z., Beers, E.P. and Ye, S. (2013) MicroRNA superfamilies descended from miR390 and their roles in secondary small interfering RNA Biogenesis in Eudicots. Plant Cell 25, 1555–1572. DOI: 10.1105/tpc.113.110957. Xie, Z.X., Kasschau, K.D. and Carrington, J.C. (2003) Negative feedback regulation of Dicer-like1 in Arabidopsis by microRNA-guided mRNA degradation. Current Biology 13, 784–789. DOI: 10.1016/ S0960-9822(03)00281-1. Xu, J. and Chua, N.H. (2012) Dehydration stress activates Arabidopsis MPK6 to signal DCP1 phosphorylation. EMBO Journal 31, 1975–1984. DOI: 10.1038/emboj.2012.56. Yang, L., Mu, X., Liu, C., Cai, J., Shi, K. et al. (2015) Overexpression of potato miR482e enhanced plant sensitivity to Verticillium dahliae infection. Journal of Integrative Plant Biology 57, 1078–1088. DOI: 10.1111/jipb.12348. Yamada, K., Lim, J., Dale, J.M., Chen, H., Shinn, P. et al. (2003) Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 31, 842–846. DOI: 10.1126/science.1088305. Yang, W.C., Katinakis, P., Hendriks, P., Smolders, A., de Vries, F. et al. (1993) Characterization of gmenod40, a gene showing novel patterns of cell-specific expression during soybean nodule development. Plant Journal 3, 573–585. DOI: 10.1046/j.1365-313x.1993.03040573.x. Ye, C.Y., Chen, L., Liu, C., Zhu, Q.H. and Fan, L. (2015) Widespread noncoding circular RNAs in plants. The New Phytologist 208(1), 88–95. DOI: 10.1111/nph.13585. Ye, C.-Y., Zhang, X., Chu, Q., Liu, C., Yu, Y. et al. (2017) Full-length sequence assembly reveals circular RNAs with diverse non-GT/AG splicing signals in rice. RNA Biology 14(8), 1055–1063. DOI: 10.1080/15476286.2016.1245268. Yuan, C., Wang, J., Harrison, A.P., Meng, X., Chen, D. et al. (2015) Genome-wide view of natural antisense transcripts in Arabidopsis thaliana. DNA Research 22, 233–243. DOI: 10.1093/dnares/ dsv008. Yu, A., Saudemont, B., Bouteiller, N., Elvira-Matelot, E., Lepère, G. et al. (2015) Second-site mutagenesis of a hypomorphic argonaute1 allele identifies SUPERKILLER3 as an endogenous suppressor of transgene posttranscriptional gene silencing. Plant Physiology 169(2), 1266–1274. DOI: 10.1104/ pp.15.00585. Yu, B., Yang, Z., Li, J., Minakhina, S., Yang, M. et al. (2005) Methylation as a crucial step in plant microRNA biogenesis. Science 307(5711), 932–935. DOI: 10.1126/science.1107130. Yu, Y., Zhang, Y., Chen, X. and Chen, Y. (2019) Plant noncoding RNAs: hidden players in development and stress responses. Annual Review of Cell and Developmental Biology 35, 407–431. DOI: 10.1146/ annurev-cellbio-100818-125218. Zhang, H., Xia, R., Meyers, B.C. and Walbot, V. (2015) Evolution, functions, and mysteries of plant ARGONAUTE proteins. Current Opinion Plant Biology 27, 84–90. DOI: 10.1016/j.pbi.2015.06.011. Zhang, P., Meng, X., Chen, H., Liu, Y., Xue, J. et al. (2017) PlantCircNet: a database for plant circRNA- miRNA-mRNA regulatory networks. Database 2017, bax089. DOI: 10.1093/database/bax089. Zhang, H., Hu, W., Hao, J., Lv, S., Wang, C. et al. (2016) Genome-wide identification and functional prediction of novel and fungi-responsive lincRNAs in Triticum aestivum. BMC Genomics 17, 238. DOI: 10.1186/s12864-016-2570-0. Zhao, T., Wang, L.Y., Li, S., Xu, M., Guan, X.Y. et al. (2017) Characterization of conserved circular RNA in polyploid Gossypium species and their ancestors. FEBS Letters 591, 3660–3669. DOI: 10.1002/1873-3468.12868.

96

A. Matsui and M. Seki

Zheng, B., Wang, Z., Li, S., Yu, B., Liu, J.-Y. et al. (2009) Intergenic transcription by RNA polymerase II coordinates Pol IV and Pol V in siRNA-directed transcriptional gene silencing in Arabidopsis. Genes & Development 23(24), 2850–2860. DOI: 10.1101/gad.1868009. Zheng, X., Zhu, J., Kapoor, A. and Zhu, J.K. (2007) Role of Arabidopsis AGO6 in siRNA accumulation, DNA methylation and transcriptional gene silencing. The EMBO Journal 26(6), 1691–1701. DOI: 10.1038/ sj.emboj.7601603. Zhong, S.-H., Liu, J.-Z., Jin, H., Lin, L., Li, Q. et al. (2013) Warm temperatures induce transgenerational epigenetic release of RNA silencing by inhibiting siRNA biogenesis in Arabidopsis. Proceedings of the National Academy of Sciences 110(22), 9171–9176. DOI: 10.1073/pnas.1219655110.

7

Plant Epigenomics

Taiko Kim To1 and Jong-Myong Kim2,3* Department of Biological Sciences, The University of Tokyo Tokyo, Japan; 2 Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan; 3Ac-Planta Inc. Tokyo, Japan 1

Abstract Epigenetic regulation is highly conserved in eukaryotes. In general, events mediated by the genome – such as the regulation of gene activity and establishment and arrangement of chromatin structure – are coordinated by epigenetic and chromatin-remodeling factors. The concept of epigenetics was coined by Waddington (1957). In the context of embryogenesis, cells are affected by particular circumstances during development. This idea involves the possibility of “epi” factors that determine cell fate (“epi” is a Greek prefix meaning “external action or effect”). Currently, the definition of epigenetics remains interpretive in various contexts; however, the concept is based on “heritable phenotype changes that do not involve alterations in DNA sequence” and involves “nonheritable chromatin-level regulation that does not involve alterations in DNA sequence.” In this chapter, the basic concepts of genome-wide regulation of histone modifications and DNA methylation – known as epigenetic and chromatin regulatory factors – are the primary focus, with an emphasis on understanding the fundamental mechanisms of epigenetic regulation.

7.1 Significance of Histone Modifications The importance of histone modifications as an “epi” factor was revealed by studies on the regulation of gene activity in budding yeast (Thompson and Grunstein, 1993; Kuo et al., 1996). Histone modifications function directly in gene silencing and activation (Grunstein and Gasser, 2013; Allis et al., 2015). The minimal unit of chromatin – the “nucleosome” – consists of genomic DNA and a histone octamer composed of two each of the core histone H2A, H2B, H3, and H4 proteins (Luger et al., 1997). The targets of modification are mainly the N-terminal regions of each histone protein; modifications such as acetylation, deacetylation, methylation, demethylation,

phosphorylation, sumoylation, ubiquitination, and biotinylation impacting specific amino acid residues of the N-terminal regions of histones are altered by specific histone modification enzymes (Xu et al., 2014; Shen et al., 2016; Zhao et al., 2020). Although histone modification exerts significant functions in regulating life processes, the primary types of histone modifications and their functions are widely conserved in eukaryotes. As a result of the altered coordination of histone modification, the accessibility of regulator proteins such as transcription factors, RNA polymerase, and chromatin- remodeling factors is changed; moreover, these changes are associated with gene activity and the higher-order structure of chromosomes. In other words, histone modification by enzymatic reactions interlocks gene activity and genomic structures mediated by compaction

*Corresponding author: kim@ac-planta.com © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0007

97

98

T. Kim To and J.-M. Kim

Fig. 7.1. Structural change of chromatin-mediated histone modifications.

and relaxation of three-dimensional structures in chromatin (summarized in Fig. 7.1).

7.1.1 Histone proteins in plants In comparison with a genome containing two copies of each gene encoding histone H2A, H2B, H3, and H4 protein in budding yeast (Saccharomyces cerevisiae) (Grunstein and Gasser, 2013), plant genomes generally encode multiple copies of histone genes (Talbert et al., 2012). The Arabidopsis thaliana genome is encoded with canonical and variant histone genes consisting of four H2A, nine histone H2A variant (H2A.X, H2A.W, and H2A.Z), one predicted histone H2A, two H2B, nine predicted H2B, eight histone H3 (H3.1 and H3.3), two H3 variant (MGH3/HTR10 and CENH3/HTR12), three predicted histone H3, and eight H4 genes (Talbert et al., 2012). In addition, three linker histone H1 (H1.1, H1.2, and H1.3) genes are also encoded. These histone proteins are classified by their alignment of amino acid sequences and functional motifs. The histone variants derived from differences in insertions, deletions, and exchanges at specific residues have divisions of labor to regulate gene activities and chromatin formations through protein–protein

interactions. Interestingly, amino acid sequences are highly conserved in each of the eight histone H4 proteins and are considered to be functionally exchangeable in A. thaliana (Talbert et al., 2012). The site specificities and order of amino acid residues that undergo chemical modification within the N-terminus of histone proteins depend on canonical and variant functions that are conserved in eukaryotes.

7.1.2 Functions of conservative modification sites in canonical histone proteins The chromatin regulation of genes that control the response to environmental changes and development is a fundamentally reversible mechanism. The enrichment of histone H3 lysine 9 acetylation (H3K9ac) and multiple sites of lysine residue acetylation in histone H4 (H4Ac: acetylation of H4K5, K8, K12, and/or K16) correlates with the transcriptional activation of genes (Tian et al., 2005; Charron et al., 2009). The enrichment of histone H3 lysine 4 tri-methylation (H3K4me3) also correlates with gene activation (Foroozani et al., 2021). Generally, these histone modifications – known as active marks of active genes conserved from yeast to humans – can coexist

Plant Epigenomics

simultaneously in the same gene-coding regions. In contrast, histone H3 lysine 27 tri-methylation (H3K27me3) is known as a repressive mark conserved from yeast to humans in inactive gene regions. Under highly enriched conditions of K27me3 in gene-coding regions, gene expression was repressed. Conversely, under inactivation conditions, K27me3 was removed from the gene- coding regions within euchromatin (Charron et al., 2009). H3K27me3 is enriched to confer stable epigenetic silencing for long-term memory in FLOWERING LOCUS C (FLC) upon exposure to cold (Finnegan et al., 2005;Yang et al., 2017). Other repressive marks for gene inactivation (silencing) include histone H3 lysine 9 di-methylation and tri-methylation (H3K9me2 and H3K9me3). With respect to heterochromatin silencing in plants, H3K9me2 specifically functions to construct heterochromatin (Bernatavichute et al., 2008). Chromatin immunoprecipitation (ChIP) is used to determine the alteration of histone modifications; methods that combine next-generation sequencing (NGS) with ChIP (i.e., ChIP-seq) have been established to analyze genome-wide distributions and alterations of histone modifications (Saleh et al., 2008; Kim et al., 2014; You et al., 2017; Chen et al., 2018).

7.1.3 The genome-wide distribution and responsiveness of major histone modifications Enrichment of H3K27me3 is restricted to the transcriptional regions of single genes and is not associated with regions of low nucleosome density (Zhang et al., 2007). In plant development, H3K27me3 is known to be a major determinant of tissue- specific expression patterns. A decrease in H3K27m3 is also linked to gene activation of transcription factor families with tissue-specific development in A. thaliana (Lafos et al., 2011). K27me3 – a repressive marker of gene activity – corresponds well with gene regulation and is dynamically deposited and removed during plant development on a genome-wide scale (Gan et al., 2015). In a mutant of demethylase RELATIVE OF EARLY FLOWERING 6 (REF6) for H3K27me3 and H3K27me2 modifications, H3K27me3 levels were increased and mRNA expression of hundreds of genes regulated

99

developmental patterning and responses to various stimuli (Lu et al., 2011). Programmed gene regulation requires dynamic changes in cell characteristics such as germ cell development; moreover, K27me3 functions as a major factor in cell fate decisions. The genome-wide loss of H3K27me3 leads to the transcription of genes essential for spermatogenesis to set the responsivity of the chromatin state for the next generation of sperm (Borg et al., 2020). In these processes involving relatively slow changes associated with gene expression (such as development), changes in H3K27me3 are thought to bring about rational changes in gene inactivation. Certainly, the enrichment of these histone marks fluctuates in response to gene expression. In highly expressed gene regions, H3K4me3 is enriched in the transcriptional start sites (TSSs) of highly expressed gene regions without tissue specificity on a genome-wide scale (Zhang et al., 2009; van Dijk et al., 2010). Similarly, H3K9ac exerts behavior that corresponds with gene activation. In dehydration- responsive gene regions, enrichment of H3K9ac occurs with transcriptional activation, which is the same as the accumulation of RNA polymerase II in gene- coding regions by dehydration stimuli (Kim et al., 2008). Furthermore, H3K9ac enrichment correlates with RNA polymerase II withdrawal from dehydration- responsive gene regions by inactivation treatment (rehydration) (Kim et al., 2012). On the contrary, H3K4me3 exhibits behavior different from H3K9ac-related gene activity in dehydration- responsive gene regions. H3K4me3 is highly enriched throughout the entire coding region of the dehydration- responsive gene regions (van Dijk et al., 2010). The enrichment of H3K4me3 occurs by gene activation, as is the case for H3K9ac response; however, the reversal of H3K4me3 enrichment does not correlate with the behavior of RNA polymerase II withdrawal. H3K4me3 enrichment remained after transcriptional inactivation by RNA polymerase II withdrawal and decreased with time (Kim et al., 2012). When gene activation occurs, both H3K9ac and H3K4me3 are rapidly and highly enriched in the induced regions of gene expression, but there is a timing difference in the removal of these active marks in A. thaliana. The high H3K27me3 levels present during transcriptionally inactive states

100

T. Kim To and J.-M. Kim

do not interfere with the transition to active transcription and H3K4me3 accumulation in genes highly responsive to dehydration (Liu et al., 2014). Interestingly, this feature of H3K4me3 accumulation during rapid change may transmit information regarding stress memory (Ding et al., 2012). In the case of H4Ac – also known as the active mark, with the enrichment corresponding to gene activation – it should be noted that activation mark accumulation does not always occur at the same time as transcriptional activation by the binding of RNA polymerase. It is insufficient to introduce only H4Ac enrichment to initiate transcriptional activation. Based on results obtained from H4Ac introduction by treatment with acetic acid in plants, the target genes of dehydration stress response were not initiated by the enrichment of H4Ac in gene- coding regions before dehydration stimuli (Kim et al., 2017). This result shows that H4Ac alone is insufficient for transcriptional initiation and H4ac can behave independently of the binding of transcription factors and transcriptional machinery and their behavior. In this experiment, as the genes with excessively accumulated H4Ac have high expression levels in response to dehydration treatment, H4Ac functions directly to promote the activation of transcriptional efficiency. Thus, the strong and sudden changes in histone modifications seem to be written and erased with unusual dynamics. This suggests that the regulation of histone modifications depends on particular bioprocesses and the intensity of stimuli.

7.1.4 Histones and histone modifications in the construction of genomes and chromosome structures In addition to conservative histone proteins and their modifications, various histone modifications and histone variants are involved in the precise regulation of gene activity. Here, we focus on histone variants that function in genome-wide regulation and the construction of chromatin structure. H3K9me2 enrichment correlates with the localization of repeat elements, transposon silencing, and heterochromatin

structures (Bernatavichute et al., 2008). It is remarkably enriched in pericentromeric regions containing abundant transposons on each chromosome in A. thaliana. Higher-order chromosomal structure near the centromere is a common feature conserved in the eukaryotic genome (Drinnenberg et al., 2016). To construct the centromere, CENH3 – a centromere-specific histone H3 variant – is the epigenetic signature that specifies centromere regions in eukaryotes (McKinley and Cheeseman, 2016; Keçeli et al., 2020). CENH3-containing nucleosomes exhibit a strong preference for a unique subset of centromeric repeats in A. thaliana (Maheshwari et al., 2017). H3.1 and H3.3 are two other known histone H3 variants. There is a high consensus on the amino acid sequence level between H3.1 and H3.3. H3.1 is enriched in heterochromatic regions, while H3.3 density is not abundant in these regions in A. thaliana (Stroud et al., 2012; Wollmann et al., 2012). K27me1 of H3.1 functions as a heterochromatic marker in the protection of H3.3-enriched genes against heterochromatization during DNA replication (Jacob et al., 2014). In contrast, H3.3 enrichment containing K27me3 is dynamically linked to transcription and is involved in resetting covalent histone marks during plant development on a genome-wide scale (Wollmann et al., 2012).

7.2 DNA Methylation DNA methylation is critical for the epigenetic regulation of gene expression, development, and genome stability in plants (Zhang et al., 2018). In mammals, epigenome reprogramming occurs and is reset from generation to generation (Xu and Xie, 2018). In flowering plants, however, such systematic reprogramming of DNA methylation is limited to certain cells such as endosperm and vegetative nuclei in pollen, while in many cases, epigenetic variation (i.e., epialleles) is stably inherited over multiple generations (Gehring, 2019). Naturally occurring changes in DNA methylation can cause hereditary changes in phenotype independent of alterations in DNA sequences. As such, DNA methylation in plants is of great interest not only in the study of epigenetics, but also for plant breeding.

Plant Epigenomics

101

7.2.2 DNA methylation mechanism in A. thaliana

Fig. 7.2. Distribution of major epi-information.

7.2.1 DNA methylation in plants As plant genomes contain abundant sequences of transposable elements (TEs) and repetitive elements, silencing of these sequences is critical for plant genome stability (Huang et al., 2012). Cytosine methylation is rich in TEs and silent genes and plays an important role in their suppression. At the genomic level, cytosine methylation is enriched in pericentromeric regions where TEs accumulate (Fig. 7.2). In plants, cytosine methylation can occur in three different contexts: CG, CHG, and CHH sites (where H can be A, T, or C – hereafter mCG, mCHG, and mCHH, respectively) – all of which are commonly found in TEs (Ito and Kakutani, 2014). This contrasts with mammals, in which DNA methylation is predominantly limited to CG sites (Bird, 2002). Moreover, 5-hydroxymethylation – observed in mammals – has not yet been identified in plants (Erdmann et al., 2014). DNA methylation is also found within transcribed genes in flowering plants, but is absent at their transcription start and termination sites (Tran et al., 2005; Zemach et al., 2010; To et al., 2015). This so-called gene body methylation is limited to CG sites in Arabidopsis. Non-CG methylation is confined to TEs and pseudogenes (Tran et al., 2005). Thus, although both genes and TEs harbor DNA methylation, they exhibit different methylation patterns. Mutants with altered methylation patterns often cause serious phenotypic abnormalities in Arabidopsis (Finnegan et al., 1996; Miura et al., 2001; Saze et al., 2008) and other plant species (Tan et al., 2016; Ikeda et al., 2018), indicating the importance of accurately maintaining this fundamental epigenome map.

The mechanism of DNA methylation is classified into maintenance and de novo methylation. Maintenance methylation preserves DNA methylation at symmetrical sequences to prevent the dilution of DNA methylation after cell division. mCG in plants is maintained by an evolutionarily conserved DNA methyltransferase named METHYLTRANSFERASE 1 (MET1), which is an ortholog of DNMT1 in mammals (Finnegan and Dennis, 1993; Finnegan et al., 1996). mCHG is maintained by a different pathway in which a plant-specific DNA methyltransferase known as CHROMOMETHYLASE 3 (CMT3) is involved (Bartee et al., 2001; Lindroth et al., 2001). CMT3 binds to histone H3 dimethylated at the lysine 9 residue (H3K9me2) through its conserved CHROMO domain (Jackson et al., 2002; Du et al., 2012). H3K9me2 is a mark of constitutive heterochromatin in plants that is catalyzed by the Su(var) 3–9 homolog (SUVH) histone methyltransferases SUVH4/KYP, SUVH5, and SUVH6, the recruitment of which is dependent on mCHG (Jackson et al., 2002; Johnson et al., 2002; Malagnac et al., 2002; Du et al., 2012). Thus, a self-reinforcing feedback loop is created to reinforce and maintain both mCHG and H3K9me2 in heterochromatic regions. Another member of the (CMT) family – CMT2 – is involved in methylation at CHH sites in Arabidopsis (Zemach et al., 2013; Stroud et al., 2014). Through a mechanism similar to that of CMT3, CMT2 binds to H3K9me2 and maintains mCHH at heterochromatin. Thus, both DNA methylation and H3K9me2 can be stably inherited during cell division. Unlike animals, cytosine methylation in plants occurred mainly at three sequence sites: CG, CHG, and CHH (where H represents A, T, or C), and de novo establishment of DNA methylation is dependent on the RNA-mediated DNA methylation (RdDM) pathway In contrast to maintenance methylation, which is controlled by distinct enzymes in each context, de novo DNA methylation in plants is considered to be mediated by a single mechanism known as RNA-directed DNA methylation (RdDM) (Matzke and Mosher, 2014). In the RdDM machinery, small interfering RNAs (siRNAs) are

102

T. Kim To and J.-M. Kim

generated by sequential enzymatic reactions with Pol IV, RDR2, and DCL3. The resulting siRNAs, 24 nt in length, guide the de novo DNA methyltransferase DOMAINS REARRANGED METHYLTRANSFERASE 2 (DRM2) to its target sequences to induce de novo methylation in all three contexts. Because CHH is an asymmetrical sequence, it is also required to maintain high levels of mCHH, especially at the edges of long TEs, euchromatic short TEs (Stroud et al., 2013; Zemach et al., 2013), and noncoding TEs (To et al., 2020). More recently, de novo DNA methylation machinery other than RdDM has been suggested because efficient de novo DNA methylation at non-CG sites is observed at the coding regions of TEs in the absence of siRNAs (To et al., 2020). In addition, de novo DNA methylation has also been observed in moss (Yaari et al., 2019), suggesting de novo DNA methylation activity of CMT proteins in plants. In addition to methyltransferases, DNA methylation is regulated in multiple layers. Mutations in the chromatin remodeler gene DECREASE IN DNA METHYLATION 1 (DDM1) lead to a significant loss of DNA methylation in all contexts in heterochromatic regions, while it scarcely affects the transcribed genes (Vongs et al., 1993; Zemach et al., 2013). The loss is partially rescued by additional mutations in histone H1 genes, suggesting a genetic interaction between these factors (Zemach et al., 2013). The global DNA methylation pattern is associated with genomic localization of histone variants. The histone variant H2A.Z is evolutionarily conserved from yeast to mammals and plants (Scacchetti and Becker, 2020). Antagonistic localization between H2A.Z and mCG is observed not only in genes but also in TEs (Zilberman et al., 2008; Zemach et al., 2010; To et al., 2020). In heterochromatic regions, mCHG, mCHH, and H3K9me2 co-localize with another H2A variant – H2A.W – which is specific to the plant kingdom (Yelagandula et al., 2014; Kawashima et al., 2015). The removal of epigenetic marks is also important for epigenomic dynamics. Plant DNA demethylation is processed through a different mechanism of DNA demethylation than that in mammals, where 10–11 translocation (TET) proteins convert 5-methylcytosines into 5-hydroxymethylcytosines (Wu and Zhang, 2011). In Arabidopsis, active demethylation

is performed by the reaction of base excision repair, which is dependent on DME, ROS1, and its homologs DML2 and DML3 (Zhang et al., 2018). DME is specifically expressed in the endosperm and is involved in genome imprinting (Choi et al., 2002). ROS1 is required to prevent DNA hypermethylation at many genomic loci in Arabidopsis, possibly preventing the spread of DNA methylation from TEs to nearby genes (Gong et al., 2002). Another example of DNA methylation increase within genes is the mutation in INCREASE IN BONSAI METHYLATION 1 (IBM1), which encodes Jumonji domain- containing histone demethylase (Saze et al., 2008). In the ibm1 mutants, the accumulation of H3K9me2 and non- CG methylation was observed in thousands of active gene bodies, although silent TEs were unaffected. A complex interplay is found between the writers and erasers in plants; the expression levels of ROS1 and IBM1 genes are both attenuated by DNA methylation-decreasing mutants (Zhang et al., 2018), as these genes have local heterochromatin within or nearby these gene regions.

7.2.3 Genome-wide DNA methylation patterns in plant genomes Plant genomes vary among species, especially in terms of size and ploidy (Bennetzen and Wang, 2014). Overall, genome size reflects the content of repetitive elements; A. thaliana (genome size: 125 Mb) contains repetitive elements as approximately 10% of its genome, whereas Zea mays (2.3 Gb) contains them as 85% (Huang et al., 2012). The pattern and degree of DNA methylation also vary among species (Zemach et al., 2010; Niederhuth et al., 2016), as total DNA methylation levels increase with the content of repeat elements with high methylation. Furthermore, the difference in methylation levels is due to the diversity of DNA methylation mechanisms. The CMT2 gene is not found in the Z. mays genome (Zemach et al., 2013); thus, RdDM is responsible for most mCHH found in relatively short regions. A mutation in the largest subunit of Pol IV (NRPD1) is found in Brassica rapa, which leads to low levels of Pol IV-dependent siRNAs (Huang et al., 2013). CMT3 is lost in

Plant Epigenomics

some Brassicaceae species such as Eutrema salsugineum and Conringia planisiliqua (Bewick et al., 2016), leading to low mCHG levels in the genome. Global DNA methylation patterns can vary even within species (Eichten et al., 2013; Schmitz et al., 2013; Kawakatsu et al., 2016). The extent of spontaneous variation in DNA methylation has been evaluated in Arabidopsis (Schmitz et al., 2011). Genome-wide association study (GWAS) and quantitative trait loci (QTL) analyses have identified candidate loci that account for the natural epigenetic variation in A. thaliana (Dubin et al., 2015; Kawakatsu et al., 2016). Methylation polymorphisms can affect crop productivity. The MANTLED locus in the oil palm Elaeis guineensis is an outstanding example of bridging between the epigenome and agriculture (Ong-Abdullah et al., 2015). GWAS analysis identified that floral sterility is associated with DNA methylation loss at a TE known as KARMA. The analysis of the epigenetic state of KARMA before planting will improve the efficiency of oil production from oil palms.

7.2.4 Methods to investigate global DNA methylation patterns Several techniques have been developed to investigate global DNA methylation (Yong

103

et al., 2016). Whole-genome bisulfite sequencing (WGBS) is the first choice as it can assess genome- wide methylation levels at single- base resolution. When DNA is treated with sodium bisulfite reagent, unmethylated Cs are replaced with Us, which are decoded as Ts by NGS. Recently, an alternative method called enzymatic methyl-seq (EM-seq), which requires a smaller amount of DNA, has been reported (Feng et al., 2020). Instead of the bisulfite reaction, this system uses mammalian proteins responsible for demethylation to convert unmethylated Cs into Us. Multiple affinity- based techniques and cost- effective methodologies have also been developed to capture the entire methylation state using methylcytosine antibodies, methyl-CpG-binding proteins, and methylation- sensitive restriction enzymes (Yong et al., 2016). These methods provide less information, especially in methylation contexts, but may be a good choice for species with large genome sizes. The localization of major histone modifications and DNA methylations on the specific genome structures is summarized in Fig. 7.2. The significance of the structural and functional role of epigenetic marks can be easily understood. It is considered that epigenetic regulation is always linked to the regulation of wide range structures and each gene expressions on the eukaryotic genomes.

References Allis, C.D., Jenuwein, T. and Reinberg, D. (2015) Overview and concepts. In: Allis, C.D., Caparros, M.-L., Jenuwein, T. and Reinberg, D. (eds) Epigenetics, 2nd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. Bartee, L., Malagnac, F. and Bender, J. (2001) Arabidopsis cmt3 chromomethylase mutations block non-CG methylation and silencing of an endogenous gene. Genes & Development 15(14), 1753–1758. DOI: 10.1101/gad.905701. Bennetzen, J.L. and Wang, H. (2014) The contributions of transposable elements to the structure, function, and evolution of plant genomes. Annual Review of Plant Biology 65, 505–530. DOI: 10.1146/ annurev-arplant-050213-035811. Bernatavichute, Y.V., Zhang, X., Cokus, S., Pellegrini, M. and Jacobsen, S.E. (2008) Genome-wide association of histone H3 lysine nine methylation with CHG DNA methylation in Arabidopsis thaliana. PloS ONE 3(9), e3156. DOI: 10.1371/journal.pone.0003156. Bewick, A.J., Ji, L., Niederhuth, C.E., Willing, E.-M., Hofmeister, B.T. et al. (2016) On the origin and evolutionary consequences of gene body DNA methylation. Proceedings of the National Academy of Sciences 113(32), 9111–9116. DOI: 10.1073/pnas.1604666113. Bird, A.P. (2002) DNA methylation patterns and epigenetic memory. Genes & Development 16(1), 6–21. DOI: 10.1101/gad.947102.

104

T. Kim To and J.-M. Kim

Borg, M., Jacob, Y., Susaki, D., LeBlanc, C., Buendía, D. et al. (2020) Targeted reprogramming of H3K27me3 resets epigenetic memory in plant paternal chromatin. Nature Cell Biology 22(6), 621–629. DOI: 10.1038/s41556-020-0515-y. Charron, J.B.F., He, H., Elling, A.A. and Deng, X.W. (2009) Dynamic landscapes of four histone modifications during deetiolation in Arabidopsis. The Plant Cell 21(12), 3732–3748. DOI: 10.1105/tpc.109.066845. Chen, X., Bhadauria, V. and Ma, B. (2018) ChIP-Seq: a powerful tool for studying protein-DNA interactions in plants. Current Issues in Molecular Biology 27, 171–180. DOI: 10.21775/cimb.027.171. Choi, Y., Gehring, M., Johnson, L., Hannon, M., Harada, J.J. et al. (2002) DEMETER, a DNA glycosylase domain protein, is required for endosperm gene imprinting and seed viability in Arabidopsis. Cell 110(1), 33–42. DOI: 10.1016/s0092-8674(02)00807-3. Ding, Y., Fromm, M. and Avramova, Z. (2012) Multiple exposures to drought “train” transcriptional responses in Arabidopsis. Nature Communications 3, 740. DOI: 10.1038/ncomms1732. Drinnenberg, I.A., Henikoff, S. and Malik, H.S. (2016) Evolutionary turnover of kinetochore proteins: a ship of theseus? Trends in Cell Biology 26(7), 498–510. DOI: 10.1016/j.tcb.2016.01.005. Du, J., Zhong, X., Bernatavichute, Y.V., Stroud, H., Feng, S. et al. (2012) Dual binding of chromomethylase domains to H3K9me2-containing nucleosomes directs DNA methylation in plants. Cell 151(1), 167–180. DOI: 10.1016/j.cell.2012.07.034. Dubin, M.J., Zhang, P., Meng, D., Remigereau, M.-S., Osborne, E.J. et al. (2015) DNA methylation in Arabidopsis has a genetic basis and shows evidence of local adaptation. ELife 4, e05255. DOI: 10.7554/eLife.05255. Eichten, S.R., Briskine, R., Song, J., Li, Q., Swanson-Wagner, R. et al. (2013) Epigenetic and genetic influences on DNA methylation variation in maize populations. The Plant Cell 25(8), 2783–2797. DOI: 10.1105/tpc.113.114793. Erdmann, R.M., Souza, A.L., Clish, C.B. and Gehring, M. (2014) 5-hydroxymethylcytosine is not present in appreciable quantities in Arabidopsis DNA. G3 (Bethesda, Md.) 5(1), 1–8. DOI: 10.1534/g3.114.014670. Feng, S., Zhong, Z., Wang, M. and Jacobsen, S.E. (2020) Efficient and accurate determination of genome-wide DNA methylation patterns in Arabidopsis thaliana with enzymatic methyl sequencing. Epigenetics & Chromatin 13(1), 42. DOI: 10.1186/s13072-020-00361-9. Finnegan, E.J. and Dennis, E.S. (1993) Isolation and identification by sequence homology of a putative cytosine methyltransferase from Arabidopsis thaliana. Nucleic Acids Research 21(10), 2383–2388. DOI: 10.1093/nar/21.10.2383. Finnegan, E.J., Peacock, W.J. and Dennis, E.S. (1996) Reduced DNA methylation in Arabidopsis thaliana results in abnormal plant development. Proceedings of the National Academy of Sciences 93(16), 8449–8454. DOI: 10.1073/pnas.93.16.8449. Finnegan, E.J., Kovac, K.A., Jaligot, E., Sheldon, C.C., James Peacock, W. et al. (2005) The downregulation of FLOWERING LOCUS C (FLC) expression in plants with low levels of DNA methylation and by vernalization occurs by distinct mechanisms. The Plant Journal 44(3), 420–432. DOI: 10.1111/j.1365-313X.2005.02541.x. Foroozani, M., Vandal, M.P. and Smith, A.P. (2021) H3K4 trimethylation dynamics impact diverse developmental and environmental responses in plants. Planta 253(1), 4. DOI: 10.1007/s00425-020-03520-0. Gan, E.S., Xu, Y. and Ito, T. (2015) Dynamics of H3K27me3 methylation and demethylation in plant development. Plant Signaling & Behavior 10(9), e1027851. DOI: 10.1080/15592324.2015.1027851. Gehring, M. (2019) Epigenetic dynamics during flowering plant reproduction: evidence for reprogramming? The New Phytologist 224(1), 91–96. DOI: 10.1111/nph.15856. Gong, Z., Morales-Ruiz, T., Ariza, R.R., Roldán-Arjona, T., David, L. et al. (2002) ROS1, a repressor of transcriptional gene silencing in Arabidopsis, encodes a DNA glycosylase/lyase. Cell 111(6), 803–814. DOI: 10.1016/s0092-8674(02)01133-9. Grunstein, M. and Gasser, S.M. (2013) Epigenetics in Saccharomyces cerevisiae. Cold Spring Harbor Perspectives in Biology 5(7), a017491. DOI: 10.1101/cshperspect.a017491. Huang, C.R.L., Burns, K.H. and Boeke, J.D. (2012) Active transposition in genomes. Annual Review of Genetics 46, 651–675. DOI: 10.1146/annurev-genet-110711-155616. Huang, Y., Kendall, T. and Mosher, R.A. (2013) Pol IV-dependent siRNA production is reduced in Brassica rapa. Biology 2(4), 1210–1223. DOI: 10.3390/biology2041210. Ikeda, Y., Nishihama, R., Yamaoka, S., Arteaga-Vazquez, M.A., Aguilar-Cruz, A. et al. (2018) Loss of CG methylation in Marchantia polymorpha causes disorganization of cell division and reveals unique DNA methylation regulatory mechanisms of non-CG methylation. Plant & Cell Physiology 59(12), 2421–2431. DOI: 10.1093/pcp/pcy161.

Plant Epigenomics

105

Ito, H. and Kakutani, T. (2014) Control of transposable elements in Arabidopsis thaliana. Chromosome Research 22(2), 217–223. DOI: 10.1007/s10577-014-9417-9. Jackson, J.P., Lindroth, A.M., Cao, X. and Jacobsen, S.E. (2002) Control of CpNpG DNA methylation by the KRYPTONITE histone H3 methyltransferase. Nature 416(6880), 556–560. DOI: 10.1038/ nature731. Jacob, Y., Bergamin, E., Donoghue, M.T.A., Mongeon, V., LeBlanc, C. et al. (2014) Selective methylation of histone H3 variant H3.1 regulates heterochromatin replication. Science 343(6176), 1249–1253. DOI: 10.1126/science.1248357. Johnson, L., Cao, X. and Jacobsen, S.E. (2002) Interplay between two epigenetic marks. DNA methylation and histone H3 lysine 9 methylation. Current Biology 12(16), 1360–1367. DOI: 10.1016/ s0960-9822(02)00976-4. Kawakatsu, T., Huang, S.-S.C., Jupe, F., Sasaki, E., Schmitz, R.J. et al. (2016) Epigenomic diversity in a global collection of Arabidopsis thaliana accessions. Cell 166(2), 492–505. DOI: 10.1016/j. cell.2016.06.044. Kawashima, T., Lorković, Z.J., Nishihama, R., Ishizaki, K., Axelsson, E. et al. (2015) Diversification of histone H2A variants during plant evolution. Trends in Plant Science 20(7), 419–425. DOI: 10.1016/j. tplants.2015.04.005. Keçeli, B.N., Jin, C., Van Damme, D. and Geelen, D. (2020) Conservation of centromeric histone 3 interaction partners in plants. Journal of Experimental Botany 71(17), 5237–5246. DOI: 10.1093/jxb/ eraa214. Kim, J.-M., To, T.K., Ishida, J., Morosawa, T., Kawashima, M. et al. (2008) Alterations of lysine modifications on the histone H3 N-tail under drought stress conditions in Arabidopsis thaliana. Plant & Cell Physiology 49(10), 1580–1588. DOI: 10.1093/pcp/pcn133. Kim, J.-M., To, T.K., Ishida, J., Matsui, A., Kimura, H. et al. (2012) Transition of chromatin status during the process of recovery from drought stress in Arabidopsis thaliana. Plant & Cell Physiology 53(5), 847–856. DOI: 10.1093/pcp/pcs053. Kim, J.-M., To, T.K., Matsui, A., Tanoi, K., Kobayashi, N.I. et al. (2017) Acetate-mediated novel survival strategy against drought in plants. Nature Plants 3, 17097. DOI: 10.1038/nplants.2017.97. Kim, J.-M., To, T.K., Tanaka, M., Endo, T.A., Matsui, A. et al. (2014) Highly reproducible ChIP-on-chip analysis to identify genome-wide protein binding and chromatin status in Arabidopsis thaliana. Methods in Molecular Biology 1062, 405–426. Kuo, M.H., Brownell, J.E., Sobel, R.E., Ranalli, T.A., Cook, R.G. et al. (1996) Transcription- linked acetylation by Gcn5p of histones H3 and H4 at specific lysines. Nature 383(6597), 269–272. DOI: 10.1038/383269a0. Lafos, M., Kroll, P., Hohenstatt, M.L., Thorpe, F.L., Clarenz, O. et al. (2011) Dynamic regulation of H3K27 trimethylation during Arabidopsis differentiation. PLoS Genetics 7(4), e1002040. DOI: 10.1371/ journal.pgen.1002040. Lindroth, A.M., Cao, X., Jackson, J.P., Zilberman, D., McCallum, C.M. et al. (2001) Requirement of CHROMOMETHYLASE3 for maintenance of CpXpG methylation. Science 292(5524), 2077–2080. DOI: 10.1126/science.1059745. Liu, N., Fromm, M. and Avramova, Z. (2014) H3K27me3 and H3K4me3 chromatin environment at super- induced dehydration stress memory genes of Arabidopsis thaliana. Molecular Plant 7(3), 502–513. DOI: 10.1093/mp/ssu001. Lu, F., Cui, X., Zhang, S., Jenuwein, T. and Cao, X. (2011) Arabidopsis REF6 is a histone H3 lysine 27 demethylase. Nature Genetics 43(7), 715–719. DOI: 10.1038/ng.854. Luger, K., Mäder, A.W., Richmond, R.K., Sargent, D.F. and Richmond, T.J. (1997) Crystal structure of the nucleosome core particle at 2.8 Å resolution. Nature 389(6648), 251–260. DOI: 10.1038/38444. Maheshwari, S., Ishii, T., Brown, C.T., Houben, A. and Comai, L. (2017) Centromere location in Arabidopsis is unaltered by extreme divergence in CENH3 protein sequence. Genome Research 27(3), 471–478. DOI: 10.1101/gr.214619.116. Malagnac, F., Bartee, L. and Bender, J. (2002) An Arabidopsis SET domain protein required for maintenance but not establishment of DNA methylation. The EMBO Journal 21(24), 6842–6852. DOI: 10.1093/emboj/cdf687. Matzke, M.A. and Mosher, R.A. (2014) RNA-directed DNA methylation: an epigenetic pathway of increasing complexity. Nature Reviews Genetics 15(6), 394–408. DOI: 10.1038/nrg3683. McKinley, K.L. and Cheeseman, I.M. (2016) The molecular basis for centromere identity and function. Nature Reviews Molecular Cell Biology 17(1), 16–29. DOI: 10.1038/nrm.2015.5.

106

T. Kim To and J.-M. Kim

Miura, A., Yonebayashi, S., Watanabe, K., Toyama, T., Shimada, H. et al. (2001) Mobilization of transposons by a mutation abolishing full DNA methylation in Arabidopsis. Nature 411(6834), 212–214. DOI: 10.1038/35075612. Niederhuth, C.E., Bewick, A.J., Ji, L., Alabady, M.S., Kim, K.D. et al. (2016) Widespread natural variation of DNA methylation within angiosperms. Genome Biology 17(1), 194. DOI: 10.1186/s13059-016-1059-0. Ong-Abdullah, M., Ordway, J.M., Jiang, N., Ooi, S.-E., Kok, S.-Y. et al. (2015) Loss of Karma transposon methylation underlies the mantled somaclonal variant of oil palm. Nature 525(7570), 533–537. DOI: 10.1038/nature15365. Saleh, A., Alvarez-Venegas, R. and Avramova, Z. (2008) An efficient chromatin immunoprecipitation (ChIP) protocol for studying histone modifications in Arabidopsis plants. Nature Protocols 3(6), 1018–1025. DOI: 10.1038/nprot.2008.66. Saze, H., Shiraishi, A., Miura, A. and Kakutani, T. (2008) Control of genic DNA methylation by a jmjC domain-containing protein in Arabidopsis thaliana. Science 319(5862), 462–465. DOI: 10.1126/ science.1150987. Scacchetti, A. and Becker, P.B. (2020) Variation on a theme: evolutionary strategies for H2A.Z exchange by SWR1-type remodelers. Current Opinion in Cell Biology 70, 1–9. Schmitz, R.J., Schultz, M.D., Lewsey, M.G., O’Malley, R.C., Urich, M.A. et al. (2011) Transgenerational epigenetic instability is a source of novel methylation variants. Science 334, 369–373. Schmitz, Robert J., Schultz, M.D., Urich, M.A., Nery, J.R., Pelizzola, M. et al. (2013) Patterns of population epigenomic diversity. Nature 495(7440), 193–198. DOI: 10.1038/nature11968. Shen, Y., Issakidis-Bourguet, E. and Zhou, D.X. (2016) Perspectives on the interactions between metabolism, redox, and epigenetics in plants. Journal of Experimental Botany 67(18), 5291–5300. Stroud, H., Greenberg, M.V., Feng, S., Bernatavichute, Y.V. and Jacobsen, S.E. (2013) Comprehensive analysis of silencing mutants reveals complex regulation of the Arabidopsis methylome. Cell 152, 352–364. DOI: 10.1016/j.cell.2012.10.054. Stroud, H., Otero, S., Desvoyes, B., Ramírez-Parra, E., Jacobsen, S.E. et al. (2012) Genome-wide analysis of histone H3.1 and H3.3 variants in Arabidopsis thaliana. Proceedings of the National Academy of Sciences 109(14), 5370–5375. Stroud, H., Do, T., Du, J., Zhong, X., Feng, S. et al. (2014) Non-CG methylation patterns shape the epigenetic landscape in Arabidopsis. Nature Structural & Molecular Biology 21(1), 64–72. DOI: 10.1038/ nsmb.2735. Talbert, P.B., Ahmad, K., Almouzni, G., Ausió, J., Berger, F. et al. (2012) A unified phylogeny-based nomenclature for histone variants. Epigenetics & Chromatin 5(1), 7. DOI: 10.1186/1756-8935-5-7. Tan, F., Zhou, C., Zhou, Q., Zhou, S., Yang, W. et al. (2016) Analysis of chromatin regulators reveals specific features of rice DNA methylation pathways. Plant Physiology 171, 2041–2054. Thompson, J.S. and Grunstein, M. (1993) Histones and the regulation of heterochromatin in yeast. Cold Spring Harbor Symposia on Quantitative Biology 58, 247–256. DOI: 10.1101/SQB.1993.058.01.029. Tian, L., Fong, M.P., Wang, J.J., Wei, N.E., Jiang, H. et al. (2005) Reversible histone acetylation and deacetylation mediate genome- wide, promoter- dependent and locus- specific changes in gene expression during plant development. Genetics 169(1), 337–345. DOI: 10.1534/ genetics.104.033142. To, T.K., Saze, H. and Kakutani, T. (2015) DNA methylation within transcribed regions. Plant Physiology 168(4), 1219–1225. DOI: 10.1104/pp.15.00543. To, T.K., Nishizawa, Y., Inagaki, S., Tarutani, Y., Tominaga, S. et al. (2020) RNA interference-independent reprogramming of DNA methylation in Arabidopsis. Nature Plants 6, 1455–1467. Tran, R.K., Henikoff, J.G., Zilberman, D., Ditt, R.F., Jacobsen, S.E. et al. (2005) DNA methylation profiling identifies CG methylation clusters in Arabidopsis genes. Current Biology 15(2), 154–159. DOI: 10.1016/j.cub.2005.01.008. van Dijk, K., Ding, Y., Malkaram, S., Riethoven, J.J.M., Liu, R. et al. (2010) Dynamic changes in genome- wide histone H3 lysine 4 methylation patterns in response to dehydration stress in Arabidopsis thaliana. BMC Plant Biology 10, 238. Vongs, A., Kakutani, T., Martienssen, R.A. and Richards, E.J. (1993) Arabidopsis thaliana DNA methylation mutants. Science 260(5116), 1926–1928. DOI: 10.1126/science.8316832. Waddington, C.H. (1957) The Strategy of the Genes, George Allen & Unwin, London. Wollmann, H., Holec, S., Alden, K., Clarke, N.D., Jacques, P.-É. et al. (2012) Dynamic deposition of histone variant H3.3 accompanies developmental remodeling of the Arabidopsis transcriptome. PLoS Genetics 8(5), e1002658. DOI: 10.1371/journal.pgen.1002658.

Plant Epigenomics

107

Wu, H. and Zhang, Y. (2011) Mechanisms and functions of Tet protein-mediated 5-methylcytosine oxidation. Genes & Development 25(23), 2436–2452. DOI: 10.1101/gad.179184.111. Xu, Q. and Xie, W. (2018) Epigenome in early mammalian development: inheritance, reprogramming and establishment. Trends in Cell Biology 28(3), 237–253. DOI: 10.1016/j.tcb.2017.10.008. Xu, Y.M., Du, J.Y. and Lau, A.T.Y. (2014) Posttranslational modifications of human histone H3: an update. Proteomics 14(17–18), 2047–2060. DOI: 10.1002/pmic.201300435. Yaari, R., Katz, A., Domb, K., Harris, K.D., Zemach, A. et al. (2019) RdDM-independent de novo and heterochromatin DNA methylation by plant CMT and DNMT3 orthologs. Nature Communications 10(1), 1613. DOI: 10.1038/s41467-019-09496-0. Yang, H., Berry, S., Olsson, T.S.G., Hartley, M., Howard, M. et al. (2017) Distinct phases of Polycomb silencing to hold epigenetic memory of cold in Arabidopsis. Science 357(6356), 1142–1145. DOI: 10.1126/science.aan1121. Yelagandula, R., Stroud, H., Holec, S., Zhou, K., Feng, S. et al. (2014) The histone variant H2A.W defines heterochromatin and promotes chromatin condensation in Arabidopsis. Cell 158(1), 98–109. DOI: 10.1016/j.cell.2014.06.006. Yong, W.S., Hsu, F.M. and Chen, P.Y. (2016) Profiling genome-wide DNA methylation. Epigenetics & Chromatin 9(1), 26. DOI: 10.1186/s13072-016-0075-3. You, W., Pien, S. and Grossniklaus, U. (2017) Chromatin immunoprecipitation protocol for histone modifications and protein-DNA binding analyses in arabidopsis. Methods in Molecular Biology 1456, 1–13. Zemach, A., McDaniel, I.E., Silva, P. and Zilberman, D. (2010) Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328(5980), 916–919. DOI: 10.1126/science.1186366. Zemach, A., Kim, M.Y., Hsieh, P.-H., Coleman-Derr, D., Eshed-Williams, L. et al. (2013) The Arabidopsis nucleosome remodeler DDM1 allows DNA methyltransferases to access H1-containing heterochromatin. Cell 153(1), 193–205. DOI: 10.1016/j.cell.2013.02.033. Zhang, H., Lang, Z. and Zhu, J.K. (2018) Dynamics and function of DNA methylation in plants. Nature Reviews Molecular Cell Biology 19(8), 489–506. DOI: 10.1038/s41580-018-0016-z. Zhang, X., Bernatavichute, Y.V., Cokus, S., Pellegrini, M. and Jacobsen, S.E. (2009) Genome-wide analysis of mono-, di- and trimethylation of histone H3 lysine 4 in Arabidopsis thaliana. Genome Biology 10(6), R62. DOI: 10.1186/gb-2009-10-6-r62. Zhang, X., Clarenz, O., Cokus, S., Bernatavichute, Y.V., Pellegrini, M. et al. (2007) Whole-genome analysis of histone H3 lysine 27 trimethylation in Arabidopsis. PLoS Biology 5(5), e129. DOI: 10.1371/journal. pbio.0050129. Zhao, M., Tao, Y. and Peng, G.H. (2020) The role of histone acetyltransferases and histone deacetylases in photoreceptor differentiation and degeneration. International Journal of Medical Sciences 17(10), 1307–1314. DOI: 10.7150/ijms.43140. Zilberman, D., Coleman-Derr, D., Ballinger, T. and Henikoff, S. (2008) Histone H2A.Z and DNA methylation are mutually antagonistic chromatin marks. Nature 456(7218), 125–129. DOI: 10.1038/nature07324.

8

Plant Organellar Omics

Masatake Kanai1, Kentaro Tamura2, Katarzyna Tarnawska-Glatt3, Shino Goto- Yamada3, Kenji Yamada3 and Shoji Mano1,4* 1 Laboratory of Organelle Regulation, National Institute for Basic Biology, Okazaki, Japan; 2Department of Environmental and Life Sciences, School of Food and Nutritional Sciences, University of Shizuoka, Shizuoka, Japan; 3Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland; 4Department of Basic Biology, SOKENDAI (The Graduate University for Advanced Studies), Okazaki, Japan

Abstract Plant cells contain a variety of organelles, whose functions support many biological processes. One of the characteristics of plant organelles is that they can dynamically change their functions in response to developmental stages, environmental changes, and external stimuli. To elucidate the molecular mechanisms of organelle function, omics analyses have been conducted at various levels, including genomics, transcriptomics, and proteomics. This chapter outlines current omics approaches for membrane-bound organelles. In recent years, imaging techniques have become indispensable in life science research, and in contributing to the visualization of organelle dynamics. This chapter also introduces useful databases on plant organelles.

8.1 Introduction Many different subcellular structures are visible when plant cells are viewed microscopically (Fig. 8.1). Differentiated subcellular structures that perform specific functions are termed organelles. Organelle functions support homeostatic balance in diverse plant cellular activities such as seed germination and growth. One characteristic of plant organelles is the capacity to adapt their dynamic functions in response to developmental stage, environmental changes, and external stimuli such as light (Fig. 8.1). Effective analysis of plant omics data is critical for understanding organelle functions and dynamics in cells and tissues at the individual plant level. New technologies such as high-purity organelle isolation

and specific detection of organellar proteins have produced large amounts of organelle- specific omics data. Although the amounts and types of omics data vary according to organelle type and plant species, overall, data availability is increasing rapidly. This chapter outlines current omics approaches for membrane-bound organelles.

8.2 Nucleus The nucleus is the most prominent eukaryotic organelle, with crucial functions in cell proliferation and gene expression regulation during development and in response to environmental stresses. The nucleus is enclosed by a nuclear envelope composed of a double membrane,

*Corresponding author: mano@nibb.ac.jp 108

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0008

Plant Organellar Omics

109

Fig. 8.1. Electron micrographs of Arabidopsis cotyledonary cells. Five-day dark-grown (A) and 7-day light-grown plants (B) were used for electron microscopic analysis. Ch, chloroplast; Mt, mitochondrion; N, nucleus; Ob, oil body; Pt, plastid; V, vacuole. Bar indicates 1 µm.

nuclear pores, and nuclear lamina. Each of these structures plays specific roles, which are functionally connected, and proteins in different substructures interact with one another (Meier et al., 2017). The nucleoplasm is organized in a complex three-dimensional (3D) fashion that dynamically changes in response to external stimuli such as light, cold, heat, drought, and salinity (Asensi-Fabado et al., 2017; Bourbousse et al., 2019). Changes in chromatin structure control transcriptional responses and downstream reactions to adapt to environmental conditions. Nuclear bodies, which are membrane-less subnuclear structures, carry out specific nuclear functions (Dundr and Misteli, 2001; Petrovska et al., 2015), including synthesis and processing of pre-ribosomal RNA in the nucleolus, assembly of spliceosomal components in nuclear speckles, storage of microRNA processing machineries in the dicing body, and light signaling in the photobody. Proteomics approaches have been used to understand nuclear organization and function in plants. Calikowski et al. (2003) identified 36 nuclear matrix proteins in cultured Arabidopsis cells and found cross- kingdom similarities in the protein composition of the nuclear matrix. Stress-related nuclear proteins were identified from proteomic changes in response to cold (Bae et al., 2003) and microbe-associated molecular pattern (MAMP)- triggered immunity (Fakih et al., 2016). Nuclear proteomics was also performed in crops including rice, maize, barley, soybean, and chickpea (Petrovska et al., 2015).

Recent technological improvements increased proteome coverage, allowing identification of more than 1500 proteins (Petrovska et al., 2015; Goto et al., 2019). The largest plant nuclear proteome identified to date is from barley, with more than 2400 proteins (http://barley.gambrinus. ueb.cas.cz/, accessed September 2022) (Blavet et al., 2017). Our knowledge of 3D genome organization in plants is rapidly expanding, and its contributions to many cellular processes are also beginning to emerge (Doğan and Liu, 2018). The Hi- C technique, which detects whole- genome chromatin interactions (Dekker et al., 2017), revealed the evolution of complex and unique 3D genome organization (Grob and Grossniklaus, 2017; Sotelo-Silveira et al., 2018; Nützmann et al., 2020). Comparison of 3D genome organization in five different species (maize, tomato, sorghum, foxtail millet, and rice) (Dong et al., 2017) showed that chromatin organization into domains was not conserved across species and that large genome-bearing plants, such as maize and tomato, formed extensive chromatin loops between gene islands that were closely associated with active compartments. Bi et al. (2017) used restriction enzyme- mediated chromatin immunoprecipitation to examine the chromatin landscape at the nuclear periphery in Arabidopsis, and found that the chromatin exhibited non- random domain organization and functional partitioning in the nuclear space. This suggests that the plant nuclear periphery provides a functional platform for genome function. Further

110

M. Kanai et al.

studies are required to integrate the various 3D chromatin structural features and address their impacts on fundamental plant growth and development processes.

proteins; therefore, in silico prediction of these proteins is more complex. Many ER proteins have been identified from fractionation analysis with immunoblotting or by microscope observation with fluorescence- tag fusion proteins. ER chaperones, such as binding immunoglobulin protein (BiP) and 8.3 Endoplasmic Reticulum protein disulfide isomerase (PDI), are frequently used as ER markers in immunoblotting after Endoplasmic reticulum (ER) is a membrane- organelle fractionation or in situ immunobound eukaryotic organelle that forms a mesh- histochemical analysis (Matsushima et al., like structure (Staehelin, 1997). In plant cells, 2003b; Li et al., 2006; Tamura et al., 2007). In ER consists of tubes and sheets, and is mainly organelle fractionation experiments, changes located at cell edges, where it is contiguous in the specific gravity of ER can also be used to with the plasma membrane, to form cortical determine protein localization (Li et al., 2006). ER (Hepler et al., 1990; Staehelin, 1997). We As ribosome- attached microsomes become can distinguish two kinds of ER in cells: smooth heavier than those without ribosomes, microER (sER), which does not have ribosomes, and somes originating from the ER shift toward the rough ER (rER), which is studded with millions higher-density fractions when centrifuged in the of membrane-bound ribosomes and functions presence of magnesium ions, which enhance mainly in the synthesis of secretory proteins ribosome binding to the membrane. In contrast, (Vitale and Denecke, 1999). Synthesized pro- the microsomes shift toward the lower-density teins are folded and post-translationally modi- fractions in the presence of ethylenediaminetetra fied by chaperones and specific enzymes (Boston acetic acid (EDTA) that releases ribosomes from et al., 1996). These modifications include the membrane. Fusion proteins containing fluothe formation of disulfide bonds (Freedman rescent tags, the ER targeting signal peptide, and et al., 1994), attachment of Asn- linked type the ER retention signal can also be used as ER polysaccharide side chains (Aebi et al., 2010), markers (Ridge et al., 1999; Ueda et al., 2010). and glycosylphosphatidylinositol anchoring However, loss of the ER retention signal can (Schultz et al., 1998). Improperly folded proteins sometimes allow proteins to escape from the ER undergo ER- associated protein degradation and reach the vacuole (Tamura et al., 2004). The (ERAD) (Ellgaard and Helenius, 2003; Claessen detection of such proteins in the vacuole can be et al., 2012; Liu and Li, 2014). Whereas rER is performed under some experimental conditions. associated with protein synthesis, sER functions One solution is the use of a light- irradiated in lipid synthesis (Ohlrogge and Browse, 1995; sample, because light efficiently reduces vacuFagone and Jackowski, 2009) through accumu- olar fluorescence of green fluorescent protein lation of biosynthetic enzymes (Arondel et al., (GFP)-fusion ER markers (Tamura et al., 2003). 1992; Karki et al., 2019). Although comprehensive omics approaches Soluble luminal proteins of the ER have been used infrequently, some ER proteins possess an ER targeting signal peptide at their were discovered using localization of organelle N-terminus and an ER retention signal at their proteins by isotope tagging (LOPIT) analysis C- terminus. The signal peptide contains a or by analysis of random GFP::cDNA fusions hydrophobic region of approximately 20 amino (Cutler et al., 2000; Dunkley et al., 2006). acids and a signal peptide cleavage site at the Plants use ER for protein storage in addiend (von Heijne, 1986). The ER retention signal tion to the previously mentioned ER functions. consists of four amino acids, typically KDEL ER compartments with storage function include or HDEL (Denecke et al., 1992). Proteins with precursor- accumulating (PAC) vesicles, ricithese sequences are expected to be found in the nosomes/KDEL-tailed protease-accumulating ER lumen. Consequently, in silico analysis can vesicles (KV), and ER bodies (Hara-Nishimura be used to predict the ER luminal proteins. No et al., 1998; Toyooka et al., 2000; Schmid et al., signature sequences have been identified for 2001; Matsushima et al., 2003a). PAC vesicles transmembrane and peripheral membrane ER accumulate precursors of seed storage proteins

Plant Organellar Omics

during the seed maturation stage and are involved in the bulk transport of storage proteins to the vacuoles (Hara-Nishimura et al., 1998). Proteomic analysis of PAC vesicles showed that they accumulated seed storage proteins and vacuolar sorting receptors (Shimada et al., 1997; Yamada et al., 1999). Ricinosomes/ KVs are protease- accumulating vesicles with digestive functions that play a role during developmentally regulated programmed cell death (PCD) of cotyledons and endosperms (Toyooka et al., 2000; Greenwood et al., 2005). ER bodies are unique structures in cabbage family plants that accumulate β-glucosidases and are involved in activation of thioglucosides during plant defense responses to herbivores (Nakazaki et al., 2019; Yamada et al., 2020).

8.4 Golgi Apparatus The Golgi apparatus consists of stacks of membrane-bound cisternae, with the number of cisternae varying according to organism and cell type (Hawes and Satiat-Jeunemaitre, 2005). The Golgi apparatus receives secretory proteins from the ER via COPII transport vesicles. Afterwards, proteins undergo modification and subsequent release to their final destinations, such as vacuoles and the cellular apoplast. Passage of proteins through the Golgi apparatus is unidirectional, from cis-, to middle-, and finally to trans- Golgi cisternae (Hawes and Satiat-Jeunemaitre, 2005). With some exceptions, such as the bulk transport of proteins to the vacuoles via PAC vesicles and KVs, most secretory proteins pass through the Golgi apparatus. The primary function of Golgi apparatus is protein modification, including the synthesis and attachment of polysaccharide chains (Hawes and Satiat- Jeunemaitre, 2005). For example, the common high-mannose type of Asn-linked polysaccharide side chain is modified to its complex type inside the Golgi cisternae (Nguema-Ona et al., 2014). The Golgi apparatus is also involved in the synthesis of O-linked glycosylated side chains at Ser or Thr amino acid residues (Nguema-Ona et al., 2014), as well as in the production of free polysaccharides like pectin (Sinclair et al., 2018). Functional differences are apparent between cis-, middle-,

111

and trans-Golgi cisternae. The cis-Golgi accepts proteins from the ER; therefore, the ER retention signal receptor, ERD2, is localized mainly on the cis-Golgi to recycle the proteins, which escape from the ER lumen, back to the ER (Silva-Alvim et al., 2018). Protein sorting depending on the protein destination occurs in the trans-Golgi. For example, proteins involved in the formation of sorting vesicles are highly enriched in the trans-Golgi cisternae. Cis-, middle-, and trans- Golgi can be distinguished by specific marker proteins. A xyloglucan galactosyltransferase, an H+-PPase, and a GFP fusion of a rat sialyl transferase anchor sequence are used as Golgi markers (Saint-Jore et al., 2002; Tamura et al., 2005). Typical trans- Golgi markers include components of clathrin- coated vesicles, such as clathrin light chain (Wang et al., 2013). Alternative trans-Golgi markers are syntaxins such as Arabidopsis SYP41 and SYP61 (Uemura et al., 2004). In silico identification of Golgi proteins in plants is hampered by the lack of sequence signatures for Golgi localization. Most Golgi-localized proteins identified to date were discovered by analysis of individual proteins. Several studies have described attempts to isolate the Golgi apparatus (Morré and Mollenhauer, 1964; Morré et al., 1993), and successful isolation facilitates the identification of Golgi proteins via proteomic approaches (Parsons et al., 2012). Proteomic approaches such as LOPIT analysis are powerful techniques for comprehensive identification of Golgi proteins (Dunkley et al., 2006; Nikolovski et al., 2012).

8.5 Vacuole The vacuole is a sac-like structure that is filled with fluid and surrounded by a single membrane, termed the vacuolar membrane or tonoplast. Vacuoles vary in size and shape but are generally observed as giant organelles that occupy a large majority of the cell volume in many types of plant cells. The main roles of vacuoles are: isolation of harmful substances within the cell; accumulation of inorganic ions, primary and secondary metabolites, defense-related proteins, and waste-degrading enzymes; and control of turgor pressure (Marty, 1999; Hara-Nishimura and Hatsugai, 2011). During seed development,

112

M. Kanai et al.

abundant storage proteins are produced and stored in the vacuole (Shimada et al., 2018a). Numerous tonoplast- localized transporters, including ABC transporters and multidrug and toxic compound extrusion (MATE) family transporters, are involved in vacuolar sorting of various ions and metabolites (reviewed in Shitan and Yazaki, 2013). Vesicle-mediated transport is another major route for transporting components to the vacuole. Anthocyanin transport into the vacuole is mediated by vesicle trafficking from the ER as well as by ABC transporters, MATE transporters, or microautophagy (Shitan and Yazaki, 2013; Chanoca et al., 2015). Soluble and transmembrane vacuolar proteins are sorted in the ER by their ER signal sequence and then transported to the vacuole by vesicle-mediated transport directly from the ER or via the Golgi and/or prevacuolar compartments (reviewed in Xiang et al., 2013; Shimada et al., 2018a). Vacuolar sorting determinants are involved in protein trafficking from the Golgi apparatus to the vacuole, although it remains unclear whether these motifs contribute to all Golgi apparatus–vacuole transport of vacuolar proteins (Pereira et al., 2013; Xiang et al., 2013). Autophagy, a degradation process involved in elimination of unnecessary cellular components, also aids protein transport to the vacuole (reviewed in Yoshimoto and Ohsumi, 2018). Diverse proteins and compounds from various pathways accumulate in the vacuole, and prediction of protein localization to the vacuole based on sequence analysis is challenging. Omics analyses of isolated vacuoles offer an alternative approach for identifying vacuolar components. Isolated vacuoles from protoplasts of Arabidopsis suspension cells, Arabidopsis rosette leaves, and cauliflower (Brassica oleracea) buds, in which vacuoles were purified by sucrose density gradient centrifugation, were used for proteomic analyses (Shimaoka et al., 2004; Schmidt et al., 2007; Ohnishi et al., 2018). Carter et al. (2004) found that approximately 40% of identified proteins had predicted vacuolar localization when assessed using peptide signal prediction algorithms. The mean predicted pI of the vacuolar proteins was relatively acidic compared with all the predicted Arabidopsis proteins. This type of information supports annotation of vacuolar proteins in omics analyses. Metabolomics analysis was conducted with

vacuolar components from isolated vacuoles from Arabidopsis suspension-cultured cells and barley leaves (Schneider et al., 2009; Ohnishi et al., 2018). Non- aqueous fractionation followed by mass spectrometry techniques can allow organelle-specific omics analysis without organelle isolations, retaining local positional information within the cell. Although the separation accuracy of each organelle is low, this method was used for several plant species, including potato, spinach, and pea (Gerhardt and Heldt, 1984; Farré et al., 2001; Fürtauer et al., 2016; Schneider et al., 2019). Oikawa et al. (2011) observed metabolite dynamics in the vacuole and cytosol of a single internodal cell of Chara braunii, a green alga with long macroscopic cells. To evaluate vacuole fraction purity, the activities of marker enzymes (such as a soluble vacuolar protein α-mannosidase) and the presence of marker proteins (such as tonoplast proteins V-ATPase and γ-TIP) can be used alongside other organelle markers (Jauh et al., 1999; Shimaoka et al., 2004).

8.6 Peroxisome Peroxisomes are single membrane-bound organelles that are ubiquitous in eukaryotic cells. Peroxisomes have specialized functions according to organ type and can transform their functions flexibly (Nishimura et al., 1993; Kamada et al., 2003; Goto-Yamada et al., 2014). During germination, plant peroxisomes play pivotal roles in fatty acid oxidation during sucrose production. Peroxisomes also participate in photorespiration in photosynthetic tissues and are involved in diverse biological processes, including phytohormone biosynthesis, generation of signaling molecules, and defense against pathogens (reviewed in Hu et al., 2012; Kao et al., 2018). Peroxisomal metabolic pathways contain several oxidases that produce hydrogen peroxidases, and thus peroxisomes harbor antioxidant enzymes, such as catalase, to degrade H2O2 (reviewed in Corpas et al., 2020). Peroxisomes contain a variety of metabolic pathways for which substrates and compounds need to cross the peroxisomal lipid bilayer. Small hydrophilic molecules, such as fatty acids and organic acids, pass through pores in the

Plant Organellar Omics

peroxisome membrane by concentration diffusion. Larger molecules, such as acyl-CoA and precursors of phytohormones, are transported by a peroxisomal ABC transporter using energy from ATP hydrolysis. Co- factors required for peroxisomal metabolism, such as ATP, NAD, and CoA, are imported into peroxisomes via membrane carriers (reviewed in Charton et al., 2019). Peroxisomes do not have their own genome, and peroxisomal proteins are encoded by the nuclear genome, synthesized in the cytosol, and transported to peroxisomes. Peroxisomal matrix proteins harbor peroxisomal targeting signal 1 (PTS1) at the C-terminal or PTS2 at the N-terminal of their peptide sequences, allowing the proteins to be recognized by specific receptors in the cytosol. PTS2-containing sequences are cleaved inside peroxisomes after transport. Peroxisomal membrane proteins are either directly inserted into the peroxisomal membrane or localized to the ER membrane before being transported to the peroxisome (Hu et al., 2012; Kao et al., 2018; Reumann and Chowdhary, 2018). Biochemical analyses of plant peroxisomes were historically performed using plant species from which large amounts of peroxisomes could be isolated, such as pumpkin and spinach. However, Arabidopsis was increasingly adopted for peroxisome studies following the completion of its genome in 2000. Fukao et al. (2002) conducted proteomics analysis with isolated peroxisomes from Arabidopsis cotyledons, and subsequent proteomics studies identified peroxisome- localized proteins and novel peroxisome metabolic processes using Arabidopsis cotyledons, leaves, and suspension- cultured cells, as well as soybean and spinach (Fukao et al., 2002, 2003; Arai et al., 2008; Babujee et al., 2010). Peroxisome isolation using a combination of two types of density gradient centrifugation improved the purity of isolated peroxisomes and increased the detection sensitivity of peroxisomal proteins (Reumann et al., 2007). The purity of isolated peroxisomes can be assessed using the activity of peroxisomal matrix enzymes (such as CAT, isocitrate lyase (ICL), and hydroxypyruvate reductase (HPR)) to distinguish peroxisomes from other organelles (Fukao et al., 2002; Reumann et al., 2007; Quan et al., 2013). Peroxisomal metabolism processes change dramatically in response to

113

cellular/tissue states and can be distinguished with appropriate markers, such as ICL and HRP, which characterize etiolated and photosynthetic tissues, respectively. Almost all peroxisomal matrix proteins contain PTS1 or PTS2, facilitating the large-scale analysis of peroxisome data (Nakai and Kanehisa, 1992). Subcellular localization of peroxisomes can be visualized using peroxisomal targeting signal-fused fluorescent proteins (such as GFP-PTS1) (Mano et al., 2002; Reumann et al., 2007). Emanuelsson et al. (2003) combined analysis of PTS1 sequences from eight eukaryotic genomes with machine learning and improved predictions of protein localization to peroxisomes. The technique was refined using plant sequences for machine learning, and plant- specific prediction algorithms were developed (Lingner et al., 2011; Wang et al., 2017; Reumann and Chowdhary, 2018). Metabolomic analysis of several Arabidopsis peroxisome-defective mutants in the photorespiration pathway was used for validation of a metabolomic database search tool, PhenoMeter (PM). The tool predicted diagnoses from the metabolic phenotype and also estimated the effects on other metabolic processes (Carroll et al., 2015). Although requiring improvement, the PM tool can potentially identify new links between peroxisome functions and metabolic processes.

8.7 Oil Body The oil body (lipid body) is a single-layer organelle that does not have its own DNA. In plant cells, oils are stored inside the oil body, which occupies most of the cell volume in oil seeds such as seeds of Arabidopsis, rapeseed, and soybean. Oil bodies are also observed in plant organs with low storage oils, such as leaves and roots, and oil bodies were confirmed as ubiquitously present in plant cells (Shimada et al., 2018b; Chapman et al., 2019). As well as its role in oil storage, the oil body has crucial roles in lipid metabolism in plants. SUGAR-DEPENDENT 1, a triacylglycerol (TAG) lipase on the oil body membrane, de-esterifies fatty acids from internally accumulated TAG and is responsible for the first step of TAG degradation (Eastmond, 2006; Kelly et al., 2013; Kim

M. Kanai et al.

114

et al., 2014; Kanai et al., 2019). SEIPIN, another oil body membrane protein, is involved in oil body formation and size control, and contributes to the efficient storage and degradation of TAG in seeds (Cai et al., 2015). Oleosin, a major structural protein of oil body membranes, has monoacylglycerol acyltransferase and phospholipase activities (Parthibane et al., 2012), and is involved in maintenance of oil body structure and lipid synthesis and degradation. These findings indicate that oil body membrane proteins are a major factor controlling oil body function, and that proteome analysis of isolated oil bodies is a powerful tool for elucidating their roles. The constituent proteins of oil body membranes are widely studied, and various oil body proteins have been identified, mainly from plant seeds and algae that accumulate large amounts of oils. Recently, developments in mass spectrometry have allowed identification of proteins from small samples (Kretzschmar et al., 2020). Comparative proteome analysis using oil bodies isolated from different tissues and organs is underway to examine the changes in oil body constituent proteins after exposure to stress.

8.8 Plastid Plastids are plant-specific organelles that have a range of crucial roles in plant cells. Plastids, which have their own DNA, can differentiate into several types of plastid, such as chloroplasts for photosynthesis in green tissues and amyloplasts in storage organs. Concomitant with plastid differentiation, the internal metabolic pathways change dramatically to reflect the differentiated functions (Solymosi et al., 2018). For example, undifferentiated proplastids in leaves differentiate into chloroplasts after exposure to light. In petals and cereal seeds, respectively, proplastids differentiate into chromoplasts, which synthesize and store pigments such as carotenoid, and into amyloplasts, which accumulate starch, by changing the metabolic pathways inside plastids in each organ. Thus, plastids have dedicated metabolic pathways that are tissue and organ specific (Rolland et al., 2018). Proteome analysis of metabolic pathways in isolated chloroplasts has produced optimized isolation methods and databases of chloroplast

localized proteins (Baginsky and Gruissem, 2004; Armbruster et al., 2011; Demartini et al., 2011). Recent research has focused on changes in organelle functions in response to environmental changes, including proteomic analysis of plastids under stress environments (Watson et al., 2018; Lande et al., 2020; Wang et al., 2020). Such studies provide overviews of the different types of plastid metabolic system and also contribute to elucidation of plastid differentiation mechanisms, a major topic in plant organelle research. Proteomic analysis is a powerful method to identify the proteins that control plastid differentiation (Kleffmann et al., 2007; Suzuki et al., 2015). In addition to nuclear- encoded plastid- localized proteins, approximately 100 proteins are encoded in the plastid genome, many of which play crucial roles in controlling plastid functions. Transcriptional and translational regulation of plastid genes are extremely important for plastid functional regulation and have a long research history (de Vries and Archibald, 2018). Proteins derived from the nuclear genome are responsible for regulating transcription of plastid genome-encoded genes, and transcriptome analysis of plastid genomes using mutants has contributed to the elucidation of transcriptional and post-transcriptional regulation of plastid genomes (Kanai et al., 2013; Ito et al., 2018).

8.9 Mitochondrion Mitochondria are endosymbiotic organelles that perform aerobic respiration and supply energy required for cellular processes. Mitochondria also function in the biosynthesis of some iron– sulfur clusters and lipids, PCD, and calcium uptake (Spinelli and Haigis, 2018). During evolution, genetic material from the ancestral endosymbiont was lost or transferred to the nucleus of the host cell. Mitochondrial genomes are circular, linear, or branched double-stranded DNA molecules (Smith and Keeling, 2015). Mitochondrial genome sizes vary up to 100-fold among plant species (Skippington et al., 2015). This variation is largely attributed to differences in the number and length of noncoding regions, and to intracellular transfers from the nucleus

Plant Organellar Omics

and the chloroplast genomes. Intergenic regions in plant mitochondrial genomes are formed by large non- tandem repeats, which impact size changes and rearrangements through recombination. The extensive genomic variation is indicative of high recombination activity in plant mitochondria (Gualberto and Newton, 2017), and this is responsible for the rapid evolution of plant mitochondrial genomes. Although mitochondrial functions depend primarily on thousands of nuclear- encoded genes, plant mitochondria possess unique machineries for transcription and translation, such as RNA editing, which modifies mRNA nucleotides to regulate gene expression. Plant mitochondria employ many RNA editing proteins, including pentatricopeptide repeat proteins (PPRs), which act as RNA editing factors. The Arabidopsis genome encodes more than 200 PPR proteins involved in cytidine- to- uridine deamination in mitochondria (Andrés-Colás et al., 2017; Tang and Luo, 2018). PPR- dependent RNA editing is involved in various cellular activities, including response to abiotic stress, carbon energy balance in the electron transport chain, and embryo development (Tang and Luo, 2018). Most RNA editing leads to amino acid substitution in genes encoded by the mitochondrial genome. The editing sites are likely to be biased toward involvement in protein structural cores (Yura et al., 2009), suggesting that RNA editing is required for the formation of functional 3D structures in mitochondrial proteins. Mitochondria have their own translation apparatus, termed mitoribosomes. Plant mutants affected in mitoribosomes exhibited alterations in specific development aspects, such as embryogenesis, leaf morphogenesis, and the formation of reproductive tissues (Robles and Quesada, 2017). In plants, mitoribosomes consist of three rRNA molecules encoded by the mitochondrial genome and ribosomal proteins encoded by nuclear and mitochondrial genomes (Robles and Quesada, 2017). Although it was expected that mitoribosomes would structurally resemble bacterial ribosomes, recent cryo- electron microscopy revealed diverse mitoribosome constitution and structures (Waltz et al., 2020). Plant mitoribosomes contain PPR proteins as core ribosomal proteins. The PPRs contribute to specific RNA recognition

115

and interactions, stabilize and maintain plant- specific ribosomal RNA expansions, and are involved in translation initiation. Ribosome profiling revealed a genome- wide snapshot view of mitochondrial translation in Arabidopsis (Planchard et al., 2018). Mitoribosome footprints are 27 nt or 28 nt long in plants, in contrast to mitoribosome footprints in animal mitochondria (24–37 nt long) (Mai et al., 2017). These results suggest that mitoribosomes have evolved to adapt to differences in mitochondrial genome content in the different eukaryotic groups. Further characterization of a variety of mitoribosomes from various species will be required to fully comprehend the evolution of translation processes across eukaryotes.

8.10 Databases for Images/Movies of Organelle Dynamics Technical advances in imaging techniques and improvements to microscopic and other equipment have allowed the generation of abundant image data showing dynamic organellar behavior in plant living cells. Although most images are analyzed and retained by individual research laboratories, some projects collate image data into public access databases. Two main database types are available: (i) databases for one organelle, specific cells/tissues, or specific plant species; and (ii) databases for all kinds of organelles in various plant species. The former includes the AtNoPDB (Arabidopsis Nucleolar Protein Database), FTFLP (Fluorescent Tagging of Full-length Proteins), LIPS (Live Images of Plant Stomata), and GFP Localisome Databases, and the latter includes PODB3 (The Plant Organelles Database 3) (Table 8.1). Researchers can use image-based databases to view organelle dynamics such as morphology, size, and movement. Image databases are potentially useful as rich data sources for computational analyses to extract biologically meaningful information, because images contain data for each organelle, such as size, length, morphology, and velocity. Some databases allow image data to be accessed under use permissions, and users can reuse the downloaded data for further analyses such as quantification of various organelle

M. Kanai et al.

116

Table 8.1. Image-based databases. Name of database

URL

AtNoPDB

https://bioinf.hutton.ac.uk/cgi-bin/atnopdb/home

FTFLP

https://gfp.dpb.carnegiescience.edu

LIPS

https://www.higaki-lab.net/lips/

GFP Localisome Database

https://www.psb.ugent.be/papers/cellbiol/

PODB3

http://podb.nibb.ac.jp/Organellome/

All URLs accessed in September 2022.

characteristics. Combining image databases with other organelle omics data will improve our understanding of organelle dynamics.

8.11 Conclusions This chapter summarizes omics approaches for membrane- surrounded organelles. However, omics approaches are also valuable for understanding non-membrane-bound subcellular structures, such as microtubules and microfilaments, and inducible subcellular structures, such as ER bodies, and stress granules.

Until recently, model plants such as Arabidopsis were the primary sources of plant materials for omics analyses due to the availability of whole- genome information for integration with other omics data. However, next-generation sequencing can generate huge amounts of sequence-based data for almost all plant species, and omics approaches can now be used to address the common mechanisms of organelle dynamics among plant species and plant-specific mechanisms. In the near future, deep learning and artificial intelligence will be needed to maximize the potential of the huge amounts of data generated in omics- driven plant organelle research.

Acknowledgment We thank Maki Kondo and the staff of the Spectrography and Bioimaging Facility, National Institute for Basic Biology Core Research Facilities, for performing transmission electron microscopy.

References Aebi, M., Bernasconi, R., Clerc, S. and Molinari, M. (2010) N-glycan structures: recognition and processing in the ER. Trends in Biochemical Sciences 35(2), 74–82. DOI: 10.1016/j.tibs.2009.10.001. Andrés-Colás, N., Zhu, Q., Takenaka, M., De Rybel, B., Weijers, D. et al. (2017) Multiple PPR protein interactions are involved in the RNA editing system in Arabidopsis mitochondria and plastids. Proceedings of the National Academy of Sciences 114(33), 8883–8888. DOI: 10.1073/ pnas.1705815114. Arai, Y., Hayashi, M. and Nishimura, M. (2008) Proteomic identification and characterization of a novel peroxisomal adenine nucleotide transporter supplying ATP for fatty acid β-oxidation in soybean and Arabidopsis. The Plant Cell 20(12), 3227–3240. DOI: 10.1105/tpc.108.062877. Armbruster, U., Pesaresi, P., Pribil, M., Hertle, A. and Leister, D. (2011) Update on chloroplast research: new tools, new topics, and new trends. Molecular Plant 4(1), 1–16. DOI: 10.1093/mp/ssq060. Arondel, V., Lemieux, B., Hwang, I., Gibson, S., Goodman, H.M. et al. (1992) Map-based cloning of a gene controlling omega-3 fatty acid desaturation in Arabidopsis. Science 258(5086), 1353–1355. DOI: 10.1126/science.1455229.

Plant Organellar Omics

117

Asensi-Fabado, M.A., Amtmann, A. and Perrella, G. (2017) Plant responses to abiotic stress: the chromatin context of transcriptional regulation. Biochimica et Biophysica Acta (BBA) – Gene Regulatory Mechanisms 1860(1), 106–122. DOI: 10.1016/j.bbagrm.2016.07.015. Babujee, L., Wurtz, V., Ma, C., Lueder, F., Soni, P. et al. (2010) The proteome map of spinach leaf peroxisomes indicates partial compartmentalization of phylloquinone (vitamin K1) biosynthesis in plant peroxisomes. Journal of Experimental Botany 61(5), 1441–1453. DOI: 10.1093/jxb/erq014. Bae, M.S., Cho, E.J., Choi, E.Y. and Park, O.K. (2003) Analysis of the Arabidopsis nuclear proteome and its response to cold stress. The Plant Journal 36(5), 652–663. DOI: 10.1046/j.1365-313X.2003.01907.x. Baginsky, S. and Gruissem, W. (2004) Chloroplast proteomics: potentials and challenges. Journal of Experimental Botany 55(400), 1213–1220. DOI: 10.1093/jxb/erh104. Bi, X., Cheng, Y.J., Hu, B., Ma, X., Wu, R. et al. (2017) Nonrandom domain organization of the Arabidopsis genome at the nuclear periphery. Genome Research 27(7), 1162–1173. DOI: 10.1101/gr.215186.116. Blavet, N., Uřinovská, J., Jeřábková, H., Chamrád, I., Vrána, J. et al. (2017) UNcleProt (Universal Nuclear Protein database of barley): the first nuclear protein database that distinguishes proteins from different phases of the cell cycle. Nucleus (Austin, Tex.) 8(1), 70–80. DOI: 10.1080/19491034.2016.1255391. Boston, R.S., Viitanen, P.V. and Vierling, E. (1996) Molecular chaperones and protein folding in plants. Plant Molecular Biology 32(1–2), 191–222. DOI: 10.1007/BF00039383. Bourbousse, C., Barneche, F. and Laloi, C. (2019) Plant chromatin catches the sun. Frontiers in Plant Science 10, 1728. DOI: 10.3389/fpls.2019.01728. Cai, Y., Goodman, J.M., Pyc, M., Mullen, R.T., Dyer, J.M. et al. (2015) Arabidopsis SEIPIN proteins modulate triacylglycerol accumulation and influence lipid droplet proliferation. The Plant Cell 27(9), 2616–2636. DOI: 10.1105/tpc.15.00588. Calikowski, T.T., Meulia, T. and Meier, I. (2003) A proteomic study of the Arabidopsis nuclear matrix. Journal of Cellular Biochemistry 90(2), 361–378. DOI: 10.1002/jcb.10624. Carroll, A.J., Zhang, P., Whitehead, L., Kaines, S., Tcherkez, G. et al. (2015) PhenoMeter: a metabolome database search tool using statistical similarity matching of metabolic phenotypes for high- confidence detection of functional links. Frontiers in Bioengineering and Biotechnology 3, 106. DOI: 10.3389/fbioe.2015.00106. Carter, C., Pan, S., Zouhar, J., Avila, E.L., Girke, T. et al. (2004) The vegetative vacuole proteome of Arabidopsis thaliana reveals predicted and unexpected proteins. The Plant Cell 16(12), 3285–3303. DOI: 10.1105/tpc.104.027078. Chanoca, A., Kovinich, N., Burkel, B., Stecha, S., Bohorquez-Restrepo, A. et al. (2015) Anthocyanin vacuolar inclusions form by a microautophagy mechanism. The Plant Cell 27(9), 2545–2559. DOI: 10.1105/tpc.15.00589. Chapman, K.D., Aziz, M., Dyer, J.M. and Mullen, R.T. (2019) Mechanisms of lipid droplet biogenesis. The Biochemical Journal 476(13), 1929–1942. DOI: 10.1042/BCJ20180021. Charton, L., Plett, A. and Linka, N. (2019) Plant peroxisomal solute transporter proteins. Journal of Integrative Plant Biology 61(7), 817–835. DOI: 10.1111/jipb.12790. Claessen, J.H., Kundrat, L. and Ploegh, H.L. (2012) Protein quality control in the ER: balancing the ubiquitin checkbook. Trends in Cell Biology 22, 22–32. DOI: 10.1016/j.tcb.2011.09.010. Corpas, F.J., González-Gordo, S. and Palma, J.M. (2020) Plant peroxisomes: a factory of reactive species. Frontiers in Plant Science 11, 853. DOI: 10.3389/fpls.2020.00853. Cutler, S.R., Ehrhardt, D.W., Griffitts, J.S. and Somerville, C.R. (2000) Random GFP∷cDNA fusions enable visualization of subcellular structures in cells of Arabidopsis at a high frequency. Proceedings of the National Academy of Sciences 97(7), 3718–3723. DOI: 10.1073/pnas.97.7.3718. de Vries, J. and Archibald, J.M. (2018) Plastid genomes. Current Biology 28(8), R336–R337. DOI: 10.1016/j.cub.2018.01.027. Doğan, E.S. and Liu, C. (2018) Three-dimensional chromatin packing and positioning of plant genomes. Nature Plants 4, 521–529. DOI: 10.1038/s41477-018-0199-5. Dekker, J., Belmont, A.S., Guttman, M., Leshyk, V.O. and Lis, J.T. (2017) The 4D nucleome project. Nature 549(7671), 219–226. DOI: 10.1038/nature23884. Demartini, D.R., Carlini, C.R. and Thelen, J.J. (2011) Proteome databases and other online resources for chloroplast research in Arabidopsis. Methods in Molecular Biology 775, 93–115. Denecke, J., De Rycke, R. and Botterman, J. (1992) Plant and mammalian sorting signals for protein retention in the endoplasmic reticulum contain a conserved epitope. The EMBO Journal 11(6), 2345–2355. DOI: 10.1002/j.1460-2075.1992.tb05294.x.

118

M. Kanai et al.

Dong, P., Tu, X., Chu, P.Y., Lü, P., Zhu, N. et al. (2017) 3D chromatin architecture of large plant genomes determined by local A/B compartments. Molecular Plant 10(12), 1497–1509. DOI: 10.1016/j. molp.2017.11.005. Dundr, M. and Misteli, T. (2001) Functional architecture in the cell nucleus. Biochemical Journal 356, 297–310. DOI: 10.1042/bj3560297. Dunkley, T.P.J., Hester, S., Shadforth, I.P., Runions, J., Weimar, T. et al. (2006) Mapping the Arabidopsis organelle proteome. Proceedings of the National Academy of Sciences 103(17), 6518–6523. DOI: 10.1073/pnas.0506958103. Eastmond, P.J. (2006) SUGAR-DEPENDENT1 encodes a patatin domain triacylglycerol lipase that initiates storage oil breakdown in germinating Arabidopsis seeds. The Plant Cell 18(3), 665–675. DOI: 10.1105/tpc.105.040543. Ellgaard, L. and Helenius, A. (2003) Quality control in the endoplasmic reticulum. Nature Reviews Molecular Cell Biology 4(3), 181–191. DOI: 10.1038/nrm1052. Emanuelsson, O., Elofsson, A., von Heijne, G. and Cristóbal, S. (2003) In silico prediction of the peroxisomal proteome in fungi, plants and animals. Journal of Molecular Biology 330(2), 443–456. DOI: 10.1016/s0022-2836(03)00553-9. Fagone, P. and Jackowski, S. (2009) Membrane phospholipid synthesis and endoplasmic reticulum function. Journal of Lipid Research 50, S311–S316. DOI: 10.1194/jlr.R800049-JLR200. Fakih, Z., Ahmed, M.B., Letanneur, C. and Germain, H. (2016) An unbiased nuclear proteomics approach reveals novel nuclear protein components that participates in MAMP- triggered immunity. Plant Signaling & Behavior 11(6), e1183087. DOI: 10.1080/15592324.2016.1183087. Farré, E.M., Tiessen, A., Roessner, U., Geigenberger, P., Trethewey, R.N. et al. (2001) Analysis of the compartmentation of glycolytic intermediates, nucleotides, sugars, organic acids, amino acids, and sugar alcohols in potato tubers using a nonaqueous fractionation method. Plant Physiology 127(2), 685–700. DOI: 10.1104/pp.010280. Freedman, R.B., Hirst, T.R. and Tuite, M.F. (1994) Protein disulphide isomerase: building bridges in protein folding. Trends in Biochemical Sciences 19(8), 331–336. DOI: 10.1016/0968-0004(94)90072-8. Fukao, Y., Hayashi, M. and Nishimura, M. (2002) Proteomic analysis of leaf peroxisomal proteins in greening cotyledons of Arabidopsis thaliana. Plant & Cell Physiology 43(7), 689–696. DOI: 10.1093/pcp/ pcf101. Fukao, Y., Hayashi, M., Hara-Nishimura, I. and Nishimura, M. (2003) Novel glyoxysomal protein kinase, GPK1, identified by proteomic analysis of glyoxysomes in etiolated cotyledons of Arabidopsis thaliana. Plant & Cell Physiology 44(10), 1002–1012. DOI: 10.1093/pcp/pcg145. Fürtauer, L., Weckwerth, W. and Nägele, T. (2016) A benchtop fractionation procedure for subcellular analysis of the plant metabolome. Frontiers in Plant Science 7, 1912. DOI: 10.3389/fpls.2016.01912. Gerhardt, R. and Heldt, H.W. (1984) Measurement of subcellular metabolite levels in leaves by fractionation of freeze-stopped material in nonaqueous media. Plant Physiology 75(3), 542–547. DOI: 10.1104/ pp.75.3.542. Goto, C., Hashizume, S., Fukao, Y., Hara- Nishimura, I. and Tamura, K. (2019) Comprehensive nuclear proteome of Arabidopsis obtained by sequential extraction. Nucleus 10(1), 81–92. DOI: 10.1080/19491034.2019.1603093. Goto-Yamada, S., Mano, S., Nakamori, C., Kondo, M., Yamawaki, R. et al. (2014) Chaperone and protease functions of LON protease 2 modulate the peroxisomal transition and degradation with autophagy. Plant & Cell Physiology 55(3), 482–496. DOI: 10.1093/pcp/pcu017. Greenwood, J.S., Helm, M. and Gietl, C. (2005) Ricinosomes and endosperm transfer cell structure in programmed cell death of the nucellus during Ricinus seed development. Proceedings of the National Academy of Sciences 102(6), 2238–2243. DOI: 10.1073/pnas.0409429102. Grob, S. and Grossniklaus, U. (2017) Chromosome conformation capture-based studies reveal novel features of plant nuclear architecture. Current Opinion in Plant Biology 36, 149–157. DOI: 10.1016/j. pbi.2017.03.004. Gualberto, J.M. and Newton, K.J. (2017) Plant mitochondrial genomes: dynamics and mechanisms of mutation. Annual Review of Plant Biology 68, 225–252. DOI: 10.1146/annurev-arplant-043015-112232. Hara- Nishimura, I. and Hatsugai, N. (2011) The role of vacuole in plant cell death. Cell Death and Differentiation 18(8), 1298–1304. DOI: 10.1038/cdd.2011.70. Hara-Nishimura, I., Shimada, T., Hatano, K., Takeuchi, Y. and Nishimura, M. (1998) Transport of storage proteins to protein storage vacuoles is mediated by large precursor-accumulating vesicles. The Plant Cell 10(5), 825–836. DOI: 10.1105/tpc.10.5.825.

Plant Organellar Omics

119

Hawes, C. and Satiat-Jeunemaitre, B. (2005) The plant Golgi apparatus: going with the flow. Biochimica et Biophysica Acta 1744(2), 93–107. DOI: 10.1016/j.bbamcr.2005.03.009. Hepler, P.K., Palevitz, B.A., Lancelle, S.A., Mccauley, M.M. and Lichtschidl, L. (1990) Cortical endoplasmic reticulum in plants. Journal of Cell Science 96(3), 355–373. DOI: 10.1242/jcs.96.3.355. Hu, J., Baker, A., Bartel, B., Linka, N., Mullen, R.T. et al. (2012) Plant peroxisomes: biogenesis and function. The Plant Cell 24(6), 2279–2303. DOI: 10.1105/tpc.112.096586. Ito, A., Sugita, C., Ichinose, M., Kato, Y., Yamamoto, H. et al. (2018) An evolutionarily conserved P-subfamily pentatricopeptide repeat protein is required to splice the plastid ndhA transcript in the moss Physcomitrella patens and Arabidopsis thaliana. The Plant Journal 94(4), 638–648. DOI: 10.1111/tpj.13884. Jauh, G.Y., Phillips, T.E. and Rogers, J.C. (1999) Tonoplast intrinsic protein isoforms as markers for vacuolar functions. The Plant Cell 11(10), 1867–1882. DOI: 10.1105/tpc.11.10.1867. Kamada, T., Nito, K., Hayashi, H., Mano, S., Hayashi, M. et al. (2003) Functional differentiation of peroxisomes revealed by expression profiles of peroxisomal genes in Arabidopsis thaliana. Plant & Cell Physiology 44(12), 1275–1289. DOI: 10.1093/pcp/pcg173. Kanai, M., Hayashi, M., Kondo, M. and Nishimura, M. (2013) The plastidic DEAD-box RNA helicase 22, HS3, is essential for plastid functions both in seed development and in seedling growth. Plant & Cell Physiology 54(9), 1431–1440. DOI: 10.1093/pcp/pct091. Kanai, M., Yamada, T., Hayashi, M., Mano, S. and Nishimura, M. (2019) Soybean (Glycine max L.) triacylglycerol lipase GmSDP1 regulates the quality and quantity of seed oil. Scientific Reports 9(1), 8924. DOI: 10.1038/s41598-019-45331-8. Kao, Y.T., Gonzalez, K.L. and Bartel, B. (2018) Peroxisome function, biogenesis, and dynamics in plants. Plant Physiology 176(1), 162–177. DOI: 10.1104/pp.17.01050. Karki, N., Johnson, B.S. and Bates, P.D. (2019) Metabolically distinct pools of phosphatidylcholine are involved in trafficking of fatty acids out of and into the chloroplast for membrane production. The Plant Cell 31(11), 2768–2788. DOI: 10.1105/tpc.19.00121. Kelly, A.A., Shaw, E., Powers, S.J., Kurup, S. and Eastmond, P.J. (2013) Suppression of the SUGAR- DEPENDENT1 triacylglycerol lipase family during seed development enhances oil yield in oilseed rape (Brassica napus L.). Plant Biotechnology Journal 11(3), 355–361. DOI: 10.1111/pbi.12021. Kim, M.J., Yang, S.W., Mao, H.Z., Veena, S.P., Yin, J.L. et al. (2014) Gene silencing of Sugar-dependent 1 (JcSDP1), encoding a patatin-domain triacylglycerol lipase, enhances seed oil accumulation in Jatropha curcas. Biotechnology for Biofuels 7(1), 36. DOI: 10.1186/1754-6834-7-36. Kleffmann, T., von Zychlinski, A., Russenberger, D., Hirsch-Hoffmann, M., Gehrig, P. et al. (2007) Proteome dynamics during plastid differentiation in rice. Plant Physiology 143(2), 912–923. DOI: 10.1104/ pp.106.090738. Kretzschmar, F.K., Doner, N.M., Krawczyk, H.E., Scholz, P., Schmitt, K. et al. (2020) Identification of low- abundance lipid droplet proteins in seeds and seedlings. Plant Physiology 182(3), 1326–1345. DOI: 10.1104/pp.19.01255. Lande, N.V., Barua, P., Gayen, D., Kumar, S., Varshney, S. et al. (2020) Dehydration-induced alterations in chloroplast proteome and reprogramming of cellular metabolism in developing chickpea delineate interrelated adaptive responses. Plant Physiology and Biochemistry 146, 337–348. DOI: 10.1016/j. plaphy.2019.11.034. Li, L., Shimada, T., Takahashi, H., Ueda, H., Fukao, Y. et al. (2006) MAIGO2 is involved in exit of seed storage proteins from the endoplasmic reticulum in Arabidopsis thaliana. The Plant Cell 18(12), 3535–3547. DOI: 10.1105/tpc.106.046151. Lingner, T., Kataya, A.R., Antonicelli, G.E., Benichou, A., Nilssen, K. et al. (2011) Identification of novel plant peroxisomal targeting signals by a combination of machine learning methods and in vivo subcellular targeting analyses. The Plant Cell 23(4), 1556–1572. DOI: 10.1105/tpc.111.084095. Liu, Y. and Li, J. (2014) Endoplasmic reticulum-mediated protein quality control in Arabidopsis. Frontiers in Plant Science 5, 162. DOI: 10.3389/fpls.2014.00162. Mai, N., Chrzanowska- Lightowlers, Z.M.A. and Lightowlers, R.N. (2017) The process of mammalian mitochondrial protein synthesis. Cell and Tissue Research 367(1), 5–20. DOI: 10.1007/ s00441-016-2456-0. Mano, S., Nakamori, C., Hayashi, M., Kato, A., Kondo, M. et al. (2002) Distribution and characterization of peroxisomes in Arabidopsis by visualization with GFP: dynamic morphology and actin-dependent movement. Plant & Cell Physiology 43(3), 331–341. DOI: 10.1093/pcp/pcf037. Marty, F. (1999) Plant vacuoles. The Plant Cell 11(4), 587–600. DOI: 10.1105/tpc.11.4.587.

120

M. Kanai et al.

Matsushima, R., Kondo, M., Nishimura, M. and Hara-Nishimura, I. (2003b) A novel ER-derived compartment, the ER body, selectively accumulates a β-glucosidase with an ER-retention signal in Arabidopsis. The Plant Journal 33(3), 493–502. DOI: 10.1046/j.1365-313x.2003.01636.x. Matsushima, R., Hayashi, Y., Yamada, K., Shimada, T., Nishimura, M. et al. (2003a) The ER body, a novel endoplasmic reticulum-derived structure in Arabidopsis. Plant & Cell Physiology 44(7), 661–666. DOI: 10.1093/pcp/pcg089. Meier, I., Richards, E.J. and Evans, D.E. (2017) Cell biology of the plant nucleus. Annual Review of Plant Biology 68, 139–172. DOI: 10.1146/annurev-arplant-042916-041115. Morré, D.J. and Mollenhauer, H.H. (1964) Isolation of the Golgi apparatus from plant cells. The Journal of Cell Biology 23, 295–305. DOI: 10.1083/jcb.23.2.295. Morré, D.J., Keenan, T.W. and Morré, D.M. (1993) Golgi apparatus isolation and use in cell-free systems. Protoplasma 172(1), 12–26. DOI: 10.1007/BF01403717. Nakai, K. and Kanehisa, M. (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14(4), 897–911. DOI: 10.1016/s0888-7543(05)80111-9. Nakazaki, A., Yamada, K., Kunieda, T., Sugiyama, R., Hirai, M.Y. et al. (2019) Leaf endoplasmic reticulum bodies identified in Arabidopsis rosette leaves are involved in defense against herbivory. Plant Physiology 179(4), 1515–1524. DOI: 10.1104/pp.18.00984. Nguema- Ona, E., Vicré-Gibouin, M., Gotté, M., Plancot, B., Lerouge, P. et al. (2014) Cell wall O- glycoproteins and N-glycoproteins: aspects of biosynthesis and function. Frontiers in Plant Science 5, 499. DOI: 10.3389/fpls.2014.00499. Nikolovski, N., Rubtsov, D., Segura, M.P., Miles, G.P., Stevens, T.J. et al. (2012) Putative glycosyltransferases and other plant Golgi apparatus proteins are revealed by LOPIT proteomics. Plant Physiology 160(2), 1037–1051. DOI: 10.1104/pp.112.204263. Nishimura, M., Takeuchi, Y., De Bellis, L. and Hara-Nishimura, I. (1993) Leaf peroxisomes are directly transformed to glyoxysomes during senescence of pumpkin cotyledons. Protoplasma 175(3–4), 131–137. DOI: 10.1007/BF01385011. Nützmann, H.-W., Doerr, D., Ramírez-Colmenero, A., Sotelo-Fonseca, J.E., Wegel, E. et al. (2020) Active and repressed biosynthetic gene clusters have spatially distinct chromosome states. Proceedings of the National Academy of Sciences 117(24), 13800–13809. DOI: 10.1073/pnas.1920474117. Ohlrogge, J. and Browse, J. (1995) Lipid biosynthesis. The Plant Cell 7(7), 957–970. DOI: 10.1105/ tpc.7.7.957. Ohnishi, M., Anegawa, A., Sugiyama, Y., Harada, K., Oikawa, A. et al. (2018) Molecular components of Arabidopsis intact vacuoles clarified with metabolomic and proteomic analyses. Plant & Cell Physiology 59(7), 1353–1362. DOI: 10.1093/pcp/pcy069. Oikawa, A., Matsuda, F., Kikuyama, M., Mimura, T. and Saito, K. (2011) Metabolomics of a single vacuole reveals metabolic dynamism in an alga Chara australis. Plant Physiology 157(2), 544–551. DOI: 10.1104/pp.111.183772. Parsons, H.T., Christiansen, K., Knierim, B., Carroll, A., Ito, J. et al. (2012) Isolation and proteomic characterization of the Arabidopsis Golgi defines functional and novel components involved in plant cell wall biosynthesis. Plant Physiology 159(1), 12–26. DOI: 10.1104/pp.111.193151. Parthibane, V., Rajakumari, S., Venkateshwari, V., Iyappan, R. and Rajasekharan, R. (2012) Oleosin is bifunctional enzyme that has both monoacylglycerol acyltransferase and phospholipase activities. Journal of Biological Chemistry 287, 1946–1954. DOI: 10.1074/jbc.M111.309955. Pereira, C., Pereira, S., Satiat-Jeunemaitre, B. and Pissarra, J. (2013) Cardosin A contains two vacuolar sorting signals using different vacuolar routes in tobacco epidermal cells. The Plant Journal 76, 87–100. DOI: 10.1111/tpj.12274. Petrovska, B., Šebela, M. and Dole el, J. (2015) Inside a plant nucleus: discovering the proteins. Journal of Experimental Botany 66(6), 1627–1640. DOI: 10.1093/jxb/erv041. Planchard, N., Bertin, P., Quadrado, M., Dargel-Graffin, C., Hatin, I. et al. (2018) The translational landscape of Arabidopsis mitochondria. Nucleic Acids Research 46, 6218–6228. DOI: 10.1093/nar/ gky489. Quan, S., Yang, P., Cassin-Ross, G., Kaur, N., Switzenberg, R. et al. (2013) Proteome analysis of peroxisomes from etiolated Arabidopsis seedlings identifies a peroxisomal protease involved in β-oxidation and development. Plant Physiology 163(4), 1518–1538. DOI: 10.1104/pp.113.223453. Reumann, S. and Chowdhary, G. (2018) Prediction of peroxisomal matrix proteins in plants. In: del Río, L.A. and Schrader, M. (eds) Proteomics of Peroxisomes: Identifying Novel Functions and Regulatory Networks. Subcellular Biochemistry, Vol. 89. Springer, Singapore, pp. 125–138.

Plant Organellar Omics

121

Ridge, R.W., Uozumi, Y., Plazinski, J., Hurley, U.A. and Williamson, R.E. (1999) Developmental transitions and dynamics of the cortical ER of Arabidopsis cells seen with green fluorescent protein. Plant & Cell Physiology 40(12), 1253–1261. DOI: 10.1093/oxfordjournals.pcp.a029513. Reumann, S., Babujee, L., Ma, C., Wienkoop, S., Siemsen, T. et al. (2007) Proteome analysis of Arabidopsis leaf peroxisomes reveals novel targeting peptides, metabolic pathways, and defense mechanisms. The Plant Cell 19(10), 3170–3193. DOI: 10.1105/tpc.107.050989. Robles, P. and Quesada, V. (2017) Emerging roles of mitochondrial ribosomal proteins in plant development. International Journal of Molecular Sciences 18(12), 2595. DOI: 10.3390/ijms18122595. Rolland, N., Bouchnak, I., Moyet, L., Salvi, D. and Kuntz, M. (2018) The main functions of plastids. Methods in Molecular Biology 1829, 73–85. Saint-Jore, C.M., Evins, J., Batoko, H., Brandizzi, F., Moore, I. et al. (2002) Redistribution of membrane proteins between the Golgi apparatus and endoplasmic reticulum in plants is reversible and not dependent on cytoskeletal networks. The Plant Journal 29(5), 661–678. DOI: 10.1046/j.0960-7412.2002.01252.x. Schmid, M., Simpson, D.J., Sarioglu, H., Lottspeich, F. and Gietl, C. (2001) The ricinosomes of senescing plant tissue bud from the endoplasmic reticulum. Proceedings of the National Academy of Sciences 98(9), 5353–5358. DOI: 10.1073/pnas.061038298. Schmidt, U.G., Endler, A., Schelbert, S., Brunner, A., Schnell, M. et al. (2007) Novel tonoplast transporters identified using a proteomic approach with vacuoles isolated from cauliflower buds. Plant Physiology 145(1), 216–229. DOI: 10.1104/pp.107.096917. Schneider, S., Harant, D., Bachmann, G., Nägele, T., Lang, I. et al. (2019) Subcellular phenotyping: using proteomics to quantitatively link subcellular leaf protein and organelle distribution analyses of Pisum sativum cultivars. Frontiers in Plant Science 10, 638. DOI: 10.3389/fpls.2019.00638. Schneider, T., Schellenberg, M., Meyer, S., Keller, F., Gehrig, P. et al. (2009) Quantitative detection of changes in the leaf-mesophyll tonoplast proteome in dependency of a cadmium exposure of barley (Hordeum vulgare L.) plants. Proteomics 9(10), 2668–2677. DOI: 10.1002/pmic.200800806. Schultz, C., Gilson, P., Oxley, D., Youl, J. and Bacic, A. (1998) GPI-anchors on arabinogalactan-proteins: implications for signalling in plants. Trends in Plant Science 3(11), 426–431. DOI: 10.1016/ S1360-1385(98)01328-4. Shimada, T., Kuroyanagi, M., Nishimura, M. and Hara-Nishimura, I. (1997) A pumpkin 72-kda membrane protein of precursor-accumulating vesicles has characteristics of A vacuolar sorting receptor. Plant & Cell Physiology 38, 1414–1420. DOI: 10.1093/oxfordjournals.pcp.a029138. Shimada, T.L., Takagi, J., Ichino, T., Shirakawa, M. and Hara-Nishimura, I. (2018a) Plant vacuoles. Annual Review of Plant Biology 69(1), 123–145. DOI: 10.1146/annurev-arplant-042817-040508. Shimada, T.L., Hayashi, M. and Hara-Nishimura, I. (2018b) Membrane dynamics and multiple functions of oil bodies in seeds and leaves. Plant Physiology 176(1), 199–207. DOI: 10.1104/pp.17.01522. Shimaoka, T., Ohnishi, M., Sazuka, T., Mitsuhashi, N., Hara-Nishimura, I. et al. (2004) Isolation of intact vacuoles and proteomic analysis of tonoplast from suspension-cultured cells of Arabidopsis thaliana. Plant & Cell Physiology 45, 672–683. Shitan, N. and Yazaki, K. (2013) New insights into the transport mechanisms in plant vacuoles. International Review of Cell and Molecular Biology 305, 383–433. Silva-Alvim, F.A.L., An, J., Alvim, J.C., Foresti, O., Grippa, A. et al. (2018) Predominant golgi residency of the plant K/HDEL receptor is essential for its function in mediating ER retention. The Plant Cell 30, 2174–2196. Sinclair, R., Rosquete, M.R. and Drakakaki, G. (2018) Post-Golgi trafficking and transport of cell wall components. Frontiers in Plant Science 9, 1784. DOI: 10.3389/fpls.2018.01784. Skippington, E., Barkman, T.J., Rice, D.W. and Palmer, J.D. (2015) Miniaturized mitogenome of the parasitic plant Viscum scurruloideum is extremely divergent and dynamic and has lost all nad genes. Proceedings of the National Academy of Sciences 112, E3515–3524. Smith, D.R. and Keeling, P.J. (2015) Mitochondrial and plastid genome architecture: reoccurring themes, but significant differences at the extremes. Proceedings of the National Academy of Sciences 112, 10177–10184. DOI: 10.1073/pnas.1422049112. Solymosi, K., Lethin, J. and Aronsson, H. (2018) Diversity and plasticity of plastids in land plants. Methods in Molecular Biology 1829, 55–72. Sotelo-Silveira, M., Chávez Montes, R.A., Sotelo-Silveira, J.R., Marsch-Martínez, N. and de Folter, S. (2018) Entering the next dimension: plant genomes in 3D. Trends in Plant Science 23, 598–612. DOI: 10.1016/j.tplants.2018.03.014.

122

M. Kanai et al.

Spinelli, J.B. and Haigis, M.C. (2018) The multifaceted contributions of mitochondria to cellular metabolism. Nature Cell Biology 20, 745–754. DOI: 10.1038/s41556-018-0124-1. Staehelin, L.A. (1997) The plant ER: a dynamic organelle composed of a large number of discrete functional domains. The Plant Journal 11, 1151–1165. DOI: 10.1046/j.1365-313X.1997.11061151.x. Suzuki, M., Takahashi, S., Kondo, T., Dohra, H., Ito, Y. et al. (2015) Plastid proteomic analysis in tomato fruit development. PLOS ONE 10(9), e0137266. DOI: 10.1371/journal.pone.0137266. Tamura, K., Shimada, T., Kondo, M., Nishimura, M. and Hara-Nishimura, I. (2005) KATAMARI1/MURUS3 Is a novel Golgi membrane protein that is required for endomembrane organization in Arabidopsis. The Plant Cell 17(6), 1764–1776. DOI: 10.1105/tpc.105.031930. Tamura, K., Shimada, T., Ono, E., Tanaka, Y., Nagatani, A. et al. (2003) Why green fluorescent fusion proteins have not been observed in the vacuoles of higher plants. The Plant Journal 35(4), 545–555. DOI: 10.1046/j.1365-313x.2003.01822.x. Tamura, K., Yamada, K., Shimada, T. and Hara-Nishimura, I. (2004) Endoplasmic reticulum-resident proteins are constitutively transported to vacuoles for degradation. The Plant Journal 39(3), 393–402. DOI: 10.1111/j.1365-313X.2004.02141.x. Tamura, K., Takahashi, H., Kunieda, T., Fuji, K., Shimada, T. et al. (2007) Arabidopsis KAM2/GRV2 is required for proper endosome formation and functions in vacuolar sorting and determination of the embryo growth axis. The Plant Cell 19(1), 320–332. DOI: 10.1105/tpc.106.046631. Tang, W. and Luo, C. (2018) Molecular and functional diversity of RNA editing in plant mitochondria. Molecular Biotechnology 60(12), 935–945. DOI: 10.1007/s12033-018-0126-z. Toyooka, K., Okamoto, T. and Minamikawa, T. (2000) Mass transport of proform of a KDEL-tailed cysteine proteinase (SH-EP) to protein storage vacuoles by endoplasmic reticulum-derived vesicle is involved in protein mobilization in germinating seeds. The Journal of Cell Biology 148(3), 453–464. DOI: 10.1083/jcb.148.3.453. Ueda, H., Yokota, E., Kutsuna, N., Shimada, T., Tamura, K. et al. (2010) Myosin-dependent endoplasmic reticulum motility and F-actin organization in plant cells. Proceedings of the National Academy of Sciences 107(15), 6894–6899. DOI: 10.1073/pnas.0911482107. Uemura, T., Ueda, T., Ohniwa, R.L., Nakano, A., Takeyasu, K. et al. (2004) Systematic analysis of SNARE molecules in Arabidopsis: dissection of the post-Golgi network in plant cells. Cell Structure and Function 29(2), 49–65. DOI: 10.1247/csf.29.49. Vitale, A. and Denecke, J. (1999) The endoplasmic reticulum – Gateway of the secretory pathway. The Plant Cell 11(4), 615–628. DOI: 10.1105/tpc.11.4.615. von Heijne, G. (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Research 14(11), 4683–4690. DOI: 10.1093/nar/14.11.4683. Waltz, F., Soufari, H., Bochler, A., Giegé, P. and Hashem, Y. (2020) Cryo-EM structure of the RNA-rich plant mitochondrial ribosome. Nature Plants 6(4), 377–383. DOI: 10.1038/s41477-020-0631-5. Wang, C., Yan, X., Chen, Q., Jiang, N., Fu, W. et al. (2013) Clathrin light chains regulate clathrin-mediated trafficking, auxin signaling, and development in Arabidopsis. The Plant Cell 25(2), 499–516. DOI: 10.1105/tpc.112.108373. Wang, J., Wang, Y., Gao, C., Jiang, L. and Guo, D. (2017) PPero, a computational model for plant PTS1 type peroxisomal protein prediction. PLOS ONE 12(1), e0168912. DOI: 10.1371/journal.pone. 0168912. Wang, Y., Li, X., Liu, N., Wei, S., Wang, J. et al. (2020) The iTRAQ-based chloroplast proteomic analysis of Triticum aestivum L. leaves subjected to drought stress and 5-aminolevulinic acid alleviation reveals several proteins involved in the protection of photosynthesis. BMC Plant Biology 20(1), 96. DOI: 10.1186/s12870-020-2297-6. Watson, S.J., Sowden, R.G. and Jarvis, P. (2018) Abiotic stress-induced chloroplast proteome remodelling: a mechanistic overview. Journal of Experimental Botany 69(11), 2773–2781. DOI: 10.1093/jxb/ ery053. Xiang, L., Etxeberria, E. and Van den Ende, W. (2013) Vacuolar protein sorting mechanisms in plants. The FEBS Journal 280(4), 979–993. DOI: 10.1111/febs.12092. Yamada, K., Goto-Yamada, S., Nakazaki, A., Kunieda, T., Kuwata, K. et al. (2020) Endoplasmic reticulum- derived bodies enable a single- cell chemical defense in Brassicaceae plants. Communications Biology 3(1), 21. DOI: 10.1038/s42003-019-0739-1. Yamada, K., Shimada, T., Kondo, M., Nishimura, M. and Hara-Nishimura, I. (1999) Multiple functional proteins are produced by cleaving Asn-Gln bonds of a single precursor by vacuolar processing enzyme. The Journal of Biological Chemistry 274(4), 2563–2570. DOI: 10.1074/jbc.274.4.2563.

Plant Organellar Omics

123

Yoshimoto, K. and Ohsumi, Y. (2018) Unveiling the molecular mechanisms of plant autophagy-from autophagosomes to vacuoles in plants. Plant & Cell Physiology 59(7), 1337–1344. DOI: 10.1093/ pcp/pcy112. Yura, K., Sulaiman, S., Hatta, Y., Shionyu, M. and Go, M. (2009) RESOPS: a database for analyzing the correspondence of RNA editing sites to protein three-dimensional structures. Plant & Cell Physiology 50(11), 1865–1873. DOI: 10.1093/pcp/pcp132.

9 Plant Cis-elements and Transcription Factors

Chi-Nga Chow1, Kuan-Chieh Tseng2 and Wen-Chi Chang1,2* College of Biosciences and Biotechnology, Institute of Tropical Plant Sciences and Microbiology, National Cheng Kung University, Tainan, Taiwan; 2Department of Life Sciences, National Cheng Kung University, Tainan, Taiwan

1

Abstract Nowadays high-throughput sequencing and computational techniques are indispensable tools to interpret the regulations between transcription factors (TFs) and their downstream target genes. Also, plenty of analysis tools and online resources have been developed to assist the construction of gene regulatory networks (GRNs). In this chapter, we illustrate the wet-lab and dry-lab approaches used in the construction of GRNs, online databases for plant TFs and cis-elements, and advanced analysis of GRNs. At the end of this chapter, we discuss the potential aspects of studies of TFs and cis-elements. Overall, this chapter summarizes the basic framework and public resources for the construction of GRNs in plants.

9.1 Introduction Plants are important resources for human life. As producers, plants also have irreplaceable roles in ecosystems. Unlike animals, plants cannot avoid dangers from environmental changes. The accurate control of gene expression to respond to internal and external stimuli is required for plants. Advances in high-throughput sequences help scientists to decipher the structures and functions of plant genomes. Plenty of functional proteins and genes have been discovered that are involved in gene regulatory networks (GRNs). Within GRNs, transcription factors (TFs) are one of the most important regulators that control gene transcription. Although TFs generally comprise only 5–10% of protein- coding genes in plants, they have a significant impact on the hundreds to thousands of target genes

downstream of GRNs (Riechmann et al., 2000; Kummerfeld and Teichmann, 2006). The major roles of a TF are to activate or repress transcriptions of its target genes through recognition and binding to the specific DNA sequences, called cis-elements, that are commonly located in the upstream regions of target genes (i.e., promoter regions) (Yu et al., 2016). Based on promoter deletion analysis, studies have reported that TF could tolerate sequence variants in partial positions of cis- elements while some positions were found to be conserved (Karthikeyan et al., 2009; Duan et al., 2016). Such TF DNA- binding specificity provides the cues to capture the possible cis-elements on target genes. Besides the DNA composition, the three- dimensional structure and space effects of TFs are used as features to improve accuracy and sensitivity of cis-elements predictive model but they increase the difficulty

*Corresponding author: sarah321@mail.ncku.edu.tw 124

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0009

Plant Cis-elements and Transcription Factors

of model construction (Yamasaki et al., 2013). In addition to TF DNA- binding specificity, the spatio- temporal dynamics of TFs further complicate the prediction of target cis-elements. Previous studies indicate that the construction of GRNs and their regulated TFs augment our understanding of plant immunity, circadian clocks, stress response, and developmental processes (Yu et al., 2015; Hernando et al., 2017; Hickman et al., 2017; Ohama et al., 2017). In this chapter, we demonstrate the common wet- lab and dry- lab approaches applied to identify cis-elements and to construct GRNs. Several studies and databases related to TF and cis-element are described according to plant species. Furthermore, we discuss concepts to improve the current GRN prediction models and a prospective view of studies of TFs and ciselements. An overview of this chapter is shown in Fig. 9.1.

125

also the tissue- or cell-specific dynamic genomic features, such as DNA modification, chromatin states, and histone modifications. These genomic features lead to differential accessibility of chromatins, which is crucial information to integrate spatio-temporal interaction of TF and cis-element in GRN. However, the application of ChIP-seq is often hindered by high costs of protocol and library development, and the quality of the antibody. For example, the failure of ChIP-seq may be caused by low antibody quality, leading to unspecific binding or low affinity, or the inherently transient expression of regulatory TFs upstream of GRN. To circumvent the limitations, alternative methods using recombined TFs in the systematic evolution of ligands by exponential enrichment (SELEX) and protein-binding microarrays (PBMs) are proposed (Lai et al., 2019). SELEX is implemented on purified or recombined TFs with a series of synthetic DNA sequences (14–100 bp) (Djordjevic, 2007; Jolma et al., 2010) while PBM is based on epitope-tagged 9.2 Methods to Infer TF–DNA TF and DNA oligonucleotide arrays (10–12- Interactions mer sequence) (Berger et al., 2006; Berger and Bulyk, 2009). Compared with the resolution of 9.2.1 Wet-lab approaches ChIP-seq data (ranging from 100 to 800 bp), SELEX and PBM provide higher resolutions The interactions between TFs and cis-elements of TF binding sites (Muhammad et al., 2019). trigger the altered expressions of target However, the synthetic DNA sequences lack genes. The clustering of binding sites for a TF genomic features. This results in a challenge enables not only the construction of the gene to reproduce the TF–DNA interactions in plant regulatory networks but also the characteriza- cells (false positive bindings). To overcome tion of the TF DNA-binding specificity. This such limitations, O’Malley et al. have developed section illustrates the basic principles of five DNA affinity purification sequencing (DAP- high-throughput techniques which are com- seq), which replaces synthetic DNA sequences monly used to explore large-scale interactions with a genomic DNA library (100–400 bp fragbetween TFs and cis-elements. The advantages ments) (O’Malley et al., 2016; Bartlett et al., and limitations of each experimental tech- 2017). Compared with ChIP-seq, DAP-seq is nique are shown in Table 9.1. a low-cost approach with partial genomic feaChromatin immunoprecipitation sequenc- tures, such as DNA modifications and genomic ing (ChIP-seq) is the most widely used high- context. The techniques described above are throughput technique for deciphering the suitably implemented in studies that aim to genomic landscapes of TFs in vivo (Kaufmann infer the possible cis-elements for a TF of interet al., 2010; Yamaguchi et al., 2014; Muhammad est. The limitations of each technique should et al., 2019). ChIP-seq is designed to capture the be considered during experimental design. In genomic DNA fragments that are bound by TF of addition, many reviews advise using both in interest. This technology pulls down the TF–DNA vivo and in vitro methods to infer the TF–DNA complex through anti- TF antibodies and fol- interactions (Geertz and Maerkl, 2010; Lai lowes by sequencing of the fragment. The DNA et al., 2019). sequences obtained by ChIP-seq are expected to The interactions between TFs and cis- reflect not only the functional cis-elements but elements can be deciphered by TF-centered or

126

C.-N. Chow, K.-C. Tseng and W.-C. Chang

Fig. 9.1. Key aspects, approach, basic information, and profile of gene regulation by TFs. (A) Four key aspects in gene regulation include TFs (regulators), cis-elements (TF binding sites), target genes, and gene regulatory network. (B) Two different approaches are centered on protein and sequence. The arrows point to possible main purposes of the approaches. (C) The accessible information and profiles related to gene regulation.

gene-centered approaches (Arda and Walhout, 2010; Fuxman Bass et al., 2016). Most high-throughput techniques are TF-centered (protein-centered) and use one or a few TFs of interest to predict the possible TF-binding DNA sequences. Only a few techniques are gene- centered (sequence- centered) that identify potential regulated TFs. The yeast one-hybrid (Y1H) is an extensive assay for deducing both

TF- centered and gene- centered interactions (Ou et al., 2011; Fuxman Bass et al., 2016). Briefly, Y1H is composed of two parts: DNA sequences, ranging from 200 bp to 3 kbp, as baits, and the selected TFs as prey. If the TFs interact with DNA sequences, the activations of the auxotrophic or colorimetric reporter genes are detected by the selected medium. The Y1H assay is low-cost and convenient for

Plant Cis-elements and Transcription Factors

127

Table 9.1. Advantages and limitations of experimental techniques. Technique

Advantages

Limitations

Chromatin immunoprecipitation sequencing (ChIP-seq)

Captures DNA sequences with genomic features, e.g., DNA modification, chromatin states, and histone modifications In vivo interaction

High cost of constructing protocols and library Failures due to unstable quality of antibodies or low TF protein expression Requirement of reference genome

Systematic evolution of ligands by exponential enrichment (SELEX)

High-resolution (14–100 bp sequence)

No genomic features for synthetic DNA sequences

Protein-binding microarrays (PBM)

High-resolution (10–12-mer sequence) Effective method to examine a wide range of TFs

No genomic features for synthetic DNA array

DNA affinity purification sequencing (DAP-seq)

Inexpensive method DNA library harboring partial genomic DNA features, e.g., DNA modification

High failure rate in some TF families, e.g., MADS-box

Yeast one-hybrid (Y1H)

Low-cost method Applications in both TF-centered and gene-centered studies

Challenging to construct TF prey library in species without whole-genome sequencing Failure due to no required post- translational modifications for TF activation in yeast

biological and technical replicates. To exploit Y1H in high- throughput screening, recent studies performed Y1H through a robotic mating and arraying platform to reduce the labor- intensive and time- consuming work on yeast transformation (Reece-Hoyes et al., 2011; Yeh et al., 2019). As the preliminary step comprises the construction of a TF prey library, these systems are not suitable for application in species without genome sequencing and annotation of TFs. This is a major challenge for existing gene- centered methods. Moreover, the TF–DNA interactions, which required post- translational modifications and co-factors on TF activations, were not detected in Y1H. This challenge also arises in other techniques that use recombined TFs.

9.2.2 Dry-lab approaches In the past decades, molecular biotechnology has advanced rapidly. However, the TF DNA-binding specificity and gene regulatory networks remain largely unknown (Weirauch

et al., 2014). Only a small proportion of plant TFs are well characterized. Although high- throughput techniques advance the investigation of TFs and their target genes, great challenges still exist in big data analysis, prediction mode construction, and the application of non-model plant studies. Previous studies propose that the computational and statistical techniques, in terms of dry- lab approaches, are the potential toolkits to identify the cis-elements and reconstruct the gene regulatory networks (Das and Dai, 2007; Kim et al., 2009). The workflow of dry-lab procedures is shown in Fig. 9.2. The cis-element prediction is dependent on two key steps: (i) extraction of TF DNA-binding specificity; and (ii) motif scanning. In the motif extraction step, the position weight matrix (PWM) is the most common model used to represent the consensus sequence of TF binding sites (Lai et al., 2019). During the generation of PWMs, the sequences are aligned with the functional centers, regarding the important part of TF recognition, and then the occurrences of

128

C.-N. Chow, K.-C. Tseng and W.-C. Chang

Fig. 9.2. The workflow of dry-lab approaches. The high-throughput binding sequences are generated from wet-lab experiments, such as ChIP-seq and SELEX. Such sequences are usually converted into a TF binding pattern (matrix) to represent TF binding specificity. During the cis-element prediction, sequences of interest are screened and selected by a motif scanner.

each nucleotide are counted at each position. PWMs are an easy-to-read model via conversion to a sequence logo. However, the nucleotide in each position of a PWM is sequence-dependent, which means that the probability of each position is not affected by the previous and following positions. Furthermore, a PWM is not optimal for conversing sequences with different lengths, since it is possible that core sequences intermingle with flanking sequences. To overcome such hurdles, several algorithms have been introduced to allow the binding matrices to record more features (Chekmenev et al., 2005; Siddharthan, 2010; Yang and Ramsey, 2015). One widely discussed model is the TF Flexible Model (TFFM), which uses the hidden Markov models to characterize the flexible length sequences and position interdependence into the binding pattern (Mathelier and

Wasserman, 2013). Here, we only describe PWM and TFFM, which are used routinely in the public databases and are generated conveniently by existing processing tools. Lai et al. (2019) reviewed additional models applied to discover the TF binding motif. Another important process that influences the performance of TF binding site predictions is motif scanning, which determines whether sequences of interest (promoters) harbor cis-elements of TF bindings. Several motif scanners are accessible from public resources, such as FIMO (Grant et al., 2011), RSAT (Patser and Matrix- Scan) (Hertz and Stormo, 1999; Turatsinze et al., 2008), and PoSSuMsearch (Beckstette et al., 2006). These tools use an individual PWM to identify TF binding sites across the sequences of interest. All of these tools use random sequence databases to measure the significance of scanning cis-elements. It should

Plant Cis-elements and Transcription Factors

be noted that if the random sequence database is selected by shuffling human or mammalian promoters, the threshold, i.e., p-value or e-value, may not be appropriate for plant- specific cis-elements. Although confirmation experiments are still required to validate the algorithms and their results, dry-lab approaches can help scientists reduce the time and costs of preliminary experiments. These approaches also enable scientists to explore characteristics hidden in the vast high-throughput data, such as stress-responsive mechanisms, dynamic gene regulation during development, and orthologous conservation of TF DNA-binding specificity (Naika et al., 2013; Weirauch et al., 2014; Yu et al., 2015; Haque et al., 2019).

9.3 Related Databases for TFs and Cis-elements 9.3.1 TF-related databases TF DNA- binding specificity is shown to be largely determined by its DNA- binding domains (DBDs) (Weirauch et al., 2014). Moreover, the type of TF, i.e., activator or repressor, is important information for dissecting the mechanisms of gene regulation. To obtain such protein- based information of a given TF, literature reviews and online database queries are usually required. In the scientific community, Uniprot is a widely used database of comprehensive information about protein sequences, protein domains, and functional annotation including TFs (Uniprot Consortium, 2019). It is also an optimal resource for obtaining further protein information via external database linkages, such as protein–protein interactions, protein three- dimensional structures, and gene expressions. However, the number of plant species stored in Uniprot is limited. To overcome this problem, PlantTFDB was created to annotate all TFs with TF families, gene ontology, plant ontology, and orthologous groups for a wide range of plant species (Jin et al., 2017). To reconstruct GRNs, TFs and their binding matrices from high-throughput sequencing are integrated into PlantPAN3.0.

129

In addition, TF information in PlantPAN3.0 includes the spatio- temporal profile of TFs and regulatory types, as well as structure- based and protein sequence-based annotation (Chow et al., 2019).

9.3.2 Cis-element-related databases A repository of cis-e lements is crucial for investigating TF DNA- b inding specificity and predicting TF binding sites. To provide user-f riendly and effective platforms, several studies have built web- b ased databases to facilitate searching and downloading TF DNA-b inding patterns (Table 9.2). For instance, JASPAR is a frequently updated TFBS repository which collects PWMs and TFFMs from literature and high-t hroughput datasets (Fornes et al., 2020). Among the resources developed for Arabidopsis thaliana, AthaMAP supports the genome- wide cis-e lements and post- t ranscriptional regulations of small RNA and microRNA to infer gene regulation (Hehl et al., 2016). AGRIS provides a manual literature-b ased collection of cis-e lements and target genes of Arabidopsis TFs (Yilmaz et al., 2011). Weirauch et al. (2014) proposed that TFs with high DBD similarity generally have very similar TF DNA- b inding patterns. In their database, CIS- B P, the inference thresholds were established to infer TF DNA-b inding patterns of large TFs derived from the experimentally verified PWMs of a small set of TFs. Such a strategy is beneficial for extending cis-e lement predictions into non-m odel plants. Moreover, Plant Cistrome Database generates the PWMs from DAP-s eq data and provides the genomic landscape of TF bindings for Arabidopsis thaliana and Zea mays (O’Malley et al., 2016). PlantPAN3.0 not only incorporates the PWMs from ten resources (e.g., TRANSFAC, PLACE, UniPROBE) (Higo et al., 1999; Matys et al., 2006; Hume et al., 2015), but also provides further functional assays for a group of target genes. For example, co- o ccurrence of TFs through co- o ccupancy of binding sites and protein–protein interactions are investigated to reconstruct GRNs (Chow

130

C.-N. Chow, K.-C. Tseng and W.-C. Chang

Table 9.2. The resources for collections of TF binding patterns. Database name

Features

Reference

JASPAR

PWMs and TFFMs for 20 plant species Customized promoter analysis

Fornes et al., 2020

AthaMAP

Genome-wide predicted TF binding sites for Arabidopsis thaliana Small RNA and microRNA target search

Hehl et al., 2016

AGRIS

Literature-based collection of cis-elements and target genes of Arabidopsis TFs cis-elements search for promoters of Arabidopsis thaliana (both experimentally varied and predicted)

Yilmaz et al., 2011

PlantPAN3.0

PWMs for 78 plant species Genome-wide TF binding sites search for 7 plant species (both experimentally varied and predicted) Customized promoter analysis Co-occurrence of TF binding sites

Chow et al., 2019

CIS-BP

Experimentally varied and inferred PWMs for 44 plant species

Weirauch et al., 2014

Plant Cistrome Database

PWMs for Arabidopsis thaliana and Zea mays Genome-wide landscape of DAP-seq data

O’Malley et al., 2016

et al., 2019). All databases described above provide download functions for the collection of TF DNA-b inding patterns (Table 9.2). The promoter analysis function is available in most databases to identify cis-e lements in customized sequences. With constantly increasing numbers of high- throughput datasets, many databases specifically focus on collecting ChIP-seq data, such as ChIPBase v2.0 (Zhou et al., 2017), Expresso (Aghamirzaie et al., 2017), ReMap 2020 (Chèneby et al., 2020), and PCSD (Liu et al., 2018). ChIP-seq databases contain the preference of cis-elements under different tissues and conditions. Furthermore, some databases integrate TF ChIP- seq data and other in vivo experimental data regarding epigenetics features. For instance, the genomic landscapes of TFs, histone modification marks, and other DNA-binding proteins (e.g., ARGONAUTE 4) from seven plant species are available in PCBase (part of PlantPAN3.0) (Chow et al., 2019). PlantRegMap provides not only the genomic landscape of TF binding sites but also the genomic annotation of chromatin accessibility by DNase- seq and MNase-seq data (Tian et al., 2020).

9.4 Advanced Analysis in GRNs After in silico predictions of cis-elements and reconstruction of GRNs, researchers often face high false-positive rates, because the GRNs are usually dynamic under different conditions. Therefore, unverified gene regulation might be caused by incorrect tissues or conditions during experimental verification. A prevalent strategy to address this issue is the use of co-expression analysis in the construction of GRNs (Vandepoele et al., 2009; Vermeirssen et al., 2014; Yu et al., 2015). This analysis is based on the theory that genes regulated by the same regulators typically show similar functions and expressions. Co-expression analysis is available in several databases such as ATTED-II, TF2Network, ChIPBase v2.0, and PlantPAN3.0 (Zhou et al., 2017; Kulkarni et al., 2018; Obayashi et al., 2018; Chow et al., 2019). As GRNs often consist of hundreds of TFs and genes, it is critical to investigate what kinds of functions and pathways these genes are involved in. To this end, enrichment analysis of biological functions, such as gene ontology terms, is a powerful measurement to elucidate representative biological mechanisms of GRNs (Vermeirssen et al., 2014; Defoort et al., 2018). Identification of

Plant Cis-elements and Transcription Factors

131

Fig. 9.3. Prospective view on studies of gene regulation. TAD, topologically associating domains; coF, co-factor; TF, transcription factor.

132

C.-N. Chow, K.-C. Tseng and W.-C. Chang

central hub TFs (key regulators) in a GRN is a good way to retrieve important candidates for further experimental validation (Vermeirssen et al., 2014; Ikeuchi et al., 2018). Overall, these approaches increase confidence and effectiveness for studying gene regulatory mechanisms.

9.5 Prospective View on Studies of Gene Regulation The physical interactions between TFs and cis-elements are an eventual operation of gene regulation. In plant cells, gene regulation is a complex process influenced by multiple factors, such as chromosome folding, chromatin structure, and DNA modification (Dong et al., 2020b; Eriksson et al., 2020). Here, we briefly describe the genomic features that could be determined by high- throughput experiments (Fig. 9.3). By using chromosome conformation capture techniques, studies revealed that chromosomes are not randomly spread out in nuclei (Dong et al., 2020b). According to the classification of their interactions and spatial organization, the chromatin regions contain two parts: euchromatin and heterochromatin, related to positive and negative gene regulation, respectively. Each chromatin region can be further divided into

functional units, i.e., topologically associating domains (TADs), which confine the interactions between genes and their distal cis-elements. A previous study also found that TADs, especially the region of TAD borders, host transcriptional activation of genes across different tissues (Dong et al., 2020a). Additionally, histone ChIP-seq, DNase-seq, and ATAC-seq are frequently used to explore the chromatin states, chromatin accessibility, and TF footprinting of plant genomes (Sullivan et al., 2014; Maher et al., 2018; Lai et al., 2019). Overall, these high-throughput data enable deduction of the DNA-packaged histones (activate or repressive histone marks) and spatial organization of chromatin around cis-elements and target genes of TFs. With growing numbers of plant studies, we believe that it is valuable to accumulate such features systematically in inferring interactions between cis-elements and TFs. Additionally, DNA structure has been reported to influence TF DNA-binding specificity across different TF families (Gordân et al., 2013; Muiño et al., 2014; Dror et al., 2015). Recent studies suggest that DNA structures contribute to TF recognition of non-sequence-specific DNA elements (Zhou et al., 2015; Yang et al., 2019). Overall, these findings demonstrate that DNA structure will help predict interactions between TFs and cis-elements.

References Aghamirzaie, D., Raja Velmurugan, K., Wu, S., Altarawy, D., Heath, L.S. et al. (2017) Expresso: a database and web server for exploring the interaction of transcription factors and their target genes in Arabidopsis thaliana using ChIP- Seq peak data. F1000Research 6, 372. DOI: 10.12688/ f1000research.10041.1. Arda, H.E. and Walhout, A.J.M. (2010) Gene- centered regulatory networks. Briefings in Functional Genomics 9(1), 4–12. DOI: 10.1093/bfgp/elp049. Bartlett, A., O’Malley, R.C., Huang, S.-S.C., Galli, M., Nery, J.R. et al. (2017) Mapping genome-wide transcription-factor binding sites using DAP-seq. Nature Protocols 12(8), 1659–1672. DOI: 10.1038/ nprot.2017.055. Beckstette, M., Homann, R., Giegerich, R. and Kurtz, S. (2006) Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 7, 389. DOI: 10.1186/1471-2105-7-389. Berger, M.F. and Bulyk, M.L. (2009) Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nature Protocols 4(3), 393–411. DOI: 10.1038/nprot.2008.195. Berger, M.F., Philippakis, A.A., Qureshi, A.M., He, F.S., Estep, P.W., 3rd et al. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnology 24(11), 1429–1435. DOI: 10.1038/nbt1246.

Plant Cis-elements and Transcription Factors

133

Chekmenev, D.S., Haid, C. and Kel, A.E. (2005) P-Match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Research 33(Web Server issue), W432–7. DOI: 10.1093/nar/gki441. Chèneby, J., Ménétrier, Z., Mestdagh, M., Rosnet, T., Douida, A. et al. (2020) ReMap 2020: a database of regulatory regions from an integrative analysis of human and Arabidopsis DNA-binding sequencing experiments. Nucleic Acids Research 48(D1), D180–D188. DOI: 10.1093/nar/gkz945. Chow, C.-N., Lee, T.-Y., Hung, Y.-C., Li, G.-Z., Tseng, K.-C. et al. (2019) PlantPAN3.0: a new and updated resource for reconstructing transcriptional regulatory networks from ChIP-seq experiments in plants. Nucleic Acids Research 47(D1), D1155–D1163. DOI: 10.1093/nar/gky1081. Das, M.K. and Dai, H.K. (2007) A survey of DNA motif finding algorithms. BMC Bioinformatics 8 Suppl 7, S21. DOI: 10.1186/1471-2105-8-S7-S21. Defoort, J., Van de Peer, Y. and Vermeirssen, V. (2018) Function, dynamics and evolution of network motif modules in integrated gene regulatory networks of worm and plant. Nucleic Acids Research 46(13), 6480–6503. DOI: 10.1093/nar/gky468. Djordjevic, M. (2007) SELEX experiments: new prospects, applications and data analysis in inferring regulatory pathways. Biomolecular Engineering 24(2), 179–189. DOI: 10.1016/j.bioeng.2007.03.001. Dong, P., Tu, X., Li, H., Zhang, J., Grierson, D. et al. (2020a) Tissue‐specific Hi‐C analyses of rice, foxtail millet and maize suggest non‐canonical function of plant chromatin domains. Journal of Integrative Plant Biology 62(2), 201–217. DOI: 10.1111/jipb.12809. Dong, P., Tu, X., Liang, Z., Kang, B.-H., Zhong, S. et al. (2020b) Plant and animal chromatin three- dimensional organization: similar structures but different functions. Journal of Experimental Botany 71(17), 5119–5128. DOI: 10.1093/jxb/eraa220. Dror, I., Golan, T., Levy, C., Rohs, R. and Mandel-Gutfreund, Y. (2015) A widespread role of the motif environment in transcription factor binding across diverse protein families. Genome Research 25(9), 1268–1280. DOI: 10.1101/gr.184671.114. Duan, Y.-B., Li, J., Qin, R.-Y., Xu, R.-F., Li, H. et al. (2016) Identification of a regulatory element responsible for salt induction of rice OsRAV2 through ex situ and in situ promoter analysis. Plant Molecular Biology 90(1–2), 49–62. DOI: 10.1007/s11103-015-0393-z. Eriksson, M.C., Szukala, A., Tian, B. and Paun, O. (2020) Current research frontiers in plant epigenetics: an introduction to a virtual issue. The New Phytologist 226(2), 285–288. DOI: 10.1111/nph.16493. Fornes, O., Castro-Mondragon, J.A., Khan, A., van der Lee, R., Zhang, X. et al. (2020) JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Research 48(D1), D87–D92. DOI: 10.1093/nar/gkz1001. Fuxman Bass, J.I., Reece- Hoyes, J.S. and Walhout, A.J.M. (2016) Gene- centered yeast one- hybrid assays. Cold Spring Harbor Protocols 2016(12). DOI: 10.1101/pdb.top077669. Geertz, M. and Maerkl, S.J. (2010) Experimental strategies for studying transcription factor-DNA binding specificities. Briefings in Functional Genomics 9(5–6), 362–373. DOI: 10.1093/bfgp/ elq023. Gordân, R., Shen, N., Dror, I., Zhou, T., Horton, J. et al. (2013) Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Reports 3(4), 1093–1104. DOI: 10.1016/j.celrep.2013.03.014. Grant, C.E., Bailey, T.L. and Noble, W.S. (2011) FIMO: scanning for occurrences of a given motif. Bioinformatics (Oxford, England) 27(7), 1017–1018. DOI: 10.1093/bioinformatics/btr064. Haque, S., Ahmad, J.S., Clark, N.M., Williams, C.M. and Sozzani, R. (2019) Computational prediction of gene regulatory networks in plant growth and development. Current Opinion in Plant Biology 47, 96–105. DOI: 10.1016/j.pbi.2018.10.005. Hehl, R., Norval, L., Romanov, A. and Bülow, L. (2016) Boosting AthaMap database content with data from protein binding microarrays. Plant & Cell Physiology 57(1), e4. DOI: 10.1093/pcp/pcv156. Hernando, C.E., Romanowski, A. and Yanovsky, M.J. (2017) Transcriptional and post-transcriptional control of the plant circadian gene regulatory network. Biochimica et Biophysica Acta. Gene Regulatory Mechanisms 1860(1), 84–94. DOI: 10.1016/j.bbagrm.2016.07.001. Hertz, G.Z. and Stormo, G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics (Oxford, England) 15(7–8), 563–577. DOI: 10.1093/ bioinformatics/15.7.563. Hickman, R., Van Verk, M.C., Van Dijken, A.J.H., Mendes, M.P., Vroegop-Vos, I.A. et al. (2017) Architecture and dynamics of the jasmonic acid gene regulatory network. The Plant Cell 29(9), 2086–2105. DOI: 10.1105/tpc.16.00958.

134

C.-N. Chow, K.-C. Tseng and W.-C. Chang

Higo, K., Ugawa, Y., Iwamoto, M. and Korenaga, T. (1999) Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Research 27(1), 297–300. DOI: 10.1093/nar/27.1.297. Hume, M.A., Barrera, L.A., Gisselbrecht, S.S. and Bulyk, M.L. (2015) UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Research 43(Database issue), D117–22. DOI: 10.1093/nar/gku1045. Ikeuchi, M., Shibata, M., Rymen, B., Iwase, A., Bågman, A.-M. et al. (2018) A gene regulatory network for cellular reprogramming in plant regeneration. Plant & Cell Physiology 59(4), 765–777. DOI: 10.1093/ pcp/pcy013. Jin, J., Tian, F., Yang, D.-C., Meng, Y.-Q., Kong, L. et al. (2017) PlantTFDB 4.0: toward a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Research 45(D1), D1040–D1045. DOI: 10.1093/nar/gkw982. Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G. et al. (2010) Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Research 20(6), 861–873. DOI: 10.1101/gr.100552.109. Karthikeyan, A.S., Ballachanda, D.N. and Raghothama, K.G. (2009) Promoter deletion analysis elucidates the role of cis elements and 5′UTR intron in spatiotemporal regulation of AtPht1;4 expression in Arabidopsis. Physiologia Plantarum 136(1), 10–18. DOI: 10.1111/j.1399-3054.2009.01207.x. Kaufmann, K., Muiño, J.M., Østerås, M., Farinelli, L., Krajewski, P. et al. (2010) Chromatin immunoprecipitation (CHIP) of plant transcription factors followed by sequencing (CHIP-SEQ) or hybridization to whole genome arrays (CHIP-CHIP). Nature Protocols 5(3), 457–472. DOI: 10.1038/ nprot.2009.244. Kim, H.D., Shay, T., O’Shea, E.K. and Regev, A. (2009) Transcriptional regulatory circuits: predicting numbers from alphabets. Science 325(5939), 429–432. DOI: 10.1126/science.1171347. Kulkarni, S.R., Vaneechoutte, D., Van de Velde, J. and Vandepoele, K. (2018) TF2Network: predicting transcription factor regulators and gene regulatory networks in Arabidopsis using publicly available binding site information. Nucleic Acids Research 46(6), e31. DOI: 10.1093/nar/gkx1279. Kummerfeld, S.K. and Teichmann, S.A. (2006) DBD: a transcription factor prediction database. Nucleic Acids Research 34(Database issue), D74–81. DOI: 10.1093/nar/gkj131. Lai, X., Stigliani, A., Vachon, G., Carles, C., Smaczniak, C. et al. (2019) Building transcription factor binding site models to understand gene regulation in plants. Molecular Plant 12(6), 743–763. DOI: 10.1016/j. molp.2018.10.010. Liu, Y., Tian, T., Zhang, K., You, Q., Yan, H. et al. (2018) PCSD: a plant chromatin state database. Nucleic Acids Research 46(D1), D1157–D1167. DOI: 10.1093/nar/gkx919. Maher, K.A., Bajic, M., Kajala, K., Reynoso, M., Pauluzzi, G. et al. (2018) Profiling of accessible chromatin regions across multiple plant species and cell types reveals common gene regulatory principles and new control modules. The Plant Cell 30(1), 15–36. DOI: 10.1105/tpc.17.00581. Mathelier, A. and Wasserman, W.W. (2013) The next generation of transcription factor binding site prediction. PLoS Computational Biology 9(9), e1003214. DOI: 10.1371/journal.pcbi.1003214. Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S. et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34(Database issue), D108–10. DOI: 10.1093/nar/gkj143. Muhammad, I.I., Kong, S.L., Akmar Abdullah, S.N. and Munusamy, U. (2019) RNA-seq and ChIP-seq as complementary approaches for comprehension of plant transcriptional regulatory mechanism. International Journal of Molecular Sciences 21(1), E167. DOI: 10.3390/ijms21010167. Muiño, J.M., Smaczniak, C., Angenent, G.C., Kaufmann, K. and van Dijk, A.D.J. (2014) Structural determinants of DNA recognition by plant MADS-domain transcription factors. Nucleic Acids Research 42(4), 2138–2146. DOI: 10.1093/nar/gkt1172. Naika, M., Shameer, K., Mathew, O.K., Gowda, R. and Sowdhamini, R. (2013) STIFDB2: an updated version of plant stress- responsive transcription factor database with additional stress signals, stress-responsive transcription factor binding sites and stress-responsive genes in Arabidopsis and rice. Plant & Cell Physiology 54(2), e8. DOI: 10.1093/pcp/pcs185. Obayashi, T., Aoki, Y., Tadaka, S., Kagaya, Y. and Kinoshita, K. (2018) ATTED-II in 2018: a plant coexpression database based on investigation of the statistical property of the mutual rank index. Plant & Cell Physiology 59(1), e3. DOI: 10.1093/pcp/pcx191. Ohama, N., Sato, H., Shinozaki, K. and Yamaguchi- Shinozaki, K. (2017) Transcriptional regulatory network of plant heat stress response. Trends in Plant Science 22(1), 53–65. DOI: 10.1016/j. tplants.2016.08.015.

Plant Cis-elements and Transcription Factors

135

Ou, B., Yin, K.-Q., Liu, S.-N., Yang, Y., Gu, T. et al. (2011) A high-throughput screening system for Arabidopsis transcription factors and its application to Med25-dependent transcriptional regulation. Molecular Plant 4(3), 546–555. DOI: 10.1093/mp/ssr002. O’Malley, R.C., Huang, S.-S.C., Song, L., Lewsey, M.G., Bartlett, A. et al. (2016) Cistrome and epicistrome features shape the regulatory DNA landscape. Cell 165(5), 1280–1292. DOI: 10.1016/j. cell.2016.04.038. Reece-Hoyes, J.S., Diallo, A., Lajoie, B., Kent, A., Shrestha, S. et al. (2011) Enhanced yeast one-hybrid assays for high- throughput gene- centered regulatory network mapping. Nature Methods 8(12), 1059–1064. DOI: 10.1038/nmeth.1748. Riechmann, J.L., Heard, J., Martin, G., Reuber, L., Jiang, C. et al. (2000) Arabidopsis transcription factors: genome- wide comparative analysis among eukaryotes. Science 290(5499), 2105–2110. DOI: 10.1126/science.290.5499.2105. Siddharthan, R. (2010) Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix. PloS ONE 5(3), e9722. DOI: 10.1371/journal.pone.0009722. Sullivan, A.M., Arsovski, A.A., Lempe, J., Bubb, K.L., Weirauch, M.T. et al. (2014) Mapping and dynamics of regulatory DNA and transcription factor networks in A. thaliana. Cell Reports 8(6), 2015–2030. DOI: 10.1016/j.celrep.2014.08.019. Tian, F., Yang, D.C., Meng, Y.Q., Jin, J. and Gao, G. (2020) PlantRegMap: charting functional regulatory maps in plants. Nucleic Acids Research 48(D1), D1104–D1113. DOI: 10.1093/nar/gkz1020. Turatsinze, J.-V., Thomas-Chollier, M., Defrance, M. and van Helden, J. (2008) Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nature Protocols 3(10), 1578–1588. DOI: 10.1038/nprot.2008.97. Uniprot Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research 47(D1), D506–D515. DOI: 10.1093/nar/gky1049. Vandepoele, K., Quimbaya, M., Casneuf, T., De Veylder, L. and Van de Peer, Y. (2009) Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. Plant Physiology 150(2), 535–546. DOI: 10.1104/pp.109.136028. Vermeirssen, V., De Clercq, I., Van Parys, T., Van Breusegem, F. and Van de Peer, Y. (2014) Arabidopsis ensemble reverse- engineered gene regulatory network discloses interconnected transcription factors in oxidative stress. The Plant Cell 26(12), 4656–4679. DOI: 10.1105/tpc.114.131417. Weirauch, M.T., Yang, A., Albu, M., Cote, A.G., Montenegro-Montero, A. et al. (2014) Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158(6), 1431–1443. DOI: 10.1016/j.cell.2014.08.009. Yamaguchi, N., Winter, C.M., Wu, M.F., Kwon, C.S., William, D.A. et al. (2014) PROTOCOLS: chromatin immunoprecipitation from Arabidopsis tissues. The Arabidopsis Book 12, e0170. DOI: 10.1199/ tab.0170. Yamasaki, K., Kigawa, T., Seki, M., Shinozaki, K. and Yokoyama, S. (2013) DNA-binding domains of plant-specific transcription factors: structure, function, and evolution. Trends in Plant Science 18(5), 267–276. DOI: 10.1016/j.tplants.2012.09.001. Yang, J. and Ramsey, S.A. (2015) A DNA shape-based regulatory score improves position-weight matrix- based recognition of transcription factor binding sites. Bioinformatics (Oxford, England) 31(21), 3445–3450. DOI: 10.1093/bioinformatics/btv391. Yang, Jinyu, Ma, A., Hoppe, A.D., Wang, C., Li, Y. et al. (2019) Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework. Nucleic Acids Research 47(15), 7809–7824. DOI: 10.1093/nar/gkz672. Yeh, C.-S., Wang, Z., Miao, F., Ma, H., Kao, C.-T. et al. (2019) A novel synthetic-genetic-array-based yeast one-hybrid system for high discovery rate and short processing time. Genome Research 29(8), 1343–1351. DOI: 10.1101/gr.245951.118. Yilmaz, A., Mejia-Guerra, M.K., Kurz, K., Liang, X., Welch, L. et al. (2011) AGRIS: the Arabidopsis Gene Regulatory Information Server, an update. Nucleic Acids Research 39(Database issue), D1118–22. DOI: 10.1093/nar/gkq1120. Yu, C.-P., Lin, J.J. and Li, W.H. (2016) Positional distribution of transcription factor binding sites in Arabidopsis thaliana. Scientific Reports 6(1), 25164. DOI: 10.1038/srep25164. Yu, C.-P., Chen, S.C.-C., Chang, Y.-M., Liu, W.-Y., Lin, H.-H. et al. (2015) Transcriptome dynamics of developing maize leaves and genomewide prediction of cis elements and their cognate transcription factors. Proceedings of the National Academy of Sciences 112(19), E2477–E2486. DOI: 10.1073/ pnas.1500605112.

136

C.-N. Chow, K.-C. Tseng and W.-C. Chang

Zhou, K.-R., Liu, S., Sun, W.-J., Zheng, L.-L., Zhou, H. et al. (2017) ChIPBase v2.0: decoding transcriptional regulatory networks of non-coding RNAs and protein-coding genes from ChIP-seq data. Nucleic Acids Research 45(D1), D43–D50. DOI: 10.1093/nar/gkw965. Zhou, T., Shen, N., Yang, L., Abe, N., Horton, J. et al. (2015) Quantitative modeling of transcription factor binding specificities using DNA shape. Proceedings of the National Academy of Sciences 112(15), 4654–4659. DOI: 10.1073/pnas.1422023112.

10

Plant Gene Expression Network

Miyu Asari1, Ai Kitazumi2, Eiji Nambara3, Benildo G. de los Reyes2 and Kentaro Yano1* 1 School of Agriculture, Meiji University, Kawasaki, Japan; 2Department of Plant and Soil Science, Texas Tech University, Lubbock, Texas, USA; 3Department of Cell and Systems Biology, University of Toronto, Ontario, Canada

Abstract The expression profile of a given gene represents the changes in expression levels and it can be established from a variety of experiments including different tissues, organs, developmental stages, and treatments. By using the wealth of publicly available RNA-seq data, we can obtain an overview of genome-wide gene expression profiles under a wide range of experimental conditions. Such gene expression profiles are particularly useful to mine a gene set having a similar expression profile. The similarity indicates that these genes may be involved in the same biological process and/or be regulated by the same transcription factor. For large-scale expression data, the gene expression network (GEN) analysis allows evaluation of the significance of the similarity in expression profiles among genes in a genome-wide manner. In order to efficiently predict the biological functions and regulatory mechanisms of expression for individual genes in GENs, we can assign various kinds of functional descriptions in GENs. This chapter introduces the types of GENs and the methods used to construct them, in addition to the database built from large-scale analysis of public datasets and the integration of existing annotation of the literature (e.g., knowledge-based) to GENs.

10.1 Introduction Gene expression is under complex regulation, depending on not only the developmental and tissue- specific influences but also the environment, including biotic and abiotic stimuli. The fluctuation in expression levels, or the expression profile of a gene, provides important insights into the biological function of the gene. For example, a gene consistently expressed in leaves during daytime, but less in other tissues or times, likely plays a role in photosynthesis, compared with genes whose expression changes significantly in response to pathogen invasion. As such, comprehensive expression profiles established under various

experimental conditions could facilitate the understanding of gene functions. Similarities in expression profiles between genes are an effective index for predicting gene function and mechanisms of regulation of gene expression. If two genes have identical expression profiles, it is reasonable to expect that the two genes have the same biological function. In addition, identical expression profiles imply that the two genes are controlled by the same regulatory mechanism (e.g., transcription factor (TF) and cis- element). Therefore, genes with highly similar expression profiles likely have the same biological functions and regulatory mechanism. To precisely compare expression profiles between genes, the gene expression profiles must

*Corresponding author: kyano@meiji.ac.jp © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0010

137

138

M. Asari et al.

comprise expression levels from at least tens to hundreds of experimental conditions. If there are only two conditions in the profile (control and treatment), genes in the genome are classified into only three classes (upregulation, downregulation, or no change in treatment compared with control), then the genes are classified into only three categories. Since the number of categories is too small, each category contains many gene sets with different functions. Therefore, the number of experimental conditions in the profile should be increased prior to analysis. Empirically, more than several tens of experimental conditions are required for accurate comparisons of profiles. If it is difficult to perform RNA-seq experiments with many experimental conditions, RNA-seq data from public databases such as NCBI Sequence Read Archive (SRA) (Leinonen et al., 2011) are available. With such online RNA-seq data, we can populate the expression profile with multiple conditions for a genome of interest. Here, the genome has many gene sets with similar profiles and thus it is difficult to simultaneously grasp all the relationships between genes when the relationships are described in a table format. Instead of a table, to quickly interpret gene sets having similar expression profiles, gene expression networks (GENs) providing a graphical overview are often used. Using a user- friendly graphical viewer, an overview of GENs helps us to intuitively understand individual gene sets with similar expression profiles. The information allows us to estimate the biological functions of genes, and the TFs and cis-elements controlling gene expression. GEN analysis using huge amounts of RNA-seq data is thus widely applied to identify genes concerned with a trait of interest. In this chapter, we explain the various types of GENs and introduce current GEN construction methods, available online RNA-seq data, and knowledge bases and databases pertaining to GENs.

10.2 Visualization of Relationships of Genes by GENs: Nodes and Edges A GEN consists of nodes and edges (Fig. 10.1). A node indicates a gene. Two related nodes

(genes) are connected by an edge (a line segment). Generally, the lengths of the edges have no meaning unless a notification is provided. As described in section 10.8, the various kinds of genes and relationships can be simultaneously depicted by using different shapes of nodes and edges. In the example of the GEN in Fig. 10.1, we can easily recognize that: (i) Gene A and Gene B have similar expression profiles; (ii) the protein coding sequences of Gene B and Gene C are similar; and (iii) a TF gene X (Gene X) regulates the expression of Gene A. The annotations of genes, such as metabolic pathway names and biological functions, are described in elliptical nodes in the GEN. Powerful and user-friendly software such as CytoScape (Franz et al., 2016) is available for obtaining a genome-wide view of the relationships between genes.

10.3 Types of Relationships in GENs GENs represent similarities and/or reciprocities in expression profiles among genes. A network showing similarity in expression profiles is also called a co-expression network. Here, in addition to GENs, there are other useful networks that indicate relationships among genes. The gene regulatory network (GRN) is a network for the relationships among genes (target genes) and the TFs that control the expression of the target genes. There are also networks representing the sequence similarities in genes, mRNAs, or proteins. Furthermore, there is a network that connects genes and their functional annotations. These relationships are described in detail in sections 10.4 to 10.7. Several different relationships can be integrated into one network to increase the amount of information about genes. For example, by integrating GENs and GRNs, a group of genes with similar expression profiles and the TFs that control their expression can be overviewed simultaneously. Thus, integrating multiple types of relationships into the GEN provides a highly informative and efficient means to understand the biological functions of genes.

Plant Gene Expression Network

139

Fig. 10.1. An example of a GEN. Circular and square nodes represent genes and transcription factors, respectively. The edge patterns (solid, wavy, dashed, and double lines), represent, respectively, the similarity in gene expression profiles between genes, the reciprocity in gene expression profiles between genes, similarity in protein coding sequences between genes, and the control of regulation by a transcription factor. The elliptical nodes show the annotations of genes, such as information on biological functions, metabolic pathway names, and Gene Ontology terms.

10.4 Similarity and/or Reciprocity in Gene Expression Profiles

the same experimental conditions. Rather than expression levels, the experimental conditions under which genes are upregulated are more As described above, the similarity in expression informative in considering the biological funcprofiles provides a clue to understanding the tions of genes. Therefore, we should initially biological functions of genes. Here, we consider not consider expression levels but rather only paralogs in a genome. When they possess expression profiles for the prediction of gene identical or highly similar DNA sequences, they functions (Fig. 10.2). Reciprocity in the expression profiles also have great potential to have the same biological functions, causally linked to the similarity in indicates the relationship between the biological expression profiles. It implies that a similarity functions of two genes. In an opposite (reciproin profiles is a useful index for similarity in the cal) expression profile (Fig. 10.2), the expression biological functions of genes. Here, the detec- level of a gene increases when the other gene’s tion of genes with the same biological role expression level decreases, and vice versa. This based on similarity in the expression profiles is suggests that one gene acts as a suppressor for not affected by the expression levels (Yano et al., the other gene. After obtaining the expression profiles of 2006; Hamada et al., 2011; Ohyanagi et al., 2015). For instance, if the expression levels of all the genes in the genome by RNA-Seq, gene Gene A are always twofold greater than those pairs showing similarity/reciprocity in expresof Gene B, these genes are upregulated under sion profiles can be identified. Then, GENs can

140

M. Asari et al.

Fig. 10.2. An example of expression profiles, showing seven profiles (genes) under five experimental conditions. This figure shows that Gene D, Gene E, and Gene F have similar expression profiles. These three genes are up regulated under the second and fourth experimental conditions. On the other hand, Gene C has a reciprocal profile to these three genes and is down regulated under the second and fourth conditions.

be constructed on the basis of the relationships of the similarity/reciprocity in the expression profiles. In the example shown in Fig. 10.1, we can easily interpret that Gene A has expression profiles similar to Gene B, Gene C, and Gene D. In addition, Gene D and Gene F have reciprocal expression profiles with each other. In this section we introduce two indices for evaluating the similarity and reciprocity in expression profiles between two genes: (i) the Pearson correlation coefficient (PCC); and (ii) distance from correspondence analysis (DCA). Another index, called mutual rank (MR), has also been proposed for identifying significantly similar expression profiles (Obayashi et al., 2011). A gene expression matrix can be used to represent genome- wide gene expression data (Fig. 10.3). Each row in the matrix describes the expression profile of a gene as a vector. When an expression matrix consists of n rows (genes) and m columns (e.g., samples, replications, experimental conditions), an expression profile (vector) has m expression

levels (elements of a vector). We can assess the statistical significance of similarity/reciprocity for each pair of two rows (vectors, genes) in the matrix. Out of n genes in the matrix, the number of all pairs is: ‍

n C2

=

( n! ) . 2· n−2 ! ‍

.In the following calculation of PCCs and DCAs for the individual pairs, we consider two genes, Gene X and Gene Y. The expression profiles of these genes are denoted by vectors x and y, respectively. The expression levels in the i-th sample (experiment) of these genes are denoted by the elements ‍xi ‍ and ‍yi ‍ (‍i = 1 ∼ m‍) of vectors x and y, respectively. Here, m ‍ ‍ indicates the number of samples (experiments).

10.4.1 PCC Because the PCC is a fundamental statistic, it is often employed for GEN construction. The PCC

Plant Gene Expression Network

141

Fig. 10.3. An example of an expression matrix. (A) A gene expression matrix with n genes (rows) and m experimental conditions (columns). The elements in the matrix indicate expression levels of the individual genes and under individual conditions. (B) The expression profile of Gene i, represented by the vector in the i-th row in the matrix (A).

(‍r ‍) for two gene expression profiles is given by the following equation:

r=

)( ) − − xi − x yi − y √ [( )2 ]√ [( )2 ] ∑m ∑m − − xi − x yi − y i=1 i=1 ∑m

i=1

(

(Eqn 10.1)

−

−

where ‍x ‍and ‍y ‍are the averages of all elements (gene expression levels) of vectors x and y, respectively. The range of the value r is –1 (negative correlation) to +1 (positive correlation). There is no correlation under r‍ = 0‍. Therefore, gene pairs with significantly positive/negative correlations are statistically selected so that selected gene pairs have similar/reciprocal profiles. For example, threshold values of r > 0.8 and r < –0.8 are applied for similar and reciprocal profiles, respectively. Then, GENs are constructed by connecting two genes (nodes) in each selected pair by an edge. Different styles of edges, such as solid and dotted lines, are useful for indicating similar/reciprocal profiles, respectively. While PCCs are a useful index for GEN construction, care must be taken in their use. First, PCC calculations are time-consuming due to the large-scale expression matrix. The use of publicly available RNA-Seq data results in an expression matrix with a large number of columns (samples,

experiments), leading to a longer calculation time. This can be addressed by eliminating several or more columns from the matrix. Since the expression matrix obtained from online RNASeq data frequently contains many columns with similar experimental conditions, such as leaves in daytime, the majority of the redundant columns can be removed. The redundant columns are not informative in the PCC calculation because similar experimental conditions tend to show essentially constant expression levels. Second, even gene pairs with significant PCC values are potentially false positives. Such false positives are often observed when the expression levels of two genes are almost constant within the majority of columns in the matrix. Note that the risk of a false-positive PCC is also markedly higher when the expression matrix contains many columns with similar expression conditions, because their expression levels tend to be similar.

10.4.2 DCA An index DCA, which is obtained by correspondence analysis (CA), is also useful for measuring the similarity and reciprocity in expression profiles between two genes and for constructing GENs (Yano et al., 2006; Hamada et al., 2011; Ohyanagi et al., 2015). CA is a multivariate

142

M. Asari et al.

Fig. 10.4. Identification of genes with similar/reciprocal expression profiles by correspondence analysis (CA). CA reduces the number of dimensions in a gene expression profile. Genes are assigned coordinates in the resulting lower dimensional subspace. (A) This example shows the first three dimensions (Factor 1 to Factor 3) of the lower dimensional subspace obtained by CA. Four genes are plotted in the subspace. (B) The plots of Gene A and Gene B are located close to each other in the subspace (A), indicating that the expression profiles of the two genes are similar. On the other hand, the plots of Gene A and Gene C are symmetric with respect to the origin, theoretically meaning that the genes have completely reciprocal profiles. Since the plot of Gene D is located close to the plot of Gene C, Gene D is likely to have a mostly reciprocal profile to Gene A.

method. It reduces the dimensions in a gene expression matrix, and summarizes the profiles of both rows (genes) and columns (experimental conditions) with lower dimensions. CA is quickly executable using the free software R (R Core Team, 2018) even on a personal computer, whereas PCC calculations require a long time. When a gene expression matrix has m columns (experimental conditions), the expression profile of a gene is denoted as a vector with m elements (dimensions). CA gives new coordinates (scores) for each gene with m’ dimensions, where m’ must be a positive integer and less than m (m’ < m), allowing genes to be plotted in the new m’ dimensional subspace. Theoretically, two genes with the same expression profiles are given the same coordinates in the m’ dimensional subspace. As the expression profiles of two genes become more dissimilar, the coordinates in the m’ dimensional subspace between the two genes differ more. Thus, the distance, DCA (> = 0), between plots (coordinates) of the two genes in the m’ dimensional subspace is an effective index for evaluating the similarity in the expression profiles of the genes (Gene A and Gene B in Fig. 10.4). Namely, when the DCA for a gene pair is equal to zero,

the genes in this pair must have the same expression profiles. As the DCA increases between two genes (plots), the expression profiles of the genes become more dissimilar. In addition, if two plots (genes) are symmetric with respect to the origin in the subspace, these genes have reciprocal expression profiles to each other (Fig. 10.4). Thus, to search for genes having reciprocal profiles to Gene A, we can consider a point which is symmetric to the plot of Gene A with respect to the origin. The coordinates of the symmetric point are easily obtained by taking the opposite signs of the coordinates of Gene A. When the coordinates of Gene A in 3-dimensional subspace are (3, –5, 9), the coordinates of the symmetric point are (–3, 5, –9). In Fig. 10.4, the plot of Gene C is the symmetric point of the plot of Gene A with respect to the origin. However, in general there are no genes (plots) on the symmetric point. Recall that two genes having similar coordinates must have similar expression profiles in CA. Therefore, we can search for genes having coordinates near the symmetric points (Gene C). In Fig. 10.4A, since the plot of Gene D is located close to the plot of Gene C, Gene D is likely to have a mostly reciprocal profile to Gene A.

Plant Gene Expression Network

The DCA distance between two genes Gene X and Gene Y in the subspace with m’ dimensions is calculated by:

DCA =

√

)2 ‍‍ ∑m ( i=1 Score_xi − Score_yi

(Eqn 10.2)

where ‍Score_x ‍and ‍Score_y ‍are vectors (coordinates) with m’ dimensions for Gene X and Gene Y, respectively. ‍Score_xi ‍ and ‍Score_yi ‍ are i-th elements (coordinates) of ‍Score_x ‍ and ‍Score_y ‍, respectively. A threshold value for the distance should be determined to select gene pairs with significant relationships (e.g., within the top 0.1% of all DCAs). Since expression profiles vary according to the kinds of experimental conditions, the appropriate threshold value of DCA is apparently changed for each expression matrix. This threshold is carefully determined based on the statistical distribution of DCAs from all the gene pairs.

10.5 Common Regulatory Mechanisms in Gene Expressions Although most diploid genomes of higher plants comprise more than 10,000 genes, only a part of the genes in the genome are selectively expressed for each spatio- temporal biological activity. In each cell, selective expressions are controlled by a group of TFs. Here, a genome has several tens of TF families, which can be classified according to their biological functions and nucleotide sequence patterns in the DNA- binding sites (cis- elements). For example, the genome of the model plant Arabidopsis (125 Mbp) has more than 1500–1700 TFs (genes), and the TF genes are grouped into about 50 different families (Arabidopsis Genome Initiative, 2000; Riechmann et al., 2000; Yilmaz et al., 2011; Celli et al., 2015). Regulatory factors (TFs and cis-elements) play a role in selectively regulating the transcription of specific target genes involved in discrete biological activities in the cell. Genes controlled by the same regulatory factor may be involved in the same biological process (e.g., a metabolic process). Therefore, the classification of expressed

143

genes, including genes encoding TFs, according to their regulatory factors helps us to consider the biological functions of expressed genes. GRNs are a type of GEN that show the relationships between TFs and their target genes. The GEN shown in Fig. 10.1 indicates that Gene A is regulated by TF genes X and Y. In addition to TFs, DNA sequence patterns of cis-element motifs are also useful for classifying genes in GENs. Information regarding cis-elements can be described in additional nodes (elliptical nodes in Fig. 10.1) to annotate the binding sites for each gene. The relationships between genes are complex, within even an experimental condition, since they are the result of the action of a large number of TFs. GENs illustrate the complex relationships between genes so that they can be easily understood. Information on regulatory factors is accessible from databases such as TAIR (Lamesch et al., 2012), AGRIS (Celli et al., 2015), PlantPAN (Chow et al., 2019), PlantTFDB (Jin et al., 2017), NEW PLACE (Higo et al., 1999), and Cistrome DAP (O’Malley et al., 2016). This information available online consists of the data obtained from laboratory experiments or computational analysis. Next- generation sequencing (NGS) experiments, such as ChIP-Seq (see Chapter 7), ATAC-Seq, and DAP-Seq (O’Malley et al., 2016), provide comprehensive information on TFs and their binding sites. Computational predictions of TFs and their binding sites have also been performed successfully. A web database PlantPAN (Chow et al., 2019, see Chapter 9) provides the results of comprehensive computational prediction of TF binding sites (TFBSs), corresponding TFs.

10.6 Sequence Similarities in mRNAs Gene duplication is a major source of paralogs (copy genes) in the genome. Paralogs often show similarity not only in DNA sequence but also in biological function and expression profile. The accumulation of DNA mutations in the gene(s) of paralogs causes biological functions between paralogs to diverge. From this viewpoint, similarity in mRNA sequences (homologs) strongly suggests the same biological function. Therefore, GENs providing information on mRNAs with

144

M. Asari et al.

significantly similar sequences help us to infer the biological functions of mRNAs. Sequence similarity can be measured by several methods and tools. Here we introduce a GEN construction method using the BLAST package, since BLAST searches are simple, effective, and widely used (Altschul et al., 1990). To conduct a BLAST search (in- house BLAST, local BLAST), users first install the BLAST package on their computer. Packages for each computer operating system (Macintosh, Linux, and Windows) are available from the NCBI FTP site (https://ftp.ncbi. nih.gov/blast/executables/blast+/LATEST/, accessed July 2022). The BLAST package contains multiple programs for performing BLAST searches. BLAST searches are executed by the command lines on the terminal application (e.g., Terminal in Macintosh and Linux, and ‘Command Prompt’ in Windows). A BLAST search requires the FASTA files for creating a BLAST database and for submitting a query sequence(s). For this example, a FASTA file for the construction of a BLAST database should contain mRNA sequences. The program “makeblastdb” in the BLAST package is used to create the database from the FASTA file. Another FASTA file containing a query sequence(s) should also be prepared. When both FASTA files (for the database and query) contain nucleotide sequences, the program “blastn” is used for execution of the BLAST searches. The resulting BLAST search provides sequences (subjects, hits) in the database which have sequence patterns similar to the query sequence. Using the option “evalue” in the “blastn” program allows us to set the threshold for the expected value (e-val, e-value) in the BLAST search. The e-values of 1e-5 to 1e-30 are frequently employed as the threshold values to search the database for homologs. To search precisely for homologs, other indices, such as the alignment length, identity, and the number of gaps, shown in the search results, are also useful. In addition, the calculation of two additional indices for sequence coverage is effective for the statistical filtering of sequences obtained from the BLAST search. One index is “query coverage,” which is the ratio of alignment length to the length of the query sequence, and the other is “subject coverage”, which is the ratio of alignment

length to the length of the subject sequence. BLAST searches allow us to select all sequence pairs (gene pairs, pairs of homolog candidates) showing significant sequence similarity. To construct GENs showing the relationships between homologs, all gene pairs (sequence pairs) are connected by edges. We can strictly identify sequences with similar functions by considering protein functional domains. Although the BLAST program is easily and quickly executed, it cannot identify common domains from protein sequences. This can be addressed by using InterProScan (Zdobnov and Apweiler, 2001), a powerful tool for mining functional domains in protein sequences.

10.7 Similarities in the Biological Functions of Expressed Genes A description of the biological function of a gene is called functional annotation. The commonality of functional annotations between genes is useful for identifying genes with similar functions. For example, Fig. 10.1 shows that Gene A and Gene D have the same functional annotations. There are various kinds of functional annotations, from descriptions roughly obtained using computational approaches, to accurate descriptions provided by expert curators. The methods for constructing a network providing such information are outlined below.

10.7.1 GENs with computational annotations of genes When the biological function of a given gene is unknown (called an unknown gene), we frequently search the sequence database for sequences that have significantly similar sequences to the unknown gene. Genes with significantly similar sequences likely have the same biological functions and thus we can use the functional annotations of similar sequences in the database as references for the function of the unknown gene. Computational tools such as BLAST (Altschul et al., 1990) and InterPro (InterProScan) (Zdobnov and Apweiler, 2001) allow us to assign computational annotations to unknown genes.

Plant Gene Expression Network

BLAST is executable through web interfaces (Web-BLAST) or command lines as described in section 10.6 (in- house BLAST, local BLAST). The publicly available BLAST databases are effective for obtaining functional annotations for unknown genes. NCBI (Sayers et al., 2021) provides the BLAST databases nt and nr, which are non- redundant nucleotide and protein sequence databases, respectively, generated from all the sequences submitted to International Nucleotide Sequence Database Collaboration (INSDC) (Arita et al., 2021). We can comprehensively search the nt or nr database for sequences similar to an unknown gene by simply filling in the query sequence and selecting the database and other options in the web interface at the NCBI website. The nt and nr databases are also downloadable from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/blast/ db/) and allow in-house BLAST searches on personal computers. UniProt (UniProt Consortium, 2021) provides a well-curated protein sequence database. While the number of sequences in the database is small, the functional annotations are highly reliable since they are assigned by manual curation. Web- BLAST can be executed on the UniProt website. The in-house BLAST is also easily performed. The FASTA file for the database is downloadable from the UniProt website. Current data on genomic DNA sequences, mRNA sequences, protein coding sequences, and protein sequences of model plants such as Arabidopsis (Lamesch et al., 2012) and rice (Ouyang et al., 2007; Sakai et al., 2013) are also available online. The website Ensembl Plants stores sequence data for many plant species (Howe et al., 2020). We can generate BLAST databases from arbitrary online data, then use them for in-house BLAST searches. The web tool InterPro (https://www.ebi. ac.uk/interpro/search/sequence/, accessed July 2022) allows users to collect functional annotations for the translated sequence (protein sequence) of an unknown gene. The annotations contain protein functional domains and Gene Ontology terms (Gene Ontology Consortium, 2015). The in-house tool InterProScan is also available to perform this protein analysis on users’ own computers.

145

10.7.2 GENs containing knowledgebased information and ontology for biological functions The literature provides a wealth of knowledge- based information about gene functions. Published reports are the principal output with deliberation- based research experiments and analyses. The extraction and summarization of detailed descriptions in the scientific literature, called manual curation, allow us to assign highly reliable functional annotations to various genes in many species. As expert manual curation has provided highly reliable knowledge-based information, the valuable information about molecular functions has been accumulated online. However, the descriptions in the literature (terms, words, or expressions in the text) often differ between authors, even if the authors are describing the same biological functions. The existence of synonyms makes the interpretation and database search difficult. To avoid such difficulties, several projects have defined ontology terms to describe the characteristics of, for example, genes, plants, and compounds, and make them available for annotation (Gene Ontology Consortium, 2015; Cooper and Jaiswal, 2016). For example, the ontology term cold temperature exposure (PECO_0007174) is defined to describe “low temperature,” “cold temperature treatment,” etc., in Plant Experimental Conditions Ontology (PECO) (Jaiswal et al., 2014). The use of ontology terms in place of various terms and synonyms used in the original literature aids biological interpretation and database searches. Although knowledge-based annotations based on manual curation and ontology are limited, their use in GENs facilitates the understanding of gene functions. While manual curation by experts (curators) provides highly reliable knowledge-based information, the number of articles (papers) which one curator can read in a day is limited. In addition, it is difficult for the curator to read articles outside of their specialty (specialism). Use of an automating curation procedure would allow many articles in a wide range of specialized fields to be computationally processed rapidly. This computational method is called natural language processing (NLP). NLP is a form of mechanical processing for text mining

146

M. Asari et al.

and currently its accuracy is poor compared with manual curation. However, the NLP technique allows the collection of comprehensive information, which cannot be achieved by manual curation. Thus, an approach combining manual curation and NLP would address these problems. The Plant Omics Data Center (PODC) project (Ohyanagi et al., 2015) (see section 10.9 for details) has gathered the comprehensive knowledge- based information on plant genes accumulated by NLP prior to manual curation and editing, accelerating the accumulation of knowledge-based information on genes. The knowledge-based information in PODC contains the functions and genetic and molecular interactions of each gene. Although such knowledge bases remain rare, knowledge-based information is valuable for understanding the relationships between genes linked by edges in GENs.

allows users to browse metabolic pathway maps in a graphical viewer. On the pathway map, enzymes in a species of interest can be highlighted. Clicking the “Change Pathway Type” button shows the list of species names, and hyperlinks that allow the pathway map for each species to be browsed. Integrating the information from gene expression matrices and pathway maps allows us to distinguish between active and inactive enzyme genes in each pathway map. First, the experimental condition of interest (e.g., gene expression in leaves in daytime) is selected in the gene expression matrix. Then, the expressed genes and unexpressed genes are classified with the expression data. Mapping (highlighting) the expressed and unexpressed genes on the metabolic pathway map using different highlight colors allows us to quickly recognize active and inactive enzyme genes and the activity of the metabolic pathways.

10.7.3 GENs with metabolic pathway information

10.8 Network Construction Tools with Multiple Types of Information about Genes

Highlighting biosynthetic enzyme genes in GENs allows us to easily recognize GENs or sub-networks (gene modules, which are parts of a network) involved in a metabolic process. Additional nodes describing metabolic pathway names (e.g., “Steroid biosynthesis,” “Fatty acid degradation”) facilitate understanding of the metabolic functions of each GEN. The nodes for the annotations of metabolic pathways can be connected to enzyme genes (nodes) in GENs (Fig. 10.1). The information on metabolic pathways is accessible from knowledge bases such as KEGG (Kanehisa and Goto, 2000) and BRENDA (Chang et al., 2021). It contains the names of enzymes and enzyme genes in each species, the Enzyme Commission numbers (EC numbers), and the names of chemical reactions in which the enzymes participate. An EC number is defined and provided by the International Union of Biochemistry and Molecular Biology (Korzybski, 1962). It comprises four numbers defined according to the chemical reaction (e.g., the EC number 1.1.1.21 corresponds to the enzyme aldehyde reductase). KEGG

To construct GENs, we must prepare a text file describing a list of gene pairs. A gene pair has a given relationship, such as similarity in expression profiles. Various kinds of relationships are available, as mentioned above (see sections 10.3 to 10.7). Each gene pair is described in a single row in a table format to depict GENs using a graphical viewer tool such as CytoScape (Franz et al., 2016). In addition to the two gene names (a gene pair), we can describe the relationship (e.g., a PCC value) in the third column in each row in the table. The number of rows in the table is equal to the number of significant gene pairs (edges) in the GENs. Cytoscape and RWGCNA (Zhang and Horvath, 2005; Langfelder and Horvath, 2008) have been widely used to construct and graphically depict GENs. Cytoscape generates GENs with a text file describing a list of gene pairs, whereas RWGCNA is a library of the R software that allows us to analyze an expression matrix and construct GENs using the PCC method. In constructing GENs, users set a threshold value of PCC.

Plant Gene Expression Network

GENs that simultaneously describe various types of information help us consider the biological functions of a gene in the GENs using a multilateral approach. The different styles of nodes and edges are useful for representing the characteristics of genes and the relationships between genes, respectively. For example, genes encoding TFs and enzymes, and genes specifically expressed in a given organ, can be depicted by different shapes of nodes, such as circles, triangles, and squares, with/without background colors. Different styles of edges, such as solid and dotted lines with different colors and widths, are also available to indicate the degree of similarity and reciprocity in the expression profiles and in the sequence similarity between the genes. Additional nodes are effective for showing more complicated annotations. Users can add nodes for descriptions of biological functions, including Gene Ontology terms and metabolic pathways, such as EC numbers, chemical reactions, and pathway names. By identifying a set of genes connected with the same annotation node (description), we can easily classify and understand the candidate genes involved in the same biological process.

10.9 Knowledge-bases for RNA-Seq Data, Expression Data, and GENs RNA-seq data with NGS technology are stored in the INSDC consisting of the NCBI SRA, DDBJ DRA, and ENA databases (Leinonen et al., 2011; Kodama et al., 2012; Cummins et al., 2022). When a researcher performs RNA-seq analysis and publishes a report in a scientific journal, the researcher must deposit the RNA-seq data into these databases. The number of RNA-seq runs (sequencing data) for green plants in SRA is thus rapidly accumulating, increasing in 1 month from 314,000 runs (December 2021) to 327,000 runs (January 2022). For the model plant Arabidopsis, the number of RNA-seq runs increased from 46,159 (May 2021) to 53,616 (January 2022). These RNA-seq data stored in the online databases consist of various experimental results obtained from different conditions: different genomes, genotypes, organs, tissues, developmental stages (daily, seasonal and annual fluctuations), environmental

147

conditions, and biotic and abiotic stress. Users can search for and retrieve RNA-seq data in SRA under experimental conditions, including the plant species of interest. Much sequencing information is available online, but the detail and terminology in the descriptions of the experimental conditions (materials and methods) differ between records (RNA-seq data, run). Researchers submitting RNA-seq data may use different terms to describe the same experimental condition, hampering effective searching of the database. Plant Omics Data Center (PODC) (Ohyanagi et al., 2015) has collected the comprehensive RNA-seq data of model plants stored in NCBI, and classified them by manual curation of the description of the experimental conditions of each RNA-seq experiment. In addition, GENs have been constructed with the manually curated RNA- seq data. Since the GENs are assigned with the manually curated information (experimental conditions), genes of interest are quickly and efficiently identified in the large-scale GENs. In the PODC project, each RNA- seq run (experiment) is manually assigned one or more ontology terms in Plant Ontology (PO) (Cooper and Jaiswal, 2016) and PECO (Jaiswal et al., 2014). The PO terms describe organs and development (e.g., leaf, crown root, anther) while the PECO terms describe treatments and growth conditions (e.g., mineral salt exposure, gibberellic acid exposure, cold air temperature exposure). By using these terms, users can easily search for RNA-seq runs of interest in PODC and construct GENs with selected runs by following the instructions on the PODC web page. PODC also contains the knowledge-based annotations of genes, allowing us to construct GENs more precisely. The knowledge- based annotations contain gene and protein names (e.g., synonyms), TFs and cis-elements controlling gene expression, and biological functions of genes. Along with progress in comprehensive genomic and genetic research, new identifiers (IDs) and names have been assigned to single genes or molecules. The various names for a single gene or molecule also hamper the efficient search and retrieval of gene information in databases. Cross-linking all names assigned to the same gene or molecule enables access to all the biological information in the databases, whatever name the information has been assigned.

148

M. Asari et al.

Knowledge-based information on genes, TFs, and cis-elements in PODC have been accumulated by extensive manual curation of the scientific literature in NCBI PubMed. This information contains not only the various names of genes, TFs and cis- elements, target genes controlled by the TF, and the sequence patterns of DNA binding sites, but also the experimental methods used to detect the TFs and binding sites, such as ChIP-seq, promoter deletion, and yeast one- hybrid. PODC also stores the knowledge-based information on the biological functions of genes. This information was collected using a combined method, called the article to knowledge (A2K) approach, with automated and manual curation. In the automatic curation, an NLP technique was used for high-throughput text mining to extract information on gene functions from the literature. Although this step can quickly extract comprehensive information from a large body of literature, it makes mistakes such as grammatical errors in summarizing the text in an article. For the fine A2K conversion, the high-quality knowledge-based information has been derived by manually editing the summary from NLP in the PODC project. User-generated GENs can be integrated with these knowledge-based annotations, helping users to understand more deeply the biological function of each gene in the GENs. PODC provides information on the integrated GENs from different species. GENs for different species are combined by connecting homologs between species with edges. This is an interesting approach for comparing the expressed genes and their regulatory mechanisms between homologs in different species. In PODC, users can search for genes involved in a given biological process in one or more species simultaneously, then depict GENs with the retrieved genes. In the GENs, genes (nodes) with similar expression profiles are connected by edges. Using the option to connect homologs

within and between species by edges, we can easily combine GENs from different species. Moreover, by using another option for the GEN construction in PODC, genes consisting of GENs are easily and automatically added (called extension of GENs). For the GEN extension, TF genes and homologs that are associated with genes consisting of GENs are automatically appended. This option quickly helps users to understand whether the gene is evolutionarily conserved between species by the presence or absence of homologs in expressed genes and TF genes. The database “Expression Atlas” in EBI (Moreno et al., 2022) also provides information on expression data obtained by RNA-Seq experiments. Expression Atlas currently stores expression data for 30 plant species. The expression levels of genes are determined using publicly available tools with transcriptome data in the GEO (Barrett et al., 2013) and ENA databases. The web page shows the list of experiments with hyperlinks for each species, allowing users to easily browse expression levels in the graphic heat-map. The data availability is well inventoried so that users directly employ the data stored in Expression Atlas by the R software as well as the file downloading from the FTP site. The database ATTED-II provides information on GENs constructed using the PCC and MR indices (Obayashi et al., 2011). The GENs of nine plant species (including Arabidopsis, rice, soybean, and tomato) have been constructed using microarray experimental data. Nowadays, many databases and knowledge bases have been constructed and maintained. The website providing the information of GENs and knowledge- based annotations has been increasing. The comparison of expression data using GENs and well- curated annotations of genes within and between species facilitates understanding of the biological functions and regulatory mechanisms of genes.

Acknowledgments This work was supported by the Program on Open Innovation Platform with Enterprises, Research Institute and Academia, Japan Science and Technology Agency (JST, OPERA, JPMJOP1851), JST- Mirai Program Grant Number JPMJMI21C2 and JPMJMI21C3, National BioResource Project (NIG): TOMATO, and Grants-in-Aid for Scientific Research (B) (KAKENHI) 20H02977, 18KT0048, and 22H02318 in Japan. This work was also supported in part by Research Funding for the Computational Software Supporting Program from Meiji University.

Plant Gene Expression Network

149

References Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410. DOI: 10.1016/S0022-2836(05)80360-2. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814), 796–815. DOI: 10.1038/35048692. Arita, M., Karsch-Mizrachi, I. and Cochrane, G. (2021) The international nucleotide sequence database collaboration. Nucleic Acids Research 49(D1), D121–D124. DOI: 10.1093/nar/gkaa967. Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F, et al. (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Research 41(Database issue), D991–D995. DOI: 10.1093/nar/gks1193. Celli, F., Malapela, T., Wegner, K., Subirats, I., Kokoliou, E. et al. (2015) AGRIS: providing access to agricultural research data exploiting open data on the web. F1000Research 4, 110. DOI: 10.12688/ f1000research.6354.1. Chang, A., Jeske, L., Ulbrich, S., Hofmann, J., Koblitz, J. et al. (2021) BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Research 49(D1), D498–D508. DOI: 10.1093/nar/gkaa1025. Chow, C.-N., Lee, T.-Y., Hung, Y.-C., Li, G.-Z., Tseng, K.-C. et al. (2019) PlantPAN3.0: a new and updated resource for reconstructing transcriptional regulatory networks from ChIP-seq experiments in plants. Nucleic Acids Research 47(D1), D1155–D1163. DOI: 10.1093/nar/gky1081. Cooper, L. and Jaiswal, P. (2016) The Plant Ontology: a tool for plant genomics. Methods in Molecular Biology (Clifton, N.J.) 1374, 89–114. DOI: 10.1007/978-1-4939-3167-5_5. Cummins, C., Ahamed, A., Aslam, R., Burgin, J., Devraj, R. et al. (2022) The European Nucleotide Archive in 2021. Nucleic Acids Research 50(D1), D106–D110. DOI: 10.1093/nar/gkab1051. Franz, M., Lopes, C.T., Huck, G., Dong, Y., Sumer, O. et al. (2016) Cytoscape.js: a graph theory library for visualisation and analysis. Bioinformatics (Oxford, England) 32(2), 309–311. DOI: 10.1093/ bioinformatics/btv557. Gene Ontology Consortium (2015) Gene Ontology Consortium: going forward. Nucleic Acids Research 43(Database issue), D1049–56. DOI: 10.1093/nar/gku1179. Hamada, K., Hongo, K., Suwabe, K., Shimizu, A., Nagayama, T. et al. (2011) OryzaExpress: an integrated database of gene expression networks and omics annotations in rice. Plant & Cell Physiology 52(2), 220–229. DOI: 10.1093/pcp/pcq195. Higo, K., Ugawa, Y., Iwamoto, M. and Korenaga, T. (1999) Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Research 27(1), 297–300. DOI: 10.1093/nar/27.1.297. Howe, K.L., Contreras-Moreira, B., De Silva, N., Maslen, G., Akanni, W. et al. (2020) Ensembl Genomes 2020–enabling non-vertebrate genomic research. Nucleic Acids Research 48(D1), D689–D695. DOI: 10.1093/nar/gkz890. Jaiswal, P., Cooper, L. and Moore, L. (2014) Plant environmental condition ontology (EO). Presentation at: Fourth Annual Summit of the Phenotype Ontology Research Coordination Network, Biosphere, 21–23 February 2014. Jin, J., Tian, F., Yang, D.-C., Meng, Y.-Q., Kong, L. et al. (2017) PlantTFDB 4.0: toward a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Research 45(D1), D1040–D1045. DOI: 10.1093/nar/gkw982. Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28(1), 27–30. DOI: 10.1093/nar/28.1.27. Kodama, Y., Shumway, M., Leinonen, R. and International Nucleotide Sequence Database Collaboration (2012) The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Research 40(Database issue), D54–D56. DOI: 10.1093/nar/gkr854. Korzybski, T. (1962) The basis for classification and nomenclature of enzymes according to the report of the commission on enzymes of the international union of biochemistry. Postepy Biochemii 8, 261–293. Lamesch, P., Berardini, T.Z., Li, D., Swarbreck, D., Wilks, C. et al. (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research 40(Database issue), D1202–D1210. DOI: 10.1093/nar/gkr1090.

150

M. Asari et al.

Langfelder, P. and Horvath, S. (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559. DOI: 10.1186/1471-2105-9-559. Leinonen, R., Sugawara, H., Shumway, M. and International Nucleotide Sequence Database Collaboration (2011) The sequence read archive. Nucleic Acids Research 39(Database issue), D19–D21. DOI: 10.1093/nar/gkq1019. Moreno, P., Fexova, S., George, N., Manning, J.R., Miao, Z. et al. (2022) Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Research 50(D1), D129–D140. DOI: 10.1093/nar/gkab1030. Obayashi, T., Nishida, K., Kasahara, K. and Kinoshita, K. (2011) ATTED-II updates: condition-specific gene coexpression to extend coexpression analyses and applications to a broad range of flowering plants. Plant & Cell Physiology 52(2), 213–219. DOI: 10.1093/pcp/pcq203. Ohyanagi, H., Takano, T., Terashima, S., Kobayashi, M., Kanno, M. et al. (2015) Plant Omics Data Center: an integrated web repository for interspecies gene expression networks with NLP-based curation. Plant & Cell Physiology 56(1), e9. DOI: 10.1093/pcp/pcu188. O’Malley, R.C., Huang, S.-S.C., Song, L., Lewsey, M.G., Bartlett, A. et al. (2016) Cistrome and epicistrome features shape the regulatory DNA landscape. Cell 165(5), 1280–1292. DOI: 10.1016/j. cell.2016.04.038. Ouyang, S., Zhu, W., Hamilton, J., Lin, H., Campbell, M. et al. (2007) The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Research 35(Database issue), D883–D887. DOI: 10.1093/nar/gkl976. R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. Available at: https://www.R-project.org/ (accessed 6 January 2022). Riechmann, J.L., Heard, J., Martin, G., Reuber, L., Jiang, C, et al. (2000) Arabidopsis transcription factors: genome- wide comparative analysis among eukaryotes. Science 290(5499), 2105–2110. DOI: 10.1126/science.290.5499.2105. Sakai, H., Lee, S.S., Tanaka, T., Numa, H., Kim, J. et al. (2013) Rice Annotation Project Database (RAP- DB): an integrative and interactive database for rice genomics. Plant & Cell Physiology 54(2), e6. DOI: 10.1093/pcp/pcs183. Sayers, E.W., Beck, J., Bolton, E.E., Bourexis, D., Brister, J.R. et al. (2021) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 49(D1), D10–D17. DOI: 10.1093/nar/gkaa892. UniProt Consortium (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research 49(D1), D480–D489. DOI: 10.1093/nar/gkaa1100. Yano, K., Imai, K., Shimizu, A. and Hanashita, T. (2006) A new method for gene discovery in large-scale microarray data. Nucleic Acids Research 34(5), 1532–1539. DOI: 10.1093/nar/gkl058. Yilmaz, A., Mejia-Guerra, M.K., Kurz, K., Liang, X., Welch, L. et al. (2011) AGRIS: the Arabidopsis Gene Regulatory Information Server, an update. Nucleic Acids Research 39, D1118–D1122. DOI: 10.1093/ nar/gkq1120. Zdobnov, E.M. and Apweiler, R. (2001) InterProScan–an integration platform for the signature- recognition methods in InterPro. Bioinformatics (Oxford, England) 17(9), 847–848. DOI: 10.1093/ bioinformatics/17.9.847. Zhang, B. and Horvath, S. (2005) A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4, Article17. DOI: 10.2202/1544-6115.1128.

11

Plant Hormones: Gene Family Organization and Homolog Interactions of Genes for Gibberellin Metabolism and Signaling in Allotetraploid Brassica napus

Eiji Nambara1*, Dawei Yan2, Jing Wen3, Arjun Sharma1, Frederik Nguyen1, Ange Yan4, Karin Uruma4 and Kentaro Yano4 1 Department of Cell & Systems Biology, University of Toronto, Ontario, Canada; 2 Department of Plant Sciences, University of California, Davis, USA; 3National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, China; 4Department of Life Sciences, School of Agriculture, Meiji University, Kawasaki, Japan

Abstract Gibberellins (GA) are phytohormones that regulate a number of key traits for crop performance, such as control of plant height, flowering time, and nitrogen use efficiency. We have curated gene family organization and expression of genes for GA biosynthesis, catabolism, and signaling in allopolyploid Brassica napus. Genes encoding early GA biosynthetic enzymes, including ent-copalyl diphosphate synthase (CPS), ent-kaurene synthase (KS), ent-kaurene oxidase (KO), and ent-kaurenoic acid oxidases (KAO), form small families that are composed of one or two homeologous gene pairs. Late GA biosynthesis, including GA 20-oxidase (GA20ox), and GA 3-oxidase (GA3ox) and GA catabolism GA 2-oxidase (GA2ox) enzymes, and signaling proteins, including GIBBERELLIN INSENSITIVE DWARF1 (GID1) receptors, GID2/SLEEPY(SLY) F-box proteins, and DELLA repressors are encoded by larger families of 5–15 pairs of homeologs (i.e., 10–30 genes). Overall, families of GA metabolism and signaling genes are well maintained among ten Brassica napus accessions regarding the number of family members. However, this is achieved by the equilibrium between active gene duplications and losses. The BnaGA20ox, BnaGA2ox, and BnaGID2/SLY2 families show remarkable diversification within accessions. The BnaGA20ox and BnaGA2ox families show gene duplications and losses more frequently than other families, and a number of truncated genes are annotated for these genes. Truncated GA2ox were frequently found in Brassica oleracea, suggesting that active diversification of BnaGA2ox is derived from the progenitor. Gene duplications and losses are frequently observed for the pairs of homeologs. Moreover, homeolog exchanges are enriched for family members showing gene duplications and/or losses. These suggest that homeolog interaction is associated with gene family maintenance. Transcriptomes at various developmental stages in the ZS11 accession showed that most homeolog pairs are co-expressed, and gene expression divergence between homeologous genes is mostly tissue-specific. These analyses suggest that genomics-based breeding is suitable to effectively select GA-related traits from a limited source of genetic diversity in young polyploid B. napus. *Corresponding author: eiji.nambara@utoronto.ca © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0011

151

E. Nambara et al.

152

11.1 Plant Hormones and Height Control Plant hormones are plant- derived signal molecules that are conserved in the plant kingdom. These signals impact plant growth, development, and physiology. Auxins, cytokinins, gibberellins (GAs), abscisic acid (ABA), ethylene, jasmonates (JAs), salicylic acid (SA), brassinosteroids (BRs), peptide hormones, and strigolactones have been well documented for their hormonal functions (Santner et al., 2009; Kamiya, 2010; Vanstraelen and Benková, 2012). Application of these hormones to plants or blocking their functions by either mutations or chemicals alters developmental and physiological outcomes. Each hormone induces a specific signaling pathway to regulate downstream events. Plant hormones regulate multiple plant processes for morphological and physiological events, although ABA, ethylene, JAs, and SA are sometimes called stress hormones and play prominent roles in stress responses (Fujita et al., 2006). Plant hormones often function together in either antagonistic or additive/synergistic manners. For example, antagonistic regulations by GAs and ABA in seed germination, and auxins and cytokinins in the regeneration of de-differentiated callus cells have been well documented (Koornneef et al., 2002; Ikeuchi et al., 2016). Hormone balance is critical for evaluating plant performance. For example, Habataki, a high-yield rice cultivar, carries loci causing semi-dwarf and increased yield (grain numbers per panicle). Major quantitative trait loci for plant height and yield encode GA biosynthesis enzyme and cytokinin catabolic enzyme, respectively (Ashikari et al., 2005). Due to the conserved functions of hormones, mutating orthologous “hormone genes” often leads to expected beneficial phenotypes in different plant species. Most of the key “hormone genes”, including those encoding core signaling proteins and rate-limiting enzymes, have been identified (Santner et al., 2009; Takeuchi et al., 2021). Thus, these have been excellent targets for the molecular breeding of crops with limited phenotypic diversities. Plant height is an important morphological trait for plant performance. Semi-dwarf cultivars are important for lodging tolerance (Hedden, 2003; Sakamoto and Matsuoka, 2004). GAs and BRs are plant growth hormones known to cause

dwarfism when their functions are blocked. GA metabolism and signaling genes affect vital traits in crops, such as stem elongation, seed germination, and flowering (Koornneef et al., 2002; Sakamoto and Matsuoka, 2004; Izawa, 2021). The most noticeable phenotype of GA- deficient and insensitive mutants is dwarfism (Sakamoto and Matsuoka, 2004). Semi-dwarf varieties of rice semi-dwarf1 (sd1) and wheat Reduced height-1 (Rht-1) caused by disrupting GA functions were developed in the Green Revolution (1950s–1960s) and contributed to dramatic increases in yield (Hedden, 2003). BRs regulate key agricultural traits, such as plant height, leaf angle, and grain size (Tong and Chu, 2018). BR-dependent semi-dwarf cultivars have been used in agriculture, such as uzu of Hordeum vulgare (Chono et al., 2003). BR- deficient mutants are also selected for developing ornamental miniature plants, such as tomato Micro-Tom and morning glory Kobito (Suzuki et al., 2003; Martí et al., 2006). These GA and BR semi-dwarf cultivars are recently recognized for other phenotypes associated with dwarfism. These dwarf cultivars show enhanced abiotic stress response (Colebrook et al., 2014; Illouz-Eliaz et al., 2020; Nolan et al., 2020). GA semi-dwarf cultivars require a higher nitrogen supply to get the maximum yield. This is reflected in the negative side of the Green Revolution in which extra fertilizers were supplied to these lines. Recently it was shown that GA signaling proteins interact directly with key regulators for growth and nitrogen uses in rice (Li et al., 2018; Wu et al., 2020b).

11.2 Brassica napus Oilseed rape (Brassica napus) is an important crop cultivated worldwide. It is the third-largest source of plant-based edible oil with about 28 million tons produced per year globally (USDA, 2021). B. napus is also a crop for feed and biofuel (Durrett et al., 2008; Friedt et al., 2018; Le Thanh et al., 2019). Domestication and breeding of B. napus have collected genetically diverse accessions and cultivar types that adapt to a wide range of climates from winter and semi- winter to spring types. Breeding of B. napus has improved plant performance, such as oil quality

Plant Hormones

with decreased glucosinolates and erucic acid contents, yield, flowering time, and pathogen resistance. Morphological and developmental modifications of phenotypes directly impact both seed yield and environmental adaptation, such as flowering time, pathogen responses, pod shattering, and height control. B. napus is an allotetraploid (2n = 4× = 38, AACC) crop formed by hybridization of Brassica rapa (2n = 20, AA) and Brassica oleracea (2n = 18, CC) about 7000 years ago (Allender and King, 2010; Chalhoub et al., 2014). Recent analysis indicates that B. napus originated from the ancestor of European turnip (B. rapa) and a common ancestor of cauliflower, broccoli, kohlrabi, and Chinese kale (B. oleracea) (Lu et al., 2019). Assembly of B. napus genomes from winter-type Darmor-bzh and semi-winter-type ZS11 and Tapidor cultivars has identified about 100,000 genes in 80–95% of the 1130 Mbp entire genome (Chalhoub et al., 2014; Bayer et al., 2017; Sun et al., 2017). The genomes of B. napus and its progenitors have experienced multiple rounds of chromosomal multiplications and local gene duplications/losses. The functionality of the B. napus genomes altered soon after hybridization. Nascent or resynthesized B. napus displays genome-wide transcriptomic changes, including expression bias between A and C subgenomes, and dose compensation of expression levels relative to the progenitors (Zhang et al., 2015; Wu et al., 2018). After initial transcriptomic changes, cultivars/accessions experienced complex histories during domestication and adaptation to diversify the B. napus genomes (Bayer et al., 2017; Sun et al., 2017; Lu et al., 2019). Most homeologous gene pairs show similar expression patterns, while about 7% of B. napus genes show differences in expression patterns between homeologous genes in Darmor- bzh. This suggests that B. napus is an early- stage polyploid crop with functional differentiation (Chalhoub et al., 2014). The genomes of B. napus have diversified through homeolog exchanges between A and C subgenomes and gene losses that occur more frequently than in the progenitors (Chalhoub et al., 2014). Regardless of the remarkable progress in understanding gene functions, it remains unknown how functionally related gene families are maintained in the polyploid B. napus genomes.

153

11.3 GAs 11.3.1 GA metabolism Over 100 structurally distinct GAs have been chemically identified, but only a small number of GAs possess biological activity (i.e., bioactive GAs), such as GA4 and GA1 (Yamaguchi, 2008). Other GAs are mostly precursors or catabolites of the bioactive forms. In germinating seeds of Lepidium sativum, a member of the Brassicaceae, GA4 is the most abundant bioactive GA, and other bioactive GAs (i.e., GA1, GA3, GA6, and GA7) are also detected in plant extracts (Graeber et al., 2014). Also, in Arabidopsis thaliana, the non-13-hydroxylation pathway predominates to produce GA4 as a primary bioactive form, while most higher plants produce more 13-hydroxy GAs in vegetative tissues (Fig. 11.1). GA biosynthesis is divided into three phases based on the subcellular localization of biosynthetic enzymes (Yamaguchi, 2008; Hedden and Sponsel, 2015). The initial phase is the conversion of geranylgeranyl diphosphate (GGPP) to ent-kaurene by ent- copalyl diphosphate synthase (CPS) and ent-kaurene synthase (KS) in plastids (Fig. 11.1). The second phase converts ent-kaurene to GA12 catalyzed by ent-kaurene oxidase (KO) and ent- kaurenoic acid oxidase (KAO) in the endoplasmic reticulum (Fig. 11.1). The last phase is the conversion of GA¹² to bioactive GAs catalyzed by GA 20-oxidase (GA20ox) and GA 3- oxidase (GA3ox) in the cytosol (Fig. 11.1). In this chapter, we designate the first and second phases as early GA biosynthesis and the third phase as late GA biosynthesis. Loss of function of early GA biosynthesis genes causes severe GA deficiency in many plants (Phillips, 2016). Overexpression of AtCPS induces accumulation of GA¹² and massive release of volatile ent-kaurene, but not bioactive GAs (Fleet et al., 2003). This indicates that the biosynthesis of bioactive GAs is regulated downstream of GA¹². The mobile nature of GA¹² also supports the differential regulation of early and late GA biosynthesis (Regnault et al., 2015). Bioactive GA synthesis for stem elongation is primarily regulated at the level of GA20ox expression (Huang et al., 1998; Coles et al., 1999; Carrera et al., 2000). The last step to produce bioactive GA is catalyzed by GA3ox. Unlike GA synthesis

154

E. Nambara et al.

Fig. 11.1. GA biosynthesis, catabolism, and signaling pathways. Bioactive GAs are synthesized from geranylgeranyl diphosphate (GGDP). Early GA biosynthesis takes place in plastids and then in the endoplasmic reticulum to produce GA12 catalyzed by ent-copalyl diphosphate synthase (CPS), ent-kaurene synthase (KS), ent-kaurene oxidase (KO), and ent-kaurenoic acid oxidases (KAO). Late GA biosynthesis converts GA12 to bioactive GAs by GA 20-oxidases (GA20ox) and GA 3-oxidases (GA3ox). GA4 is the primary bioactive form of GA in Brassica plants. GA 13-hydroxylases (GA13ox) convert GA12 to GA53, the precursor for less bioactive GA1. Also, the conversion of GA4 to GA1 is known in Brassica plants. GA 2-oxidase (GA2ox) converts C20- and C19 GAs, including GA4 and GA1, to deactivated metabolites. The core components of GA signaling are GID1 receptors, GID2/SLY F-box proteins, and DELLA repressors. GA responses are repressed by DELLA in the absence of bioactive GA, and bioactive GA accumulation facilitates the degradation of DELLA by GID1 and GID2/SLY. Black and blue arrows indicate enzymatic reactions and signal transduction events, respectively. Dashed blue lines indicate ligand–receptor interactions. Enzymes are shown in red.

Plant Hormones

for stem elongation, GA3ox plays a prominent role in regulating GA biosynthesis during seed germination of lettuce and A. thaliana (Toyomasu et al., 1993; Yamauchi et al., 2004). GA 2-oxidase (GA2ox) is most well characterized for its regulatory role in decreasing bioactive GA levels in various plants (Yamaguchi, 2008). Three structurally distinct classes of GA2ox are known: class I and class II GA2ox convert C19-GAs, including bioactive GAs, to less active metabolites, and class III members preferentially catalyze the conversion of C20-GA precursors to remove them from the GA biosynthesis pathway (Yamaguchi, 2008; Hedden and Sponsel, 2015). The enzymatic functions of GA2ox may be more diversified among family members than the rule mentioned above, and some GA2ox accept a broad range of substrates and catalyze different reactions (Lange et al., 2020). GA20ox, GA3ox, and GA2ox are involved in feedback and feedforward regulation, which function in GA homeostasis (Yamaguchi, 2008).

11.3.2 GA signaling The core GA signaling pathway requires three essential components: GIBBERELLIN INSENSITIVE DWARF1 (GID1) receptors, GID2/SLEEPY (SLY) F- box proteins, and DELLA proteins (Ueguchi-Tanaka et al., 2007; Sun, 2011). DELLA proteins are repressors of GA signaling functioning as transcriptional regulators. Binding of bioactive GA to the GID1 receptor facilitates GID2-mediated DELLA degradation to induce the de-repression of GA signaling (Ueguchi-Tanaka et al., 2007; Sun, 2011) (Fig. 11.1). Biochemical analysis of three A. thaliana GA receptors, AtGID1a, AtGID1b, and AtGID1c, shows that GA4 has approximately 100 times higher affinity to all AtGID1s than other bioactive GAs, GA1 and GA3, supporting the idea that GA4 is the primary bioactive GA in A. thaliana (Nakajima et al., 2006). In B. rapa, 13-hydroxylation of GA4 to produce GA1 was reported (Rood and Hedden, 1994) (Fig. 11.1). It is possible that the conversion of GA4 to GA1 is a catabolism process of GA4. GA8, the 2β-hydroxylated metabolite of GA1, was more abundant in germinating seeds of particular

155

accessions than other accessions of B. napus, suggesting that 13-hydroxylated GA contributes to an accession-specific regulation of seed germination (Boter et al., 2019). In vitro affinity of AtGID1b to GA4 is ten times higher than other receptors and shows a basal GA-independent interaction with DELLAs (Nakajima et al., 2006; Yamamoto et al., 2010). The A. thaliana genome encodes two GID2/SLYs, SLY1 and SLY2 (Dill et al., 2004; Ariizumi et al., 2011). SLY1 interacts with multiple DELLAs for GA-dependent proteolysis, while SLY2 binds to particular DELLAs selectively (Ariizumi et al., 2011). Consistent with biochemical analyses, the sly1 mutants, but not sly2, show semi-dwarf phenotypes. The sly2 mutation enhances dwarfism under the sly1 background (Ariizumi et al., 2011). DELLAs play a central role in GA-mediated transcription and hormonal crosstalk (Claeys et al., 2014; Davière and Achard, 2016). DELLAs are GRAS transcription factors that contain conserved DELLA and TVHYNP domains at the N-terminus required for its binding to GID1 receptors. DELLAs can be stabilized by disruptions of these domains, which confer the semi- dominant GA-insensitive dwarf phenotype (Davière and Achard, 2016).

11.3.3 GA-auxotroph and response mutants In A. thaliana, five loci conferring GA deficiency are categorized into two groups: non- germinating severe dwarfs (ga1, ga2, and ga3), and germinating semi- dwarfs (ga4 and ga5) (Koornneef et al., 1990). GA1, GA2, and GA3 loci encode early GA biosynthesis enzymes, CPS, KS, and KO. The A. thaliana genome encodes a single gene for each of them; thus, mutations in these genes cause severe GA deficiency. Severe dwarf mutants defective in the single gene- coded GA signaling genes, such as gid1 and gid2, are reported in rice. GA4 and GA5 loci encode AtGA3ox1 and AtGA20ox1, respectively. Redundancy and feedback regulation among family members of GA20ox and GA3ox are thought to be crucial for making these mutants semi-dwarf (Phillips, 2016). Another group of semi-dwarfs encompass gain-of-function mutants defective in the DELLA and TVHYNP

156

E. Nambara et al.

domains of DELLA proteins. Semi- dwarfism is a beneficial trait to increase yield; thus, GA biosynthesis and signaling mutants have contributed to establishing elite lines in crops. Most of such mutations in crops disrupt genes for GA20ox, GA3ox, GA2ox, or DELLA (Phillips, 2016). In cereals, semi-dwarf mutants having a mutation in GA20ox include rice sd-1 and barley sdw1 and in DELLA include maize d8 and wheat rht8 mutants (Phillips, 2016). Genome editing of the maize GA20ox-3 by CRISPR-Cas9 technology also produced a semi-dwarf mutant (Zhang et al., 2020). Also, a maize quantitative trait locus (QTL) controlling internode elongation was mapped to ZmGA3ox2 (Teng et al., 2013). The semi- dwarfism of wheat Rht18 mutant is the result of enhanced expression of GA2ox9 (Ford et al., 2018). Aside from GA20ox, GA3ox, GA2ox, and DELLA, a GA- sensitive dwarf in commercial varieties includes rice Tan- Ginbozu that carries a d35 mutation, a weak allele of OsKO2 (Itoh et al., 2004). Recent studies show that GA-dependent height control is directly linked to the nutrient-growth coordination via physical interaction of DELLAs with GROWTH-REGULATING FACTOR4 (GRF4) and nitrogen-responsive chromatin regulation, not simply lodging tolerance (Li et al., 2018; Wu et al., 2020b). Decoupling of growth and nitrogen metabolism enhanced effects of fertilizers, which made both positive and negative aspects of the Green Revolution. Mutant phenotypes are determined by multiple factors, such as redundancy within gene families and different stringency/types of mutations. Gene family compositions and their expression patterns are important factors in evaluating and predicting mutant phenotypes. In this chapter, we review currently available data on GA metabolism and signaling genes in B. napus.

11.4 GA Metabolism and Signaling Genes in B. napus: Gene Family Diversity and Gene Expression Coding sequences (CDS) for GA metabolism and signaling genes were collected from the genome sequences of ten B. napus accessions, B. rapa, and B. oleracea (Table 11.1) by BLASTP

using A. thaliana orthologs as queries at BnTIP (http://yanglab.hzau.edu.cn/BnTIR, accessed September 2022) and Ensembl Plants (https:// plants.ensembl.org/index.html, accessed September 2022) websites. The cut-off values are shown in Table 11.1. The phylogenetic trees of B. napus gene families from ten accessions typically show clades branched from the A. thaliana orthologs. Each clade is further divided into two to six subclades corresponding to the orthologs of B. napus and progenitors (Fig. 11.2). The subclade of conserved members contains B. napus orthologs from ten accessions, while the diversified members form a subclade with missing orthologs from some accessions (Fig. 11.3) (Supplemental Fig. 11.4; materials available via QR code on p.xii). Homeologous exchange was counted when a gene is located on the homeologous locus. Transcriptome data of ZS11 were obtained from the BnTIP website (http://yanglab.hzau.edu.cn/BnTIR, accessed September 2022). This website releases transcriptomes of > 90 developmental samples in ZS11, thus this dataset allows comparison of developmental expression patterns and expression dominance among gene family members (Liu et al., 2021).

11.4.1 Early GA biosynthesis (synthesis of GA12) Four enzymes, CPS, KS, KO, and KAO, are involved in early GA biosynthesis. These genes form small multigene families (two to four members per accession), which comprise one or two homeologous gene pairs (Fig. 11.2). Most early GA biosynthesis genes in B. napus show a clear phylogenetic relationship to the progenitors’ orthologs, except for BnaC03.CPS, whose progenitor ortholog was not found in the TO1000 accession of B. oleracea (Fig. 11.2a). The degree of conservation within accessions varies among family members. It is worth mentioning that, with the exception of BnaA09.CPS, most of the conserved members of the early GA biosynthesis genes originated from B. oleracea (Fig. 11.2). Most accessions have two homeologous gene pairs for CPS. BnaA09.CPS is the most conserved member that is present in all ten

Plant Hormones

157

Table 11.1. Numbers of family members for GA metabolism and signaling genes in A. thaliana, B. oleracea, B. rapa and ten accessions of B. napus. Species (lines)

CPS

KS

KO

KAO

GA20ox

GA3ox

A. thaliana

1

1

1

2

5

3

2

5

B. rapa

2

1

1

2

12

5 (+1)

16 (+1)

6

4

5

B. oleracea

1

1

1

2

12

4 (+2)

14 (+7)

5

4

5

Darmor (v10)

4

2

2

4

23 (+1)

11

33 (+3)

11

8

10

Express617

4

2

2

3

20 (+2)

10

25 (+8)

10

8

10

Gangan

2

1

2

4

24

10

32 (+4)

11

3

10

No2127

1 (+3) 1

2

4

26 (+2)

11

31 (+3)

10

7

10

Quinta

4

2

2

5

24 (+1)

11

31 (+7)

10

8

10

Shengli

3

2

2

4

23

10

32 (+5)

10

7

10

Tapidor

4

2

2

3

19 (+5)

10

30 (+13)

10

8

10

Westar

3

2

2

4

22 (+1)

11

29 (+5)

10

3

10

ZS11

3

2

2

4

22 (+2)

10

31 (+8)

10

8

10

Zheyou7

5

2

2

4

23 (+1)

10

32 (+7)

10

3

11

4

GA2ox 9

GID1

GID2/SLY

DELLA

B. napus

BLASTP search was conducted using Arabidopsis sequences as queries against B. oleracea (TO1000), B. rapa (Chiifu) and B. napus (DS11) proteomes at the BnTIP website. Other sequences from B. napus cultivars/lines were obtained by BLASTP using ZS11’s sequences as queries. Numbers of hits having E value lower than E-100 (CPS, KO, GA20ox, GA3ox, GA2ox, GID1, GID2/SLY, DELLA), E-150 (KS, KAO) were counted. Numbers in parentheses indicate the number of short proteins (< 500 amino acids for CPS, and < 300 amino acids for others).

accessions, and the other three members (BnaC09.CPS, BnaA03.CPS, and BnaC03.CPS) are more diversified, with missing/truncated family members in some accessions (Fig. 11.2a). BnaA03.CPS and BnaC03.CPS in some accessions are annotated to encode either truncated or extended proteins (Supplemental Table 11.1; materials available via QR code on p.xii). The extended BnaCPS in Darmor, ZS11, Tapidor, and Zheyou7 accessions contains a C-terminal extension by fusion to a neighboring F- box/ RNI- like protein, although this fusion has not been validated experimentally. Current advances in sequencing technology to read long transcripts will confirm these annotations (Yao et al., 2020). Short annotations of BnaA03. CPS in Shengli and Zhenyou7 contain the same 5′ truncations, but again, these have not been experimentally validated and it is unclear whether or not the first Met is correctly assigned. Short sequences of BnaA03.CPS in Gangan and its ortholog BraA03g028920 in B. rapa have 3′ truncations. Large truncations of BnaA03.CPSa and BnaC03.CPS in No2127 lack both 5′ and 3′ regions. The short annotation of BnaA03.CPSb

includes the 3′ region. Moreover, the truncated BnaA03.CPSa and BnaA03.CPSb in No2127 are tandemly located; thus, these may have originated from the same single gene. All extended and truncated annotations are either BnaA03. CPS or BnaC03.CPS (Supplemental Table 11.1; materials available via QR code on p.xii), which would appear to be unstable alleles that are not functionally maintained. Gene families for KS, KO, and KAO are well conserved regarding the number of family members (Fig. 11.2b–d). In particular, the BnaKO family is most conserved among ten accessions (Fig. 11.2.c). Extended BnaKAO annotations, BnaA10.KAO1 in Darmor- bzh, and BnaC05.KAO1 in Darmor-bzh, ZS11, and Zhenyou7, are 5′ fusions to a galactosyltransferase family protein gene. A small truncation was found at the 3′ end of BnaA05.KAO2 in Zhenyou7. These extensions and truncations require experimental validations. Gene duplications and losses are found with a low frequency in the families of early GA biosynthesis genes (Fig. 11.3). One trend observed is

158

E. Nambara et al.

Fig. 11.2. Gene families and developmental expression patterns of early GA biosynthesis genes. Phylogenetic trees for CPS (a), KS (b), KO (c), and KAO (d) are shown. Coding sequences (CDSs) of corresponding genes from ten B. napus accessions were used to construct phylogenetic trees. Gene IDs and protein lengths are shown in Supplemental Table 11.1. Orthologs from A. thaliana, B. rapa, and B. oleracea are included as references. The 3′-extensions of BnaCPS (A03p30850.1_BnaDAR (338-bp), BnaC03G0283900TA (251-bp), BnaA03G0269400ZS (449-bp), BnaC03G0260400ZY (521-bp), C09p001980.1_BnaEXP (242-bp)), and 5′-extension of BnaKAO (A10p03690.1_BnaDAR (785-bp), C05p03400.1_BnaDAR (794-bp), BnaC05G0035200ZS (809-bp) and BnaC05G0031100ZY (809-bp)) were removed prior to constructing phylogenetic trees. CDSs were aligned by ClastalW (Gap open Penalty is 7.0, Gap extension penalty is 6.66 for both pairwise and multiple alignments). Neighbor-joining trees were constructed by using MEGAX (1000 rounds of bootstrap replications were performed with default settings). Reference genes are highlighted in red (A. thaliana), green (B. rapa), and pale green (B. oleracea). Last two or three letters of B. napus gene names indicate the accession: Darmor-bzh (DAR), Express617 (EXP), Gangan (GG), No2127 (NO), Quinta (QA), Shengli (SL), Tapidor (TA), Westar (WE), ZS11 (ZS), and Zheyou7 (ZY). Asterisks in red highlight the homeologous exchanges. (e) The heatmap representing the expression of early GA biosynthesis genes in ZS11. Samples are root, stem lower (stem low), stem middle (stem mid), stem upper (stem up), leaf position 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, cotyledon, vegetative rosette leaf (veg. rosette), bud 2 mm, bud 4 mm, sepal, petal, filament, pollen, silique wall (sil. wall) 20, 30, 40, 50, 60 days after flowering (DAF) and immature seeds at 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, and 62 DAF (from left to right). The color scale for expression levels is shown at the right of the heatmap. The heatmap was created at http://yanglab.hzau. edu.cn/index/heatmap (accessed September 2022).

Plant Hormones

159

Fig. 11.3. Gene duplications and losses of GA biosynthesis and catabolism genes in B. napus accessions. Genes are shown by pale green cells. Truncated genes are yellow, and blue cells indicate that the corresponding genes were not found. Each line indicates a progenitor (Bra or Bo), Darmor-bzh (DAR), Express617 (EXP), Ganggan (GG), No2127 (NO), Quinta (QA), Tapidor (TA), Westar (WE), ZS11 (ZS), and Zhenyou7 (ZY) (from left to right). Homeologous interactions are highlighted in red (gene gain/ loss), pale red (co-gene losses/truncations), and brown (homeologous exchange). Co-gene losses of functionally related genes are highlighted by black boxes. Gene IDs are shown in Supplemental Table 11.1.

the compensation or enhancement of family size changes between homeologous gene pairs. Gene duplication of BnaA09.CPS is found in Shengli and Zhenyou7, and its homeolog BnaC09. CPS was lost in these accessions (Fig. 11.3). In No2127, both homeologous genes, BnaA03. CPS and BnaC03.CPS, are annotated to encode truncated CPS (Fig. 11.3). Another interesting finding is that Gangan and No2127 show

decreased numbers of both CPS and KS genes. Gene families of CPS and KS in the other eight accessions are well maintained for the number of family members (Table 11.1). Because KS and CPS act together in ent-kaurene synthesis (Fig. 11.1), it is possible that these gene losses are not independent events. It would be interesting to examine the BnaCPS and BnaKS families in more accessions.

E. Nambara et al.

160

Early GA biosynthesis genes are expressed at many developmental stages (Fig. 11.2e). This is different from the larger gene families with differentiated tissue-specific expression patterns for each member (see below). Homeologous gene pairs show similar expression patterns to each other (Fig. 11.2e), suggesting that homeologous genes are substantially redundant. BnaCPS and BnaKS from the A subgenome show higher levels of expression than those from the C subgenome. Interestingly, the expression of BnaC09.CPS is higher in stems than its homeolog. BnaC09.CPS would be a candidate gene to reduce stem elongation when mutated, although severe dwarfism would be expected when a null mutation is introduced. As an example of mutants defective in early GA biosynthesis in Brassica plants, a B. rapa nhm1 mutant defective in BraKS, which is encoded by a single gene, displays a non-heading phenotype (Gao et al., 2020).

11.4.2 BnaGA20ox The BnaGA20ox genes form a large gene family. A phylogenetic tree of 225 BnaGA20ox CDSs from ten accessions shows six groups that correspond to five AtGA20ox genes and one unrelated group, designated as BnaGA20ox1 ~ BnaGA20ox5 and BnaGA20oxX (Supplemental Fig. 11.1a; materials available via QR code on p.xii). These are further divided into a total of 24 members (Supplemental Fig. 11.1a; materials available via QR code on p.xii). Twelve GA20ox genes from B. oleracea and 12 GA20ox genes from B. rapa are phylogenetically related to each member, while the genomes of 10 B. napus accessions contain 19–26 BnaGA20ox genes (Table 11.1) (Supplemental Table 11.1; materials available via QR code on p.xii). A remarkable feature of the BnaGA20ox family is the number of truncated genes (Table 11.1) (Supplemental Table 11.1; materials available via QR code on p.xii). A total of 16 truncated BnaGA20ox genes were found in the genomes of 10 B. napus accessions. Premature stop codons (i.e., 3′ truncations) were found in some truncated genes, such as in BnaA03.GA20ox3a and BnaA03.GA20ox3b in Tapidor. A homeologous pair of 3′ truncated genes, BnaA08.GA20ox5 and BnaC06.

GA20ox5, is present in Express617. Most truncated genes were found in particular family members of BnaGA20ox (Fig. 11.3), with 5 of the 16 truncated genes belonging to the BnaGA20oxX paralogs. It is worth mentioning that truncated BnaGA20ox genes of ZS11 are either not expressed or expressed at low levels (Supplemental Fig. 11.2; materials available via QR code on p.xii); thus, these are likely pseudogenes. Six gene duplications were observed in the BnaGA20ox family (Fig. 11.3). The phylogenetic analysis indicates that the truncated gene C01p37580.1_BnaDAR in Darmor-bzh is phylogenetically related to BnaA01.GA20ox4, suggesting that homeologous exchange may be associated with the loss of BnaC01.GA20ox4 in this accession (Fig. 11.3) (Supplemental Table 11.1; materials available via QR code on p.xii). No. 2127 is the only accession in which the number of family members is increased to 26 (Table 11.1) (Fig. 11.3) (Supplemental Fig. 11.1a; materials available via QR code on p.x; materials available via QR code on p.xii). The 22 BnaGA20ox genes in ZS11 are differentially expressed to display unique tissue-specific patterns (Fig. 11.4). GA20ox is a rate- limiting enzyme of GA biosynthesis; thus, the tissue- specific expression of these genes is likely to impact the physiological functions of GA. Expression patterns of homeologous gene pairs are similar to each other, although genes from the A subgenome tend to be expressed more abundantly than the homeologs on the C subgenome (Fig. 11.4). The most abundantly expressed BnaGA20ox genes in the stem are BnaA01.GA20ox1 and BnaC01. GA20ox1, followed by BnaA02.GA20ox2 and BnaC09.GA20ox2 (Fig. 11.4). In addition, BnaC07.GA20ox1 was indicated as a candidate for plant height control gene from genome-wide association studies (GWAS) (Li et al., 2016). In A. thaliana, the expression of AtGA20ox1 predominates in the stem, and loss of function of AtGA20ox1/ GA5 leads to the semi- dwarf phenotype. The atga20ox2 single mutant shows no obvious phenotype in height, but the atga20ox2 mutation enhances the semi-dwarf phenotype of atga20ox1 in the double mutant (Rieu et al., 2008). These collectively suggest that GA20ox1 and GA20ox2 are the primary GA20ox enzymes to regulate stem elongation in both A. thaliana and B. napus. In flowers and buds, three BnaGA20ox2 members are

Plant Hormones

161

Fig. 11.4. Developmental expression patterns of late GA biosynthesis and signaling genes. Heatmaps represent the expression of late GA biosynthesis and signaling genes in ZS11. Yellow boxes highlight expression patterns discussed in the text. Samples are the same as in Fig. 11.2. The heatmaps were created at http://yanglab.hzau.edu.cn/index/heatmap (accessed September 2022).

differentially expressed. In A. thaliana, AtGA20ox1 and AtGA20ox2 are redundantly involved in male fertility and flower organ development (Rieu et al., 2008). In leaves, the BnaGA20ox3 genes are the most abundantly expressed, followed by the BnaGA20ox2 homeologs. During seed development, BnaGA20ox3 and BnaGA20ox4 genes are abundantly expressed (Fig. 11.4). As for the BnaGA20ox3, two pairs of homeologous genes have differentiated tissue- specific expression between

leaf (BnaA10.GA20ox3 and BnaC09.GA20ox3) and seed development (BnaA02.GA20ox3 and BnaC02.GA20ox3). Four BnGA20ox4 genes are abundantly expressed during seed development, while the expression of BnaGA20oxX is under the detection limit in these samples. In the transcriptome of ZS11, no detectable expression of BnaGA20ox is observed in the root (Fig. 11.4). In A. thaliana, the root tip synthesizes bioactive GA for cell division and elongation, but its required GA

E. Nambara et al.

162

amount is less than for other tissues (Tanimoto, 2012; Barker et al., 2021).

11.4.3 BnaGA3ox The BnaGA3ox family is composed of 10–11 members. The phylogenetic tree of 105 BnGA3ox CDSs from ten accessions shows four phylogenetic groups that correspond to four A. thaliana GA3ox genes. These are further divided into 11 members (Supplemental Fig. 11.1b; materials available via QR code on p.xii). Each of the four BoGA3ox and five BraGA3ox corresponds to one member. The truncated BoGA3ox (Bo2g096080 and Bo6g125640) and BraGA3ox (BraA07g043080) are phylogenetically related to the other 2 of 11 members (Supplemental Table 11.1; materials available via QR code on p.xii). This suggests that the bona fide progenitors of B. napus had at least five BoGA3ox and six BraGA3ox genes. No apparent gene gains (or gene duplication) were found for BnaGA3ox among ten accessions. The homeologous GA3ox2 gene pair (BnaA07.GA3ox2 and BnaC06.GA3ox2) are unstable alleles, and six of ten accessions lack both genes (Fig. 11.3). Expression of BnaGA3ox is differentiated among the gene family in ZS11. The homeologous GA3ox1 genes (BnaA06.GA3ox1 and BnaC05.GA3ox1) are expressed most abundantly in the stem. This coincides with the role of AtGA3ox1 in stem elongation in A. thaliana. Expression of the other homeologous BnaGA3ox1s (BnaA09.GA3ox1 and BnaC08. GA3ox1) is below the detection limit in these developmental samples. The homeologous BnaGA3ox4 (BnaA02.GA3ox4 and BnaC02. GA3ox4) are abundantly expressed in seed development. AtGA3ox4 is required for proper seed coat development in A. thaliana (Kim et al., 2005). It would be interesting to analyze the spatiotemporal expression of these genes at the tissue level in seeds. BnaC03.GA3ox3 is highly expressed in buds and flowers (Fig. 11.4).

11.4.4 BnaGA2ox The GA2ox genes encode GA catabolic enzymes that form a large family with approximately

30 members in each B. napus accession. The A. thaliana genome of Columbia accession encodes five GA2ox genes for classes I and II (AtGA2ox1~AtGA2ox6 with a pseudogene AtGA2ox5) and four class III AtGA2ox genes (AtGA2ox7~AtGA2ox10) (Lee and Zeevaart, 2005; Lange et al., 2020). The class I/II BnaGA2ox family contains five phylogenetic groups related to five A. thaliana genes. These are further divided into 25 members, and 22 of them correspond to 11 BraGA2ox and 11 BoGA2ox genes (Supplemental Fig. 11.1c; Supplemental Table 11.1; materials available via QR code on p.xii). Besides, two others are phylogenetically related to truncated BoGA2ox genes, Bo3g032250 and Bo3g032270 (Supplemental Table 11.1; materials available via QR code on p.xii). This indicates that the TO1000 accession of B. oleracea and Chiifu of B. rapa lack at least one GA2ox gene that was present in the bona fide progenitors hybridized to form B. napus. Some GA2ox3 genes (BraA05g010560, BnaC03. GA2ox3_DAR, BnaA03.GA2ox3_DAR, BnaC03. GA2ox3_NO) are annotated as C- terminal fusion proteins to a P- loop containing nucleoside triphosphate hydrolase superfamily protein. The class III BnaGA2ox family is divided into 4 groups, which are further divided into 12 members (Supplemental Fig. 11.1c; Supplemental Table 11.1; materials available via QR code on p.xii). Only 5 BraGA2ox and 3 BoGA2ox genes are assigned to be the orthologs of these 12 members, and the other 4 members are phylogenetically related to truncated GA2ox genes in progenitors (Supplemental Table 11.1; materials available via QR code on p.xii). The most significant feature of the BnaGA2ox family is a large number of missing and truncated genes (Fig. 11.3). Among 10 accessions, 23 and 63 orthologs are missing and truncated, respectively (Fig. 11.3). Both BnaGA2ox and BnaGA20ox have missing/ truncated orthologs, but a noticeable difference between these family members can be seen in orthologs of the progenitors, in which a certain number of missing (four genes) and truncated (five genes) GA2ox genes were found, especially B. oleracea (Fig. 11.3). These suggest that the GA2ox family initiated active diversification before forming B. napus by natural interspecific hybridization and chromosome doubling. Consistent with this notion, similar family

Plant Hormones

members are genetically unstable between B. napus and B. oleracea (Fig. 11.3). Among 306 GA2ox genes and 63 truncated genes from 10 B. napus accessions, 16 gene duplication events were found, with only two successfully increasing the number of family members (Fig. 11.3). This suggests that diversification of BnaGA2ox is directed to gene loss. There is differential tissue-specific expression of most BnaGA2ox genes (Supplemental Fig. 11.3; materials available via QR code on p.xii). Very high expression is observed in flower organs for BnaA09.GA2ox2, BnaC05. GA2ox2, BnaA10.GA2ox6, BnaC05.GA2ox6, and BnaA02.GA2ox9 (Supplemental Fig. 11.3; materials available via QR code on p.xii). In A. thaliana, the atga2ox10 mutant enhances fertility and produces more seeds than the wild type (Lange et al., 2020). Note that AtGA2ox9 and AtGA2ox10 failed to specify distinct clades in the phylogenetic tree (Supplemental Fig. 11.1d; materials available via QR code on p.xii); thus, we cannot discuss the differences in physiological functions between these members. The ZS11 transcriptome also indicates that the root is the site of high BnaGA2ox expression, such as BnaGA2ox1, BnaGA2ox2, BnaGA2ox6, and BnaGA2ox8 (Supplemental Fig. 11.3; materials available via QR code on p.xii). In A. thaliana, both atga2ox9 and atga2ox10 single mutants show better root growth than the wild type (Lange et al., 2020). BnaA10.GA2ox6 and BnaC03.GA2ox2 were shown to encode functional enzymes because their overexpression induced dwarfism and late flowering in A. thaliana and B. napus (Yan et al., 2017, 2021). BnaA03.GA2ox3 and BnaA03. GA2ox8 were indicated as candidate genes for plant height control by GWAS (Li et al., 2016). Interestingly, BnaC03.GA2ox2 was also indicated as 1 of 61 candidate flowering-time genes by GWAS (Lu et al., 2019). In A. thaliana, the GA metabolism and flowering phenotypes are not well investigated at the tissue level, requiring further studies. Moreover, enhanced freezing tolerance of the atga2ox9 mutant is an interesting finding (Lange et al., 2020), because this trait would be important for winter type cultivars.

163

11.4.5 GA signaling genes We examined three gene families for core GA signaling components, GID1, GID2/SLY, and DELLA (Supplemental Fig. 11.1e–g; materials available via QR code on p.xii). The families of GID1 and DELLA are highly conserved in B. napus accessions (Supplemental Fig. 11.1e, g; materials available via QR code on p.xii). These families have only a few missing genes, and no truncated genes are annotated (Supplemental Fig. 11.4; materials available via QR code on p.xii). There are no signs of homeologous exchanges for these families. These indicate that the GID1 and DELLA families are stably maintained in the B. napus accessions. Among 10 accessions, 102 BnaGID1s are phylogenetically divided into 10 members related to 5 BoGID1 and 5 BraGID1 orthologs (Supplemental Fig. 11.1e; materials available via QR code on p.xii). In the BnaDELLA family, only two orthologues were not found in ten accessions (Supplemental Fig. 11.1g; Supplemental Fig. 11.4; materials available via QR code on p.xii). Interestingly, two gene duplications were detected in ZS11 and Zhenyou7, and both homeologous counterparts are lost (Supplemental Fig. 11.4; materials available via QR code on p.xii), suggesting homeolog interactions are associated with these events. In contrast, the GID2/SLY family is remarkably diversified within B. napus accessions (Supplemental Fig. 11.1f; materials available via QR code on p.xii). The BnaGID2/SLY family is composed of one pair of the conserved BnaSLY1 homeologs and zero to three pairs of variable BnaSLY2 homeologs (Supplemental Fig. 11.4; materials available via QR code on p.xii). Gangan accession does not possess any BnaSLY2s in the genome, and only one BnaSLY2 was found in Westar and Zhenyou7 accessions. Interestingly, gene duplication of BnaSLY1 was found in the genome of Gangan, which lacked all six members of BnaSLY2s. This may suggest counteraction between BnaSLY1 and BnaSLY2 in this accession. GA signaling genes in ZS11 have evolved distinct differential gene expression patterns from GA metabolism genes. Expression of BnaGID1, BnaGID2/SLY, and BnaDELLA shows one to two pairs of homeologous genes that are abundantly expressed throughout

164

E. Nambara et al.

developmental stages, while the other members are expressed in particular tissues (Fig. 11.4). This differs from the gene expression patterns of BnaGA20ox, BnaGA3ox, and BnaGA2ox, which are more tissue- specific (Fig. 11.4). Homeologous gene pairs BnaGID1a (BnaA05. GID1a and BnaC05.GID1a) and BnaGID1c (BnaA06.GID1c and BnaC07.GID1c) show similar expression patterns, suggesting little expression bias for BnaGID1 genes. On the other hand, the other homeologous genes, BnaA09. GID1b and BnaC08.GID1b, are weakly expressed throughout plant life cycles, with highest expression in roots and flowers, respectively. BnaC05. RGL2 was reported to be upregulated during seed germination, although its ortholog was not found in ZS11 (Boter et al., 2019). Expression of BnaSLY1 (BnaA01.SLY1 and BnaC01. SLY1) is intense and ubiquitous, but members of BnaSLY2 are weakly expressed (Fig. 11.4). Overall expression patterns of BnaSLY1 are similar to those of BnaGID1. One difference in the developmental expression patterns between BnaSLY1 and BnaGID1 (BnaA05.GID1a, BnaC05.GID1a, BnaA06.GID1c, and BnaC07. GID1c) is at the late stage of seed development. Expression of BnaGID1 is upregulated, while BnaSLY1 is downregulated during the late stage of seed development (Fig. 11.4). This expression pattern may suggest that BnaGID1 and BnaSLY1 play distinct roles in controlling seed dormancy and germination. Semi-dominant loci for the semi-dwarf ds-1 and ds-3 mutants of B. napus were mapped to BnaA06.RGA and BnaC07.RGA, respectively (Liu et al., 2010; Zhao et al., 2017). The BnaA06.rga/ ds-1 mutant shows more severe dwarfism than BnaC07.rga/ds-3 (Zhao et al., 2017). In yeast two-hybrid assays, both ds-1 and ds-3 failed to interact with AtGID1a (Zhao et al., 2017). In ZS11, BnaA06.RGA/DS-1 is more abundantly expressed in the stem than is BnaC07.RGA/ DS-3 (Fig. 11.4). The difference in expression abundance in the stem may affect the severity of the mutant dwarf phenotypes. A semi-dwarf dwf2/Brrga1-d mutant of B. rapa harbors a semi- dominant mutation in BraA06g040430, the ortholog of BnaA06.RGA/DS-1 (Muangprom et al., 2005). The similar mutant phenotypes of two orthologous genes between B. rapa and B. napus indicate that their function and expression patterns are conserved between these species.

Moreover, resynthesized B. napus from a cross between Brrga1-d and B. oleracea show the dwarf phenotype with a slight delay of flowering and altered branching pattern (Muangprom et al., 2006). Furthermore, systematic mutagenesis of four BnaRGA genes by CRISPR-B. napus compared with progenitors’ orthologsCas9 identified a semi- dominant mutation in BnaA06.RGA that causes dwarfism, while quadruple mutants containing loss-of-function mutations in all four BnaRGA genes show taller plant heights than the parent (Yang et al., 2017). In addition to height control, a semi-dominant BnaA06.rga-D mutant shows enhanced drought tolerance (Wu et al., 2020a). It remains unknown why mutations in BnaA06.RGA and BnaC07.RGA, but not BnaA09.RGA and BnaC09.RGA, were preferentially selected by random or semi-random selections.

11.5 Expression of Homeologous Genes Differential expression of homeologous genes is a factor determining genetic diversity in polyploid crops. B. napus is a relatively young crop; thus, expression partitioning (or expression dominance) of the homeologous gene is less progressed than in other old polyploid crops (Chalhoub et al., 2014). Even if not fully differentiated, the B. napus A subgenome genes generally show higher expression than the C subgenomes with tissue- specific expression dominance of one homeolog (Chalhoub et al., 2014; Wu et al., 2018). GA-related genes also follow these trends. In addition, tissue-specific expression dominance of one homeolog is found occasionally from either subgenome. For example, expression dominance of BnaC09.CPS in the stem, BnaC09.GA20ox2 in buds, BnaA09. GA2ox2 in flowers, BnaA09.GID1b and BnaA06.GID1c in roots, and BnaC05.GID1a in late embryos are characteristic expression dominance of GA- related genes in ZS11. Differentiated expression of these genes might decrease the degree of redundancy and expanding the genetic diversity among accessions. Nascent canola provides insight into the adaption to the polyploidy stress on gene expression. The expression of GA- related genes in

Plant Hormones

resynthesized B. napus was analyzed by comparing with those of the corresponding progenitors’ orthologs using published transcriptome data (Zhang et al., 2015). One clear trend is that most genes show lower expression levels in resynthesized B. napus than do their orthologs in the progenitors (Fig. 11.5, 88% in leaves, 85% in silique wall). This is not a unique event to GA-related genes, because a similar result was reported in the genome-wide analysis of resynthesized B napus genes (Wu et al., 2018). A similar phenomenon is also known for the expression of duplicated genes as the dosage-sharing hypothesis: “most young duplicates are down-regulated to match expression levels of single-copy genes” (Lan and Pritchard, 2016). In this regard, BnaA10.KAO1 and BnaC05.KAO1 are the unique homeologous genes that show upregulation in resynthesized B. napus compared with progenitors’ orthologs (Supplemental Fig. 11.5; materials available via QR code on p.xii). It is possible to have a unique mechanism to tolerate the gene doses.

165

maintained among B. napus accessions regarding the number of family members. However, this maintenance is based on the equilibrium of active gene gains and losses. Phylogenetic analysis shows the degree of conservation varies among gene family members. It is noteworthy that early GA biosynthesis genes have one or more conserved members, such as BnaA09.CPS, BnaC06.KS, BnaC07.KO, and BnaC04.KAO2. The other diversified members are characteristic of gene losses/truncations in some accessions. On the other hand, BnaGA20ox, BnaGA2ox, and BnaGID2/SLY families show unique diversification among B. napus accessions (Table 11.1). One characteristic feature of BnaGA20ox and BnaGA2ox families is the number of truncated genes (Fig. 11.3). A similar trend was observed for GA2ox in B. oleracea, suggesting that active diversification of the BnaGA2ox family is largely derived from B. oleracea. In contrast, the diversified nature of BnaGA20ox and BnaSLY2 seems to be B. napus-specific. It is worth noting that GA20ox and GA2ox encode rate-limiting enzymes for GA biosynthesis and catabolism, respectively. The organization and regulation 11.6 General Discussion of these gene families directly impact endogenous GA levels. Gene families for rate-limiting Efficient selection of agronomically important enzymes of hormone metabolism tend to have genetic traits is crucial to make high- yield- larger families than other enzymes in the same producing crop cultivars. However, selecting pathways. The differentiation of these families agriculturally beneficial alleles from polyploid may contribute to establishing unique developcrops is more challenging, due to the redun- mental regulation in individual plant species. dant nature of their genome structures, than Our phylogenetic analysis also suggests those of diploid crops. Therefore, it is crucial that gene duplications and losses/truncations to understand gene families’ composition and are often observed in both homeologous gene genetic stability of essential genes, such as plant pairs. This suggests that homeologous interhormone metabolism and signaling genes. This actions are associated with the gene family chapter summarizes gene families encoding GA maintenance of GA-related genes in B. napus. metabolism and signaling genes in ten genome- Moreover, homeolog exchanges are more often sequenced B. napus cultivars. These could be observed in BnaGA20ox and BnaGA2ox families a start point for testing the usefulness of loci than others, especially where particular family encoding GA-related genes by either molecular members show frequent gene duplications, marker- based selection or genome editing to losses, and truncations (Fig. 11.3). Homeolog alter GA-related processes, such as plant height, exchanges are thought to be important for stress tolerance, and nutrient uses. establishing the genome stability and evolution Based on publicly available data, we in newly formed polyploids (Mason and Wendel, examined 10 families of GA metabolism and 2020). Our analysis suggests that homeologous signaling genes from 10 B. napus accessions interactions are well associated with gene (1011 genes + 81 truncated genes), B. rapa family maintenance for GA-related genes in B. (54 genes + two truncated gene), and B. oleracea napus. We found that the majority of homeolog (49 genes + 9 truncated genes). Overall, families exchanges are independent events in individual for GA metabolism and signaling genes are well accessions, suggesting their involvement in

166

E. Nambara et al.

Fig. 11.5. Expression of orthologous genes encoding GA signaling components in resynthesized B. napus and its parents. Expression levels of GA signaling genes in leaf (L) and silique wall (S) in resynthesized B. napus and its parents were obtained from Zhang et al. (2015). The left two columns in the heatmap indicate expression of genes in resynthesized B. napus. A simplified phylogenetic tree is shown to indicate homeologous genes. The right two columns in the heatmap indicate the expression of orthologous genes in B. rapa and B. oleracea. Gray columns indicate that no orthologs were found in the progenitor genome.

Plant Hormones

genome diversity in polyploid genomes of B. napus. Expression dominance or divergence of one homeologous gene with respect to its counterpart can be observed in the tissue-specific expression in B. napus, which is consistent with other genome-wide analyses. Gene family organization and differential expression of either homeologous genes or the entire gene family are the requirements to evaluate and predict the phenotypes of GA-related mutants of B. napus. GA regulates important traits for B. napus, such as stem

167

height, the timing of flowering, nitrogen use efficiency, stress tolerance, and pod shattering. Differential expression of B. napus homeologous genes and differential tissue-specific expression of gene family members are factors to expand the genetic diversity and decrease gene redundancy. The expression divergence of homeologous genes in B. napus is still at an early stage compared with other older polyploid crops. This suggests that genomics-associated breeding is suitable for this plant to exploit such genetic diversity effectively.

Acknowledgments The authors thank Professor Peter Hedden (Rothamsted Research, UK) for critical reading and valuable comments on the manuscript, and Christine Nguyen, Xinyue Wang, Zhiwei Xue, and Ayako Nambara (CSB, University of Toronto) for confirming sequence data analysis. This work was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant RGPIN-2019-04144 to E.N. and a Canada First Research Excellence Fund (CFREF) grant that established the Plant Phenotyping and Imaging Research Center (P2IRC) project.

References Allender, C.J. and King, G.J. (2010) Origins of the amphiploid species Brassica napus L. investigated by chloroplast and nuclear molecular markers. BMC Plant Biology 10, 54. DOI: 10.1186/1471-2229-10-54. Ariizumi, T., Lawrence, P.K. and Steber, C.M. (2011) The role of two F-box proteins, SLEEPY1 and SNEEZY, in Arabidopsis gibberellin signaling. Plant Physiology 155(2), 765–775. DOI: 10.1104/pp.110.166272. Ashikari, M., Sakakibara, H., Lin, S., Yamamoto, T., Takashi, T. et al. (2005) Cytokinin oxidase regulates rice grain production. Science 309(5735), 741–745. DOI: 10.1126/science.1113373. Barker, R., Fernandez Garcia, M.N., Powers, S.J., Vaughan, S., Bennett, M.J. et al. (2021) Mapping sites of gibberellin biosynthesis in the Arabidopsis root tip. The New Phytologist 229(3), 1521–1534. DOI: 10.1111/nph.16967. Bayer, P.E., Hurgobin, B., Golicz, A.A., Chan, C.-K.K., Yuan, Y. et al. (2017) Assembly and comparison of two closely related Brassica napus genomes. Plant Biotechnology Journal 15(12), 1602–1610. DOI: 10.1111/pbi.12742. Boter, M., Calleja-Cabrera, J., Carrera-Castaño, G., Wagner, G., Hatzig, S.V. et al. (2019) An integrative approach to analyze seed germination in Brassica napus. Frontiers in Plant Science 10, 1342. DOI: 10.3389/fpls.2019.01342. Carrera, E., Bou, J., García-Martínez, J.L. and Prat, S. (2000) Changes in GA 20-oxidase gene expression strongly affect stem length, tuber induction and tuber yield of potato plants. The Plant Journal 22(3), 247–256. DOI: 10.1046/j.1365-313x.2000.00736.x. Chalhoub, B., Denoeud, F., Liu, S., Parkin, I.A.P., Tang, H. et al. (2014) Plant genetics. Early allopolyploid evolution in the post-neolithic Brassica napus oilseed genome. Science 345(6199), 950–953. DOI: 10.1126/science.1253435. Chono, M., Honda, I., Zeniya, H., Yoneyama, K., Saisho, D. et al. (2003) A semidwarf phenotype of barley uzu results from a nucleotide substitution in the gene encoding a putative brassinosteroid receptor. Plant Physiology 133(3), 1209–1219. DOI: 10.1104/pp.103.026195. Claeys, H., De Bodt, S. and Inzé, D. (2014) Gibberellins and DELLAs: central nodes in growth regulatory networks. Trends in Plant Science 19(4), 231–239. DOI: 10.1016/j.tplants.2013.10.001.

168

E. Nambara et al.

Colebrook, E.H., Thomas, S.G., Phillips, A.L. and Hedden, P. (2014) The role of gibberellin signalling in plant responses to abiotic stress. The Journal of Experimental Biology 217(Pt 1), 67–75. DOI: 10.1242/jeb.089938. Coles, J.P., Phillips, A.L., Croker, S.J., García-Lepe, R., Lewis, M.J. et al. (1999) Modification of gibberellin production and plant development in Arabidopsis by sense and antisense expression of gibberellin 20-oxidase genes. The Plant Journal 17(5), 547–556. DOI: 10.1046/j.1365-313x.1999.00410.x. Davière, J.-M. and Achard, P. (2016) A pivotal role of DELLAs in regulating multiple hormone signals. Molecular Plant 9(1), 10–20. DOI: 10.1016/j.molp.2015.09.011. Dill, A., Thomas, S.G., Hu, J., Steber, C.M. and Sun, T.-P. (2004) The Arabidopsis F-box protein SLEEPY1 targets gibberellin signaling repressors for gibberellin-induced degradation. The Plant Cell 16(6), 1392–1405. DOI: 10.1105/tpc.020958. Durrett, T.P., Benning, C. and Ohlrogge, J. (2008) Plant triacylglycerols as feedstocks for the production of biofuels. The Plant Journal 54(4), 593–607. DOI: 10.1111/j.1365-313X.2008.03442.x. Fleet, C.M., Yamaguchi, S., Hanada, A., Kawaide, H., David, C.J. et al. (2003) Overexpression of AtCPS and AtKS in Arabidopsis confers increased ent-kaurene production but no increase in bioactive gibberellins. Plant Physiology 132(2), 830–839. DOI: 10.1104/pp.103.021725. Ford, B.A., Foo, E., Sharwood, R., Karafiatova, M., Vrana, J. et al. (2018) Rht18 semidwarfism in wheat is due to increased GA2-oxidaseA9 expression and reduced GA content. Plant Physiology 177, 168–180. Friedt, W., Tu, J. and Fu, T. (2018) Academic and economic importance of Brassica napus rapeseed. In: S., L., R., S. and B, C. (eds) The Brassica Napus Genome. Compendium of Plant Genomes. Springer, Cham. DOI: 10.1007/978-3-319-43694-4_1. Fujita, M., Fujita, Y., Noutoshi, Y., Takahashi, F., Narusaka, Y. et al. (2006) Crosstalk between abiotic and biotic stress responses: a current view from the points of convergence in the stress signaling networks. Current Opinion in Plant Biology 9(4), 436–442. DOI: 10.1016/j.pbi.2006.05.014. Gao, Y., Huang, S., Qu, G., Fu, W., Zhang, M. et al. (2020) The mutation of ent-kaurene synthase, a key enzyme involved in gibberellin biosynthesis, confers a non-heading phenotype to Chinese cabbage (Brassica rapa L. ssp. pekinensis). Horticulture Research 7(1), 178. DOI: 10.1038/ s41438-020-00399-6. Graeber, K., Linkies, A., Steinbrecher, T., Mummenhoff, K., Tarkowská, D. et al. (2014) DELAY OF GERMINATION 1 mediates a conserved coat- dormancy mechanism for the temperature- and gibberellin- dependent control of seed germination. Proceedings of the National Academy of Sciences 111(34), E3571–E3580. DOI: 10.1073/pnas.1403851111. Hedden, P. (2003) The genes of the green revolution. Trends in Genetics 19(1), 5–9. DOI: 10.1016/ s0168-9525(02)00009-4. Hedden, P. and Sponsel, V. (2015) A century of gibberellin research. Journal of Plant Growth Regulation 34(4), 740–760. DOI: 10.1007/s00344-015-9546-1. Huang, S., Raman, A.S., Ream, J.E., Fujiwara, H., Cerny, R.E. et al. (1998) Overexpression of 20-oxidase confers a gibberellin-overproduction phenotype in arabidopsis. Plant Physiology 118(3), 773–781. DOI: 10.1104/pp.118.3.773. Ikeuchi, M., Ogawa, Y., Iwase, A. and Sugimoto, K. (2016) Plant regeneration: cellular origins and molecular mechanisms. Development (Cambridge, England) 143(9), 1442–1451. DOI: 10.1242/dev.134668. Illouz-Eliaz, N., Nissan, I., Nir, I., Ramon, U., Shohat, H. et al. (2020) Mutations in the tomato gibberellin receptors suppress xylem proliferation and reduce water loss under water-deficit conditions. Journal of Experimental Botany 71(12), 3603–3612. DOI: 10.1093/jxb/eraa137. Itoh, H., Tatsumi, T., Sakamoto, T., Otomo, K., Toyomasu, T. et al. (2004) A rice semi-dwarf gene, Tan- Ginbozu (D35), encodes the gibberellin biosynthesis enzyme, ent-kaurene oxidase. Plant Molecular Biology 54(4), 533–547. DOI: 10.1023/B:PLAN.0000038261.21060.47. Izawa, T. (2021) What is going on with the hormonal control of flowering in plants? The Plant Journal 105(2), 431–445. DOI: 10.1111/tpj.15036. Kamiya, Y. (2010) Plant hormones: versatile regulators of plant growth and development. Annual Review of Plant Biology 61(1). DOI: 10.1146/annurev.arplant.61.031110.100001. Kim, Y.C., Nakajima, M., Nakayama, A. and Yamaguchi, I. (2005) Contribution of gibberellins to the formation of Arabidopsis seed coat through starch degradation. Plant & Cell Physiology 46(8), 1317–1325. DOI: 10.1093/pcp/pci141. Koornneef, M., Bentsink, L. and Hilhorst, H. (2002) Seed dormancy and germination. Current Opinion in Plant Biology 5(1), 33–36. DOI: 10.1016/s1369-5266(01)00219-9.

Plant Hormones

169

Koornneef, M., Bosma, T.D., Hanhart, C.J., van der Veen, J.H. and Zeevaart, J.A. (1990) The isolation and characterization of gibberellin-deficient mutants in tomato. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 80(6), 852–857. DOI: 10.1007/BF00224204. Lange, T., Krämer, C. and Pimenta Lange, M.J. (2020) The class III gibberellin 2-oxidases AtGA2ox9 and AtGA2ox10 contribute to cold stress tolerance and fertility. Plant Physiology 184(1), 478–486. DOI: 10.1104/pp.20.00594. Lan, X. and Pritchard, J.K. (2016) Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science 352(6288), 1009–1013. DOI: 10.1126/science.aad8411. Le Thanh, B.V., Beltranena, E., Zhou, X., Wang, L.F. and Zijlstra, R.T. (2019) Amino acid and energy digestibility of Brassica napus canola meal from different crushing plants fed to ileal-cannulated grower pigs. Animal Feed Science and Technology 252, 83–91. DOI: 10.1016/j.anifeedsci.2019.04.008. Lee, D.J. and Zeevaart, J.A.D. (2005) Molecular cloning of GA 2-oxidase3 from spinach and its ectopic expression in Nicotiana sylvestris. Plant Physiology 138(1), 243–254. DOI: 10.1104/pp.104.056499. Li, F., Chen, B., Xu, K., Gao, G., Yan, G. et al. (2016) A genome-wide association study of plant height and primary branch number in rapeseed (Brassica napus). Plant Science 242, 169–177. DOI: 10.1016/j. plantsci.2015.05.012. Li, S., Tian, Y., Wu, K., Ye, Y., Yu, J. et al. (2018) Modulating plant growth-metabolism coordination for sustainable agriculture. Nature 560(7720), 595–600. DOI: 10.1038/s41586-018-0415-5. Liu, C., Wang, J., Huang, T., Wang, F., Yuan, F. et al. (2010) A missense mutation in the VHYNP motif of a DELLA protein causes a semi-dwarf mutant phenotype in Brassica napus. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 121(2), 249–258. DOI: 10.1007/ s00122-010-1306-9. Liu, D., Yu, L., Wei, L., Yu, P., Wang, J. et al. (2021) BnTIR: an online transcriptome platform for exploring RNA-seq libraries for oil crop Brassica napus. Plant Biotechnology Journal 19(10), 1895–1897. DOI: 10.1111/pbi.13665. Lu, K., Wei, L., Li, X., Wang, Y., Wu, J. et al. (2019) Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement. Nature Communications 10(1), 1154. DOI: 10.1038/s41467-019-09134-9. Martí, E., Gisbert, C., Bishop, G.J., Dixon, M.S. and García-Martínez, J.L. (2006) Genetic and physiological characterization of tomato cv. Micro-Tom. Journal of Experimental Botany 57(9), 2037–2047. DOI: 10.1093/jxb/erj154. Mason, A.S. and Wendel, J.F. (2020) Homoeologous exchanges, segmental allopolyploidy, and polyploid genome evolution. Frontiers in Genetics 11, 1014. DOI: 10.3389/fgene.2020.01014. Muangprom, A., Mauriera, I. and Osborn, T.C. (2006) Transfer of a dwarf gene from Brassica rapa to oilseed B. napus, effects on agronomic traits, and development of a ‘perfect’ marker for selection. Molecular Breeding 17(2), 101–110. DOI: 10.1007/s11032-005-3734-9. Muangprom, A., Thomas, S.G., Sun, T.P. and Osborn, T.C. (2005) A novel dwarfing mutation in a green revolution gene from Brassica rapa. Plant Physiology 137(3), 931–938. DOI: 10.1104/pp.104.057646. Nakajima, M., Shimada, A., Takashi, Y., Kim, Y.-C., Park, S.-H. et al. (2006) Identification and characterization of Arabidopsis gibberellin receptors. The Plant Journal 46(5), 880–889. DOI: 10.1111/j.1365-313X.2006.02748.x. Nolan, T.M., Vukašinović, N., Liu, D., Russinova, E. and Yin, Y. (2020) Brassinosteroids: multidimensional regulators of plant growth, development, and stress responses. The Plant Cell 32(2), 295–318. DOI: 10.1105/tpc.19.00335. Phillips, A.L. (2016) Genetic control of gibbberellin metabolism and signaling in crop improvement. Annual Plant Reviews 49, 405–430. Regnault, T., Davière, J.-M., Wild, M., Sakvarelidze-Achard, L., Heintz, D. et al. (2015) The gibberellin precursor GA12 acts as a long-distance growth signal in Arabidopsis. Nature Plants 1(6), 15073. DOI: 10.1038/nplants.2015.73. Rieu, I., Ruiz-Rivero, O., Fernandez-Garcia, N., Griffiths, J., Powers, S.J. et al. (2008) The gibberellin biosynthetic genes AtGA20ox1 and AtGA20ox2 act, partially redundantly, to promote growth and development throughout the Arabidopsis life cycle. The Plant Journal 53(3), 488–504. DOI: 10.1111/j.1365-313X.2007.03356.x. Rood, S. and Hedden, P. (1994) Convergent pathways of gibberellin A1 biosynthesis in Brassica. Plant Growth Regulation 15(3), 241–246. DOI: 10.1007/BF00029897. Sakamoto, T. and Matsuoka, M. (2004) Generating high-yielding varieties by genetic manipulation of plant architecture. Current Opinion in Biotechnology 15(2), 144–147. DOI: 10.1016/j.copbio.2004.02.003.

170

E. Nambara et al.

Santner, A., Calderon-Villalobos, L.I.A. and Estelle, M. (2009) Plant hormones are versatile chemical regulators of plant growth. Nature Chemical Biology 5(5), 301–307. DOI: 10.1038/nchembio.165. Sun, F., Fan, G., Hu, Q., Zhou, Y., Guan, M. et al. (2017) The high-quality genome of Brassica napus cultivar “ZS11” reveals the introgression history in semi-winter morphotype. The Plant Journal 92(3), 452–468. DOI: 10.1111/tpj.13669. Sun, T.P. (2011) The molecular mechanism and evolution of the GA-GID1-DELLA signaling module in plants. Current Biology 21(9), R338–R345. DOI: 10.1016/j.cub.2011.02.036. Suzuki, Y., Saso, K., Fujioka, S., Yoshida, S., Nitasaka, E. et al. (2003) A dwarf mutant strain of Pharbitis nil, Uzukobito (kobito), has defective brassinosteroid biosynthesis. The Plant Journal 36(3), 401–410. DOI: 10.1046/j.1365-313x.2003.01887.x. Takeuchi, J., Fukui, K., Seto, Y., Takaoka, Y. and Okamoto, M. (2021) Ligand-receptor interactions in plant hormone signaling. The Plant Journal 105(2), 290–306. DOI: 10.1111/tpj.15115. Tanimoto, E. (2012) Tall or short? Slender or thick? A plant strategy for regulating elongation growth of roots by low concentrations of gibberellin. Annals of Botany 110(2), 373–381. DOI: 10.1093/aob/ mcs049. Teng, F., Zhai, L., Liu, R., Bai, W., Wang, L. et al. (2013) ZmGA3ox2, a candidate gene for a major QTL, qPH3.1, for plant height in maize. The Plant Journal 73(3), 405–416. DOI: 10.1111/tpj.12038. Tong, H. and Chu, C. (2018) Functional specificities of brassinosteroid and potential utilization for crop improvement. Trends in Plant Science 23(11), 1016–1028. DOI: 10.1016/j.tplants.2018.08.007. Toyomasu, T., Tsuji, H., Yamane, H., Nakayama, M., Yamaguchi, I. et al. (1993) Light effects on endogenous levels of gibberellins in photoblastic lettuce seeds. Journal of Plant Growth Regulation 12(2), 85–90. DOI: 10.1007/BF00193238. Ueguchi-Tanaka, M., Nakajima, M., Motoyuki, A. and Matsuoka, M. (2007) Gibberellin receptor and its role in gibberellin signaling in plants. Annual Review of Plant Biology 58, 183–198. DOI: 10.1146/annurev. arplant.58.032806.103830. USDA (2021) Oilseeds: world markets and trade. Available at: https://downloads.usda.library.cornell.edu/ usda-esmis/files/tx31qh68h/hq37wg66q/q237jm44p/oilseeds.pdf (accessed 25 May 2022). Vanstraelen, M. and Benková, E. (2012) Hormonal interactions in the regulation of plant development. Annual Review of Cell and Developmental Biology 28, 463–487. DOI: 10.1146/ annurev-cellbio-101011-155741. Wu, J., Lin, L., Xu, M., Chen, P., Liu, D. et al. (2018) Homoeolog expression bias and expression level dominance in resynthesized allopolyploid Brassica napus. BMC Genomics 19(1), 586. DOI: 10.1186/ s12864-018-4966-5. Wu, K., Wang, S., Song, W., Zhang, J., Wang, Y. et al. (2020b) Enhanced sustainable green revolution yield via nitrogen-responsive chromatin modulation in rice. Science 367(6478), 641. DOI: 10.1126/ science.aaz2046. Wu, J., Yan, G., Duan, Z., Wang, Z., Kang, C. et al. (2020a) Roles of the Brassica napus DELLA protein BnaA6.RGA, in modulating drought tolerance by interacting with the ABA signaling component BnaA10.ABF2. Frontiers in Plant Science 11, 577. DOI: 10.3389/fpls.2020.00577. Yamaguchi, S. (2008) Gibberellin metabolism and its regulation. Annual Review of Plant Biology 59(1), 225–251. DOI: 10.1146/annurev.arplant.59.032607.092804. Yamamoto, Y., Hirai, T., Yamamoto, E., Kawamura, M., Sato, T. et al. (2010) A rice gid1 suppressor mutant reveals that gibberellin is not always required for interaction between its receptor, GID1, and DELLA proteins. The Plant Cell 22(11), 3589–3602. DOI: 10.1105/tpc.110.074542. Yamauchi, Y., Ogawa, M., Kuwahara, A., Hanada, A., Kamiya, Y. et al. (2004) Activation of gibberellin biosynthesis and response pathways by low temperature during imbibition of Arabidopsis thaliana seeds. The Plant Cell 16(2), 367–378. DOI: 10.1105/tpc.018143. Yan, J., Liao, X., He, R., Zhong, M., Feng, P. et al. (2017) Ectopic expression of GA 2-oxidase 6 from rapeseed (Brassica napus L.) causes dwarfism, late flowering and enhanced chlorophyll accumulation in Arabidopsis thaliana. Plant Physiology and Biochemistry 111, 10–19. DOI: 10.1016/j. plaphy.2016.11.008. Yan, J., Xiang, F., Yang, P., Li, X., Zhong, M. et al. (2021) Overexpression of BnGA2ox2, a rapeseed gibberellin 2-oxidase, causes dwarfism and increased chlorophyll and anthocyanin accumulation in Arabidopsis and rapeseed. Plant Growth Regulation 93(1), 65–77. DOI: 10.1007/s10725-020-00665-6. Yang, H., Wu, J.J., Tang, T., Liu, K.D. and Dai, C. (2017) CRISPR/Cas9-mediated genome editing efficiently creates specific mutations at multiple loci using one sgRNA in Brassica napus. Scientific Reports 7(1), 7489. DOI: 10.1038/s41598-017-07871-9.

Plant Hormones

171

Yao, S., Liang, F., Gill, R.A., Huang, J., Cheng, X. et al. (2020) A global survey of the transcriptome of allopolyploid Brassica napus based on single-molecule long-read isoform sequencing and illumina- based RNA sequencing data. The Plant Journal 103(2), 843–857. DOI: 10.1111/tpj.14754. Zhang, D., Pan, Q., Cui, C., Tan, C., Ge, X. et al. (2015) Genome-specific differential gene expressions in resynthesized Brassica allotetraploids from pair-wise crosses of three cultivated diploids revealed by RNA-seq. Frontiers in Plant Science 6, 957. DOI: 10.3389/fpls.2015.00957. Zhang, J., Zhang, X., Chen, R., Yang, L., Fan, K. et al. (2020) Generation of transgene-free semidwarf maize plants by gene editing of Gibberellin-Oxidase20-3 using CRISPR/Cas9. Frontiers in Plant Science 11, 1048. DOI: 10.3389/fpls.2020.01048. Zhao, B., Li, H., Li, J., Wang, B., Dai, C. et al. (2017) Brassica napus DS-3, encoding a DELLA protein, negatively regulates stem elongation through gibberellin signaling pathway. Theoretical and Applied Genetics 130(4), 727–741. DOI: 10.1007/s00122-016-2846-4.

12

Plant–Pathogen Interaction: New Era of Plant–Pathogen Interaction Studies: “Omics” Perspectives

1

Shu’an Zheng1 and Ryohei Terauchi1,2* Laboratory of Crop Evolution, Kyoto University, Kyoto, Japan; 2Iwate Biotechnology Research Center, Iwate, Japan

Abstract Plants, including major crops, are constantly attacked by a plethora of microbial pathogens. Understanding molecular interactions between plants and pathogens is important for developing efficient measures to control disease. In this chapter, we review recent progress in the application of omics approaches to understand plant– pathogen interactions, particularly focusing on transcriptome, plant–pathogen protein interactome, and plant NLRome.

12.1 Introduction A breakthrough in genetic studies of plant– pathogen interaction was made by Flor (1971) with the proposed “gene- for- gene” theory, in which plant resistance is manifested when a dominant resistance (R-) gene of a plant matches the dominant avirulence (Avr-) gene of a pathogen. Since then, studies of plant–pathogen interactions have focused mainly on the interactions between a host molecule and a pathogen molecule. With continuing discoveries about plant immunity, plant–pathogen interactions came to be understood as interactions between the networks of respective organisms at the level of the genes, proteins, and metabolites. When the genotype is translated to the phenotype in a given environment, complex biological systems and cellular networks work closely behind the scenes (Vidal et al., 2011).

The omics approach represents the shift of research attention from a few individual molecular components, such as a single gene, a single molecule, and individual gene–gene or protein– protein interactions, toward more holistic analysis of the entire cellular systems (Baker, 2013). The omics perspective and systems approach enhances our knowledge of plant–microbe interactions. This chapter will discuss the omics approach to plant–pathogen interactions by providing information on transcriptomics, protein interactome, and pan-NLRome.

12.2 Overview of Plant Defense against Pathogens Plant disease is a major cause of crop losses. To fend off a wide variety of microbial pathogens, plants have developed a complex mechanism

*Corresponding author: terauchi.ryohei.3z@kyoto-u.ac.jp 172

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0012

Plant–Pathogen Interaction

of pathogen recognition and immunity. Pathogens have conserved molecular structures called pathogen- associated molecular patterns (PAMP). The first layer of plant defense is triggered by recognition of PAMP, resulting in the defense mechanism called PAMP- triggered immunity (PTI). Receptors such as receptor- like kinases (RLKs) and receptor-like proteins (RLPs) located on the cell surface play roles in the detection of PAMPs in the extracellular space. To counter the plant defense, pathogens secrete effector molecules to suppress PTI signaling and successfully invade the plants. Plants evolved another layer of defense called effector-triggered immunity (ETI), in which cytosolic immune receptors of nucleotide-binding leucine-rich repeat (NLR)- class proteins recognize pathogen effectors and exert strong immunity that is frequently associated with hypersensitive response (HR)-like cell death (Dangl and Jones, 2001). Interplays between pathogen and plant involving PAMP, PTI, effector, and ETI are conceptualized as the Zigzag model (Jones and Dangl, 2006). In this chapter, we give an overview of the omics approach to understanding plant–pathogen interactions, with a special emphasis on effector–host protein interactions and ETI mediated by NLRs.

173

et al., 1995) and SuperSAGE (Matsumura et al., 2003), which enabled studying host–pathogen transcriptomes in parallel. More recently, RNA- seq became the most popular tool to address transcriptomes of plant–pathogen interactions. cDNA microarrays are made by immobilizing a large number of cDNA probes on a chip surface. RNAs extracted from the sample are labeled with fluorescent dye and are hybridized to the chip. The amount of RNA in the sample is correlated with the fluorescent signal intensity of the probe, displaying the transcript level of the genes (Schena et al., 1995). It has been successfully used to understand, for example, effector- mediated manipulation of plant hormone homeostasis during the interaction between Arabidopsis and Pseudomonas syringae pv. Tomato (de Torres-Zabala et al., 2007). However, cDNA microarray has several drawbacks. First, it is a high-cost method. The requirement of designing custom chips for hosts and pathogens is prohibitory. Second, this method can only address the expression of genes that are spotted on the chip. Lastly, it lacks accuracy, due to the high background of hybridization signals (Schenk et al., 2012) and a relatively narrower dynamic range of signals as compared with sequence- based methods. Apart from expressed sequence tags (EST) analysis (Adams et al., 1991), serial analysis of gene expression (SAGE) was the first method of large- scale transcriptomics based on DNA 12.3 Transcriptome of Plant and sequencing (Velculescu et al., 1995). Messenger Pathogen Interactions: Providing a RNA isolated from the cell is converted to cDNA, Global Understanding of the Host– which is digested by an “anchoring enzyme”, Pathogen Interplay NlaIII, and further ligated to linker fragments. By using a Type IIS restriction endonuclease Due to its digital nature, the analysis of DNA BsmF1, a 15 bp “tag” fragment is released and RNA sequences was the first category of bio- from a specific position of each cDNA. The tag logical information to be converted to the omics fragments are concatenated and sequenced platform, resulting in the advent of genomics to obtain the transcriptome. The abundance and transcriptomics. Transcriptomics allows of tag sequences reflects the expression level the simultaneous analysis of gene expression of of the corresponding gene. Thus, SAGE was interacting plants and pathogens. Traditionally, considered as a highly efficient method to study the level of individual transcript was measured transcriptomics in general (Anisimov, 2008). by a hybridization-based method (e.g., northern However, the 15 bp tag sequence was sometimes blot analysis) using gene-specific DNA probes, too short to uniquely identify the correspondor by PCR amplification of cDNA using gene- ing gene. Therefore, Matsumura et al. (2003) specific primers. These methods could only be modified SAGE to extract 26 bp tag fragments applied to the pre- selected target genes. Two using an anchoring enzyme EcoP15I, resulting decades ago, high- throughput methods were in SuperSAGE (Matsumura et al., 2003). This introduced including cDNA microarray (Schena method was applied to studying interactions of

174

S. Zheng and R. Terauch

host and pathogen organisms including rice– Magnaporthe oryzae interaction (Matsumura et al., 2003). The major drawback of SAGE and SuperSAGE is a complicated experimental procedure involving a series of restriction digestion and ligation procedures. With the development of next-generation sequencing (NGS), RNA-seq became the major approach to profile whole transcriptomes with high accuracy and a wide dynamic range. cDNA fragments converted from RNA are ligated to one or two adaptors in theirs ends and subjected to high-throughput sequencing. The resulting reads could be aligned to a reference genome or reference transcripts, or could be used for de novo assembly of transcript (Wang et al., 2009). By RNA-seq, gene expression of both plant and pathogen could be investigated during their interactions. Since 2008, RNA-seq has contributed significantly to understanding plant–pathogen interactions, including the following interactions: Arabidopsis–Pseudomonas syringae pv. Tomato (Howard et al., 2013); wheat–Zymoseptoria tritici (Rudd et al., 2015); sorghum–Bipolaris sorghicola (Yazawa et al., 2013); rice–Magnaporthe oryzae (Kawahara et al., 2012; Nishimura et al., 2016; Shimizu et al., 2019; Tian et al., 2020); and foxtail millet– Sclerospora graminicola (Kobayashi et al., 2017). Though RNA-seq could be considered a powerful tool to study plant–microbes interaction, it still poses some challenges, including the bias caused by PCR amplification of cDNA (Hrdlickova et al., 2017). Even though transcriptomics could be remarkably useful in investigating both gene expression levels of plant and the pathogens, it should kept in mind that the gene expression level is not correlated with the protein level (Gygi et al., 1999). Thus, interpretation of transcriptome results needs caution.

12.4 Proteomics and Plant– Pathogen Interactome: Network Analysis Proteomics is the study of the structure and function of proteins on a large scale, and it also contributes to functional genomics (Yoithapprabhunath et al., 2015). Proteomics

research goals could be divided into three groups. First, it addresses large-scale identification of proteins in a given cell, tissue, or organ under a certain condition. Second, it is used to compare the quantity of proteins between samples placed in different biological conditions. Finally, it studies protein–protein interactions (Pandey and Mann, 2000). Protein–protein interaction (PPI) is crucial for studying plant–pathogen interaction. Protein–protein interactions have been conventionally studied by yeast two-hybrid (Y2H) assay, co- immunoprecipitation (Co- IP) assay, and pull- down assay. These methods were initially employed to address interactions of two or a few proteins, but could be scaled up to the omics platform. In the plant–pathogen interaction field, large-scale PPI studies between pathogen effectors and host proteins have been conducted, mainly by Y2H assay and in planta Co-IP followed by mass spectrometry of precipitated proteins (e.g., Petre et al., 2015). Mukhtar and colleagues revealed the first large-scale plant–pathogens interaction network (interspecies interactome) of Pseudomonas syringae (bacteria) and Hyaloperonospora arabidopsidis (oomycete) pathogen effectors and Arabidopsis proteins (Mukhtar et al., 2011). They revealed Y2H interactions involving 83 pathogen effectors, 170 Arabidopsis immune proteins involved in PTI, ETI, and defense signaling, as well as 673 Arabidopsis proteins that interact with the immune proteins. This analysis revealed a total of 3148 potential interactions among the 926 proteins. Pathogen effectors tend to target highly interconnected host proteins (hubs) and the interactions between effectors and host immune receptors are indirect, which was interpreted to support the “guardee” model of plant–microbe interaction (Dangl and Jones, 2001). Later, an interactome study of an Ascomycete fungus Golovinomyces orontii and Arabidopsis was carried out (Weßling et al., 2014) by testing Y2H interaction of 69 G. orontii effector candidate proteins against 12,000 Arabidopsis proteins. The resulting interactome, together with the report of Mukhtar et al. (2011), indicated the convergence of effectors from different kingdoms to target common host proteins. Weßling et al. (2014) also hypothesized that the plant could more efficiently defend against pathogens by monitoring these conserved target proteins as “guardees” than monitoring rapidly

Plant–Pathogen Interaction

evolving effector molecules of the pathogens. Recently González-Fuente et al. (2020) extended the Arabidopsis interactome by addressing its interactions with two bacterial pathogens, Ralstonia pseudosolanacearum and Xanthomonas campestris, and summarized the 2035 previous publications on protein interactions involving 48,200 Arabidopsis–Arabidopsis and 1300 Arabidopsis–effector protein interactions, which resulted in a database named EffectorK. Similar to previous reports, the study showed that effectors tend to target Arabidopsis hub proteins. Even though the recent development of the plant–microbe interactome is impressive, we have to keep in mind that Y2H is prone to false positives and false negatives and the results should be interpreted with caution. More importantly, current PPI results reflect static interactions in heterologous Y2H systems, whereas the real PPI is dynamic one with interacting proteins changing quickly in space and time. We need a technical breakthrough in addressing the dynamic nature of PPI.

12.5 NLRome Provides a Comprehensive Way to Study NLRs Plant NLRs are the central immune receptors in charge of recognition of invading pathogen effectors. NLR genes have been massively amplified in the plant kingdom. The number of NLR genes varies across the plants; Arabidopsis, maize, Sorghum, Brachypodium, and rice encode 149, 129, 245, 239, and 508 NLR genes in their genomes, respectively. As expected from its hexaploid nature, wheat has more than 1000 NLR genes (Meyers et al., 2003; Li et al., 2010; Baggs et al., 2017). Generally speaking, woody plants have a larger number of NLRs than non-woody plants (Li et al., 2010; Baggs et al., 2017). This kind of expansion might be due to their longer lifespans, leading to more exposure to different pathogens (Baggs et al., 2017). NLR genes are known to be the most variable among the plant genes (Clark et al., 2007). NLRs have a conserved structure of canonical nucleotide binding (NB) and leucine- rich repeat (LRR). Depending on the domain at their N- termini, they can be further divided into TIR-NLRs (TNLs) with toll and interleukin

175

receptor-like domain (TIR) and CC-NLRs (CNLs) with coiled-coil (CC) domain. The number of NLRs in the TNL subclass are fewer than in the CNL subclass in most angiosperms. CNL is present in all monocot and dicot plants, whereas the TNL subclass is present only in dicots (Bezerra-Neto et al., 2020). Recently, the RPW8- NB- ARC- LRR (RNL), which shares a similarity to RPW8-only proteins (RESISTANCE TO POWDERY MILDEW8) from Arabidopsis, was proposed as a novel class of NLR proteins, which contains three families based on the homology of their RPW8 domain (Adachi et al., 2019a). It has been found that the three subclasses of NLR require distinct downstream factors and are involved in different signaling pathways (Bezerra-Neto et al., 2020). New insights into NLR activation mechanisms have been obtained from recent structure studies (Horsefield et al., 2019; Wan et al., 2019; Wang et al., 2019). ZAR1 is a conserved CNL in Arabidopsis. A cryo-EM structure study revealed that the activated ZAR1 protein bound to host proteins and pathogen AVR effector forms a pentamer complex and the α-helix located in a very N-terminus makes a funnel-shaped structure (Wang et al., 2019). This funnel-like structure may be responsible for perturbing plasma membrane integrity and triggering cell death. Horsefield et al. (2019) and Wan et al. (2019) revealed that the self- assembled TIR domain of plant TNLs gains NAD+ cleavage activity and the resulting NAD depletion may trigger HR-like cell death, which is similar to the mechanism of Wallerian degeneration associated with neuropathy that is caused by SARM1 (sterile alpha and TIR motif containing 1) protein of mammals. Martin et al. (2020) revealed that a TNL ROQ1 bound by XopQ effector is activated and forms a tetramer, and NAD+ cleavage by its TIR domain results in cytosolic Ca²+ influx, further giving the signal for downstream immunity pathway.

12.6 NLR and Avr Interaction Could Be Divided into Three Patterns The molecular interaction mechanisms between plant NLRs and pathogen avirulence (Avr)

176

S. Zheng and R. Terauch

effectors can be categorized into three major patterns (Kourelis and van der Hoorn, 2018). LRR domain of NLR could directly bind to Avr and trigger the immune response. One of the most well-studied cases is RPP1 of Arabidopsis, which recognizes the oomycete pathogen by direct binding to its effector ARABIDOPSIS THALIANA RECOGNIZED1 (ATR1) (Krasileva et al., 2010). Some NLRs recognize Avr by integrated domains (IDs), a non-canonical domain with a wide variety of motifs showing similarity to host proteins (Kroj et al., 2016; Sarris et al., 2016). For example, in rice–Magnaporthe oryzae interactions, the pathogen effectors Avr- Pia and Avr1-CO39 directly bind the HMA ID of RGA5 NLR (Cesari et al., 2013) and the effector AVR-Pik directly binds HMA ID of Pik-1 NLR (Maqbool et al., 2015), to trigger HR-like cell death. Similarly, Pseudomonas syringae effector AvrRps4 interacts with WRKY ID of RRS1 in Arabidopsis (Sarris et al., 2015). The NLR with ID is called sensor NLR and usually requires another genetically linked helper NLR to trigger the HR signal. The third pattern is represented by the guardee model, in which NLR is guarding a host protein and its modification by pathogen Avr effector activates NLR to trigger defense. Examples include Arabidopsis NLRs, RPM1 and RPS2, which are guarding RPM1- interacting protein 4 (RIN4). These NLRs indirectly recognize four Avr effectors of bacterial pathogen by detecting modification of RIN4 by the effectors (Afzal et al., 2013).

12.7 NLRs Function in Singleton, Pair, or Network NLRs can function as singletons, by pair, or by network. One well- explained example of the singleton NLR is ZAR1, which alone can perceive the effector and trigger the HR-like cell death (Wang et al., 2019). Some NLRs function as pairs, including RRS1/RPS4 in Arabidopsis (Williams et al., 2014; Sarris et al., 2015), Pik-1/ Pik-2 (Ashikawa et al., 2008), RGA4/RGA5 (Okuyama et al., 2011), and Pii-1/Pii-2 (Fujisaki et al., 2015) in rice. The paired NLRs contain a sensor NLR with integrated domains (IDs) and a helper NLR. Sensor NLRs play an essential role in effector recognition, and helper NLRs function

as executors of defense signaling (Adachi et al., 2019b). Sensor and helper NLRs are usually genetically tightly linked. In the NLR networks, sensor NLR requires helper NLR that is not genetically linked to the sensor. An example is in Nicotiana benthamiana, where helper NLRs NRC4, NRC2, and NRC 3 are differentially required by sensor NLRs to trigger immune response (Wu et al., 2017). Redundancy of helper NLRs may allow sensor NLRs to evolve to counteract the rapidly evolving pathogens.

12.8 Pan-NLRome Reveals Diversity of NLRs Considering the importance of NLRs in crop breeding and the understanding of plant–pathogen interactions, the importance of exploring NLRs cannot be overemphasized. There are more unknown than known NLRs in terms of their structure, function, interaction with other NLRs as well as other host and pathogen proteins, and NLR gene arrangement. Pan-NLRome, a pan- genome study that focuses exclusively on NLR genes, could explore the variation and conservation of NLRs within given taxa. In Arabidopsis, to reach 95% of pan-NLRome saturation, NLRs of 38 accessions out of 64 should be enumerated, pointing to a wide diversity of NLRs in different accessions (Van de Weyer et al., 2019), which suggests that there are many unrecognized NLRs. The complements of NLR in both model and non- model plants, systematic analysis of relationships between NLR phylogeny, the mode of recognition, and the amount of allelic diversity should be further addressed (Barragan and Weigel, 2020). We expect pan-NLRome will contribute to crop breeding and our understanding of plant–pathogen co-evolution.

12.9 Concluding Remarks Understanding of plant–pathogen interactions is important in developing efficient measures to control diseases of crops that are feeding the world population. To enhance world researchers’ endeavors to fight against crop diseases, all the relevant data should be openly available

Plant–Pathogen Interaction

to the public. In this regard, we are glad to see the recent development of omics data of plant– pathogen interactions being shared in open databases, including Open Wheat Blast (http:// openwheatblast.net, accessed September 2022) and Open Rice Blast (http://openriceblast.org,

177

accessed September 2022). Zenodo (http:// zenodo.org, accessed September 2022), a general- purpose open- access repository developed by OpenAIRE and maintained by CERN, would be a suitable platform to share the omics data of plant–pathogen interactions.

Acknowledgments This work was supported by grant JSPS KAKENHI 15H05779 and 20H05681 to R.T.

References Adachi, H., Derevnina, L. and Kamoun, S. (2019a) NLR singletons, pairs, and networks: evolution, assembly, and regulation of the intracellular immunoreceptor circuitry of plants. Current Opinion in Plant Biology 50, 121–131. DOI: 10.1016/j.pbi.2019.04.007. Adachi, Hiroaki, Contreras, M.P., Harant, A., Wu, C.-H., Derevnina, L. et al. (2019b) An N-terminal motif in NLR immune receptors is functionally conserved across distantly related plant species. ELife 8, e49956. DOI: 10.7554/eLife.49956. Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H. et al. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252(5013), 1651–1656. DOI: 10.1126/science.2047873. Afzal, A.J., Kim, J.H. and Mackey, D. (2013) The role of NOI-domain containing proteins in plant immune signaling. BMC Genomics 14, 327. DOI: 10.1186/1471-2164-14-327. Anisimov, S.V. (2008) Serial Analysis of Gene Expression (SAGE): 13 years of application in research. Current Pharmaceutical Biotechnology 9(5), 338–350. DOI: 10.2174/138920108785915148. Ashikawa, I., Hayashi, N., Yamane, H., Kanamori, H., Wu, J. et al. (2008) Two adjacent nucleotide-binding site- leucine- rich repeat class genes are required to confer Pikm- specific rice blast resistance. Genetics 180(4), 2267–2276. DOI: 10.1534/genetics.108.095034. Baggs, E., Dagdas, G. and Krasileva, K.V. (2017) NLR diversity, helpers and integrated domains: making sense of the NLR IDentity. Current Opinion in Plant Biology 38, 59–67. DOI: 10.1016/j. pbi.2017.04.012. Baker, M. (2013) Big biology: The ’omes puzzle. Nature 494(7438), 416–419. DOI: 10.1038/494416a. Barragan, A.C. and Weigel, D. (2020) Plant NLR diversity: the known unknowns of pan-NLRomes. The Plant Cell 33(4), 814–831. DOI: 10.1093/plcell/koaa002. Bezerra-Neto, J.P., Araújo, F.C., Ferreira-Neto, J.R., Silva, R.L., Borges, A.N. et al. (2020) NBS-LRR genes—plant health sentinels: structure, roles, evolution and biotechnological applications. In: Poltronieri, P. and Hong, Y. (eds) Applied Plant Biotechnology for Improving Resistance to Biotic Stress. Academic Press, San Diego, California, pp. 99–115. Cesari, S., Thilliez, G., Ribot, C., Chalvon, V., Michel, C. et al. (2013) The rice resistance protein pair RGA4/ RGA5 recognizes the Magnaporthe oryzae effectors AVR-Pia and AVR1-CO39 by direct binding. The Plant Cell 25(4), 1463–1481. DOI: 10.1105/tpc.112.107201. Clark, R.M., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G. et al. (2007) Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science 317(5836), 338–342. DOI: 10.1126/science.1138632. Dangl, J.L. and Jones, J.D. (2001) Plant pathogens and integrated defence responses to infection. Nature 411(6839), 826–833. DOI: 10.1038/35081161. de Torres-Zabala, M., Truman, W., Bennett, M.H., Lafforgue, G., Mansfield, J.W. et al. (2007) Pseudomonas syringae pv. tomato hijacks the Arabidopsis abscisic acid signalling pathway to cause disease. The EMBO Journal 26(5), 1434–1443. DOI: 10.1038/sj.emboj.7601575.

178

S. Zheng and R. Terauch

Flor, H.H. (1971) Current status of the gene-for-gene concept. Annual Review of Phytopathology 9(1), 275–296. DOI: 10.1146/annurev.py.09.090171.001423. Fujisaki, K., Abe, Y., Ito, A., Saitoh, H., Yoshida, K. et al. (2015) Rice Exo70 interacts with a fungal effector, AVR-Pii, and is required for AVR-Pii-triggered immunity. The Plant Journal 83(5), 875–887. DOI: 10.1111/tpj.12934. González-Fuente, M., Carrère, S., Monachello, D., Marsella, B.G., Cazalé, A.-C. et al. (2020) EffectorK, a comprehensive resource to mine for Ralstonia, Xanthomonas, and other published effector interactors in the Arabidopsis proteome. Molecular Plant Pathology 21(10), 1257–1270. DOI: 10.1111/ mpp.12965. Gygi, S.P., Rochon, Y., Franza, B.R. and Aebersold, R. (1999) Correlation between protein and mRNA abundance in yeast. Molecular and Cellular Biology 19(3), 1720–1730. DOI: 10.1128/MCB.19.3.1720. Horsefield, S., Burdett, H., Zhang, X., Manik, M.K., Shi, Y. et al. (2019) NAD+ cleavage activity by animal and plant TIR domains in cell death pathways. Science 365(6455), 793–799. DOI: 10.1126/science. aax1911. Howard, B.E., Hu, Q., Babaoglu, A.C., Chandra, M., Borghi, M. et al. (2013) High- throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS ONE 8(10), e74183. DOI: 10.1371/journal.pone.0074183. Hrdlickova, R., Toloue, M. and Tian, B. (2017) RNA-Seq methods for transcriptome analysis. Wiley Interdisciplinary Reviews. RNA 8(1), e1364. DOI: 10.1002/wrna.1364. Jones, J.D.G. and Dangl, J.L. (2006) The plant immune system. Nature 444(7117), 323–329. DOI: 10.1038/ nature05286. Kawahara, Y., Oono, Y., Kanamori, H., Matsumoto, T., Itoh, T. et al. (2012) Simultaneous RNA-seq analysis of a mixed transcriptome of rice and blast fungus interaction. PloS ONE 7(11), e49423. DOI: 10.1371/journal.pone.0049423. Kobayashi, M., Hiraka, Y., Abe, A., Yaegashi, H., Natsume, S. et al. (2017) Genome analysis of the foxtail millet pathogen Sclerospora graminicola reveals the complex effector repertoire of graminicolous downy mildews. BMC Genomics 18(1), 897. DOI: 10.1186/s12864-017-4296-z. Kourelis, J. and van der Hoorn, R.A.L. (2018) Defended to the nines: 25 years of resistance gene cloning identifies nine mechanisms for R protein function. The Plant Cell 30(2), 285–299. DOI: 10.1105/ tpc.17.00579. Krasileva, K.V., Dahlbeck, D. and Staskawicz, B.J. (2010) Activation of an Arabidopsis resistance protein is specified by the in planta association of its leucine-rich repeat domain with the cognate oomycete effector. The Plant Cell 22(7), 2444–2458. DOI: 10.1105/tpc.110.075358. Kroj, T., Chanclud, E., Michel-Romiti, C., Grand, X. and Morel, J.B. (2016) Integration of decoy domains derived from protein targets of pathogen effectors into plant immune receptors is widespread. The New Phytologist 210(2), 618–626. DOI: 10.1111/nph.13869. Li, J., Ding, J., Zhang, W., Zhang, Y., Tang, P. et al. (2010) Unique evolutionary pattern of numbers of gramineous NBS-LRR genes. Molecular Genetics and Genomics 283(5), 427–438. DOI: 10.1007/ s00438-010-0527-6. Maqbool, A., Saitoh, H., Franceschetti, M., Stevenson, C.E.M., Uemura, A. et al. (2015) Structural basis of pathogen recognition by an integrated HMA domain in a plant NLR immune receptor. ELife 4, e08709. DOI: 10.7554/eLife.08709. Martin, R., Qi, T., Zhang, H., Liu, F., King, M. et al. (2020) Structure of the activated ROQ1 resistosome directly recognizing the pathogen effector XopQ. Science 370(6521), eabd9993. DOI: 10.1126/ science.abd9993. Matsumura, H., Reich, S., Ito, A., Saitoh, H., Kamoun, S. et al. (2003) Gene expression analysis of plant host- pathogen interactions by SuperSAGE. Proceedings of the National Academy of Sciences 100(26), 15718–15723. DOI: 10.1073/pnas.2536670100. Meyers, B.C., Kozik, A., Griego, A., Kuang, H. and Michelmore, R.W. (2003) Genome-wide analysis of NBS-LRR-encoding genes in Arabidopsis. The Plant Cell 15(4), 809–834. DOI: 10.1105/tpc.009308. Mukhtar, M.S., Carvunis, A.-R., Dreze, M., Epple, P., Steinbrenner, J. et al. (2011) Independently evolved virulence effectors converge onto hubs in a plant immune system network. Science 333(6042), 596–601. DOI: 10.1126/science.1203659. Nishimura, T., Mochizuki, S., Ishii-Minami, N., Fujisawa, Y., Kawahara, Y. et al. (2016) Magnaporthe oryzae glycine-rich secretion protein, Rbf1 critically participates in pathogenicity through the focal formation of the biotrophic interfacial complex. PLoS Pathogens 12(10), e1005921. DOI: 10.1371/journal.ppat. 1005921.

Plant–Pathogen Interaction

179

Okuyama, Y., Kanzaki, H., Abe, A., Yoshida, K., Tamiru, M. et al. (2011) A multifaceted genomics approach allows the isolation of the rice Pia-blast resistance gene consisting of two adjacent NBS-LRR protein genes. The Plant Journal 66(3), 467–479. DOI: 10.1111/j.1365-313X.2011.04502.x. Pandey, A. and Mann, M. (2000) Proteomics to study genes and genomes. Nature 405(6788), 837–846. DOI: 10.1038/35015709. Petre, B., Saunders, D.G.O., Sklenar, J., Lorrain, C., Win, J. et al. (2015) Candidate effector proteins of the rust pathogen Melampsora larici-populina target diverse plant cell compartments. Molecular Plant-Microbe Interactions 28(6), 689–700. DOI: 10.1094/MPMI-01-15-0003-R. Rudd, J.J., Kanyuka, K., Hassani-Pak, K., Derbyshire, M., Andongabo, A. et al. (2015) Transcriptome and metabolite profiling of the infection cycle of Zymoseptoria tritici on wheat reveals a biphasic interaction with plant immunity involving differential pathogen chromosomal contributions and a variation on the hemibiotrophic lifestyle definition. Plant Physiology 167(3), 1158–1185. DOI: 10.1104/pp.114.255927. Sarris, P.F., Cevik, V., Dagdas, G., Jones, J.D.G. and Krasileva, K.V. (2016) Comparative analysis of plant immune receptor architectures uncovers host proteins likely targeted by pathogens. BMC Biology 14, 8. DOI: 10.1186/s12915-016-0228-7. Sarris, P.F., Duxbury, Z., Huh, S.U., Ma, Y., Segonzac, C. et al. (2015) A plant immune receptor detects pathogen effectors that target WRKY transcription factors. Cell 161(5), 1089–1100. DOI: 10.1016/j. cell.2015.04.024. Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235), 467–470. DOI: 10.1126/ science.270.5235.467. Schenk, P.M., Carvalhais, L.C. and Kazan, K. (2012) Unraveling plant- microbe interactions: can multi- species transcriptomics help? Trends in Biotechnology 30(3), 177–184. DOI: 10.1016/j. tibtech.2011.11.002. Shimizu, M., Nakano, Y., Hirabuchi, A., Yoshino, K., Kobayashi, M. et al. (2019) RNA-seq of in planta- expressed Magnaporthe oryzae genes identifies MoSVP as a highly expressed gene required for pathogenicity at the initial stage of infection. Molecular Plant Pathology 20(12), 1682–1695. DOI: 10.1111/mpp.12869. Tian, D., Chen, Z., Lin, Y., Chen, Z., Bui, K.T. et al. (2020) Weighted gene co-expression network coupled with a critical- time- point analysis during pathogenesis for predicting the molecular mechanism underlying blast resistance in rice. Rice (New York, N.Y.) 13(1), 81. DOI: 10.1186/ s12284-020-00439-8. Van de Weyer, A.-L., Monteiro, F., Furzer, O.J., Nishimura, M.T., Cevik, V. et al. (2019) A species-wide inventory of NLR genes and alleles in Arabidopsis thaliana. Cell 178(5), 1260–1272. DOI: 10.1016/j. cell.2019.07.038. Velculescu, V.E., Zhang, L., Vogelstein, B. and Kinzler, K.W. (1995) Serial analysis of gene expression. Science 270(5235), 484–487. DOI: 10.1126/science.270.5235.484. Vidal, M., Cusick, M.E. and Barabási, A.L. (2011) Interactome networks and human disease. Cell 144(6), 986–998. DOI: 10.1016/j.cell.2011.02.016. Wan, L., Essuman, K., Anderson, R.G., Sasaki, Y., Monteiro, F. et al. (2019) TIR domains of plant immune receptors are NAD+-cleaving enzymes that promote cell death. Science 365(6455), 799–803. DOI: 10.1126/science.aax1771. Wang, J., Hu, M., Wang, J., Qi, J., Han, Z. et al. (2019) Reconstitution and structure of a plant NLR resistosome conferring immunity. Science 364(6435), eaav5870. DOI: 10.1126/science.aav5870. Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-seq: a revolutionary tool for transcriptomics. Nature Reviews. Genetics 10(1), 57–63. DOI: 10.1038/nrg2484. Weßling, R., Epple, P., Altmann, S., He, Y., Yang, L. et al. (2014) Convergent targeting of a common host protein-network by pathogen effectors from three kingdoms of life. Cell Host & Microbe 16(3), 364–375. DOI: 10.1016/j.chom.2014.08.004. Williams, S.J., Sohn, K.H., Wan, L., Bernoux, M., Sarris, P.F. et al. (2014) Structural basis for assembly and function of a heterodimeric plant immune receptor. Science 344(6181), 299–303. DOI: 10.1126/ science.1247357. Wu, C.-H., Abd-El-Haliem, A., Bozkurt, T.O., Belhaj, K., Terauchi, R. et al. (2017) NLR network mediates immunity to diverse plant pathogens. Proceedings of the National Academy of Sciences 114(30), 8113–8118. DOI: 10.1073/pnas.1702041114.

180

S. Zheng and R. Terauch

Yazawa, T., Kawahigashi, H., Matsumoto, T. and Mizuno, H. (2013) Simultaneous transcriptome analysis of Sorghum and Bipolaris sorghicola by using RNA-seq in combination with de novo transcriptome assembly. PloS ONE 8(4), e62460. DOI: 10.1371/journal.pone.0062460. Yoithapprabhunath, T.R., Nirmal, R.M., Santhadevy, A., Anusushanth, A., Charanya, D. et al. (2015) Role of proteomics in physiologic and pathologic conditions of dentistry: Overview. Journal of Pharmacy & Bioallied Sciences 7(Suppl. 2), S344–S349. DOI: 10.4103/0975-7406.163448.

13

Plant GWAS

Matthew Shenton* NARO Institute of Crop Science, Tsukuba, Japan

Abstract Genome-wide association studies (GWAS) associate genomic polymorphisms with phenotypes. As the cost of genotyping using next-generation sequencing (NGS) has fallen, GWAS has become a popular technique, taking advantage of population diversity to rapidly identify genomic loci associated with phenotypic traits. This chapter provides a brief introduction to the principles and practicalities of using GWAS, primarily on its use in diploid crop species, and concentrating on GWAS employing mixed linear models (MLM) that accommodate the effects of population structure and genetic relatedness among the study population. Core processes that comprise a GWAS study are reviewed, and some recommendations concerning software and best practices for achieving meaningful results are discussed.

13.1 Introduction The identification of genes and loci that influence different phenotypes is fundamental in biology. The availability of DNA sequence-based molecular markers, in conjunction with statistical methods, has allowed the identification of specific genomic regions that influence the phenotypes of individuals, lines, and accessions. Quantitative trait locus (QTL) linkage mapping assesses the association between phenotypes and genotypes in a population generated between two parental lines (Paterson, 2002). Genome- wide association studies (GWAS), in contrast, use different types of populations, often encompassing genetic diversity. It avoids the time burden of generating the population in advance and has two major resolution advantages: 1. GWAS can analyse multiple allelic variations simultaneously in a single study, whereas

traditional bi-parental population-based QTL mapping can analyse two allelic variations. 2. GWAS has a higher mapping resolution than bi- parental population- based QTL mapping, because diverse populations have experienced large numbers of recombinations compared with one and several in bi-parental F2 populations and recombinant inbred line (RIL) or MAGIC populations (Huang et al., 2015). Thus, it is now popular to rapidly identify loci associated with certain traits, especially in crop plants. Since the early 2000s, improved sequencing technologies (Chapter 1 of this book) mean that GWAS has become a popular and widely used technique for examining genotype–phenotype relationships in plants (Tibbs Cortes et al., 2021). Many excellent reviews are already available on GWAS methods and results (e.g., Xiao et al., 2017; Liu and Yan, 2019; Wang et al., 2019), and this chapter aims to provide

*matt.shenton@affrc.go.jp © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0013

181

182

M. Shenton

a brief introduction to the most popular and accessible techniques, methods, and software as an introductory guide for researchers who are interested in exploiting this method. Although there are some GWAS tools for polyploid plants (e.g., Rosyara et al., 2016), the focus in this chapter is on diploid analysis, because polyploid analysis includes many technical difficulties and is still a developing research area (Bourke et al., 2018).

13.2 Core Processes in GWAS 13.2.1 Associating genotypic variations with phenotypic variations Historically, QTLs have been identified by mapping biparental populations (Paterson, 2002). In QTL mapping, the probability of recombination between a marker and a causal locus increases with distance. This relationship can be used to estimate the position of a QTL on a genetic map using genome-wide molecular markers (typically using hundreds of markers). However, transitioning from a position on the genetic map to an accurate physical position in the genome sequence is difficult, because a very large population is required to obtain a certain number of recombinations near a causal locus. In contrast, a typical GWAS takes advantage of many historical recombination events in a diverse population (often tens to hundreds), whereas only one recombination event is used for F2 QTL mapping. Therefore, the effective number of recombinations in the whole population near a particular marker is much higher in GWAS (Mitchell-Olds and Schmitt, 2006). In addition, very large numbers of single nucleotide polymorphisms (SNPs) are often available as markers; therefore, a high resolution is potentially available. The actual achievable resolution depends on the species and population; however, because SNPs are not inherited individually, there is a non-random association between nearby SNPs, and they are thus in linkage disequilibrium (LD) (Gupta et al., 2005). The average size of regions containing SNPs that are strongly in LD can extend to hundreds of kilobases in self-pollinating crops, such as rice and soybean.

In GWAS, each SNP has three possible values corresponding to the minor allele count; that is, 0, 1, or 2 for the genotypes AA, Aa, or aa. The association between marker genotype and phenotypic values is calculated for all markers. For a marker closely linked to the causal polymorphism of a highly heritable trait, the mean trait value is strongly associated with the genotype pattern. A p-value is calculated such that the null hypothesis (no association between trait value and genotype pattern) can be rejected. If the null hypothesis is rejected, we can speculate that the SNP is in LD with the causal polymorphism (see section 13.5 for more discussion).

13.2.2 Preparing GWAS populations In general, the populations used for GWAS have been based on collections of genetic resources that aim to represent the entire range of genetic diversity in a species (Li et al., 2014; Alonso- Blanco et al., 2016; Chang et al., 2016; Tanaka et al., 2020a, b; Kajiya-Kanegae et al., 2021), whereas others have been based on populations adapted to specific areas or environments (e.g., Travis et al., 2015; Higgins et al., 2021). GWAS panels constructed from natural populations, crop varieties, and landraces are often structured. If the population used is significantly structured, the association of traits with markers can be overestimated; for example, if many accessions in one subgroup have extreme trait values, many markers specific to the subgroup may display a correlation with the phenotype, effectively inflating their association with the trait. Similarly, even within a subgroup, more closely related varieties or accessions will be more genetically similar and share marker genotypes; therefore, their relatedness must also be taken into account when assessing the strength of their association with the phenotype. This issue is discussed in section 13.2.4.

13.2.3 Checking phenotype data A successful GWAS depends on high- quality phenotype data. In- depth knowledge of the structure and distribution of the data is also highly advantageous before performing GWAS.

Plant GWAS

The mathematics of the mixed linear models (see section 13.2.4) used in GWAS presupposes that the variation in the phenotype data approximates a normal distribution, although some methods are also available for binary traits (Shenstone et al., 2018). If the data are removed from the expected distribution, some of the sensitivity of the association may be lost but, in some cases, a transformation of the data (e.g., a Box-Cox transformation, which adjusts trait values by rank), resulting in a normally distributed phenotype, can be advantageous.

13.2.4 Mixed linear model A de facto consensus has emerged that, in plant GWAS using diploid species and populations with natural variation, the most effective way to perform association analysis while correcting for population structure and relatedness between accessions is to use a mixed linear model (MLM) (Yu et al., 2006). The MLM procedure used in GWAS adjusts for relatedness of the accessions at two levels. First, a genetic relatedness matrix (GRM) (or kinship matrix) is estimated from genome-wide SNPs and accounts for relatedness between individual accessions. Meanwhile, structured groups in the population are also modeled, often using principal component analysis (PCA) to capture the major population stratification. Both the GRM and principal components are used as explanatory variables in the MLM. Many GWAS papers that use an MLM in relation to a wide variety of species have been published recently, demonstrating the widespread popularity and acceptance of this method (e.g., Chen et al., 2021; Li et al., 2021; Yadav et al., 2021; Zhou et al., 2021).

13.2.5 Analyzing statistical significance The p-value in GWAS illustrates the probability that the observed data for each SNP or marker are inconsistent with the null hypothesis that there is no association between the marker and phenotypic variation. However, in recent GWAS experiments, where whole-genome resequencing data are available, the number of markers can be in the millions. For a commonly used

183

p-value threshold of 0.05, there will therefore be tens or hundreds of thousands of false positives simply because of the enormous number of tests. This severe multiple testing problem has to be addressed using some method of correction in order to allow the researcher to decide which peaks are truly meaningful associations and which occur by chance. The Bonferroni correction is one of the most frequently used methods to establish statistical significance thresholds. Bonferroni correction simply divides the p-value by the number of tests (the number of markers in the GWAS cases) to estimate the probability of incorrect decisions for a rejected null hypothesis. However, the non-independence between SNPs used in a GWAS due to LD indicates that the Bonferroni correction using family-wise error rate (FWER) is often excessively conservative. The false discovery rate (FDR) is the expected proportion of false positives among all the “null hypothesis rejected-tests” (Benjamini and Hochberg, 1995). The Benjamini–Hochberg (BH) method uses q-value estimation of FDR as follows: 1. p-values in a test are sorted in ascending order. 2. q-values are calculated using the following equation:

‍

( ) qi = m × pi / i‍

where qi is the q-value for the i-th p-value, m is the number of tests, and pi is the i-th p-value. The largest p- value below the expected FDR level is chosen as the threshold. It should be noted that there may be no markers that meet these criteria (i.e., there are no significant associations). Storey and Tibshirani (2003) found that BH-based q-values deviate from the actual FDR when a certain number of true positives are included in the test. In such cases, tests with low p-values are more frequent, while the distribution will be uniform when a small number of true positives are included. Storey’s method addresses this issue by estimating the proportion of true positives in the tests and considers this when calculating q-values (Storey and Tibshirani, 2003).

184

M. Shenton

Another method for establishing a statistical significance threshold is the permutation test. In a permutation test, the link between the phenotypes and genotypes is broken randomly, and GWAS is performed using the link-broken data. This is repeated many times. For a 5% significance threshold, the bottom 5 percentile p-values generated in the permutation tests is chosen as the p-value threshold for the original data. Permutation testing is one of the most reliable methods for determining the significance threshold. However, according to Churchill and Doerge (1994), 1000 permutations are necessary for a 5% significance threshold, which is computationally intensive and impractical if millions of SNPs are used. Any p-value threshold should be regarded as a guide. Regardless of the chosen threshold, independent measurements of phenotype or trait data from different experiments or environments are necessary for confidence. The generation of similar peaks from such independent data can be regarded as a reliable indicator of meaningful genotype–phenotype association.

13.2.6 GWAS software Table 13.1 shows examples of GWAS software that use an MLM. Among these software packages, TASSEL is the only tool that has a graphical user interface. As for other statistical methods (e.g., genomic selection; see Chapter 14 of this book), many GWAS tools are implemented as R packages (https://www.r-project.org/, accessed September 2022).

13.3 Graphical Representation of GWAS Results 13.3.1 Manhattan plot The most common graphical representation of GWAS results is the Manhattan plot (Fig. 13.1). The Manhattan plot represents the –log10(p) values on the y- axis (the highest confidence associations are higher on the plot), with the x-axis showing the chromosomal position. This allows rapid assessment of the strength and position of marker associations with traits and instant visual assessment of intersections with known QTL positions.

13.3.2 Quantile–quantile (QQ) plot QQ plots are commonly used to assess the success of MLMs in controlling for population structure and are often presented as a complement to the Manhattan plot for a GWAS experiment. The QQ plot is a plot of the observed versus expected log10(p) values in the GWAS. Under the null hypothesis, the expected distribution of p-values is uniform; therefore, the scatter plot of the observed versus expected p-values forms a straight line. When there are significant associations, the plot has a tail at higher –log10(p) values, where the observed values are higher. If inflation of –log10(p) values occurs due to the population structure, the plot deviates from a straight line at lower –log10(p) values (Fig. 13.2).

Table 13.1. Popular software used in GWAS. Software package

Features

Reference

plink

Many features, developed for human-scale analyses, Chang et al., 2015 widely used file format, command line

GCTA

Command line, heritability estimate from genotypes

Yang et al., 2011

gemma

Command line, efficient MLM analysis

Zhou and Stephens, 2012

Tassel

GUI and command-line interface

Bradbury et al., 2007

GAPIT

R-package, MLM GWAS, plotting, PCA analysis

Lipka et al., 2012

GENESIS

R-package, MLM GWAS, integrates with other Bioconductor packages

Gogarten et al., 2019

RAINBOWR

R-package, haplotype/window-based association, epistasis analysis

Hamazaki and Iwata, 2020

Plant GWAS

185

Fig. 13.1. An example of a Manhattan plot. GWAS was performed using the NARO world rice collection (WRC) and NARO Rice Core collection of Japanese landraces for the seed colour trait (Tanaka et al., 2020a). This trait is controlled by one strong QTL on chromosome 7, at the Rc locus. The dashed line represents the p-value threshold for a false discovery rate of 0.05 using the Benjamini–Hochberg procedure.

13.4 Case Studies 13.4.1 Arabidopsis

Fig. 13.2. An example of a QQ plot for rice GWAS. Quantile–quantile plot of expected versus observed p-values for the GWAS shown in Fig. 13.1.

In recent decades, Arabidopsis has become the dominant plant model in molecular genetics and genomics, and as such its GWAS resources are arguably the best developed. The principal resource for Arabidopsis association studies is the 1001 Genomes Project (https://1001genomes. org/, accessed September 2022) (Alonso-Blanco et al., 2016). Sequencing of 1135 Arabidopsis genomes sampled from locations around the world has provided a highly effective population for GWAS experiments and is available to researchers worldwide. Furthermore, a curated resource of Arabidopsis GWAS experiments and associations has been developed, the AraGWAS Catalog (https://aragwas.1001genomes.org/#/, accessed September 2022) (Togninalli et al.,

186

M. Shenton

2018). Each GWAS has been recomputed using a standard pipeline and up- to- date genomic information so that the results are comparable across studies. The p-value thresholds have been calculated using the permutation of phenotypes.

provides similar tools for rice, soybeans, and wheat (Peng et al., 2020).

13.5 Problems with GWAS 13.4.2 Rice As a model crop plant, rice has many well- developed resources for GWAS. The National Agriculture and Food Research Organization of Japan (NARO) has developed core collections consisting of rice accessions from around the world (Tanaka et al., 2020a) as well as Japanese landraces (Tanaka et al., 2020b). To make use of our resources in the NARO collection, readers interested in collaborative research are encouraged to contact us. The 3K Genomes Project for rice provides the most comprehensive resource for genome information. Seeds and information on the rice 3K genomes (Li et al., 2014) are available from the International Rice Research Institute (IRRI) (https://snp-seek.irri.org/, accessed September 2022). The MBKbase (http://www. mbkbase.org/rice, accessed September 2022)

13.5.1 Functional validation of GWAS results In general, if the GWAS is successful, a number of markers in a genomic region comprising several genes will pass the calculated p-value threshold (Fig. 13.3). There are several standard methods that researchers tend to use to determine which candidate gene regions are important for the phenotype. The most convincing studies use mutant, RNAi or gene editing lines, near isogenic lines (NILs) (Weigel, 2012), or similar to conclusively show the importance of a particular region. The candidate region can be narrowed down by considering gene annotations, gene expression data, and local information about linkage distribution, among other methods. It is important to remember that

Fig. 13.3. Detailed view of the significant region in a Manhattan plot. Close-up of an interactive Manhattan plot showing the region passing the p-value threshold given in Fig. 13.1. The interactive Manhattan links to the NARO Tasuke system showing mutations and their consequences for the WRC and JRC collections (Kumagai et al., 2013). The significant region comprises approximately 250 kb (Chromosome 7, 6009271–6253806).

Plant GWAS

a marker-trait association is a correlation, and famously, correlation does not equal causation.

13.5.2 Spurious association, rare alleles One phenomenon observed in GWAS is that when examining known causal loci, the highest –log10(p) values may be observed outside of the expected region (Dickson et al., 2010; Yano et al., 2016). This could be due to allelic heterogeneity in genes whose variation has a high phenotypic impact, which could be under strong selective pressure in different environments, resulting in multiple alleles at the same locus. It is difficult to achieve statistical significance when a gene includes allelic heterogeneity because an allele- specific polymorphism is compared with all other alleles, including null, intermediate, and fully functional alleles. If there is a bi-allelic polymorphism that is associated with null and other functional alleles, the polymorphism will represent higher statistical significance than other true causal polymorphisms (Yano et al., 2016). Plant GWAS are often limited by population size and phenotyping cost. In contrast with human studies, phenotyping a large number of plants is difficult for most researchers, and the smallest usable populations are generally preferred. This brings some limitations to GWAS; for example, when selecting markers, SNPs with a minor frequency lower than 5% are often excluded from the population in order to avoid spurious associations that would otherwise arise by chance. Thus, if a rare allele is the cause of a phenotype, its causal SNP will be excluded from the analysis. Conversely, particularly in crop studies, rare alleles with extreme trait values are of great interest. To address this paradox, some studies have tried to pre-select the population with a phenotype in mind so that rare alleles associated with it are enriched and consequently are not excluded (Yang et al., 2015).

187

Alternatively, it would be advantageous to detect the effect of multiple rare variants occurring in close proximity, such as within a gene region. Several GWAS methods that consider groups of SNPs have already been proposed. For instance, Yano et al. (2016) used gene haplotypes to successfully identify causal loci in rice. In addition to this, Hamazaki and Iwata (2020) proposed a sophisticated method for considering groups of SNPs together, even when SNPs at the same locus may have opposing phenotypic effects. Although the large numbers of genetic markers made available by whole- genome sequencing are advantageous in some ways, they are computationally intensive for genome-wide associations. Considering that SNPs are not inherited individually, it makes sense to consider haplotypes in GWAS using these methods. The software presented by Hamazaki and Iwata (2020) allows the calculation of epistatic effects (interactions between unlinked loci), which will further improve our understanding of complex traits in plants.

13.6 Conclusion and Prospects In recent years, GWAS has become a popular method for rapidly identifying genotype–phenotype associations and candidate genes or polymorphisms causing trait variations in plants. The increasing availability of sequence data has extended the range of crop and model species in which GWAS has been performed. Although GWAS employing MLMs to accommodate population structure and relatedness has become a standard method, improvements in population design and new implementation of association methods (Hamazaki and Iwata, 2020; Hamazaki et al., 2020) will continue to improve the effectiveness of marker genotype– phenotype association using GWAS.

References Alonso-Blanco, C., Andrade, J., Becker, C., Bemm, F. and Bergelson, J. (2016) 1,135 Genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166(2), 481–491. DOI: 10.1016/j. cell.2016.05.063.

188

M. Shenton

Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 57(1), 289–300. DOI: 10.1111/ j.2517-6161.1995.tb02031.x. Bourke, P.M., Voorrips, R.E., Visser, R.G.F. and Maliepaard, C. (2018) Tools for genetic studies in experimental populations of polyploids. Frontiers in Plant Science 9, 513. DOI: 10.3389/fpls.2018.00513. Bradbury, P.J., Zhang, Z., Kroon, D.E., Casstevens, T.M., Ramdoss, Y. et al. (2007) TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics (Oxford, England) 23(19), 2633–2635. DOI: 10.1093/bioinformatics/btm308. Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M. et al. (2015) Second- generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7. DOI: 10.1186/ s13742-015-0047-8. Chang, H.-X., Lipka, A.E., Domier, L.L. and Hartman, G.L. (2016) Characterization of disease resistance loci in the USDA soybean germplasm collection using genome-wide association studies. Phytopathology 106(10), 1139–1151. DOI: 10.1094/PHYTO-01-16-0042-FI. Chen, Z.-Q., Zan, Y., Milesi, P., Zhou, L., Chen, J. et al. (2021) Leveraging breeding programs and genomic data in Norway spruce (Picea abies L. Karst) for GWAS analysis. Genome Biology 22(1), 179. DOI: 10.1186/s13059-021-02392-1. Churchill, G.A. and Doerge, R.W. (1994) Empirical threshold values for quantitative trait mapping. Genetics 138(3), 963–971. DOI: 10.1093/genetics/138.3.963. Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H. and Goldstein, D.B. (2010) Rare variants create synthetic genome-wide associations. PLoS Biology 8(1), e1000294. DOI: 10.1371/journal.pbio. 1000294. Gogarten, S.M., Sofer, T., Chen, H., Yu, C., Brody, J.A. et al. (2019) Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics (Oxford, England) 35(24), 5346–5348. DOI: 10.1093/bioinformatics/btz567. Gupta, P.K., Rustgi, S. and Kulwal, P.L. (2005) Linkage disequilibrium and association studies in higher plants: present status and future prospects. Plant Molecular Biology 57(4), 461–485. DOI: 10.1007/ s11103-005-0257-z. Hamazaki, K. and Iwata, H. (2020) RAINBOW: haplotype-based genome-wide association study using a novel SNP-set method. PLoS Computational Biology 16(2), e1007663. DOI: 10.1371/journal.pcbi. 1007663. Hamazaki, K., Kajiya-Kanegae, H., Yamasaki, M., Ebana, K., Yabe, S. et al. (2020) Choosing the optimal population for a genome-wide association study: a simulation of whole-genome sequences from rice. The Plant Genome 13(1), e20005. DOI: 10.1002/tpg2.20005. Higgins, J., Santos, B., Khanh, T.D., Trung, K.H., Duong, T.D. et al. (2021) Resequencing of 672 native rice accessions to explore genetic diversity and trait associations in Vietnam. Rice (New York, N.Y.) 14(1), 52. DOI: 10.1186/s12284-021-00481-0. Huang, B.E., Verbyla, K.L., Verbyla, A.P., Raghavan, C., Singh, V.K. et al. (2015) MAGIC populations in crops: current status and future prospects. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 128(6), 999–1017. DOI: 10.1007/s00122-015-2506-0. Kajiya-Kanegae, H., Nagasaki, H., Kaga, A., Hirano, K., Ogiso-Tanaka, E. et al. (2021) Whole-genome sequence diversity and association analysis of 198 soybean accessions in mini-core collections. DNA Research 28(1), dsaa032. DOI: 10.1093/dnares/dsaa032. Kumagai, M., Kim, J., Itoh, R. and Itoh, T. (2013) TASUKE: a web-based visualization program for large- scale resequencing data. Bioinformatics (Oxford, England) 29(14), 1806–1808. DOI: 10.1093/ bioinformatics/btt295. Li, H., Hu, X., Lovell, J.T., Grabowski, P.P., Mamidi, S. et al. (2021) Genetic dissection of natural variation in oilseed traits of camelina by whole-genome resequencing and QTL mapping. The Plant Genome 14(2), e20110. DOI: 10.1002/tpg2.20110. Li, J.-Y., Wang, J. and Zeigler, R.S. (2014) The 3,000 rice genomes project: new opportunities and challenges for future rice research. GigaScience 3, 8. DOI: 10.1186/2047-217X-3-8. Lipka, A.E., Tian, F., Wang, Q., Peiffer, J., Li, M. et al. (2012) GAPIT: genome association and prediction integrated tool. Bioinformatics 28(18), 2397–2399. DOI: 10.1093/bioinformatics/bts444. Liu, H.-J. and Yan, J. (2019) Crop genome-wide association study: a harvest of biological relevance. The Plant Journal 97(1), 8–18. DOI: 10.1111/tpj.14139. Mitchell-Olds, T. and Schmitt, J. (2006) Genetic mechanisms and evolutionary significance of natural variation in Arabidopsis. Nature 441(7096), 947–952. DOI: 10.1038/nature04878.

Plant GWAS

189

Paterson, A.H. (2002) What has QTL mapping taught us about plant domestication? The New Phytologist 154(3), 591–608. DOI: 10.1046/j.1469-8137.2002.00420.x. Peng, H., Wang, K., Chen, Z., Cao, Y., Gao, Q. et al. (2020) MBKbase for rice: an integrated omics knowledge base for molecular breeding in rice. Nucleic Acids Research 48(D1), D1085–D1092. DOI: 10.1093/nar/gkz921. Rosyara, U.R., De Jong, W.S., Douches, D.S. and Endelman, J.B. (2016) Software for genome-wide association sudies in autopolyploids and its application to potato. The Plant Genome 9(2). DOI: 10.3835/plantgenome2015.08.0073. Shenstone, E., Cooper, J., Rice, B., Bohn, M., Jamann, T.M. et al. (2018) An assessment of the performance of the logistic mixed model for analyzing binary traits in maize and sorghum diversity panels. PLoS ONE 13, e0207752. DOI: 10.1371/journal.pone.0207752. Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. In: Proceedings of the National Academy of Sciences, pp. 9440–9445 (Vol. 100). DOI: 10.1073/pnas.1530509100. Tanaka, N., Shenton, M., Kawahara, Y., Kumagai, M., Sakai, H. et al. (2020a) Whole-genome sequencing of the NARO world rice core collection (WRC) as the basis for diversity and association studies. Plant and Cell Physiology 61, 922–932. DOI: 10.1093/pcp/pcaa019. Tanaka, N, Shenton, M., Kawahara, Y., Kumagai, M., Sakai, H. et al. (2020b) Investigation of the genetic diversity of a rice core collection of Japanese landraces using whole-genome sequencing. Plant and Cell Physiology 61, 2087–2096. DOI: 10.1093/pcp/pcaa125. Tibbs Cortes, L., Zhang, Z. and Yu, J. (2021) Status and prospects of genome-wide association studies in plants. The Plant Genome 14, e20077. DOI: 10.1002/tpg2.20077. Togninalli, M., Seren, Ü., Meng, D., Fitz, J., Nordborg, M. et al. (2018) The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog. Nucleic Acids Research 46(D1), D1150–D1156. DOI: 10.1093/nar/gkx954. Travis, A.J., Norton, G.J., Datta, S., Sarma, R., Dasgupta, T. et al. (2015) Assessing the genetic diversity of rice originating from Bangladesh, Assam and West Bengal. Rice (New York, N.Y.) 8(1), 35. DOI: 10.1186/s12284-015-0068-z. Wang, Q., Tang, J., Han, B. and Huang, X. (2019) Advances in genome-wide association studies of complex traits in rice. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 133(5), 1415–1425. DOI: 10.1007/s00122-019-03473-3. Weigel, D. (2012) Natural variation in Arabidopsis: from molecular genetics to ecological genomics. Plant Physiology 158(1), 2–22. DOI: 10.1104/pp.111.189845. Xiao, Y., Liu, H., Wu, L., Warburton, M. and Yan, J. (2017) Genome-wide association studies in maize: praise and stargaze. Molecular Plant 10(3), 359–374. DOI: 10.1016/j.molp.2016.12.008. Yadav, C.B., Tokas, J., Yadav, D., Winters, A., Singh, R.B. et al. (2021) Identifying anti-oxidant biosynthesis genes in pearl millet [Pennisetum glaucum (L.) R. br.] using genome-wide association analysis. Frontiers in Plant Science 12, 599649. DOI: 10.3389/fpls.2021.599649. Yang, J., Lee, S.H., Goddard, M.E. and Visscher, P.M. (2011) GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics 88(1), 76–82. DOI: 10.1016/j.ajhg.2010.11.011. Yang, J., Jiang, H., Yeh, C.-T., Yu, J., Jeddeloh, J.A. et al. (2015) Extreme-phenotype genome-wide association study (XP-GWAS): a method for identifying trait-associated variants by sequencing pools of individuals selected from a diversity panel. The Plant Journal 84(3), 587–596. DOI: 10.1111/ tpj.13029. Yano, K., Yamamoto, E., Aya, K., Takeuchi, H., Lo, P.-C. et al. (2016) Genome-wide association study using whole-genome sequencing rapidly identifies new genes influencing agronomic traits in rice. Nature Genetics 48(8), 927–934. DOI: 10.1038/ng.3596. Yu, J., Pressoir, G., Briggs, W.H., Vroh Bi, I., Yamasaki, M. et al. (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38(2), 203–208. DOI: 10.1038/ng1702. Zhou, X. and Stephens, M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44(7), 821–824. DOI: 10.1038/ng.2310. Zhou, X., Guo, J., Pandey, M.K., Varshney, R.K., Huang, L. et al. (2021) Dissection of the genetic basis of yield-related traits in the Chinese peanut mini-core collection through genome-wide association studies. Frontiers in Plant Science 12, 637284. DOI: 10.3389/fpls.2021.637284.

14

Plant Genomic Selection: a Concept That Uses Genomics Data in Plant Breeding

Eiji Yamamoto* Graduate School of Agriculture, Meiji University, Kawasaki, Japan

Abstract Genomic selection (GS) is a concept that uses genomics data in animal and plant breeding. The core of GS is to perform breeding selection based on phenotype prediction models that use genome-wide genomic variants as the explanatory variables. Since its proposal in 2001, GS has been used in many practical breeding projects. However, GS is still not popular among plant scientists who are not familiar with genetics and breeding, because of the complicated theoretical background. In this chapter, a brief explanation of the basic GS process is provided, followed by its application in diverse breeding programs. In addition, advanced research projects employing a combination of GS and other omics approaches are introduced. The aim of this chapter is to stimulate the interest of plant scientists who are not currently involved in plant breeding and GS, and to encourage breeding scientists to realize the methods and theories necessary to understand GS.

14.1 Introduction The central dogma indicates that all biological events initiate from genetic information coded in the genome. Therefore, genomic information has the potential to explain the biological consequences (i.e., phenotype) fully or partially, providing two important messages for biologists. First, genomics is the basis for all areas of biology. Second, each phenotype can be designed by genomic modification. The latter strongly relates to breeding, i.e., the processes of genetic improvement to develop cultivars or varieties that have agronomically and/or economically beneficial characteristics. Plant breeding is expected to play an essential role in nourishing a growing world population sustainably (Hickey et al., 2019). There are two major approaches to

plant breeding in the context of genomics: the use of molecular biology techniques represented by genome editing (Chapter 15 of this book) and the use of theories in genetics, the topic of this chapter. From the beginning of the 1980s, DNA marker-assisted selection (MAS) came into use in plant breeding programs (Bernardo, 2008; Xu and Crouch, 2008). In MAS, the genotype of markers, whose association with phenotypic variation has been verified, is used as indicators of the presence or absence of the phenotype (Fig. 14.1A). Thus, a breeder can confirm trait introgression without (time- consuming and costly) phenotypic observation. To develop DNA markers for plant breeding, genetic mapping analyses such as genome- wide association studies (GWAS) (Chapter 13 of this book) were

*yame@meiji.ac.jp 190

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0014

Plant Genomic Selection

191

Fig. 14.1. Schematic representations of genomic selection (GS). (A) Difference in DNA markers between traditional marker-assisted selection (MAS) and GS. Sideways pentagons indicate genes in a genome. The arrows indicate the positions of the DNA markers. The bars on the pentagons indicate the effect size of the gene. “Use” and “Not” indicate whether the DNA marker is used or not used in MAS or GS. In traditional MAS, DNA markers only close to the gene with a large effect are used, while all the DNA markers are used in GS. (B) Core processes in the GS. Pheno., Geno., and GEBV indicate genotype data, phenotype data, and genomic estimated breeding value, respectively.

conducted to identify the association between a marker genotype and a phenotypic variation (Bernardo, 2008; Xu and Crouch, 2008). Each association between a marker genotype and a phenotypic variation is statistically detectable only when a DNA marker and a causal variant are in tight linkage, and the causal variant has a large effect on the phenotypic variation (Yu et al., 2008). Therefore, MAS was especially

efficient for traits controlled by a single gene or a small number of genes, often called monogenic and oligogenic traits, respectively. However, it has long been recognized that most agronomically and economically important traits, such as yield and flavor, are controlled by numerous small-effect genes. Traits under such control are called polygenic traits. Because it is theoretically impossible to identify

192

E. Yamamoto

an association between a DNA marker genotype and a phenotypic variation in polygenic traits (Yu et al., 2008), genetic improvement of a polygenic trait using MAS had been unfeasible until the mid-2000s. Genomic selection (GS) is currently the only MAS for polygenic traits (Meuwissen et al., 2001). The concept of GS was first proposed at the beginning of the 2000s but gained popularity after the mid-2000s (Taylor et al., 2016). The delay was primarily due to the unavailability of genomics technology that enables genotyping hundreds of lines (individuals) with thousands of DNA markers (Davey et al., 2011). The idea of GS is based on the assumption that DNA markers from dense genome-wide genotyping will be in tight linkage with most genes controlling a trait (Fig. 14.1A). The genotype data from all DNA markers are used to predict the phenotype of lines, whereas traditional MAS for mono- and oligogenic traits uses only selected DNA markers whose association with phenotypic variation has been statistically verified (Fig. 14.1A). GS has already achieved great success in animal breeding (Taylor et al., 2016) and has now been implemented in various plant-breeding programs (Crossa et al., 2017). Despite its considerable impact on enhancing breeding efficiency, GS has not been recognized by biologists and plant breeders unfamiliar with theories in advanced genetics, attributable to the following three factors: (i) excessive technical jargon for non- statisticians; (ii) no common implementation method in various plant breeding programs; and (iii) rapid progress in the development of new methods that make it difficult to trace the latest information on GS. This chapter consists of sections that address each of the above issues. Section 14.2 provides a brief explanation of the core processes in GS and the methods used in each process (Fig. 14.1B). Section 14.3 introduces diversity in plant breeding and how GS has been implemented in each plant-breeding program. Section 14.4 introduces advanced topics in GS that include the adoption of omics approaches other than genomics. This chapter does not provide a detailed description of each topic, and therefore, further reading is recommended. Instead, the aim is to encourage readers to realize the methods and theories necessary to understand GS and implement GS for their breeding programs.

14.2 Core Processes in GS The original concept of GS consists of three processes (Fig. 14.1B). The first process is the preparation of training data that are used for the subsequent model construction process. In GS, the training data are data from a population whose DNA marker genotypes and phenotypic values are both available. Therefore, the population used for GS model construction is referred to as a “training population” (Fig. 14.1B). The second process is constructing a statistical model that predicts the phenotype of each line by using the DNA marker genotypes as the explanatory variables. Models constructed for GS are known by various names, such as genomic prediction models, genomic- enabled prediction models, genome-wide prediction models, whole-genome prediction models, and GS models. In this chapter, the term GS model will be used hereafter (Fig. 14.1B). Strictly speaking, the GS model output is not the predicted phenotypic value, because plant phenotypes vary depending on the growth environment and genotype- by- environment interaction (G × E) effects (van Eeuwijk et al., 2019). Instead, it is an estimate of the genetic potential of the line. This characteristic is preferable because plant breeding aims to develop cultivars or varieties with high genetic potential in agronomically and economically important traits. Therefore, the output of GS model is referred to as the genomic estimated breeding value (GEBV) (Fig. 14.1B). As in other scientific fields that use statistical modeling, GS model accuracy can be evaluated by various indicators such as mean absolute error, mean squared error, and the correlation coefficient between GEBVs and actual observed phenotypic values. In GS, the correlation coefficient is more popular than the other indicators. The third process is the GEBV- based selection of lines from a breeding population (Fig. 14.1B). In this process, promising lines are selected without actual phenotypic observations. Thus, GS can reduce the costs and effort required for phenotypic observation in plant breeding. The selection process varies considerably depending on plant species and breeding program. Therefore, topics on the selection process are introduced in another section (section 14.3). In the following subsections, methods and techniques used to

Plant Genomic Selection

prepare training data and construct GS models are introduced.

14.2.1 Preparation of training data In the preparation of training data, the genetic composition of the training population is important for constructing accurate GS models. A training population that is genetically close to the target breeding population, ideally full-sibling, increases the accuracy of the GS model (Habier et al., 2007). To achieve high prediction accuracy by using genetically distant training populations, many lines (about 1000) are required (Edwards et al., 2019). Because resources such as cost and effort are limited in plant breeding programs, methods to optimize the genetic composition of training populations have been proposed. For example, Rincent et al. (2012) proposed that the maximization of the generalized coefficient of determination (CD) is a useful criterion to optimize the training population. Isidro et al. (2015) and Rincent et al. (2017) indicated that stratified CD or simple stratified sampling is recommended when the training population has a strong population structure (i.e., where lines in the training population can be divided into a small number of genetically close subpopulations). The efficiency of the above-mentioned methods was verified based on the assumption that information on a breeding population is unavailable. Akdemir and Isidro-Sánchez (2019) indicated that the genetic background of the breeding population should be considered in the optimization process of the training population to achieve higher GS model accuracy. What and how to phenotype depends largely on the breeding program itself rather than on how to perform GS. However, how to genotype the training population is an issue that needs to be addressed by GS users. Because the idea of GS is based on the capture of small-effect genes distributed over the genome by using genome- wide DNA markers, numerous DNA markers are preferable. Since the GS model is a function that uses a specific set of genotype information as the explanatory variables, the same DNA marker set must be used between a training population and a breeding population (Fig. 14.1B). In addition,

193

high reproducibility and low missing rate are important for genotyping in GS. Therefore, SNP genotyping arrays have been preferred in plant GS, as in animal GS (Taylor et al., 2016). Genotyping- by- sequencing (GBS) is another high- throughput genotyping method (Elshire et al., 2011). GBS is less advantageous in terms of reproducibility and missing rate than SNP genotyping arrays. However, the number of DNA markers detected by GBS often results in thousands of SNPs with a low missing rate even after aggressive filtering. Therefore, GBS is also used for genotyping in GS (Jarquín et al., 2014a; Spindel et al., 2015). The simplest yet important question is how many DNA markers are required for GS. The number of DNA markers required for GS depends on the degree of linkage disequilibrium (LD) in the training and test populations (Daetwyler et al., 2010). LD is the correlation of genotype patterns between a nucleotide variant and a nearby nucleotide variant (Gupta et al., 2005). A larger number of DNA markers are required for GS using a population with a lower degree of LD. In general, the degree of LD is high in a population consisting of genetically close lines, such as a biparental F2 population. The degree of LD will be low in a population with genetically diverse lines and a population that has experienced many genetic recombinations, even if it originates from the biparental population. Several hundred DNA markers would be sufficient for a population with a high LD state (Habier et al., 2009; Heffner et al., 2011). A larger number of DNA markers in genetically diverse populations results in higher GS model accuracy, although there is a diminishing return with more than 10,000 DNA markers (Spindel et al., 2015; Hong et al., 2020). The law of diminishing returns has also been proved using simulation studies (Pérez-Enciso et al., 2015; Iheshiulor et al., 2016).

14.2.2 Construction of GS model Construction of the GS model is performed by regressing phenotypic values with the genotype of genome-wide DNA markers (Fig. 14.2). In plant breeding, the number of markers (p) is much larger than the number of phenotypes (n). Therefore, ordinary least-squares regression

194

E. Yamamoto

Fig. 14.2. Statistical modelling for genomic selection (GS). (A) Basic equation used for parametric GS models. yi, μ, and, εi indicate a phenotypic value of i-th individual, population-wide mean, and a residual of i-th individual, respectively. βj models the effect of the j-th DNA marker. xij is a genotype value of for the j-th DNA marker of the i-th individual. The genotype values of DNA markers are coded as {–1, 0, 1} = {aa, Aa, AA}. (B) A visual comparison of DNA marker effects between different statistical modeling methods. The x-axis indicates DNA markers in genome coordinate. The y-axis indicates DNA marker effect. In the Bayesian LASSO, all the DNA markers seem to have certain but small effect. Meanwhile, in the Bayes B, there are two prominent and specific DNA markers with extremely large effect. The values were obtained using a data of tomato fruit sugar content in Yamamoto et al. (2017).

is inapplicable for estimating the effects of p simultaneously. To avoid the “curse of dimensionality (n ≪p)” problem, various statistical modeling methods have been applied for GS model construction. Genomic best linear unbiased prediction with linear ridge kernel regression (GBLUP-RR)

may be the most frequently used statistical modeling method in GS (VanRaden, 2008). GBLUP-RR uses a genomic relationship matrix (GRM) calculated from genome-wide genotype data as the kernel of the model. As in other scientific fields that use statistical modeling, regularized regression methods such as ridge

Plant Genomic Selection

regression (RR) (Hoerl and Kennard, 1970) and LASSO (Tibshirani, 1996) are also used in GS. However, in GS, Bayesian regularized regression methods are more popular than RR and LASSO. The methods include Bayes A, B (Meuwissen et al., 2001), C (Habier et al., 2011), etc. This naming rule is called the Bayesian alphabet, and different alphabets indicate different prior distributions of the marker effect, while the other core algorithms are the same (Fig. 14.2; Gianola, 2013). Park and Casella (2008) developed Bayesian regularized regression of LASSO, termed Bayesian LASSO (Fig. 14.2). Bayesian LASSO is also frequently used in GS. The above- mentioned methods are parametric and capture only additive linear effects. Therefore, these methods have been extended to capture nonlinear effects such as dominant and epistatic effects (Vitezica et al., 2017; Varona et al., 2018). These extensions are based on integrating additional explanatory variables that design nonlinear effects in a parametric modeling framework. Meanwhile, non-parametric methods are also used to capture nonlinear effects. Reproducing kernel Hilbert spaces (RKHS) regression is similar to GBLUP-RR but uses a nonlinear Gaussian kernel instead of an additive GRM (Gianola and van Kaam, 2008). Machine learning methods such as support vector machine (SVM) (Drucker et al., 1996), random forest (Breiman, 2001), and neural networks (Hornik et al., 1989) are also used in non-parametric GS models. The use of different GS model construction methods results in different GEBV and prediction accuracies. However, it has been reported that differences in prediction accuracy between methods were often small in most empirical studies (Heslot et al., 2012; de Los Campos et al., 2013). Nevertheless, the advantages and disadvantages of methods in response to the genetic architecture of target traits have been revealed using simulation experiments. Habier et al. (2007, 2013) indicated that Bayesian regularized regression methods are advantageous when a training population and a breeding population are genetically distant. When nonlinear effects control a phenotype, non-parametric methods show higher prediction accuracy than parametric methods. In a small training population, random forest seems to be more suitable for capturing nonlinear effects than the other non- parametric methods (Onogi et al., 2015).

195

In GS model construction, it is important to determine whether a GS model shows predictability in the target breeding population. However, direct assessment of predictability is impossible because a breeding population’s phenotypic values are usually unavailable in advance. Moreover, if phenotypic values were available, it would be better to perform phenotypic selection than GS (Wong and Bernardo, 2008). To solve this contradiction, “estimation” of GS model prediction accuracy is performed using cross-validation (CV) (Hastie et al., 2009). CV consists of three processes: (i) division of the training data into a tentative training population and a test population; (ii) construction of a tentative GS model by using the tentative training population; and (iii) evaluation of the prediction accuracy of the tentative GS model in the test population (Fig. 14.3). The standard method is K-fold CV (Fig. 14.3A). In K-fold CV, training data with n cases (each including a phenotypic value and genotypes of DNA markers) are divided into several folds (K) of approximately equal size. In each iteration, data in K–1 folds (Fig. 14.3A, black rectangles) are used for tentative GS model construction and to calculate GEBVs in the testing fold (Fig. 14.3A, a gray rectangle). This exercise is repeated in each iteration, and the overall results are combined (Fig. 14.3A). If the number of K equals n, the CV scheme is called leave-one-out cross-validation (LOOCV). In K- fold CV, the training data are randomly divided into several folds (i.e., random CV). However, CV results are substantially affected by population structure, and the random CV in a structured population results in strongly inflated prediction accuracy (Windhausen et al., 2012; Hickey et al., 2014; Werner et al., 2020). Therefore, if there is a strong population structure in a training population, fold division based on the population structure is necessary for a trustable CV. In plant breeding, cultivars and/or varieties are grown in fields different from where the breeding was performed. Moreover, the extreme climate changes in recent years suggest that growth conditions might be considerably different over the years, even in the same locations. Effects of growth environment differences and G × E on phenotype are problematic not only for agricultural productivity but also for estimating GS model accuracy. To address these issues, CV schemes have been proposed to estimate GS

196

E. Yamamoto

Fig. 14.3. Schematic representation of cross-validation (CV) methods used in genomic selection (GS). The horizontal axes in the panels indicate lines used for GS model construction and evaluation. The black and gray rectangles indicate datasets used as training populations and test populations, respectively. (A) K-fold CV. The panel represents a case where K = 4. r indicates the correlation coefficient between genomic estimated breeding values and the observed phenotypic values in the test lines selected at the fold. (B) CV that considers genotype-by-environment interaction (G × E) effects. The white rectangles indicate that the data of the lines in the environment are neither the training data nor the target of prediction.

model accuracy across different environments (Fig. 14.3B). CV2 predicts the performance of lines that have been evaluated in some environments but not in others (Burgueño et al., 2012). CV1 evaluates the performance of lines that have not been evaluated in any observed environment (Burgueño et al., 2012). CV0 predicts the performance of previously tested lines in untested locations (Jarquín et al., 2017). CV00

predicts the performance of untested lines in untested locations (Jarquín et al., 2017). Although low- cost and high- throughput genotyping technologies are essential for the feasibility of GS, user-friendly and open-access software packages have also contributed to the widespread use of GS in plant breeding. Most software packages are available as R packages. Among them, rrBLUP (Endelman, 2011), BGLR

Plant Genomic Selection

197

Table 14.1. Methods to construct GS models and their corresponding R packages. Method

Category

Reference

R package

GBLUP-RR

Parametric

VanRaden (2008)

rrBLUP, BGLR

RR

Parametric

Hoerl and Kennard (1970)

glmnet

LASSO

Parametric

Tibshirani (1996)

glmnet

Bayes A

Parametric

Meuwissen et al. (2001)

BGLR

Bayes B

Parametric

Meuwissen et al. (2001)

BGLR, VIGoR

Bayes C

Parametric

Habier et al. (2011)

BGLR, VIGoR

Bayesian LASSO

Parametric

Park and Casella (2008)

BGLR, VIGoR

RKHS

Non-parametric

Gianola and van Kaam (2008)

rrBLUP, BGLR

SVM

Non-parametric

Drucker et al. (1996)

e1071

Random forest

Non-parametric

Breiman (2001)

randomForest

Neural network

Non-parametric

Hornik et al. (1989)

nnet, tensorflow

Fig. 14.4. An example of how to implement genomic selection (GS) in practical breeding. The example assumes a case of tomato breeding. In the phenotypic selection (PS), breeding selections can be performed only twice in 3 years. On the other hand, breeding selection can be performed four times with GS.

(Pérez and de Los Campos, 2014), and VIGoR (Onogi and Iwata, 2016) were developed to be compatible with GS. Table 14.1 shows the statistical modeling methods used in GS and R packages that implement these methods.

14.3 Implementation of GS in Practical Plant Breeding Because GS was developed to enhance breeding efficiency, how GS contributes to practical breeding is more important than how accurate GS models are. The main advantages of GS over traditional phenotypic selection are reducing

the cost per selection cycle and the time required for variety development. Specifically, GS enables more breeding selection cycles than traditional phenotypic selection in a limited time unit (Fig. 14.4). These advantages were first proposed by using simulation experiments (Bernardo and Yu, 2007; Wong and Bernardo, 2008; Heffner et al., 2010, 2011), and subsequently demonstrated in empirical experiments (Combs and Bernardo, 2013; Beyene et al., 2015; Rutkoski et al., 2015; Yabe et al., 2018). To achieve significant genetic improvement, it is necessary to perform cycles of: (i) selection from a breeding population; (ii) crossing of the selected individuals; and (iii) development of a

198

E. Yamamoto

new breeding population (Fig. 14.4). Therefore, a plant breeder might use GS models over generations. However, it has been reported that GS model prediction accuracy decreases as the plant breeding generation succeeds, because many recombination events during breeding over generations result in LD decay between DNA markers and genes. To avoid this problem, GS model reconstruction using an updated training population is recommended for GS over multiple generations (Habier et al., 2009; Bassi et al., 2016; Neyhart et al., 2017). F1 hybrid breeding is a popular breeding strategy in plants, because F1 hybrids often represent high yield and high robustness against various stresses, although the genetic mechanism is largely unknown (Duvick, 2001). An F1 hybrid is developed by crossing parents whose genotype is homozygous for all loci (i.e., inbred lines). In F1 hybrid breeding, combinations of inbred lines that should be tested increase exponentially as the number of candidate parents increases. Therefore, pre-selection of promising F1 hybrids (or pre-removal of non-promising F1 hybrids) considerably contributes to reducing the costs and effort required for phenotypic evaluation and subsequent breeding selection. In GS for F1 hybrid breeding, the genotype of the possible F1 hybrids can be determined by the genotype of parental inbred lines, because the genotype of a DNA marker in an F1 hybrid is heterozygous only when the genotype pattern is different between the inbred lines. Then, phenotype data of selected F1 hybrids and/or parental inbred lines are used to construct GS models. The efficiency of GS in F1 hybrid breeding has been demonstrated in major crops such as maize (Acosta-Pech et al., 2017), wheat (Basnet et al., 2019), rice (Xu et al., 2014), and strawberry (Yamamoto et al., 2021). Progeny genotype prediction is complicated when the parents are non- inbred lines (i.e., lines including numerous heterozygous loci). To predict the progeny genotype from non-inbred parents, simulation of meiotic recombination is necessary, which is achieved by the following three steps: (i) estimation of linkage phases in heterozygous loci by using genetic algorithms such as BEAGLE (Browning and Browning, 2007); (ii) approximation of meiotic recombination patterns between loci using genetic linkage map information; and (iii) Monte Carlo

simulation for possible progeny genotype by using results from (i) and (ii). Iwata et al. (2013) demonstrated the efficacy of this method, especially in selecting parents that generate progeny with high performance in multiple traits by using real data of the Japanese pear. This efficacy was also confirmed in tomato GS (Yamamoto et al., 2017). In addition, Yamamoto et al. (2016) applied this method to design a breeding plan over generations.

14.4 Advanced Topics in GS The topics included in the above sections are well established and empirically confirmed parts of GS studies. These are ready for application in practical plant breeding and, therefore, are valuable for breeders rather than for researchers. In the following sections, topics that GS researchers are currently working on are introduced. Although the following topics might not be ready for practical plant breeding at present, they provide useful information on what to expect in plant breeding in the near future.

14.4.1 GS model incorporating G × E effects The most significant disturbing factor for central dogma is the effect of the growth environment and G × E (van Eeuwijk et al., 2019). G × E effects can be observed when the rank of phenotypic values of the same genotype set change under different environmental conditions. Statistical modeling of G × E effects requires complicated methods, while the linear effect of the growth environment can be simply modeled by additive covariates. In the method developed by Jarquín et al. (2014b), environmental effects and G × E effects were modeled in the form of a variance– covariance matrix and incorporated in multi- kernel GBLUP. Lopez-Cruz et al. (2015) extended this idea to enable the estimation of each DNA marker effect in each environment. Crossa et al. (2016) performed further extension to enable the detection of DNA markers with a large effect by using the algorithm in Bayes B. Crop modeling is another approach that predicts plant phenotype by using statistical modeling. In crop

Plant Genomic Selection

modeling, information on growth environment conditions such as temperature, photoperiod, and flowering time are used as explanatory variables, while genotype data are not used (Tardieu, 2003). In Heslot et al. (2014), crop model values were incorporated into a GS model, which is an extension of GBLUP. This enabled phenotype prediction of lines in untested environments. Onogi et al. (2016) developed a method that performs simultaneous estimation of parameters in crop models and a GS model to achieve higher prediction accuracy.

14.4.2 DNA marker selection for GS model construction In the GS model, genotype data from all available DNA markers are used to predict phenotypes (Fig. 14.1A). Meanwhile, it has been reported that appropriate DNA marker selection (thinning) before GS model construction might increase GS prediction accuracy. In most cases, marker selection is performed based on GWAS results (Spindel et al., 2016; Bian and Holland, 2017; Rice and Lipka, 2019). In animal breeding, previously identified gene information is also used in marker selection for GS (Fang et al., 2017; Fragomeni et al., 2017; Raymond et al., 2018). With ample information on functionally characterized genes, this approach drastically increases GS model accuracy. GS model construction using functionally characterized gene information will also contribute to an increase in plant GS model accuracy, because a large amount of functional genomics information is also available for plant species (Yamamoto et al., 2012).

14.4.3 Combination with other omics Phenomics, transcriptomics, and metabolomics, as well as genomics, might contribute to the enhancement of efficient plant breeding. The development of a multi- omics- based plant breeding system centered on GS is currently an active research area. Phenomics approaches have already been incorporated into GS schemes aggressively (Crossa et al., 2017; Moreira et al., 2020). This movement has been driven by the development of high-throughput phenotyping

199

(HTP) platforms. Details of HTP are available in Chapter 5 of this book. Advantages of HTP over the traditional phenotypic observation include time- series measurements that track crop development through the life stages and its response to the environment. The use of such data is expected to bridge the gap between genotype and phenotype (Moreira et al., 2020). For example, Rutkoski et al. (2016) used canopy temperature and normalized difference vegetation index (NDVI) measured by unmanned aerial vehicles (UAVs) to improve the accuracy of GS models. As a large-scale challenge, Millet et al. (2019) used high- resolution phenotype and environment data from an automated platform installed in a greenhouse for GS model construction. Then, the GS models were applied to predict maize yield in various fields all over the world. More information related to the combination of GS and HTP is available elsewhere (Crossa et al., 2017; Moreira et al., 2020). Implementation of transcriptome and metabolome data into GS model construction is also a challenging field. As for maize flowering time prediction, Azodi et al. (2020) demonstrated that the addition of transcriptome data as explanatory variables in the GS model increases prediction accuracy. The authors explained that the transcriptome contributed to GS model accuracy by capturing effects not reflected in the genome sequences. Tong et al. (2020) increased GS model accuracy for Arabidopsis growth rate by incorporating metabolome data and metabolomics models. The authors found that using a line-specific metabolic model is necessary to increase GS model accuracy, while applying a common metabolic model decreased accuracy. Compared with phenomics, incorporating transcriptomics and metabolomics into GS appears to require more analyses that include how to obtain data (e.g., throughput of analyses and data acquisition time) and how to integrate the data into the GS model. Nevertheless, these studies demonstrated the potential of the multi-omics approach to enhance plant breeding.

14.5 Concluding Remarks In this chapter, key topics for GS in plant breeding and related studies have been introduced.

200

E. Yamamoto

However, as explained in section 14.4, researchers in plant breeding are now enthusiastic about the development of high- throughput plant breeding systems that use multi- omics data (Weckwerth et al., 2020). Since plant breeding

centered on GS is a developing area, advanced knowledge and theories are released daily. It is expected that the topics in this chapter will stimulate readers’ interest and help readers access the latest information in this field.

References Acosta-Pech, R., Crossa, J., de Los Campos, G., Teyssèdre, S., Claustres, B. et al. (2017) Genomic models with genotype × environment interaction for predicting hybrid performance: an application in maize hybrids. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 130(7), 1431–1440. DOI: 10.1007/s00122-017-2898-0. Akdemir, D. and Isidro-Sánchez, J. (2019) Design of training populations for selective phenotyping in genomic prediction. Scientific Reports 9(1), 1446. DOI: 10.1038/s41598-018-38081-6. Azodi, C.B., Pardo, J., VanBuren, R., de Los Campos, G. and Shiu, S.-H. (2020) Transcriptome-based prediction of complex traits in maize. The Plant Cell 32(1), 139–151. DOI: 10.1105/tpc.19.00332. Basnet, B.R., Crossa, J., Dreisigacker, S., Pérez‐Rodríguez, P., Manes, Y. et al. (2019) Hybrid wheat prediction using genomic, pedigree, and environmental covariables interaction models. The Plant Genome 12(1), 180051. DOI: 10.3835/plantgenome2018.07.0051. Bassi, F.M., Bentley, A.R., Charmet, G., Ortiz, R. and Crossa, J. (2016) Breeding schemes for the implementation of genomic selection in wheat (Triticum spp.). Plant Science 242, 23–36. DOI: 10.1016/j. plantsci.2015.08.021. Bernardo, R. (2008) Molecular markers and selection for complex traits in plants: learning from the last 20 years. Crop Science 48(5), 1649–1664. DOI: 10.2135/cropsci2008.03.0131. Bernardo, R. and Yu, J. (2007) Prospects for genomewide selection for quantitative traits in maize. Crop Science 47(3), 1082–1090. DOI: 10.2135/cropsci2006.11.0690. Beyene, Y., Semagn, K., Mugo, S., Tarekegne, A., Babu, R. et al. (2015) Genetic gains in grain yield through genomic selection in eight bi‐parental maize populations under drought stress. Crop Science 55(1), 154–163. DOI: 10.2135/cropsci2014.07.0460. Bian, Y. and Holland, J.B. (2017) Enhancing genomic prediction with genome-wide association studies in multiparental maize populations. Heredity 118, 585–593. DOI: 10.1038/hdy.2017.4. Breiman, L. (2001) Random forests. Machine Learning 45(1), 5–32. DOI: 10.1023/A:1010933404324. Browning, S.R. and Browning, B.L. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. American Journal of Human Genetics 81(5), 1084–1097. DOI: 10.1086/521987. Burgueño, J., de los Campos, G., Weigel, K. and Crossa, J. (2012) Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Science 52(2), 707–719. DOI: 10.2135/cropsci2011.06.0299. Combs, E. and Bernardo, R. (2013) Genomewide selection to introgress semidwarf maize germplasm into U.S. Corn Belt inbreds. Crop Science 53(4), 1427–1436. DOI: 10.2135/cropsci2012.11.0666. Crossa, J., de los Campos, G., Maccaferri, M., Tuberosa, R., Burgueño, J. et al. (2016) Extending the marker × environment interaction model for genomic- enabled prediction and genome- wide association analysis in durum wheat. Crop Science 56(5), 2193–2209. DOI: 10.2135/ cropsci2015.04.0260. Crossa, J., Pérez-Rodríguez, P., Cuevas, J., Montesinos-López, O., Jarquín, D. et al. (2017) Genomic selection in plant breeding: methods, models, and perspectives. Trends in Plant Science 22(11), 961–975. DOI: 10.1016/j.tplants.2017.08.011. Daetwyler, H.D., Pong- Wong, R., Villanueva, B. and Woolliams, J.A. (2010) The impact of genetic architecture on genome-wide evaluation methods. Genetics 185(3), 1021–1031. DOI: 10.1534/ genetics.110.116855. Davey, J.W., Hohenlohe, P.A., Etter, P.D., Boone, J.Q., Catchen, J.M. et al. (2011) Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews. Genetics 12(7), 499–510. DOI: 10.1038/nrg3012.

Plant Genomic Selection

201

de Los Campos, G., Hickey, J.M., Pong-Wong, R., Daetwyler, H.D. and Calus, M.P.L. (2013) Whole- genome regression and prediction methods applied to plant and animal breeding. Genetics 193(2), 327–345. DOI: 10.1534/genetics.112.143313. Drucker, H., Burges, C.J., Kaufman, L., Smola, A. and Vapnik, V. (1996) Support vector regression machines. Advances in Neural Information Processing Systems 9, 155–161. Duvick, D.N. (2001) Biotechnology in the 1930s: the development of hybrid maize. Nature Reviews. Genetics 2(1), 69–74. DOI: 10.1038/35047587. Edwards, S.M., Buntjer, J.B., Jackson, R., Bentley, A.R., Lage, J. et al. (2019) The effects of training population design on genomic prediction accuracy in wheat. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 132(7), 1943–1952. DOI: 10.1007/s00122-019-03327-y. Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K. et al. (2011) A robust, simple genotyping- by-sequencing (GBS) approach for high diversity species. PloS ONE 6(5), e19379. DOI: 10.1371/ journal.pone.0019379. Endelman, J.B. (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome 4(3), 250–255. DOI: 10.3835/plantgenome2011.08.0024. Fang, L., Sahana, G., Ma, P., Su, G., Yu, Y. et al. (2017) Use of biological priors enhances understanding of genetic architecture and genomic prediction of complex traits within and between dairy cattle breeds. BMC Genomics 18(1), 604. DOI: 10.1186/s12864-017-4004-z. Fragomeni, B.O., Lourenco, D.A.L., Masuda, Y., Legarra, A. and Misztal, I. (2017) Incorporation of causative quantitative trait nucleotides in single-step GBLUP. Genetics, Selection, Evolution 49(1), 59. DOI: 10.1186/s12711-017-0335-0. Gianola, D. (2013) Priors in whole-genome regression: the Bayesian alphabet returns. Genetics 194(3), 573–596. DOI: 10.1534/genetics.113.151753. Gianola, Daniel and van Kaam, J.B.C.H.M. (2008) Reproducing kernel hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178(4), 2289–2303. DOI: 10.1534/ genetics.107.084285. Gupta, P.K., Rustgi, S. and Kulwal, P.L. (2005) Linkage disequilibrium and association studies in higher plants: present status and future prospects. Plant Molecular Biology 57(4), 461–485. DOI: 10.1007/ s11103-005-0257-z. Habier, D., Fernando, R.L. and Dekkers, J.C.M. (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177(4), 2389–2397. DOI: 10.1534/genetics.107.081190. Habier, D., Fernando, R.L. and Dekkers, J.C.M. (2009) Genomic selection using low-density marker panels. Genetics 182(1), 343–353. DOI: 10.1534/genetics.108.100289. Habier, D., Fernando, R.L., Kizilkaya, K. and Garrick, D.J. (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12, 186. DOI: 10.1186/1471-2105-12-186. Habier, D., Fernando, R.L. and Garrick, D.J. (2013) Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194(3), 597–607. DOI: 10.1534/genetics.113.152207. Heffner, E.L., Lorenz, A.J., Jannink, J.L. and Sorrells, M.E. (2010) Plant breeding with genomic selection: gain per unit time and cost. Crop Science 50(5), 1681–1690. DOI: 10.2135/cropsci2009.11.0662. Hastie, T., Tibshirani, R. and Friedman, J. (2009) The elements of statistical learning. In: Hastie, T. (ed.) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York, NY. DOI: 10.1007/978-0-387-84858-7. Heffner, E.L., Jannink, J.L., Iwata, H., Souza, E. and Sorrells, M.E. (2011) Genomic selection accuracy for grain quality traits in biparental wheat populations. Crop Science 51(6), 2597–2606. DOI: 10.2135/ cropsci2011.05.0253. Heslot, N., Akdemir, D., Sorrells, M.E. and Jannink, J.L. (2014) Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 127(2), 463–480. DOI: 10.1007/s00122-013-2231-5. Heslot, N., Yang, H.P., Sorrells, M.E. and Jannink, J.L. (2012) Genomic selection in plant breeding: a comparison of models. Crop Science 52(1), 146–160. DOI: 10.2135/cropsci2011.06.0297. Hickey, J.M., Dreisigacker, S., Crossa, J., Hearne, S., Babu, R. et al. (2014) Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Science 54(4), 1476–1488. DOI: 10.2135/cropsci2013.03.0195. Hickey, L.T., N Hafeez, A., Robinson, H., Jackson, S.A., Leal-Bertioli, S.C.M. et al. (2019) Breeding crops to feed 10 billion. Nature Biotechnology 37(7), 744–754. DOI: 10.1038/s41587-019-0152-9.

202

E. Yamamoto

Hoerl, A.E. and Kennard, R.W. (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics: A Journal of Statistics for the Physical, Chemical, and Engineering Sciences 12(1), 55–67. DOI: 10.1080/00401706.1970.10488634. Hong, J.-P., Ro, N., Lee, H.-Y., Kim, G.W., Kwon, J.-K. et al. (2020) Genomic selection for prediction of fruit-related traits in pepper (Capsicum spp). Frontiers in Plant Science 11, 570871. DOI: 10.3389/ fpls.2020.570871. Hornik, K., Stinchcombe, M. and White, H. (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366. DOI: 10.1016/0893-6080(89)90020-8. Iheshiulor, O.O., Woolliams, J.A., Yu, X., Wellmann, R. and Meuwissen, T.H. (2016) Within- and across- breed genomic prediction using whole- genome sequence and single nucleotide polymorphism panels. Genetics, Selection, Evolution 48, 15. DOI: 10.1186/s12711-016-0193-1. Isidro, J., Jannink, J.L., Akdemir, D., Poland, J., Heslot, N. and Sorrells, M.E. (2015) Training set optimization under population structure in genomic selection. Theoretical and Applied Genetics 128, 145–158. DOI: 10.1007/s00122-014-2418-4. Iwata, H., Hayashi, T., Terakami, S., Takada, N., Saito, T. and Yamamoto, T. (2013) Genomic prediction of trait segregation in a progeny population: a case study of Japanese pear (Pyrus pyrifolia). BMC Genetics 14, 81. DOI: 10.1186/1471-2156-14-81. Jarquín, D., Kocak, K., Posadas, L., Hyma, K., Jedlicka, J. et al. (2014a) Genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genomics 15, 740. DOI: 10.1186/1471-2164-15-740. Jarquín, D., Lemes da Silva, C., Gaynor, R.C., Poland, J., Fritz, A, et al. (2017) Increasing genomic‐enabled prediction accuracy by modeling genotype × environment interactions in Kansas wheat. The Plant Genome 10, 1–15. DOI: 10.3835/plantgenome2016.12.0130. Jarquín, D., Crossa, J., Lacaze, X., Du Cheyron, P., Daucourt, J. et al. (2014b) A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theoretical and Applied Genetics 127, 595–607. DOI: 10.1007/s00122-013-2243-1. Lopez-Cruz, M., Crossa, J., Bonnett, D., Dreisigacker, S., Poland, J. et al. (2015) Increased prediction accuracy in wheat breeding trials using a marker× environment interaction genomic selection model. G3: Genes, Genomes, Genetics 5, 569–582. DOI: 10.1534/g3.114.016097. Meuwissen, T.H.E., Hayes, B.J. and Goddard, M.E. (2001) Prediction of total genetic value using genome- wide dense marker maps. Genetics 157(4), 1819–1829. DOI: 10.1093/genetics/157.4.1819. Millet, E.J., Kruijer, W., Coupel-Ledru, A., Alvarez Prado, S.A., Cabrera-Bosquet, L. et al. (2019) Genomic prediction of maize yield across European environmental conditions. Nature Genetics 51, 952–956. DOI: 10.1038/s41588-019-0414-y. Moreira, F.F., Oliveira, H.R., Volenec, J.J., Rainey, K.M. and Brito, L.F. (2020) Integrating high-throughput phenotyping and statistical genomic methods to genetically improve longitudinal traits in crops. Frontiers in Plant Science 11, 681. DOI: 10.3389/fpls.2020.00681. Neyhart, J.L., Tiede, T., Lorenz, A.J. and Smith, K.P. (2017) Evaluating methods of updating training data in long-term genomewide selection. G3: Genes, Genomes, Genetics 7, 1499–1510. DOI: 10.1534/ g3.117.040550. Onogi, A. and Iwata, H. (2016) VIGoR: variational Bayesian inference for genome-wide regression. Journal of Open Research Software 4(1), 11. DOI: 10.5334/jors.80. Onogi, A., Ideta, O., Inoshita, Y., Ebana, K., Yoshioka, T. et al. (2015) Exploring the areas of applicability of whole-genome prediction methods for Asian rice (Oryza sativa L.). Theoretical and Applied Genetics 128, 41–53. DOI: 10.1007/s00122-014-2411-y. Onogi, A., Watanabe, M., Mochizuki, T., Hayashi, T., Nakagawa, H. et al. (2016) Toward integration of genomic selection with crop modelling: the development of an integrated approach to predicting rice heading dates. Theoretical and Applied Genetics 129, 805–817. DOI: 10.1007/s00122-016-2667-5. Park, T. and Casella, G. (2008) The Bayesian lasso. Journal of the American Statistical Association 103(482), 681–686. DOI: 10.1198/016214508000000337. Pérez-Enciso, M., Rincón, J.C. and Legarra, A. (2015) Sequence- vs. chip-assisted genomic selection: accurate biological information is advised. Genetics, Selection, Evolution 47, 43. DOI: 10.1186/ s12711-015-0117-5. Pérez, P. and de Los Campos, G. (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495. DOI: 10.1534/genetics.114.164442.

Plant Genomic Selection

203

Raymond, B., Bouwman, A.C., Schrooten, C., Houwing- Duistermaat, J. and Veerkamp, R.F. (2018) Utility of whole-genome sequence data for across-breed genomic prediction. Genetics, Selection, Evolution 50, 27. DOI: 10.1186/s12711-018-0396-8. Rice, B. and Lipka, A.E. (2019) Evaluation of RR‐BLUP genomic selection models that incorporate peak genome‐wide association study signals in maize and sorghum. The Plant Genome 12(1), 180052. DOI: 10.3835/plantgenome2018.07.0052. Rincent, R., Charcosset, A. and Moreau, L. (2017) Predicting genomic selection efficiency to optimize calibration set and to assess prediction accuracy in highly structured populations. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 130(11), 2231–2247. DOI: 10.1007/ s00122-017-2956-7. Rincent, R., Laloë, D., Nicolas, S., Altmann, T., Brunel, D. et al. (2012) Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192(2), 715–728. DOI: 10.1534/ genetics.112.141473. Rutkoski, J., Poland, J., Mondal, S., Autrique, E., Pérez, L.G. et al. (2016) Canopy temperature and vegetation indices from high-throughput phenotyping improve accuracy of pedigree and genomic selection for grain yield in wheat. G3: Genes, Genomes, Genetics 6(9), 2799–2808. DOI: 10.1534/ g3.116.032888. Rutkoski, J., Singh, R.P., Huerta-Espino, J., Bhavani, S., Poland, J. et al. (2015) Genetic gain from phenotypic and genomic selection for quantitative resistance to stem rust of wheat. The Plant Genome 8(2), eplantgenome2014.10.0074. DOI: 10.3835/plantgenome2014.10.0074. Spindel, J., Begum, H., Akdemir, D., Virk, P., Collard, B. et al. (2015) Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genetics 11(2), e1004982. DOI: 10.1371/journal.pgen.1004982. Spindel, J.E., Begum, H., Akdemir, D., Collard, B., Redoña, E. et al. (2016) Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. Heredity 116(4), 395–408. DOI: 10.1038/hdy.2015.113. Tardieu, F. (2003) Virtual plants: modelling as a tool for the genomics of tolerance to water deficit. Trends in Plant Science 8(1), 9–14. DOI: 10.1016/s1360-1385(02)00008-0. Taylor, J.F., Taylor, K.H. and Decker, J.E. (2016) Holsteins are the genomic selection poster cows. Proceedings of the National Academy of Sciences 113(28), 7690–7692. DOI: 10.1073/ pnas.1608144113. Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society 58(1), 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x. Tong, H., Küken, A. and Nikoloski, Z. (2020) Integrating molecular markers into metabolic models improves genomic selection for Arabidopsis growth. Nature Communications 11(1), 2410. DOI: 10.1038/s41467-020-16279-5. van Eeuwijk, F.A., Bustos-Korts, D., Millet, E.J., Boer, M.P., Kruijer, W. et al. (2019) Modelling strategies for assessing and increasing the effectiveness of new phenotyping techniques in plant breeding. Plant Science 282, 23–39. DOI: 10.1016/j.plantsci.2018.06.018. VanRaden, P.M. (2008) Efficient methods to compute genomic predictions. Journal of Dairy Science 91(11), 4414–4423. DOI: 10.3168/jds.2007-0980. Varona, L., Legarra, A., Toro, M.A. and Vitezica, Z.G. (2018) Non-additive effects in genomic selection. Frontiers in Genetics 9, 78. DOI: 10.3389/fgene.2018.00078. Vitezica, Z.G., Legarra, A., Toro, M.A. and Varona, L. (2017) Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations. Genetics 206(3), 1297–1307. DOI: 10.1534/ genetics.116.199406. Weckwerth, W., Ghatak, A., Bellaire, A., Chaturvedi, P. and Varshney, R.K. (2020) PANOMICS meets germplasm. Plant Biotechnology Journal 18(7), 1507–1525. DOI: 10.1111/pbi.13372. Werner, C.R., Gaynor, R.C., Gorjanc, G., Hickey, J.M., Kox, T. et al. (2020) How population structure impacts genomic selection accuracy in cross-validation: implications for practical breeding. Frontiers in Plant Science 11, 592977. DOI: 10.3389/fpls.2020.592977. Windhausen, V.S., Atlin, G.N., Hickey, J.M., Crossa, J., Jannink, J.-L. et al. (2012) Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3: Genes, Genomes, Genetics 2(11), 1427–1436. DOI: 10.1534/g3.112.003699.

204

E. Yamamoto

Wong, C.K. and Bernardo, R. (2008) Genomewide selection in oil palm: increasing selection gain per unit time and cost with small populations. TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 116(6), 815–824. DOI: 10.1007/s00122-008-0715-5. Xu, S., Zhu, D. and Zhang, Q. (2014) Predicting hybrid performance in rice using genomic best linear unbiased prediction. Proceedings of the National Academy of Sciences 111(34), 12456–12461. DOI: 10.1073/pnas.1413750111. Xu, Y. and Crouch, J.H. (2008) Marker‐assisted selection in plant breeding: from publications to practice. Crop Science 48(2), 391–407. DOI: 10.2135/cropsci2007.04.0191. Yabe, S., Hara, T., Ueno, M., Enoki, H., Kimura, T. et al. (2018) Potential of genomic selection in mass selection breeding of an allogamous crop: an empirical study to increase yield of common buckwheat. Frontiers in Plant Science 9, 276. DOI: 10.3389/fpls.2018.00276. Yamamoto, E., Yonemaru, J.I., Yamamoto, T. and Yano, M. (2012) OGRO: the overview of functionally characterized genes in rice online database. Rice 5(1), 26. DOI: 10.1186/1939-8433-5-26. Yamamoto, E., Matsunaga, H., Onogi, A., Kajiya-Kanegae, H., Minamikawa, M. et al. (2016) A simulation- based breeding design that uses whole-genome prediction in tomato. Scientific Reports 6, 19454. DOI: 10.1038/srep19454. Yamamoto, E., Matsunaga, H., Onogi, A., Ohyama, A., Miyatake, K. et al. (2017) Efficiency of genomic selection for breeding population design and phenotype prediction in tomato. Heredity 118(2), 202–209. DOI: 10.1038/hdy.2016.84. Yamamoto, E., Kataoka, S., Shirasawa, K., Noguchi, Y. and Isobe, S. (2021) Genomic selection for F1 hybrid breeding in strawberry (Fragaria × ananassa). Frontiers in Plant Science 12, 645111. DOI: 10.3389/fpls.2021.645111. Yu, J., Holland, J.B., McMullen, M.D. and Buckler, E.S. (2008) Genetic design and statistical power of nested association mapping in maize. Genetics 178(1), 539–551. DOI: 10.1534/genetics.107.074245.

15 1

Plant Genome Editing

Naoki Wada1, Yuriko Osakabe2 and Keishi Osakabe1* Tokushima University, Tokushima city, Tokushima, Japan; 2Tokyo Institute of Technology, Yokohama, Kanagawa, Japan

Abstract Genome editing is an innovative technology that has brought about a revolution in the field of genetic engineering. In particular, the emergence of clustered regularly interspaced short palindromic repeat (CRISPR)-CRISPR- associated protein 9 (Cas9) has created a breakthrough in genetic engineering by enabling easy, efficient, and precise genome manipulation. In this chapter, we introduce recent advances in plant genome editing technologies that include precise genome editing by gene targeting, base editing, and a new strategy called prime editing. The exploration of new CRISPR-Cas tools has been continuously reported, resulting in successful applications of engineered Cas9 proteins or newly identified CRISPR-Cas systems such as Cas9-NG, CRISPR-CasΦ, and CRISPR-Cas type I-D systems. The applications of CRISPR-Cas are already not limited to genome editing but have also expanded to other purposes, such as RNA editing by CRISPR-Cas13, transcriptional control, and epigenetic modification by CRISPR-dCas9 fused with effector proteins. The toolbox will be updated further in the near future, bringing new approaches to achieve precise genome editing and genome manipulation in plants. These developments will strongly facilitate the functional analysis of plant genes and contribute to plant breeding.

15.1 Introduction

target DNA sequence. Then, the engineered nuclease binds to the target DNA sequence and The development of plants with improved traits induces double-strand breaks (DSBs) at, or in has long been required to address global prob- the vicinity of, the target site. These DSBs are lems such as food security and environmental then repaired by endogenous non-homologous issues. Traditionally, plant breeding has been end-joining (NHEJ) or homology-directed repair performed by crossing and selection; however, (HDR) pathways. While NHEJ is an error-prone these methods are labor- intensive and time- repair process and often causes mutations, such consuming. Genome editing is a newly developed as small insertions and deletions (indels), HDR technology that allows targeted modification leads to a precise repair of DSBs (Osakabe and of specific DNA sequences (Jinek et al., 2012; Osakabe, 2015; Wang et al., 2016). Genome- Osakabe and Osakabe, 2015; Wang et al., 2016; editing technologies have now been applied sucJaganathan et al., 2018). Generally, introduc- cessfully in various organisms, including plants tion of mutations into the target DNA sequence (Osakabe and Osakabe, 2015; Wang et al., 2016; using genome editing technologies involves Jaganathan et al., 2018). Three major genome- editing technolothree common steps. First, an exogenous engineered nuclease consisting of a recognition gies have been developed (Fig. 15.1) (Osakabe module and nuclease domain recognizes the and Osakabe, 2015; Wang et al., 2016; *Corresponding author: kosakabe@tokushima-u.ac.jp © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0015

205

206

N. Wada, Y. Osakabe and K. Osakabe

Fig. 15.1. Three genome editing technologies, ZFN, TALEN, and CRISPR-Cas9. These technologies induce DSBs in host genomes. DSBs are repaired by NHEJ pathway or by HDR pathway. The errors that occur during NHEJ-mediated DNA repair result in mutagenesis at the target site. HDR using a donor DNA as a template leads to precise mutagenesis at the target site.

Jaganathan et al., 2018). The first of these, zinc-finger nuclease (ZFN) was developed as an engineered nuclease. Then, the more flexible transcription activator- like effector nuclease (TALEN) appeared. ZFNs and TALENs are composed of a sequence-specific DNA binding module and a FokI nuclease domain. The FokI nuclease domain requires dimerization to activate nuclease activity. Therefore, two modules should be designed to target closely located DNA sequences, allowing dimerization of FokI at the target DNA sequence. This requirement for dimerization provides specificity to ZFNs and TALENs. However, in fact it is expensive and difficult to design active ZFNs and TALEN nucleases (Osakabe et al., 2010; Osakabe and Osakabe, 2015; Wang et al., 2016; Jaganathan et al., 2018). Finally, the clustered regularly interspaced short palindromic repeat (CRISPR)-CRISPR-associated protein 9 nuclease (Cas9) has emerged as a simpler and more flexible engineered nuclease (Jinek et al., 2012; Cong et al., 2013; Mali et al., 2013). Compared with ZFNs and TALENs, the CRISPR- Cas9 system is inexpensive and the experimental design is easy (Osakabe and Osakabe, 2015; Wang et al., 2016; Jaganathan et al., 2018). The ease, simplicity, and high efficiency of the CRISPR-Cas9 system have facilitated its development into the most widely applied genome-editing tool. The CRISPR-Cas9 system consists of two major components: Cas9 protein and a guide

RNA (gRNA). Cas9 protein is an RNA-dependent DNA endonuclease that forms a complex with gRNA. The gRNA is a small RNA that includes 20 nt complementary to the target sequence and is required to recruit Cas9 protein to the target site. This is the most important difference between CRISPR- Cas9 and the other genome- editing technologies; that is, it relies on DNA–RNA interaction, instead of DNA–protein interaction, for recognition of the target DNA sequence. In the case of ZFNs and TALENs, which recognize specific sequences by DNA–protein interaction, design and expression of two different DNA-binding domains (500–700 amino acids in the case of TALENs) are required for each target site, making this process rather laborious. On the other hand, CRISPR- Cas9 recognizes specific sequences by DNA–RNA interaction, thus design of an 18–20 bp oligonucleotide is all that is required. This characteristic makes it very easy to adapt the CRISPR-Cas9 system for genome editing applications. To function as a genome editing tool, Cas9 and gRNA must bind to a specific protospacer adjacent motif (PAM) sequence, which is a short nucleotide sequence located at the 3′ end of the target sequence. In the case of Streptococcus pyogenes Cas9 (SpCas9) (the most commonly used Cas9 protein), 5′-NGG-3′ is recognized as the PAM. Recruitment of Cas9 induces DSBs at the target site in the host genome. In this chapter, we summarize recent approaches to targeted genome manipulation

Plant Genome Editing

using the CRISPR- Cas9 system. Targeted manipulation has also been performed without inducing DSBs using CRISPR- dCas9- based systems. We also introduce newly developed and discovered CRISPR- Cas systems as promising tools for plant genome engineering in the near future.

15.2 Genome Editing Using CRISPR-Cas9 in Plants: an Overview The CRISPR-Cas9 system has been applied successfully in various plant species, including not only model plants such as Arabidopsis, but also crops such as rice, tobacco, sorghum, wheat, maize, soybean, tomato, potato, poplar, apple, and banana (Osakabe and Osakabe, 2015; Wang et al., 2016; Jaganathan et al., 2018; Wada et al., 2020). The induced mutations are inherited by the next generations, indicating that plant genome editing can be applied both to plant research and the production of useful plants. An important advantage of using the CRISPR-Cas9 system is the ability to edit multiple target genes simultaneously (Hashimoto et al., 2018; Zsögön et al., 2018; Najera et al., 2019). For example, Zsögön et al. (2018) targeted six genes and introduced mutations into four genes simultaneously. Importantly, this study achieved the de novo domestication of wild tomato by targeting six loci important for key domestication traits. The results demonstrate that multiplex genome editing using CRISPR- Cas9 can be applied to mimic the domestication process during evolution in a short time-frame, with implications for the rapid production of new plants with desirable traits. Multiplex genome editing also can induce deletions with defined sizes between target sites (Hashimoto et al., 2018; Najera et al., 2019), which would be useful for the disruption of regulatory sequences and generation of knockout mutants whose gene functions were disrupted not by out-of-frame mutations but by the deletion of a defined region. Gene targeting (GT) by CRISPR- Cas9 is another approach that can be used to engineer the plant genome precisely. GT can be achieved via the HDR pathway but in plant cells the efficiency of HDR is much lower than that of NHEJ. Several approaches have been reported

207

to enhance GT efficiency in plants, such as suppressing the NHEJ pathway (Endo et al., 2006; Qi et al., 2013b), amplifying donor DNAs using virus replicons (Baltes et al., 2014), and driving Cas9 expression using egg- specific promoters (Miki et al., 2018). Miki et al. (2018) reported a sequential transformation of Arabidopsis to enhance GT efficiency. The authors first generated parental lines expressing the Cas9 gene under the control of the egg cell- and early embryo-specific DD45 promoter derived from Arabidopsis. Then, using the floral-dip method, they introduced the donor DNA fragment and gRNA expression vectors into selected parental plants that showed high genome-editing activity. In this manner, they achieved heritable GT with 5–10% efficiency (according to the number of T2 Arabidopsis populations examined). However, further improvements in GT efficiency are needed to enhance the efficiency of precise genome editing. Despite the various benefits of CRISPR- Cas9, an important concern is off-target effects, i.e., unintended mutations at unintended sites induced by genome editing. Several methods, including SITE- seq (Cameron et al., 2017), Digenome-seq (Kim et al., 2015), CIRCLE-seq (Tsai et al., 2017), GUIDE-seq (Tsai et al., 2015), and DISCOVER-seq (Wienert et al., 2019), have been developed to detect off-target mutations in vitro and in vivo. Unbiased detection of off-target effects is preferable for evaluation of genome- edited outcomes. On the other hand, new types of mutations, which have not been addressed so far, have recently been reported. Kosicki et al. (2018) detected unexpected large deletions (up to 9.5 kb) that occurred as a result of Cas9-based genome editing in mammalian cells. Although such unexpected large deletions have not yet been reported in plants, the possibility of their occurrence should be taken into consideration.

15.3 Genome Manipulation Using a CRISPR-dCas9-based System Without DSB Induction DSBs are key events in genome editing, but they run the risk of genome instability and unpredictable outcomes of DNA repair. Therefore, approaches to modify targeted DNA or gene

208

N. Wada, Y. Osakabe and K. Osakabe

Fig. 15.2. Representative approaches for genome manipulation without DSB induction using CRISPR- dCas9-based system. (A) Base editing. (B) Transcriptional control. (C) Prime editing. These approaches utilize dCas9 protein (A, B) or Cas9 nickase (C) fused to effector proteins. Each effector protein (and engineered gRNA) affects genome sequences or gene expression as shown in the figure.

expression without the need to induce DSBs have been explored. The key protein in this regard is a catalytically dead Cas9 variant (dCas9) that can bind to the target sequence but does not cleave the double-stranded DNA. The dCas9 protein can be fused to another effector protein that either modifies the genome or the epigenome without cleaving the double-stranded DNA (Qi et al., 2013a; Wang et al., 2016; Adli, 2018). dCas9 proteins fused to DNA deaminases have been developed as base editors that can modify the target DNA sequence without inducing DSBs (Fig. 15.2A) (Komor et al., 2016; Nishida et al., 2016; Gaudelli et al., 2017; Molla and Yang, 2019). Cytidine deaminase-incorporating DNA base editors (CBEs; Target- AID and BE) have been developed for nucleotide conversion from C to T, and an adenine deaminase-based DNA editor (ABE) has been developed for A to G conversion. Both types of base editors have been applied successfully to base editing in plants such as Arabidopsis, rice, wheat, maize, and tomato

(Gaudelli et al., 2017; Jaganathan et al., 2018; Hua et al., 2019b; Jin et al., 2019; Molla and Yang, 2019; Zhang et al., 2019). Transcriptional activation and repression by CRISPR-dCas9 has been performed in Arabidopsis, tobacco, and rice by fusing dCas9 to a VP64 transcriptional activation (dCas9VP64) or repressor (dCas9- SRDX) domain (Fig. 15.2B) (Lowder et al., 2015; Piatek et al., 2015). Compared with a single gRNA, the use of multiple gRNAs enhanced gene activation and suppression (Lowder et al., 2015; Piatek et al., 2015). The engineering of fusion activators (e.g., dCas9-VPR) (Chavez et al., 2015) or recruitment of multiple activators in the dCas9-SAM (Konermann et al., 2015) or dCas9- SunTag (Konermann et al., 2015; Li et al., 2017; Selma et al., 2019) systems enhanced gene activation. Interestingly, the extent of gene activation was roughly negatively correlated with the basal expression levels of the target gene (Li et al., 2017). Selma et al. (2019) compared 43 SunTag,

Plant Genome Editing

SAM and scRNA-based transcription activation domain combinations in plant cells and found that the combination of dCas9-EDLL, gRNA2.1 and MS2-VPR gave the maximum gene activation. RNAseq analysis indicated that significant transcriptional changes of unexpected genes did not occur, indicating the specificity of the system. The dCas9 protein fused with an epigenetic modifier has been developed for the introduction of targeted epigenetic modifications to alter gene expression (Adli, 2018). As an example, Gallego- Bartolomé et al. (2018) applied a combination of the SunTag system, DNA demethylase TET1cd, and CRISPR-dCas9 (dCas9-SunTag-TET1) to upregulate the FLOWERING WAGENINGEN (FWA) gene whose activation leads to late-flower phenotypes in Arabidopsis. These authors achieved demethylation of the FWA gene promoter region, resulting in activation of FWA gene expression. The modified epigenetic status and late- flowering phenotypes were stably inherited by subsequent generations even in the absence of the dCas9-SunTag-TET1 gene.

209

15.4 Engineered Cas9 and Newly Discovered Cas Proteins for Plant Genome Editing The applicability of the CRISPR-Cas9 system is limited by PAM specificity and off-target effects. In the case of SpCas9, sequences without an NGG motif cannot be selected as a target sequence. Therefore, several new strategies have been applied to broaden PAM compatibility and enhance specificity. These include rational SpCas9 engineering, the discovery and characterization of Cas9 orthologs, and the application of novel CRISPR-Cas systems from other sources (Table 15.1). The SpCas9 protein has been engineered extensively to broaden PAM compatibility or to enhance PAM specificity, while at the same time minimizing off-target effects (Karvelis et al., 2017). Kleinstiver et al. (2015) reported the production of SpCas9-VQR, SpCas9-EQR, and SpCas9-VRER, which recognize NGA- PAM, NGAG- PAM, and

Table 15.1. Examples of Cas9 variants and newly developed CRISPR-Cas tools. CRISPR-Cas

Classification

Target

Characteristics

CRISPR-Cas9

Class 2 type II

DNA

SpCas9 recognizes NGG PAM. Induces DNA DSBs at target sites.

DNA

Engineered Cas9 with broadened PAM compatibility: Cas9-NG (PAM: NG), SpG (PAM: NGN), SpRY (PAM: NRN, NYN)

-Cas9-NG, SpG, SpRY -Cas9-HF

DNA

Engineered Cas9 with high specificity

CRISPR-Cas12a (Cpf1)

Class 2 type V

DNA

Cas12a recognizes AT-rich PAM. Shows trans- cleavage activity on collateral ssDNA after recognition and cleavage of target DNA in vitro.

CRISPR-Cas13 (C2c2)

Class 2 type VI

RNA

Cas13 targets RNA. Shows trans-cleavage activity on collateral ssRNA after recognition and cleavage of target RNA in vitro.

CRISPR-Cas3

Class 1 type I-E

DNA

CRISPR-Cas3 recognizes longer target sequences than Cas9. Cas3e induces long deletions toward 5' upstream of PAM sequence in human cells.

CRISPR-CasX (Cas12e)

Class 2 type V

DNA

CRISPR-CasX was isolated from uncultivated microbes. CasX is smaller than Cas9 and Cas12, and has a unique mechanism for DNA cleavage.

CRISPR-CasΦ (Cas12J)

Class 2 type V

DNA

CRISPR-CasΦ is the smallest CRISPR-Cas system so far. The size of CasΦ is about half that of Cas9.

CRISPR-Cas type I-D (TiD)

Class 1 type I-D

DNA

TiD recognizes longer target sequences than Cas9. Cas10d induces both small indels and bi-directional long-range deletions.

210

N. Wada, Y. Osakabe and K. Osakabe

NGCG- PAM, respectively. These engineered SpCas9 variants function in Arabidopsis and rice while the activities of engineered Cas9 were not as high as that of wild-type SpCas9 (X. Hu et al., 2016, Hu et al., 2018; Yamamoto et al., 2019). Nishimasu et al. (2018) developed SpCas9-NG, which recognizes NG- PAM, and applied it to genome editing in rice and Arabidopsis plants (Endo et al., 2019; Ge et al., 2019; Hua et al., 2019a; Niu et al., 2019; Zhong et al., 2019). Other engineered Cas proteins such as SpCas9- HF1 (Kleinstiver et al., 2016), eSpCas9 (Slaymaker et al., 2016), and HypaCas9 (Chen et al., 2017) have also been reported. SpCas9-HF1 and eSpCas9 have already been tested in rice and showed reduced off-target editing activities, suggesting that SpCas9-HF1 and eSpCas9 have high specificity in plant cells (Zhang et al., 2017). Directed evolution approaches have also been applied to produce three engineered SpCas9 proteins with high specificities such as xCas9 (J.H. Hu et al., 2018), evoCas9 (Casini et al., 2018), and Sniper-Cas9 (Lee et al., 2018). xCas9 has expanded PAM preferences (NG, GAA, and GAT-PAM) and has been tested in Arabidopsis and rice (Endo et al., 2019; Ge et al., 2019; Hua et al., 2019a; Li et al., 2019; Niu et al., 2019; Wang et al., 2019; Zhong et al., 2019). Studies indicated that both xCas9 and Cas9-NG could cause mutations at some non-canonical PAMs in plants; however, their efficiency and specificity in mammalian cells appear differently in plant cells. Hua et al. (2019a) reported that xCas9 could induce mutations effectively in rice, but other studies (Niu et al., 2019; Wang et al., 2019; Zhong et al., 2019) found that xCas9 showed much lower activity in rice calli than in mammalian cells (Wang et al., 2019), and failed to recognize NG PAM in tomato (Niu et al., 2019). xCas9 showed comparable activity to Cas9-WT at NGG PAM, and higher specificity than that of Cas9-WT, but genome editing activity at the NGH (A, T, C) PAM in rice was not high. In contrast, Cas9-NG indicated higher genome editing activity than xCas9 in almost all NG PAM sites in rice (Zhong et al., 2019). Hua et al. (2019a) also reported that Cas9-NG had high editing activity at several NG PAM sites tested (CGG, AGC, TGA, CGT). These studies indicate that Cas9-NG would be more suitable for genome editing at the NG PAM sites in plants. On the other hand, xCas9 would be suitable for use as a highly specific SpCas9 in plant cells. Recently, near- PAMless engineered Cas9 variants (SpG, SpRY) have been developed in

human cells (Walton et al., 2020). Examination of the functionality of these new Cas9 variants has not yet been performed in plant cells. Cas9 orthologs with different PAM preferences have been discovered from other bacteria (Murovec et al., 2017), e.g., NmCas9 from Neisseria meningitidis (Hou et al., 2013), SaCas9 from Staphylococcus aureus (Ran et al., 2015), StCas9 from Streptococcus thermophilus (Müller et al., 2016), FnCas9 from Francisella novicida (Hirano et al., 2016), and CjCas9 from Campylobacter jejuni (Kim et al., 2017). Genes encoding most of these proteins are smaller than that encoding SpCas9, which can be advantageous for gene delivery by viral vectors. FnCas9, StCas9, and SaCas9 have already been applied to genome editing in Arabidopsis and tobacco (Steinert et al., 2015; Zhang et al., 2018). Finally, new types of Cas proteins other than Cas9 that belong to a Class 2 type II CRISPR-Cas system have been discovered in the past 5 years (Table 15.1). For example, Cas13 (C2c2) protein belongs to a type VI CRISPR- Cas system that recognizes RNA sequences and exhibits RNA genome-editing activity (Abudayyeh et al., 2016; Cox et al., 2017). The CRISPR- Cas13 system has been applied successfully to the targeted knockdown of endogenous genes in rice and Nicotiana benthamiana (Abudayyeh et al., 2016; Aman et al., 2018). Aman et al. (2018) applied the CRISPR-Cas13 system to engineer interference with an RNA virus in Arabidopsis, suggesting that the system can be used to engineer RNA-guided immunity against RNA viruses in plants. As another example, Yan et al. (2019) identified functional diversity in the Type V CRISPR-Cas system (Cas12c, Cas12g, Cas12h, and Cas12i). The functions of these Cas variants range from dsDNA nicking and cleavage activity to the collateral cleavage activity of ssRNA and ssDNA, suggesting that undiscovered Cas proteins with various functions still exist in nature. In addition, new types of CRISPR-Cas systems have been identified from uncultivated microbes (Burstein et al., 2017). For example, CRISPR-Cas14, classified as a Type V CRISPR-Cas system, can target and cleave ssDNA independently of PAM sequences (Harrington et al., 2018). CasX and CasY (classified into Cas12e and Cas12d, respectively) have unique protein structures, distinct from those of known Cas proteins, and genome editing of CasX has been validated in Escherichia coli and human cells

Plant Genome Editing

(Liu et al., 2019). Recently, the smallest CRISPR- Cas system, CRISPR-CasΦ, has been discovered from huge phages (Pausch et al., 2020). Although most of the CRISPR- Cas systems mentioned above have not yet been applied to plant genome editing, CRISPR-CasΦ has already been tested in plant cells, and was able to induce mutations in Arabidopsis (Pausch et al., 2020). New genome editing technologies have also been developed from Class 1 type I CRISPR-Cas systems – the most abundant CRISPR-Cas system in bacteria. CRISPR-Cas systems belonging to Class I type I include the multi-Cas subunit complex, termed “CRISPR associated complex for antiviral defense (Cascade)”. Cas3, a nuclease from Thermobifida fusca and E. coli type I- E CRISPR- Cas systems (Dolan et al., 2019; Morisaka et al., 2019), induced large deletions, up to 100 kb, upstream of a target site in human cells. However, this technology has yet to be applied to plant genome editing.

211

Recently, we identified a new Class 1 type I CRISPR-Cas genome locus, named type I-D (TiD) (Osakabe et al., 2020). The TiD system consists of five Cas proteins: Cas3d, Cas5d, Cas6d, Cas7d, and Cas10d, and CRISPR RNA (crRNA) (Fig. 15.3A). We found that TiD recognizes a GTH (H: T, G, A) PAM adjacent to the target sequence (35, 36 bp). We also discovered that a unique effector protein, Cas10d, has a single-strand DNA nuclease activity in the TiD system. The expression of TiD Cas proteins and crRNA successfully induced mutagenesis at target sites in tomato. In addition, the mutagenesis profile was different from other CRISPR-Cas systems; both bi-directional long-range deletion and short indels were induced (Fig. 15.3B). Mutation analysis indicated that bi-allelic mutations were induced in the regenerated shoot, suggesting that the TiD system can be used to efficiently generate bi-allelic mutants in the first

Fig. 15.3. Schematic illustration of CRISPR-Cas type I-D (TiD) system. (A) TiD complex consisting of five Cas proteins (Cas3d, 5d, 6d, 7d, 10d) and crRNA. Target DNA including PAM (GTH) and 36 bp spacer are also shown in the figure. (B) Mutation patterns induced by the TiD system.

212

N. Wada, Y. Osakabe and K. Osakabe

generation. The longer target sequence offers high specificity compared with the CRISPR- Cas9 system. Indeed, genome- wide analysis on rice and tomato genomes indicated that the number of potential off-target sites was smaller than that seen in the CRISPR-Cas9 system. The absence of off-target effects in tomato genomes was confirmed experimentally in TiD- edited samples, demonstrating that the TiD system represents a new genome editing tool with high specificity. The unique characteristics of the TiD system will be useful for creating gene knockouts by causing deletions. In addition, we induced mutagenesis successfully in human cells (Osakabe et al., 2021), suggesting that the TiD system can be useful for genome editing in many organisms. Further improvements to increase genome editing efficiency are ongoing. Collectively, the development of these new Cas proteins will expand the repertoire of plant genome-editing tools.

15.5 Prime Editing Recently, Anzalone et al. (2019) developed a new genome editing technology called prime editing in yeast and mammalian cells (Fig. 15.2C), achieving precise genome editing without inducing DSBs or requiring a donor DNA template (which is necessary for genome editing via HDR). Prime editing consists of two components: (i) Cas9 nickase fused to reverse transcriptase; and (ii) an engineered pegRNA (prime editing guide RNA), which consists of a primer binding site (PBS), the desired edited sequence, and a sequence that anneals to the target DNA sequence. At first, the desired edited sequences added to the gRNA are reverse-transcribed into DNA using reverse transcriptase fused to Cas9 nickase, then the reverse- transcribed DNA is incorporated into the nicked target site. Prime editing has achieved targeted insertions (up to 44 bp), deletions (up to 80 bp), and a variety of point mutations efficiently and precisely. The lack of a requirement for DSBs prevented the generation of unintended mutations compared with Cas9- initiated HDR. In addition, prime editing enabled precise single- nucleotide replacement even in cases where multiple cytosines or adenines were

present in the base-editing window, indicating the advantages of prime editing compared with base editing. Target scope is also expanded because prime editing, unlike base editing, is not limited by the need for a PAM sequence located at a suitable distance from the target nucleotides. Prime editing has been applied successfully to genome editing in plants (Lin et al., 2020). Prime editing is a promising technology for plant genome engineering, especially because it offers new strategies for knock-in of DNA fragments via an HDR-independent pathway.

15.6 Conclusions The emergence of genome-editing technologies has revolutionized plant genome engineering. In particular, the CRISPR- Cas9 system has accelerated the speed of plant breeding by providing easy, efficient, and precise approaches to genome editing. In addition, CRISPR-Cas9 is no longer just scissors for cleaving genomic DNA, it can also change one nucleotide into another and modify transcriptional states and epigenetic environments at the target site. The toolbox continues to be updated with newly developed CRISPR- Cas systems, Cas proteins, effector modules, etc. For example, prime editing is a promising technology to achieve precise editing of plant genomes without relying on the inefficient HDR pathway. We have also developed the new and unique TiD genome editing system, which can induce both small indels and large deletions in tomato cells. The controlled induction of large deletions will represent a new strategy for gene knockout, deletion of gene clusters, and induction of chromosomal deletions. These technologies will facilitate the analysis of gene function and the generation of genome-edited plants with desired genotypes in a short time. Plant genome editing can also be useful to generate null-segregants, which have genome-edited alleles but no transgenes, which alleviates concerns about genome-edited plants. The CRISPR toolbox for plant engineering will continue to expand further in the near future, providing new approaches to achieve precise genome editing for functional analysis of plant genes and plant breeding.

Plant Genome Editing

213

References Abudayyeh, O.O., Gootenberg, J.S., Konermann, S., Joung, J., Slaymaker, I.M. et al. (2016) C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector. Science 353(6299), aaf5573. DOI: 10.1126/science.aaf5573. Adli, M. (2018) The CRISPR tool kit for genome editing and beyond. Nature Communications 9(1), 1911. DOI: 10.1038/s41467-018-04252-2. Aman, R., Ali, Z., Butt, H., Mahas, A., Aljedaani, F. et al. (2018) RNA virus interference via CRISPR/Cas13a system in plants. Genome Biology 19(1), 1. DOI: 10.1186/s13059-017-1381-1. Anzalone, A.V., Randolph, P.B., Davis, J.R., Sousa, A.A., Koblan, L.W. et al. (2019) Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576(7785), 149–157. DOI: 10.1038/s41586-019-1711-4. Baltes, N.J., Gil-Humanes, J., Cermak, T., Atkins, P.A. and Voytas, D.F. (2014) DNA replicons for plant genome engineering. The Plant Cell 26(1), 151–163. DOI: 10.1105/tpc.113.119792. Burstein, D., Harrington, L.B., Strutt, S.C., Probst, A.J., Anantharaman, K. et al. (2017) New CRISPR-Cas systems from uncultivated microbes. Nature 542(7640), 237–241. DOI: 10.1038/nature21059. Cameron, P., Fuller, C.K., Donohoue, P.D., Jones, B.N., Thompson, M.S. et al. (2017) Mapping the genomic landscape of CRISPR-Cas9 cleavage. Nature Methods 14(6), 600–606. DOI: 10.1038/ nmeth.4284. Casini, A., Olivieri, M., Petris, G., Montagna, C., Reginato, G. et al. (2018) A highly specific SpCas9 variant is identified by in vivo screening in yeast. Nature Biotechnology 36(3), 265–271. DOI: 10.1038/ nbt.4066. Chavez, A., Scheiman, J., Vora, S., Pruitt, B.W., Tuttle, M. et al. (2015) Highly efficient Cas9-mediated transcriptional programming. Nature Methods 12(4), 326–328. DOI: 10.1038/nmeth.3312. Chen, J.S., Dagdas, Y.S., Kleinstiver, B.P., Welch, M.M., Sousa, A.A. et al. (2017) Enhanced proofreading governs CRISPR-Cas9 targeting accuracy. Nature 550(7676), 407–410. DOI: 10.1038/nature24268. Cong, L., Ran, F.A., Cox, D., Lin, S., Barretto, R. et al. (2013) Multiplex genome engineering using CRISPR/ Cas systems. Science 339(6121), 819–823. Cox, D.B.T., Gootenberg, J.S., Abudayyeh, O.O., Franklin, B., Kellner, M.J. et al. (2017) RNA editing with CRISPR-Cas13. Science 358(6366), 1019–1027. Dolan, A.E., Hou, Z., Xiao, Y., Gramelspacher, M.J., Heo, J. et al. (2019) Introducing a spectrum of long- range genomic deletions in human embryonic stem cells using Type I CRISPR-Cas. Molecular Cell 74(5), 936–950.e5. Endo, M., Ishikawa, Y., Osakabe, K., Nakayama, S., Kaya, H. et al. (2006) Increased frequency of homologous recombination and T- DNA integration in Arabidopsis CAF- 1 mutants. The EMBO Journal 25(23), 5579–5590. Endo, M., Mikami, M., Endo, A., Kaya, H., Itoh, T. et al. (2019) Genome editing in plants by engineered CRISPR-Cas9 recognizing NG PAM. Nature Plants 5(1), 14–17. Gallego-Bartolomé, J., Gardiner, J., Liu, W., Papikian, A., Ghoshal, B. et al. (2018) Targeted DNA demethylation of the Arabidopsis genome using the human TET1 catalytic domain. Proceedings of the National Academy of Sciences 115(9), E2125–E2134. DOI: 10.1073/pnas.1716945115. Gaudelli, N.M., Komor, A.C., Rees, H.A., Packer, M.S., Badran, A.H. et al. (2017) Programmable base editing of A•T to G•C in genomic DNA without DNA cleavage. Nature 551(7681), 464–471. DOI: 10.1038/nature24644. Ge, Z., Zheng, L., Zhao, Y., Jiang, J., Zhang, E.J. et al. (2019) Engineered xCas9 and SpCas9-NG variants broaden PAM recognition sites to generate mutations in Arabidopsis plants. Plant Biotechnology Journal 17(10), 1865–1867. DOI: 10.1111/pbi.13148. Harrington, L.B., Burstein, D., Chen, J.S., Paez-Espino, D., Ma, E. et al. (2018) Programmed DNA destruction by miniature CRISPR-Cas14 enzymes. Science 362(6416), 839–842. DOI: 10.1126/science. aav4294. Hashimoto, R., Ueta, R., Abe, C., Osakabe, Y. and Osakabe, K. (2018) Efficient multiplex genome editing induces precise, and self-ligated type mutations in tomato plants. Frontiers in Plant Science 9, 916. DOI: 10.3389/fpls.2018.00916. Hirano, H., Gootenberg, J.S., Horii, T., Abudayyeh, O.O., Kimura, M. et al. (2016) Structure and engineering of Francisella novicida Cas9. Cell 164(5), 950–961.

214

N. Wada, Y. Osakabe and K. Osakabe

Hou, Z., Zhang, Y., Propson, N.E., Howden, S.E., Chu, L.F. et al. (2013) Efficient genome engineering in human pluripotent stem cells using cas9 from neisseria meningitidis. Proceedings of the National Academy of Sciences 110(39), 15644–15649. Hu, J.H., Miller, S.M., Geurts, M.H., Tang, W., Chen, L. et al. (2018) Evolved cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556(7699), 57–63. Hu, X., Meng, X., Liu, Q., Li, J. and Wang, K. (2018) Increasing the efficiency of CRISPR-cas9-VQR precise genome editing in rice. Plant Biotechnology Journal 16(1), 292–297. Hu, X., Wang, C., Fu, Y., Liu, Q., Jiao, X. et al. (2016) Expanding the range of CRISPR/cas9 genome editing in rice. Molecular Plant 9(6), 943–945. Hua, K., Tao, X., Han, P., Wang, R. and Zhu, J.K. (2019a) Genome engineering in rice using Cas9 variants that recognize NG PAM sequences. Molecular Plant 12(7), 1003–1014. DOI: 10.1016/j. molp.2019.03.009. Hua, K., Tao, X. and Zhu, J.K. (2019b) Expanding the base editing scope in rice by using Cas9 variants. Plant Biotechnology Journal 17(2), 499–504. DOI: 10.1111/pbi.12993. Jaganathan, D., Ramasamy, K., Sellamuthu, G., Jayabalan, S. and Venkataraman, G. (2018) CRISPR for crop improvement: an update review. Frontiers in Plant Science 9, 985. DOI: 10.3389/ fpls.2018.00985. Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J.A. et al. (2012) A programmable dual-RNA- guided DNA endonuclease in adaptive bacterial immunity. Science 337(6096), 816–821. DOI: 10.1126/science.1225829. Jin, S., Zong, Y., Gao, Q., Zhu, Z., Wang, Y. et al. (2019) Cytosine, but not adenine, base editors induce genome-wide off-target mutations in rice. Science 364(6437), 292–295. DOI: 10.1126/science. aaw7166. Karvelis, T., Gasiunas, G. and Siksnys, V. (2017) Harnessing the natural diversity and in vitro evolution of Cas9 to expand the genome editing toolbox. Current Opinion in Microbiology 37, 88–94. Kim, D., Bae, S., Park, J., Kim, E., Kim, S. et al. (2015) Digenome-seq: genome-wide profiling of CRISPR- Cas9 off-target effects in human cells. Nature Methods 12(3), 237–243. Kim, E., Koo, T., Park, S.W., Kim, D., Kim, K. et al. (2017) In vivo genome editing with a small Cas9 orthologue derived from Campylobacter jejuni. Nature Communications 8, 4500. Kleinstiver, B.P., Pattanayak, V., Prew, M.S., Tsai, S.Q., Nguyen, N.T. et al. (2016) High-fidelity CRISPR- cas9 nucleases with no detectable genome-wide off-target effects. Nature 529(7587), 490–495. Kleinstiver, B.P., Prew, M.S., Tsai, S.Q., Topkar, V.V., Nguyen, N.T. et al. (2015) Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature 523(7561), 481–485. Komor, A.C., Kim, Y.B., Packer, M.S., Zuris, J.A. and Liu, D.R. (2016) Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533(7603), 420–424. DOI: 10.1038/nature17946. Konermann, S., Brigham, M.D., Trevino, A.E., Joung, J., Abudayyeh, O.O. et al. (2015) Genome-scale transcriptional activation by an engineered CRISPR-Cas9 complex. Nature 517(7536), 583–588. DOI: 10.1038/nature14136. Kosicki, M., Tomberg, K. and Bradley, A. (2018) Repair of double-strand breaks induced by CRISPR-Cas9 leads to large deletions and complex rearrangements. Nature Biotechnology 36(8), 765–771. DOI: 10.1038/nbt.4192. Lee, J.K., Jeong, E., Lee, J., Jung, M., Shin, E. et al. (2018) Directed evolution of CRISPR-Cas9 to increase its specificity. Nature Communications 9(1), 3048. DOI: 10.1038/s41467-018-05477-x. Li, J., Luo, J., Xu, M., Li, S., Zhang, J. et al. (2019) Plant genome editing using xCas9 with expanded PAM compatibility. Journal of Genetics and Genomics = Yi Chuan Xue Bao 46(5), 277–280. DOI: 10.1016/j.jgg.2019.03.004. Li, Z., Zhang, D., Xiong, X., Yan, B., Xie, W. et al. (2017) A potent Cas9-derived gene activator for plant and mammalian cells. Nature Plants 3(12), 930–936. DOI: 10.1038/s41477-017-0046-0. Lin, Q., Zong, Y., Xue, C., Wang, S., Jin, S. et al. (2020) Prime genome editing in rice and wheat. Nature Biotechnology 38(5), 582–585. DOI: 10.1038/s41587-020-0455-x. Liu, J.-J., Orlova, N., Oakes, B.L., Ma, E., Spinner, H.B. et al. (2019) CasX enzymes comprise a distinct family of RNA- guided genome editors. Nature 566(7743), 218–223. DOI: 10.1038/ s41586-019-0908-x. Lowder, L.G., Zhang, D., Baltes, N.J., Paul, J.W., Tang, X. et al. (2015) A CRISPR/Cas9 toolbox for multiplexed plant genome editing and transcriptional regulation. Plant Physiology 169(2), 971–985. DOI: 10.1104/pp.15.00636.

Plant Genome Editing

215

Mali, P., Yang, L., Esvelt, K.M., Aach, J., Guell, M. et al. (2013) RNA-guided human genome engineering via Cas9. Science 339(6121), 823–826. DOI: 10.1126/science.1232033. Miki, D., Zhang, W., Zeng, W., Feng, Z. and Zhu, J.K. (2018) CRISPR/Cas9-mediated gene targeting in Arabidopsis using sequential transformation. Nature Communications 9(1), 1967. DOI: 10.1038/ s41467-018-04416-0. Molla, K.A. and Yang, Y. (2019) CRISPR/Cas- mediated base editing: technical considerations and practical applications. Trends in Biotechnology 37(10), 1121–1142. DOI: 10.1016/j. tibtech.2019.03.008. Morisaka, H., Yoshimi, K., Okuzaki, Y., Gee, P., Kunihiro, Y. et al. (2019) CRISPR-Cas3 induces broad and unidirectional genome editing in human cells. Nature Communications 10(1), 5302. DOI: 10.1038/ s41467-019-13226-x. Müller, M., Lee, C.M., Gasiunas, G., Davis, T.H., Cradick, T.J. et al. (2016) Streptococcus thermophilus CRISPR-Cas9 systems enable specific editing of the human genome. Molecular Therapy 24(3), 636–644. DOI: 10.1038/mt.2015.218. Murovec, J., Pirc, Ž. and Yang, B. (2017) New variants of CRISPR RNA-guided genome editing enzymes. Plant Biotechnology Journal 15(8), 917–926. DOI: 10.1111/pbi.12736. Najera, V.A., Twyman, R.M., Christou, P. and Zhu, C. (2019) Applications of multiplex genome editing in higher plants. Current Opinion in Biotechnology 59, 93–102. DOI: 10.1016/j.copbio.2019.02.015. Nishida, K., Arazoe, T., Yachie, N., Banno, S., Kakimoto, M. et al. (2016) Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science 353(6305), aaf8729. DOI: 10.1126/science.aaf8729. Nishimasu, H., Shi, X., Ishiguro, S., Gao, L., Hirano, S. et al. (2018) Engineered CRISPR-Cas9 nuclease with expanded targeting space. Science 361(6408), 1259–1262. DOI: 10.1126/science.aas9129. Niu, Q., Wu, S., Li, Y., Yang, X., Liu, P. et al. (2019) Expanding the scope of CRISPR/Cas9-mediated genome editing in plants using an xCas9 and Cas9-NG hybrid. Journal of Integrative Plant Biology 62(4), 398–402. DOI: 10.1111/jipb.12886. Osakabe, Y. and Osakabe, K. (2015) Genome editing with engineered nucleases in plants. Plant & Cell Physiology 56(3), 389–400. DOI: 10.1093/pcp/pcu170. Osakabe, K., Osakabe, Y. and Toki, S. (2010) Site-directed mutagenesis in Arabidopsis using custom- designed zinc finger nucleases. Proceedings of the National Academy of Sciences 107(26), 12034–12039. DOI: 10.1073/pnas.1000234107. Osakabe, K., Wada, N., Miyaji, T., Murakami, E., Marui, K. et al. (2020) Genome editing in plants using CRISPR type I-D nuclease. Communications Biology 3(1), 648. DOI: 10.1038/s42003-020-01366-6. Osakabe, K., Wada, N., Murakami, E., Miyashita, N. and Osakabe, Y. (2021) Genome editing in mammalian cells using the CRISPR type I-D nuclease. Nucleic Acids Research 49(11), 6347–6363. DOI: 10.1093/nar/gkab348. Pausch, P., Al-Shayeb, B., Bisom-Rapp, E., Tsuchida, C.A., Li, Z. et al. (2020) CRISPR-CasΦ from huge phages is a hypercompact genome editor. Science 369(6501), 333–337. DOI: 10.1126/science. abb1400. Piatek, A., Ali, Z., Baazim, H., Li, L., Abulfaraj, A. et al. (2015) RNA-guided transcriptional regulation in planta via synthetic dCas9-based transcription factors. Plant Biotechnology Journal 13(4), 578–589. DOI: 10.1111/pbi.12284. Qi, L.S., Larson, M.H., Gilbert, L.A., Doudna, J.A., Weissman, J.S. et al. (2013a) Repurposing CRISPR as an RNA-guided platform for sequence-specific control of gene expression. Cell 152(5), 1173–1183. DOI: 10.1016/j.cell.2013.02.022. Qi, Y., Zhang, Y., Zhang, F., Baller, J.A., Cleland, S.C. et al. (2013b) Increasing frequencies of site-specific mutagenesis and gene targeting in Arabidopsis by manipulating DNA repair pathways. Genome Research 23(3), 547–554. DOI: 10.1101/gr.145557.112. Ran, F.A., Cong, L., Yan, W.X., Scott, D.A., Gootenberg, J.S. et al. (2015) In vivo genome editing using Staphylococcus aureus Cas9. Nature 520(7546), 186–191. DOI: 10.1038/nature14299. Selma, S., Bernabé-Orts, J.M., Vazquez-Vilar, M., Diego-Martin, B., Ajenjo, M. et al. (2019) Strong gene activation in plants with genome-wide specificity using a new orthogonal CRISPR/Cas9-based programmable transcriptional activator. Plant Biotechnology Journal 17(9), 1703–1705. DOI: 10.1111/ pbi.13138. Slaymaker, I.M., Gao, L., Zetsche, B., Scott, D.A., Yan, W.X. et al. (2016) Rationally engineered Cas9 nucleases with improved specificity. Science 351(6268), 84–88. DOI: 10.1126/science.aad5227.

216

N. Wada, Y. Osakabe and K. Osakabe

Steinert, J., Schiml, S., Fauser, F. and Puchta, H. (2015) Highly efficient heritable plant genome engineering using Cas9 orthologues from Streptococcus thermophilus and Staphylococcus aureus. The Plant Journal 84(6), 1295–1305. DOI: 10.1111/tpj.13078. Tsai, S.Q., Zheng, Z., Nguyen, N.T., Liebers, M., Topkar, V.V. et al. (2015) GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nature Biotechnology 33(2), 187–197. DOI: 10.1038/nbt.3117. Tsai, S.Q., Nguyen, N.T., Malagon-Lopez, J., Topkar, V.V., Aryee, M.J. et al. (2017) CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR-Cas9 nuclease off-targets. Nature Methods 14(6), 607–614. DOI: 10.1038/nmeth.4278. Wada, N., Ueta, R., Osakabe, Y. and Osakabe, K. (2020) Precision genome editing in plants: state-of- the-art in CRISPR/Cas9-based genome engineering. BMC Plant Biology 20(1), 234. DOI: 10.1186/ s12870-020-02385-5. Walton, R.T., Christie, K.A., Whittaker, M.N. and Kleinstiver, B.P. (2020) Unconstrained genome targeting with near-PAMless engineered CRISPR-Cas9 variants. Science 368(6488), 290–296. DOI: 10.1126/ science.aba8853. Wang, H., La Russa, M. and Qi, L.S. (2016) CRISPR/Cas9 in genome editing and beyond. Annual Review of Biochemistry 85, 227–264. DOI: 10.1146/annurev-biochem-060815-014607. Wang, J., Meng, X., Hu, X., Sun, T., Li, J. et al. (2019) xCas9 expands the scope of genome editing with reduced efficiency in rice. Plant Biotechnology Journal 17(4), 709–711. DOI: 10.1111/pbi.13053. Wienert, B., Wyman, S.K., Richardson, C.D., Yeh, C.D., Akcakaya, P. et al. (2019) Unbiased detection of CRISPR off-targets in vivo using DISCOVER-Seq. Science 364(6437), 286–289. DOI: 10.1126/ science.aav9023. Yamamoto, A., Ishida, T., Yoshimura, M., Kimura, Y. and Sawa, S. (2019) Developing heritable mutations in Arabidopsis thaliana using a modified CRISPR/Cas9 toolkit comprising PAM-altered Cas9 variants and gRNAs. Plant & Cell Physiology 60(10), 2255–2262. DOI: 10.1093/pcp/pcz118. Yan, W.X., Hunnewell, P., Alfonse, L.E., Carte, J.M., Keston-Smith, E. et al. (2019) Functionally diverse type V CRISPR-Cas systems. Science 363(6422), 88–91. DOI: 10.1126/science.aav7271. Zhang, D., Zhang, H., Li, T., Chen, K., Qiu, J.-L. et al. (2017) Perfectly matched 20-nucleotide guide RNA sequences enable robust genome editing using high-fidelity SpCas9 nucleases. Genome Biology 18(1), 191. DOI: 10.1186/s13059-017-1325-9. Zhang, R., Liu, J., Chai, Z., Chen, S., Bai, Y. et al. (2019) Generation of herbicide tolerance traits and a new selectable marker in wheat using base editing. Nature Plants 5(5), 480–485. DOI: 10.1038/ s41477-019-0405-0. Zhang, T., Zheng, Q., Yi, X., An, H., Zhao, Y. et al. (2018) Establishing RNA virus resistance in plants by harnessing CRISPR immune system. Plant Biotechnology Journal 16(8), 1415–1423. DOI: 10.1111/ pbi.12881. Zhong, Z., Sretenovic, S., Ren, Q., Yang, L., Bao, Y. et al. (2019) Improving plant genome editing with high-fidelity xCas9 and non-canonical PAM-targeting Cas9-NG. Molecular Plant 12(7), 1027–1036. DOI: 10.1016/j.molp.2019.03.011. Zsögön, A., Čermák, T., Naves, E.R., Notini, M.M., Edel, K.H. et al. (2018) De novo domestication of wild tomato using genome editing. Nature Biotechnology 36, 1211–1216. DOI: 10.1038/nbt.4272.

16

Introduction of Deep Learning Approaches in Plant Omics Research

Eli Kaminuma* Graduate School of Design and Architecture, Nagoya City University, Nagoya, Japan

Abstract In recent years, there have been growing opportunities to apply artificial intelligence to experimental research data. Artificial intelligence, often abbreviated as AI, is a field of computer science in the development of software algorithms that mimics human intellectual behaviors. Machine learning is a subset of artificial intelligence to facilitate data-driven numerical learning strategy, and deep learning is, again, a subset of machine learning. There are multiple algorithms in deep learning, and the selection of deep learning algorithms is a key problem for applying them to plant omics. The optimality of the algorithm depends on task design and input modality, while the plant omics can be broken down into multiple omics layers of genomics, transcriptomics, proteomics, metabolomics, phenomics, and so on. In this chapter, major deep learning approaches will be introduced with their applications in life science, particularly in plant omics.

16.1 Introduction The primordial model among machine learning algorithms is the Artificial Neural Network (ANN), which is a stack of artificial neurons inspired by human nerve cells. When the layers of artificial neurons are accumulated deeply, the ANN is called the Deep Neural Network (DNN) (LeCun and Ranzato, 2013). In 2012, Professor Hinton of Toronto University proposed an eight- layer DNN for the image classification task at a machine learning competition, and it achieved a high accuracy that surpasses conventional algorithms (for details, see Chapter 17). In 2015, a DNN model for image classification proposed by Microsoft Asia exceeded the accuracy of human visual classification. Currently, DNN is utilized for many other tasks beyond image recognition and, moreover, the application objective of DNN has

been extended to multiple domains in life science, including bioinformatics (Min et al., 2017). Deep learning (or machine learning) techniques can be narrowly categorized as supervised learning, unsupervised learning, and reinforcement learning. The supervised learning algorithms need to use datasets with correct answer labels, whereas the unsupervised learning algorithms do not. The reinforcement learning algorithms design reward functions and optimize actions for higher rewards by repeating trial-and-error interactions with a dynamic environment.

16.2 Supervised Learning The algorithm for the supervised learning has two stages: (i) the model training phase, using

*kaminuma@sda.nagoya-cu.ac.jp © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0016

217

218

E. Kaminuma

correct answer labels; and (ii) the inference phase, to conduct actual tasks using input data. The main types of tasks for supervised learning are categorized into classification and regression, as described in the following sections.

16.2.1 Classification task: CNN In the model training phase for classification tasks, the correct answer labels of respective classes are requisite. The accuracy of the model can be evaluated by comparing the machine’s classified output with the correct answer labels. Convolutional Neural Networks (CNN) (LeCun et al., 2015) are the most commonly used deep learning models for image classification tasks. The network structure of the CNNs essentially contains the features of the convolutional layers and pooling layers. The convolutional layers in the CNNs perform mathematical convolution operations, which extract features in the target image such as the local contour of an object by digital image processing. Then pooling layers in the CNNs conduct mathematical operations of spatial dimension reduction. The details of the CNN models will be explained in Chapter 17. In plant omics, many deep learning-based studies for plant phenomics have been published (Kaminuma et al., 2020) and a few challenges have been raised in those studies. The first challenge is how to select color space of input images for CNN models. Scientists prefer to conduct pre- processing operations to transform RGB color space images into grayscale images or black- and-white images for the purpose of extracting an object’s features. However, the deep learning models do not require the extraction of features. In the diagnosis of plant disease, Mohanty et al. (2016) reported exploring model accuracy through various conditions of color spaces of the input leaf images. They used the Plant Village dataset (Hughes and Salathe, 2015) including 54,305 plant leaf images of 38 classes for plant diseases such as grape black measles in 14 crop species. Leaf images with three types of color space were computed by the same CNN models; Color, Grayscale, and Color Segmented (RGB color leaf segmented with black background). Then classification accuracies of the three color space conditions were represented as follows:

Color > Color Segmented >> Grayscale. Thus, the report suggested adopting raw RGB images without any image transformation for deep learning input. The second challenge to be addressed is how to select plant body parts of input images for CNN models. In information science research, a multimedia retrieval challenge, named ImageCLEF, has been held every year since 2003. LifeCLEF is a subsection of the ImageCLEF started in 2014, which contains BirdCLEF, PlantCLEF, and others. In particular, the PlantCLEF challenge focuses on plant species classification. In a PlantCLEF publication in 2017 (Goëau et al., 2017), 256,287 plant images from 10,000 plant species were classified. The accuracy rank of CNN model with each plant body part was as follows: Flower > Entire Body > Branch > Fruit > Leaf. The study indicated that the flower is the most distinguishable part for image-based plant species identification using deep learning. Interestingly, the rank of accuracy might be consistent with our human knowledge.

16.2.2 Regression task: RNN, LSTM The regression task has the output of numerical variables, while the classification task has those of categorical variables for correct answer labels. Popular deep learning models for the regression tasks are Recurrent Neural Networks (RNNs), whose network structures are characterized as containing recursive neurons in the artificial neural networks. Among RNN models, the Long Short-Term Memory (LSTM) model (Hochreiter and Schmidhuber, 1997) is widely used for time series data. The LSTM model retains the information over long periods of time (long-term dependencies), whereas it was difficult to retain old information with the conventional RNN model. In plant research on time series data, crop yield modeling, which forecasts crop production and disease outbreaks from meteorological time series data, has been conducted for a long time. A review article on crop yield prediction using machine learning has been published (van Klompenburg et al., 2020). Xiao et al. (2019) predicted the occurrence of cotton diseases (ten types) at six Indian sites using an LSTM model

Deep Learning Approaches

from eight weather factors, such as temperature and wind speed, with an area under the curve (AUC) of 0.97. The research collected meteorological data and crop disease assessment data at specified locations. On the other hand, de Freitas Cunha and Silva (2020) integrated multiple information of remote sensing data by satellite, geospatial soil type, and farm management for crop yield prediction.

16.3 Unsupervised Learning This section focuses on unsupervised deep learning about two strategies for data generation and data reduction. It introduces the generation task for data generation strategy and the dimensionality reduction task for data reduction strategy.

16.3.1 Generation task: GAN A generative adversarial network (GAN) is a deep learning algorithm for image generation by competing for two models: a discriminative model and a generative model (Goodfellow et al., 2014). The GAN model is well known for the deepfake program to create fake videos and images (Suwajanakorn et al., 2017). The GAN is utilized for image generation to increase the number of images when training data are insufficient. For example, in a citrus disease leaf classification study (Sun et al., 2020), the model accuracy with GAN data generation showed 0.978 compared with 0.955 without data generation.

16.3.2 Dimensionality reduction task: AE, word2vec Auto-Encoder (AE) is a deep learning algorithm that trains neural networks where input and output are the same by connecting to the decoder and encoder. One of the tasks of the AE models is to reduce the dimensionality as the number of neurons in the middle layer is decreased. This dimensionality- reduced middle layer is called latent space. Variational Auto-Encoder (VAE) is the AE model containing probability distributions in the latent variables. Hu and Greene

219

(2019) proposed a dimensionality reduction method for single- cell RNA transcriptomics using VAE instead of conventional PCA, t-SNE algorithms. Furthermore, there is some research on the adoption of text embedding algorithms (e.g., word2vec (Mikolov et al., 2013) for word embeddings; and doc2vec (Le and Mikolov, 2014) for paragraph embeddings) in the field of Natural Language Processing (NLP). In the case of life science, dna2vec (Ng, 2017) and gene2vec (Du et al., 2019) were published. The structure of the word embedding algorithm is not Auto-Encoder; therefore, the word2vec output is not equivalent to the input. This NLP- based dimensionality reduction space is called a distributed representation space. A graph embedding algorithm (Zhong and Rajapakse, 2020) is proposed for protein–protein interaction prediction as well.

16.4 Deep Reinforcement Learning: DQN Reinforcement learning is a machine learning algorithm to refine the behavior of a software agent system against a given environment by the trial-and-error method. In 2013, DeepMind Technologies Ltd proposed Deep Q- Network (DQN) for Atari 2600 games (Mnih et al., 2013), which is the first application of deep learning to reinforcement learning. The proposed DQN algorithm showed the best accuracy in six of the seven Atari 2600 games and surpassed human accuracy in three games.Then, in 2016, DeepMind published AlphaGO (Silver et al., 2016), which defeated a top- ranked professional Go player in the official competition. AlphaGO needed big data on the game record. Furthermore, in the following year, AlphaGO Zero (Silver et al., 2017) to be trained only by self-competition without game record datasets was released. Moreover, DeepMind published MuZero (Schrittwieser et al., 2019) with self- competition ability to play Atari games as well as several board games. As applications of deep reinforcement learning in life science, there are studies on drug design (Popova et al., 2018), protein structure prediction (detailed in Chapter 17 for AlphaFold, AlphaFold2), crop yield prediction (Elavarasan

220

E. Kaminuma

and Vincent, 2020), and irrigation management (Zhou, 2020).

16.5 Other Deep Learning Techniques: GNN, Transformer, AutoML This section introduces other representative deep learning models, such as deep neural networks for graph data structure, natural language processing, and the automation for hyperparameter search.

16.5.1 Deep learning for graphs: GNN Graph Neural Networks (GNNs) are applicable to all ANNs with graph-formatted input. The input data for GNNs consists of nodes and edges (links between nodes) as graph components. The graph- based deep learning models for CNNs, RNNs, auto- encoder, and deep reinforcement learning have been reported, respectively (Zhang et al., 2020). For example, the graph- based convolutional neural networks are called Graph Convolutional Networks (GCNs). The basic tasks of GNNs are node classification, graph classification, link prediction, and graph generation. In plant omics, Eguchi et al. (2019) classified starting substances by training the GCN model based on molecular fingerprint graphs of 12,460 alkaloid compounds. Zhou et al. (2020) proposed a GCN model that assigns maize gene ontology (GO) protein annotations (terms) based on their amino acid sequences. It successfully assigned around 40,000 GO terms as graph nodes. The proposed model also contains a CNN to deal with the amino acid sequences.

16.5.2 Natural language processing: transformer Natural language processing (NLP) technology has been applied to biomedical text analysis in life science. Currently, most NLP deep learning models are based on Google’s Transformer model (Vaswani et al., 2017), whereas RNNs such as LSTMs had been utilized before the publication

of the Transformer model. The Transformer algorithm summarizes the relationship between words as a matrix with an Attention Mechanism. The Attention Mechanism of the Transformer model shows dramatically improved performance of machine translation compared with conventional deep learning algorithms such as CNNs and RNNs. In particular, BERT (Bidirectional Encoder Representations from Transformers) (Devin et al., 2019), proposed by Google as a bidirectional Transformer model, has been applied to various NLP studies. In life science, BioBERT (Lee et al., 2019) based on the BERT model, which trained with biomedical text data of PubMed abstracts and PMC full- text articles, has been published. Presently, the Transformer models are well known to be more accurate when pre-trained with larger-scale datasets (Kalyan et al., 2022). The size of training data has been inflated in Google’s Text- to- Text Transfer Transformer (T5) model (Raffel et al., 2020) with 11 billion parameters, and OpenAI’s Generative Pre- trained Transformer 3 (GPT-3) model (Brown et al., 2020) with 175 billion parameters. The GPT-3 model became famous in August 2020, when a fake article generated by GPT-3 was ranked at the top on a social news site (Hacker News). Furthermore, it has become apparent that the Vision Transformer (Dosovitskiy et al., 2020) is more accurate than CNN models when the volume of pre-training data is large enough. Vision Transformer applications in plant omics research have already been reported, including a study that distinguished weeds and crops for UAV images (Reedha et al., 2021).

16.5.3 Automatic machine learning: AutoML AutoML is the technique of automatic machine learning, which aims to automate the process of training for hyperparameter search. In general, the length of the AutoML code is short. However, in most cases, the computational effort in AutoML is incredibly large because it chooses the best model and/or optimizes multiple hyperparameters by all- possible- combinations strategy. There are many AutoML frameworks, such as Google AutoML(Google

Deep Learning Approaches

221

Table 16.1. Features of AutoML frameworks. AutoML framework

Deep learning

Classical ML

Cloud service

✓

auto-sklearn

✓

PyCaret AutoKeras

✓

Auto-PyTorch

✓

Google AutoML

✓

✓

✓

Amazon SageMaker AutoPilot

✓

✓

✓

Microsoft Azure Automated ML

✓

✓

✓

Brain, 2020), Amazon SageMaker AutoPilot (Amazon Web Service, 2021), Microsoft Azure AutomatedML (Microsoft Azure Cloud Services, 2022) from proprietary cloud vendors, and then Auto-sklearn (Feurer et al., 2020), PyCaret (PyCaret, 2020), AutoKeras (Jin et al., 2019), and Auto-PyTorch (Zimmer et al., 2020) as open-source environments. Frameworks with AutoML technique containing both deep learning and classical machine learning algorithms are summarized in Table 16.1. Presently, the main targets of the AutoML frameworks are structured table- formatted data and image data. There is already a plant phenotyping application of AutoML based on CNN models (Koh et al., 2020). The AutoML environments should be good solutions for experimental

biologists without the expert programming technique.

16.6 Summary This chapter explains representative deep learning algorithms applied in plant omics research. Transformer models are becoming mainstream in deep learning algorithms. Therefore, Transformer- based plant omics research is promising for the future. The application domain of deep learning models may be developed, especially in the multi-modal plant phenotyping analysis using images and texts, and then digital twin simulation of plant bodies using 3D sensing data.

References Amazon Web Service (2021) Amazon SageMaker Autopilot. Available at: https://github.com/aws/ amazon-sagemaker-examples/tree/main/autopilot (accessed July 2022). Brown, T.B., Benjamin, M., Nick, R., Subbiah, M., Kaplan, J.D. et al. (2020) Language models are few-shot learners. arXiv 2005(14165). de Freitas Cunha, R.L. and Silva, B. (2020) Estimating Crop Yields With Remote Sensing And Deep Learning. In: IEEE Latin American GRSS & ISPRS Remote Sensing Conference (LAGIRS), 22–26 March 2020, pp. 273–278. DOI: 10.1109/LAGIRS48042.2020.9165608. Devin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 4171–4186 (Vol. 1). Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X. et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv 2010.11929. Du, J., Jia, P., Dai, Y., Tao, C., Zhao, Z. et al. (2019) Gene2vec: distributed representation of genes based on co-expression. BMC Genomics 20(Suppl. 1), 82. DOI: 10.1186/s12864-018-5370-x. Eguchi, R., Ono, N., Hirai Morita, A., Katsuragi, T., Nakamura, S. et al. (2019) Classification of alkaloids according to the starting substances of their biosynthetic pathways using graph convolutional neural networks. BMC Bioinformatics 20(1), 380. DOI: 10.1186/s12859-019-2963-6.

222

E. Kaminuma

Elavarasan, D. and Vincent, P.M.D. (2020) Crop yield prediction using deep reinforcement learning model for sustainable agrarian applications. IEEE Access: Practical Innovations, Open Solutions 8, 86886–86901. DOI: 10.1109/ACCESS.2020.2992480. Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M. and Hutter, F. (2020) Auto-sklearn 2.0. arXiv 2007(4074). Goëau, H., Bonnet, P. and Joly, A. (2017) Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017). In: Cappellato, L., Ferro, N., Goeuriot, L. and Mandl, T. (eds), CLEF 2017 Working Notes. Conference and Labs of the Evaluation Forum (CLEF), CEUR Workshop Proceedings. International Conference of the CLEF Association, 11–24 September 2017. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D. et al. (2014) Generative adversarial networks. Advances in Neural Information Processing Systems 27, 2672–2680. Google Brain (2020) Google AutoML. Available at: https://github.com/google/automl (accessed July 2022). Hochreiter, S. and Schmidhuber, J. (1997) Long short-term memory. Neural Computation 9(8), 1735–1780. DOI: 10.1162/neco.1997.9.8.1735. Hughes, D.P. and Salathe, M. (2015) An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv 1511.08060. Hu, Q. and Greene, C.S. (2019) Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics. Pacific Symposium on Biocomputing 24, 362–373. DOI: 10.1101/385534. Jin, H., Song, Q. and Hu, X. (2019) Auto-Keras: an efficient neural architecture search system. In: KDD ’19, Anchorage, Alaska, pp. 1946–1956. DOI: 10.1145/3292500.3330648. Kalyan, K.S., Rajasekharan, A. and Sangeetha, S. (2022) AMMU: a survey of transformer-based biomedical pretrained language models. Journal of Biomedical Informatics 126, 103982. DOI: 10.1016/j. jbi.2021.103982. Kaminuma, E., Baba, Y., Mochizuki, M., Matsumoto, H., Ozaki, H. et al. (2020) DDBI Data Analysis Challenge: a machine learning competition to predict Arabidopsis chromatin feature annotations from DNA sequences. Genes & Genetic Systems 95, 43–50. Koh, J.C., Spangenberg, G. and Kant, S. (2020) Automated machine learning for high throughput image- based plant phenotyping. BioRxiv:2020.12.03.410746. Le, Q.V. and Mikolov, T. (2014) Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning 32, 1188–1196. LeCun, Y. and Ranzato, M. (2013) Deep learning tutorial. In: Tutorials in International Conference on Machine Learning (ICML’13), 16 June 2013, pp. 1–29. LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning. Nature 521(7553), 436–444. DOI: 10.1038/ nature14539. Lee, J.L., Yoon, W., Kim, S., Kim, D., Kim, S, et al. (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240. Microsoft Azure Cloud Services (2022) Microsoft Azure Machine Learning and AutomatedML. Available at: https://github.com/Azure/MachineLearningNotebooks (accessed July 2022). Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of Advances in Neural Information Processing Systems. 26th International Conference (NIPS 13), 5–8 December, pp. 3111–3119. Min, S., Lee, B. and Yoon, S. (2017) Deep learning in bioinformatics. Briefings in Bioinformatics 18, 851–869. Mnih, V., Kavukcuoglu, K., Sliver, D., Graves, A., Antonoglou, I. et al. (2013) Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop. Mohanty, S.P., Hughes, D.P. and Salathé, M. (2016) Using deep learning for image-based plant disease detection. Frontiers in Plant Science 7, 1419. DOI: 10.3389/fpls.2016.01419. Ng, P. (2017) Dna2vec: consistent vector representations of variable-length k-mers. arXiv: 1701.06279. Popova, M., Isayev, O. and Tropsha, A. (2018) Deep reinforcement learning for de novo drug design. Science Advances 4(7), eaap7885. DOI: 10.1126/sciadv.aap7885. PyCaret (2020) PyCaret version 1.0.0. Available at: https://pycaret.org/about (accessed July 2022). Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S. et al. (2020) Exploring the limits of transformer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1–67.

Deep Learning Approaches

223

Reedha, R., Dericquebourg, E., Canals, R. and Hafiane, A. (2021) Transformer neural network for weed and crop classification of high resolution UAV images. Remote Sensing 14(3), 592. DOI: 10.3390/ rs14030592. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L. et al. (2019) Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609. DOI: 10.1038/ s41586-020-03051-4. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L. et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489. DOI: 10.1038/nature16961. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A. et al. (2017) Mastering the game of Go without human knowledge. Nature 550(7676), 354–359. DOI: 10.1038/nature24270. Sun, R., Zhang, M., Yang, K. and Liu, J. (2020) Data enhancement for plant disease classification using generated lesions. Applied Sciences 10, 466. Suwajanakorn, S., Seitz, S.M. and Kemelmacher-Shlizerman, I. (2017) Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics 36, 95. van Klompenburg, T., Kassahun, A. and Catal, C. (2020) Crop yield prediction using machine learning: a systematic literature review. Computers and Electronics in Agriculture 177, 105709. DOI: 10.1016/j. compag.2020.105709. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L. et al. (2017) Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H. and Fergus, R. (eds), Proceedings of the Advances in Neural Information Processing Systems 30. Annual Conference on Neural Information Processing Systems, 4–9 December 2017, pp. 5998–6008. Xiao, Q., Li, W., Kai, Y., Chen, P., Zhang, J. et al. (2019) Occurrence prediction of pests and diseases in cotton on the basis of weather factors by long short term memory network. BMC Bioinformatics 20, 688. Zhang, Z., Cui, P. and Zhu, W. (2020) Deep learning on graphs: a survey. IEEE Transactions on Knowledge and Data Engineering. Zhong, X. and Rajapakse, J.C. (2020) Graph embeddings on gene ontology annotations for protein-protein interaction prediction. BMC Bioinformatics 21(Suppl. 16), 560. DOI: 10.1186/s12859-020-03816-8. Zhou, G., Wang, J., Zhang, X., Guo, M. and Yu, G. (2020) Predicting functions of maize proteins using graph convolutional network. BMC Bioinformatics 21(Suppl. 16), 420. DOI: 10.1186/s12859-020-03745-6. Zhou, N. (2020) Intelligent control of agricultural irrigation based on reinforcement learning. Journal of Physics 1601(5), 052031. DOI: 10.1088/1742-6596/1601/5/052031. Zimmer, L., Lindauer, M. and Hutter, F. (2020) Auto-Pytorch: multi-fidelity metalearning for efficient and robust autoDL. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(9), 3079–3090. DOI: 10.1109/TPAMI.2021.3067763.

17

Deep Learning on Images and Genetic Sequences in Plants: Classifications and Regressions

Kanae Masuda and Takashi Akagi* Graduate School of Environmental and Life Science, Okayama University, Okayama, Japan

Abstract Recent progress of deep learning (DL) frameworks has allowed various high-quality classifications and regressions to be developed. Their performances have often exceeded human standards, especially in image diagnoses, so we may be able to reproduce artificial professional eyes for various objectives. Furthermore, recent development of explainable DL technology (or explainable AI (X-AI)), which can visualize the relevance of DL predictions, would allow biological interpretations of the predictions, potentially providing insights that only DL models may be able to recognize. Nevertheless, the application of DL frameworks has still progressed less in plant science than in animal, medical, and social sciences. In this chapter, mainly focused on classification and regression analyses, we introduce the current applications and potential prospects of DL technologies with respect to plant images and genetic sequences. Here, we take as examples taxonomic classification, disease/stress diagnosis, non-invasive prediction, and implementation with automated sorting systems based on plant images, as well as prediction of gene functions, such as expression patterns and protein folding structures, based on DNA or amino acid sequences. For beginners in the use of DL techniques, tips and precautions regarding practical application of DL frameworks are also provided.

17.1 Introduction Recent progress of deep neural networks (DNNs), called deep learning (DL), has been made in the past few years, especially in image diagnosis. The concepts of DL technologies as applied to images can be divided into four main categories: (i) classification; (ii) prediction; (iii) regression (quantification); and (iv) object recognition (identification) (Singh et al., 2018). The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015), in which 1000 categories are classified, including animals, plants, and

products, based on the ImageNet dataset with > 1 million images (Deng et al., 2009), has promoted the development of various DNN frameworks. It is noteworthy that in this image diagnosis contest, a convolutional neural network (CNN) (LeCun et al., 2015) (Table 17.1) named AlexNet, with only eight layers, exceeded the classification ability of conventional (or non-DL) machine learning (ML) models in 2012 (Krizhevsky et al., 2012). Thereafter, CNN models with more layers have been on a winning streak in the ILSVRC, and in 2015, a DNN model, named ResNet50, could exceed the human standard in image classification (Simonyan et al., 2013;

*Corresponding author: takashia@okayama-u.ac.jp 224

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0017

Deep Learning on Images and Genetic Sequences

225

Table 17.1. Representative CNN models. CNN model

Layers numbers (Conv. + FC)

Features

Notea

AlexNet

8

Winner of 2012 ILSVRC

ZFNet

8

Winner of 2013 ILSVRC

VGGNet

16 or 19

GoogLeNet (Inceptions)

22

Inception module

Winner of 2014 ILSVRC

ResNet

152 (many alternatives)

Residual learning

Winner of 2015 ILSVRC

SENet

Applicable in combination with other CNNs

Depthwise/pointwise convolution

EfficientNet

18

Compound model scaling

MobileNet

15

Depthwise separable convolution

Very accurate and speedy

ILSVRC: the ImageNet Large Scale Visual Recognition Challenge

a

He et al., 2016; Szegedy et al., 2016). This suggests that DNNs can “promptly” reproduce professional eyes, which would be based on “empirical” factors accumulated over decades. This may apply not only to normal images but also to other array data, such as various waveform data and nucleotide sequences, for which there have been no experts with long experience to observe. Even in the past couple of years, CNN models with complex and numerous layers, such as EfficientNet (Tan and Le, 2019), have continued to evolve rapidly. To meet the objectives of biologists, faster updating of knowledge for CNN models in biology would be indispensable.

17.2 Deep Learning for Plant Images Regarding classification (or identification) by DNNs, a major, previously unresolved issue was the detection of the image regions that are highly relevant to derive the results. In other words, we had not been able to recognize “Why did the DNN say that?” Importantly, some techniques are now able to address this problem. Feature visualization methods, such as layer-wise relevance propagation (LRP) (Bach et al., 2015), gradient- weighted class activation mapping (Grad-CAM) (Selvaraju et al., 2016), and guided backpropagation (Springenberg et al., 2014),

have been proposed to elucidate the “black-box” nature of DNN learning. Thus, DNNs integrated with these methods are often called explainable artificial intelligence (X-AI). These techniques can examine the validity of trained DNN models, as well as check for misinterpretation, which is mainly derived from learning bias. For instance, these methods can determine that DNNs might learn a specific background strongly biased to an objective category as a key feature. However, in plant science (especially physiology and morphology), regions relevant to a predicted phenotype would correspond with a biological cause or index of the objective phenomenon. Thus, visualization of relevant regions would contribute to physiological or morphological interpretations of the predictions by DNNs, which would allow cell- or site-specific analyses of objective phenomena.

17.2.1 Deep learning for taxonomic classification of plant images The simplest use of DL techniques in plants is automated taxonomic classification, as is the case for the ILSVRC described above (Russakovsky et al., 2015). With the Inception- ResNet v2 CNN model for the PlantCLEF 2016 image dataset (PlantCLEF, 2016; available at

226

K. Masuda and T. Akagi

http://www.imageclef.org/lifeclef/2016/plant, accessed July 2022), which consists of approximately 117,000 images, normal RGB images of 1000 plant species could be properly classified with approximately 82%, 86%, and 88% accuracy at the species, genus, and family levels, respectively (Seeland et al., 2019). This study also visualized relevant regions for each classification with Grad-CAM++ (Chattopadhay et al., 2018), indicating the morphological features specific to each species. This example proposes a representative application of an open-source image dataset to CNN models to construct classification platforms. Recently, open-source plant image datasets, such as the Fruit360 dataset, which consists of about 90,000 images of 131 fruits and vegetables (https://www.kaggle.com/ moltean/fruits, accessed September 2022), and the edible wild plants dataset, which includes 61 species (https://www.kaggle.com/gverzea/ edible-wild-plants, accessed September 2022), have been accumulating in the Kaggle database (https://www.kaggle.com/, accessed September 2022). Although a de novo dataset may be required in terms of novelty of science, it is often difficult for a single research unit to acquire a big image dataset. Therefore, effective use (or merging) of existing datasets and image data registration (or simple submission to an open- access repository such as Github (https://github. co.jp/, accessed September 2022) may be key to deploying DNN techniques in plant science.

17.2.2 Deep learning for stress/disease diagnosis based on plant images In plant science, DNN classification (or identification) models have been successfully applied mainly to detect symptoms of stresses/diseases in many plant species (Ramcharan et al., 2017; Ferentinos, 2018; Ghosal et al., 2018; and well reviewed by Singh et al., 2018). To date, for some major crops, such as apple (Mohanty et al., 2016; Ferentinos, 2018; Liu et al., 2018), tomato (Fuentes et al., 2017; Brahimi et al., 2017; Yamamoto et al., 2017), cassava (Ramcharan et al., 2017), maize (DeChant et al., 2017), cucumber (Fujita et al., 2016), wheat (Li et al., 2017), soybean (Ghosal et al., 2018), and certain tree crops (Sladojevic et al., 2016), existing or modified CNN models trained with

image databases such as PlantVillage and Wheat Disease DataBase have been successfully applied for image-based phenotyping of plant stress or diseases (Singh et al., 2018). To evaluate the classification abilities of such models, Mohanty et al. (2016) compared the performance of AlexNet from scratch and GoogleNet with transfer learning to detect 26 kinds of diseases in 14 crops, using the PlantVillage image dataset (N = about 54,000; https://plantvillage.psu.edu/, accessed September 2022). Here, a model from scratch is one with no pre- training, whereas a model with transfer learning applies pre-training to the related trials. GoogleNet with transfer learning showed better performance (> 99%) than AlexNet from scratch (about 85%). However, models with simple structures, such as AlexNet or VGG series (see Table 17.1), showed good enough (or better) classification performance for apple leaf diseases (> 97% accuracy) (Liu et al., 2018) and tomato stresses and diseases (e.g., > 90% accuracy for leaf mold), in comparison with other architectures that exhibited higher performance in the ILSVRC (Fuentes et al., 2017). Most of these studies effectively used open-source datasets, and the authors emphasized the importance of data size, which can be augmented in some platforms, and data quality, with a preference for including “real data” that reflect the targeted classification situations, such as locations with uniform backgrounds, or cultivars if the target is crops.

17.2.3 Deep learning for non-invasive prediction of plant images Plant stresses and diseases are “visible” information for humans, and DNN phenotyping can directly detect the visible symptoms of these conditions from images, which can be addressed by relevant visualization tools (Ghosal et al., 2018; Nagasubramanian et al., 2020). However, internal traits, such as internal disorders or ingredient compositions, and superficial microstructures are often not clearly visible and would be difficult for humans to recognize. Only a limited number of experts can possibly identify these issues based on their extensive experience, and it may be difficult to reproduce these skills. Notwithstanding, non- invasive prediction of such invisible traits in plants, especially in crops, is in high demand,

Deep Learning on Images and Genetic Sequences

from the viewpoints of both agriculture and experiments. To date, a limited number of DNNs have been applied to attempt to predict invisible characters in plants organs, such as for detection of internal physical damage of blueberry fruits (Wang et al., 2018), calyx- end cracking (Akagi et al., 2020), and seed numbers/locations (Masuda et al., 2021) in persimmon fruits, granulation in citrus fruits (Jie et al., 2021), and maturity in citrus fruits (Itakura et al., 2019) and blueberry fruits (Mu et al., 2020). Among these studies, simple RGB images could work for detection/classification of persimmon internal calyx-cracking and seeds, and quantification of maturity in blueberry fruit, whereas the other studies used non-RGB image data, such as hyperspectral images (HSI) or fluorescence spectra. In the detection of calyx-end cracking in persimmon fruit, with only about 3000 original images, five CNN models reached approximately 85% accuracy, which would exceed the performance of experts with decades of experience (Akagi et al., 2020). Interestingly, a combination of relevant visualization methods, such as Guided Grad-CAM (Springenberg et al., 2014) and a series of LRPs (Iwana et al., 2019), could suggest the indexes of this internal disorder in the original RGB images, which no one has previously recognized. Among studies with non-RGB image data, Jie et al. (2021) used HSI data for the detection of very fine structures on the surfaces of citrus fruit and determined effective spectra for prediction of granulation by backpropagation of relevance. Itakura et al. (2019) applied CNN models to fluorescence spectroscopy data to determine hyperparameters effective to predict sugar–acid ratios. Of course, few (or no) experts would be able to associate such HSI or fluorescence spectrum data with specific phenotypes. Therefore, together, DNNs could become an advantageous technique to rapidly (re) produce artificial professional eyes on unidentified features, especially for invisible characters or data.

227

of quantification tasks mainly using segmentation or object recognition, as this is introduced in Chapter 18. Here, simple quantifications of leaf numbers and stress/disease severity are taken as examples. Ubbens et al. (2018) applied a typical regression CNN that consisted of four convolutional layers and a single fully connected layer, followed by the output layer with a single neuron (or node), to quantify rosette leaf numbers in Arabidopsis. Importantly, the authors successfully applied a generative adversarial network (GAN) to synthetically augment the training images that were less frequently encountered in the original dataset. Such typical regression models (with a single node in the output layer) can directly predict quantitative values, but the concept of visualization of relevant regions (or feature mapping) is not applicable for this. Wang et al. (2017) effectively predicted the quantitative severity of apple black rot disease at four levels using four representative CNN models: VGG16, VGG19, ResNet50, and Inception- v3 (Table 17.1). They ultimately reached about 90% accuracy with VGG16, using fewer than 500 images of diseased plants. Note that, strictly speaking, this study was categorized as a classification task rather than quantification, because the four levels of the severity of apple black rot were classified on a nominal scale. A similar concept was also developed in the prediction of the severity of fruit calyx-end cracking (Akagi et al., 2020) and of seed numbers (Masuda et al., 2021) in persimmons. These studies demonstrated that in binary classifications such as seeded or seedless, the prediction values would be highly correlated with the quantitative values, such as seed numbers, which may suggest a new quantification method by CNN. Ghosal et al. (2018) developed a novel framework to apply an explainable CNN, named the explainable plant network (xPlNet), to classify the stresses on soybean leaves and detect the relevant regions (or feature map) for these stresses, the intensity of which can be used to quantify stress severity.

17.2.4 Deep learning for regression and quantification of plant images Regression (or quantification) is also an ability of CNNs in image diagnosis. The concept of quantification by DNNs covers a wide range of analyses. Note that here we reluctantly omit introduction

17.2.5 Deep learning for automated sorting of plant images Other effective applications of CNNs for classification, especially in agriculture, are for

228

K. Masuda and T. Akagi

automated sorting of crops. Ponce et al. (2019) proposed an automated classification method of seven olive varieties, which are often harvested mixed. The authors developed a technique to extract each olive grain from the mixture in normal RGB images and reached about 95% accuracy for classification of the seven varieties with Inception-ResNet-v2. For lychee (Litchi chinensis), there is worldwide confusion regarding the cultivars’ nomenclature, and it may be difficult to properly discriminate them. Osako et al. (2020) applied VGG16 for discrimination of four lychee cultivars in Taiwan. In this analysis, the subtle differences in their fruit shapes were evaluated as key features by relevance visualization in VGG16, which was consistent with the results from SHAPE analysis based on elliptic Fourier descriptors (Iwata and Ukai, 2002). Nasiri et al. (2019) also applied VGG16 for prediction of four classes (or grades) of date fruits. These classifications of grades or cultivars/varieties/ lines by CNNs would direct automated sorting systems, and lead to the actual introduction of machine sorting systems. These techniques would be worth applying also in experimental plant science.

17.3 Deep Learning for DNA Sequences CNNs (and possibly recurrent neural networks) can be applied to any matrix (or “array” in computer science) data, including DNA and RNA sequences, which can be converted into one-hot arrays (Zou et al., 2019) (Fig. 17.1). Although only a few example studies have applied this approach to plants to date, some advanced trials, mainly in medical science and

perspective papers, have introduced applications of DNNs to DNA/RNA sequences. Alipanahi et al. (2015) proposed a CNN framework, named DeepBind, to predict the sequence specificities of nucleotide- biding proteins. They presented models trained with nucleotide motifs from existing experimental data for protein–nucleotide interactions, such as ChIP-Seq. In addition, they back- propagated the residues critical for protein-binding ability, which allowed the prediction of deleterious single nucleotide variations that cause certain severe diseases. Especially in medical science, this concept has been deployed in the prediction of splicing patterns or expression abundancy involved in critical diseases, and further applied to genome- wide analyses, such as GWAS (reviewed by Leung et al., 2016). Tian et al. (2019) developed a CNN model, named MRCNN, to predict DNA methylation levels from DNA sequences. They used a genome-wide bisulfite sequencing dataset and made good predictions (> 90% accuracy) of CpG methylation levels. Furthermore, the relevance propagation of the CNN model allowed the identification of DNA motifs that were predominantly methylated. Regarding plants, Washburn et al. (2019) proposed a novel approach to predict expression abundances of genes based on their promoter and terminator sequences, importantly by utilizing information on the “evolutionary relationships” of the genes. This was based on a comparison of the genes nested into the same gene family (orthologs or paralogs), for which minimal differences to derive expression changes would be identified. To evaluate the validity of this hypothesis, the authors developed two approaches: (i) gene- family-guided splitting, in which training within a gene family was validated in test sets with different gene families; and (ii) ortholog contrast,

Fig. 17.1. Conversion of nucleotide sequences into a two-dimensional one-hot array.

Deep Learning on Images and Genetic Sequences

in which training with expression contrasts between orthologs in two species, Sorghum bicolor and Zea mays, was validated in different gene families. The CNN was successfully interpreted to identify the genomic regions important for expression activation/suppression. Although only a limited number of examples can be introduced here, the development of algorithms, models, and platforms for applying DNNs to DNA/RNA sequences is progressing rapidly. The concept of DNNs applied to nucleotides is consistent with that described in the image diagnosis section above, which is “rapid production of an artificial professional eye” on unidentified features in fields in which there are no experts. It is suggested that very simple networks, potentially even fully connected models, can be applied to the prediction of DNA sequence features (Zou et al., 2019) and that any applications to plant genomes would contribute to the deployment of this new concept.

229

2020). This achievement can be regarded as a breakthrough against a significant challenge, the so-called protein folding problem. Future progress in the field of folding prediction may make crystal structure analysis obsolete.

17.5 CNN Guides for Beginners: Tips and Precautions in Practice 17.5.1 Installing libraries and preparing data for application of a CNN

In this section, as most readers are likely not to be information scientists but to be “wet” plant biologists or bioinformaticians, entry- level and basic implementations and applications of CNNs are introduced. Representative CNN models for image diagnosis can be implemented freely, based on DL frameworks (or backends) such as TensorFlow, Caffe, and PyTorch, with higher application programming interfaces (APIs). With Keras (https://keras.io/, accessed 17.4 Deep Learning for Amino Acid September 2022), for example, as one of the Sequences: Prediction of most popular higher APIs, CNN training can Protein Folding be performed with minimal requirements of installation of Python (and its derivative librarAmino acid sequences are, of course, two- ies, such as NumPy), TensorFlow (as the default dimensional data, but can act as proteins setting for Keras), and objective modules/packonly in three- dimensional structures. This ages/libraries on any computer, including cloud conversion, called protein folding, is a major servers, preferably with a graphical processing issue in analysing the physiological functions unit or units (GPU). Existing two-dimensional of proteins. Nevertheless, crystal structure CNN models implemented with Keras, such analysis of protein folding is time consuming as VGG16, Xception, ResNet50, Inception v3, (several years for a single protein), and the Inception- ResNet v2, and MobileNet v1, are prediction of protein folding rules or patterns available from scratch or pre-trained with the from two-dimensional amino acid sequences ImageNet dataset (available at http://www. is challenging. To overcome this issue, various image-net.org/, accessed 21 September 2021). prediction models have been proposed and The pre-trained models can be freely fine-tuned developed in the Critical Assessment of protein to adjust to the user’s purposes, such as by Structure Prediction (CASP) competition adding or removing layers, by freezing weights (Avbelj and Moult, 1995). Notably, in 2018 in some layers, or by changing output classes. (CASP13), the AlphaFold algorithm (Senior The input dataset should be a numerical array et al., 2020) achieved a leap in prediction (NumPy format in Keras). RGB images would ability, reaching almost 60% for the averaged normally be two-dimensional arrays with three global distance test total score (GDT- TS), channels, whereas DNA/RNA sequences can be even though the averaged GDT-TSs in previ- expressed by a one-hot one-dimensional array ous top-performing models were up to 40% with four channels (as shown in Fig. 17.1). (Callaway, 2020). Furthermore, at CASP14 Preparation steps would require resizing and in 2020, AlphaFold2 reached approximately rescaling of images and setting of parameters, 90% for averaged GDT- TS (Jumper et al., representatively optimizers, metrics, losses,

230

K. Masuda and T. Akagi

learning rates, or epoch numbers, which often substantially affect training performance. Regarding optional parameters, augmentation of the training dataset through flipping, rotation, or changing the color saturation or lamination of images may allow better performance even with small sample sets. In classification tasks, the setting of the class-weight can work to balance categories with biased sample numbers (or a “class imbalance problem”). Fortunately, these parameters have been implemented in existing higher APIs, as in Keras. Training can be monitored with transitions of loss values, ideal situations of which would show a synchronized decrease in the training and validating sets. Setting too many epochs often results in over-fitting to trained samples, where training loss continues to decrease but validating loss gradually increases.

17.5.2 Evaluation of CNN model performance Classification performance of trained models can be evaluated based on the confusion matrix (or further hypergeometric tests), accuracy, precision, recall, F1 score (or F-measure), breakeven point, and/or Intersection over Union (IoU). Further visualization of the prediction value distribution, the receiver operating characteristic (ROC) curve (and calculation of the ROC area under curve (ROC- AUC) value), and/or the precision-recall (PR) curve can support model evaluation and be applied to define characteristics of the model. As described above (Ghosal et al., 2018; Akagi et al., 2020; Masuda et al., 2021), if classified objects are annotated with ordinal, ranking, or proportional scales, correlations between their quantitative values and the distribution of prediction values or relevance levels in backpropagation of CNNs may be employed as new indexes of performance. In regression analysis with CNNs, a simple performance evaluation could be correlation analyses between labels and predictions, depending on regression losses. For instance, in computing the mean squared logarithmic error for loss, the logarithmic correlation should be examined.

17.5.3 Interpretability and explainability of CNN models Relevance (or feature) visualization (called explainable AI (X-AI)), as introduced above, is a powerful tool, especially in combination with “wet” experiments such as microscopic observations and region-specific omics approaches in plant biology. To date (as of 6 January 2021), popular tools include LRP (Bach et al., 2015), guided backpropagation (Springenberg et al., 2014), Grad-CAM (Selvaraju et al., 2016), and their variants or combinations. These tools are freely available from the iNNvestigate library (Alber et al., 2019) and/or other Github sites. Users should note that these tools behave as independent concepts for visualization of features. Guided backpropagation and Grad-CAM are quite similar because they both assume that relevant regions have larger impacts on the class likelihood determined by the CNNs. Guided backpropagation aims to find the input pixels that have large impacts in the orthodox backpropagation procedure, whereas Grad-CAM finds high-impact regions not in the input image but in the feature map of the last (or potentially somewhat previous) convolutional layer. Feature maps are smaller than the input image, and the relevance map needs to be enlarged to have the same size as the original input image. Thus, Grad-CAM can provide only coarse visualizations. In contrast with these gradient-based methods, LRP aims to identify the input pixels that are highly relevant to the results by decomposing the class likelihood into the input pixels. This decomposition is achieved by applying a simple weighted reallocation rule (Bach et al., 2015). Note that guided backpropagation or LRP would apply to networks constituted of only simple unidirectional layers, whereas it would be difficult to apply these methods to ResNet with residue learning by shortcut layer connections (He et al., 2016), or to Inception/GoogLeNet with the inception module (Szegedy et al., 2016). A combination of guided backpropagation and Grad-CAM, called Guided Grad-CAM (Selvaraju et al., 2016), and LRP variants, such as LRP- Sequential A/B and LRP-Epsilon (Iwana et al., 2019; Montavon et al., 2019), use concepts that are basically identical to the original ones for CNN backpropagation. Guided Grad- CAM

Deep Learning on Images and Genetic Sequences

simply multiplies guided backpropagation and Grad-CAM for relevance visualization. The LRP variants slightly modify the parameters in the reallocation rule of LRP.

17.6 Future Perspectives In this chapter, we have introduced the potential of DL technology, which currently allows rapid classification and regression by capturing hidden patterns or rules that would often be difficult to detect even for experts with long experience. Even so, the current usages are quite limited to visual images. DL frameworks would be applicable to any format that can be converted into numeric arrays, as represented by AlphaFold2 (Jumper et al., 2020, 2021), as described above.

231

Not limited to genetic sequences, applications of X-AI to any other format, such as gas chromatography and acoustic vibration data, might reveal new features that have not been defined by manual curation. Furthermore, the combination of X-AI and plant omics approaches may be promising. For instance, simple transcriptome analysis between position-specific relevant and non-relevant regions for prediction may be able to capture physiological reactions for premonitory signals of objective phenomena (such as shoot regeneration signals from a callus). As there are such wide frontiers with respect to merging DL technologies and biological approaches, more trials with more users of DL should be encouraged for further development in plant biology.

Acknowledgments We thank Sara J. Mason, MSc, from Edanz (https://jp.edanz.com/ac, accessed September 2022) for editing a draft of this manuscript.

References Akagi, T., Onishi, M., Masuda, K., Kuroki, R., Baba, K. et al. (2020) Explainable deep learning reproduces a “professional eye” on the diagnosis of internal disorders in persimmon fruit. Plant & Cell Physiology 61(11), 1967–1973. DOI: 10.1093/pcp/pcaa111. Alber, M., Lapuschkin, S., Seegerer, P., Hägele, M., Schütt, K.T. et al. (2019) INNvestigate neural networks! Journal of Machine Learning Research 20(93), 1–8. Alipanahi, B., Delong, A., Weirauch, M.T. and Frey, B.J. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33(8), 831–838. DOI: 10.1038/nbt.3300. Avbelj, F. and Moult, J. (1995) Determination of the conformation of folding initiation sites in proteins by computer simulation. Proteins 23(2), 129–141. DOI: 10.1002/prot.340230203. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R. et al. (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS ONE 10(7), e0130140. DOI: 10.1371/journal.pone.0130140. Brahimi, M., Boukhalfa, K. and Moussaoui, A. (2017) Deep learning for tomato diseases: classification and symptoms visualization. Applied Artificial Intelligence 31(4), 299–315. DOI: 10.1080/08839514.2017.1315516. Callaway, E. (2020) “It will change everything”: DeepMind’s AI makes gigantic leap in solving protein structures. Nature 588(7837), 203–204. DOI: 10.1038/d41586-020-03348-4. Chattopadhay, A., Sarkar, A., Howlader, P. and Balasubramanian, V.N. (2018) Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, Lake Tahoe, NV, pp. 839–847. DOI: 10.1109/WACV.2018.00097.

232

K. Masuda and T. Akagi

DeChant, C., Wiesner-Hanks, T., Chen, S., Stewart, E.L., Yosinski, J. et al. (2017) Automated identification of northern leaf blight-infected maize plants from field imagery using deep learning. Phytopathology 107(11), 1426–1432. DOI: 10.1094/PHYTO-11-16-0417-R. Deng, J., Dong, W., Socher, R., Li, L.-J., Fei-Fei, L. et al. (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Miami, FL, pp. 248–255. DOI: 10.1109/CVPR.2009.5206848. Ferentinos, K.P. (2018) Deep learning models for plant disease detection and diagnosis. Computers and Electronics in Agriculture 145, 311–318. Fuentes, A., Yoon, S., Kim, S.C. and Park, D.S. (2017) A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors 17(9), 2022. DOI: 10.3390/s17092022. Fujita, E., Kawasaki, Y., Uga, H., Kagiwada, S. and Iyatomi, H. (2016) Basic investigation on a robust and practical plant diagnostic system. In: Proceedings of the 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 18–20 December 2016, pp. 989–992. Ghosal, S., Blystone, D., Singh, A.K., Ganapathysubramanian, B., Singh, A. et al. (2018) An explainable deep machine vision framework for plant stress phenotyping. Proceedings of the National Academy of Sciences 115(18), 4613–4618. DOI: 10.1073/pnas.1716999115. He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas NV, USA, 27–30 June 2016, pp. 770–778. Itakura, K., Saito, Y., Suzuki, T., Kondo, N. and Hosoi, F. (2019) Estimation of citrus maturity with fluorescence spectroscopy using deep learning. Horticulturae 5(1), 2. Iwana, B.K., Kuroki, R. and Uchida, S. (2019) Explaining convolutional neural networks using softmax gradient layer- wise relevance propagation. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27–28 October 2019, pp. 4176–4185. Iwata, H. and Ukai, Y. (2002) SHAPE: a computer program package for quantitative evaluation of biological shapes based on elliptic Fourier descriptors. Journal of Heredity 93(5), 384–385. Jie, D., Wu, S., Wang, P., Li, Y., Ye, D. and Wei, X. (2021) Research on Citrus grandis granulation determination based on hyperspectral imaging through deep learning. Food Analytical Methods 14(2), 280–289. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M. et al. (2020) High accuracy protein structure prediction using deep learning. In: 14th CASP (14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction, May–August 2020). Available at: https:// predictioncenter.org/casp14/doc/CASP14_Abstracts.pdf (accessed September 20, 2022). Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M. et al. (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60(6), 84–90. LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning. Nature 521(7553), 436–444. Leung, M.K., Delong, A., Alipanahi, B. and Frey, B.J. (2016) Machine learning in genomic medicine: a review of computational problems and data sets. Proceedings of the IEEE 104(1), 176–197. Li, W., Lin, M., Huang, Y., Liu, H. and Zhou, X. (2017) Near infrared spectroscopy detection of the content of wheat based on improved deep belief network. Journal of Pharmaceutical Health Care and Sciences 887(1), 012046. Liu, B., Zhang, Y., He, D. and Li, Y. (2018) Identification of apple leaf diseases based on deep convolutional neural networks. Symmetry 10(1), 11. DOI: 10.3390/sym10010011. Masuda, K., Suzuki, M., Baba, K., Takeshita, K., Suzuki, T. et al. (2021) Noninvasive diagnosis of seedless fruit using deep learning in persimmon. The Horticulture Journal 90(2), 172–180. DOI: 10.2503/hortj. UTD-248. Mohanty, S.P., Hughes, D.P. and Salathé, M. (2016) Using deep learning for image-based plant disease detection. Frontiers in Plant Science 7, 1419. DOI: 10.3389/fpls.2016.01419. Montavon, G., Binder, A., Lapuschkin, S., Samek, W. and Müller, K.R. (2019) Layer-wise relevance propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K. and Müllelr, K.-R. (eds) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, Cham, pp. 193–209. Mu, C., Yuan, Z., Ouyang, X., Sun, P. and Wang, B. (2020) Non-destructive detection of blueberry skin pigments and intrinsic fruit qualities based on deep learning. Journal of the Science of Food and Agriculture 101(8), 3165–3175. DOI: 10.1002/jsfa.10945.

Deep Learning on Images and Genetic Sequences

233

Nagasubramanian, K., Singh, A.K., Singh, A., Sarkar, S. and Ganapathysubramanian, B. (2020) Usefulness of interpretability methods to explain deep learning based plant stress phenotyping. arXiv 2007.05729. Nasiri, A., Taheri-Garavand, A. and Zhang, Y.D. (2019) Image-based deep learning automated sorting of date fruit. Postharvest Biology and Technology 153, 133–141. DOI: 10.1016/j.postharvbio.2019.04.003. Osako, Y., Yamane, H., Lin, S.Y., Chen, P.A. and Tao, R. (2020) Cultivar discrimination of litchi fruit images using deep learning. Scientia Horticulturae 269, 109360. DOI: 10.1016/j.scienta.2020.109360. Ponce, J.M., Aquino, A. and Andujar, J.M. (2019) Olive-fruit variety classification by means of image processing and convolutional neural networks. IEEE Access: Practical Innovations, Open Solutions 7, 147629–147641. DOI: 10.1109/ACCESS.2019.2947160. Ramcharan, A., Baranowski, K., McCloskey, P., Ahmed, B., Legg, J. et al. (2017) Deep learning for image- based cassava disease detection. Frontiers in Plant Science 8, 1852. DOI: 10.3389/fpls.2017.01852. Russakovsky, O., Li, L.J. and Fei- Fei, L. (2015) Best of both worlds: human- machine collaboration for object annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2121–2131. Seeland, M., Rzanny, M., Boho, D., Wäldchen, J. and Mäder, P. (2019) Image-based classification of plant genus and family for trained and untrained plant species. BMC Bioinformatics 20(1), 4. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D. et al. (2016) Grad-CAM: why did you say that? arXiv 1611.07450. Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L. et al. (2020) Improved protein structure prediction using potentials from deep learning. Nature 577(7792), 706–710. Simonyan, K., Vedaldi, A. and Zisserman, A. (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv 1312.6034. Singh, A.K., Ganapathysubramanian, B., Sarkar, S. and Singh, A. (2018) Deep learning for plant stress phenotyping: trends and future perspectives. Trends in Plant Science 23(10), 883–898. Sladojevic, S., Arsenovic, M., Anderla, A., Culibrk, D. and Stefanovic, D. (2016) Deep neural networks based recognition of plant diseases by leaf image classification. Computational Intelligence and Neuroscience 2016(Article 3289801). DOI: 10.1155/2016/3289801. Springenberg, J.T., Dosovitskiy, A., Brox, T. and Riedmiller, M. (2014) Striving for simplicity: the all convolutional net. arXiv 1412.6806. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z. (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Las Vegas, NV, USA, June 27–30, 2016, pp. 2818–2826. DOI: 10.1109/CVPR.2016.308. Tan, M. and Le, Q.V. (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv 1905.11946. Tian, Q., Zou, J., Tang, J., Fang, Y., Yu, Z. et al. (2019) MRCNN: a deep learning model for regression of genome- wide DNA methylation. BMC Genomics 20(Suppl. 2), 192. DOI: 10.1186/ s12864-019-5488-5. Ubbens, J., Cieslak, M., Prusinkiewicz, P. and Stavness, I. (2018) The use of plant models in deep learning: an application to leaf counting in rosette plants. Plant Methods 14(1), 6. DOI: 10.1186/ s13007-018-0273-z. Wang, G., Sun, Y. and Wang, J. (2017) Automatic image-based plant disease severity estimation using deep learning. Computational Intelligence and Neuroscience 2017, 2917536. DOI: 10.1155/2017/2917536. Wang, Z., Hu, M. and Zhai, G. (2018) Application of deep learning architectures for accurate and rapid detection of internal mechanical damage of blueberry using hyperspectral transmittance data. Sensors (Basel, Switzerland) 18(4), 1126. DOI: 10.3390/s18041126. Washburn, J.D., Mejia-Guerra, M.K., Ramstein, G., Kremling, K.A., Valluru, R. et al. (2019) Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences 116(12), 5542–5549. DOI: 10.1073/ pnas.1814551116. Yamamoto, K., Togami, T. and Yamaguchi, N. (2017) Super-resolution of plant disease images for the acceleration of image- based phenotyping and vigor diagnosis in agriculture. Sensors (Basel, Switzerland) 17(11), 2557. DOI: 10.3390/s17112557. Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A. et al. (2019) A primer on deep learning in genomics. Nature Genetics 51(1), 12–18. DOI: 10.1038/s41588-018-0295-5.

18

Deep Learning in Plant Omics: Object Detection and Image Segmentation

Wei Guo1* and Akshay L. Chandra2 Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan; 2Department of Computer Science and Engineering, Indian Institute of Technology Hyderabad, Kandi, India

1

Abstract Object detection and image segmentation are powerful computer vision techniques that have been widely used in plant phenotyping tasks, such as the identification of crop diseases/insects and the measurement/counting of organs (leaves, stems, fruits, etc.). This chapter summarizes recent object detection and image segmentation techniques and public datasets developed for plant phenotyping, discusses their current challenges, particularly focusing on model generalization capability, and offers a perspective on future approaches to make these techniques more helpful to the plant science community.

18.1 Introduction

of deep learning techniques is a key factor in ensuring the success of object detection. Deep Object detection and image segmentation are learning algorithms for object detection can be important research topics in computer vision, categorized into two main types: two-stage and and have been widely applied in real- world one-stage detectors. In two-stage detectors, the applications. Object detection refers to the task detection framework first defines an arbitrary of locating objects in an image with bound- number of region proposals (where an object ing boxes, while also identifying their classes/ might or might not be present). Subsequently, types. Image segmentation is a much denser the framework uses simple object classification task, in which the model is expected to segment for each single-region proposal and regression and assign a class to each pixel of the image. to obtain precise bounding boxes. Though this Figure 18.1 shows the differences between framework achieved the highest accuracy when it was first implemented and still does so, the object detection and image segmentation. Traditional non- deep learning object computing speed is slow owing to its two-stage detection algorithms such as the Viola–Jones nature (i.e., the generation of region proposals detector (Viola and Jones, 2021), histogram- and the subsequent pruning stages mentioned of-gradients detector (Dalal and Triggs, 2020), above) (Girshick, 2015; Ren et al., 2015). and discriminative trained part- based model Consequently, several studies have focused on (Felzenszwalb et al., 2009) have yielded remark- the one-stage detectors. This framework directly able results. However, the implementation predicts bounding boxes over images without *Corresponding author: guowei@g.ecc.u-tokyo.ac.jp 234

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0018

Object Detection and Image Segmentation

235

Fig. 18.1. Visual illustration of the difference between object detection and image segmentation. (a) An object detection task. The blue boxes indicate bounding boxes identifying rice panicles. (b) An image segmentation task. The pink polygons indicate the exact regions of rice panicles.

the distinct steps of region-proposal generation and pruning. One-stage detectors innately prioritize inference speed at the cost of being slightly less accurate; hence, they are increasingly used in real-world and real-time applications. They are also effective for recognizing irregularly shaped objects or groups of small objects. The most popular one-stage detectors are YOLO (for “You Only Look Once”) (Redmon et al., 2016) and its latest variants (Redmon and Farhadi, 2018; Bochkovskiy et al., 2020; Wang et al., 2021a), single- shot multibox detector (SSD) (Liu et al., 2016) and RetinaNet (Li et al., 2019). As a result of the open science strategy of the computer vision community, most popular deep learning-based object detection studies can be found online, for example at https://github. com/hoya012/deep_learning_object_detection (accessed 28 June 2022). For more details on object detection, see Jiao et al. (2019). Image segmentation is a computer vision task that partitions an image into multiple segments at the pixel level. An obvious advantage of image segmentation over object detection is that one can gain a better understanding of the object of interest (shape, volume, pose, orientation, etc.). Although many traditional computer vision tools and techniques can perform image segmentation reasonably well (Stockman and Shapiro, 2001), we mainly focus on deep learning-based image segmentation studies hereafter. In this chapter, we divide image segmentation into the following three categories: (i) semantic segmentation; (ii) instance

segmentation; and (iii) panoptic segmentation. In semantic segmentation, the model assigns a class label to each pixel of the image (Hao et al., 2020). A typical semantic segmentation model architecture comprises a down-sampling stage and a consequent up-sampling stage. Pooling and strided convolution operations are used in the down-sampling stage, while unpooling and transpose convolution operations are used in the up-sampling stage. In instance segmentation, the model assumes multiple objects of the same class as distinct individual objects (or instances) (Ke et al., 2020). Instance segmentation is generally a more difficult task than semantic segmentation. Panoptic segmentation unifies instance segmentation and semantic segmentation into a single task where the model assigns a class label to each distinct individual instance present in the image (Cheng et al., 2020). FastFCN (Wang and Ji, 2021), Mask R- CNN (He et al., 2017), Gated-SCNN (Takikawa et al., 2019), DeepLabV3 (Chen et al., 2018), and U-Net (Ronneberger et al., 2015) are popular frameworks for image segmentation. To date, object detection and image segmentation have exploited the advances in deep learning. These tasks have been utilized in various real- world applications, such as the detection of traffic signals, signs, and road markings; pedestrian and vehicle detection in autonomous and assisted driving; the detection of nodules, tumors, lesions, microbleeds, and bone fractures in smart medical diagnostics; generic object detection/segmentation in

236

W. Guo and A.L. Chandra

intelligent robotics; and face recognition for automated closed-circuit television (CCTV) surveillance (Garcia-Garcia et al., 2018; Liu et al., 2020; Minaee et al., 2021). For more information, we refer readers to Hao et al. (2020). Object detection and image segmentation have also been utilized in plant phenomics, such as crop disease diagnosis and organ detection/segmentation for high-throughput plant phenotyping (Chandra et al., 2020a; Jiang and Li, 2020). In this chapter, we introduce latest object detection and image segmentation techniques, and public datasets developed for plant phenotyping. We also discuss current challenges in object detection and image segmentation for plant phenotyping.

18.2 Object Detection and Image Segmentation in Plant Phenomics 18.2.1 Object detection and its applications Object detection plays an important role in the plant phenotyping pipelines. The management of crops and weeds is crucial for farmers who expect a profitable yield, and several studies have focused on seedling detection and counting (Jiang et al., 2019; Oh et al., 2020; Samiei et al., 2020; Hasan et al., 2021; Perugachi-Diaz et al., 2021). Tree detection is also a critical step in remote sensing of forested landscapes (Onishi and Ise, 2021). Accurately identifying individual crowns in airborne imagery allows ecologists, foresters, and land managers to increase the number of sampling studies (compared with terrestrial surveys). Recently, Guo et al. (2020) proposed a framework that effectively eliminates weeds for individual plant segmentation and extracts complex traits such as volume and outline. They demonstrated the viability of their framework in the phenotyping of Helianthus tuberosus (Jerusalem artichoke), a herbaceous perennial plant species. Weinstein et al. (2020) proposed a single unified framework that can produce individual tree predictions in a variety of ecosystems. The robustness of the proposed approach is highlighted by the consistency of the results obtained based on multiple forest- type datasets. The detection of plant organs

(leaves, stems, fruits, etc.) is another task in plant phenomics that has received considerable attention. Recent studies on the detection and counting of fruits (Afonso et al., 2020; Mu et al., 2020; Zhang et al., 2021), crop panicles (Desai et al., 2019; Ghosal et al., 2019; Chandra et al., 2020b; David et al., 2020), and leaves (Aich and Stavness, 2017; Itzhaky et al., 2018; Buzzy et al., 2020; Tu et al., 2021) have shown a trend toward improved detection models that are particularly tailored for tasks involved in plant phenomics. Several datasets and methodologies have contributed to improved object detection in plant phenomics (Table 18.1). We also refer readers to a recent survey (Lu and Young, 2020) of publicly available, easy- to- access, vision- based plant datasets. Plant diseases are not only a threat to food security on a global scale, but also have disastrous consequences for smallholder farmers whose livelihood depends on healthy crops. Various studies have addressed this problem by focusing on plant stress detection (Esgario et al., 2020), disease detection (Fuentes et al., 2017; Li et al., 2020; Zhang et al., 2020; Wang et al., 2021b), and insect and pest detection (Fuentes et al., 2017; Shen et al., 2018; Liu et al., 2019; Kim et al., 2020; Wang et al., 2021c). The availability of datasets (Table 18.1) has helped the research community build better disease- detection algorithms.

18.2.2 Image segmentation and its applications Prior to 2014, plant researchers used popular image processing tools, such as edge detection, thresholding, clustering, and graph partitioning, to segment plant organs (Gwo and Wei, 2013; Mizushima and Lu, 2013; Pereira et al., 2021). With the success of deep neural networks, end- to- end learnable intelligent segmentation models have become increasingly prevalent in this research field. Image segmentation for plant disease enables localization of disease-infected organs (Ma et al., 2019; Lakshmi and Nickolas, 2016; Akanksha et al., 2021; Chouhan et al., 2021). To this end, Zhong et al. (2021) provided a multi-modal plant disease image dataset (PDID)

Object Detection and Image Segmentation

237

Table 18.1. Publicly available datasets for plant phenomics-related research. Name

Target

Reference

Link

DeepFruits

Apple, avocado, capsicum, mango, orange, rockmelon, strawberry

Sa et al., 2016

http://goo.gl/9LmmOU

ACFR Orchard Fruit Dataset

Almond, apple, mango

Bargoti and https://data.acfr.usyd.edu.au/ag/ Underwood, 2017 treecrops/2016-multifruit

MangoYOLO

Mango

Koirala et al., 2019 http://hdl.cqu.edu.au/10018/1261224

MinneApple

Apple

Hani et al., 2020

Rosette plant or Arabidopsis

Rosette plants

Scharr et al., 2014 https://www.plant-phenotyping.org/ datasets-overview

KFuji RGB-DS

Apples

Gené-Mola et al., 2019

http://www.grap.udl.cat/publicacions/ datasets.html

Open Plant Phenotyping Database

47 plant species

Madsen et al., 2020

https://vision.eng.au.dk/open-plantphenotyping-database/

global wheat head Wheat heads detection (GWHD)

David et al., 2020

http://www.global-wheat.com/

PlantVillage

Diseases of multiple crops

Hughes and Salathé, 2015

https://www.kaggle.com/emmarex/ plantdisease

Rice Leaf Disease Dataset

Diseases of rice

Prajapati et al., 2017

https://www.kaggle.com/vbookshelf/ rice-leaf-diseases

the Maize Disease Diseases of maize Dataset

Wiesner-Hanks et al., 2018

https://osf.io/p67rz/

Rice Leaf Disease Image Samples

Diseases of rice

Sethy et al., 2020

https://data.mendeley.com/datasets/ fwci7stb8r/1

PlantDoc

Diseases of multiple crops

Singh et al., 2019

https://github.com/pratikkayal/ PlantDoc-Dataset

Eden Library

Diseases, weeds, pests of multiple crops

Mylonas et al., 2022

https://edenlibrary.ai/

with classification, detection, and segmentation labels. The authors also proposed a triple stream segmentation network for robust plant disease segmentation. Seed features such as shape and size are indicators of environmental stress and can predict germination rates and the subsequent development of plants. Traditionally, these features have been measured using Vernier calipers, manual annotation, or complex image- processing machinery. Automatic seed segmentation reduces this labor-intensive task to computer vision tasks that are often learnable. Makanza et al. (2018) proposed a robust ear phenotyping and kernel weight estimation

http://rsn.cs.umn.edu/index.php/ MinneApple

method for maize using high-resolution images. Toda et al. (2020) achieved high-throughput quantification of seed morphology in real-world barley seed images by employing deep neural networks and synthetic seed images. Although plant organ segmentation is a complex task, it is an unavoidable prerequisite for the extraction of a wide range of complex phenotypic traits (Scharr et al., 2014; Bargoti and Underwood, 2017; Hani et al., 2020; Kang and Chen, 2020; Ni et al., 2020). Various studies have incorporated leaf counting using a leaf segmentation map as model input (Aich and Stavness, 2017; Xu et al., 2018; Kumar and Domnic, 2019; Gomes and Zheng, 2020).

238

W. Guo and A.L. Chandra

Itakura and Hosoi (2018) and Lüling et al. (2021) used segmentation maps to quantify leaf area and inclination angle. Choudhury et al. (2021) introduced an algorithm that uses plant segmentation masks to compute the stem angle, which is a potential measure of a plant’s susceptibility to lodging.

(e.g., a human annotator). In object detection and image segmentation tasks, it allows the trained model to select the best samples for labeling in the next round. This “active” approach to labeling has been shown to reduce labeled data requirements while obtaining models with high generalizability (Settles, 2009; Gal et al., 2017; Sener and Savarese, 2018). The efficiency of applying active learning to plant datasets has been demonstrated in various studies. Desai et al. 18.3 Current Challenges of Object (2019) proposed a method that allows object Detection and Image Segmentation detection models to query either weak labels for Plant Phenomics (a single-center dot representing an object) or strong labels (a full bounding box surrounding Although object detection and image segmen- an object). This active switching between weak tation tasks (coupled with deep learning) are and strong labels allowed for a reduction in crucial for efficient plant phenotyping, they have annotation times when applied to a wheat-ear certain unavoidable drawbacks. Most state-of- detection dataset. The same authors proposed based approach the-art deep learning algorithms generalize well another weak supervision- only to large-scale datasets. However, phenotyp- (Chandra et al., 2020b) particularly designed for ing tasks are often subjected to specific environ- plant datasets. In this active learning approach, mental conditions and target species; therefore, the model constantly interacts with a human acquiring, annotating, and maintaining large annotator by iteratively querying strong labels datasets under such conditions is difficult and for only the most informative images and using weak labels for others. This allowed for a consistoften expensive. ent reduction of over 50% in the annotation times for the sorghum head and wheat spike detection datasets. More recently, Nagasubramanian 18.3.1 Data annotation cost et al. (2020) emphasized the effectiveness of active learning when applied to two popular Image labeling is typically a laborious process. plant phenotype classification datasets, the Su et al. (2012) reported that the median of soybean stress dataset (Ghosal et al., 2018) and total time taken by annotators to label a single weed species dataset (Olsen et al., 2019). bounding box (inclusive of quality and coverage checks) is approximately 42 s. In the case of segmentation, Cordts et al. (2020) stated that anno18.3.2 Generalization capability of tators required approximately 90 min to label a current deep learning models single image in their dataset. Severe additional increases in labeling costs occur in the context of plant phenotyping datasets, where object Most deep learning-based plant organ detection counts are often high (David et al., 2020). This and image segmentation studies have been perproblem is further complicated by the demands formed with the help of task-specific, fully superand costs of domain experts who must guide the vised plant datasets. This has resulted in trained labeling process first-hand to ensure accurate models that work properly on the given datasets, and reliable plant phenotyping datasets. Plant but do not transfer correctly to other domains, researchers have been considering active learn- other plant species, or even other datasets of ing, generative modeling, and synthetic data to the same plant species. Even a slight change address this problem. in the data distribution could lead to incorrect Active learning, which is a subfield of inferences from the model. Data augmentation machine learning, attempts to overcome the is a common and highly effective technique used labeling bottleneck by asking queries in the form to increase the size of a training dataset and to of unlabeled instances to be labeled by an oracle make the model more generalizable and robust.

Object Detection and Image Segmentation

However, simple transformations such as rotation, scale, and color jitter are limited by the variability of the available training set; that is, plant attributes that are absent in the training set are also absent in the augmentations. To overcome this limitation, several studies have proposed data augmentation techniques to exploit the success of recent advancements in image generation. Giuffrida et al. (2017) used generative adversarial networks (GANs) (Goodfellow et al., 2014) and proposed the ARIGAN model to generate Arabidopsis plant images with specific traits (e.g., plants with more than seven leaves) that occur less frequently in the original dataset. Recent studies (Zhu et al., 2018; Kuznichov et al., 2019) have used advanced variants of GANs to generate realistic plant images with favorable leaf segmentation of interest. This procedure increased the accuracy of the leaf counting models. Cap et al. (2020) proposed LeafGAN, an image-to-image translation model that generates leaf images with various plant diseases and significantly boosts diagnostic performance by a large margin. Similarly, Shete et al. (2020) proposed a modified generator architecture called TassleGAN for maize tassel generation. Cicco et al. (2017) generated super- realistic sugar beet crop/weed detection datasets for a slightly extreme case by creating an entirely virtual 3D environment. The authors procedurally generated large synthetic training datasets by randomizing the key features of the target environment (i.e., crop and weed species, soil type, and light conditions) that could be used directly for training and inference on a real- world dataset. In the context of plant datasets, a distribution change or a domain shift between two domains can degrade performance owing to different factors (e.g., illumination, pose, image quality, and other environmental factors). This problem forces plant researchers to acquire, maintain, and label datasets for each domain or species, and for each plant phenotyping task. However, this approach is time- consuming, labor- intensive, and undesirable. Ideally, a learned model should adapt easily to new, unseen datasets. Domain adaptation is a specific type of transfer learning that aims to achieve this goal. Transfer learning refers to learning problems in which the tasks and/or domains may change between the source and target, whereas

239

in domain adaptations, only the domains can differ, and the tasks remain unchanged. Giuffrida et al. (2019) first used domain adaptation to perform leaf counting in the target domain without the need for explicit annotations. This objective was achieved by minimizing the distance between the counting distributions on the source (where labels were available) and target domain (where labels were unavailable). Similarly, Ayalew et al. (2020) proposed a domain-adversarial neural network that can easily adapt models trained on an indoor wheat dataset, ACID (Pound et al., 2017), to the task of spikelet counting in Global Wheat Head Detection (GWHD) Dataset (David et al., 2020) and CropQuant (Zhou et al., 2017) datasets, where the camera angle, background, and illumination were completely different. Although model adaptation is a desirable method, creating a cost-free dataset by translating plant images from one domain to another (labels are often transferred free of cost) would also produce satisfactory results. Owing to advancements in image translation technology, plant researchers can now generate realistic translated datasets. Gogoll et al. (2020) used the CycleGAN network and created a framework that allowed the translation of images from a source domain to new environments, different value crops, and other farm robots. Their system yielded high segmentation performance in the target domain by exploiting labels only from the source domain. Zhang et al. (2021) also used the CycleGAN network to translate orange orchard images into unlabeled apple and tomato image datasets. This line of research has a considerable scope, where one can simply generate new domains or species with little to no label information requirements.

18.4 Conclusion and Future Perspective Object detection and image segmentation in conjunction with deep learning have successfully contributed to the development of high- throughput plant phenotyping methods. With the growing demand for efficient plant phenotyping methods, object detection and image segmentation tasks in plant science are being

240

W. Guo and A.L. Chandra

channeled in various directions, such as active learning, generation of synthetic data, domain adaption, and image translation. Further growth in computer vision technology will allow plant researchers to swiftly adopt either “plug and play” or “modify and adapt” state- of-the-art methods. These factors will lead to

further advances in plant phenomics. However, one problem that was not covered in this chapter is that the current applications of deep learning also accept its black-box nature; therefore, it is not easy to understand the internal logic of the networks. This lack of explanation must be overcome in cases such as decision-making.

References Afonso, M., Fonteijn, H., Fiorentin, F.S., Lensink, D., Mooij, M. et al. (2020) Tomato fruit detection and counting in greenhouses using deep learning. Frontiers in Plant Science 11, 1759. DOI: 10.3389/ fpls.2020.571299. Aich, S. and Stavness, I. (2017) Leaf counting with deep convolutional and deconvolutional networks. In: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, Italy, pp. 2080–2089. DOI: 10.1109/ICCVW.2017.244. Akanksha, E., Sharma, N. and Gulati, K. (2021) OPNN: Optimized Probabilistic Neural Network based automatic detection of maize plant disease detection. In: 2021 6th International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, pp. 1322–1328. DOI: 10.1109/ ICICT50816.2021.9358763. Ayalew, T., Ubbens, J. and Stavness, I. (2020) Unsupervised domain adaptation for plant organ counting. In: A., B. and A, F. (eds) Computer Vision – ECCV 2020 Workshops. Springer, Cham, Switzerland, pp. 330–346. DOI: 10.1007/978-3-030-65414-6_23. Bargoti, S. and Underwood, J. (2017) Deep fruit detection in orchards. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, Singapore, pp. 3626–3633. DOI: 10.1109/ICRA.2017.7989417. Bochkovskiy, A., Chien-Yao Wang, C.-Y. and Liao, H. (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv 2004.10934. Buzzy, M., Thesma, V., Davoodi, M. and Mohammadpour Velni, J. (2020) Real-time plant leaf counting using deep object detection networks. Sensors (Basel, Switzerland) 20(23), 6896. DOI: 10.3390/ s20236896. Cap, Q.H., Uga, H., Kagiwada, S. and Iyatomi, H. (2020) LeafGAN: an effective data augmentation method for practical plant disease diagnosis. arXiv 2002.10100. Chandra, A.L., Desai, S.V., Guo, W. and Balasubramanian, V.N. (2020a) Computer vision with deep learning for plant phenotyping in agriculture: a survey. arXiv 2006.11391. Chandra, A.L., Desai, S.V., Balasubramanian, V.N., Ninomiya, S. and Guo, W. (2020b) Active learning with point supervision for cost-effective panicle detection in cereal crops. Plant Methods 16, 1–16. DOI: 10.1186/s13007-020-00575-8. Chen, L., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H. (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv 1802.02611. Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S. et al. (2020) Panoptic-DeepLab: a simple, strong, and fast baseline for bottom-Up panoptic segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 12472–12482. DOI: 10.1109/ CVPR42600.2020.01249. Choudhury, S.D., Goswami, S., Bashyam, S., Awada, T. and Samal, A. (2021) Automated stem angle determination for temporal plant phenotyping analysis. In: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, Italy, pp. 2022–2029. DOI: 10.1109/ICCVW.2017.237. Chouhan, S.S., Singh, U., Sharma, U. and Jain, S. (2021) Leaf disease segmentation and classification of Jatropha curcas L. and Pongamia pinnata L. biofuel plants using computer vision based approaches. Measurement 171, 108796. DOI: 10.1016/j.measurement.2020.108796. Cicco, M.D., Potena, C., Grisetti, G. and Pretto, A. (2017) Automatic model based dataset generation for fast and accurate crop and weeds detection. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, pp. 5188–5195. DOI: 10.1109/IROS.2017.8206408.

Object Detection and Image Segmentation

241

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M. et al. (2020) The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 3213–3223. DOI: 10.1109/CVPR.2016.350. Dalal, N. and Triggs, B. (2020) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, pp. 886–893. DOI: 10.1109/CVPR.2005.177. David, E., Madec, S., Sadeghi-Tehran, P., Aasen, H., Zheng, B. et al. (2020) Global Wheat Head Detection (GWHD) Dataset: a large and diverse dataset of high-resolution RGB-labelled images to develop and benchmark wheat head detection methods. Plant Phenomics (Washington, D.C.) 2020, 3521852. DOI: 10.34133/2020/3521852. Desai, S.V., Chandra, A.L., Guo, W., Ninomiya, S. and Balasubramanian, V.N. (2019) An adaptive supervision framework for active learning in object detection. arXiv 1908.02454. Esgario, J.G.M., Krohling, R.A. and Ventura, J.A. (2020) Deep learning for classification and severity estimation of coffee leaf biotic stress. Computers and Electronics in Agriculture 169, 105162. DOI: 10.1016/j.compag.2019.105162. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A. and Ramanan, D. (2009) Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9), 1627–1645. DOI: 10.1109/TPAMI.2009.167. Fuentes, A., Yoon, S., Kim, S.C. and Park, D.S. (2017) A robust deep-learning-based detector for real- time tomato plant diseases and pests recognition. Sensors (Basel, Switzerland) 17(9), E2022. DOI: 10.3390/s17092022. Gal, Y., Islam, R. and Ghahramani, Z. (2017) Deep Bayesian active learning with image data. arXiv 1703.02910. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-Gonzalez, P. et al. (2018) A survey on deep learning techniques for image and video semantic segmentation. Applied Soft Computing 70, 41–65. DOI: 10.1016/j.asoc.2018.05.018. Gené-Mola, J., Vilaplana, V., Rosell-Polo, J.R., Morros, J.-R., Ruiz-Hidalgo, J. et al. (2019) KFuji RGB-DS database: Fuji apple multi-modal images for fruit detection with color, depth and range-corrected IR data. Data in Brief 25, 104289. DOI: 10.1016/j.dib.2019.104289. Ghosal, S., Blystone, D., Singh, A.K., Ganapathysubramanian, B., Singh, A. et al. (2018) An explainable deep machine vision framework for plant stress phenotyping. Proceedings of the National Academy of Sciences 115(18), 4613–4618. DOI: 10.1073/pnas.1716999115. Ghosal, S., Zheng, B., Chapman, S.C., Potgieter, A.B., Jordan, D.R. et al. (2019) A weakly supervised deep learning framework for sorghum head detection and counting. Plant Phenomics (Washington, D.C.) 2019, 1525874. DOI: 10.34133/2019/1525874. Girshick, R.B. (2015) Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp. 1440–1448. DOI: 10.1109/ICCV.2015.169. Giuffrida, M.V., Scharr, H. and Tsaftaris, S.A. (2017) ARIGAN: synthetic Arabidopsis plants using generative adversarial network. In: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, Italy, pp. 2064–2071. DOI: 10.1109/ICCVW.2017.242. Giuffrida, M.V., Dobrescu, A., Doerner, P. and Tsaftaris, S.A. (2019) Leaf counting without annotations using adversarial unsupervised domain adaptation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, pp. 2590–2599. DOI: 10.1109/CVPRW.2019.00315. Gogoll, D., Lottes, P., Weyler, J., Petrinic, N. and Stachniss, C. (2020) Unsupervised domain adaptation for transferring plant classification systems to new field environments, crops, and robots. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, pp. 2636–2642. DOI: 10.1109/IROS45743.2020.9341277. Gomes, D. and Zheng, L. (2020) Leaf segmentation and counting with deep learning: on model certainty, test-time augmentation, trade-offs. arXiv 2012.11486. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D. et al. (2014) Generative adversarial nets. Advances in Neural Information Processing Systems 27, 2672–2680. Guo, W., Fukano, Y., Noshita, K. and Ninomiya, S. (2020) Field-based individual plant phenotyping of herbaceous species by unmanned aerial vehicle. Ecology and Evolution 10(21), 12318–12326. DOI: 10.1002/ece3.6861.

242

W. Guo and A.L. Chandra

Gwo, C.-Y. and Wei, C.-H. (2013) Plant identification through images: using feature extraction of key points on leaf contours(1.). Applications in Plant Sciences 1(11), apps.1200005. DOI: 10.3732/ apps.1200005. Hani, N., Roy, P. and Isler, V. (2020) MinneApple: a benchmark dataset for apple detection and segmentation. IEEE Robotics and Automation Letters 5(2), 852–858. DOI: 10.1109/LRA.2020.2965061. Hao, S., Zhou, Y. and Guo, Y. (2020) A brief survey on semantic segmentation with deep learning. Neurocomputing 406, 302–321. DOI: 10.1016/j.neucom.2019.11.118. Hasan, A.S.M.M., Sohel, F., Diepeveen, D., Laga, H. and Jones, M.G.K. (2021) A survey of deep learning techniques for weed detection from images. Computers and Electronics in Agriculture 184, 106067. DOI: 10.1016/j.compag.2021.106067. He, K., Gkioxari, G., Dollar, P. and Girshick, R. (2017) Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2980–2988. DOI: 10.1109/ICCV.2017.322. Hughes, D.P. and Salathé, M. (2015) An open access repository of images on plant health to enable the development of mobile disease diagnostics through machine learning and crowdsourcing. arXiv 1511.08060. Itakura, K. and Hosoi, F. (2018) Automatic leaf segmentation for estimating leaf area and leaf inclination angle in 3D plant images. Sensors (Basel, Switzerland) 18(10), 3576. DOI: 10.3390/s18103576. Itzhaky, Y., Farjon, G., Khoroshevsky, F., Shpigler, A. and Bar-Hillel, A. (2018) Leaf counting: multiple scale regression and detection using deep CNNs. In: Proceedings of the BMVC, Newcastle, UK, p. 328. Jiang, Y. and Li, C. (2020) Convolutional neural networks for image-based high-throughput plant phenotyping: a review. Plant Phenomics (Washington, D.C.) 2020, 4152816. DOI: 10.34133/2020/4152816. Jiang, Yu, Li, C., Paterson, A.H. and Robertson, J.S. (2019) DeepSeedling: deep convolutional network and Kalman filter for plant seedling detection and counting in the field. Plant Methods 15, 141. DOI: 10.1186/s13007-019-0528-3. Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L. et al. (2019) A survey of deep learning-based object detection. IEEE Access: Practical Innovations, Open Solutions 7, 128837–128868. DOI: 10.1109/ ACCESS.2019.2939201. Kang, H. and Chen, C. (2020) Fruit detection, segmentation and 3D visualisation of environments in apple orchards. Computers and Electronics in Agriculture 171, 105302. DOI: 10.1016/j. compag.2020.105302. Ke, L., Tai, Y.W. and Tang, C.K. (2020) Deep occlusion-aware instance segmentation with overlapping biLayers. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 4018–4027. DOI: 10.1109/CVPR46437.2021.00401. Kim, W.-S., Lee, D.-H. and Kim, Y.-J. (2020) Machine vision-based automatic disease symptom detection of onion downy mildew. Computers and Electronics in Agriculture 168, 105099. DOI: 10.1016/j. compag.2019.105099. Koirala, A., Walsh, K.B., Wang, Z. and McCarthy, C. (2019) Deep learning for real-time fruit detection and orchard fruit load estimation: benchmarking of ‘MangoYOLO.’ Precision Agriculture 20(6), 1107–1135. DOI: 10.1007/s11119-019-09642-0. Kumar, J.P. and Domnic, S. (2019) Image based leaf segmentation and counting in rosette plants. Information Processing in Agriculture 6(2), 233–246. DOI: 10.1016/j.inpa.2018.09.005. Kuznichov, D., Zvirin, A., Honen, Y. and Kimmel, R. (2019) Data augmentation for leaf segmentation and counting tasks in rosette plants. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, pp. 2580–2589. DOI: 10.1109/ CVPRW.2019.00314. Lakshmi, R.K. and Nickolas, S. (2016) Deep learning based betelvine leaf disease detection (Piper Betle L.). In: 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, pp. 215–219. DOI: 10.1109/ICCCA49541.2020.9250911. Li, D., Wang, R., Xie, C., Liu, L., Zhang, J. et al. (2020) A recognition method for rice plant diseases and pests video detection based on deep convolutional neural network. Sensors 20(3), 578. DOI: 10.3390/s20030578. Li, Y., Dua, A. and Ren, F. (2019) Light-weight retinanet for object detection on edge devices. In: 2020 IEEE 6th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA. DOI: 10.1109/ WF-IoT48130.2020.9221150. Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J. et al. (2020) Deep learning for generic object detection: a survey. International Journal of Computer Vision 128(2), 261–318. DOI: 10.1007/ s11263-019-01247-4.

Object Detection and Image Segmentation

243

Liu, L., Wang, R., Xie, C., Yang, P., Wang, F. et al. (2019) PestNet: an end-to-end deep learning approach for large-scale multi-class pest detection and classification. IEEE Access: Practical Innovations, Open Solutions 7, 45301–45312. DOI: 10.1109/ACCESS.2019.2909522. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E. et al. (2016) SSD: Single shot multibox detector. ECCV 2016. In: Leibe, B., Matas, J., Sebe, N. and Welling, M. (eds) Computer Vision – ECCV 2016. Springer, Cham, Switzerland, pp. 21–37. DOI: 10.1007/978-3-319-46448-0_2. Lüling, N., Reiser, D. and Griepentrog, H.W. (2021) 86. Volume and leaf area calculation of cabbage with a neural network-based instance segmentation. In: Stafford, J.V. (ed.), 13th European Conference on Precision Agriculture, Wageningen Academic Publishers, Budapest, Hungary, pp. 719–726. DOI: 10.3920/978-90-8686-916-9_86. Lu, Y. and Young, S. (2020) A survey of public datasets for computer vision tasks in precision agriculture. Computers and Electronics in Agriculture 178, 105760. DOI: 10.1016/j.compag.2020.105760. Ma, J., Du, K., Zheng, F., Zhang, L. and Sun, Z. (2019) A segmentation method for processing greenhouse vegetable foliar disease symptom images. Information Processing in Agriculture 6(2), 216–223. DOI: 10.1016/j.inpa.2018.08.010. Madsen, S.L., Mathiassen, S.K., Dyrmann, M., Laursen, M.S., Paz, L.-C. et al. (2020) Open plant phenotype database of common weeds in Denmark. Remote Sensing 12(8), 1246. DOI: 10.3390/rs12081246. Makanza, R., Zaman-Allah, M., Cairns, J.E., Eyre, J., Burgueño, J. et al. (2018) High-throughput method for ear phenotyping and kernel weight estimation in maize using ear digital imaging. Plant Methods 14, 49. DOI: 10.1186/s13007-018-0317-4. Minaee, S., Boykov, Y.Y., Porikli, F., Plaza, A.J., Kehtarnavaz, N. et al. (2021) Image segmentation using deep learning: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1–1. DOI: 10.1109/TPAMI.2021.3059968. Mizushima, A. and Lu, R. (2013) An image segmentation method for apple sorting and grading using support vector machine and Otsu’s method. Computers and Electronics in Agriculture 94, 29–37. DOI: 10.1016/j.compag.2013.02.009. Mu, Y., Chen, T.S., Ninomiya, S. and Guo, W. (2020) Intact detection of highly occluded immature tomatoes on plants using deep learning techniques. Sensors (Basel, Switzerland) 20(10), 2984. DOI: 10.3390/s20102984. Mylonas, N., Malounas, I., Mouseti, S., Vali, E., Espejo-Garcia, B. et al. (2022) Eden Library: A long-term database for storing agricultural multi-sensor datasets from UAV and proximal platforms. Smart Agricultural Technology 2, 100028. DOI: 10.1016/j.atech.2021.100028. Nagasubramanian, K., Jubery, T., Fotouhi Ardakani, F., Mirnezami, S.V., Singh, A.K. et al. (2020) How useful is active learning for image‐based plant phenotyping? The Plant Phenome Journal 4(1). DOI: 10.1002/ppj2.20020. Ni, X., Li, C., Jiang, H. and Takeda, F. (2020) Deep learning image segmentation and extraction of blueberry fruit traits associated with harvestability and yield. Horticulture Research 7, 110. DOI: 10.1038/ s41438-020-0323-3. Oh, S., Chang, A., Ashapure, A., Jung, J., Dube, N. et al. (2020) Plant counting of cotton from UAS imagery using deep learning-based object detection framework. Remote Sensing 12(18), 2981. DOI: 10.3390/rs12182981. Olsen, A., Konovalov, D.A., Philippa, B., Ridd, P., Wood, J.C. et al. (2019) DeepWeeds: a multiclass weed species image dataset for deep learning. Scientific Reports 9(1), 1–12. DOI: 10.1038/ s41598-018-38343-3. Onishi, M. and Ise, T. (2021) Explainable identification and mapping of trees using UAV RGB image and deep learning. Scientific Reports 11, 903. DOI: 10.1038/s41598-020-79653-9. Pereira, C.S., Morais, R. and Reis, M.J.C.S. (2021) Recent advances in image processing techniques for automated harvesting purposes: A review. In: 2017 Intelligent Systems Conference (IntelliSys), London, pp. 566–575. DOI: 10.1109/IntelliSys.2017.8324352. Perugachi-Diaz, Y., Tomczak, J.M. and Bhulai, S. (2021) Deep learning for white cabbage seedling prediction. Computers and Electronics in Agriculture 184, 106059. DOI: 10.1016/j. compag.2021.106059. Pound, M.P., Atkinson, J.A., Wells, D.M., Pridmore, T.P. and French, A.P. (2017) Deep learning for multi- task plant phenotyping. BioRxiv. DOI: 10.1101/204552. Prajapati, H.B., Shah, J.P. and Dabhi, V.K. (2017) Detection and classification of rice plant diseases. Intelligent Decision Technologies 11(3), 357–373. DOI: 10.3233/IDT-170301. Redmon, J. and Farhadi, A. (2018) YOLOv3: an incremental improvement. arXiv 1804.02767.

244

W. Guo and A.L. Chandra

Redmon, J., Divvala, S., Girshick, R.B. and Farhadi, A. (2016) You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 779–788. DOI: 10.1109/CVPR.2016.91. Ren, S., He, K., Girshick, R.B. and Sun, J. (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6), 1137–1149. DOI: 10.1109/TPAMI.2016.2577031. Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: convolutional networks for biomedical image segmentation. Miccai. In: Navab, N., Hornegger, J., Wells, W. and Frangi, A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer, Cham, Switzerland, pp. 234–241. DOI: 10.1007/978-3-319-24574-4_28. Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T. et al. (2016) DeepFruits: a fruit detection system using deep neural networks. Sensors 16(8), 1222. DOI: 10.3390/s16081222. Samiei, S., Rasti, P., Ly Vu, J., Buitink, J. and Rousseau, D. (2020) Deep learning-based detection of seedling development. Plant Methods 16, 103. DOI: 10.1186/s13007-020-00647-9. Scharr, H., Minervini, M., Fischbach, A. and Tsaftaris, S. (2014) Annotated image datasets of rosette plants. Available at: http://juser.fz-juelich.de/record/154525 (accessed 26 May 2022). Sener, O. and Savarese, S. (2018) Active learning for convolutional neural networks: a core-set approach. arXiv 1708.00489. Sethy, P.K., Barpanda, N.K., Rath, A.K. and Behera, S.K. (2020) Deep feature based rice leaf disease identification using support vector machine. Computers and Electronics in Agriculture 175, 105527. DOI: 10.1016/j.compag.2020.105527. Settles, B. (2009) Active Learning Literature Survey. Available at: http://digital.library.wisc.edu/1793/606 60 (accessed 26 May 2022). Shen, Y., Zhou, H., Li, J., Jian, F. and Jayas, D.S. (2018) Detection of stored-grain insects using deep learning. Computers and Electronics in Agriculture 145, 319–325. DOI: 10.1016/j. compag.2017.11.039. Shete, S., Srinivasan, S. and Gonsalves, T.A. (2020) TasselGAN: an application of the generative adversarial model for creating field-based maize tassel data. Plant Phenomics (Washington, D.C.) 2020, 8309605. DOI: 10.34133/2020/8309605. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S. et al. (2019) PlantDoc: a dataset for visual plant disease detection. arXiv 1911.10317. DOI: 10.1145/3371158.3371196. Stockman, G. and Shapiro, L. (2001) Computer Vision, 1st edn. Prentice Hall PTR, Hoboken, New Jersey. Su, H., Deng, J. and Fei-Fei, L. (2012) Crowdsourcing annotations for visual object detection. In: Human Computation: Papers from the 2012 AAAI Workshop, Technical Report. Vol. WS-12-08, pp. 40–46. Takikawa, T., Acuna, D., Jampani, V. and Fidler, S. (2019) Gated-SCNN: gated Shape CNNs for semantic segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 5228–5237. DOI: 10.1109/ICCV.2019.00533. Toda, Y., Okura, F., Ito, J., Okada, S., Kinoshita, T. et al. (2020) Training instance segmentation neural network with synthetic datasets for crop seed phenotyping. Communications Biology 3(1), 173. DOI: 10.1038/s42003-020-0905-5. Tu, Y.-L., Lin, W.-Y. and Lin, Y.-C. (2021) Automatic leaf counting using improved YOLOv3. In: 2020 International Symposium on Computer, Consumer and Control (IS3C), Taichung City, Taiwan, pp. 197–200. DOI: 10.1109/IS3C50286.2020.00058. Viola, P.A. and Jones, M.J. (2021) Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA. DOI: 10.1109/CVPR.2001.990517. Wang, C., Yeh, I. and Liao, H. (2021a) You only learn one representation: unified network for multiple tasks. arXiv 2105.04206. Wang, J., Yang, J., Yu, L., Dong, H., Yun, K. et al. (2021b) DBA_SSD: a novel end-to-end object detection using deep attention module for helping smart device with vegetable and fruit leaf plant disease detection. [In Review]. DOI: 10.21203/rs.3.rs-166579/v1. Wang, R., Liu, L., Xie, C., Yang, P., Li, R. et al. (2021c) AgriPest: a large-scale domain-specific benchmark dataset for practical agricultural pest detection in the wild. Sensors 21(5), 1601. DOI: 10.3390/ s21051601. Wang, Z. and Ji, S. (2021) Smoothed dilated convolutions for improved dense prediction. Data Mining and Knowledge Discovery 35(4), 1470–1496. DOI: 10.1007/s10618-021-00765-5.

Object Detection and Image Segmentation

245

Weinstein, B.G., Marconi, S., Bohlman, S.A., Zare, A. and White, E.P. (2020) Cross- site learning in deep learning RGB tree crown detection. Ecological Informatics 56, 101061. DOI: 10.1016/j. ecoinf.2020.101061. Wiesner-Hanks, T., Stewart, E.L., Kaczmar, N., DeChant, C., Wu, H. et al. (2018) Image set for deep learning: field images of maize annotated with disease symptoms. BMC Research Notes 11(1), 440. DOI: 10.1186/s13104-018-3548-6. Xu, L., Li, Y., Sun, Y., Song, L. and Jin, S. (2018) Leaf instance segmentation and counting based on deep object detection and segmentation networks. In: 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), Toyama, Japan, pp. 180–185. DOI: 10.1109/SCIS-ISIS.2018.00038. Zhang, W., Chen, K., Wang, J., Shi, Y. and Guo, W. (2021) Easy domain adaptation method for filling the species gap in deep learning-based fruit detection. Horticulture Research 8, 119. DOI: 10.1038/ s41438-021-00553-8. Zhang, Y., Song, C. and Zhang, D. (2020) Deep learning-based object detection improvement for tomato disease. IEEE Access: Practical Innovations, Open Solutions 8, 56607–56614. DOI: 10.1109/ ACCESS.2020.2982456. Zhong, C., Hu, Z., Yang, X., Li, H., Liu, F. et al. (2021) Triple stream segmentation network for plant disease segmentation. In: 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, pp. 496–501. DOI: 10.1109/IAEAC50856.2021.9390933. Zhou, J., Reynolds, D., Cornu, T.L., Websdale, D., Orford, S. et al. (2017) CropQuant: an automated and scalable field phenotyping platform for crop monitoring and trait measurements to facilitate breeding and digital agriculture. [bioRxiv]. DOI: 10.1101/161547. Zhu, Y., Aoun, M., Krijn, M., Vanschoren, J. and Campus, H.T. (2018) Data augmentation using conditional generative adversarial networks for leaf counting in Arabidopsis plants. In: Computer Vision Problems in Plant Phenotyping (CVPPP2018). Available at: http://bmvc2018.org/contents/workshop s/cvppp2018/0014.pdf (accessed 25 June 2022).

19

Plant Experimental Resources

Masatomo Kobayashi* RIKEN BioResource Research Center, Tsukuba, Ibaraki, Japan

Abstract Experimental resources of various plant species are provided from the core facilities of the National BioResource Project (NBRP) of Japan. Among them, Arabidopsis thaliana is a well-known model plant widely used in the international research community. Arabidopsis resources are maintained and distributed through collaboration with centers in the USA and UK. The project also preserves and distributes biological materials of crop species that are indispensable for human beings, such as rice, wheat, tomato, and legumes. The aim of the NBRP is to promote life science research by providing resources of the highest quality level to ensure the reproducibility of research. The NBRP contributes to the promotion of plant science, including omics studies.

19.1 Introduction Daily, vast arrays of data are produced by transcriptome, proteome, metabolome, and phenome analyses. By characterizing the data, researchers discover novel gene functions, new pathways of signal transduction, and new bioactive compounds that regulate plant growth. However, as growth conditions can affect the results, and even microflora can affect plant growth (reviewed by Singh et al., 2004; Barea et al., 2005), all plant materials for omics studies should be prepared under strictly controlled conditions. In addition to environmental factors, researchers must consider the genetic background of plant materials. Genetic characterization of the mouse C57BL/6 sub-strain serves as a good example (Mekada et al., 2009; Benavides et al., 2020): although C57BL/6 is widely used, its genetic background differs among suppliers, even to the deletion of the Nnt1, Snca, and Crb1 genes. Thus, the use of genetically defined

animals is required for studies using laboratory rodents. In the case of Arabidopsis, Shao et al. (2016) reported the introgression of chromosome segments from Wassilewskija- 2 into Columbia- 0 seed stock provided by a private company. Quality assurance of genetic resources is indispensable for acquiring reliable data, and public repositories are working on ensuring this for their resources. The Japanese government founded the National BioResource Project (NBRP) in 2002 to collect, preserve, and distribute biological materials indispensable for life science research. Through the project, biological resources are distributed to research communities at a minimal cost. Now the government is going to develop the information infrastructure for data- driven research, including genome data and biological resources. The NBRP plans to improve the quality of available resources as well as to promote the strategic collection and utilization of bioresources (NBRP, 2021). In this chapter, plant resources available from the Core Facilities of NBRP are described.

*masatomo.kobayashi@riken.jp 246

© CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0019

Plant Experimental Resources

247

Fig. 19.1. Arabidopsis at flowering stage (left) and juvenile plants in a Petri dish (right). Table 19.1. Public sources of Arabidopsis resources and information. (URLs accessed 1 September 2022.) Name of institute

Location

Type of resource

URL

Arabidopsis Biological Resource Center (ABRC)

USA

Seed, cultured cells, DNA

https://abrc.osu.edu

Nottingham Arabidopsis Stock Centre (NASC)

UK

Seed

https://arabidopsis.info

RIKEN BioResource Research Center (RIKEN BRC)

Japan

Seed, cultured cells, DNA

https://epd.brc.riken.jp/en/

The Arabidopsis Information Resource (TAIR)a

USA

Genomic and genetic information

https://www.arabidopsis.org

Subscription required. All websites accessed in September 2022.

a

19.2 Overview of Arabidopsis Resources The model plant Arabidopsis thaliana (L.) Heyhn. (Fig. 19.1) is the most suitable plant species for genetic and phenotypic research, including omics analyses (reviewed by Carneiro et al., 2015). Arabidopsis can grow under laboratory conditions, even aseptically in a Petri dish (Fig. 19.1). Its small size and short life cycle minimize the time and cost of preparation for analysis. Since the entire genome sequence of the standard line Columbia- 0 was published (Arabidopsis Genome Initiative, 2000), Arabidopsis has become the most useful model plant in various research fields of plant science. There are three public resource centers for Arabidopsis: (i) the Arabidopsis Biological Resource Center (ABRC) in the USA; (ii) the Nottingham Arabidopsis Stock Centre (NASC) in the UK; and (iii) the RIKEN BioResource Research Center (RIKEN BRC) in Japan (Table 19.1). The centers

provide wild-type and mutant Arabidopsis seeds to the world research community. The Arabidopsis Information Resource (TAIR) in the USA provides genomic and genetic information, with links to the resources available from these centers, and useful protocols such as growth methods.

19.2.1 Arabidopsis seed resources for omics analysis Arabidopsis seed resources can be classified into two major categories: natural accessions (ecotypes) and insertion mutant lines (Table 19.2). Natural accessions were collected from various locations worldwide. Seeds are amplified by bulk methods or single- seed descent methods. The ecotypes are used to study responses to environmental factors. It is worthy of note that the 1001 Genomes Project for A. thaliana has completed resequencing

248

M. Kobayashi

Table 19.2. Seed lines of Arabidopsis. Type of resource

Name of resource

Feature

Provider

Wild typea

Natural accession (ecotype)

Non-GMb

ABRC, NASC, RIKEN BRC

T-DNA insertion mutants

SALK, GABI, Wisconsin DsLox, SAIL

GMb

ABRC, NASC

Transposon insertion mutants

RATM

GMb

RIKEN BRC

IMA, JIC SM

ABRC, NASC

Includes Columbia-0 (Col-0). GM = genetically modified.

a

b

of the ecotypes, which has facilitated genetic studies such as genome-wide association studies (GWAS) (1001 Genomes Consortium, 2016). Moreover, genome assembling of selected accessions is ongoing (Weigel and Mott, 2009; 1001 Genomes Project, 2010). Soon after the Arabidopsis reference genome was sequenced, the Multinational Coordinated Arabidopsis thaliana Functional Genomics Project, led by the Multinational Arabidopsis Steering Committee (MASC), was launched. The goal of this ambitious project is the functional annotation of most of the genes (MASC, 2010). Use of a reverse genetics approach (Østergaard and Yanofsky, 2004) has accelerated the sequencing of T-DNA insertion sites and transposon insertion mutants developed by the research community. Seeds of insertion mutants covering approximately 96% of Arabidopsis genes are now available from the resource centers. Among the mutant lines, SALK lines developed by the Salk Institute Genomic Analysis Laboratory (SIGnAL) constitute the largest collection of gene knockout mutants and are used worldwide (Alonso et al., 2003). Although omics analyses of insertion mutants have the potential to become a powerful tool for elucidating gene functions, there are potential challenges: transformation with T-DNA vectors is a conventional method for genetic engineering of plants, but about half of the SALK lines have multiple insertions (SIGnAL, 2001), and T-DNA-associated duplications/translocations have been reported (Tax and Vernon, 2001; Forsbach et al., 2003). Usually, transgenic mutant lines are preserved and provided as a mixture of homozygous, heterozygous, and wild-type seeds, and users are responsible for selecting homozygous plants. In addition, selection with antibiotics may affect the

results. To avoid these problems, seed lines homozygous at the insertion site have been established for approximately 60% of Arabidopsis genes.

19.2.2 Arabidopsis DNA resources Full- length cDNA clones are a useful tool in molecular biology. Researchers use the clones for creating transgenic plant as well as producing proteins to elucidate gene functions. RIKEN Arabidopsis full-length cDNA (RAFL) clones, the largest library of Arabidopsis cDNA in the world, were established in the RIKEN Genomic Sciences Center and distributed from the RIKEN BRC to the world. The RIKEN BRC also distributes the ORF clone of Arabidopsis transcription factors (TF clone) and Arabidopsis genomic DNA clone (TAC clone).

19.3 Overview of Experimental Plant Resources for Crop Research In the 21st century, human society faces the global problems of climate change and an ever-growing population. Agricultural research is indispensable to address these challenges. The core facilities of NBRP (Table 19.3) preserve and distribute grain crops such as rice and wheat, and vegetable crops (namely tomato and legume) are also provided to the crop research community.

19.3.1 Rice resources Rice (Oryza sativa L.) is the most important grain- crop species in eastern Asia, including Japan.

Plant Experimental Resources

249

Table 19.3. Core facilities of NBRP plant resources. Resource

Core facility

Resource type

URL (accessed 1 September 2022)

Arabidopsis/cultured plant cells, genes

RIKEN BRC

Seed, cell, DNA

https://nbrp.jp/en/resource/arabidopsisculture-gene-en/

Rice

National Institute of Genetics

Seed

https://nbrp.jp/en/resource/rice-en/

Wheat

Kyoto University

Seed

https://nbrp.jp/en/resource/wheat-en/ https://shigen.nig.ac.jp/wheat/komugi/

Barley

Okayama University

Seed, DNA

https://nbrp.jp/en/resource/barley-en/

Lotus/glycine

Miyazaki University

Seed, DNA, tissue culture

https://nbrp.jp/en/resource/lotus-glycine-en/

Tomato

Tsukuba University

Seed, DNA

https://nbrp.jp/en/resource/tomato-en/

Chrysanthemum

Hiroshima University

Seed

https://nbrp.jp/en/resource/chrysanthemumen/

Morning glory

Kyushu University

Seed, DNA

https://nbrp.jp/en/resource/morning-gloryen/

The high-quality rice genome sequence (subsp. japonica cultivar Nipponbare) was first published by the International Rice Genome Sequencing Project, 2005. The progeny seeds of the line (JP No. 229579) used for the sequencing project are available from the Genebank Project, National Agriculture and Food Research Organization (NARO) (https://www.gene.affrc.go.jp/distribution-plant_en.php, accessed September 2022). Unfortunately, the distribution service of the rice TOS17 mutant and full-length cDNA clone from the Rice Genome Resource Center was terminated as of 31 March 2021. The National Institute of Genetics operates the NBRP-RICE project (reviewed by Sato et al., 2021) to collect wild-type rice resources that include 1700 accessions from 21 species of the genus Oryza. Among them, 217 accessions from 19 species have been subjected to next- generation genome sequencing studies, and the results are stored in the database, OryzaGenome 2.1 (Kajiya-Kanegae et al., 2021). Information of the origin, species, genotype, and growth habit is available from Oryzabase (Kurata and Yamazaki, 2006; Ohyanagi et al., 2016). They also create recombinant inbred lines (RILs) and chromosome segment substitution lines (CSSLs) for studies of quantitative trait loci (QTLs). The RILs were created by crossing

japonica rice with indica rice. The CSSLs were established from six accessions of Oryza species that have AA-genome. It is worth noting that the NBRP- RICE preserves 14,000 mutant strains. Among them, approximately 12,000 strains were generated by the N-methyl-N-nitrosourea (NMU) treatment of fertilized eggs of strain Kinmaze and Taichung 65 (Suzuki et al., 2008; Satoh et al., 2010). During the propagation, phenotypes of each mutant line were observed and are provided in the Oryzabase. The aim is to obtain the genomic sequence of mutant lines, expecting that the combined genomic and phenomic datasets will greatly improve the value of NMU mutants for those studies using GWAS technology.

19.3.2 Wheat resources Wheat (Triticum aestivum L.) is another important grain species. In 2018, The IWGSC, 2018 dispatched the first reference genome of wheat, cultivar Chinese Spring (IWGSC, 2018). The success was followed by the International Wheat 10+ Genome Project that enables the comparative genome analyses of hexaploid wheat varieties through the de novo assembly of 15 global wheat

250

M. Kobayashi

varieties (Walkowiak et al., 2020). A Japanese representative cultivar, Norin 61, was the only Asian variety used in the project (Shimizu et al., 2021). As the core facility of NBRP-Wheat, Kyoto University collects, preserves, and distributes a wide variety of wheat (approximately 6700 accessions, including Norin 61). However, the distribution of cDNA clones was terminated recently. Their core collection consists of 188 accessions of Triticum aestivum, T. spelta, T. compactum, T. sphaerococcum, T. macha, and T. vavilovii (Takenaka et al., 2018). The experimental lines of NBRP- Wheat such as isogenic lines, recombinant inbred lines, intervarietal chromosomal substitution lines, and synthetic polyploids enable researchers to investigate the relationship between phenotype and genotype. It is worth noting that they established nested association mapping (NAM) populations by crossing the East Asian core collection of hexaploid wheat with Norin 61 as a paternal parent (Shimizu et al., 2021). They plan to characterize the phenotype of each NAM line for mining new alleles useful in breeding research.

19.3.3 Tomato resources The Solanaceae family includes economically important crops, notably tomato, eggplant, bell pepper, potato, and tobacco. Among them, tomato (Solanum lycopersicum L.) is the first solanaceous plant species to be fully sequenced (Tomato Genome Consortium, 2012). Tomato is regarded as a useful model for studying fruit ripening and the production of specific components such as carotenoids. Micro-Tom is a dwarf tomato that can be maintained under laboratory conditions (Meissner et al., 1997). The whole-genome sequence of the standard line TOMJPF00001 is publicly available (Kobayashi et al., 2014). Micro-Tom EMS mutant lines and gamma ray-irradiated lines are distributed by NBRP Tomato at Tsukuba University (NBRP, 2020b). They also provide full- length cDNA clones.

19.3.4 Legume resources Legumes are widespread crops whose seeds are rich in lipids and protein. Symbiotic interactions between legumes and rhizobium bacteria are an important research subject in the field of plant science. Lotus japonicus (Regel) is used as a model legume in symbiosis research (Handberg and Stougaard, 1992). The whole-genome sequence of the standard line Gifu B-129 was published (Kamal et al., 2020), and its seeds are distributed by NBRP Legume at Miyazaki University (NBRP, 2020a). They also provide natural accessions, recombinant inbred lines (RILs), EMS mutant lines, retrotransposon (LORE1 insertion) tag lines, cDNA clones, and genomic DNA clones.

19.4 Conclusion and Perspective Reproducibility of research data is essential in science. Plant genome sequencing projects have revealed non-negligible differences in the genotypes of plant strains with the same name that are maintained in different laboratories. Thus, researchers are strongly advised to use plant materials maintained at public resource centers and to state the origin in reports. With the development of science, unexpected findings that should be considered during the preparation of plant materials may arise. Note that microflora growing with plants can affect plant phenotype (reviewed by Morelli et al., 2020). For example, endophytic microbes can improve plant growth and tolerance to biotic and abiotic stresses (Fan et al., 2020). Some endophytes may affect plant growth by modulating plant hormone signaling (Venneman et al., 2020). In the future, the control and measurement of microflora during the preparation of plant materials could be required in studies using omics analysis. In conclusion, the use of reliable resources prepared under a well-controlled environment is essential for research.

Plant Experimental Resources

251

References 1001 Genomes Consortium (2016) 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166(2), 481–491. DOI: 10.1016/j.cell.2016.05.063. 1001 Genomes Project (2010) 1001 Genomes – A catalogue of Arabidopsis thaliana genetic variation. Available at: http://1001genomes.org/index.html (accessed 2 January 2021). Alonso, J.M., Stepanova, A.N., Leisse, T.J., Kim, C.J., Chen, H. et al. (2003) Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301(5633), 653–657. DOI: 10.1126/ science.1086391. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814), 796–815. DOI: 10.1038/35048692. Barea, J.M., Pozo, M.J., Azcón, R. and Azcón-Aguilar, C. (2005) Microbial co-operation in the rhizosphere. Journal of Experimental Botany 56(417), 1761–1778. DOI: 10.1093/jxb/eri197. Benavides, F., Rülicke, T., Prins, J.B., Bussell J., Scavizzi, F. et al. (2020) Genetic quality assurance and genetic monitoring of laboratory mice and rats: FELASA Working Group Report. Laboratory Animals 54, 135–148. DOI: 10.1177/0023677219867719. Carneiro, J.M.T., Madrid, K.C., Maciel, B.C.M. and Arruda, M.A.Z. (2015) Arabidopsis thaliana and omics approaches: a review. Journal of Integrated OMICS 5(1), 1–15. DOI: 10.5584/jiomics.v5i1.179. Fan, D., Subramanian, S. and Smith, D.L. (2020) Plant endophytes promote growth and alleviate salt stress in Arabidopsis thaliana. Scientific Reports 10, 12740. DOI: 10.1038/s41598-020-69713-5. Forsbach, A., Schubert, D., Lechtenberg, B., Gils, M. and Schmidt, R. (2003) A comprehensive characterization of single-copy T-DNA insertions in the Arabidopsis thaliana genome. Plant Molecular Biology 52, 161–76. DOI: 10.1023/a:1023929630687. Handberg, K. and Stougaard, J. (1992) Lotus japonicus, an autogamous, diploid legume species for classical and molecular genetics. The Plant Journal 2(4), 487–496. DOI: 10.1111/j.1365-313X.1992.00487.x. International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436, 793–800. DOI: 10.1038/nature03895. International Wheat Genome Sequencing Consortium (IWGSC) (2018) Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361(6403), eaar7191. DOI: 10.1126/ science.aar7191. Kajiya-Kanegae, H., Ohyanagi, H., Ebata, T., Tanizawa, Y., Onogi, A. et al. (2021) OryzaGenome2.1: database of diverse genotypes in wild Oryza species. Rice 14(1), 24. DOI: 10.1186/s12284-021-00468-x. Kamal, N., Mun, T., Reid, D., Lin, J.-S., Akyol, T.Y. et al. (2020) Insights into the evolution of symbiosis gene copy number and distribution from a chromosome-scale Lotus japonicus Gifu genome sequence. DNA Research 27(3), dsaa015. DOI: 10.1093/dnares/dsaa015. Kobayashi, M., Nagasaki, H., Garcia, V., Just, D., Bres, C. et al. (2014) Genome-wide analysis of intraspecific DNA polymorphism in “Micro-Tom”, a model cultivar of tomato (Solanum lycopersicum). Plant & Cell Physiology 55(2), 445–454. DOI: 10.1093/pcp/pct181. Kurata, N. and Yamazaki, Y. (2006) Oryzabase. An integrated biological and genome information database for rice. Plant Physiology 140(1), 12–17. DOI: 10.1104/pp.105.063008. MASC (2010) Multinational coordinated Arabidopsis thaliana functional genomics project annual report 2010. Available at: http://ww.w.arabidopsis.info/static/info/masc_2010.pdf (accessed 3 January 2021). Meissner, R., Jacobson, Y., Melamed, S., Levyatuv, S., Shalev, G. et al. (1997) A new model system for tomato genetics. The Plant Journal 12(6), 1465–1472. DOI: 10.1046/j.1365-313x.1997.12061465.x. Mekada, K., Abe, K., Murakami, A., Nakamura, S., Nakata, H. et al. (2009) Genetic differences among C57BL/6 substrains. Experimental Animals 58(2), 141–149. DOI: 10.1538/expanim.58.141. Morelli, M., Bahar, O., Papadopoulou, K.K., Hopkins, D.L. and Obradović, A. (2020) Editorial: role of endophytes in plant health and defense against pathogens. Frontiers in Plant Science 11, 1312. DOI: 10.3389/fpls.2020.01312. NBRP (2020a) Lotus/glycine. Available at: https://nbrp.jp/en/resource/lotus-glycine-en/ (accessed 14 June 2021). NBRP (2020b) Tomato. Available at: https://nbrp.jp/en/resource/tomato-en/ (accessed 14 June 2021). NBRP (2021) About the NBRP. Available at: https://nbrp.jp/en/about-en/ (accessed 10 October 2021). Ohyanagi, H., Ebata, T., Huang, X., Gong, H., Fujita, M. et al. (2016) OryzaGenome: genome diversity database of wild Oryza species. Plant & Cell Physiology 57(1), e1. DOI: 10.1093/pcp/pcv171.

252

M. Kobayashi

Østergaard, L. and Yanofsky, M.F. (2004) Establishing gene function by mutagenesis in Arabidopsis thaliana. The Plant Journal 39(5), 682–696. DOI: 10.1111/j.1365-313X.2004.02149.x. Sato, Y., Tsuda, K., Yamagata, Y., Matsusaka, H., Kajiya-Kanegae, H. et al. (2021) Collection, preservation and distribution of Oryza genetic resources by the National Bioresource Project RICE (NBRP-RICE). Breeding Science 71(3), 291–298. DOI: 10.1270/jsbbs.21005. Satoh, H., Matsusaka, H. and Kumamaru, T. (2010) Use of N-methyl-N-nitrosourea treatment of fertilized egg cells for saturation mutagenesis of rice. Breeding Science 60(5), 475–485. DOI: 10.1270/ jsbbs.60.475. Shao, M.R., Shedge, V., Kundariya, H., Lehle, F.R. and Mackenzie, S.A. (2016) Ws-2 introgression in a proportion of Arabidopsis thaliana Col-0 stock seed produces specific phenotypes and highlights the importance of routine genetic verification. The Plant Cell 28(3), 603–605. DOI: 10.1105/tpc.16.00053. Shimizu, K.K., Copetti, D., Okada, M., Wicker, T., Tameshige, T. et al. (2021) De novo genome assembly of the Japanese wheat cultivar Norin 61 highlights functional variation in flowering time and fusarium- resistant genes in east Asian genotypes. Plant & Cell Physiology 62(1), 8–27. DOI: 10.1093/pcp/ pcaa152. SIGnAL (2001) Salk Institute Genomic Analysis Laboratory Arabidopsis sequence indexed T-DNA insertion. Available at: http://signal.salk.edu/tdna_FAQs.html (accessed 3 January 2021). Singh, B.K., Millard, P., Whiteley, A.S. and Murrell, J.C. (2004) Unravelling rhizosphere-microbial interactions: opportunities and limitations. Trends in Microbiology 12(8), 386–393. DOI: 10.1016/j. tim.2004.06.008. Suzuki, T., Eiguchi, M., Kumamaru, T., Satoh, H., Matsusaka, H. et al. (2008) MNU-induced mutant pools and high performance TILLING enable finding of any gene mutation in rice. Molecular Genetics and Genomics 279(3), 213–223. DOI: 10.1007/s00438-007-0293-2. Takenaka, S., Nitta, M. and Nasuda, S. (2018) Population structure and association analyses of the core collection of hexaploid accessions conserved ex situ in the Japanese gene bank NBRP-wheat. Genes & Genetic Systems 93(6), 237–254. DOI: 10.1266/ggs.18-00041. Tax, F.E. and Vernon, D.M. (2001) T-DNA-associated duplication/translocations in Arabidopsis. Implications for mutant analysis and functional genomics. Plant Physiology 126(4), 1527–1538. DOI: 10.1104/ pp.126.4.1527. Tomato Genome Consortium (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485(7400), 635–641. DOI: 10.1038/nature11119. Venneman, J., Vandermeersch, L., Walgraeve, C., Audenaert, K., Ameye, M. et al. (2020) Respiratory CO2 combined with a blend of volatiles emitted by endophytic Serendipita strains strongly stimulate growth of Arabidopsis implicating auxin and cytokinin signaling. Frontiers in Plant Science 11, 544435. DOI: 10.3389/fpls.2020.544435. Walkowiak, S., Gao, L., Monat, C., Haberer, G., Kassa, M.T. et al. (2020) Multiple wheat genomes reveal global variation in modern breeding. Nature 588(7837), 277–283. DOI: 10.1038/s41586-020-2961-x. Weigel, D. and Mott, R. (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biology 10(5), 107. DOI: 10.1186/gb-2009-10-5-107.

20

Plant Omics Databases: an Online Resource Guide

Feng Li1*, Yingtian Deng1, Eiji Yamamoto2 and Zhenya Liu1 Key Laboratory of Horticultural Plant Biology, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei, China; 2Graduate School of Agriculture, Meiji University, Kawasaki, Japan

1

Abstract In the past two decades, technological advancements in DNA sequencing and mass spectrometry-based protein sequencing have made large-scale genome, transcriptome, and proteome analyses accessible to individual researchers. Consequently, many omics databases have been constructed for various organisms, which has facilitated a diversity of studies on genetics, functional genomics, molecular biology, and other research areas. In this chapter, we introduce key plant omics databases together with some useful online analysis tools that enable users without any bioinformatics expertise to access omics data. In the early part of this chapter, we have focused on databases for the model plant Arabidopsis, in concurrence with a summary of the related milestone studies for genome, epigenome, transcriptome, and proteome. Subsequently, we introduce omics databases for major crops, including rice, wheat, maize, soybean, tomato, and pepper, which have contributed not only to functional genomics, but also to agronomic studies. We also introduce omics databases for bryophytes, a basal lineage of land plants, which have contributed to plant evolutionary studies. Moreover, we introduce plant omics portals that facilitate the performance of integrated omics data analysis at a species-wide level. Thus, this chapter provides a useful guide for researchers to obtain and exploit ample online resources to promote their own studies.

20.1 Introduction Omics-based online databases and bioinformatics platforms have played crucial roles in promoting plant research. These databases provide baseline knowledge for building hypotheses, designing experiments, and inferring the molecular and/ or genetic mechanisms underlying biological events (Mochida and Shinozaki, 2010). In the past two decades, advances in DNA sequencing technology (i.e., next- generation sequencing (NGS)) have revolutionized plant research related to omics (Metzker, 2010). For example, most of the reference genomes for plant species other

than Arabidopsis and rice were generated within a decade owing to NGS. Even in Arabidopsis and rice, whole-genome analysis has extended from sequencing of single or few strains to resequencing thousands of accessions. Similarly, NGS facilitated rapid growth of transcriptome and epigenome data, which are useful for studying the molecular mechanisms of a gene of interest. In addition to DNA sequencing technology, mass spectrometry- based proteomic analysis methods have developed rapidly and have greatly promoted high- throughput quantification of proteome and protein modifications (Aebersold and Mann, 2016).

*Corresponding author: chdlifeng@mail.hzau.edu.cn © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.) DOI: 10.1079/9781789247534.0020

253

Feng Li et al.

254

However, handling large volumes of omics data usually requires sophisticated bioinformatics expertise, which limits their potential to promote biological research. For improving their use, many online databases have retrieved and integrated omics data and provided analytic tools that allow users to explore such databases without advanced bioinformatics expertise. In the early part of this chapter, we have used Arabidopsis to navigate through online resources for its genomics, epigenomics, transcriptomics, and proteomics databases. We then introduce databases for other major plant species, as well as some important portals for plant omics studies.

20.2 Arabidopsis Omics Databases Arabidopsis thaliana is the best- studied model plant, and therefore Arabidopsis databases are substantial in both quality and quantity compared with those of other plants. The wealth of information on Arabidopsis has contributed to research not only in Arabidopsis but also in other plants. In this section, we introduce the Arabidopsis omics databases in more detail than for other plant species.

20.2.1 Arabidopsis genome databases The genome of A. thaliana (approximately 135 Mb) is relatively small compared with that of other plants, and it was the first plant genome to be fully sequenced (Arabidopsis Genome Initiative, 2000). The Arabidopsis Information Resource (TAIR) (Table 20.1) is a centralized database for the Arabidopsis research community that provides not only the reference genome sequences and annotations but also other valuable information, such as availability of mutant stock and external links for other omics databases (Lamesch et al., 2012). The small genome size of Arabidopsis has enabled further comprehensive genomic analyses. In the A. thaliana 1001 Genomes Project (Alonso-Blanco et al., 2016), the genomes of 1135 natural accessions were sequenced, and the database (Table 20.1) provides not only the sequences but also additional information, such as phenotypes of the accessions and the results of genome-wide association studies (GWAS). The data from these natural accessions are valuable resources for ecological and evolutionary studies, such as those involving genetic and phenotypic adaptation to the environment.

Table 20.1. List of Arabidopsis omics databases. Category

Database

URL

Genome

TAIR

http://www.arabidopsis.org

The A. thaliana 1001 Genomes

http://1001genomes.org

The A. thaliana 1001 Epigenomes

http://signal.salk.edu/1001.php

Chèneby et al., 2020

http://remap.univ-amu.fr

ARS

http://ipf.sustech.edu.cn/pub/athrna/

PLncDB

http://bis.zju.edu.cn/PlncRNADB/

The Arabidopsis Next-Gen Sequence DBs

https://mpss.danforthcenter.org

ProteomicsDB

https://www.proteomicsdb.org/

ATHENA

http://athena.proteomics.wzw.tum.de:5002/ master_arabidopsisshiny/

SIGnAL (AI-1, PPIN-1, TF-NAPPA)

http://signal.salk.edu/interactome/index.html.

CCSB Interactome Database

http://interactome.dfci.harvard.edu

FAT-PTM

https://bioinformatics.cse.unr.edu/fat-ptm/

STRING

https://string-db.org/

Epigenome Transcriptome

Proteome

All URLs accessed in September 2022.

Plant Omics Databases

20.2.2 Arabidopsis epigenome databases High research activity in Arabidopsis has resulted in the accumulation of not only genome sequences but also a large amount of epigenomic data, such as DNA methylation and histone modification. For example, Cokus et al. (2008) performed single-base pair resolution DNA methylome analysis using bisulfite treatment and high- throughput sequencing (BS-seq). The aim of this study was to reveal functional differences between MET1, DRM1/2, and CMT3 in cytosine methylation (Cokus et al., 2008). However, to enable other researchers to exploit the BS-seq data for their own research activities, the authors developed a database at which users can search for the DNA methylation state of a specific genome region through a genome browser (http:// epigenomics.mcdb.ucla.edu/BS-Seq, accessed September 2022). As with the genome sequences, a comprehensive methylome analysis for natural accessions was performed in the A. thaliana 1001 Epigenomes Project (Kawakatsu et al., 2016). High-resolution methylome maps of 1107 natural accessions are available in the database (Table 20.1), and these data are useful for analyzing the effect of epigenetic variations in various Arabidopsis phenotypes. To investigate histone modification profiles, chromatin immunoprecipitation sequencing (ChIP-seq) with antibodies specific to different types of histone modifications were conducted. The data were accumulated in NCBI GEO databases, but were not arranged until recently. Chèneby et al. (2020) retrieved the ChIP-seq data in NCBI GEO databases, and constructed a catalog of histone modification sites from the analysis. This catalog is now available in ReMap 2020 (Table 20.1).

20.2.3 Arabidopsis transcriptome databases RNA transcripts can be divided into three categories, based on their function: (i) messenger RNAs (mRNAs); (ii) noncoding RNAs (ncRNAs); and (iii) small RNAs (see Chapters 2 and 6 for detailed

255

discussions on their functions). Currently, most transcriptome data are obtained from RNA-seq. mRNAs are transcripts from protein- coding genes and are detectable by poly- A selected RNA-seq (Mortazavi et al., 2008). ncRNAs consist of ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), and long non- coding RNAs (lncRNAs). All these RNAs contain poly-A tails, except for some lncRNAs (Di et al., 2014). lncRNAs consist of molecules with or without poly-A tails (Di et al., 2014). Therefore, mRNA transcriptome analysis is conducted with only poly- A selected RNA- seq, while ncRNA transcriptome requires both poly-A selected and ribo-minus RNA-seq approaches, the latter of which requires depletion of rRNAs (Chen and Duan, 2011). The Arabidopsis RNA- seq database (ARS) retrieved more than 20,000 RNA-seq data from public repositories; it provides the arranged information together with various analytics tools (Zhang et al., 2020). The ARS provides a user- friendly interface that enables gene expression analysis without bioinformatics expertise (Table 20.1). In the A. thaliana 1001 Epigenomes Project (Kawakatsu et al., 2016), RNA-seq data from 998 natural accessions are also provided, which enables the investigation of genetic and epigenetic mechanisms of gene expression in such natural accessions (Table 20.1). The Plant long non-coding RNA database (PLncDB) specializes in plant lncRNAs (Jin et al., 2013). It provides information on more than 5000 lncRNAs from 4 plant species, including Arabidopsis, and users can access their expression profiles in different tissues and experimental conditions (Table 20.1). Another important feature of PLncDB is that it provides information on RNA- binding proteins that potentially interact with lncRNAs as a form of interaction network. Small RNAs are also ncRNAs that are often treated as different categories because of their specific functions. Small RNAs include microRNAs (miRNAs) and small interfering RNAs (siRNAs), and act to induce sequence- specific gene silencing through association with Argonaute proteins (Chen, 2009). Transcriptome-wide analysis of small RNAs was performed by sequencing 18–30 nt RNA fractions that were retrieved

256

Feng Li et al.

by gel purification (Lu et al., 2005). The Arabidopsis Next- Gen Sequence DBs (Table 20.1) provide a comprehensive set of Arabidopsis small RNA sequencing data, as well as online bioinformatics tools to analyze and visualize them (Nakano et al., 2020). For example, to investigate the possibility of small RNA-mediated regulation of gene expression, users can utilize the gene ID to retrieve small RNA and/or degradome RNA reads that were derived from a certain gene.

20.2.4 Arabidopsis proteome databases Proteome- wide studies include analysis of: (i) protein accumulation patterns; (ii) post- translational modifications (PTM); and (iii) protein–protein interactions (PPI) (see Chapter 4 for the details). While the first two have been performed using mass spectrometry, PPI has been performed by library-to-library yeast two- hybridization (Y2H) and nucleic acid programmable protein arrays (NAPPA) (Ramachandran et al., 2008). Mergner et al. (2020) conducted mass spectrometry- based proteome and phosphoproteome analyses for 30 tissues in Arabidopsis, which resulted in the detection of over 18,000 proteins and over 43,000 phosphorylation sites. The data were arranged and made accessible in ProteomicsDB and ATHENA (Table 20.1). Cruz et al. (2019) retrieved PTM data from various proteomics studies in Arabidopsis and provided tools to analyze eight different PTMs of over 49,000 sites in the FAT- PTM database (Table 20.1). The first proteome- wide PPI study was published by the Arabidopsis Interactome Mapping Consortium (2011). The Arabidopsis interactome map (AI- 1) was constructed using a large-scale Y2H system. AI-1 consists of 6200 interactions between approximately 2700 proteins. Mukhtar et al. (2011) took a similar approach but focused on plant-pathogen protein–protein immune networks (PPIN- 1). The resulting interactome map consists of 3148 interactions from 926 proteins, including 83 pathogen effectors. Yazaki et al. (2016) developed a high-density NAPPA with 12,000 Arabidopsis open reading frames (ORFs). Analysis of interactions between the ORFs and a set of 38

transcription factors resulted in the construction of a transcriptional regulatory network (TF-NAPPA); it consists of 3580 interactions among 2238 proteins. AI-1, PPIN-1, and TF- NAPPA are available in SIGnAL (Table 20.1); the first two are also available in the CCSB Interactome Database (Table 20.1). The aforementioned PPI databases were constructed based on their own experiments. Szklarczyk et al. (2021) mined and integrated publicly available PPIs from more than 5000 organisms, including Arabidopsis. The results are provided in the STRING database. STRING provides not only experimentally validated PPIs but also predicted functional interactions.

20.3 Omics Databases for Crop Plants Many plant species have been used as crops. Because of their agronomic and economic importance, many omics studies have been performed even if many difficulties in experiments and data analysis have to be faced for a specific plant species (i.e., large genome sizes, polyploidy, and complicated genome structures, see below). This section provides information on databases for some major crops that have been the target of intensive study.

20.3.1 Rice (Oryza sativa L.) Rice is a staple food crop that feeds more than half of the world’s population. Oryza sativa consists of two major subspecies: japonica and indica. The first rice draft genome was generated from two cultivars: O. sativa L. subsp. japonica cv. “Nipponbare” and ssp. indica cv. “93-11” using shotgun Sanger sequencing (Goff et al., 2002; Yu et al., 2002). However, the reference genome in current use was generated from “Nipponbare” by using map- based genome sequencing (International Rice Genome Sequencing Project and Sasaki, 2005; Kawahara et al., 2013). Both the reference genome sequence and gene annotation information are provided in the Rice Annotation Project Database (RAP-DB) (Sakai et al., 2013) and the Rice Genome Annotation Project (RGAP)

Plant Omics Databases

257

Table 20.2. List of omics databases for crop plants. Species

Category

Database

URL

Rice

Genome

RAP-DB

https://rapdb.dna.affrc.go.jp/

RGAP

http://rice.plantbiology.msu.edu/

RIGW

http://rice.hzau.edu.cn/rice_rs3/

Oryza_glaberrima_V1 at Gramene

http://ensembl.gramene.org/Oryza_glaberrima/ Info/Index

RiceHap3

http://server.ncgr.ac.cn/RiceHap3/index.php

RicePanGenome

http://db.ncgr.ac.cn/RicePanGenome/

RPAN

https://cgm.sjtu.edu.cn/3kricedb/

Transcriptome

RiceXPro

https://ricexpro.dna.affrc.go.jp/

Genome

MBKBASE

http://mbkbase.org/Tu/

GrainGenes

https://wheat.pw.usda.gov/GG3/

WheatExp

https://wheat.pw.usda.gov/WheatExp/

Wheat Expression Browser

http://www.wheat-expression.com/

Proteome

Wheat Proteome

http://wheatproteome.org

Maize

Genome & transcriptome

MaizeGDB

https://www.maizegdb.org/

Soybean

Genome & transcriptome

SoyBase

https://www.soybase.org/

Tomato

Genome & transcriptome

SGN

https://solgenomics.net/

Pepper

Genome & transcriptome

PepperHub

http://www.hnivr.org/pepperhub

Wheat

Transcriptome

All URLs accessed in September 2022.

(Table 20.2). Given that “Nipponbare” is a japonica cultivar, Zhang et al. (2016) generated high-quality reference sequences of two indica cultivars, “Zhenshan 97” and “Minghui 63”, by using PacBio SMRT technology. The genome sequences and gene annotation information are available in the Rice Information GateWay (RIGW) (Song et al., 2018; Table 20.2). O. sativa, the Asian cultivated rice, was domesticated from O. rufipogon, whereas O. glaberrima Steud., the African cultivated rice, was domesticated from O. barthii. Wang et al. (2014) generated chromosome- level genome sequences of O. glaberrima, and the resulting sequences and gene annotations are available in Gramene (Table 20.2). Huang et al. (2012) resequenced 1083 O. sativa cultivars and 446 O. rufipogon strains. Interspecific sequence comparisons suggested that japonica was first domesticated from O. rufipogon from the Pearl River area in southern

China, and indica was subsequently derived from crosses between japonica and O. rufipogon in Southeast and South Asia. Genome annotation, genotype data, and a catalog of GWAS results are available in the Rice Haplotype Map Project database (RiceHap3) (Table 20.2). The 3000 Rice Genome Project aims to uncover the whole genetic variation within Asian cultivated rice (i.e., O. sativa). In the first stage, sequencing of 3024 accessions was conducted (Sun et al., 2017), and subsequently, 3010 accessions, covering 74.6–98.7% of the “Nipponbare” genome, were selected after quality control (Wang et al., 2018). Over 32 million nucleotide variations were identified; the information is provided in the Rice Pan-genome Browser (RPAN) (Table 20.2). Because of their agronomic importance and the great efforts made in their breeding, rice germplasms are rich in genetic diversity, and therefore, a few reference

258

Feng Li et al.

genomes would not include the entire set of genes in the germplasms. A rice pan-genome project conducted de novo genome assembly of 66 accessions that included both cultivars and O. rufipogon strains (Zhao et al., 2018). The genome sequences and annotation information, including 10,872 newly identified genes, are available in the RicePanGenome database (Table 20.2). Although genomic analyses done in rice are comparable to or even more extensive than those done in Arabidopsis, the transcriptome database for rice seems to be scarce compared with that of Arabidopsis. However, RiceXPro (Table 20.2) provides comprehensive microarray- based transcriptome profiles with a user-friendly graphical interface. An important feature of RiceXPro is that the data were obtained from various organs/ tissues under natural field conditions (Sato et al., 2010). This enables users to investigate gene expression profiles in the context of agronomic performance. RiceXPro interacts with another database, RiceFREND, which provides functions for co-expression analysis using the same transcriptome data (Sato et al., 2013).

20.3.2 Wheat (Triticum aestivum L.) Among wheat species (genus Triticum), durum wheat (T. turgidum subsp. durum, 2n = 4x = AABB) and bread wheat (T. aestivum L., 2n = 6x = AABBDD) are cultivated as food sources. Durum wheat was derived from the hybridization of T. urartu (2n = 2x; AA) and Aegilops speltoides (2n = 2x; BB). Subsequently, hybridization between T. turgidum and Ae. tauschii (2n = 2x; DD) resulted in the formation of the hexaploid bread wheat. The wheat genome is extremely large and has a complicated structure compared with other plants. Each subgenome is approximately 5.5 Gb, of which more than 80% are repetitive sequences. Therefore, wheat genome projects have started from the sequencing of each subgenome. The reference AA genome was generated by sequencing an accession of T. urartu (Ling et al., 2013, 2018). Information on gene annotation and small RNAs from the RNA-seq data are available on the page dedicated to wheat in MBKBASE (Table 20.2). The reference DD

genome was generated from an accession of Ae. tauschii subsp. strangulate (Jia et al., 2013); its quality has been improved by using the BioNano genome map technology (Luo et al., 2017). More recently, genome sequencing of polyploid wheat has also been conducted. The AABB genome assembly was conducted for the durum wheat cultivar “Svevo” (Maccaferri et al., 2019) and an accession of emmer wheat (T. turgidum subsp. dicoccon) (Scott et al., 2019). Finally, chromosome- level assembly of the AABBDD genome was conducted using a bread wheat from the cultivar “Chinese Spring” (Appels et al., 2018). Thus, there are many wheat reference genomes derived from different subgenomes and ploidy levels. These wheat genomes are now collected in GrainGenes (Table 20.2) (Blake et al., 2019). From the genome browser, users can access annotation information for genes, ncRNAs, and transposable elements. It should be noted that GrainGenes includes genomic information not only of wheat but also of other cereal crops, such as barley, oat, and rye. Transcriptome analysis in polyploid species such as tetraploid and hexaploid wheat is difficult because of the existence of homeologous genes. In general, quantification of expression differences between homeologous genes is difficult because of the high sequence similarity between gene pairs. However, recent technical advances in RNA-seq analysis have enabled the quantification of homeologous gene expression. Pearce et al. (2015) developed an analytic pipeline to distinguish between homeologous genes and performed transcriptome analysis of publicly available wheat RNA-seq data. The results are provided in WheatExp (Table 20.2). Ramírez-González et al. (2018) collected RNA- seq data from various tissues, developmental stages, strains, and environmental conditions and revealed that about 30% of hexaploid wheat homeologous genes showed non- balanced expression patterns. The genome browser-based interface for the data is provided by the Wheat Expression Browser (Table 20.2) in expVIP (Borrill et al., 2016). Given that wheat is an important protein source, Duncan et al. (2017) performed mass spectrometry- based proteomic analysis of 24 samples and developmental stages. The data are provided in Wheat Proteome (Table 20.2).

Plant Omics Databases

This database also provides tools to compare the proteome and RNA-seq data.

20.3.3 Maize (Zea mays) Maize is an important crop for both humans and livestock, and has the largest production among cereal crops. The maize genome is about 2.5 Gb, of which about 80% are repetitive elements. Its large size and complicated structure have made maize genome sequencing difficult. In the first maize draft genome, only gene-rich regions were sequenced using methylation filtration (Palmer et al., 2003; Whitelaw et al., 2003). The first chromosome- level maize reference genome was constructed using the BAC- by- BAC approach, similar to the method used for constructing the rice reference genome (Schnable et al., 2009). The current version of the reference genome was improved using PacBio SMART technology (Jiao et al., 2017). The maize reference genome was generated from “B73”, an inbred line that has been used in maize breeding. However, the maize genome is known for its extreme genetic diversity, and it has been suggested that the genome of “B73” contains only 63–74% of the genes in the maize genetic pool. Therefore, a pan-genome project was also conducted for comprehensive coverage of maize (Hufford et al., 2021). The maize pan-genome project performed the sequencing of 25 inbred lines that were used as founders of the Nested Association Mapping population, a community resource for genetic mapping (Gage et al., 2020). The genome sequences and annotation information from “B73” and the pan-genome project are available at the Maize Genetics and Genomics Database (MaizeGDB), the centralized database for maize research community (Table 20.2). Although information on genomic and genetic resources has been the main feature of the database, the current version of MaizeGDB has incorporated transcriptomic and epigenomic data from various maize studies (Portwood et al., 2019). Users can survey the transcriptome and epigenome states of a specific genome region through the JBrowse graphical interface of MaizeGDB.

259

20.3.4 Soybean (Glycine max) Soybean is an important crop not only as a source of seed oil and protein, but also as a source of atmospheric nitrogen through symbiosis with soil- borne microorganisms. In 2010, the first chromosome-scale soybean reference genome was generated from “Williams 82” using a whole-genome shotgun approach (Schmutz et al., 2010). The soybean genome size is approximately 1.1 Gb, and because of its ancient polyploidization, 75% of the genes are present in multiple copies. The reference genome sequence has been incorporated into SoyBase (Table 20.2), which is a comprehensive repository for genetics, genomics, and related data resources for soybean (Grant et al., 2010). SoyBase provides annotation information on the reference genome and curated quantitative trait locus (QTL) information retrieved from QTL mapping and GWAS (Brown et al., 2021). In recent years, SoyBase has expanded to incorporate the sequence data from five cultivars/ species other than “Williams 82”, as well as resequencing and chip-based variant information from thousands of accessions (Brown et al., 2021).

20.3.5 Tomato (Solanum lycopersicum) The Solanaceae family includes plants of agronomic importance, such as potato, eggplant, pepper, tobacco, and tomato. In addition to its economic importance, tomato is considered a model fruiting crop and has been the subject of extensive research, including functional gene characterization. The tomato genome is diploid and consists of approximately 900 Mb in size and 12 chromosomes. The tomato reference genome was generated from the inbred cultivar “Heinz 1706” (Tomato Genome Consortium, 2012). Because many Solanaceae species share not only the chromosome number (i.e., n = 12) but also genomic synteny, the tomato reference genome provides a framework for genomic analysis and information for molecular breeding of other Solanaceae species. The annual SOL genomics workshop began after a meeting in 2003 to initiate an international collaboration entitled the

260

Feng Li et al.

International Solanaceae Genome Project. As a result, the SOL Genomics Network (SGN) (Table 20.2) was created to house various genomic information and analysis tools for studying Solanaceae species (Mueller et al., 2005). Currently, SGN hosts not only the genomes of tomato cultivars and wild relatives, but also those of other Solanaceae species, including potato, pepper, eggplant, coffee, petunia, and several tobacco species. Moreover, the database stores loci information and phenotype data, which are provided and edited by individual researchers through interfaces on the website. To take full advantage of the increasing genomic and phenotypic data available, SGN is now implementing new web- based breeding tools, such as genomic selection (Fernandez-Pozo et al., 2015).

20.3.6 Pepper (Capsicum annuum) Pepper is a member of the Solanaceae family; it is one of the oldest domesticated crops in the Americas, and is now grown all over the world. Like other members of the Solanaceae family, such as tomato and eggplant, pepper has the same number of chromosomes (n = 12) but differs drastically in genome size, which is much larger than 3.5 Gb (Kim et al., 2014; Qin et al., 2014). PepperHub (Table 20.2) provides an integrative public platform for sharing and analyzing research data (Liu et al., 2017). As a powerful online tool to facilitate pepper research, this website contains a genome module, which hosts the reference genomes of “Zunla-1” and “CM334” and provides GBrowse and BLAST functions. GBrowse allows users to browse the gene structure and sequence information, while BLAST allows users to query and retrieve the genomic DNA, cDNA, and protein sequences from both “Zunla-1” and “CM334” (Kim et al., 2014; Qin et al., 2014). In addition, the transcriptome module in PepperHub includes a large volume of new transcriptome data resulting from high-throughput mRNA sequencing from 188 samples of different organs/tissues during successive developmental stages or different stressed plants at consecutive time points (Wang et al., 2019). PepperHub may accelerate research in pepper functional genomics and

serve as a valuable resource for studying fruit developmental biology and stress responses in pepper as well as other plant species.

20.4 Databases for Bryophytes At an early stage of plant omics, comprehensive omics analyses were limited to the model plant (i.e., Arabidopsis) and some major crops. Recent advances in high-throughput analytic tools have facilitated systematic multi-omics characterization with reasonable costs and effort. Therefore, omics studies have now expanded over land plants. Because of their importance in plant molecular evolution, intensive research has recently been conducted on bryophytes, which are a basal lineage of land plants. Marchantia polymorpha is a well- studied bryophyte, and scaffold-level genome sequences were generated by Bowman et al. (2017). The authors also performed RNA-seq on samples from various tissues to analyze the transcriptome profiles and to improve gene annotations. Subsequently, Montgomery et al. (2020) improved the genome assembly to the chromosome level using long-read sequencing and Hi-C. In addition, they performed a comprehensive ChIP- seq analysis to explore the epigenomic landscape. These genome, transcriptome, and epigenome data are accessible through the genome browser on MarpolBase (https://marchantia.info/, accessed September 2022). Physcomitrium patens is another bryophyte whose omics study shows remarkable advances. The chromosome- level reference genome sequence was generated by Lang et al. (2018). The authors used RNA- seq data from public repositories to confirm the validity of their gene annotations. In addition, the epigenomic landscape of the genome was analyzed using ChIP- seq data from public repositories and newly obtained BS-seq data. These results are provided on a page in Phytozome 13 (https:// phytozome-next.jgi.doe.gov/info/Ppatens_ v3_3, accessed September 2022) (see also section 20.6.3 of this chapter). Regarding the transcriptome profile of P. patens, Ortiz-Ramírez et al. (2016) developed another gene expression atlas composed of various organs/tissues from

Plant Omics Databases

most of this bryophyte’s life cycle by using a microarray- based transcriptome analysis. The data are accessible through the electronic fluorescent pictographic browser of The Bio- Analytic Resource for Plant Biology (BAR) (http://bar.utoronto.ca/efp_physcomitrella/ cgi-bin/efpWeb.cgi, accessed September 2022) (see also section 20.6.1 of this chapter). A comparison between these omics data with those of other plant species may accelerate molecular evolution studies.

20.5 Databases for Other Plant Species Currently, multi-omics databases are available for many other plant species. In Table 20.3, we provide a list of databases that are useful for plant omics researchers, but we have not provided detailed descriptions of them in the main text.

20.6 Portals for Plant Omics Databases Extensive efforts for omics research on various plant species have resulted in the accumulation of rich resources that are available online. Many portals for plant omics databases have been developed and provided convenient tools for integrated analysis of certain types of omics data across different plant species or exploring different types of omics data for certain genes in a particular species. These portals have become an essential part of plant research activities to undertake investigations in the research areas, for generating in silico hypotheses, and designing experiments.

261

analytic tools by which users can perform various analyses, including exploration of co-regulated genes, identification of cis- elements in co- regulated genes, and “electronic Northern blot.” To represent gene expression data in an intuitive way, an electronic Fluorescent Pictograph (eFP) browser was developed; it can visualize the gene expression level in different organs, and tissue and protein accumulation in different subcellular locations (Winter et al., 2007). eFP was originally developed for Arabidopsis gene expression data and has been adapted not only for many other plant species but also for different omics data types. BAR now contains more than 150 million gene expression measurements, 100,000 protein–protein interactions, and 70,000 protein structures. The average number of BAR uses per month was approximately 60,000 in 2015 (Waese and Provart, 2017).

20.6.2 Gramene Gramene (https://www.gramene.org/, accessed September 2022) was originally built as a community resource for rice and as a comparative genome mapping database for monocots (Ware et al., 2002a, b). The current release hosts 93 reference genomes from rice, and other monocots, dicots, and non-vascular plants (Tello-Ruiz et al., 2021). Gramene provides curated and integrated information on several features, such as genetic maps, markers, mutants, genes, and publications. One of the remarkable features of Gramene, compared with other databases, is the plant reactome equipment, the plant pathway platform that provides visualization of various types of molecular events, reactions, and interactions in the context of subcellular architecture of the plant cell (Naithani et al., 2020). Thus, Gramene now supports a wide range of plant researchers, including genomics researchers, molecular biologists, geneticists, and breeders.

20.6.1 Bio-Analytic Resource for Plant Biology (BAR) 20.6.3 Phytozome BAR (http://bar.utoronto.ca/, accessed September 2022) was formerly named “The Botany Array Resource” (Toufighi et al., 2005). BAR hosts large sets of gene expression microarray data from Arabidopsis and provides a set of web-based

Phytozome (https://phytozome-next.jgi.doe. gov/, accessed September 2022) is a centralized hub for accessing, visualizing, and analyzing plant genomes that have been sequenced by the

Feng Li et al.

262

Table 20.3. List of databases without descriptions in the main text. Species

Category

Database

URL

Reference

Abies sachalinensis

Transcriptome

TodoFirGene

http://plantomics.mind. meiji.ac.jp/todomatsu/

Ueno et al., 2018

Banana

Genome

The Banana Genome Hub

https://banana-genomehub.southgreen.fr/

Droc et al., 2013

Brassica

Genome

Brassica Genome

http://www. brassicagenome.net/ details.php

Wang et al., 2011

Buckwheat

Genome

BGDB

http://buckwheat.kazusa. Yasui et al., 2016 or.jp/

Carnation

Genome

Carnation DB

http://carnation.kazusa. or.jp/index.html

Cassava

Transcriptome

Cassava Online Archive

http://cassava.psc.riken. Sakurai et al., jp/ver.1/ 2013

Cherry

Genome

DBcherry

http://cherry.kazusa.or.jp/ Shirasawa et al., 2017

Chickpea

Transcriptome

CTDB

http://www.nipgr.ac.in/ ctdb.html

Singh et al., 2013

Chrysanthemum

Transcriptome

Chrysanthemum Transcriptome Database

http://www.icugi.org/ chrysanthemum/

Xu et al., 2013

Chrysanthemum

Genome

Mum GARDEN

http://mum-garden. kazusa.or.jp/

Hirakawa et al., 2019

Citrus

Genome

MiGD

https://mikan.dna.affrc. go.jp/

Kawahara et al., 2020

Clover

Genome

Clover Garden

http://clovergarden.jp/

Hirakawa et al., 2016

Cucurbits

Genome & transcriptome

CuGenDB

http://cucurbitgenomics. Zheng et al., org/ 2019

Eggplant

Genome

Eggplant Genome Database

http://eggplant.kazusa. or.jp/index.html

Hirakawa et al., 2014a

Ginseng

Genome & transcriptome

Ginseng Genome Database

http://ginsengdb.snu. ac.kr/

Jayakodi et al., 2018

Japanese pear

Transcriptome

TRANSNAP

http://plantomics.mind. meiji.ac.jp/nashi/

Koshimizu et al., 2019

Jatropha

Genome

Jatropha Genome Database

http://www.kazusa.or.jp/ jatropha/

Sato et al., 2011

Kiwifruit

Genome & transcriptome

Kiwifruit Genome Database

http://kiwifruitgenome. org/

Yue et al., 2020

Legumes

Genome & transcriptome

LegumeIP V3

http://plantgrn.noble.org/ Dai et al., 2021 LegumeIP/gdp/

Lotus japonicus

Genome

miyakogusa.jp

http://www.kazusa.or.jp/ lotus/

Yagi et al., 2014

Sato et al., 2008 Continued

Plant Omics Databases

263

Table 20.3. Continued Species

Category

Database

URL

Reference

Lygodium japonicum

Transcriptome

Lygodium japonicum Transcriptome Database

http://bioinf.mind.meiji. ac.jp/kanikusa/

Aya et al., 2015

Medicago truncatula

Genome

Medicago https://www.jcvi.org/ truncatula Genome research/medicagoDatabase truncatula-genomedatabase

Tang et al., 2014

Muskmelon

Genome & transcriptome

Melonet DB

https://melonet-db.dna. affrc.go.jp/ap/top

Yano et al., 2020

Onion

Transcriptome

AlliumTDB

http://alliumtdb.kazusa. or.jp/index.html

Abdelrahman et al., 2017

Physcomitrella patens

Genome

PHYSCObase

https://moss.nibb.ac.jp/

Nishiyama et al., 2003

Radish

Genome

Raphanus sativus http://radish.kazusa.or.jp/ Shirasawa et al., Genome DataBase 2020

River red gum

Genome

Eucalyptus http://www.kazusa.or.jp/ camaldulensis eucaly/ Genome Database

Hirakawa et al., 2011

Rosaceae

Genome

GDR

https://www.rosaceae. org/

Jung et al., 2019

Rubber

Genome & transcriptome

Rubber Genome & Transcriptome Database

http://matsui-lab.riken.jp/ Lau et al., 2016 rubber/home.html

Strawberry

Genome

Strawberry GARDEN

http://strawberry-garden. Hirakawa et al., kazusa.or.jp/ 2014b

Sweetpotato

Genome

Sweetpotato GARDEN

http://sweetpotatogarden.kazusa.or.jp/

Woody plant species

Genome & transcriptome

Hardwood Genomics Project

https://www. hardwoodgenomics.org/

Zoysia grass

Genome

Zoysia Genome Database

http://zoysia.kazusa.or.jp/ Tanaka et al., 2016

Hirakawa et al., 2015

All URLs accessed in September 2022.

US Department of Energy Joint Genome Institute (JGI) (Goodstein et al., 2012). One of the most important features of Phytozome, compared with other plant portals, is its wider species coverage; the most recent release of Phytozome (v13) hosts 249 assembled and annotated genomes that include both JGI and non- JGI plant genomes. All gene sets in Phytozome have been annotated with “KOG”, “KEGG”, “ENZYME”, “Pathway”, and “InterPro”. The large amount of sequence data and consistent annotation information in Phytozome facilitate comprehensive comparative genomics studies.

20.7 Future Perspectives Because of recent advances in omics technologies, increasing numbers of information resources for many plant species have been accumulated on a daily basis. In this chapter, we especially focused on databases in Arabidopsis that show the greatest advancement in genome, epigenome, transcriptome, and proteome analyses. However, as briefly explained in this chapter, similar databases are also available for other plant species, and their quality and quantity

264

Feng Li et al.

are quickly approaching the level of those for Arabidopsis. All plant omics databases provide a user-friendly web-based interface for various omics analytic tools, enabling experimental biologists to analyze genes of interest without having in-depth knowledge in biostatistics and bioinformatics. Thus, there is no doubt that omics databases have contributed to a large number of biological studies. Meanwhile, the existence of too many databases in recent years has caused new problems. For example, the databases of a plant species are often scattered across different locations. Because the categories of data (e.g.,

genomes and transcriptomes) included in such databases are often different from each other, it is difficult for non-bioinformaticians to gather multi-omics data of a species. It is also a problem that databases of the same category of data but from different species use different interfaces and analytic functions. This interferes with interspecies comparisons of multi-omics data. Therefore, in the era of multi-omics data analysis, setting a standard data format between databases and developing a central portal for different plant species would be necessary.

References Abdelrahman, M., El-Sayed, M., Sato, S., Hirakawa, H., Ito, S.I. et al. (2017) RNA-sequencing-based transcriptome and biochemical analyses of steroidal saponin pathway in a complete set of Allium fistulosum–A. cepa monosomic addition lines. PLoS ONE 12, e0181784. DOI: 10.1371/journal.pone. 0181784. Aebersold, R. and Mann, M. (2016) Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355. DOI: 10.1038/nature19949. Alonso-Blanco, C., Andrade, J., Becker, C., Bemm, F., Bergelson, J. et al. (2016) 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491. DOI: 10.1016/j. cell.2016.05.063. Appels, R., Eversole, K., Stein, N., Feuillet, C., Keller, B. et al. (2018) Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361, eaar7191. DOI: 10.1126/ science.aar7191. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815. DOI: 10.1038/35048692. Arabidopsis Interactome Mapping Consortium (2011) Evidence for network evolution in an Arabidopsis interactome map. Science 333(6042), 601–607. DOI: 10.1126/science.1203877. Aya, K., Kobayashi, M., Tanaka, J., Ohyanagi, H., Suzuki, T. et al. (2015) De novo transcriptome assembly of a fern, Lygodium japonicum, and a web resource database, Ljtrans DB. Plant & Cell Physiology 56(1), e5. DOI: 10.1093/pcp/pcu184. Blake, V.C., Woodhouse, M.R., Lazo, G.R., Odell, S.G., Wight, C.P. et al. (2019) GrainGenes: centralized small grain resources and digital platform for geneticists and breeders. Database 2019, baz065. DOI: 10.1093/database/baz065. Borrill, P., Ramirez-Gonzalez, R. and Uauy, C. (2016) expVIP: a customizable RNA-seq data analysis and visualization platform. Plant Physiology 170(4), 2172–2186. DOI: 10.1104/pp.15.01667. Bowman, J.L., Kohchi, T., Yamato, K.T., Jenkins, J., Shu, S. et al. (2017) Insights into land plant evolution garnered from the Marchantia polymorpha genome. Cell 171(2), 287–304. DOI: 10.1016/j. cell.2017.09.030. Brown, A.V., Conners, S.I., Huang, W., Wilkey, A.P., Grant, D. et al. (2021) A new decade and new data at SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Research 49(D1), D1496–D1501. DOI: 10.1093/nar/gkaa1107. Chen, X. (2009) Small RNAs and their roles in plant development. Annual Review of Cell and Developmental Biology 25, 21–44. DOI: 10.1146/annurev.cellbio.042308.113417. Chen, Z. and Duan, X. (2011) Ribosomal RNA depletion for massively parallel bacterial RNA- sequencing applications. Methods in Molecular Biology (Clifton, N.J.) 733, 93–103. DOI: 10.1007/978-1-61779-089-8_7.

Plant Omics Databases

265

Chèneby, J., Ménétrier, Z., Mestdagh, M., Rosnet, T., Douida, A. et al. (2020) ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments. Nucleic Acids Research 48(D1), D180–D188. DOI: 10.1093/nar/gkz945. Cokus, S.J., Feng, S., Zhang, X., Chen, Z., Merriman, B. et al. (2008) Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 452(7184), 215–219. DOI: 10.1038/nature06745. Cruz, E.R., Nguyen, H., Nguyen, T. and Wallace, I.S. (2019) Functional analysis tools for post-translational modification: a post- translational modification database for analysis of proteins and metabolic pathways. The Plant Journal 99(5), 1003–1013. DOI: 10.1111/tpj.14372. Dai, X., Zhuang, Z., Boschiero, C., Dong, Y. and Zhao, P.X. (2021) LegumeIP V3: from models to crops – an integrative gene discovery platform for translational genomics in legumes. Nucleic Acids Research 49(D1), D1472–D1479. DOI: 10.1093/nar/gkaa976. Di, C., Yuan, J., Wu, Y., Li, J., Lin, H. et al. (2014) Characterization of stress-responsive lncRNAs in Arabidopsis thaliana by integrating expression, epigenetic and structural features. The Plant Journal 80(5), 848–861. DOI: 10.1111/tpj.12679. Droc, G., Larivière, D., Guignon, V., Yahiaoui, N., This, D. et al. (2013) The banana genome hub. Database 2013, bat035. DOI: 10.1093/database/bat035. Duncan, O., Trösch, J., Fenske, R., Taylor, N.L. and Millar, A.H. (2017) Resource: Mapping the Triticum aestivum proteome. The Plant Journal 89(3), 601–616. DOI: 10.1111/tpj.13402. Fernandez-Pozo, N., Menda, N., Edwards, J.D., Saha, S., Tecle, I.Y. et al. (2015) The Sol Genomics Network (SGN) – from genotype to phenotype to breeding. Nucleic Acids Research 43(Database issue), D1036–D1041. DOI: 10.1093/nar/gku1195. Gage, J.L., Monier, B., Giri, A. and Buckler, E.S. (2020) Ten years of the maize nested association mapping population: impact, limitations, and future directions. The Plant Cell 32(7), 2083–2093. DOI: 10.1105/ tpc.19.00951. Goff, S.A., Ricke, D., Lan, T.-H., Presting, G., Wang, R. et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296(5565), 92–100. DOI: 10.1126/science.1068275. Goodstein, D.M., Shu, S., Howson, R., Neupane, R., Hayes, R.D. et al. (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Research 40(Database issue), D1178–D1186. DOI: 10.1093/nar/gkr944. Grant, D., Nelson, R.T., Cannon, S.B. and Shoemaker, R.C. (2010) SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Research 38(Database issue), D843–D846. DOI: 10.1093/nar/gkp798. Hirakawa, Hideki, Nakamura, Y., Kaneko, T., Isobe, S., Sakai, H. et al. (2011) Survey of the genetic information carried in the genome of Eucalyptus camaldulensis. Plant Biotechnology 28(5), 471–480. DOI: 10.5511/plantbiotechnology.11.1027b. Hirakawa, H., Shirasawa, K., Miyatake, K., Nunome, T., Negoro, S. et al. (2014a) Draft genome sequence of eggplant (Solanum melongena L.): the representative Solanum species indigenous to the old world. DNA Research 21, 649–660. DOI: 10.1093/dnares/dsu027. Hirakawa, H., Shirasawa, K., Kosugi, S., Tashiro, K., Nakayama, S. et al. (2014b) Dissection of the octoploid strawberry genome by deep sequencing of the genomes of Fragaria species. DNA Research 21, 169–181. DOI: 10.1093/dnares/dst049. Hirakawa, H., Okada, Y., Tabuchi, H., Shirasawa, K., Watanabe, A. et al. (2015) Survey of genome sequences in a wild sweet potato, Ipomoea trifida (H. B. K.) G. Don. DNA Research 22(2), 171–179. DOI: 10.1093/dnares/dsv002. Hirakawa, H., Kaur, P., Shirasawa, K., Nichols, P., Nagano, S. et al. (2016) Draft genome sequence of subterranean clover, a reference for genus Trifolium. Scientific Reports 6, 30358. DOI: 10.1038/ srep30358. Hirakawa, H., Sumitomo, K., Hisamatsu, T., Nagano, S., Shirasawa, K. et al. (2019) De novo whole- genome assembly in Chrysanthemum seticuspe, a model species of Chrysanthemums, and its application to genetic and gene discovery analysis. DNA Research 26, 195–203. DOI: 10.1093/ dnares/dsy048. Huang, X., Kurata, N., Wang, Z.X., Wang, A., Zhao, Q. et al. (2012) A map of rice genome variation reveals the origin of cultivated rice. Nature 490, 497–501. DOI: 10.1038/nature11532. Hufford, M.B., Seetharam, A.S., Woodhouse, M.R., Chougule, K.M., Ou, S. et al. (2021) De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Genomics. DOI: 10.1101/2021.01.14.426684.

266

Feng Li et al.

International Rice Genome Sequencing Project, Sasaki T. (2005) The map-based sequence of the rice genome. Nature 436(7052), 793–800. DOI: 10.1038/nature03895. Jayakodi, M., Choi, B.S., Lee, S.C., Kim, N.H., Park, J.Y. et al. (2018) Ginseng Genome Database: an open-access platform for genomics of Panax ginseng. BMC Plant Biology 18, 62. DOI: 10.1186/ s12870-018-1282-9. Jia, J., Zhao, S., Kong, X., Li, Y., Zhao, G. et al. (2013) Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature 496(7443), 91–95. DOI: 10.1038/nature12028. Jiao, Y., Peluso, P., Shi, J., Liang, T., Stitzer, M.C. et al. (2017) Improved maize reference genome with single-molecule technologies. Nature 546(7659), 524–527. DOI: 10.1038/nature22971. Jin, J., Liu, J., Wang, H., Wong, L. and Chua, N.H. (2013) PLncDB: plant long non-coding RNA database. Bioinformatics (Oxford, England) 29(8), 1068–1071. DOI: 10.1093/bioinformatics/btt107. Jung, S., Lee, T., Cheng, C.-H., Buble, K., Zheng, P. et al. (2019) 15 years of GDR: new data and functionality in the genome database for Rosaceae. Nucleic Acids Research 47(D1), D1137–D1145. DOI: 10.1093/nar/gky1000. Kawahara, Y., de la Bastide, M., Hamilton, J.P., Kanamori, H., McCombie, W.R. et al. (2013) Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice (New York, N.Y.) 6(1), 4. DOI: 10.1186/1939-8433-6-4. Kawahara, Y., Endo, T., Omura, M., Teramoto, Y., Itoh, T. et al. (2020) Mikan Genome Database (MiGD): integrated database of genome annotation, genomic diversity, and CAPS marker information for mandarin molecular breeding. Breeding Science 70(2), 200–211. DOI: 10.1270/jsbbs.19097. Kawakatsu, T., Huang, S.-S.C., Jupe, F., Sasaki, E., Schmitz, R.J. et al. (2016) Epigenomic diversity in a global collection of Arabidopsis thaliana accessions. Cell 166(2), 492–505. DOI: 10.1016/j. cell.2016.06.044. Kim, S., Park, M., Yeom, S.-I., Kim, Y.-M., Lee, J.M. et al. (2014) Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species. Nature Genetics 46(3), 270–278. DOI: 10.1038/ng.2877. Koshimizu, S., Nakamura, Y., Nishitani, C., Kobayashi, M., Ohyanagi, H. et al. (2019) TRANSNAP: a web database providing comprehensive information on Japanese pear transcriptome. Scientific Reports 9(1), 18922. DOI: 10.1038/s41598-019-55287-4. Lamesch, P., Berardini, T.Z., Li, D., Swarbreck, D., Wilks, C. et al. (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research 40(Database issue), D1202–D1210. DOI: 10.1093/nar/gkr1090. Lang, D., Ullrich, K.K., Murat, F., Fuchs, J., Jenkins, J. et al. (2018) The Physcomitrella patens chromosome‐scale assembly reveals moss genome structure and evolution. The Plant Journal 93, 515–533. DOI: 10.1111/tpj.13801. Lau, N.S., Makita, Y., Kawashima, M., Taylor, T.D., Kondo, S. et al. (2016). The rubber tree genome shows expansion of gene family associated with rubber biosynthesis. Scientific Reports 6, 28594. DOI: 10.1038/srep28594. Ling, H.Q., Ma, B., Shi, X., Liu, H., Dong, L. et al. (2018) Genome sequence of the progenitor of wheat A subgenome Triticum urartu. Nature 557, 424–428. DOI: 10.1038/s41586-018-0108-0. Ling, H.Q., Zhao, S., Liu, D., Wang, J., Sun, H. et al. (2013) Draft genome of the wheat A-genome progenitor Triticum urartu. Nature 49, 87–90. DOI: 10.1038/nature11997. Liu, F., Yu, H., Deng, Y., Zheng, J., Liu, M. et al. (2017) PepperHub, an informatics hub for the chili pepper research community. Molecular Plant 10(8), 1129–1132. DOI: 10.1016/j.molp.2017.03.005. Lu, C., Tej, S.S., Luo, S., Haudenschild, C.D., Meyers, B.C. et al. (2005) Elucidation of the small RNA component of the transcriptome. Science 309(5740), 1567–1569. DOI: 10.1126/science.1114112. Luo, M.-C., Gu, Y.Q., Puiu, D., Wang, H., Twardziok, S.O. et al. (2017) Genome sequence of the progenitor of the wheat D genome Aegilops tauschii. Nature 551(7681), 498–502. DOI: 10.1038/nature24486. Maccaferri, M., Harris, N.S., Twardziok, S.O., Pasam, R.K., Gundlach, H. et al. (2019) Durum wheat genome highlights past domestication signatures and future improvement targets. Nature Genetics 51(5), 885–895. DOI: 10.1038/s41588-019-0381-3. Mergner, J., Frejno, M., List, M., Papacek, M., Chen, X. et al. (2020) Mass-spectrometry-based draft of the Arabidopsis proteome. Nature 579(7799), 409–414. DOI: 10.1038/s41586-020-2094-2. Metzker, M.L. (2010) Sequencing technologies–the next generation. Nature Reviews. Genetics 11(1), 31–46. DOI: 10.1038/nrg2626. Mochida, K. and Shinozaki, K. (2010) Genomics and bioinformatics resources for crop improvement. Plant & Cell Physiology 51(4), 497–523. DOI: 10.1093/pcp/pcq027.

Plant Omics Databases

267

Montgomery, S.A., Tanizawa, Y., Galik, B., Wang, N., Ito, T. et al. (2020) Chromatin organization in early land plants reveals an ancestral association between H3K27me3, transposons, and constitutive heterochromatin. Current Biology 30(4), 573–588. DOI: 10.1016/j.cub.2019.12.015. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. and Wold, B. (2008) Mapping and quantifying mammalian transcriptomes by RNA- Seq. Nature Methods 5(7), 621–628. DOI: 10.1038/ nmeth.1226. Mueller, L.A., Solow, T.H., Taylor, N., Skwarecki, B., Buels, R. et al. (2005) The SOL Genomics Network: A comparative resource for Solanaceae biology and beyond. Plant Physiology 138(3), 1310–1317. DOI: 10.1104/pp.105.060707. Mukhtar, M.S., Carvunis, A.-R., Dreze, M., Epple, P., Steinbrenner, J. et al. (2011) Independently evolved virulence effectors converge onto hubs in a plant immune system network. Science 333(6042), 596–601. DOI: 10.1126/science.1203659. Naithani, S., Gupta, P., Preece, J., D’Eustachio, P., Elser, J.L. et al. (2020) Plant Reactome: a knowledgebase and resource for comparative pathway analysis. Nucleic Acids Research 48(D1), D1093–D1103. DOI: 10.1093/nar/gkz996. Nakano, M., McCormick, K., Demirci, C., Demirci, F., Gurazada, S.G.R. et al. (2020) Next-generation sequence databases: RNA and genomic informatics resources for plants. Plant Physiology 182(1), 136–146. DOI: 10.1104/pp.19.00957. Nishiyama, T., Fujita, T., Shin- I, T., Seki, M., Nishide, H. et al. (2003) Comparative genomics of Physcomitrella patens gametophytic transcriptome and Arabidopsis thaliana: implication for land plant evolution. Proceedings of the National Academy of Sciences 100(13), 8007–8012. DOI: 10.1073/pnas.0932694100. Ortiz-Ramírez, C., Hernandez-Coronado, M., Thamm, A., Catarino, B., Wang, M. et al. (2016) A transcriptome atlas of Physcomitrella patens provides insights into the evolution and development of land plants. Molecular Plant 9(2), 205–220. DOI: 10.1016/j.molp.2015.12.002. Palmer, L.E., Rabinowicz, P.D., O’Shaughnessy, A.L., Balija, V.S., Nascimento, L.U. et al. (2003) Maize genome sequencing by methylation filtration. Science 302(5653), 2115–2117. DOI: 10.1126/ science.1091265. Pearce, S., Vazquez-Gross, H., Herin, S. Y., Hane, D., Wang, Y. et al. (2015) WheatExp: an RNA-seq expression database for polyploid wheat. BMC Plant Biology 15, 299. DOI: 10.1186/s12870-015-0692-1. Portwood, J.L., Woodhouse, M.R., Cannon, E. K., Gardiner, J.M., Harper, L.C. et al. (2019) MaizeGDB 2018: the maize multi- genome genetics and genomics database. Nucleic Acids Research 47, D1146–D1154. DOI: 10.1093/nar/gky1046. Qin, C., Yu, C., Shen, Y., Fang, X., Chen, L, et al. (2014) Whole-genome sequencing of cultivated and wild peppers provides insights into Capsicum domestication and specialization. Proceedings of the National Academy of Sciences 111, 5135–5140. DOI: 10.1073/pnas.1400975111. Ramachandran, N., Raphael, J.V., Hainsworth, E., Demirkan, G., Fuentes, M.G. et al. (2008) Next- generation high-density self-assembling functional protein arrays. Nature Methods 5, 535–538. DOI: 10.1038/nmeth.1210. Ramírez-González, R.H., Borrill, P., Lang, D., Harrington, S.A., Brinton, J. et al. (2018) The transcriptional landscape of polyploid wheat. Science 361, eaar6089. DOI: 10.1126/science.aar6089. Sakai, H., Lee, S.S., Tanaka, T., Numa, H., Kim, J. et al. (2013) Rice Annotation Project Database (RAP- DB): an integrative and interactive database for rice genomics. Plant & Cell Physiology 54(2), e6. DOI: 10.1093/pcp/pcs183. Sakurai, T., Mochida, K., Yoshida, T., Akiyama, K., Ishitani, M. et al. (2013) Genome-wide discovery and information resource development of DNA polymorphisms in cassava. PLoS ONE 8(9), e74056. DOI: 10.1371/journal.pone.0074056. Sato, S., Nakamura, Y., Kaneko, T., Asamizu, E., Kato, T. et al. (2008) Genome structure of the legume, Lotus japonicus. DNA Research 15(4), 227–239. DOI: 10.1093/dnares/dsn008. Sato, S., Hirakawa, H., Isobe, S., Fukai, E., Watanabe, A. et al. (2011) Sequence analysis of the genome of an oil-bearing tree, Jatropha curcas L. DNA Research 18(1), 65–76. DOI: 10.1093/dnares/dsq030. Sato, Y., Antonio, B.A., Namiki, N., Takehisa, H., Minami, H. et al. (2010) RiceXPro: a platform for monitoring gene expression in japonica rice grown under natural field conditions. Nucleic Acids Research 39(Database issue), D1141–D1148. DOI: 10.1093/nar/gkq1085. Sato, Y., Namiki, N., Takehisa, H., Kamatsuki, K., Minami, H. et al. (2013) RiceFREND: a platform for retrieving coexpressed gene networks in rice. Nucleic Acids Research 41, D1214–D1221. DOI: 10.1093/nar/gks1122.

268

Feng Li et al.

Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T. et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature 463, 178–183. DOI: 10.1038/nature08670. Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F. et al. (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112–1115. DOI: 10.1126/science.1178534. Scott, M.F., Botigué, L. R., Brace, S., Stevens, C.J., Mullin, V.E. et al. (2019) A 3,000-year-old Egyptian emmer wheat genome reveals dispersal and domestication history. Nature Plants 5, 1120–1128. DOI: 10.1038/s41477-019-0534-5. Shirasawa, K., Isuzugawa, K., Ikenaga, M., Saito, Y., Yamamoto, T. et al. (2017) The genome sequence of sweet cherry (Prunus avium) for use in genomics-assisted breeding. DNA Research 24, 499–508. DOI: 10.1093/dnares/dsx020. Shirasawa, K., Hirakawa, H., Fukino, N., Kitashiba, H. and Isobe, S. (2020) Genome sequence and analysis of a Japanese radish (Raphanus sativus) cultivar named “Sakurajima Daikon” possessing giant root. DNA Research 27(2), dsaa010. DOI: 10.1093/dnares/dsaa010. Singh, V.K., Garg, R. and Jain, M. (2013) A global view of transcriptome dynamics during flower development in chickpea by deep sequencing. Plant Biotechnology Journal 11(6), 691–701. DOI: 10.1111/ pbi.12059. Song, J.-M., Lei, Y., Shu, C.-C., Ding, Y., Xing, F. et al. (2018) Rice Information GateWay: a comprehensive bioinformatics platform for Indica rice genomes. Molecular Plant 11(3), 505–507. DOI: 10.1016/j. molp.2017.10.003. Sun, C., Hu, Z., Zheng, T., Lu, K., Zhao, Y. et al. (2017) RPAN: rice pan-genome browser for ∼3000 rice genomes. Nucleic Acids Research 45(2), 597–605. DOI: 10.1093/nar/gkw958. Szklarczyk, D., Gable, A.L., Nastou, K.C., Lyon, D., Kirsch, R. et al. (2021) The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Research 49(D1), D605–D612. DOI: 10.1093/nar/ gkaa1074. Tanaka, H., Hirakawa, H., Kosugi, S., Nakayama, S., Ono, A. et al. (2016). Sequencing and comparative analyses of the genomes of zoysiagrasses. DNA Research 23, 171–180. DOI: 10.1093/dnares/ dsw006. Tang, H., Krishnakumar, V., Bidwell, S., Rosen, B., Chan, A. et al. (2014) An improved genome release (version Mt4.0) for the model legume Medicago truncatula. BMC Genomics 15, 312. DOI: 10.1186/1471-2164-15-312. Tello-Ruiz, M.K., Naithani, S., Gupta, P., Olson, A., Wei, S. et al. (2021) Gramene 2021: harnessing the power of comparative genomics and pathways for plant research. Nucleic Acids Research 49, 1452–1463. DOI: 10.1093/nar/gkaa979. Tomato Genome Consortium (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635–641. DOI: 10.1038/nature11119. Toufighi, K., Brady, S.M., Austin, R., Ly, E., and Provart, N.J. (2005) The Botany Array Resource: e‐ Northerns, expression angling, and promoter analyses. The Plant Journal 43, 153–163. DOI: 10.1111/j.1365-313X.2005.02437.x. Ueno, S., Nakamura, Y., Kobayashi, M., Terashima, S., Ishizuka, W. et al. (2018) TodoFirGene: developing transcriptome resources for genetic analysis of Abies sachalinensis. Plant & Cell Physiology 59(6), 1276–1284. DOI: 10.1093/pcp/pcy058. Waese, J. and Provart, N.J. (2017) The bio-analytic resource for plant biology. In: van Dijk, A. (ed.) Plant Genomics Databases. Methods in Molecular Biology, Vol. 1533. Humana Press, New York, NY, pp. 119–148. DOI: 10.1007/978-1-4939-6658-5_6. Wang, J., Deng, Y., Zhou, Y., Liu, D., Yu, H. et al. (2019) Full-length mRNA sequencing and gene expression profiling reveal broad involvement of natural antisense transcript gene pairs in pepper development and response to stresses. The Plant Journal 99, 763–783. DOI: 10.1111/tpj.14351. Wang, M., Yu, Y., Haberer, G., Marri, P.R., Fan, C. et al. (2014) The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature Genetics 46, 982–988. DOI: 10.1038/ng.3044. Wang, W., Mauleon, R., Hu, Z., Chebotarov, D., Tai, S. et al. (2018) Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557, 43–49. DOI: 10.1038/s41586-018-0063-9. Wang, X., Wang, H., Wang, J., Sun, R., Wu, J. et al. (2011) The genome of the mesopolyploid crop species Brassica rapa. Nature Genetics 43, 1035–1039. DOI: 10.1038/ng.919. Ware, D., Jaiswal, P., Ni, J., Pan, X., Chang, K. et al. (2002a) Gramene: a resource for comparative grass genomics. Nucleic Acids Research 30, 103–105. DOI: 10.1093/nar/30.1.103.

Plant Omics Databases

269

Ware, D.H., Jaiswal, P., Ni, J., Yap, I.V., Pan, X. et al. (2002b) Gramene, a tool for grass genomics. Plant Physiology 130, 1606–1613. DOI: 10.1104/pp.015248. Whitelaw, C.A., Barbazuk, W.B., Pertea, G., Chan, A.P., Cheung, F. et al. (2003) Enrichment of gene-coding sequences in maize by genome filtration. Science 302, 2118–2120. DOI: 10.1126/science.1090047. Winter, D., Vinegar, B., Nahal, H., Ammar, R. and Wilson, G.V. (2007) An “electronic fluorescent pictograph” browser for exploring and analyzing large-scale biological data sets. PLoS ONE 2, e718. DOI: 10.1371/journal.pone.0000718. Xu, Y., Gao, S., Yang, Y., Huang, M., Cheng, L. et al. (2013) Transcriptome sequencing and whole genome expression profiling of chrysanthemum under dehydration stress. BMC Genomics 14, 662. DOI: 10.1186/1471-2164-14-662. Yagi, M., Kosugi, S., Hirakawa, H., Ohmiya, A., Tanase, K. et al. (2014) Sequence analysis of the genome of carnation (Dianthus caryophyllus L.). DNA Research 21(3), 231–241. DOI: 10.1093/dnares/dst053. Yano, R., Ariizumi, T., Nonaka, S., Kawazu, Y., Zhong, S. et al. (2020) Comparative genomics of muskmelon reveals a potential role for retrotransposons in the modification of gene expression. Communications Biology 3(1), 432. DOI: 10.1038/s42003-020-01172-0. Yasui, Y., Hirakawa, H., Ueno, M., Matsui, K., Katsube-Tanaka, T. et al. (2016) Assembly of the draft genome of buckwheat and its applications in identifying agronomically useful genes. DNA Research 23(3), 215–224. DOI: 10.1093/dnares/dsw012. Yazaki, J., Galli, M., Kim, A.Y., Nito, K., Aleman, F. et al. (2016) Mapping transcription factor interactome networks using HaloTag protein arrays. Proceedings of the National Academy of Sciences 113(29), E4238–47. DOI: 10.1073/pnas.1603229113. Yu, J., Hu, S., Wang, J., Wong, G.K.-S., Li, S. et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296(5565), 79–92. DOI: 10.1126/science.1068037. Yue, J., Liu, J., Tang, W., Wu, Y.Q., Tang, X. et al. (2020) Kiwifruit Genome Database (KGD): a comprehensive resource for kiwifruit genomics. Horticulture Research 7, 117. DOI: 10.1038/s41438-020-0338-9. Zhang, H., Zhang, F., Yu, Y., Feng, L., Jia, J. et al. (2020) A comprehensive online database for exploring ∼20,000 public Arabidopsis RNA-seq libraries. Molecular Plant 13(9), 1231–1233. DOI: 10.1016/j. molp.2020.08.001. Zhang, J., Chen, L.-L., Xing, F., Kudrna, D.A., Yao, W. et al. (2016) Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63. Proceedings of the National Academy of Sciences 113(35), E5163–E5171. DOI: 10.1073/pnas.1611012113. Zhao, Q., Feng, Q., Lu, H., Li, Y., Wang, A. et al. (2018) Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nature Genetics 50(2), 278–284. DOI: 10.1038/ s41588-018-0041-z. Zheng, Y., Wu, S., Bai, Y., Sun, H., Jiao, C. et al. (2019) Cucurbit Genomics Database (CuGenDB): a central portal for comparative and functional genomics of cucurbit crops. Nucleic Acids Research 47(D1), D1128–D1136. DOI: 10.1093/nar/gky944.

Index

Note: Page numbers in bold type refer to figures Page numbers in italic type refer to tables

A2K (article to knowledge) approach 148 ABC transporter, peroxisomal 113 abiotic stress 30, 33–36, 38, 86 conditions 22 crop plants 40 plant acclimation and tolerance 40 signaling mechanism under 42–43 ABI’s SOLiD 16 abscisic acid 40 acetylation 39, 40, 43, 85 activation mark accumulation 99 active learning 238 adaptive transcriptomic responses 11 adenine deaminase-based (ABE) DNA editor 208 AE (Auto-Encoder) 219 AGRIS 129, 130 AI 116, 217 X- (explainable AI) 224, 225, 230, 231 AI-1 256 alcohol dehydrogenase 41 AlexNet 224 from scratch 226 alfalfa (Medicago sativa) 4 algae 112 algorithms 229 DL 217 reinforcement learning 217 supervised learning 217–218 Transformer 220 alleles, rare 187 AlphaFold algorithm 229 AlphaFold2 231

AlphaGO (DeepMind) 219 AlphaGO Zero (DeepMind) 219 Alternaria alternata, citrus 42 aluminium oxide nanoparticles 39 amino acid sequences 98, 224, 229 amyloplasts 114 ANN (artificial neural network) 217 answer labels 218 anthocyanin transport, vacuoles 112 antibiotics, selection with 248 APCI (atmospheric pressure chemical ionization) 56 APOLO 87–88 apple black rot disease 227 leaf diseases 226 Arabidopsis 84 active demethylation 102 AGO proteins 83 AGRIS 129, 130 AtR8 89 circRNAs 88 cis-NAT expression 85 databases epigenome 255 genome 254 omics 254–256 proteome 256 transcriptome 255–256 DNA resources 248 ENOD40 and ASCO expression 82, 87 flowering stage 247 genome editing 210

271

272

Index

Arabidopsis (continued) genome PPRs 115 genomes 185 GWAS resources 185–186 immune proteins 174 interactome map (AI-1) 256 juvenile plants 247 lncRNAs 88, 89 metabolomic analysis, of peroxisome-defective mutants 113 mutants 57, 101 ORFs 256 peroxisome studies 113 public resource centers 246, 247 public sources of resources and information 247 resources 247–248, 247, 248 RNA-seq runs 147 rosette leaf numbers 227 RPP1 of 176 seed resources natural accessions (ecotypes) and insertion mutant lines 247, 248 for omics analysis 247–248 TF genes 143 Arabidopsis Interactome Mapping Consortium 256 Arabidopsis Next-Gen Sequence DBs 254, 256 Arabidopsis thaliana 1, 42, 52, 160, 163, 246, 247 AthaMAP 129, 130 bioactive GAs 153, 154 centromeric repeats, unique subset 100 DNA methylation 101–102 GA deficiency 155 metabolism and signaling genes 156, 157 GA receptors, biochemical analysis 155 genome 98, 155 information 1 mutants 57 1001 Genomes Project 185, 247–248, 254, 254, 255 orthologs, clades 156 Plant Cistrome Database 129, 130 repetitive elements 102 tissue-specific development 99 transcriptome 18–19, 18, 82 Arabidopsis–effector protein interactions 175 ARIGAN model 239 ARS (Arabidopsis RNA-seq database) 255 artificial intelligence (AI) explainable (X-AI) 217, 224, 225 see also DL (deep learning); machine learning ASCO (alternative spicing competitor) 87 Atari 2600 games 219 atga20ox1/atga20ox2 160–161 AthaMAP 129, 130 ATHENA 254, 256

AtNSRs 87 AtR8 89 ATTED-II database 148 automated sorting systems, based on images, DL for 224, 227–228 automated taxonomic classification 225–226 of olives 228 automatic curation 148 AutoML (automatic machine learning) 220–221 frameworks 220–221, 221 autophagy 112 Avr 175 and NLR interaction 175–176

back splicing 88 backpack-based platform 72 bacteria, plant-pathogenic 42 bacterial infection 38 BAR (Bio-Analytic Resource for Plant Biology) 261 barley (Hordeum vulgare) lines, comparative structural analysis 5–6 nuclear proteome 109 pan-genome project 5 reference genome 5, 6 reference-quality genome assemblies 5 seed images 237 base editing 208, 208, 211 batch-effects 60 Bayesian alphabet 195 Bayesian LASSO 194, 195 Bayesian regularized regression methods 195 Benjamini-Hockberg (BH) method 183 BERT (Bidirectional Encoder Representations from Transformers) 220 bioactive GAs 153–155, 154 biochemical analyses, of peroxisomes 113 bioinformatic technology 43 bioinformatics methods 43 platforms 253 biological discoveries, enhancement 60 biological functions, enrichment analysis of 130 biological processes, multiple 30 biotic stress 30, 32, 36–37, 38, 41 BLAST 144, 260 in-house 145 blueberry 227 BnaGA2ox family 162–163, 165 BnaGA3ox family 162 BnaGA20ox genes 160–162, 165 in ZS11 160, 161, 161 BnaSLY2 164 BnTIP website 156 body parts, plant, selection for input images for CNN models 218 Bonferroni correction 183 Botany Array Source 261 Brassica

Index

oleracea 151, 156, 157, 162, 164, 165 rapa 102, 155, 156, 157, 164, 165 var. perviridis 54 Brassica napus (oilseed rape) 151, 152–153, 162, 164–165 accessions DELLA and GID1 gene families 163 GA biosynthesis and catabolism genes- gene duplications and losses 159 GID2/SLY 163 GA metabolism and signaling genes 165 CDS 156, 157 gene families, phylogenetic trees 156 genes 153 genomes 153 resynthesized, expression of GA-related genes 164–165 Brassicaceae 102, 153 see also Arabidopsis; Arabidopsis thaliana bread wheat 5, 258 hexaploid 258 breeding 190 Brassica napus 152–153 efficiency 192 F1 hybrid 198 GS implementation 197–198, 197 lines, selection 191, 192 multi-omics-based 199 selection 190 bridge PCR 17 BRs (brassinosteroids) 152 bryophytes 260–261 databases 260–261 BS-seq 255 BUSCO (Benchmarking Universal Single-Copy Orthologs) 22, 23

CA (correspondence analysis) 141–142, 142 Caenorhabditis elegans 82 calcium homeostasis 42 camera-based platforms 72 canola 164 canopy structure 71 carotenoids 55 Cas, proteins, new types 209, 210–212 Cas9 engineered 209–212 gene 201 orthologs 210 protein 206 variants 209 near-PAMless engineered 210 Cas9SG 210 Cas10d 211 Cas13 (C2c2) 210 Cascade (CRISPR associated complex for antiviral defence) 211

273

CASP (Critical Assessment of Protein Structure) 229 CasX, genome editing 210 catabolism 154, 155 genes, gene duplications and losses in B. napus accessions 159 causal variant 191 CCS (circular consensus sequencing) 3 CCSB Database Interactome 254, 256 cDNA micro-array 173 cDNAs 16–17 clones 248, 250 CDS (coding sequences) 18, 156, 158 for GA metabolism and signaling genes 156, 157 cell wall 32–38 metabolism, under flooding stress 38–39 proteomics 38 cell-level phenotyping 70 cells abiotic stress responses 42 compartmentalization 32 death signals 40 diverse response strategies 32 ER 110 fractionation technique 30 cellular functions, molecular basis 31 CENH3 100 central hub TFs, identification 130–132 central metabolites 52, 54 centromere 100 cereal crops genomes 5 natural variation 5 semi-dwarf mutants 156 sorghum 229 see also barley; maize; rice; wheat ceRNAs 88 CE–MS 54 CG sites 84, 101 chaperones, ER 110 Chara braunii 112 chemical derivatization 54 CHG sites 84, 101 CHH sites 84, 101–102 chickpea 4 ChIP-seq (chromatin immunoprecipitation sequencing) 125, 127, 255 databases 130 chloroform-methanol extraction 31 chlorophyll fluorescence imaging 70 meters 72 chlorophyll meters 72 chlorophylls 55 Chloroplast 2010 57 chloroplasts 38, 114 proteomic analysis 38 chromatin 3D, structural features 109–110

274

chromatin (continued) landscape at Arabidopsis nuclear periphery 109 modification, of FLC locus 86–87 regions 132 regulation of genes 98 chromatin immunoprecipitation (ChIP) 99 chromatographic techniques 51 chromosomal inversions study 6 chromosomes conformation capture techniques 132 structures construction 100 circRNAs 88–89 cis 25 CIS-BP 129, 130 cis-elements 124, 143, 147 prediction 127 related databases 129–130 and TFs, large-scale interaction techniques 125–127, 127 cis-lncNAT 86 cis-NATPH01-2 86 cis-NATs 85–86 citrus 227 Alternaria alternata 42 classification task, for supervised learning 218 classifications 224 by DNNs 225, 226 tasks 230 taxonomic 224 automated 225–226 CLF (CURLY LEAF) 86–87 climate changes 195 clones, cDNA 248, 250 cluster bean 4 CMT2 101 CMT2 gene 102 CMT3 (CHROMOMETHYLASE) 101 CNN (convolutional neural network) 218, 224–225, 228 guides for beginners 229–231 models 224–225, 225 co-expression analysis 130 network 138 Co-IP (co-immunoprecipitation) assay 173 coding sequences see CDS COLDAIR 86–87 COLDWRAP 87 common bean, pan-genome information 4 comparative proteome analysis 114 comparative structural analysis, barley lines 5–6 compound annotations 56 computer vision 234–236 technology 240 confocal microscopy 69 COOLAIR 87 copy genes (paralogs) 143 core gene set, cereal crops 5

Index

correspondence analysis (CA) 141–142, 142 cortical ER 110 cost-effective methodologies 103 cotton 70, 89 cotton diseases, prediction, sites in India 218–219 CowPeaPan 4 CPM (counts per million) 24 CRISPR associated complex for antiviral defence (Cascade) 211 CRISPR-Cas system Class 1 type 1 211 Type V, functional diversity 210 CRISPR-Cas type I-D (TiD) system 211, 211, 212 CRISPR-Cas9 206, 206 genome editing using 207 off-target effects 207 system 212 applicability 209 tools 209, 209, 212 CRISPR-Cas13 210 CRISPR-Cas14 210 CRISPR-CasΦ 211 crop plants abiotic stress 40 omics databases for 256–260 protein phosphorylation studies in 39 crops automated sorting 227–228 management 236 model values 199 modeling 198–199 research, experimental plant resources overview 248–250, 248 yield modeling 218–219 CT 71 micro- 69–70 curators 145 CV (cross-validation) 195, 196 CV2, CV1, CV0, CV00 195–196, 196 K-fold 195, 196 CycleGAN network 239 Cytoscape 146 cytosine, methylation 84, 100–101 cytosolic calcium 42, 43

DAP-seq (DNA affinity purification sequencing) 125, 127 DART (direct analysis in real time) ionization 56 data annotation cost 238 augmentation 238–239 challenges, in field phenotyping platforms 73 genotype 192 high-throughput 132 metabolomics 50, 60–62 metabolite identification/ annotation 56 sharing importance 56–60

Index

multi-omics 264 organelle-specific omics 108 phenotype 182–183 reliability 56 repositories, metabolome 56, 57 reproducibility 250 RNA-seq 138, 147 sequencing, online 147 training 220 for GS 191, 192 preparation 193 data analysis metabolomics 50, 60 workflow 57 pipeline 67 databases access problems 263 Arabidopsis epigenome 255 genome 254 proteome 256 transcriptome 255–256 Arabidopsis RNA-seq 255 bryophytes 260–261 co-expression analysis 130 image-based 115, 116 images/movies, for organelles 115–116 metabolome analyses 56, 57 metabolomics 51–52 NCBI GEO 255 omics 253 for crop plants 256–260 data of plant–pathogen interactions 176 portals 261 public metabolome, assessor software tools 61 regulatory factors information 143 TF-related 129 without text 262–263 datasets active learning application efficiency 238 disease 236, 237 domain adaptation 239 open-source 226 image 226 plant 239 realistic translated 239 systematic analytical variation 60 date fruits 228 DBDs (DNA-binding domains) 129 DCA (distance from correspondence analysis) 140, 141–143 index 141 dCas9 protein 207–208, 209 de novo assembly 3, 12, 18, 19, 20 preparation 22 tools and data subsets 22 type-I and type-II errors 20–21 circumvention 20–21

275

de novo methylation 84, 101–102 deep neural networks see DNNs DeepBind 228 DeepMind 219 defence, against pathogens 172–173 dehydration-responsive gene regions 99 deletions 212 DELLA gene family 163–164 proteins 154, 155 demethylase REF6 99 dephosphorylated proteins 40–41 development, plant 30, 99 (DI)-MS (direct infusion) 51 dimensionality reduction task 219 diploid analysis 182 disease 172–173 datasets 236, 237 diagnosis, based on images (DL for) 218, 224, 226 image segmentation for 236–237 prediction, cotton 218–219 segmentation, triple-stream network 237 DL (deep learning) 116, 217 algorithms for object detection 234 one-stage detectors 234–235 two-sector detector 234 frameworks 224 generalization capability 238–239 for graphs 220 for plant images 225–228 practical application, tips and cautions for beginners 224 predictions 224 technologies, concepts as applied to images 224 DME 102 DNA deaminases 208 marker-assisted selection (MAS) 190–191, 191 markers 191, 192, 193 selection for GS model construction 199 methylome analysis 255 motifs 228 mutations 132 resources, Arabidopsis 248 sequences, DL for 224, 228–229 structures 132 DNA binding patterns, TF, resources 129–130, 130 DNA editor, adenine deaminase-based (ABE) 208 DNA methylation 84, 100–103 Arabidopsis thaliana 101–102 de novo methylation 85, 101–102 genome-wide, in plant genomes 102–103 global variation patterns 102–103 maintenance methylation 101

276

DNA sequencing advances 253 technologies 1 DNA-binding specificity, TFs 129 DNNs (deep neural networks) 217, 224 classification by 225, 226 quantification by 227 to predict invisible characters in plant organs 227 typical regression 227 domain-adversarial neural network 239 dosage-sharing hypothesis 165 DQN (Deep Q-Network) 219 DRM2 (DOMAINS REARRANGED METHYLTRANSFERASE 2) 101 drought stress 40 soybean 42–43 dry-lab procedures 127–129, 128 workflow 127, 128 DSBs (double strand breaks) 205, 207 dsRNAs 84, 86 durum wheat 258 dwarf mutants, severe 155 dwarfism 152, 164

e-infrastructure, efficient, for metabolomics 57 EC (Enzyme Commission) numbers 146 ecotypes 247 effectors 173–174 eFP (electronic Fluorescent Pictograph) 261 EI (electron ionization) method 55 Elaeis guineensis 103 elicitors 41 EM-seq (enzymatic methyl-seq) 103 emulsion PCR 17 endophytic microbes 250 endoplasmic reticulum (ER), rough (rER) 110 energy consumption 39 production 41 ENOD40 87 enrichment analysis, of biological functions 130 Ensembl Plants 145 environmental conditions changes 33–37, 41–43 and organelles function changes 114 and GS modelling 198 environmental stress 30–31, 39, 71 indictors, seed features 237 enzymatic reactions 97 enzymes 146 GA biosynthesis 156, 158 rate-limiting 165 epigenetic marks, removal 102 epigenetic regulation 97 epigenetic silencing 99 epigenetics, definition 97

Index

epigenome databases, Arabidopsis 255 reprogramming, animals/plants 100 ER (endoplasmic reticulum) 42–43, 110–111 chaperones 110 luminal proteins 110 markers 110 proteins 110 discovery 110 proteomics 38, 39 retention signal 110 rough (rER) 110 smooth (sER) 110 targeting signal peptide 110 ERAD (ER-associated protein degradation) 110 ERD2 (ER retention signal receptor) 111 Escherichia coli 210 ESI (electrospray ionization) method 55–56 eSpCas9 210 ethylene 40 ETI (effector-triggered immunity) 173 eukaryotes 88, 97 glycosylation responses 40 eukaryotic organelle, nucleus 38, 39, 108–110 expression abundance of genes, prediction 228–229 activation/suppression 229 patterns 224 see also gene expression; GENs Expression Atlas 148

F1 hybrid breeding 198 faba bean 4 Fabaceae 3–4, 250 genome 4 FASTA file 144, 145 FAT-PTM database 254, 256 FDR (false discovery rate) 183 feature extraction, from images 218 feature maps 230 feature visualization methods 225 female gametogenesis 85 field phenotyping 71–73 platforms 71–72 ground-based 72, 73 limitations 72–73 satellite 71–72 UAVs 72 target traits 71, 71 FLC locus 87, 99 chromatin modification 86–87 flooding responsive maps 38 stress 38, 42–43 soybean 38–39 floral sterility 103 floral-dip method 207

Index

fluorescence imaging 69 chlorophyll 70 fluorescence spectroscopy data 227 FokI nuclease domain 206 forest landscapes, remote sensing 236 FPKM/RPKM (fragments/reads per kilobase of transcript per million) 24 fractionation distillation 55 Fruit360 dataset 226 functional diversity, in Type V CRISPR-Cas system 210 fungi, phytopathogenic 41–42 FWA gene 209

GA biosynthesis early 156–160 enzymes 156, 158 gene duplications and losses in B. napus accessions 159 genes early 160, 165 gene families and developmental expression families 158 phases 153, 154 GA (gibberellin) 151, 152, 153–156 bioactive 153–155, 154 deficiency 155 signaling 155 genes 163–164 GA metabolism 152, 153–155 and signaling genes 165 in B. napus 156–164, 157 coding sequences for 156, 157 GA-auxotroph, and response mutants 155–156 GA2ox (GA2-oxidase) 160, 162 classes 155 gametogenesis, female 85 GAN (generative adversarial network) 219, 227, 239 Gangan genome 163 GBLUP-RR 194 GBrowse 260 GBS (genotyping-by-sequence) 193 GCNs (Graph Convolutional Networks) 220 GC–MS technique 54–55 GC–MS-based metabolomics 54 GDT-TS (global distance test - total score) 229 GEBV (genomic estimated breeding value) 192 gel electrophoresis, two-dimensional 31 gel-based proteomics analysis 31, 32 gel-free proteomics analysis 31, 32 label-free techniques, LC–MS/MS 31 labeled techniques chemical in vitro 31 metabolic in vivo 31 gene expression 164 abundance, prediction 228–229 common regulatory mechanisms 143 fluxes 11

277

identical profiles 137 levels fluctuation 137 matrix 140, 141, 142 example 141 and polyploidy stress 164–165 profiles comparing 137–138 prediction of gene functions 139, 140 reciprocity 139–140, 140, 142, 142 similarity 137, 139 similarity and/or reciprocity 139–143, 139, 142 regulation 85, 137 repression 87–88 see also GENs (gene expression networks) gene-centered approaches 126 gene-for-gene theory 172 gene(s) activation 208–209 annotations 147 biological functions 138 body, methylation 101 chromatin regulation 98 duplications 11, 143, 159, 160, 165 function, prediction 224 functional annotation 139, 144 haplotypes 187 histone 98 homologous 160, 163–164, 164–165 knockout mutants 248 losses 165 pairs 139, 146, 156–157 homologous 163–164 regulation 132 studies 131, 132 by TFs 126 unverified 130 sets 138 silencing 85 spatio-temporal profiles 12 target 124 genetic background, of plant materials 246 genetic diversity 5, 6, 12, 20, 167, 181, 182, 257, 259 polyploid crops 164 genetic improvement 197–198, 197 genetic model species 13 genetic relatedness 19 genetic resources, quality assurance 246 genetic studies, model species 12 genetic traits 165 genome editing 205, 210 multiplex 207 technologies 205–206, 206, 211, 212 using CRISPR-Cas9 207 genomes analysis 1 Arabidopsis 185 Brassica napus 153

278

genomes (continued) comparison 3 construction 100 databases, Arabidopsis 254 engineering 212 genome-wide DNA methylation in 102–103 maize 259 manipulation, using CRISPR-dCas9-based system without DSB induction 207–209 mitochondrial 114–115 organization, 3-D 109 repetitive elements 100–101, 102 sequenced 1, 2 sequencing, polyploid wheat 258 sequencing technologies advances 1–3 long read 1 size and repetitive elements 102 structure variant identification 3 tomato 259 transposable elements 100–101 wheat 258 see also GWAS; reference genomes genomic information 190 genomic modification 190 genomics 173, 190 see also GS (genomic selection) genotypes data 192 progeny prediction 198 genotype–phenotype association, marker 187 genotype–phenotype relationships 181 GENs (gene expression networks) 137, 138, 141 analysis 138 with computational annotation of genes 144–145 construction 141, 146, 147 method 143–144 example 139 GEN extension 148 knowledge-based information and ontology terms 145–146 with metabolic pathway information 146 nodes and edges 138, 139, 146–147 types of relationships 138 user-generated 148 germination 112 germlines 4 germplasms, rice 257 GID1 gene family 163–164 GID2/SLY gene family 163–164 Github 226 Global Natural Product Social (GNPS) 57 global proteomics approach 30 global-scope transcriptome projects 12 Glycine max see soybean soja 4

Index

glycoproteins 39, 41 glycosylation 40, 41, 43 GNNs (Graph neural networks) 220 GNPSMassIVE 58 GO (gene ontology) terms 220 Golgi apparatus 111 cisternae, cis-, middle and trans- 111 Golm Metabolome Database (GMD) 57, 58 Golovinomyces orontii, and Arabidopsis, interactome study 174 Google, Transformer 220 GoogleNet, with transfer learning 226 GPT-3 (Generative Pre-trained Transformer 3) model (OpenAI) 220 Grad-Cam++ 226 Grad-Cam 230 GrainGenes 257, 258 Gramene 257, 257, 261 graphs, deep learning for 220 Green Revolution 152 greenhouse phenotyping 70 gRNAs 206 multiple 208 GRNs (gene regulatory networks) 124, 138 advanced analysis 130–132 ground-based phenotyping platforms 72 ground-based sensing systems 71 growth chamber 69–70 growth hormones 152 GS (genomic selection) 190, 192–193 advanced topics 198–199 combination with other omics 199 DNA marker selection for model construction 199 model incorporating G x E effects 198–199 core processes 191, 192–197 for F1 hybrid breeding 198 implementation in practical breeding 197–198, 197 model 191, 192, 193–197 accuracy 192, 193, 195 prediction accuracy 195 model construction 192, 193–197, 194 DNA marker selection 199 statistical modelling 193–195, 194 GT (gene targeting) 207 guided backpropagation 230 Guided Grad-CAM 227, 230–231 GWAS (genome-wide association study) 103, 181, 228 case studies 185–186 Arabidopsis 185–186 rice 186 core processes 182–184 analyzing statistical difference 183–184 associating genotypic and phenotypic variations 182

Index

checking phenotype data 182–183 mixed linear model 183 preparing populations 182 software 184 graphical representation of results 184–185 panels 182 problems with 186–187 functional validation of results 186–187 rare alleles 187 spurious association 187 software 184, 184

H2A.Z 102 H3K4me3 98, 99 H3K9ac 98, 99 H3K9me2 99, 100, 101 H3K9me3 99 H3K27me3 98–99 enrichment 99 H4Ac (active mark) 99 ‘Habataki’ rice 152 haplotypes 187 gene 187 HDR (homology-directed repair) pathway 205, 207 heat shock proteins, overexpression 40 heat stress 85 height control, and hormones 152 Helianthus tuberosus 236 helper NLRs 175 herbivorous insects 42 heterochromatic regions 102 heterochromatin constitutive 101 silencing 99 Hi-C library 3 Hi-C technique 109 HID1 87–88 HiFi reads 3 high-throughput data 132 high-throughput methods 173 high-throughput sequences 124 HiSeq X 17 histone 40, 255 acetylation 85 modifications 97–100, 255 chromatin-mediated 97, 98 genome-wide distribution and responsiveness 99–100 targets 97 proteins 98 variants 100, 102 homolog 148 exchanges 165 interactions 165 homologous genes 160, 163 expression 164–165 pairs 163–165

279

homopolymer 17 Hordeum vulgare see barley hormones balance 152 genes 152 and height control 152 host–pathogen interplay 173–174 housekeeping genes 24, 25 HR (hypersensitive response) 173 HRD (homology-directed repair) 205 HSI (hyperspectral images) data 227 HTP (high-throughput-phenotyping) platforms, advantages 199 Hyaloperonospora arabidopsidis, plant–pathogen network 174 hybrid assembly 21 hybridization, B. napus 153 hyperspectral imaging 70, 72–73, 227

IBM1 (INCREASE IN BONSAI METHYLATION 1) genes 102 ibm1 mutants 102 identified metabolites 56 Illumina HiSeq 2 Illumina HiSeq/NovaSeq system 16, 17 ILSVRC (IMageNet Large Scale Visual Recognition Challenge) 224 image-based databases 115, 116 ImageCLEF 218 imagery, satellite 72 images DL technologies, concepts applied to 224 analysis, in field phenotyping data 73 automated taxonomic classifications for 225–226 data, non-RGB 227 databases, for organelles 115–116 datasets, open-source 226 input, selecting color space 218 labeling, and costs 238 non-invasive prediction, DL for 226–227 plant 224 DL for 225–228 quantification, DL for 227 RGB 226 segmentation 234, 235–236, 235, 239–240 applications in phenomics 236–238 challenges 238–239 labeling costs 238 real-world applications 235–236 semantic/instance/synoptic 235 stress/disease diagnosis, DL for 224, 226 imaging hyperspectral 70, 72–73, 227 sensors 69 techniques 108 immune proteins, Arabidopsis 174 immunity, pathogen 173

280

in vitro labeling methods 55 in vivo labeling methods 55 in-house BLAST 145 indels (insertions and deletions) 17, 212 India, cotton diseases prediction 218–219 indoor phenotyping platforms 69–71, 73 infrared thermometer 72 INSDC (International Nucleotide Sequence Database Collaboration) 145, 147 insect 42 herbivorous 42 proteins 42 vectors 42 insertion mutants, omics analyses 248 instance segmentation 235 interactome, plant–microbe 174–175 InterPro 145 InterProScan 145 interspecies interactome 174 Ion Torrent 16, 17 IPS1 (phosphate starvation 1) 88 Iso-Seq approach 17 isoelectric focusing 31 isolated peroxisomes 113 isotope labeling methods, stable 55 iTRAQ labeling 32 IWGSC (International Wheat Genome Sequencing Consortium) 249–250

Japan Kyoto University 250 NARO (National Agriculture and Food Research Organization) 186 National Institute of Genetics, NBRP-RICE Project 249 NBRP (National BioResource Project) 246, 247, 248 JASPAR 129, 130 Jerusalem artichoke (Helianthus tuberosus) 236 Jupyter notebooks 57 k-mers 22 K27me3 99 Kaggle database 226 KARMA 103 KEGG 146 Keras 229 KNApSAcK 50–51 knowledge-based annotations of genes 147 knowledge-bases 146, 147–148 for RNA-seq data, expression data and GENs 147–148 Kyoto University (Japan) 250 labeling 32 ‘active’ approach 238

Index

image 238 methods in vitro 55 in vivo 55 labels, answer 218 laboratory 69–70 Landsat-8 72 Laodelphax striatellus (brown planthopper) 42 laser ablation scanning 69 laser scanning platform, pole-based terrestrial 72 LC PDA (liquid chromatography photodiode array detector) 55 LC–MS (liquid chromatography-mass spectrometry) 31, 53, 54 LD (linkage disequilibrium) 193 lead stress 32 leaf counting, using segmentation map 237–238 greenhouse phenotyping 70 images 218 LeafGAN 239 legumes resources 250 species nitrogen-fixing symbiosis model 4 re-sequencing-level pan-genome information 4 see also soybean (Glycine max) leguminous plants, genome analysis 4 Lepidium sativum 153 LiDAR imaging 70 UAV 72 LiDAR sensor, backpack-based 72 lima bean 4 linker histone genes 98 lipids 55 body 113–114 metabolism 133 Litchi chinensis 228 lncRNAs (long noncoding RNAs) 82, 84, 86–87, 88, 255 antisense 86 classification 82 ENOD40 87 Pol V-derived 84 positional relationship to protein-coding genes 82–83 long-read sequencing 11 plant pathogenic (viroids) 89 platforms 3 technologies 1, 6 ultra- 3 long-reads 20–21 LOOCV (leave-one-out cross-validation) 195 Lotus japonicus 4 (Regel) 250 LRP (layer-wise relevance propagation) 227, 230 variants 230–231 LSTM (Long Short-Term Memory) model 218, 220

Index

luminal proteins, ER 110 lychee (Litchi chinensis) 228 lysis 31 buffer formulations 31

machine learning 217 methods 195 macromolecules 38 Magnaporthe oryzae–rice interactions 174, 176 maize (Zea mays) 32, 102, 199, 229, 259 ear phenotyping and kernel weight estimation method 237 genome 259 pan-genome project 259 Plant Cistrome Database 129, 130 reference genomes 259 MaizeGDB (Maize Genetics and Genomics Database) 257, 259 Manhattan plot 184, 185 manual curation 145, 147, 148 mapping analyses, genetic 190–191 raw sequence reads 23–24 short sequence reads against reference genome 14, 19 Marchantia polymorpha 260 markers DNA 191, 192, 193 ER 110 genetic 187 genotype–phenotype association 187 molecular 182 MAS (marker-assisted selection) 190, 191 MASC (Multinational Arabidopsis Steering Committee) 247 mass spectrometry see MS MassBase 59 MassIVE 57 matrix proteins, peroxisome 113 MBKBASE 257, 258 mCG 101, 102 mCHH 102 Medicago sativa 4 truncatula 4, 87 membrane carriers, into peroxisomes 113 proteins, oil body 114 MeRy-B 57, 58 MET1 (METHYLTRANSFERASE 1) 101 metabolic editing 62 metabolic pathways information, GENs with 146 nodes describing names 139, 146 peroxisomal 112–113 plastid 114 Metabolics Workbench 57, 58 MetaboLights 57, 58

281

Metabolite Profiling Database for Knock-Out Mutants (Me-KO) 57 metabolites 50–51 annotation 51, 56 central 52, 54 characterization 51 detection under current analytical techniques 52, 53 identification 51, 56 confidence 50, 51 identified 56 non-polar 55 primary 50, 54 profiling, 1H-NMR-based 54 secondary/specialized 50, 52, 54–55 (semi-)quantification 55 metabolome 50, 52 analyses, databases 56, 57 data analysis, reproducible 57–60 repositories 57, 58 public (meta)databases, assessor software tools 61 MetabolomeExpress 57 MetabolomeXChange 57 metabolomics analysis 112 analytical methods 55–56 analytical targets 52–55 physicochemical properties 52 strategies 52, 52 data 50, 60–62 identification confidence metrics 52, 53 methods, identification confidence 52, 56 Metabolomics Standards Initiative 50, 51, 56, 62 Metabolonote 57, 58 metadata metabolomic 50 shared 57 methylase, Dnmt3 class de novo 84 methylation asymmetric CHH 84 of cytosine 84 de novo 84 DNA 84, 100–103 non-CG 101 polymorphisms 103 methylome analysis, DNA 255 maps 255 mice 86, 246 Micro CT 69–70 microarray 16 microbes endophytic 250 uncultivated 210 microbial pathogens 172 microflora 250 microsomes 110

282

minirhizotrons 70, 71 miR399 88 miRbase 83 miRNA/miRNA* duplex 83 miRNAs (microRNAs) 82, 83, 255 target 88 missing genes 159, 162 missing proteins 43 mitochondria 38, 39, 114–115 functions, transcription and translation machineries 115 mitochondrial genomes 114–115 intergenic regions 115 size variation 114–115 mitoribosome footprints 115 mitoribosomes 115 MLM (mixed linear model) 181, 183, 187 model species 116 entire exonic regions 15 for genetic studies 12 legume 4 M. truncatula 4, 87 well-annotated reference sequences, mapping to 14, 19 see also Arabidopsis thaliana modifications histone 97–100, 98, 255 post-translational 39–41, 43 synthesized proteins 110 molecular biotechnology 127 molecular markers 182 molecular mechanisms, of plants 31–32 monogenic traits 191 motif scanning/scanners 128 mouse C57BL/sub-strain 246 movie databases, for organelles 115–116 MR (mutual rank) 140 MRCNN 228 MRI (magnetic resonance imaging) 56, 71 mRNA transcriptome analysis 255 mRNAs 13, 255 sequence similarities 143–144 MS (mass spectrometry) 43, 52, 53, 114, 174 -based hyphenated technologies 51 proteomic analysis methods 253 ultrahigh-resolution techniques 56 MS (mass spectrum) 51 MSI (mass spectrometry imaging) 51, 56 multi-cellularity 10 multi-genome comparisons, cereal crops 5 multi-mapping 16 multi-omics data 264 multi-omics-based breeding 199 Multinational Coordinated A. thaliana Genomics Project 248 multiple affinity-based techniques 103 multiple gRNAs 208 multiple reference genomes 5 multiplex genome editing 207

Index

mutagenesis, tomato 211 mutants Arabidopsis 57, 101 BR-deficient 152 gene knockout 248 ibm1 102 insertion 248 knockout 248 peroxisome-defective 113 plant, affected in mitoribosomes 115 semi-dwarf 155–156 severe dwarf 155 sly1/sly2 155 mutation analysis 211 DNA 132 dwarfism 164 MuZero (DeepMind) 219

N-glycosylation 40 NAM (nested association mapping) 250 names, cross-linking 147 NAPPA 256 NARO (National Agriculture and Food Research Organization of Japan) 186 NAT pairs, sense–antisense 85 NATs (natural antisense transcripts) 85 natural accessions 247 NBRP (National BioResource Project) of Japan 246, 247, 248 NBRP-RICE 249 NBRP-Wheat 250 NCBI (National Center for Biotechnology Information) 144–145 GEO, databases 255 PubMed 148 ncRNAs (noncoding RNAs) 89, 255 antisense 86 biological roles 79 classification 82–83 first (based on size) 82 second (based on polyadenylation) 82 common mechanisms and complexity 79 identification 82 with known functions 79, 80–81 loci 12 long 82 long intergenic 88 molecular functions 80–81, 83–89 research history 82 RNA POL III-derived 89 third (based on positional relationship of lncRNAs to protein-coding genes) 82–83 types 79 nematodes 82 Nested Association Mapping Population 259

Index

neural networks deep see DNNs domain-adversarial 239 NGS (next-generation sequencing) 67, 82, 116, 147, 253 experiments 143 platforms 2 classification categories 2–3 technologies 1–3, 6, 11, 15, 89 NHEJ (nonhomologous end-joining) pathway 205, 207 Nicotiana benthamiana 175, 210 NIST Standard Reference Material 60 nitrogen 152 nitrogen-fixing symbiosis 3–4 NlaIII 173 NLP (Natural Language Processing) 145–146, 148, 219, 220 technology 220 NLRome 175 NLRs (nucleotide-binding leucine-rich repeats) 173, 175 activation mechanisms 175 and AvR interaction 175–176 genes 175 helper 175 sensor 175 singleton, pair or network 176 structure 175 TNLs and CNLs 175 NMR (nuclear magnetic resonance)-based metabolomics 56 nmrML 57 non-invasive prediction 224 of images, DL for 226–227 non-linear effects, capture 195 non-model species novel genes identification from 21–22 transcriptome profiling 20–23 non-parametric methods 195 non-polar metabolites 55 novel gene-coding loci 12 novel genes, identification from non-model species 21–22 nuclear bodies 109 nuclear proteins 43 nuclear proteome, barley 109 nuclear proteomics 109 nucleoplasm 109 nucleosome 97 nucleotide-biding proteins 228, 228 nucleus 38, 39, 108–110

object detection 234, 235, 239–240 applications in phenomics 236 challenges 238–239 deep learning 235–236 algorithms 234 models, strong or weak labels switching 238

283

in plant phenomics 237 real-world applications 235–236 oil body 113–114 membrane proteins 114 oil palm (Elaeis guineensis) 103 oil seeds 113 oilseed rape see Brassica napus oleosin 114 oligogenic traits 191 olives, automated taxonomic classification 228 omics analysis Arabidopsis seed resources for 247–248 of insertion mutants 248 approach 231 to plant–pathogen interactions 172 data, of plant–pathogen interactions 176 databases 253 Arabidopsis 254–256 contribution 263 for crop plants 256–260 portals 261 Omni-C 3 1001 Epigenomes Project 255 1001 Genomes Project 185, 247–248, 254, 254 one-hot arrays 228, 228 1H-NMR-based metabolite profiling 53, 54, 56 1H-NMR-based metabolomics techniques 52, 53 online sequencing data 147 ontology terms 147 defined 145 OpenAI, Generative Pre-trained Transformer 3 (GPT-3) model 220 OpenAIRE 176 optical mapping 3 organ-level phenotyping 70 organelle-specific omics analysis 112 organelles 31, 32, 108 dynamics, databases for images/movies 115–116 fractionation experiments 110 function 38 changes and environmental conditions 114 molecular mechanisms 108 nucleus 38 subcellular 38 organs detection, and phenomics 236 orthologous gene expression, in B. napus 164–165, 166 orthologs, Cas9 210 Oryza barthii 257 glaberrima 257 nivara 20 officinalis 19–20 rufipogon 20, 257, 258 sativa see rice

284

Index

osmotic stress 40 Oxford Nanopore Technologies sequencers 3, 20

PAC (precursor-accumulating) vesicles 110–111 proteomic analysis 111 PacBio 20 Sequel II 3 paired mapping 19 PAM compatibility 209 PAMP (plant-associated molecular patterns) 173 pan-genome projects 5 barley 5 maize 259 pan-genomes 4, 6 Pan-NLRome 175 panoptic segmentation 235 parallel phenotyping, for multiple traits 72 paralogs (copy genes) 143 parametric methods 195 pathogen effectors 173 pathogens 175 immunity 173 microbial 172 plant 41 defence 172–173 PBMs (protein-binding microarrays) 125, 127 PCBase 130 PCC (Pearson correlation coefficient) 140–141 PCR duplication 18 PDID (plant disease image dataset), multi-modal 236–237 pea 4 peanut 4 Pearson correlation coefficient (PCC) 140–141 PECO (Plant Experimental Conditions Ontology) 147 pepper (Capsicum annuum) 257, 260 PepperHub 260 Peptide Atlas 43 performance, plant 153 pericentromeric regions 100, 101, 101 permutation test 184 peroxisomal ABC transporter 113 peroxisomal metabolic pathways 112–113 peroxisome 112–113 biochemical analyses 113 isolation 113 matric proteins 113 membrane carriers 113 metabolism processes 113 persimmon, internal calyx-cracking and seeds 227 PH01-2 86 pha-siRNAs 84 PhenoMeter (PM) 113 phenomics 67 object detection 237 and image segmentation use 236–239 and plant organs detection 236 specific topic reviews 68, 68

phenomobiles 72 phenotype 67 data 182–183 gene identification and loci 181 mutants 156 GA-related 167 prediction models 190 phenotyping costs 187 infrastructure 73 of Jerusalem artichoke (Helianthus tuberosus) 236 pipelines, object detection application 236 platforms 67, 68–69 field 69, 71–73 indoor 69–71, 73 technologies 68–69 conventional manual 67, 68, 68 high-throughput 68–69, 68 phosphate homeostasis 88 phosphoproteins 41 phosphorylated stress 40 phosphorylation 39, 40, 43 photorespiration 112 photosynthesis 70 phylogenetic relationships 2 Physcomitrium patens 260–261 phytopathogenic fungi 41–42 Phytophthora capsici 32 Phytozome 261–263 13 260 pigeon pea 4 Plant Cistrome Database 129, 130 Plant and Microbial Metabolomics Resource (PMR) 57 plant-pathogenic bacteria 42 PlantcircBase 3.0 88 PlantCircNet 88 PlantCLEF 218 2016 dataset, Inception ResNet v2 CNN model 225–226 PlantPAN3.0 129–130, 130 PlantRegMap 130 PlantTFDB 129 PlantVillage 218, 226 plant–fungus interaction and pathogenicity 41–42 plant–microbe interaction 174–175 guardee model 174 plant–pathogen interaction 172, 174 transcriptomes 173–174 plasma membrane 38 flooding-induced proteins 38–39 plasticity, plant 11 plastid 114 differentiation 114 genes, transcriptional and translational regulation 114 genome, proteins encoded in 114 metabolic pathways 114 PLncDB (Plant long noncoding RNA database) 255

Index

PM (PhenoMeter) 113 PmiREN 83 PO (Plant Ontology) 147 Poaceae 4–5 PODC (Plant Omics Data Center) 146, 147 Pol IV transcripts 84 Pol V transcripts 84 pollen siRNAs 85 polyadenylation 82 polygenic traits 191–192 polyploid 165 analysis 182 crops, genetic diversity 164 wheat species, transcriptome analysis 258 polyploidy stress, and gene expression 164–165 polysaccharide chains 111 poplar 38 post-translational modifications 39–41 large-scale identification and characterization 40 protein 39–41 and stress response 43 PPI (protein–protein interactions) 174, 256 PPIN-1 (plant pathogen protein-protein immune networks) 256 PPRs (pentatricopeptide repeat proteins) 115 pre-mRNA 87 pre-processing 18, 19 prediction DL 224 non-invasive 224 pri-miRNAs 83 primary metabolites 50, 54 prime editing 211, 212 progeny genotype prediction 198 promoter deletion analysis 124 proplastids 114 protein accumulation patterns 256 acetylation 39 activity modulation 31 post-translational modifications 31 complex extracts, profiling 30 degradation system, chloroplast-associated 38 DELLA 154, 155 dephosphorylated 40–41 dynamics, qualitative and quantitative information 30 ER 110 extraction 31 contaminants removal from sample 31 method 31 glycosylation 40 histone, in plants 8 insect 42 knowledge and dynamics 30 localization, to vacuole 112 matrix, peroxisome 113

285

membrane, oil body 114 modification 111 in plastid genome 114 post-translational modifications 39–41 prediction 23 quantification, gel-based method 31 secretory 111 soybean, under flooding stress 38 storage, ER for 110 synthesized 110 vacuolar Golgi apparatus–vacuole transport 112 soluble and transmembrane 112 protein folding prediction 229 structures 224 protein-coding genes 85 and lncRNAs 82–83 RNAs 86 protein–DNA interaction 124 protein–protein interactions (PPI) 174, 256 proteome analysis 114 databases, Arabidopsis 256 proteomic analysis gel-based 31, 32 gel-free 31, 32 of PAC vesicles 111 proteomic approaches 43 to understand nuclear organization and function 109 proteomic technology 33–37 proteomics 174–175 definition 30 research goals 174 ProteomicsDB 256 Pseudomonas syringae 42, 176 plant–pathogen network 174 PSTVd (potato spindle tuber viroid) 89 PTI (PAMP-triggered immunity) 173 PTM (post-translational modifications) 256 (PTR)-MS (proton transfer reaction) 55 published reports 145, 147 ‘putatively’ identified compounds 56 PWM (position weight matrix) 127–128 pyruvate decarboxylase 42 QQ (quantile-quantile) plot 184–185, 185 QTL (quantitative trait loci) 25, 103 linkage mapping 181, 182 quality assurance, genetic resources 246 quantification, by DNNs 227 quantitative traits 25 R markdown files 57 R packages 196–197, 197

286

Ralstonia pseudosolanacearum 175 random forest 195 RAP-DB (Rice Annotation Project Database) 256 RdDM (RNA-mediated DNA methylation) 84–85, 102 pathway 101 real data 226 reference genomes 12, 19, 253 barley 5, 6 bread wheat 5 maize 259 mapping with short sequence reads 14, 19 multiple 5 rice 256 sequences, selection 19 soybean 259 tomato 259 wheat 6, 258 reference transcriptome sequences 24 reference-based mapping 12 reference-guided assembly 21 reference-guided transcriptome analysis 13 reference-quality genome assemblies, cereal crop 5 regression 224 DL for 227 analysis, with CNNs 230 models 227 regularized methods 194–195 task 218–219 reinforcement learning 217, 219–220 algorithm 217, 219–220 applications 219–220 remote sensing systems 71 repeat-rich regions 3 repositories metabolome data 57 metabolomics 51–52 research, omics, Transformer-based 221 ResNet 230 ResNet50 224 resources, Arabidopsis 247–248, 248 restriction enzyme-mediated chromatin immunoprecipitation 109 retention signal, ER 110 reverse genetics approach 248 RGAP (Rice Genome Annotation Project) 256 RGB imaging, UAV 72 RGB-based images 70, 227 rhizobia 4 rhizotrons 70 ribosome 110 profiling 115 rice bean 4 rice genome 22 collection 5 rice (Oryza sativa) 19–20, 85–86, 88, 248–249 circRNAs 88 endogenous genes, targeted knockdown 210 genome editing 210

Index

genome sequence, subsp. japonica cv. Nipponbare 4–5, 20, 21, 249, 256 genome-wide analysis 211 germplasms 257 GWAS case study 186 lncRNAs 88 lysine acetylation 39 NARO 186 omics databases 256–258 pan-genome project 257–258 reference genome 256 resources 248–249 semi-dwarf varieties 152 subsp. aus 20 subsp. indica 20, 256–257 subsp. japonica 256–257 cv. Nipponbare genome sequence 4–5, 20, 21, 249, 256 3K Genomes Project 186 3000 Rice Genome Project 25 transcriptome database 258 Rice PanGenomeDatabase 257, 258 rice stripe virus 42 RiceFREND 258 RiceHap3 (Rice Haplotype Map Project) database 257, 257 RiceXPro 257, 258 rice–Magnaporthe oryzae interactions 174, 176 ricinosomes/KVs 111 ridge regression (RR) 194 RIGW (Rice Information GateWay) 257 RIKEN BioResource Research Center (BRC) 247, 248 Plant Metabolome Metadatabase (PMM) 59, 60, 248 RKHS (reproducing kernel Hilbert spaces) regression 195 RLKs (receptor-like kinases) 173 RLPs (receptor-like proteins) 173 RNA 173 aberrant 86 closed-loop 88 decoys 88 degradation 18 editing 115 PPR-dependent 115 fragments original genomic loci 15–16 transcriptome reconstruction 16 interfering events induced by cis-NATs 85 without siRNAs 85 POL III 89 protein-coding 86 slicing factors 87 small 85 sponge 88 transcription 13 background noise 13, 15

Index

bulk sampling approach 13–15 sampling strategy and background conditions 13 transcripts, categories 255 viruses, engineered RNA-guided immunity against 210 RNA integrity number (RIN) 19 RNA-seq 173, 174 analysis 147 data 138, 147 database, Arabidopsis 255 technology 11, 16 disadvantage 16 RNA-seq-based transcriptome analysis 12 standard workflow 13, 14 RNA-seq-based transcriptome profiling 12, 13–15, 14 RNA–DNA hybrids 87–88, 88–89 RNL (RPW8-NB-ARC-LRR) 175 RNNs (Recurrent Neural Networks) 218 Roche 454 16, 17 root phenotyping 70, 71, 73 symbiotic nodules, orthogenesis 87 ROS (reactive oxygen species) 41 ROS1 102 RPAN (Rice Pan-Genome Browser) 257 rRNAs 13 rRPMM (R package) 60 RT (retention time) 51 RWGCNA 146

S-nitrosylation 40, 41 SAGE (serial analysis of gene expression) 173, 174 SALK lines 248 satellites 71–72 imagery 72 SDS-polyacrylamide gel electrophoresis 31 secondary/specialized metabolites 50, 52, 54–55 secretory proteins 111 seed Arabidopsis, for omics analysis 247–248 features, environmental stress indictors 237 germination 155 segmentation, automatic 237 segmentation images see images models, end-to-end learnable intelligent 236 semantic 235 SEIPIN 114 selection with antibiotics 248 of genetic traits 165 genomic see GS of images, for CNN models 218 marker-assisted (MAS) 190, 191 SELEX (systematic evolution of ligands by exponential enrichment) 125, 127

287

semantic segmentation 235 semi-dwarf cultivars 152 semi-dwarf mutants 155–156 sensor 70, 71 NLRs 175 technology 67 in UAVs 72 separation technique 55 separation-based analytical platforms 52 separation-free techniques 51 sequence reads 11–12 bias screening 18 error rates 18 k-mers 22 raw output 17–18 mapping 23–24 sequences amino acids 98, 224, 229 DNA, DL for 224, 228–229 genetic 224 high-throughput 124 similarities, mRNAs 143–144 sequencing data, online 147 DNA 1, 253 libraries 14, 15–16, 18–19, 18, 22–24 artefacts 17–18 efficiency 19 platforms 16–17, 16, 18 long-read 3 ultra-long-read 3 whole-genome sequencing (WGS) 4, 187 see also long-read sequencing; NGS (nextgeneration sequencing); short-read sequencing SGN (SOL Genomics Network) 257, 260 short sequence reads 20 mapping against reference genome 14, 19 short-read sequencing 11 platforms 3, 16–17, 16 signal molecules (hormones) 151 peptide, ER 110 transduction pathways 39 SIGnAL (Salk Institute Genomic Analysis Laboratory) 248, 254, 256 signaling GA 155 genes 152 coding sequences for 156, 157 GA 163–164 and GA metabolism 165 mechanism, under abiotic stress 42–43 single reference genomes 5 siRNAs (small-interfering RNAs) 82, 83–84, 101–102, 255 pollen 85 production 86 sly1/sly2 mutants 155

288

SNP (single nucleotide polymorphism) 182, 187 genotyping arrays 193 snRNAs (small RNAs) 255 software, GWAS 184, 184 Sogatella furcifera (white-backed planthopper) 42 Solanaceae 250, 259–260 see also tomato Sorghum bicolor (sorghum) 229 southern rice black-streaked dwarf virus 42 SoyBase 257, 259 soybean (Glycine max) 4, 70, 82, 259 germline sets 4 proteins, affected by flooding stress 38 reference genomes 259 seed 38–39 protein extraction 31 under flooding stress, post-translational modifications-mediated response 40, 41 spatial metabolomic techniques 50 spatial metabolomics 56 SpCas9 209 protein, engineered 209–210 variants 209–210 SpCas9-HF1 210 specialized/secondary metabolites 50, 52, 54–55 with neutral and polar physicochemical properties 55 spectral sensing systems 72 spectroradiometers 72 spermatogenesis 99 statistical significance, analyzing 183–184 stem angle, plant segmentation masks algorithm 238 diameter 70 elongation 153 greenhouse phenotyping 70 stomatal density (index) 69 stomatal phenotyping, direct 69 Streptococcus pyogenes 206 stress 22, 24 abiotic 33–36, 38, 86 plant acclimation and tolerance 40 signaling mechanism under 42–43 abiotic and biotic 30, 38, 41 biotic 32, 36–37, 38, 41 conditions 31, 38 diagnosis, based on images 224, 226 drought 40 soybean response 42–43 environmental 30–31, 39 and seed features 237 flooding 38 soybean 38–39, 42–43 heat 85 hormones 152 lead 32 osmotic 40

Index

phosphorylated 40 polyploidy, and gene expression 164–165 response 32–38, 42 factors 40 stress-induced responses 39 stressors, environmental, of field-grown plants 71 STRING database.STRING 256 sub-cellular proteomics 31, 32–39, 43 analysis 38 subcellular structures, microscopic view 108, 109 sugar beet, crop/weed detection datasets 239 sugar-acid rations 227 SUGAR-DEPENDENT 1 113–114 sugars 54 SuperSAGE 173, 173–174 SuperTranscripts construction 22–23 supervised learning 217–219 algorithm 217–218 classification task 218 regression task 218–219 synthesis, with cyclic reversible termination 17 synthesized proteins, modifications 110

ta-siRNAs 83–84 TADs (topologically associating domains) 132 TAG (triacylglycerol) lipase 113–114 TAIR (Arabidopsis Information Resource) 247, 254, 254 TALEN (transcription activator-like effector nuclease) 205–206, 206 target genes 83, 124, 207 target proteins, guardees 173–174 targets, AutoML 221 TASSEL 184, 184 TassleGAN 239 taxonomic classification 224 automated, for images 225–226 10+ wheat genome project 5 TEs 102 Text-to-Text Transfer Transformer (T5) model (Google) 220 TF-centered approaches 125–126 TF-related databases 129 TFFM (TF Flexible Model) 128 TFs (transcription factors) 124, 143, 148 central hub, identification of 130–132 and cis-elements, large-scale interaction techniques 125–127, 127 DNA binding patterns, resources 129–130, 130 DNA-binding specificity 129 gene regulation by 126 TF–DNA interactions methods to infer 125–129 dry-lab approaches 127–129 wet-lab approaches 125–127 thermal imaging 69 thermal sensing systems 72

Index

thermal sensors, on UAVs 72 thermometer, infrared 72 3000 Rice Genome Project 257 TIC (total ion current) chromatogram 51 TiD (CRISPR-Cas type I-D) system 211, 211, 212 TMM (trimmed mean of M values) 24, 25 tobacco, genome editing 210 tobacco mosaic virus 42 (TOF)-MS (time of flight) 56 tomato (Solanum lycopersicum) 38, 70, 250, 259–260 cells, induction of small indels and large deletions 212 domestication 207 genome-wide analysis 211 genomes 259 off-target effects 211–212 GS 198 Micro-Tom 250 mutagenesis 211 reference genomes 259 resources 250 SOL genomics workshop 259–260 stresses and diseases 226 tonoplast-localized transporters, vacuoles 112 TPM (transcripts per million) 24 training data 220 for GS 191, 192 preparation 193 population 191, 192, 193 updated 198 what and how to phenotype 193 traits 25, 67 genetic 165 improved 205 internal 226 monogenic 191 multiple, parallel phenotyping 72 oligogenic 191 polygenic 191–192 target, field phenotyping 71, 71 trans 25 transcriptional control 211 by CRISPR-dCas9 208, 208 transcriptome 199 Arabidopsis thaliana 18–19, 18, 82 assembly 20–23 see also de novo assembly data of ZS11 156 databases Arabidopsis 255–256, 258 rice 258 dynamic nature 11 epigenetic regulation contribution 12 isoforms 12 of plant–pathogen interactions 173–174 profiles 10 profiling

non-model species 20–23 RNA-seq-based 12, 13–15, 14 robust biological replicates 15 sampling time-point 15 technology and approaches 11 projects, global-scope 12 regulatory elements 12 roadmap or atlas 12 spatio-temporal fluxes 11 studies, resolution and biological interpretability 12–13 transcriptome analysis 10–11, 231 genome-wide, ncRNA identification 82 mRNA 255 polyploid wheat species 258 reference-guided 13 RNA-seq-based 12 transcriptome-wide analysis, of small RNAs 255–256 transcriptomic responses adaptive 11 multiple layers of regulation 11 transcriptomics 173 transcripts 82 transfer learning 239 Transformer (Google) 220, 221 accuracy 220 Attention Mechanism 220 Text-to-Text Transfer (T5) 220 transgenic mutant lines 248 tree detection 236 trichloroacetic acid/acetone precipitation 31 tRNAs 13 truncated genes 159, 160, 162, 165

UAVs 68, 72 ubiquitin-mediated proteolysis 41 ubiquitination 40, 41, 43 ultra-long-read sequencing 3 UniProt 129, 145 unknown gene 144 unsupervised learning 217, 219 dimensionality reduction task 219 generation task 219 unverified gene regulation 130

vacuoles 38, 110, 111–112 isolated 112 roles 111–112 tonoplast-localized transporters 112 VAE (Variational Auto-Encoder) 219 variation, natural, cereal crops 5 vesicle-mediated transport, vacuoles 112 VGG series 226 VGG16 227, 228 Vigna, pan-genome 4 viroids 89

289

290

viruses plant 42 receptors 42 RNA 210 Vision Transformer 220 applications 220 vitellogenin 42 VOCs (volatile organic compounds) 54–55 aliphatic 55 methyl-branched 55

Web-BLAST 144, 145 weeds management 236 wet experiments 230 wet-lab approaches 125–127, 127 WGBS (whole-genome bisulfite sequencing) 103 WGS (whole-genome sequencing) 4, 187 Wheat Disease DataBase 226 Wheat Expression Browser 257, 258 Wheat Proteome 257, 258–259 wheat (Triticum aestivum) 258–259 bread 258 hexaploid 258 cell wall proteomics 38 ‘Chinese Spring’ 249–250 durum 258 genome 258 projects 258 lncRNAs 87 lysine acetylation 39 ‘Norin 61’ 250 polyploid, genome sequencing 258 reference genomes 6, 258

Index

reference-quality genome assemblies 5 resources 249–250 semi-dwarf varieties 152 10+ wheat genome project 5 WheatExp 257, 258 white lupin 4 whole-genome analysis 253 whole-genome assembly, and structure variant identification 3 word2vec 219 WorldView-3 72

X-AI (explainable AI) 224, 225, 230 applications 231 X-ray Micro CT 70 Xanthomonas campestris 175 xCas9 210 XCMS Online 57

Y1H (yeast one-hybrid) assay 126–127, 127 Y2H (yeast two-hybrid) assay 173, 174 YOLO 235 ZAR1 protein 175 single NLR 175 Zea mays see maize ZFN (zinc-finger nuclease) 206, 206 Zigzag model 173 ZS11 GA signaling genes in 161, 163 transcriptome 161, 161

CABI – who we are and what we do This book is published by CABI, an international not-for-profit organisation that improves people’s lives worldwide by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI is also a global publisher producing key scientific publications, including world renowned databases, as well as compendia, books, ebooks and full text electronic resources. We publish content in a wide range of subject areas including: agriculture and crop science / animal and veterinary sciences / ecology and conservation / environmental science / horticulture and plant sciences / human health, food science and nutrition / international development / leisure and tourism. The profits from CABI’s publishing activities enable us to work with farming communities around the world, supporting them as they battle with poor soil, invasive species and pests and diseases, to improve their livelihoods and help provide food for an ever growing population. CABI is an international intergovernmental organisation, and we gratefully acknowledge the core financial support from our member countries (and lead agencies) including: Ministry of Agriculture People’s Republic of China

Discover more To read more about CABI’s work, please visit: www.cabi.org Browse our books at: www.cabi.org/bookshop, or explore our online products at: www.cabi.org/publishing-products Interested in writing for CABI? Find our author guidelines here: www.cabi.org/publishing-products/information-for-authors/

11

CABI BIOTECHNOLOGY SERIES

Plant Omics Advances in Big Data Biology EDITED BY HAJIME OHYANAGI, EIJI YAMAMOTO, AI KITAZUMI AND KENTARO YANO This book provides a comprehensive overview of plant omics and big data in the fields of plant and crop biology. It discusses each omics layer individually, including genomics, transcriptomics, proteomics, and covers model and nonmodel species. In a section on advanced topics, it considers developments in each specialized domain, including genome editing and enhanced breeding strategies (such as genomic selection and high-throughput phenotyping), with the aim of providing tools to help tackle global food security issues. The importance of online resources in big data biology are highlighted in a section summarizing both wet- and dry-biological portals. This section introduces biological resources, datasets, online bioinformatics tools and approaches that are in the public domain. This title: • Reviews each omics layer individually. • Focuses on new advanced research domains and technology. • Summarizes publicly available experimental and informatics resources. This book is for students, engineers, researchers and academics in plant biology, genetics, biotechnology and bioinformatics.

Space for bar code with ISBN included