121 35 13MB
English Pages 384 [369] Year 2022
Methods in Molecular Biology 2481
Davoud Torkamaneh François Belzile Editors
Genome-Wide Association Studies
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Genome-Wide Association Studies Edited by
Davoud Torkamaneh and François Belzile Département de Phytologie, Université Laval, Quebec City, QC, Canada; Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC, Canada
Editors Davoud Torkamaneh De´partement de Phytologie Universite´ Laval Quebec City, QC, Canada;
Franc¸ois Belzile De´partement de Phytologie Universite´ Laval Quebec City, QC, Canada;
Institut de Biologie Inte´grative et des Syste`mes (IBIS) Universite´ Laval Quebec City, QC, Canada
Institut de Biologie Inte´grative et des Syste`mes (IBIS) Universite´ Laval Quebec City, QC, Canada
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-2236-0 ISBN 978-1-0716-2237-7 (eBook) https://doi.org/10.1007/978-1-0716-2237-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface Genome-wide association studies (GWAS) have revolutionized the investigation of complex traits over the past decade and have unveiled numerous useful genotype–phenotype associations in plants. In this book, we describe the key concepts and methods underlying GWAS, including the genetic architecture underlying variation for phenotypic traits, the structure of genetic variation in plants, technologies for capturing genetic information, study designs, and the statistical models and bioinformatics tools used for data analysis. This book provides an extremely valuable resource for the plant research community by rendering GWAS analysis less challenging and more accessible to a broader group of users. We have divided the contents of the book into six main parts. In a first part (Part I; Chapters 1–5), we provide general overviews of the entire GWAS process as well as of the main components of GWAS (phenotypic data, genotypic data, association models, and interpretation of outputs). These first chapters are written in the form of review articles, providing an overall view of the current status of the science in these areas. In the rest of the volume (Chapters 6–20), each individual chapter focuses on a specific topic relevant to performing a GWAS or presents an example or case study for a specific context. In these chapters, the authors provide a hands-on, step-by-step guide on how to perform various analyses that contribute to a GWAS. In Part II, comprised of Chapters 6–8, we focus on the analysis of phenotypic data. Chapter 6 covers the analysis of multi-environment phenotypic data (typically from field trials) that must be integrated to provide a single phenotypic value for each trait and each accession in the association panel. In Chapter 7, the authors describe the analysis of data obtained from high-throughput phenotyping platforms. While some of these traits may be in common with those described in the previous chapter, most often the traits are measured in a controlled environment and compound traits (e.g., yield) are broken down into component traits (number of tillers, number of seed per head, size of seed). Finally, in Chapter 8, the special case of "omics"-derived traits is presented. In this case, the phenotypic datasets are typically very large. In a transcriptomic analysis, for example, one will seek to capture the expression of all genes of a crop in each accession under a particular set of conditions. Similarly, for traits derived from proteomic or metabolomic characterization of accessions, a vast number of traits need to be handled. Chapters 9–12 comprise Part III of the book and cover various topics related to the production and handling of genotypic data. The first of these chapters (Chapter 9) covers the most common case where genotypic data in the form of SNPs must be produced and curated to ensure an adequate coverage of the genome. In Chapter 10, we present the special case of structural variants (SVs), another type of variant that has started to be assessed when performing GWAS in plants but that presents unique challenges. The two remaining chapters in this part deal with specific use cases for the genotypic data. In Chapter 11, the authors explain the rationale and methods used in integrating different datasets. This obviously covers the subject of data imputation as different genotypic datasets will oftentimes not be perfectly overlapping and also the case of meta-analysis, where entirely different
v
vi
Preface
association panels are used to characterize the traits of interest. The last chapter in this part (Chapter 12) relates to the specific task of using genotypic data to characterize the underlying population structure and relatedness among accessions that make up the association panel. Such a characterization is key to minimizing confounding effects that can greatly increase the number of false-positive associations in a GWAS. Once phenotypic and genotypic data have been obtained and properly curated in view of their joint analysis, the next step in GWAS is to perform the association analysis per se, and this is covered in Part IV of this book. Although there are a multitude of association models and software tools available to identify phenotype–genotype associations, we have chosen to present two common cases. In Chapter 13, the authors describe the use of GAPIT, one of the most widely used software platforms on which GWAS is conducted in plants and the one which is probably the most user-friendly thanks to its graphic interface with pull-down menus. Chapter 14 presents a guide to the use of rMVP, another widely used software tool that offers to run a number of association models, but this time in R. The latter tool offers to run some multilocus models which have become more and more prevalent in recent years. In Part V, we focus on the analyses that can be conducted after a GWAS has been completed to either further delve into the results or to provide independent validation. As described in Chapter 15, the identification of candidate genes is one of the most common and important post-GWAS analyses to be conducted. At the same time, this is an area in which, in our view, there is often a lack of rigor and consistency in the approaches used, and we hope this chapter will provide useful guidance. In Chapter 16, the authors describe how the results of a GWAS, namely specific marker–trait associations, can be validated through the use of biparental populations. Another frequent type of follow-up to a GWAS is to use the most highly associated SNPs to develop genetic markers that can be easily assayed in breeding populations. Chapter 17 provides guidance on how to develop KASP markers from GWAS-derived SNPs. In the sixth and final part of the book, we have sought to provide use cases of GWAS that capture different types of traits in various crop species with different genome sizes and ploidy. In Chapter 18, the authors provide guidance in performing GWAS to identify components of disease resistance, one of the most commonly assayed traits in GWAS in crop plants, in soybean, a diploid crop with a medium-sized genome. In Chapter 19, the authors provide numerous examples of traits that have been subjected to GWAS (some fairly simple, others much more complex) in wheat, a crop that has both a very large genome and one that is composed of three sub-genomes. Finally, in Chapter 20, the study of the microbiome as a trait is presented and discussed. We hope that the sum of these contributions will allow novice users to rapidly gain a good command of what are currently the best practices for conducting a GWAS. This is something that is always challenging for new users to gain from reading the literature, as studies published even as little as 5 years ago may no longer present approaches or tools that are considered current. We also hope that even more experienced researchers, already familiar with at least some uses of GWAS, will find this book a source of useful information in contrasting with their own practices and facilitate their venturing into more novel areas, be it in the choice of analysis (e.g., meta-analysis), to the choice of traits (e.g., the microbiome) or the choice of analytical models (multi-locus vs single locus).
Preface
vii
In closing, we want to extend our thanks to the authors who so generously accepted our invitation to share their knowledge and expertise by contributing a chapter to this book. The particularities of such a type of published work allow researchers to much more readily share the sometimes hard lessons learned through experience and to therefore help new users to steer clear of some of the pitfalls inherent in venturing into a new area or adopting a new approach. Quebec City, QC, Canada
Davoud Torkamaneh Franc¸ois Belzile
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PART I
REVIEWS AND OVERVIEWS
1 Designing a Genome-Wide Association Study: Main Steps and Critical Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Franc¸ois Belzile and Davoud Torkamaneh 2 Preparation and Curation of Phenotypic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . Santiago Alvarez Prado, Fernando Herna´ndez, Ana Laura Achilli, and Agustina Amelong 3 Genotyping Platforms for Genome-Wide Association Studies: Options and Practical Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David L. Hyten 4 Genome-Wide Association Study Statistical Models: A Review. . . . . . . . . . . . . . . . Mohsen Yoosefzadeh-Najafabadi, Milad Eskandari, Franc¸ois Belzile, and Davoud Torkamaneh 5 Interpretation of Manhattan Plots and Other Outputs of Genome-Wide Association Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiabo Wang, Jianming Yu, Alexander E. Lipka, and Zhiwu Zhang
PART II
v xi
3 13
29 43
63
PHENOTYPIC DATA FOR GWAS
6 Preparation and Curation of Multiyear, Multilocation, Multitrait Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Amina Abed and Zakaria Kehel 7 Development, Preparation, and Curation of High-Throughput Phenotypic Data for Genome-Wide Association Studies: A Sample Pipeline in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Pasquale Tripodi 8 Preparation and Curation of Omics Data for Genome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Feng Zhu, Alisdair R. Fernie, and Federico Scossa
PART III
GENOTYPIC DATA AND ASSESSMENT OF POPULATION STRUCTURE
9 Producing High-Quality Single Nucleotide Polymorphism Data for Genome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Philipp E. Bayer, Mitchell Gill, Monica F. Danilevicz, and David Edwards
ix
x
10
11
12
Contents
A Practical Guide to Using Structural Variants for Genome-Wide Association Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Marc-Andre´ Lemay and Sidiki Malle Data Integration, Imputation, and Meta-analysis for Genome-Wide Association Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Reem Joukhadar and Hans D. Daetwyler Population Structure and Relatedness for Genome-Wide Association Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Sidiki Malle
PART IV 13
14
Performing Genome-Wide Association Studies with Multiple Models Using GAPIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Jiabo Wang, You Tang, and Zhiwu Zhang Performing Genome-Wide Association Studies Using rMVP . . . . . . . . . . . . . . . . . 219 Xiaolei Liu, Lilin Yin, Haohao Zhang, Xinyun Li, and Shuhong Zhao
PART V 15
16
17
19 20
IDENTIFICATION OF CANDIDATE GENES AND VALIDATION
Identification and Validation of Candidate Genes from Genome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Elise Albert and Christopher Sauvage Biparental Crossing and QTL Mapping for Validation of Genome-Wide Association Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Pawan L. Kulwal and Ravinder Singh Development of Breeder-Friendly KASP Markers from Genome-Wide Association Studies Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Manar Makhoul and Christian Obermeier
PART VI 18
BIOINFORMATICS TOOLS FOR GWAS
CASE STUDIES
Mapping Major Disease Resistance Genes in Soybean by Genome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Everton Geraldo Capote Ferreira, ˜ es and Francismar Coˆrrea Marcelino-Guimara GWAS Case Studies in Wheat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Deepmala Sehgal and Susanne Dreisigacker Plant Microbiome-Based Genome-Wide Association Studies . . . . . . . . . . . . . . . . . 353 Siwen Deng, Michael A. Meier, Daniel Caddell, Jinliang Yang, and Devin Coleman-Derr
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
369
Contributors AMINA ABED Consortium de recherche sur la pomme de terre du Que´bec (CRPTQ), Que´bec, QC, Canada ANA LAURA ACHILLI Centro de Recursos Naturales Renovables de la Zona Semia´rida (CERZOS–CONICET), Bahı´a Blanca, Argentina; Departamento de Agronomı´a, Universidad Nacional del Sur (UNS), Bahı´a Blanca, Argentina ELISE ALBERT Syngenta SAS France, Saint Sauveur, France SANTIAGO ALVAREZ PRADO IFEVA-CONICET, Ciudad de Buenos Aires, Argentina; Departamento de Produccion Vegetal, Facultad de Agronomı´a, Universidad de Buenos Aires, Ciudad de Buenos Aires, Argentina AGUSTINA AMELONG Ca´tedra de Sistemas de Cultivos Extensivos-GIMUCE, Facultad de Ciencias Agrarias, Universidad Nacional de Rosario, Zavalla, Argentina PHILIPP E. BAYER Applied Bioinformatics Group, School of Biological Sciences, The University of Western Australia, Perth, WA, Australia FRANC¸OIS BELZILE De´partement de Phytologie, Universite´ Laval, Quebec City, QC, Canada; Institut de Biologie Inte´grative et des Syste`mes (IBIS), Universite´ Laval, Quebec City, QC, Canada DANIEL CADDELL Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA; Plant Gene Expression Center, USDA-ARS, Albany, CA, USA DEVIN COLEMAN-DERR Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA; Plant Gene Expression Center, USDA-ARS, Albany, CA, USA HANS D. DAETWYLER Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia; School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia MONICA F. DANILEVICZ Applied Bioinformatics Group, School of Biological Sciences, The University of Western Australia, Perth, WA, Australia SIWEN DENG Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA; Plant Gene Expression Center, USDA-ARS, Albany, CA, USA SUSANNE DREISIGACKER International Maize and Wheat Improvement Center (CIMMYT), Carretera Mex-Veracruz, CP, Mexico DAVID EDWARDS Applied Bioinformatics Group, School of Biological Sciences, The University of Western Australia, Perth, WA, Australia MILAD ESKANDARI Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada ALISDAIR R. FERNIE Max Planck Institute of Molecular Plant Physiology, Potsdam-Golm, Germany EVERTON GERALDO CAPOTE FERREIRA Londrina State University (UEL), Londrina, PR, Brazil MITCHELL GILL Applied Bioinformatics Group, School of Biological Sciences, The University of Western Australia, Perth, WA, Australia FERNANDO HERNA´NDEZ Centro de Recursos Naturales Renovables de la Zona Semia´rida (CERZOS–CONICET), Bahı´a Blanca, Argentina; Departamento de Agronomı´a, Universidad Nacional del Sur (UNS), Bahı´a Blanca, Argentina
xi
xii
Contributors
DAVID L. HYTEN Department of Agronomy and Horticulture, University of NebraskaLincoln, Lincoln, NE, USA REEM JOUKHADAR Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, Australia ZAKARIA KEHEL International Center for Agricultural Research in the Dry Areas (ICARDA), Rabat, Morocco PAWAN L. KULWAL State Level Biotechnology Centre, Mahatma Phule Krishi Vidyapeeth, Rahuri, Ahmednagar, Maharashtra, India MARC-ANDRE´ LEMAY De´partement de phytologie and Institut de biologie inte´grative et des syste`mes, Universite´ Laval, Quebec City, QC, Canada XINYUN LI Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education & College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China ALEXANDER E. LIPKA Department of Crop Sciences, University of Illinois, Urbana, IL, USA XIAOLEI LIU Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education & College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China; Hubei Hongshan Laboratory, Wuhan, China MANAR MAKHOUL Department of Plant Breeding, Justus Liebig University, Giessen, Germany SIDIKI MALLE Assistant professor at Institut Polytechnique Rural de Formation et de Recherche Applique´e (IPR/IFRA) de Katibougou, Koulikoro, Mali FRANCISMAR CORREˆA MARCELINO-GUIMARA˜ES Brazilian Agricultural Research Corporation – National Soybean Research Center (Embrapa Soja), Londrina, PR, Brazil MICHAEL A. MEIER Department of Agronomy and Horticulture, University of NebraskaLincoln, Lincoln, NE, USA; Center for Plant Science Innovation, University of NebraskaLincoln, Lincoln, NE, USA CHRISTIAN OBERMEIER Department of Plant Breeding, Justus Liebig University, Giessen, Germany CHRISTOPHER SAUVAGE Syngenta SAS France, Saint Sauveur, France FEDERICO SCOSSA Max Planck Institute of Molecular Plant Physiology, Potsdam-Golm, Germany; Council for Agricultural Research and Economics (CREA), Research Centre for Genomics and Bioinformatics (CREA-GB), Rome, Italy DEEPMALA SEHGAL International Maize and Wheat Improvement Center (CIMMYT), Carretera Mex-Veracruz, CP, Mexico RAVINDER SINGH School of Biotechnology, Sher-e-Kashmir University of Agricultural Sciences and Technology of Jammu, Jammu, Jammu and Kashmir, India YOU TANG Jilin Agricultural Science and Technology University, Jilin, Jilin, China DAVOUD TORKAMANEH De´partement de Phytologie, Universite´ Laval, Quebec City, QC, Canada; Institut de Biologie Inte´grative et des Syste`mes (IBIS), Universite´ Laval, Quebec City, QC, Canada PASQUALE TRIPODI CREA Research Centre for Vegetable and Ornamental Crops, Pontecagnano Faiano, Italy JIABO WANG Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization, Sichuan Province and Ministry of Education, Southwest Minzu University, Chengdu, Sichuan, China JINLIANG YANG Department of Agronomy and Horticulture, University of NebraskaLincoln, Lincoln, NE, USA; Center for Plant Science Innovation, University of NebraskaLincoln, Lincoln, NE, USA
Contributors
xiii
LILIN YIN Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education & College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China MOHSEN YOOSEFZADEH-NAJAFABADI Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada JIANMING YU Department of Agronomy, Iowa State University, Ames, IA, USA HAOHAO ZHANG School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China ZHIWU ZHANG Department of Crop and Soil Sciences, Washington State University, Pullman, WA, USA SHUHONG ZHAO Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction, Ministry of Education & College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China; Hubei Hongshan Laboratory, Wuhan, China FENG ZHU National R&D Center for Citrus Preservation, Key Laboratory of Horticultural Plant Biology, Ministry of Education, Huazhong Agricultural University, Wuhan, China; Max Planck Institute of Molecular Plant Physiology, Potsdam-Golm, Germany
Part I Reviews and Overviews
Chapter 1 Designing a Genome-Wide Association Study: Main Steps and Critical Decisions Franc¸ois Belzile and Davoud Torkamaneh Abstract In this introductory chapter, we seek to provide the reader with a high-level overview of what goes into designing a genome-wide association study (GWAS) in the context of crop plants. After introducing some general concepts regarding GWAS, we divide the contents of this overview into four main sections that reflect the key components of a GWAS: assembly and phenotyping of an association panel, genotyping, association analysis and candidate gene identification. These sections largely reflect the structure of the chapters which follow later in the book, and which provide detailed discussions of these various steps. In each section, in addition to providing external references from the literature, we also often refer the reader to the appropriate chapters in this book in which they can further explore a topic. We close by summarizing some of the key questions that a prospective user of GWAS should answer prior to undertaking this type of experiment. Key words Genome-wide association studies, GWAS, Association panel, Genotyping, Statistical models, Candidate genes
1
Introduction Genome-wide association studies (GWAS) have become one of the most powerful tools available to geneticists to identify the genomic loci controlling traits of interest. At the time of writing, when one does a simple search using the term “genome-wide association study” in Google Scholar, over two million hits are returned, indicating that this approach for identifying and mapping the loci underlying heritable traits is widely used in a very broad range of species, from model species such as yeast and Arabidopsis to humans and crops. While less than 2000 hits are obtained when doing a search for GWAS in plants prior to the year 2000, in 2020 alone, close to 25,000 hits are obtained. The wide adoption of GWAS closely tracked the advent of highthroughput genotyping technologies that came about from the development of SNP arrays as well as sequencing-based genotyping
Davoud Torkamaneh and Franc¸ois Belzile (eds.), Genome-Wide Association Studies, Methods in Molecular Biology, vol. 2481, https://doi.org/10.1007/978-1-0716-2237-7_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
3
4
Franc¸ois Belzile and Davoud Torkamaneh
approaches [1]. This is no coincidence as a very dense marker coverage of the genome of interest is a key requisite for performing GWAS, in stark contrast with biparental QTL mapping. In the latter case, a marker every 10–20 cM is sufficient, thus typically requiring only a few dozen to a few hundred markers for most crops as genetic maps range from as little as a few hundred to a few thousand centi-morgans. Biparental populations used in QTL mapping are only a few generations removed from the initial cross, thereby allowing for a limited number of recombination events to occur between loci located on the same chromosome. For this reason, linkage disequilibrium (LD) between loci extends over large physical distances, often measurable in megabases (Mb). In contrast, the unrelated individuals that typically comprise an association panel trace back to a common ancestor in a more distant and less uniform fashion, some individuals being more closely related than others. A direct consequence of this is that a greater number of recombination events have occurred in the evolutionary path leading to the association panel and LD can decline much more rapidly, with LD extending only across a few thousand base pairs (bp) in some cases. Consequently, the more rapid the decline in LD, the greater the marker coverage required. Indeed, as each marker is in high LD (and can thus inform us of the presence of a causal variant in its proximity) over a very limited physical distance, a greater number of markers are needed to ensure exhaustive coverage of the genome. Thus, until genotyping platforms capable of delivering markers in the many thousands to millions became available to crop geneticists, GWAS was not truly feasible or was terribly underpowered. As a reward for this increased requirement for marker coverage, we can expect a much-increased resolution of mapping through GWAS. Indeed, as a causal variant needs to be in high LD to be picked up via the association it maintains with a genotyped marker, the window within which we can hope to find this causal variant or a candidate gene is typically limited to a few kb to a few hundred kilobases in most cases. To achieve a similar outcome via highresolution biparental mapping (often termed “chromosome landing”), it was often necessary to develop extremely large segregating progenies (many thousands of individuals) [2, 3]. As one can imagine, the phenotypic and genotypic characterization of such large populations was an extremely onerous endeavor. Thanks to the increased feasibility of high-density genotyping in most if not all crop species, it is no surprise that GWAS has essentially become the tool of choice for identifying and mapping the loci controlling traits of interest. This is true both for cases where the end is to gain a complete and detailed understanding of the genetic underpinnings of a heritable trait and in breeding where one may be content to identify a few major loci that explain a large portion of the phenotypic variability within a breeding program
Main Steps in Designing a GWAS
5
and thus enable marker-assisted selection (MAS) approaches to improve the trait of interest. Another important difference between biparental QTL mapping and GWAS is that the former can only identify two alleles at any given locus, assuming that the two parental lines are fixed and are segregating for this locus. In contrast, GWAS offers the possibility of identifying a multitude of allelic variants at a locus associated with the phenotype of interest. Obviously, there are limits to the number of alleles that can be identified in this fashion since, with the increase in the diversity of alleles at a locus comes a decrease in the number of accessions sharing this allele and a concomitant decrease in the power to identify rare alleles. In this introductory chapter, we seek to provide the reader with a general overview of the key steps needed to design and perform a GWAS. While there are general principles that need to be adhered to, a wide range of possibilities exists depending on the specific goals being pursued. Throughout the following sections focusing on the four main steps of GWAS (assembly and phenotyping of an association panel, genotyping, association analysis and candidate gene identification), we will try to illustrate the wide range of valid choices open to the investigator. At each step along the way, we will point the reader to relevant chapters exploring these facets of GWAS in greater detail.
2
Main Steps in a GWAS and Critical Decisions
2.1 Assembly and Phenotyping of an Association Panel
The foundational step in any GWAS is the selection of the accessions that will form the association panel. This has two important components to it: the number of accessions and their degree of relationship or how wide (genetically diverse) the panel should be. As we will illustrate below, there can be no “one-size-fits-all” approach to these questions and careful definition of the goals of the study are needed. As mentioned earlier, different researchers pursue different goals in performing GWAS and this will necessarily impact the decisions made at each key step, but none more so than in defining the size and composition of the association panel. For the purpose of illustrating contrasting cases, we will consider two rather common uses of GWAS. In a first case, that we will call the “basic science” scenario, one may be interested in gaining a very detailed and exhaustive understanding of genetic variation for a trait at the species level (or even wider). This means identifying all the loci and as complete as possible a catalogue of alleles that explain phenotypic variation for a trait across a crop species and possibly its close relatives. In a second case (termed the “applied scenario”), which we will encounter most often in breeding, the aim is more limited. Typically, one will be interested in identifying only the loci and alleles that are the main contributors to the more
6
Franc¸ois Belzile and Davoud Torkamaneh
limited phenotypic variation present within a restricted germplasm (the one relevant for a specific breeding program). If the aim is to identify SNP markers that can be used for marker-assisted selection, only those with a large effect will be of interest as the cost of genotyping small-effect loci in breeding populations cannot be justified by their limited contribution to the outcome of selection. In cases where no large-effect loci are identified, genomic selection approaches relying on genome-wide coverage will be better suited to this situation than MAS [4]. As is the case with biparental mapping, simulation studies have amply illustrated that there is a positive relationship between the size of an association panel and its power to identify the loci controlling a trait [5]. In practice, however, all researchers are constrained to various degrees in the resources that can be allocated to a GWAS. Compromises are inevitable. In the “basic science” scenario, there is less scope for such compromises as one is pursuing a more ambitious goal. For such work, many hundreds to a few thousand accessions are typically used to capture the genetic architecture of a trait and a good range of alleles at these loci [6– 8]. Under current conditions, these limitations are most often due to the challenges associated with the phenotypic characterization of a large collection of accessions (ensuring both precision and uniformity in this assessment) than with genotyping, an aspect which has become considerably easier and cheaper with time. In contrast, in the “applied scenario,” the number of accessions that adequately capture the main loci and alleles responsible for variation within a breeding program can be much more limited. For traits with a monogenic to oligogenic basis, panels of 100 or so accessions have successfully led to the identification of loci of interest and promising candidate genes [9], even though for breeding there isn’t always a need to achieve this level of resolution if a tightly associated marker can meet the requirements of MAS. Within a breeding program, typically composed of a few hundred lines that form the germplasm used to initiate crosses, it can be a good idea to perform a low-density genotyping (as few as a few hundred markers can suffice in many cases) to identify genetic relationships among the germplasm and select a subset (say 100–300) that capture this diversity for the purpose of assembling an association panel. As for the composition of the panel, it is paramount to select accessions that will cover the full range of phenotypic variation. For example, when mapping loci associated with resistance to a pathogen, it is just as important to include susceptible accessions as it is to include resistant ones. Only in this fashion will it be possible to detect significant phenotypic differences between the two alleles at a SNP locus that is associated with this trait. Similarly, for traits presenting continuous variation, one should select accessions that will generate a normal distribution for the phenotype of interest.
Main Steps in Designing a GWAS
7
While it is possible to focus on phenotypic extremes, this will likely lead to the identification of only a subset of relevant loci. In the “basic science” scenario, the full range of phenotypic variation may greatly exceed that seen within the improved materials of a crop species itself. The exploration of a broader germplasm, such as landraces could lead to the identification of novel loci/alleles that would eventually be of interest both for understanding the genetic architecture of the trait but also for enriching the cultivated germplasm [10]. Extending further, it is possible to perform GWAS on wild relatives of crops [11]. In the “applied scenario,” the selection of accessions to include in the panel is typically more straightforward as more detailed knowledge of the range of variation and phenotypes of individual accessions are often available. Once a panel is assembled, it then needs to be phenotyped. This is often one of the most difficult aspects as ensuring uniform testing of large numbers of accessions is not trivial. As detailed in Chapters 2 and 6–8, the phenotypic characterization of traits of interest can span a multitude of cases, with some of the more challenging reviewed in the latter three chapters (multienvironment trials, high-throughput phenotyping and omics-related traits). Even in cases where traits can be measured in fairly simple experiments, there are a few considerations that are important to keep in mind when performing GWAS. Maybe foremost of these is the underlying assessments. In biparental mapping and other research endeavors, one seeks to characterize the phenotype of individual accessions, be they different breeding lines or progeny of a biparental cross. The phenotype measured at the level of an individual is of interest. In GWAS, however, these individual measurements are pooled together on the basis of the genotype at each individual SNP. When performing GWAS, we seek to determine if the mean phenotype of all accessions sharing one allele of a SNP is significantly different from the mean phenotype of the accessions sharing the other allele at the same SNP. This conceptual difference is an important one when planning and executing the phenotyping. For example, some traits may require sophisticated analytical methods and a large number of replicates to accurately assess the phenotype of individual accessions. When one considers all accessions sharing the same allele at an SNP as being, in effect, a form of replication of the effect of this allele (albeit in different genetic backgrounds), it can possibly allow the use of somewhat less precise approaches that are more suitable to large-scale use. One such example is the compositional analysis of seeds or other parts of the plant. Traditionally, sophisticated and costly methods (chemical analysis, atomic absorption, HPLC etc.) have been regarded as the “gold standard” for the measurement of specific metabolites. Thanks to the inherently large degree of replication of the phenotypic effect of an allele (shared by dozens if not hundreds of
8
Franc¸ois Belzile and Davoud Torkamaneh
accessions), it becomes possible to fruitfully resort to technically less-demanding tools, such as near infrared spectroscopy (NIR). In examples of such work, Malle and coworkers demonstrate how alternative phenotyping tools such as NIR and X-ray spectroscopy could be fruitfully used to characterize seed composition traits in soybean [12, 13]. 2.2 Genotyping the Association Panel
As described in detail in Chapter 3, there are now a multitude of attractive options for genotyping very large numbers of markers on sizeable association panels. These extend across a wide range of technical approaches, such as genotyping arrays as well as the sequencing a subset of the genome all the way to genome-wide sequencing at various depths. In all cases, it is possible to secure genotypic data of very high quality. This quality, however, cannot be taken for granted and it is always important to perform some assessment of the quality of the data both for its reproducibility and accuracy, and examples of such quality control tests are provided in Chapter 9. When producing a set of genotypic data, various filters can be used to retain markers on the basis of different metrics such as the amount of missing data (prior to possible imputation), the proportion of heterozygous genotypes in the case of fixed lines (as these are often due to paralogy or undiscovered copy number variation) or the frequency of the minor allele (MAF). In the latter case, it has been custom practice to remove markers with rare alleles, most often defined as having a MAF < 0.05. The justification for this is that it will be difficult to obtain a valid estimate of the mean phenotypic value associated with a rare allele if too few individuals in the association panel carry it. This is a rule of thumb, but some thought should go into choosing an appropriate threshold for each GWAS. As outlined by Tibbs Cortes et al. [14], blindly applying this rule could prevent the discovery of rare alleles that nonetheless have a large effect, ones that could be very useful in breeding. For example, in a small association panel (e.g., 150 accessions), a threshold MAF of 0.05 means that 7 or fewer accessions share the rare allele and that the mean phenotype will be estimated on a very small number of accessions. In contrast, in a larger panel of say 500 accessions, up to 25 accessions could contribute to the estimation of the effect of this allele. In addition, not all phenotypes are equal with some being inherently more challenging to measure reliably. It is our view that researchers must use their knowledge of the phenotype at hand, when available, to judge how many accessions sharing a “rare” allele can yield a reliable estimate of the alleles effect on the phenotype. To date, GWAS in crops has essentially been conducted with SNP markers despite the fact that these represent only one type of genetic variation that can differentiate the accessions that compose the association panel. As described in Chapter 10, structural
Main Steps in Designing a GWAS
9
variants (SVs) represent an important type of genetic variation, but one that unfortunately presents a much greater difficulty in genotyping compared to SNPs. Nonetheless, there is growing evidence that GWAS including such variants can uncover additional loci that would otherwise have gone undetected. For example, in maize [15] showed that there is often a high degree of overlap between the genomic loci uncovered using SNPs and SVs (over 93%), as oftentimes even a causal SV will be in high LD with flanking SNPs. Nonetheless, in close to 7% of cases, only SVs allowed the identification of a locus of interest and, generally speaking, SVs explained a larger proportion of the variance in the case of disease resistance traits compared to a plethora of other traits. Another area in which great strides have been made is in the use of imputation methods to enrich genotypic datasets. As detailed in Chapter 11, various imputation strategies and tools enable researchers to achieve high-density marker coverage at relatively low cost by combining the use of low- to medium-density genotyping with data imputation based on denser marker sets captured in reference panels. Based on the observed haplotypes, genotypes at a large number of ungenotyped markers (missing loci) can be imputed. In all cases, however, it remains highly important to carefully assess the accuracy of the resulting imputed data. Finally, there is a growing interest in considering SNP haplotypes (multilocus genotypes) rather than individual SNPs when performing GWAS [16]. In principle, the biallelic nature of SNP markers means that they are inherently limited in their ability to capture the full diversity of alleles potentially present at a causal locus. Where multiple alleles of a gene are the source of the variation, any single SNP will necessarily pool multiple alleles (possibly differing in their impact on the phenotype) under a single allele at the SNP. The consideration of multiple neighboring SNP markers as a combined (multilocus) genotype creates the opportunity to better capture the true underlying allelic diversity at a locus. 2.3 Testing for Marker–Trait Association
Once sets of phenotypic and genotypic data have been assembled for an association panel, we arrive at the focal point of any GWAS: identifying the markers that show a statistically significant association with the trait of interest and, through this, identifying the genomic loci (and possibly candidate genes) that contribute to the observed phenotypic variation. But prior to doing this, no matter the approach chosen (single-locus, multi-locus), great care has to be taken to control for false-positive associations. As detailed in Chapter 12, numerous tools are available to adequately capture this and to factor this into the statistical models that seek to identify significant associations. As reviewed in Chapter 4 and illustrated in Chapters 13 and 14, a plethora of models and statistical approaches have been developed to pick out markers that are truly associated with the trait of
10
Franc¸ois Belzile and Davoud Torkamaneh
interest. As with any analysis, a critical assessment of the outputs of any model should be performed and Chapter 5 reviews ways in which the user can interpret some of these outputs to assess their credibility. Anomalies (e.g., in the Q-Q plot) should not simply be brushed under the carpet as these may indicate an underlying problem in the data or the analysis that will greatly undermine the reliability of the reported associations. A common approach has been to explore different models and to retain the loci that are detected jointly by all or a set fraction of the models [17]. Alternatively, other authors choose to report all loci detected the different models [18]. As a user of GWAS, it is always reassuring to see reproducibility in the identification of associated markers and, in most cases, will likely point toward loci that are truly associated. Despite this, one has to keep in mind that as different models may be better suited to uncover associations under different assumptions, the fact that a single model identifies an association does not necessarily mean that this association is spurious or doubtful. Given the inherent challenges in detecting all associated loci, one should view GWAS as providing important leads that will call for validation through further experimentation, be it via confirming a locus’ effect in a biparental cross (as explained in Chapter 16) or through functional validation using gene knockouts or genome editing technologies (see Chapter 6). 2.4 Identification of Candidate Genes
3
As was highlighted above, not all users of GWAS need be interested in identifying candidate genes. For many a breeder, the purpose of a GWAS is achieved when tightly linked markers that can be exploited in MAS are obtained (see Chapter 17 for more details on this). For many users, however, the purpose of performing a GWAS is to gain a deep understanding of the genes/alleles that explain the variation seen in the trait of interest. As described above, thanks to the typically high resolution afforded by GWAS, it has become a reasonable endeavor to try to identify genes exhibiting features that would make them plausible candidates. The strategies for identifying and validating such candidate genes are explored in more detail in Chapter 15.
Summary of Critical Questions/Decisions When Designing a GWAS (a) What are the specific goals being pursued? (b) What is the relevant germplasm for achieving this goal? (c) How many and which accessions need to be part of the association panel? (d) How should the panel be phenotyped and how difficult is it to measure the trait(s) of interest?
Main Steps in Designing a GWAS
11
(e) Which genotyping approach is best suited to providing the extent and comprehensiveness needed? (f) Which statistical models will be used and how will differences in their outputs be handled? (g) Do the outputs of these statistical models seem credible and how consistent are the results across models? (h) How can these associated markers be validated? (i) Do I want/need to identify candidate genes and, if so, how?
4
Conclusion In crops, GWAS has become the method of choice for identifying and mapping loci that contribute to the observed phenotypic variation in a trait of interest. It is an extremely powerful approach when the experiment is well designed and subjected to adequate quality checks at each step along the way. It provides researchers with extremely useful information for applied purposes such as breeding and good leads for more basic investigations. It remains, however, an exploratory approach whose results require follow-up work in the form of validation either of the association itself or of any candidate genes that were identified.
References 1. Rasheed A et al (2017) Crop breeding chips and genotyping platforms: progress, challenges, and perspectives. Mol Plant 10(8): 1047–1064 2. Tanksley SD et al (1995) Chromosome landing: a paradigm for map-based gene cloning in plants with large genomes. Trends Genet 11(2):63–68 3. Jaganathan D et al (2020) Fine mapping and gene cloning in the post-NGS era: advances and prospects. Theor Appl Genet 133: 1791–1810 4. Cerrudo D et al (2018) Genomic selection outperforms marker assisted selection for grain yield and physiological traits in a maize doubled haploid population across water treatments. Front Plant Sci 9:00366 5. Wang SB et al (2016) Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci Rep 6:19444 6. Liu YC et al (2020) Pan-genome of wild and cultivated soybeans. Cell 182(1):162–176 7. Romero Navarro J et al (2017) A study of allelic diversity underlying flowering-time
adaptation in maize landraces. Nat Genet 49: 476–480 8. Wang W et al (2018) Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557:43–49 9. Seck W et al (2020) Comprehensive genomewide association analysis reveals the genetic basis of root system architecture in soybean. Front Plant Sci 11:590740 10. Li Y-H et al (2020) Identification of loci controlling adaptation in Chinese soya bean landraces via a combination of conventional and bioclimatic GWAS. Plant Biotechnol J 18(2):389–401 11. Zhang H et al (2017) Genetic architecture of wild soybean (Glycine soja) response to soybean cyst nematode (Heterodera glycines). Mol Genet Genomics 292:1257–1265 12. Malle S et al (2020) Identification of loci controlling mineral element concentration in soybean seeds. BMC Plant Biol 20:419 13. Malle S et al (2020) Genome-wide association identifies several QTLs controlling cysteine and methionine content in soybean seed including
12
Franc¸ois Belzile and Davoud Torkamaneh
some promising candidate genes. Sci Rep 10: 21812 14. Cortes T et al (2021) Status and prospects of genome-wide association studies in plants. Plant Genome 14(1):20077 15. Hufford MB, et al (2021). De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373 (6555):655–662. https://doi.org/10.1126/ science.abg5289. PMID: 34353948; PMCID: PMC8733867
16. Maldonado C et al (2019) SNP- and haplotype-based GWAS of flowering-related traits in maize with network-assisted gene prioritization. Agronomy 9(11):725 17. Cui Y-R et al (2018) The application of multilocus GWAS for the detection of salt-tolerance loci in rice. Front Plant Sci 9:1464 18. Zhong H et al (2021) Uncovering the genetic mechanisms regulating panicle architecture in rice with GPWAS and GWAS. BMC Genomics 22:86
Chapter 2 Preparation and Curation of Phenotypic Datasets Santiago Alvarez Prado , Fernando Herna´ndez , Ana Laura Achilli , and Agustina Amelong Abstract Based on case studies, in this chapter we discuss the extent to which the number and identity of quantitative trait loci (QTL) identified from genome-wide association studies (GWAS) are affected by curation and analysis of phenotypic data. The chapter demonstrates through examples the impact of (1) cleaning of outliers, and of (2) the choice of statistical method for estimating genotypic mean values of phenotypic inputs in GWAS. No cleaning of outliers resulted in the highest number of dubious QTL, especially at loci with highly unbalanced allelic frequencies. A trade-off was identified between the risk of false positives and the risk of missing interesting, yet rare alleles. The choice of the statistical method to estimate genotypic mean values also affected the output of GWAS analysis, with reduced QTL overlap between methods. Using mixed models that capture spatial trends, among other features, increased the narrow-sense heritability of traits, the number of identified QTL and the overall power of GWAS analysis. Cleaning and choosing robust statistical models for estimating genotypic mean values should be included in GWAS pipelines to decrease both false positive and false negative rates of QTL detection. Key words Outliers, Statistical models, False QTL, Statistical power
1
Introduction The causal relationship between genetic polymorphism and the observed phenotypic differences within species is of fundamental biological interest. Genome-wide association studies (GWAS) are a major tool for identifying these relationships, which usually combine thousands of markers and hundreds of accessions. The common scheme consists in combining molecular and phenotypic data for a certain number of accessions within a statistical model, which takes into account the population structure and/or a kinship matrix in order to identify genomic regions linked to a phenotypic trait (Fig. 1). Despite the fact that this pipeline is quite accepted by
Supplementary Information The online version contains supplementary material available at [https://doi.org/ 10.1007/978-1-0716-2237-7_2]. Davoud Torkamaneh and Franc¸ois Belzile (eds.), Genome-Wide Association Studies, Methods in Molecular Biology, vol. 2481, https://doi.org/10.1007/978-1-0716-2237-7_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
13
14
Santiago Alvarez Prado et al.
the scientific community, some of its components have not received the same amount of attention, contributing to potential flaws in the final output. Major advances in DNA marker assays and sequencing technologies lead to high throughput genotyping at a relatively low cost allowing a high proportion of genome coverage [1]. Important breakthroughs have also been made in statistical analysis evolving from the analysis of variance (ANOVA) to mixed models, which dramatically decreased the false positive rates, improving estimations of genotype–phenotype associations [2–4]. More limited advances are observed in the generation of highthroughput and valuable phenotypic information. As a matter of fact, over the past decade impressive progress has been made on phenotyping technologies such as developing novel sensors and imaging techniques for a wide range of traits, organs and situations [5, 6]. However, data handling and processing remain major challenges when translating sensor information into knowledge [7]. In this regard, data curation and preparation for GWAS analysis has received less attention within the pipeline analysis (Fig. 1).
PLANTS Phenotyping
Raw dataset Data curation Molecular markers
Clean dataset Data analysis
Genotypic mean values
GWAS
QTL
Fig. 1 Schematic representation of the GWAS pipeline from the plant to the QTL. Three different bottlenecks were marked at different steps of the phenotypic data process. Red bottlenecks are addressed in this chapter
Main Steps in Designing a GWAS
15
Numerous QTL and GWAS studies have been published on a wide range of species and phenotypic traits, such as grain yield determination [8, 9], drought tolerance [10, 11], and many other traits of interest. However, after 30 years of research there are a limited number of publications demonstrating results in plant breeding programs. The inconsistency of these results can be related to different causes, such as the specificity of the mechanism under study and its dependence on specific environmental conditions [12] or something even simpler such as the absence of or inaccurate data curation and/or the use of an inadequate statistical method for genotypic mean estimation. In this chapter we aim to shed light on a basic but key subject: phenotypic data curation and preparation, which is an essential step within the GWAS pipeline but has not always received the same attention as other steps of the process. To prove its importance, we will perform some data analysis oriented to highlight the consequences of (1) cleaning phenotypic data before genotypic mean estimation and GWAS analysis, and of (2) estimating genotypic means with different methods over the identification of QTL.
2
Data Curation Data curation is the process of managing and promoting the use of data from origin, to ensure it is fit for the intended objectives, and available for discovery and reuse. For dynamic datasets this may mean continuous enrichment or updating to be adequate for different purposes. Higher levels of curation will also involve maintaining links with annotation and other published materials. With the advent of high-throughput phenotyping technologies, massive datasets involving millions of images from experiments performed in the field and in controlled or semicontrolled conditions concerning hundreds of accessions at different phenological stages became automatically available [5, 13, 14]. However, the incorporation of new technologies to find a solution to different problems has unintended consequences creating new drawbacks and opportunities and consequently contributing to the dynamics of continuous innovation [15]. After data is generated and transferred to a computing facility, subsequent processing and management bring further challenges due to the limitations of current analytical techniques [7, 16]. The development of new methods to address the needs generated by phenotyping technologies requires the identification of possible drawbacks that may appear along the data analysis process. Daily collection of data by different sensors in an automatic way could lead to the presence of outliers. An outlier is usually defined as an observation considered to be inconsistent with the distribution of values in the analyzed dataset [17]. Observations may be time points [18] or whole dynamic courses of one or more variables
16
(a)
Santiago Alvarez Prado et al.
Noisy image Leaf area (m²)
0.4 0.3 0.2 0.1 0.0 140 145
150 155
160 165
DOY
Technical issue Leaf area (m²)
(b)
0.4 0.3 0.2 0.1 0.0 270
280
290
300
310
320
DOY
Human error Plant height (mm)
(c)
1500
1000
500
140
145
150
155
160
165
DOY Fig. 2 Examples of dubious points detected within growth curves. An example of the three typical sources of error: (a) noisy image (b) mechanical error and (c) human error. The orange circles highlight the outlier (dubious point) produced, in each case, by the three different sources of error
Main Steps in Designing a GWAS
17
[19]. For example, in phenotyping platforms, most common sources of error detected within datasets are related to noisy images (Fig. 2a), technical issues (Fig. 2b), and human errors (Fig. 2c). These types of outliers are easily detected and excluded from the database by different methods [17, 18, 20] before estimating genotypic mean values for GWAS. In plant science, an outlier is a biological replicate (an observation, a plant or a plot) that deviates from the overall distribution of plants or plots on a multi-criteria basis, regardless of the quality of the measurements. In phenotyping platforms with hundreds of accessions, and also in many other experiments in controlled conditions, the experimental unit (the smallest entity to which a treatment can be applied; http://www.miappe.org) is frequently an individual plant with 3–10 replicates per accession, so one or more outlier plants may have a high impact on genotypic means [21]. In field experiments, the experimental unit is usually the plot, which contains tens of plants [22]. Despite the fact that outlier plots have a slighter impact on genotypic mean estimation compared with outlier plants, they can have a marked effect on the detection of a QTL [23]. Regarding data curation, the detection and removal of outliers and the selected method for this purpose represent key decisions in GWAS analysis. These actions are not always clear or even declared and can have an important impact on the final output [23, 24]. Successful identification and removal of true outliers from data remain a very challenging task to date, especially in high-throughput data typically used in GWAS analysis and genomic prediction, and may even have undesirable consequences [25]. In a previous study, GWAS analysis was carried out in a maize association panel for leaf area under different situations [23]. The analysis was performed over the original data set, without outlier detection, and the data sets resulting from three different cleaning methods. The choice of the cleaning method showed a large effect on the amount and distribution of allelic effects at QTL identified in all the tested datasets (Fig. 3). The variation in final output is partially explained by the unbalanced allelic frequency observed in certain markers, where the minor allele frequency usually ranged from 1 to 5%, typical of GWAS analyses [23]. Thus, detecting an outlier within an accession carrying the minor allele could have an impact over the significance of a QTL provided that the size of the population is small (i.e., 4). The narrow-sense heritability was assessed from the variance component estimates of the mixed models implemented in the GAPIT r package [44, 46]. In both datasets, the comparison of genotypic mean values estimated using MEANS showed very similar values to those obtained using BLUPs or BLUEs (Table 1). Similar results were obtained in previous work comparing REML with ANOVA methods [24]. Authors conclude that important differences appear under unbalanced data sets when part of the phenotypic information is missing. Despite the fact that differences in genotypic mean values were almost negligible, important differences appeared when comparing the outputs of GWAS analysis. For grain yield in maize, using MEANS, BLUEs or BLUPs as input of GWAS largely altered the number and identity of identified QTL (Fig. 4). The number of QTL was higher for BLUEs (68), and similar for MEANS (49) and BLUPs (46). A larger overlap was observed for BLUEs and BLUPs than for MEANS and either BLUPs or BLUEs (Fig. 4). Out of 105 QTL, only 20 (19%) were shared between the three methods (Fig. 4). In wheat, the number of QTLs for grain yield was higher for BLUPs (15), followed by BLUEs (12) and MEANS (9). Out of 15 QTL, 9 (60%) were
22
Santiago Alvarez Prado et al.
Table 1 Pearson correlation coefficients between MEANS, BLUEs, and BLUPs for the maize and wheat datasets under each environment Species
Trait
Comparison
Env1
Env2
Env3
Env4
Maize
Yield
MEANS vs. BLUEs MEANS vs. BLUPs BLUEs vs. BLUPs MEANS vs. BLUEs MEANS vs. BLUPs BLUEs vs. BLUPs
1 1 1 0.91 0.911 1
0.97 0.974 0.996 0.979 0.978 0.999
0.979 0.981 0.999 0.996 0.996 0.999
0.987 0.986 0.999 0.997 0.996 0.999
MEANS vs. BLUEs MEANS vs. BLUPs BLUEs vs. BLUPs MEANS vs. BLUEs MEANS vs. BLUPs BLUEs vs. BLUPs
1 1 1 0.995 0.998 0.999
0.998 0.999 1 0.993 0.992 0.999
0.95 0.944 0.988 0.993 0.994 1
– – – – – –
Anthesis
Wheat
Yield
Anthesis
Fig. 4 Venn diagram showing the number of QTL detected for MEANS, BLUEs and BLUPs methods for grain yield and anthesis in maize and wheat datasets
Main Steps in Designing a GWAS
23
shared between the three methods (Fig. 4), three QTL were identified only by BLUPs and no unique QTL were identified by either BLUEs or MEANS (Fig. 4). Results indicate that the selected statistical method for estimating genotypic mean values alters the outcome of GWAS in both maize and wheat datasets, and consequently the conclusions. For time to anthesis in maize, the number of QTL was similar for the three methods (Fig. 4). Out of 102 QTL, only 16 (16%) were shared between the three methods (Fig. 4), and a larger overlap was observed for BLUEs and BLUPs than for MEANS and either BLUPs or BLUEs (Fig. 4). Similar patterns were observed for time to anthesis in wheat, 13 QTL were found across environments, but only three (23%) were shared by the three methods (Fig. 4). Narrow-sense heritability (h2) captures the proportion of phenotypic variation that is due to additive genetic effects. In plant breeding, the magnitude of the response to artificial selection depends on the h2 of the target trait and the intensity of selection [26]. Therefore, increasing the h2 is key to maximize genetic gains. To investigate whether statistical methods to estimate genotypic mean values affect the heritability of the trait, h2 was compared between MEANS, BLUEs, and BLUPs. In maize, h2 was larger for BLUEs (0.89 and 0.88 for grain yield and anthesis), and BLUPs (0.90 and 0.87 for grain yield and anthesis) than for MEANS (0.85 and 0.83 for grain yield and anthesis), indicating that statistical methods that take into account spatial variation (to estimate BLUPs or BLUEs) actually captured more heritable variation than MEANS. On the other hand, very low h2 values were observed for both grain yield and anthesis in wheat (h2 < 0.02), making comparisons impossible. The choice of the method for estimating genotypic mean values was shown to be associated with the number of detected QTL. To test whether it also affects the ability to detect a QTL (i.e., the power of GWAS), we compared the log10(P) values for each environment and trait in the maize dataset (Fig. 5a–h). We observed an overall increase of log10(P) values for BLUPs and BLUEs relative to MEANS (Fig. 5a–h) indicating that the use of BLUEs and BLUPs increased the statistical power of GWAS. Moreover, we compared, within each statistical method, SNPs identified with the three methods (common SNPs) with the rest of significant SNPs (other SNPs) for two features: log10(P) and minimum allele frequency. Common SNPs showed significantly higher log10(P) than the other SNPs in five out of six comparisons (Table S1), while no differences were observed in the minimum allele frequency between these two groups (Table S1). The increased statistical power directly impacts the number of identified QTL with distinct significance thresholds. For example, for time to anthesis in maize at Env1 (Fig. 5b), increasing the significance threshold from 4 to 6 differentially affected the number of
24
Santiago Alvarez Prado et al.
Fig. 5 Impact of the statistical method for estimating genotypic mean values on the power of GWAS; (a–h) relationship between the significance level of every SNP using MEANS, BLUEs or BLUPs methods for grain yield and anthesis under the four environments of the maize dataset: (a, b) Env1, (c, d) Env2, (e, f) Env3, (g, h) Env4; (i–k) significant associations for time to anthesis at Env1 using MEANS, BLUEs, BLUPs under different significance thresholds
identified QTL; significant QTL decreased from 11 to 7 for BLUPs, from 6 to 4 for BLUEs and from 10 to 1 for MEANS (Fig. 5i–k). Therefore, even for regions identified by all methods, the ability to detect them is greater when estimating genotypic mean values using statistical methods considering spatial trends and genotypic effects than for MEANS.
4
Concluding Remarks GWAS analysis is a powerful tool for identifying genomic regions responsible for phenotypic variation, involving different stages of analysis regarding genomic and phenotypic data. In this chapter we
Main Steps in Designing a GWAS
25
focused on those aspects that often receive less attention, but can strongly influence the final output. Throughout this chapter we showed and discussed the relevance of data curation and analysis of phenotypic data in order to obtain repeatable results. The effect of inappropriate decisions regarding both issues was demonstrated through the analysis of real data for maize and wheat. Some bullet points we would like to highlight from this chapter are: l
The method for outlier identification is an essential criterion in GWAS analyses. The choice of one method or another thus depends on an optimization of criteria and on strategic decisions for genetic analyses.
l
The lack of attention to data cleaning has a direct impact over the amount and effect of detected QTL and can result in blurring final conclusions.
l
Datasets should be organized and stored in such a way that they can be reanalyzed. This requires that detected outliers are identified as such but are not deleted from the information system, and that the rules for outlier detection are kept as meta-data of the GWAS analysis.
l
Genotypic mean values for GWAS analyses must be estimated by means of robust models that take into account the spatial variation of the experiment, such as mixed models.
l
The use of mixed models that take into account spatial trends increased the number of significant QTL, likely as result of increasing the narrow-sense heritability of traits and the statistical power of GWAS.
l
The choice of the method for estimating genotypic mean values impacts the statistical power of GWAS. Thus, the consequences of the chosen method may be counteracted by the choice of a higher (or lower) threshold for GWAS analysis.
The inclusion of these bullet points within a GWAS pipeline will improve the precision and repeatability of GWAS outputs. Here, we provide tools (methods and codes) to implement data curation and statistical methods to estimate genotypic mean values in future association studies. Otherwise, the gap between detected and validated QTL will remain large. References 1. Eathington SR, Crosbie TM, Edwards MD, Reiter RS, Bull JK (2007) Molecular markers in a commercial breeding program. Crop Sci 47. https://doi.org/10.2135/cropsci2007. 04.0015IPBS 2. Huang M, Liu X, Zhou Y, Summers RM, Zhang Z (2019) BLINK: a package for the
next level of genome-wide association studies with both individuals and markers in the millions. Gigascience 8(2):giy154 3. Tibbs Cortes L, Zhang Z, Yu J (2021) Status and prospects of genome-wide association studies in plants. Plant Genome 14(1):e20077
26
Santiago Alvarez Prado et al.
4. Van Eeuwijk FA, Bustos-Korts DV, Malosetti M (2016) What should students in plant breeding know about the statistical aspects of genotype environment interactions? Crop Sci 56(5):2119–2140. https://doi.org/10. 2135/cropsci2015.06.0375 5. Yang W, Feng H, Zhang X et al (2020) Crop phenomics and high-throughput phenotyping: past decades, current challenges, and future perspectives. Mol Plant 13:187–214 6. Zhao C, Zhang Y, Du J et al (2019) Crop phenomics: current status and perspectives. Front Plant Sci 10:714 7. Tardieu F, Cabrera-Bosquet L, Pridmore T, Bennett M (2017) Plant phenomics, from sensors to knowledge. Curr Biol 27:R770–R783. https://doi.org/10.1016/j.cub.2017.05.055 8. Bezant J, Laurie D, Pratchett N, Chojecki J, Kearsey M (1997) Mapping QTL controlling yield and yield components in a spring barley (Hordeum vulgare L.) cross using marker regression. Mol Breed 3(1):29–38 9. Li D, Pfeiffer TW, Cornelius PL (2008) Soybean QTL for yield and yield components associated with glycine soja alleles. Crop Sci 48(2): 571–581 10. Agrama HAS, Moussa ME (1996) Mapping QTLs in breeding for drought tolerance in maize (Zea mays L.). Euphytica 91(1):89–97 11. Specht JE, Chase K, Macrander M, Graef GL, Chung J, Markwell JP, Germann M, Orf JH, Lark KG (2001) Soybean response to water: a QTL analysis of drought tolerance. Crop Sci 41(2):493–509 12. Tardieu F (2011) Any trait or trait-related allele can confer drought tolerance: just design the right drought scenario. J Exp Bot 63:25–31 13. Furbank RT, Tester M (2011) Phenomics – technologies to relieve the phenotyping bottleneck. Trends Plant Sci 16:635–644. https:// doi.org/10.1016/j.tplants.2011.09.005 14. Neveu P, Tireau A, Hilgert N, Ne`gre V, Mineau-Cesari J, Brichet N, Chapuis R, Sanchez I, Pommier C, Charnomordic B, Tardieu F, Cabrera-Bosquet L (2019) Dealing with multi-source and multi-scale information in plant phenomics: the ontology-driven phenotyping hybrid information system. New Phytol 221(1):588–601. https://doi.org/10. 1111/nph.15385 15. Sadras VO (2020) Agricultural technology is unavoidable, directional, combinatory, disruptive, unpredictable and has unintended consequences. Outlook Agric 49(4):293–297. h t t p s : // d o i . o r g / 1 0 . 1 1 7 7 / 0030727020960493
16. Rahaman MM, Chen D, Gillani Z, Klukas C, Chen M (2015) Advanced phenotyping and phenotype data analysis for the plant growth and development study. Front Plant Sci 6:619. https://doi.org/10.3389/fpls.2015.00619 17. Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York 18. Grubbs FE (1950) Sample criteria for testing outlying observations. Ann Math Statist 21: 27–58. https://doi.org/10.1214/aoms/ 1177729885 19. Hubert M, Rousseeuw PJ, Segaert P (2015) Multivariate functional outlier detection. Stat Methods Appl 24:177–202. https://doi.org/ 10.1007/s10260-015-0297-8 20. Rousseeuw PJ, Hubert M (2011) Robust statistics for outlier detection. Wiley Interdiscip Rev Data Min Knowl Discov 1:73–79. https://doi.org/10.1002/widm.2 21. Estaghvirou SBO, Ougutu JO, Piepho HP (2014) Influence of outliers on accuracy estimation in genomic prediction in plant breeding. G3 Genes Genomes Genet 4:2317–2328 22. Tollenaar M, Muldoon JF, Daynard TB (1984) Differences in rates of leaf appearance among maize hybrids and phases of development. Can J Plant Sci 64:759–763. https://doi.org/10. 4141/cjps84-104 23. Alvarez Prado S, Sanchez I, Cabrera-Bosquet L, Grau A, Welcker C, Tardieu F, Hilgert N (2019) To clean or not to clean phenotypic datasets for outlier plants in genetic analyses? J Exp Bot 70(15):3693. https://doi.org/10. 1093/jxb/erz191 24. Bernal-Vasquez AM, Utz HF, Piepho HP (2016) Outlier detection methods for generalized lattices: a case study on the transition from ANOVA to REML. Theor Appl Genet 129:787–804. https://doi.org/10. 1007/s00122-016-2666-6 25. Cerioli A, Farcomeni A (2011) Error rates for multivariate outlier detection. Comput Stat Data Anal 55(1):544–553. https://doi.org/ 10.1016/j.csda.2010.05.021 26. Bernardo R (2008) Molecular markers and selection for complex traits in plants: learning from the last 20 years. Crop Sci 48:1649–1664 27. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE (2016) Comment: the FAIR guiding principles for scientific data management and stewardship. Sci Data 3:160018 28. Barabaschi D, Tondelli A, Desiderio F, Volante A, Vaccino P, Vale` G, Cattivelli L (2016) Next generation breeding. Plant Sci 242:3–13
Main Steps in Designing a GWAS 29. Liu H, Yan J (2019) Crop genome-wide association study: a harvest of biological relevance. Plant J 97(1):8–18 30. Varshney RK, Terauchi R, McCouch SR (2014) Harvesting the promising fruits of genomics: applying genome sequencing technologies to crop breeding. PLoS Biol 12(6): e1001883 31. Xiao Y, Liu H, Wu L, Warburton M, Yan J (2017) Genome-wide association studies in maize: praise and stargaze. Mol Plant 10(3): 359–374 32. Huang X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M (2010) Genome-wide association studies of 14 agronomic traits in rice landraces. Nat Genet 42(11):961 33. Zhang N, Gibon Y, Wallace JG, Lepak N, Li P, Dedow L, Chen C, So Y-S, Kremling K, Bradbury PJ, Brutnell T, Stitt M, Buckler ES (2015) Genome-wide Association of Carbon and Nitrogen Metabolism in the maize nested association mapping population. Plant Physiol 168: 575–583 34. Millet EJ, Kruijer W, Coupel-Ledru A, Alvarez Prado S, Cabrera-Bosquet L, Lacube S, Charcosset A, Welcker C, van Eeuwijk F, Tardieu F (2019a) Genomic prediction of maize yield across European environmental conditions. Nat Genet 51(6). https://doi.org/10. 1038/s41588-019-0414-y 35. Millet EJ, Welcker C, Kruijer W, Negro S, Coupel-Ledru A, Nicolas SD, Laborde J, Bauland C, Praud S, Ranc N, Presterl T, Tuberosa R, Bedo Z, Draye X, Usadel B, Charcosset A, Van Eeuwijk F, Tardieu F (2016) Genome-wide analysis of yield in Europe: allelic effects vary with drought and heat scenarios. Plant Physiol 172:749–764. https:// doi.org/10.1104/pp.16.00621 36. Cullis BR, Smith AB, Coombes NE (2006) On the design of early generation variety trials with correlated data. J Agric Biol Environ Stat 11: 3 8 1 . h t t p s : // d o i . o r g / 1 0 . 1 1 9 8 / 108571106x154443 ´ lvarez MX, Boer MP, 37. Velazco JG, Rodrı´guez-A Jordan DR, Eilers PHC, Malosetti M, van Eeuwijk FA (2017) Modelling spatial trends in sorghum breeding field trials using a
27
two-dimensional P-spline mixed model. Theor Appl Genet 130(7):1375–1392. https://doi.org/10.1007/s00122-0172894-4 38. Williams ER, John JA, Whitaker D (2014) Construction of more flexible and efficient P-rep designs. Aust New Zeal J Stat 56(1): 89–96. https://doi.org/10.1111/anzs.12068 39. Millet EJ, Pommier C, Buy M et al (2019b) A multi-site experiment in a network of European fields for assessing the maize yield response to environmental scenarios. Portail Data INRAE. https://doi.org/10.15454/iasstn 40. Sukumaran S, Crossa J, Jarquin D, Lopes M, Reynolds MP (2017) Genomic prediction with pedigree and genotype environment interaction in spring wheat grown in South and West Asia, North Africa, and Mexico. G3 Genes Genomes Genet 7(2):481–495 41. Sukumaran S, Dreisigacker S, Lopes M, Chavez P, Reynolds MP (2015) Genomewide association study for grain yield and related traits in an elite spring wheat population grown in temperate irrigated environments. Theor Appl Genet 128(2):353–363 ´ lvarez MX, Boer MP, Eilers PHC, 42. Rodrı´guez-A van Eeuwijk FA (2018) SpATS: spatial analysis of field trials with splines. R package version 1.0–8 43. Bates D, M€achler M, Bolker B, Walker S (2014) Fitting linear mixed-effects models using lme4. arXiv Prepr arXiv14065823 44. Lipka AE, Tian F, Wang Q, Peiffer J, Li M, Bradbury PJ, Gore MA, Buckler ES, Zhang Z (2012) GAPIT: genome association and prediction integrated tool. Bioinformatics 28(18):2397–2399 45. Gao X, Becker LC, Becker DM, Starmer JD, Province MA (2010) Avoiding the high Bonferroni penalty in genome-wide association studies. Genet Epidemiol 34(1):100–105 46. Tang Y, Liu X, Wang J, Li M, Wang Q, Tian F, Su Z, Pan Y, Liu D, Lipka AE (2016) GAPIT version 2: an enhanced integrated tool for genomic association and prediction. Plant Genome 9(2). https://doi.org/10.3835/plan tgenome2015.11.0120
Chapter 3 Genotyping Platforms for Genome-Wide Association Studies: Options and Practical Considerations David L. Hyten Abstract Genome-wide association studies (GWAS) in crops requires genotyping platforms that are capable of producing accurate high density genotyping data on hundreds of plants in a cost-effective manner. Currently there are multiple commercial platforms available that are being effectively used across crops. These platforms include genotyping arrays such as the Illumina Infinium arrays and the Applied Biosystems Axiom Arrays along with a variety of resequencing methods. These methods are being used to genotype tens of thousands of markers up to millions of markers on GWAS panels. They are being used on crops with simple genomes to crops with very complex, large, polyploid genomes. Depending on the crop and the goal of the GWAS, there are several options and practical considerations to take into account when selecting a genotyping technology to ensure that the right coverage, accuracy, and cost for the study is achieved. Keywords Genotyping platforms, Genotyping-by-sequencing, Genotyping arrays, Genome-wide association studies
1
Introduction Setting up genotyping for a genome-wide association study (GWAS), a researcher has several options and practical considerations that need to be pondered when selecting a genotyping technology. Genotyping technologies are continually evolving and include many readily available commercial genotyping arrays or choosing from the many different sequencing techniques to genotype their GWAS populations. Many of these options are available as a service or as techniques a researcher can perform on their own. While selecting a platform for a GWAS study, the main factors that need to be considered are cost, throughput, and quality of data the technology will provide for the study. But there are also several other practical aspects that need to be considered. Fortunately, the wide range of genotyping options available provide the research team lots of options to pick the right one for their study that helps
Davoud Torkamaneh and Franc¸ois Belzile (eds.), Genome-Wide Association Studies, Methods in Molecular Biology, vol. 2481, https://doi.org/10.1007/978-1-0716-2237-7_3, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
29
30
David L. Hyten
address all the things needed to take into consideration to make a GWAS successful. One key aspect of enabling GWAS is obtaining enough genome-wide molecular markers on enough individuals to obtain sufficient power to detect significant associations. GWAS in plants at minimum require thousands of markers screened on hundreds of individuals. The number of variants to adequately cover the genome for a GWAS study is specific for the amount of linkage disequilibrium (LD) present within the GWAS panel. LD is the nonrandom association of alleles within a population. The LD will vary greatly between species and even within species depending on the panel assembled for the GWAS. Depending on the genotyping platform chosen by the research team and the number of markers genotyped, the study may miss a significant proportion of the variation segregating in the panel. This could require the research team to come back and regenotype with a larger number of markers if the first platform fails to identify any positive GWAS hits that explain a significant portion of the phenotypic variance. In addition to LD varying between GWAS panels, it also varies greatly within the genome. Not only will this variation affect the resolution of GWAS, but it also affects the density of markers needed to tag the haplotype variation in the localized region. Globally heterochromatin regions have much higher amounts of LD and require lower amounts of markers, while euchromatic regions have a significantly lower amount of LD and require a higher density of markers. When molecular markers were first being used such as restriction fragment length polymorphisms (RFLP), simple sequence repeat (SSR) markers, or amplified fragment length polymorphism (AFLP) markers, they were low throughput and did not provide a sufficiently high density of marker coverage across the genome. Studies being produced at the time were only mapping a few hundred markers in the populations being studied. While SSR markers are multiallelic, easy to amplify and run on gels, their multiplex capabilities and throughput are limited. Alternatively, there are millions of single nucleotide polymorphisms (SNPs) segregating in populations within species and technologies were quickly being developed that were able to multiplex thousands of assays into one reaction. A few of the early technologies still being used today were the Affymetrix Axiom technology (now the Applied Biosystems Axiom Genotyping Array) [1] and the Illumina Infinium assay [2, 3]. These technologies based on microarrays are capable of genotyping hundreds of thousands of SNPs in a single reaction. The Affymetrix Axiom and the Illumina Infinium array along with other technologies of the time enabled researchers to perform GWAS to a sufficiently high density of SNPs in crops. As the array technologies were starting to be used for GWAS studies, another key technology was being developed that would
Genotyping Platforms for Genome-Wide Association Studies: Options and. . .
31
also prove enabling for large-scale GWAS studies. This would be the next generation sequencers. This started with three different next-generation sequencing (NGS) technologies starting in 2005 with the 454 sequencer, followed the next year by the Solexa Genome Analyzer (which is now the basis of the Illumina sequencing technology), and then the ABI SOLiD sequencer [4]. These three technologies started a revolution that saw the cost of sequencing drop dramatically along with the throughput of sequencing rise significantly. For GWAS, the Illumina sequencing technology is the one that has been the foundation for genotyping due to its exceptional throughput, generating hundreds of millions of sequence reads at a cost researchers can afford for typical GWAS studies. This has enabled several techniques that take advantage of NGS to genotype their GWAS populations by sequencing, often referred to as genotyping-by-sequencing (GBS). These techniques are often performed either by treating DNA with a restriction enzyme and size selecting, selectively capturing and enriching portions of the genome with capture probes, or performing a wholegenome resequencing of the GWAS population as the method for genotyping. Whichever technology is selected there are many aspects to consider. Whether it be cost, throughput, or quality of data obtained. Since technology is evolving rapidly the specific technology reviewed here is only a snapshot in time. The practical considerations brought up may also be factors in any new technology that is developed in the future. These factors should be examined with any new technology developed beyond the genotyping platforms presented in this review.
2
Current Genotyping Platforms Commonly Used for GWAS
2.1 Genotyping Arrays
Genotyping arrays were the first technologies that enabled researchers to commonly perform GWAS in plants and still remain a popular choice as a genotyping platform today. They are a mature technology that provide highly accurate SNP calls. There are a number of commercial providers able to provide genotyping services for several commercially available arrays and for custom arrays. The number of SNPs that these arrays provide are in the moderateto high-density SNP range depending on the commercial array available for the crop or if the researcher wants the added expense of designing a custom array. They have a high sample throughput, relatively low to moderate cost per sample, and the data management is straightforward and simple. Genotyping arrays generally have very low missing data rates which is advantageous for GWAS. As mentioned previously, the two array technologies commonly used in crops for GWAS are the Illumina Infinium and the Applied Biosystems Axiom arrays [5]. They provide marker densities from a
32
David L. Hyten
few thousand markers which is suitable for genomic selection to hundreds of thousands of markers. The available arrays with SNP densities above 10,000 SNPs have been regularly used for GWAS studies. 2.2 Illumina Infinium Assay
Illumina Infinium arrays support genotyping hundreds of markers up to hundreds of thousands of markers with as little as 200 ng of input DNA. These arrays can genotype a large number of variants including copy number variants (CNVs), insertions/deletions (INDELs), SNPs, and structural variants (SVs). For most plant species the arrays are designed with SNPs as the predominant marker. Due to cost, most plant studies have designed Infinium arrays to include the number of SNPs in the tens of thousands or less. The typical plant Infinium chips range from 6000 SNPs up to the wheat 90,000 SNP chip [5–9]. They have been used to genotype simple genomes to large genomes such as rye and even larger polyploid genomes such as wheat [10–12]. Another attractive aspect of the Illumina arrays is that their throughput is thousands of samples per week which enables most GWAS studies to be easily genotyped. Several commercial labs are available that offer their services for running samples on Illumina’s Infinium arrays saving researchers from having to buy expensive equipment. In addition to being able to run arrays in commercial labs, Illumina offers many “off the shelf” commercial arrays for different crops that can be immediately ordered and ran in these labs. Currently, Illumina offers consortium-designed arrays for 16 different plant species, including maize, barley, cotton, wheat, pepper, soybean, brassica, rice, and cowpea (Margaret Mueller, personal communication). Illumina can also create new arrays or add-on content for these existing arrays. The advantage of these consortium arrays is that it has given each crop community the opportunity to lower the cost per sample of genotyping through the combined purchasing power of the research community. The disadvantage is that individual researchers are limited to the number of SNPs preselected by the consortium. This ascertainment bias may be too few SNPs or the wrong SNPs for the population a researcher wishes to genotype for their specific GWAS study. The SNPs selected for the consortium arrays were derived from a specific set of plant germplasm and selected using specific criteria decided on by the consortium [13]. This could cause the consortium’s array to miss significant variation when used to genotype the GWAS panel. Before using any of these predesigned arrays, the researcher should understand how the consortium designed the array, the germplasm used to select the SNPs, and how the SNPs were selected to populate the array. This will help determine if these SNPs are appropriate for genotyping the researcher’s GWAS population. If the germplasm used to select the SNPs for the array are too distantly related from the GWAS population the researchers want to
Genotyping Platforms for Genome-Wide Association Studies: Options and. . .
33
genotype, then the researchers run the risk of not having enough SNPs segregating in high enough allele frequencies in their GWAS population. The money savings they get from running a consortium array may be lost by not having enough SNPs segregate in their GWAS population to be able to perform a proper association study. This will cause them to have to go back and regenotype with another technology or design a custom Infinium array. Many of the arrays do offer the option to add on additional SNPs for an additional charge which may also help mitigate the potential for ascertainment bias and provide enough moderate– to high–allelefrequency SNPs suitable for the GWAS. Even if the Infinium array seems appropriate for genotyping, it may still have several SNPs that will be uninformative for the GWAS study. It should be expected that the coverage from an array will be less than expected from the raw number of SNPs that is populated on the array. This can vary greatly depending on the array selected and the crop. For example, in rye 8942 SNPs were used for GWAS from the 15 k rye Infinium array. This is approximately 60% of the SNPs from the array [10]. In pea it was 11,366 SNPs from the 13.2 K SNP array (86%) [14]. Jin et al. [15] obtained 31,846 informative SNPs from their 60 k Brassica Infinium array (53%). In soybean, there were 36,000 SNPs from the 50 k Soy Infinium array (72%) [16]. Wheat, due to its complex genome, has the biggest discrepancies reported with its 90 k Infinium array. From some recently published GWAS studies there are reports of using 15,737 SNPs (17%) [17] 31,000 SNPs (34%) [18], 16,425 SNPs (18%) [19], 13,006 SNPs (14%) [20], 20,853 SNPs (23%) [21], and 42,001 SNPs (47%) [22]. To improve the wheat chip and remove many of the SNPs that were uninformative, a 15 k Infinium array was designed that has seen a higher percentage of markers be informative in studies as demonstrated by Arif and Borner [23] reporting using 11,139 SNPs from the 15 k Infinium array (74%). If a GWAS study is going to take DNA from multiple plants to represent a plant sample or accession instead of a single plant for the DNA source, then heterogeneity of the accession needs to be taken into consideration. Some of the technologies can be sensitive to the copy number of an allele. Unequal dosage of a copy can reduce the number of accurate SNP calls below what would be expected. For example, the Illumina technology reports 99.99% accurate genotype calling for its Infinium assay. When the SoySNP50K was used to genotype the soybean germplasm collection, three plants per accession were used for the DNA extraction [24]. While these accessions are highly inbred, there is still some heterogeneity present in the accessions. Due to the unequal representation of alleles, the heterozygous cluster of the genotype calls becomes extremely difficult to distinguish from the homozygous clusters, vs. when sampling from single plants (Fig. 1). In each accession, depending on the drift of a heterogenous allele, even sampling 20 plants may
34
David L. Hyten
Fig. 1 (a) Soybean inbred lines genotyped with the SoySNP50k Infinium array. DNA was extracted from three seeds per inbred line. (b) Soybean plant genotyped with the SoySNP50k Infinium array. DNA was extracted from the single F5 plant the derived the soybean inbred lines
not produce equal dosage of each allele which may lead to these issues still being present if the heterogenous allele is not in equal frequency within the line. 2.3 Applied Biosystems Axiom Arrays
Applied Biosystems Axiom genotyping arrays are another highdensity genotyping array option widely used as a genotyping platform for GWAS [5]. It is also a very mature technology that produces high quality data [25–28]. The Axiom array technology starts with an input of 200 ng of genomic DNA that is then wholegenome amplified. It is hybridized to the array followed by a ligation, two-color enzymatic assay that is read on an Applied Biosystems GeneTitan Multi-Channel instrument. The Axiom arrays can be designed to have as few as 1500 SNPs up to 2.6 million SNPs. Most plant studies using the Axiom arrays have been designed to have fewer than 700,000 SNPs. Depending on the array layout the throughput per week can be between 96 and 6144 samples [29]. Axiom currently has 17 arrays already designed for 12 different crops and offer the ability to design custom arrays. These arrays range from the strawberry 35,000 SNP array up to the wheat 2 array set that genotypes up to 817,000 SNPs [29]. As demonstrated from these available arrays, they are able to successfully genotype complex polyploid genomes such as strawberry and wheat with a moderate to high density of markers. The number of markers genotyped from the arrays will be fewer than the numbers that are on the arrays. For example, in Persian walnut, 609,658 SNPs were genotyped out of the designed 700 k array (87%) [25], 114,090 SNPs in soybean genotyped from the 180 k SoyaSNP array (63%) [28], and 7652 polymorphic SNPs in durum wheat from the 35 K Axiom Array (22%) [26]. Part of the reason for the low number of SNPs genotyped in the durum wheat comes from ascertainment bias as mentioned previously for Illumina arrays. The 35 K Axiom array that Gupta et al. [26] used was the 35 K Axiom wheat breeders’ array. The germplasm used to populate the breeders’ array is genetically
Genotyping Platforms for Genome-Wide Association Studies: Options and. . .
35
distant enough from the durum wheat panel that a significantly lower proportion of SNPs were polymorphic in their GWAS panel. Designing the arrays to have a higher density of SNPs using a larger panel of lines to discover SNPs can help mitigate the ascertainment bias to some degree [13]. This helps to make the consortium arrays more flexible for a larger diversity of GWAS panels. Sequencing
Even though array technology remains popular in GWAS studies, the volume of studies that use sequencing as their main method of genotyping continues to grow. There have been numerous methods developed that use sequencing at their core for genotyping. These include methods that reduce the genome down to a smaller fraction to save on the cost of sequencing up to resequencing the entire genome. Most of the current methods still focus on calling only SNPs and INDELs but as sequencing technology improves, methods capable of genotyping a wider range of genetic variation on a GWAS population will become readily available. These technologies will begin to genotype other genetic variations in GWAS panels such as structural variation, repetitive elements, and methylation status.
2.5 Restriction Enzyme Genotyping by Sequencing
It was not long after next-generation sequencing started to become widely available that it was used for genotyping. The beginning of using a sequencer to genotype or genotyping-by-sequencing (GBS) involved reducing the genome to a smaller fraction of the genome that could then be resequenced across multiple samples. Then this reduced fraction of the genome could be aligned to a reference genome for either SNP discovery [30, 31] or for genotyping [32, 33]. These methods for genotyping were so successful that they are still used today along with several variations of the original methods. One of the earliest methods developed was RAD-Seq [32]. This method performs a restriction digestion on the DNA followed by the ligation of adaptors to the fragments. The samples are then multiplexed, DNA fragments sheared, size selected, end repaired followed by A-tailing and ligation of Y-adaptors. Then PCR is performed followed by sequencing. This method was soon simplified and adapted to plants for GBS [33]. The use of restriction enzymes to easily add adaptors and barcodes to DNA makes all the developed restriction enzyme GBS (reGBS) methods very attractive genotyping platforms for GWAS. The method developed by Elshire et al. [33] (Elshire-GBS) and diversity array technology sequencing (DArT-seq) [34] are two very popular reGBS methods used for GWAS in a variety of crops [35–41]. For more complex genomes two restriction enzymes were used to further reduce the complexity of the genome for crops such as wheat and barley [42].
2.4
36
David L. Hyten
The simple protocol of reGBS makes library construction very low cost and high throughput. The number of SNPs genotyped varies greatly between species and even within species. While many of the studies have moderate SNP numbers for GWAS studies, it is still enough to find significant associations in their study. The numbers of SNPs used in GWAS from DArTseq or from ElshireGBS method vary greatly depending on the restriction enzyme chosen and the diversity of the crop. Examples for Elshire-GBS include 15,681 SNPs genotyped in watermelon [41], 5394 SNPs genotyped in common bean [40] 20,956 SNPs genotyped in Cassava [43], 265,487 SNPs genotyped in sorghum [44], 182,252 SNPs genotyped in maize [45], 337,110 genotyped in maize [39]. For DArTseq the markers genotyped are often filtered down very significantly to fewer high-quality markers. For example in maize 47,440 markers were filtered down to 7224 high-quality markers [46], wheat 58,378 were reduced down to 6887 [35], maize 34,509 SNPs were filtered to 28,919 unique SNPs [47]. For other crops DArTseq produced similar numbers of markers with sweet potato producing 33,068 informative SNPs [48], mung bean having 5288 high quality markers [36], and walnut having 33,519 high quality markers [37]. When using reGBS there is a tradeoff between coverage and cost. If researchers chose to save money and do not perform a deep enough sequence on their samples they will end up with significant missing data in their dataset. This leads to the need for significant imputation of the GBS data [49]. A low sequence coverage will also lead to heterozygotes not being called accurately in the GWAS population [50]. The accuracy of SNP calling will vary from the different resequencing methods and subsequent bioinformatic pipelines used to call SNPs and quality control those SNP calls for each GWAS study [51]. Depending on the accuracy obtained by the method, the power to detect significant associations can be decreased significantly especially for lower minor allele frequency SNPs and for smaller GWAS populations [52]. Low accuracy could also cause false positives to be detected. While low accuracy could lead to a type II error, any positive SNP from a GWAS study should be confirmed. This confirmation will likely be done in a second population and will help eliminate any false positives that could be due to inaccurate genotyping from resequencing data. If the restriction enzyme is not optimized for the crop, then the number of SNPs obtained in the study may only be tagging a small proportion of the haplotype variation segregating in the GWAS panel. This is especially important when phenotyping the population is very time consuming and expensive. Choosing a genotyping technology that does not adequately cover the entire genome could cause the researcher to miss a significant allele when a different technology could have tagged a larger proportion of the
Genotyping Platforms for Genome-Wide Association Studies: Options and. . .
37
population’s segregating haplotypes. For example when reGBS has been performed in soybean using the ApeKI restriction enzyme there has been 12,193 SNPs of which only 7864 had a minor allele frequency (MAF) > 0.05 [53]. In a separate population the same enzyme produced 79,000 SNPs of which only 34,428 had a MAF > 0.05 [54]. Genotyping 50,000 SNPs in soybean captures less than 50% of the haplotype diversity in diverse germplasm [24]. This leaves a significant portion of the genome’s diversity not covered by markers. An advantage of reGBS methods is that this number of SNPs genotyped can be changed depending on the restriction enzyme (s) used. This can help facilitate obtaining enough SNPs to tag a larger proportion of the population’s segregating haplotypes for GWAS studies. This has been recently demonstrated by an HD-GBS method being developed that found an optimized combination of restriction enzymes for soybean that can genotype up to 875,000 SNPs in a population for a cost that makes it practical for a GWAS study [55]. Using the DepthFinder tool, Torkamaneh et al. [56] were able to in silico test different enzymes to come up with one that produced enough fragments in the defined size range. This ideal number of fragments would potentially provide the desired number of SNPs given the diversity present within the species. While this will produce more SNPs, the amount of sequencing required will be significantly higher than for a re-GBS method that only tags 30,000 SNPs. But as sequencing cost continues to be significantly reduced, sequencing this high-density SNPs has been demonstrated to be feasible and is necessary to adequately cover most of the haplotype variation in crops. 2.6 Targeted Genotyping by Sequencing
There have been several methods developed that target and amplify specific SNPs in the genome which can then be sequenced using next-generation sequencers. Some examples of these methods include commercial options such as AgriSeq (Thermo Fisher, Waltham, MA), PlexSeq (AgriPlex Genomics, Cleveland, OH), SeqSNP (Biosearch Technologies, Madison, WI) which are some form of amplicon-based sequencing or probe-based sequencing. Most of these methods are often targeting a low-density number of SNPs. In the range of a few thousand SNPs up to 10,000 SNPs. The main application for these genotyping platforms is genomic selection. With the low number of SNPs genotyped by these platforms they would not be recommended for use in GWAS studies.
2.7 Whole-Genome Sequencing
Using whole-genome sequencing (WGS) as a method of genotyping for GWAS was initially only practical for species with relatively small genomes or for GWAS panels that would be used as a community resource [57, 58]. As the cost of sequencing has decreased and the throughput has increased, the practical use of WGS as a method for GBS is starting to increase and will potentially become
38
David L. Hyten
the standard in GWAS studies. In addition, computational methods such as imputation are also making WGS more practical as a method for genotyping by allowing lower sequencing coverage in inbred crops with the missing data filled in by imputation [52, 59]. Currently the main platform for performing WGS for a GWAS panel is the Illumina sequencing platform. This platform allows for researchers to either perform a deep resequencing of their GWAS panel or a skim sequencing of their GWAS panel. While deep vs. skim can vary, for the purpose of this review a deep resequence will be considered an average depth of 8x or above. This allows for the majority of SNPs to be genotyped with very little imputation needed. Even with the low cost of the current sequencing platforms this type of deep sequencing can be very costly for large GWAS panels. To reduce cost, imputation has been demonstrated to be able to provide accurate genotype calls when the sequence coverage is lowered. Sometimes to levels as low as 0.3 coverage when a reference panel that has been deep sequenced is used to aid imputation [52]. This has enabled GWAS mapping at an economical cost while still maintaining the benefits of the whole-genome coverage of the SNPs from WGS [60]. The high coverage of SNPs from WGS allows for the majority of common haplotypes to be tagged by the genotyping. Depending on the species this can produce datasets in the millions of SNPs. Because LD is not constant across the genome, the coverage afforded by WGS data will give the local resolution or the size of the haplotype the positive association is located on. This greatly helps identify the region of potential candidate genes from the positive associates found from the GWAS study. One challenge from this amount of data is that this large number of SNPs can become computationally challenging. To assist in the computation of the GWAS study, oftentimes SNPs in high LD with each other will be reduced to a single tag SNP which will then be used for the analysis. Then once a positive association is discovered with a tag SNP then all the SNPs in high LD with the tag SNP can be used to determine the haplotype that the candidate gene is likely present. Third generation sequencing platforms such as the Oxford Nanopore or the Pacific Biosciences Sequel IIe SMRT sequencing are rapidly improving to the point where one or a future generation of these platforms could regularly be applied to genotype GWAS panels. These platforms have the advantage of being able to produce long reads that are able to detect structural variation throughout the genome. As sequencing technology moves forward, GWAS will move from SNPs as the only variation being used for the discovery of associations to the phenotypic trait to more complex variations. This technology will allow for whole genome assemblies of the entire GWAS panel to be used within the study to compare
Genotyping Platforms for Genome-Wide Association Studies: Options and. . .
39
not only SNPs but other forms of variation such as CNVs, INDELs, SSRs, transposable elements, and methylation status. If these other forms of variation are not in high LD, then the GWAS using this new technology will find new associations previously missed by today’s technology.
3
Regenotyping GWAS Populations Due to quickly evolving technology, current genotyping technologies will soon be obsolete, yet the phenotype data collected on GWAS populations may still be valuable. Germplasm collections or GWAS panels may be genotyped with a current technology and phenotyped only to be revisited in a few years to be regenotyped when the technology advances far enough to justify it. One example is the Maize HapMap which was sequenced over several different years using a variety of sequencing instruments and protocols as the technology evolved [58]. Another example is the soybean germplasm collection. The germplasm collection is 20,000 accessions and was genotyped with a 50 k Illumina Infinium array. This array tags less than 50% of the haplotypes [61]. Even though this tags less than half the haplotypes these genotyped accessions have been phenotyped by numerous research groups for the purpose of performing GWAS. Currently there is an effort to regenotype this same collection with WGS + imputation. Since the original DNA was not saved, it has to be reextracted for this genotyping. As longread technology continues to improve it will not be surprising if at some point this collection is regenotyped a third time using a different technology to genotype the repetitive sequences, provide long INDEL information, structural variation, and so on. While the current DNA method used might not be sufficient for the new technology there is a chance it might be and should be a consideration when taking on such large tasks for genotyping large collections for GWAS. A different option may be the long-term storage of seed or leaf tissue for a second extraction when a second genotyping of a GWAS population becomes practical due to the initial low marker coverage of the first genotyping of the population. This could be especially important when methylation information becomes standard as part of the sequence data obtained during a sequencing run. This also creates the issue that genotyping data for GWAS may not always come from a single genotyping platform. The genotype data may have been collected over time or through a series of multiple groups’ efforts. In such cases the GWAS data may need to be combined into one data set. This can provide the complication of the same SNP being called on a different strand. For example, if one group genotypes a SNP using an Illumina genotyping platform and the platform genotypes an A/G SNP on the top
40
David L. Hyten
strand. Another group my use whole-genome sequencing and genotype the same SNP except call the SNP on the bottom strand as a T/C SNP. In this case the error is easy to spot. But if the SNP is an A/T SNP, then the one technology is calling an A/T SNP while the other technology is genotyping a T/A SNP. If the two datasets are combined they will be calling the same SNP with zero concordance. For large data sets this type of strand differences can be very difficult to curate and can cause significant errors in combining GWAS datasets.
4
Conclusion Array technology such as the Infinium and Axiom and sequencing technology platforms such as restriction enzyme GBS and wholegenome sequencing have enabled the genotyping of thousands to millions of markers across hundreds to thousands of individuals for most if not all crops. These platforms produce high quality data that has been cost effective for researchers to leverage this technology to genotype their GWAS panels. With several commercial providers available for these technology platforms, genotyping GWAS panels researchers are really enabled to carefully select the right option that is best for their crop and the research goals of their study.
References 1. Kennedy GC et al (2003) Large-scale genotyping of complex DNA. Nat Biotechnol 21(10): 1233–1237 2. Gunderson KL et al (2005) A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet 37(5):549–554 3. Steemers FJ et al (2006) Whole-genome genotyping with the single-base extension assay. Nat Methods 3(1):31–33 4. Liu L et al (2012) Comparison of nextgeneration sequencing systems. J Biomed Biotechnol 2012:251364 5. Rasheed A et al (2017) Crop breeding chips and genotyping platforms: progress, challenges, and perspectives. Mol Plant 10(8): 1047–1064 6. Song Q et al (2020) Soybean BARCSoySNP6K: an assay for soybean genetics and breeding research. Plant J 104(3):800–811 7. Wang S et al (2014) Characterization of polyploid wheat genomic diversity using a highdensity 90 000 single nucleotide polymorphism array. Plant Biotechnol J 12(6):787–796 8. Sim S-C et al (2012) High-density SNP genotyping of tomato (Solanum lycopersicum L.)
reveals patterns of genetic variation due to breeding. PLoS One 7(9):e45520 9. Bayer MM et al (2017) Development and evaluation of a Barley 50k iSelect SNP array. Front Plant Sci 8:1792 10. Gaikpa DS et al (2020) Genome-wide association mapping and genomic prediction of Fusarium head blight resistance, heading stage and plant height in winter rye (Secale cereale). Plant Breed 139(3):508–520 11. Kang YC et al (2020) Genome-wide association mapping for adult resistance to powdery mildew in common wheat. Mol Biol Rep 47(2):1241–1256 12. Haile TA et al (2021) Genomic prediction of agronomic traits in wheat using different models and cross-validation designs. Theor Appl Genet 134(1):381–398 13. Bianco L et al (2016) Development and validation of the Axiom®Apple480K SNP genotyping array. Plant J 86(1):62–74 14. Beji S et al (2020) Genome-wide association study identifies favorable SNP alleles and candidate genes for frost tolerance in pea. BMC Genomics 21(1):536
Genotyping Platforms for Genome-Wide Association Studies: Options and. . . 15. Jin SR et al (2020) A combination of genomewide association study and transcriptome analysis in leaf epidermis identifies candidate genes involved in cuticular wax biosynthesis in Brassica napus. BMC Plant Biol 20(1):458 16. Assefa T et al (2020) Deconstructing the genetic architecture of iron deficiency chlorosis in soybean using genome-wide approaches. BMC Plant Biol 20(1):42 17. Abou-Elwafa SF, Shehzad T (2021) Genetic diversity, GWAS and prediction for drought and terminal heat stress tolerance in bread wheat (Triticum aestivum L.). Genet Resour Crop Evol 68(2):711–728 18. Ahmed HGMD et al (2020) Genome-wide association mapping through 90K SNP array for quality and yield attributes in bread wheat against water-deficit conditions. Agriculture 10(9) 19. Anuarbek S et al (2020) Quantitative trait loci for agronomic traits in tetraploid wheat for enhancing grain yield in Kazakhstan environments. PLoS One 15(6):e0234863 20. Begum H et al (2020) Genetic dissection of bread wheat diversity and identification of adaptive loci in response to elevated tropospheric ozone. Plant Cell Environ 43(11): 2650–2665 21. Bin Safdar L et al (2020) Genome-wide association study and QTL meta-analysis identified novel genomic loci controlling potassium use efficiency and agronomic traits in bread wheat. Front Plant Sci 11:70 22. Cheng B et al (2020) Genome-wide association analysis of stripe rust resistance loci in wheat accessions from southwestern China. J Appl Genet 61(1):37–50 23. Arif MAR, Borner A (2020) An SNP based GWAS analysis of seed longevity in wheat. Cereal Res Commun 48:149 24. Song QJ et al (2015) Fingerprinting soybean germplasm and its utility in genomic research. G3 Genes Genomes Genet 5(10):1999–2006 25. Bernard A et al (2020) Association and linkage mapping to unravel genetic architecture of phenological traits and lateral bearing in Persian walnut (Juglans regia L.). BMC Genomics 21(1):203 26. Gupta P et al (2020) Genomic regions associated with the control of flowering time in durum wheat. Plants 9(12):1628 27. Hu DZ et al (2020) Genetic dissection of yieldrelated traits via genome-wide association analysis across multiple environments in wild soybean (Glycine soja Sieb. and Zucc.). Planta 251(2):39
41
28. Kim KH et al (2020) Genome-wide association and epistatic interactions of flowering time in soybean cultivar. PLoS One 15(1):e0228114 29. Scientific T (2021) Axiom genotyping arrays for agrigenomics. [5/6/2021]. https://assets. t h e r m o fi s h e r. c o m / T F S - A s s e t s / G S D / brochures/axiom-genotyping-arraysagrigenomics-brochure.pdf 30. Van Tassell CP et al (2008) SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods 5(3):247–252 31. Hyten DL et al (2010) High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics 11:38 32. Baird NA et al (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3(10):e3376 33. Elshire RJ et al (2011) A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6(5): e19379 34. Cruz VMV, Kilian A, Dierig DA (2013) Development of DArT marker platforms and genetic diversity assessment of the U.S. collection of the new oilseed crop Lesquerella and related species. PLoS One 8(5):e64062 35. Akram S, Arif MAR, Hameed A (2021) A GBS-based GWAS analysis of adaptability and yield traits in bread wheat (Triticum aestivum L.). J Appl Genet 62(1):27–41 36. Breria CM et al (2020) A SNP-based genomewide association study to mine genetic loci associated to salinity tolerance in Mungbean (Vigna radiata L.). Genes 11(7):759 37. Bukucu SB et al (2020) Major QTL with pleiotropic effects controlling time of leaf budburst and flowering-related traits in walnut (Juglans regia L.). Sci Rep 10(1):15207 38. Hou L et al (2020) Genome-wide association studies of fruit quality traits in jujube germplasm collections using genotyping-bysequencing. Plant Genome 13(3):e20036 39. Kibe M et al (2020) Genetic dissection of resistance to gray leaf spot by combining genomewide association, linkage mapping, and genomic prediction in tropical maize germplasm. Front Plant Sci 11:572027 40. Campa A, Garcia-Fernandez C, Ferreira JJ (2020) Genome-wide association study (GWAS) for resistance to Sclerotinia sclerotiorum in common bean. Genes 11(12):1496 41. Aguado E et al (2020) Mapping a partial Andromonoecy locus in Citrullus lanatus
42
David L. Hyten
using BSA-Seq and GWAS approaches. Front Plant Sci 11:1243 42. Poland JA et al (2012) Development of highdensity genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS One 7(2):e32253 43. do Carmo CD et al (2020) Genome-wide association studies for waxy starch in cassava. Euphytica 216(5):82 44. Elango D, Xue WY, Chopra S (2020) Genome wide association mapping of epi-cuticular wax genes in Sorghum bicolor. Physiol Mol Biol Plants 26(8):1727–1737 45. Ertiro BT et al (2020) Genetic dissection of nitrogen use efficiency in tropical maize through genome-wide association and genomic prediction. Front Plant Sci 11:474 46. Adewale SA et al (2020) Genome-wide association study of Striga resistance in early maturing white tropical maize inbred lines. BMC Plant Biol 20(1):203 47. Badji A et al (2020) Genetic basis of maize resistance to multiple insect pests: integrated genome-wide comparative mapping and candidate gene prioritization. Genes 11(6):689 48. Bararyenya A et al (2020) Genome-wide association study identified candidate genes controlling continuous storage root formation and bulking in hexaploid sweetpotato. BMC Plant Biol 20(1):3 49. Chan AW, Hamblin MT, Jannink J-L (2016) Evaluating imputation algorithms for low-depth genotyping-by-sequencing (GBS) data. PLoS One 11(8):e0160733 50. Nazzicari N et al (2016) Marker imputation efficiency for genotyping-by-sequencing data in rice (Oryza sativa) and alfalfa (Medicago sativa). Mol Breed 36(6):69 51. Wickland DP et al (2017) A comparison of genotyping-by-sequencing analysis methods on low-coverage crop datasets shows
advantages of a new workflow, GB-eaSy. BMC Bioinformatics 18(1):586 52. Happ MM et al (2019) Generating high density, low cost genotype data in soybean [Glycine max (L.) Merr.]. G3 (Bethesda) 9(7): 2153–2160 53. Bastien M, Sonah H, Belzile F (2014) Genome wide association mapping of Sclerotinia sclerotiorum resistance in soybean with a genotyping-by-sequencing approach. Plant Genome 7(1):849 54. Mamidi S et al (2014) Genome-wide association studies identifies seven major regions responsible for iron deficiency chlorosis in soybean (Glycine max). PLoS One 9(9):e107469 55. Torkamaneh D et al (2021) A bumper crop of SNPs in soybean through high-density genotyping-by-sequencing (HD-GBS). Plant Biotechnol J 19:860 56. Torkamaneh D et al (2019) DepthFinder: a tool to determine the optimal read depth for reduced-representation sequencing. Bioinformatics 36(1):26–32 57. Wang W et al (2018) Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557(7703):43–49 58. Bukowski R et al (2017) Construction of the third-generation Zea mays haplotype map. GigaScience 7(4):1 59. Huang BE et al (2014) Efficient imputation of missing markers in low-coverage genotypingby-sequencing data from multiparental crosses. Genetics 197(1):401–404 60. Happ MM et al (2021) Comparing a mixed model approach to traditional stability estimators for mapping genotype by environment interactions and yield stability in soybean [Glycine max (L.) Merr.]. Front Plant Sci 12(542): 630175 61. Song Q et al (2015) Fingerprinting soybean germplasm and its utility in genomic research. G3 (Bethesda) 5(10):1999–2006
Chapter 4 Genome-Wide Association Study Statistical Models: A Review Mohsen Yoosefzadeh-Najafabadi, Milad Eskandari, Franc¸ois Belzile, and Davoud Torkamaneh Abstract Statistical models are at the core of the genome-wide association study (GWAS). In this chapter, we provide an overview of single- and multilocus statistical models, Bayesian, and machine learning approaches for association studies in plants. These models are discussed based on their basic methodology, cofactors adjustment accounted for, statistical power and computational efficiency. New statistical models and machine learning algorithms are both showing improved performance in detecting missed signals, rare mutations and prioritizing causal genetic variants; nevertheless, further optimization and validation studies are required to maximize the power of GWAS. Key words GWAS, Statistical models, Statistical power, Computational efficiency, Significance threshold
1
Introduction Understanding the genetic architecture of traits is a fundamental step to understanding the biology of traits and to improve the performance of plants. Association analysis (so-called genomewide association study (GWAS)), with high-density genotypic data, is a powerful tool being effectively and efficiently used to decipher the genetic control of traits of interest. This method essentially evaluates statistical associations between the alleles at a given locus and the observed phenotype [1]. The GWAS was first applied to human genetics in 1993 [2], and the first application of GWAS in plant species was published in 2001 [3] to dissect the genetic basis of flowering time in maize. Over the last two decades of GWAS analyses, fueled by advances in genomic technology (i.e., next-generation sequencing (NGS)), tens of thousands of genetic variants underlying complex traits in plants have been discovered [4–8]. Advances in genomic technologies such as reference genome
Davoud Torkamaneh and Franc¸ois Belzile (eds.), Genome-Wide Association Studies, Methods in Molecular Biology, vol. 2481, https://doi.org/10.1007/978-1-0716-2237-7_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
43
44
Mohsen Yoosefzadeh-Najafabadi et al.
assembly [9] and the development of cost-effective high-throughput genotyping methods [10] along with the development of robust phenotyping protocols and platforms [11] and powerful statistical methodology [12] have transformed GWAS from a promising tool to a universal, efficient, and powerful method for investigating the genetic basis of complex traits in plants. Robustness and effectiveness of GWAS in the dissection of complex traits is determined by multiple factors that should be considered when performing GWAS. These factors are briefly described below. (1) Phenotypic variation: GWAS should only be performed on phenotypic data with normal or approximately normal distributions. Outliers and nonnormality of a data set can significantly inflate Type I error in a GWAS study and consequently reduce the statistical power. Moreover, the heritability of trait and genotype environment interaction can also impact the power of GWAS to detect the association. (2) Population size: the portion of the genotypic variation to the phenotypic variation can be defined by the population size. In plants, although populations with 100–500 unrelated individuals are recommended for GWAS, larger populations will improve the power of the analyses [13]. (3) Population structure: the composition of an association population can be affected by selection criteria such as geographic regions or growth habits, which can lead to establishing a population split into subpopulations with a more highly related genetic background, so-called population structure [14]. Subpopulations can be inferred using principal component analysis (PCA) or STRUC TURE analysis [15]. Furthermore, at the genetic level, individuals within a population that are not equally and distantly related and this can result in cryptic relationship (aka kinship) [16]. The presence of population structure and cryptic relationships among individuals within a population can increase type I error or false discovery rate in GWAS. (4) Linkage disequilibrium (LD): the coefficient of LD is an indication of association between two loci that have a shared history of mutation and recombination and can be used for estimating the number of required markers to cover the genome. It is well documented that the extent of LD is highly variable in different chromosomal regions and populations [17]. As most association studies use low-density genotypic data, the power to detect associations in a GWAS depends critically on a high level of LD between the causal factor and the markers that are queried. Thus, LD is the key element to define haplotypic blocks when searching for a candidate causal gene/allele in detected regions [18]. On the other hand, since the presence of long-range LD increases type I error rate, it is essential to calculate LD prior to performing GWAS. Finally, the fast LD decay between loci can be advantageous to fine map a candidate causal gene/allele only if sufficient marker density is attained. (5) Marker density: the minimum number of markers that are required for successful GWAS is
Genome-Wide Association Study Statistical Models: A Review
45
highly dependent on the genome size and the rate of LD decay. A marker density ensuring that each haplotype block is covered by at least one marker is needed for a successful GWAS in which no association is missed [19]. (6) Allele frequency: most GWAS studies are focused only on common variants (i.e., variants with an allele frequency 5%) and ignoring rare variants with a minor allele frequency (MAF) < 5%. However, such rare alleles might explain an important part of phenotypic variation, but their low presence leads to a lack of resolution power [20]. (7) Significance threshold: to classify associations as significant and reduce false-positive and false-negative rates, multiple-testing strategies have been developed. In its early years, to control the family-wise error rate (FWER), a Bonferroni correction approach was proposed and widely used [21]. Although this approach is remarkably successful in limiting false-positive rates, it is also often overly strict and leads to an increased false-negative rate. Indeed, with high-density marker data, many markers are in high LD and do not represent independent tests. This leads to the Bonferroni correction being too conservative. Alternative approaches use adjusted P-values to control the false discovery rate (FDR) or use posterior probabilities to control the Bayesian FDR [22]. The determination of significance levels in GWAS is mostly dependent on the study and population size. (8) Statistical model: in principle, GWAS uses statistical models to find genetic markers that are significantly associated with variation in phenotype. Basically, the concept of GWAS is derived from a simple statistical test (e.g., ANOVA) where the model tests each marker individually for a potential association [23]. A major problem of this “naı¨ve” approach is the high rate of false positives due to the significance threshold and population structure. In the last decade, several statistical procedures were developed to control both type I and type II error rates through incorporating structure (Q matrix) and kinship (K matrix) as cofactors in the models [12]. But then again overcorrection of population structure can lead to an increased rate of false negatives. On the other hand, with the generation of whole-genome sequencing (WGS) data, from up to thousands of samples, coupled with detailed multidimensional phenotypic records can simply cause computational burdens in GWAS analyses. Therefore, over the years, multiple methods have been developed in which different cofactors are incorporated and the computing time is reduced. Below, we provide an overview on available statistical models for GWAS and potential approaches for the next generation of models.
46
2
Mohsen Yoosefzadeh-Najafabadi et al.
Current Models Commonly Used for GWAS
2.1 GWAS Using Single-Locus Models
In variant-trait association studies, the single-locus GWAS models are commonly used to test one genetic marker at a time and measure relatedness among genotypes based on random polygenic effect methods or principal components as fixed covariates [24]. Most single-locus models rely on predefined genetic algorithms to ease the comparative analysis of inheritance effects among genotypes [25, 26]. Some of the important genetic algorithms are based on dominant, recessive, and codominant or additive models, which means the selection of each model on a given dataset may lead to different results [27, 28]. Screening SNPs based on a codominant or additive model is the first recommendation for choosing an appropriate genetic algorithm [29]. After selecting the appropriate genetic algorithm, different statistical tests such as analysis of variance (ANOVA), t-test, Fisher’s exact test, chi-squared test, Cochran–Armitage’s trend test, and odds ratio test can be implemented for single-locus tests [25, 26, 29]. The chi-squared test measures the significance of the deviation for the observed genotypes/allele counts, from the expected counts in a likelihood table [28]. The function of Fisher’s exact test is similar to chi-squared test with improved accuracy in small sample size dataset [27, 29]. Cochran–Armitage’s trend test was found to be more powerful than the chi-squared test when SNPs show some level of deviation from Hardy–Weinberg equilibrium, for example, population stratification [27]. This method is based on incorporating assumed ordinal genotype effects into the chi-squared test [27]. The odds ratio test is based on comparing the odds ratio derived from the effect sizes of mutated alleles/genotypes [28]. ANOVA and t-tests are usually implemented in a dataset with quantitative phenotypes to test the significance of the difference between genotypes and alleles, respectively [30]. However, the structure of SNPs in a given dataset plays an important role in the efficiency of the ANOVA and t-test models [25, 30]. Choosing appropriate genetic algorithms and statistical tests is pivotal to avoid or minimize the false discovery of associated genomic regions [25, 27]. Linear models can be implemented in singlelocus tests without determining the genetic algorithm [31, 32]. The inheritance of genotype effects can be assessed after detecting a significant association between specific genomic regions and traits of interest [33]. The very first developed GWAS model was a general linear model (GLM), including genomic control, structured association, and family-based tests of association models, which took the population structure into account to reduce the overfitting and false discoveries in genomic regions [34–36]. However, GLM-based methods are underpowered to consider the unequal relatedness among genotypes (i.e., kinship) [34, 36, 37].
Genome-Wide Association Study Statistical Models: A Review
47
This issue led to the development of new GWAS methods based on a mixed linear model (MLM) that simultaneously incorporates the population structure and kinship in the analyses [37]. In MLM models, population structure or principal components (PCs) are considered as a fixed effect and kinship is fitted as a random variance–covariance structure among genotypes [38, 39]. Later, in order to increase the statistical power of QTL detection and reduce the computing time in MLM, two MLM-related models, namely, compressed MLM (CMLM) and enriched CMLM (ECMLM) models, were developed [40, 41]. CMLM has the ability to cluster genotypes into different related groups using unweighted pair groups with arithmetic mean (UPGMA) as one of the most common clustering approaches [37]. Considering clustering genotypes into different groups, the statistical power for detecting relevant QTL increased significantly [33, 37]. CMLM employs different parameters such as population structure, random genetic effect, ratio or variance, clustering, and the number of genotyping groups [37, 42]. The ECMLM method is built upon CMLM with an additional parameter, which investigates alternatives to measure kinship between a different group of genotypes as the average of pairwise individual kinships [41, 43]. In ECMLM, the computing time highly depends on optimizing population parameters [43, 44]. However, ECMLM has a great potential to increase the statistical power in detecting relevant QTLs associated with traits of interest by detecting the optimal combination of cluster algorithms, kinship, and compression levels [37, 41, 45]. One of the major problems with MLM-based models is the significant computing time needed to analyze a large number of genotypes with several iterations to calculate the unknown variance components and “cubic function” or dealing with different parameters, simultaneously [39, 44]. In order to overcome this drawback, several GWAS models have been proposed, including efficient mixedmodel association (EMMA), efficient mixed-model association expedited (EMMAX), population parameter previously determined (P3D), genome-wide efficient mixed-model association (GEMMA), factored spectrally transformed linear mixed model (FaST-LMM), and settlement of mixed linear model under progressively exclusive relationship (SUPER) [37, 42, 44, 46]. The EMMA model reduces the computing time by converting the two-dimensional optimization of residual and genetic variance components into one-dimensional optimization by calculating the likelihood as a function of their ratio [46, 47]. EMMAX and P3D estimate the variance or ratio components without iterations and then adjust them to test markers [40, 43]. The GEMMA model has the same computing strategy as EMMAX and P3D but calculates the population parameters for each tested genetic marker [37, 48]. GRAMMAR-Gamma is another type of GEMMA model that has lower computational efficiency but higher statistical
Mohsen Yoosefzadeh-Najafabadi et al. High
Nb of genetic markers / QTNs
Many
REGENIE REGENIE BN SVM RF WVSM
BLINK
BN
Farm-CPU
Bayes LASSO E-BAYES LASSO AM-LASSO PLR E-BAYES FVSM BLINK EN PR BOLT-LMM MLMM Farm-CPU FASTmrEMMA EMMAX MLM SUPER P3D EMMA ECMLM FaST-LMM CMLM FaST-LMM-Select GEMMA GRAMMAR GLM
Computational time
RF SVM BOLT-LMM SUPER WVSM
ANN
Statistical power
48
FASTmrEMMA FaST-LMM-Select P3D CMLM ECMLM FVSM EMMAX EMMA FaST-LMM GEMMA PR
ANN MLMM MLM
GRAMMAR
Single-locus models Multi-locus models Bayesian models Machine Learning
PLR EN AM-LASSO E-BAYES LASSO
LASSO
E-BAYES
Bayes
GLM
Slow
Computational time
Slow
Fig. 1 Different analytical models for GWAS. Left, distribution of models based on number of QTNs that they take into account and computational time. Right, distribution of models based on statistical power and computational speed
power in comparison with the GEMMA model [48, 49]. In order to reduce the computing time, FaST-LMM was proposed based on partitioning the cubic function into the number of genotypes and the square of relationship ranking among genotypes [44, 50, 51]. However, the cubic function still exists in the number of genotypes and, therefore, this GWAS model would be most efficient when dealing with a set of genetic markers less than the number of genotypes [50]. One of the LMM-based models is Fast-LMM-Select, which takes advantage of FaST-LMM with better efficiency in addressing confounding from rare variants and spatial structure [44, 52]. This method seems to be more reliable in decreasing the false discovery rates and increasing the statical power in detecting rare variants in comparison with other GWAS models [52]. However, as with FaST-LMM, Fast-LMM-Select is undertaken for detecting a small subset of markers [37, 52]. Similar to Fast-LMM-Select, the SUPER model was proposed to deal with rate and spatial structure but is more efficient in a dataset with a larger number of genetic markers than the number of genotypes [44]. The SUPER model is based upon extracting a small subset of genetic markers and using them in FaST-LMM to minimize the computing time and increase the statistical power when the entire set of genetic markers is used (Fig. 1) [44, 53]. 2.2 GWAS Using Multilocus Models
One of the major impediments of using single-locus models is their inefficiency in dealing with complex phenotypes that are under the control of a group of alleles with major and minor effects [25]. In single-locus models, each genetic marker is tested in a one-dimensional genome scan, individually, which fails to estimate the joint effects of multiple alleles [54]. Also, in order to set the
Genome-Wide Association Study Statistical Models: A Review
49
significance threshold, multiple-test corrections such as the Bonferroni correction and false discovery rate (FDR) should be considered [25, 42], which may result in failing to identify QTL because of their conservative nature in setting up the significance threshold [53]. With the advent of multilocus tests, there is no requirement to employ multiple correction tests [25, 42]. Also, multilocus tests are able to deal with multidimensional genome scans in which the effect of all genetic markers is considered simultaneously [53, 55] and can be suitable for detecting relevant QTL associated with complex traits [42]. Nevertheless, the main problem of multilocus tests arises when working on datasets with a number of genetic markers that is larger than the number of genotypes [25, 42, 54]. The “large p, small n” problem exists in many GWAS methods and to avoid this, the number of genetic markers that can be fitted simultaneously must be less than the number of genotypes due to the presence of LD [42]. LD is usually implemented for detecting genetic markers with the same information and overlapping effects on a given trait [54]. Several methods such as marker tagging that can be done using stepwise or penalized regression [56, 57] have been developed to alleviate this problem. Marker tagging is the process of choosing a subset of genetic markers that capture a high portion of the information from the full set of genetic markers based on their LD values [25, 58]. After identification of tag markers, these are iteratively added to the GWAS model and thus update the model training performances until all genetic markers have been included [58]. Thereafter, it is possible to remove genetic markers if they do not improve the performance of the overall GWAS model [59]. While different regression methods have been proposed, based on the level of LD between two genetic markers, the power of selecting similar genetic markers using different regression methods is still low. Therefore, the regression methods are recommended to be selected carefully for different plant species [58]. One of the common regression methods for marker tagging is stepwise regression that is easy to use but mostly overfitted [25, 59]. Therefore, the heritability estimates through stepwise regression methods are incorrectly biased and need to be checked with other statistical methods [59]. As an alternative, penalized regression can be implemented to add regularization or a penalty term to the linear model and, therefore, shrink the coefficients of irrelevant genetic markers towards zero [60]. Optimizing and estimating the required parameters in penalized regression can balance the “goodness of fit” and model complexity to reduce the possible overfitting using training and testing datasets [60]. Therefore, most of the penalized regression methods such as elastic net, LASSO, normal exponential gamma, and ridge regression are required when using a large dataset to obtain good performance [61].
50
Mohsen Yoosefzadeh-Najafabadi et al.
Most of the abovementioned regression methods are not practical in GWAS analyses due to high computational requirements and also not being able to directly address the population structure [42, 62]. In order to consider the population structure in analyses, the multilocus mixed-model (MLMM) method was developed based on reestimating the variance of the model at each step using a forward–backward stepwise regression method [42]. Genetic markers in this GWAS method are treated as fixed-effect cofactors [63]. Although MLMM has an acceptable performance with increased statistical power, the possibility of losing valuable genetic markers still exists because of using the greedy forward–backward stepwise regression method [42, 63]. Fast multilocus randomSNP-effect EMMA (FASTmrEMMA) and genome-wide composite interval mapping (GCIM) are multilocus GWAS methods that are developed under the MLM framework to consider genetic marker effects as random and construct the new and fast matrix transformation [40, 64]. These methods are useful for covering the covariance matrix of the polygenic K matrix and reducing the noise caused by the environment [40]. However, FASTmrEMMA requires an intensive computational effort in comparison with other proposed multilocus models [64, 65]. The improved version of FASTmrEMMA has been proposed recently to reduce the computing time by using matrix identities, specifically the Woodbury matrix identity [65]. Therefore, in the improved version of FASTmrEMMA, there is no need to calculate the eigenvectors, which substantially reduces the time associated with analyzing each genetic marker [65]. Another multilocus model that is currently implemented in a wide range of studies is the fixed and random model circulating probability unification (Farm-CPU) which divides the MLMM model into a random-effect model (REM) and a fixed-effect model (FEM), then employs them iteratively to achieve the best result in a given dataset [66]. In this GWAS model, (1) a singlelocus test is used in the FEM to remove the confounding between testing genetic markers and kinship, (2) pseudo-quantitative trait nucleotides (QTNs) were considered in the FEM to eliminate the false positive rate, (3) the pseudo-QTNs are detected by the use of maximum likelihood method in a REM for reducing the overfitting and incorporating the map of genetic markers, and (4) in conjunction with tests on other genetic markers, the p-value calculated for pseudo-QTNs are merged [66]. This method relies on the assumption that QTNs are uniformly distributed throughout the genome [67]. However, one of the major impediments of FarmCPU is the high computational efforts in analyzing the REM method [66, 67]. To increase the computational efficiency of the REM, the Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) was developed to replace the REM with
Genome-Wide Association Study Statistical Models: A Review
51
a FEM using different criteria derived from Bayesian information [67]. The BLINK method can increase the true positive rate and decrease the type I error in comparison with other proposed GWAS models [67, 68]. Different types of BLINK such as BLINK-C can capture more information from the used genetic markers resulting in the detection of more relevant QTL in a given dataset (Fig. 1) [67]. Although the abovementioned GWAS models considerably increase the power of detecting relevant QTL and decrease the possible error associated with population structure and kinship, most of them still suffer from the long computing time in analyzing large datasets [69]. To solve this problem, BOLT-LMM was developed to increase the computational efficiency in analyzing big datasets with large numbers of genetic markers and genotypes [69]. This GWAS model is recommended for a large dataset with more than 5000 samples [70]. This method can successfully control for population structure and alleviate the effect of population stratification [70]. 2.3 GWAS Using Bayesian Models
Bayesian models such as Bayesian penalized regressions and Bayesian variable selections are commonly used for investigating the joint effect of multiple genetic markers [71–74]. These models pave the way of solving the “large p, small n” problem by simultaneously considering more genetic markers than the number of genotypes [74]. Since Bayesian models are free from the multiple testing issue, they are sensitive to genetic markers with small effects [71, 75]. Several studies have reported superior efficiency of Bayesian variable selection methods in detecting more relevant QTL associated with the trait of interest than other GWAS single-locus models [71, 76]. In most GWAS models, the false positive rate can be adjusted through limiting the probability of one or more falsepositive results (genome-wise error rate) from the final calculation [72, 73]. This technique may be useful in dealing with a small dataset; however, eliminating false positive results from the final outcome in large datasets, in which more false positive results are produced, may decrease the statistical power of GWAS models [74, 76]. In Bayesian models, false positive rates are controlled by eliminating the portion of false positive results among all positives, which maximizes the power of the GWAS method [71, 74]. However, implementation of Bayesian models requires prior specifications, such as the distribution of the trait of interest and the probability of associated genetic markers [75]. Usually, the analysis of Bayesian models is complex and time-consuming due to the use of different approaches such as Markov Chain Monte Carlo (MCMC) sampling for inferring the distribution [76]. Different Bayesian models have been developed to detect genomic regions associated with the trait of interest [76]. Most of the developed
52
Mohsen Yoosefzadeh-Najafabadi et al.
Bayesian models are divergent based on their prior specification assumptions [77, 78]. Penalized logistic regression models were the first developed Bayesian GWAS models, which are built based on imposing a penalty to the genetic markers for selecting the most relevant QTL associated with a trait of interest [79, 80]. Some of the important penalized regression models are ridge regression, least absolute shrinkage and selection operator (LASSO) regression, and elastic net regression [81–83]. Ridge regression is built upon considering the near-zero coefficient for variables with minor effects but including all variables in the final model [83, 84]. LASSO regression is based on forcing variables with minor effects to have exactly zero coefficient, and only significant variables are included in the final model [82, 85]. Elastic net regression is a combination of LASSO and Ridge regression, which shrinks some of the coefficients exactly to zero and some toward zero [86]. Ridge regression, best linear unbiased prediction (BLUP), as one of the first GWAS Bayesian models, assumes normal distribution for genetic markers as random effects with a null mean [49]. The Bayesian LASSO regression model employs different hierarchical models, considering normal distribution for genetic effects and exponential priors’ distribution for their variances in MCMC algorithm to detect the most relevant QTL associated with the trait of interest [78, 79, 87]. One of the major drawbacks of using the Bayesian LASSO model is that LASSO uses the same shrinkage criteria for genetic markers with large and small effect [88, 89]. This resulted in excessive shrinking of large effects and insufficient shrinking of the small effects [89, 90]. This problem can be largely solved by developing adaptive mixed LASSO (AM-LASSO) Bayesian model, in which appropriate weighting methods are used to employ a different penalty to each coefficient derived from the tested genetic markers [90]. Since most of the linear regression methods are sensitive to the multicollinearity among genetic markers, the Elastic-Net (EN) model was developed by Cho, et al. [91] to detect the joint effect of multiple genetic markers, simultaneously. Elastic-Net (EN) has been reported to be a more powerful model than other penalized regression models, since it benefits from the advantages of both LASSO and ridge regression methods by using them at the same time in a given dataset [91, 92]. BayesA, another Bayesian GWAS model, considers independent and identical univariate distributions for genetic marker effects with a null mean [93]. This means that the genetic marker effect of locus “A” has the normal and independent Student’s t-distribution with the null mean and unknown locus-specific variance [49, 94, 95]. In BayesB, the prior specifications are that genetic marker effects have independent and identical Student’s t-distributions with the probability of “π” and no effect with the probability of
Genome-Wide Association Study Statistical Models: A Review
53
“1 π” in a null mean [76, 78, 96]. Many genetic markers in the BayesB model have zero effects and few genetic markers are assumed to have Student’s t-distributions [94]. Thus, BayesA is a special form of BayesB with π ¼ 0 [76]. BayesC also employs the same prior specifications as BayesB but assumes equal unknown variances for all the tested genetic markers [97]. BayesCπ was developed for the case where a proportion of genetic markers (π) are considered to have zero effect and the rest of the genetic markers follow a normal distribution in a given dataset [76, 98]. One of the major challenges among all the Bayesian GWAS models is that they are usually computationally demanding despite the high improvement in MCMC simulation methods [99, 100]. Therefore, significant efforts require to boost computational speed in Bayesian GWAS models (Fig. 1). One of these efforts resulted in the development of the fast empirical Bayesian LASSO (EBLASSO), in which the variance components are estimated as close as other Bayesian models but in a more accurate and computationally efficient manner [99]. Empirical Bayes (EB) was developed by using a selected prior distribution to shrink genetic markers toward zero [99]. This approach increases the computational efficiency of the fully Bayesian models, which rely on MCMC simulation [99, 101]. It is well documented that EB had the highest efficiency when dealing with a large number of genetic markers, simultaneously, with lower computational needs [99, 102]. However, its efficiency highly depends on the numerical optimization methods such as the simplex algorithm to estimate the variance component of the tested genetic markers [99, 102, 103]. One of the components of EB is the relevance vector machine (RVM) that considers as a machine learning algorithm [102]. The first assumption for using this algorithm is the uniform distribution of the variance components [102]. This takes the flexibility from the EB to adjust the shrinkage level for a specific dataset [99, 102]. Therefore, EBLASSO was developed based on the Bayesian LASSO model with an exponential prior distribution for the variance components, instead of the inverse chi-square distribution for the variance components considered by the EB model [99, 101]. 2.4 GWAS Using Machine Learning (ML)
One of the most recent trends in agricultural science is the use of diverse machine learning (ML) algorithms in solving regression and classification problems [104]. To put it simply, ML algorithms can learn from data without being explicitly programmed thanks to automated analytical algorithm building [105, 106]. The use of ML algorithms has been widely investigated in phenomics [107], genomics [25], metabolomics [108], genetic engineering [109], and proteomics [110]. In genomics, ML algorithms have the ability to interpret large genomic datasets and annotate different genomic sequence factors [25, 111]. In GWAS, ML algorithms can prioritize
54
Mohsen Yoosefzadeh-Najafabadi et al.
and detect genetic markers with small effect sizes, thus resulting in an increased efficiency of detecting relevant QTL associated with the trait of interest [112, 113]. Throughout ML analytical process, data will be split into training, validation, and test subsets [114]. The training dataset is usually employed to train the ML algorithms using observed data [115]. The validation dataset is implemented to tune the hyperparameters in the tested ML algorithm [114, 115]. The testing dataset is employed to assess the optimal ML algorithm performance [115]. Alternatively, the crossvalidation technique can be used when the dataset is small, and it would not be possible to split the base dataset into three subsets [115]. The cross-validation splits the dataset into training and testing datasets, and then the training subset divides into two distinct subsets, one for training the algorithm and another for tunning the hyperparameters [116]. The partitioning of the base dataset should be performed several times to minimize overfitting and variability rates [116]. In GWAS analysis, partitioning the base dataset would be complicated due to the existence of the interaction between genetic markers [117]. To solve this issue, Piette and Moore [117] proposed an alternative technique for cross-validation that can keep the original structure of the data. Most of the used ML algorithms in genomics are categorized as supervised and unsupervised learning algorithms [113]. Supervised ML algorithms make predictions of the unlabeled dataset using labeled data, while unsupervised algorithms find a pattern in the dataset without the use of labeled data [118]. Supervised ML algorithms should be implemented to find the genetic markers associated with the labeled data, which is the trait of interest [118]. Several supervised ML algorithms have been proposed recently to increase the accuracy and prediction efficiency in a given dataset [119]. Five of the most common ML algorithms are penalized regressions, support vector machine (SVM), artificial neural network (ANN), random forest (RF), and Bayesian network (BN) [113, 118, 119]. Most of the ML algorithms have high computational efficiency in dealing with a large dataset; however, each has its own merits and demerits [120]. The rationale behind the SVM algorithms is to detect a hyperplane in the genetic markers that separate them based on the trait of interest [121, 122]. The advantage of using this algorithm is the high chance of incorporating a high level of interaction using nonlinear kernels without the necessity of predefining model parameters [121, 123]. However, the efficiency of the SVM algorithm will drop significantly when the number of genetic markers exceeds the number of genotypes [124]. Therefore, prescreening of genetic markers is required to maintain the efficiency of the SVM [25]. ANN is another ML algorithm that is successfully used in different fields of study, such as phenomics [125] and plant tissue culture [126]. Regardless of the type, most ANN algorithms include input and output layers
Genome-Wide Association Study Statistical Models: A Review
55
with one or more hidden layers in between [127]. The computational efficiency of ANN is, in general, lower than other ML algorithms and the possibility of overfitting might be augmented when the number of hidden layers is not chosen wisely [128]. As the most common decision tree algorithm, RF is reported to be efficient in dealing with high-dimensional data [129]. The RF algorithm also has higher power to detect epistasis in comparison with other ML algorithms [129]. However, as the number of genetic markers increases in a given population, the prediction power of RF will decrease, and the marginal effects will be captured instead of the effects of epistasis [130]. The BN algorithms are probabilistic graphical algorithms, which use a directed acyclic graph to model the causation and conditional dependency between genetic markers and the trait of interest [131]. This algorithm was shown to offer high performance in identifying interactions in a large dataset in comparison with other ML algorithms such as the penalized regressions or RF algorithms (Fig. 1) [132]. Variable selection methods are usually implemented in ML algorithms to detect the association between input and output variables [133, 134]. In GWAS, epistasis and genetic marker identification can be done by different variable selection methods such as filtering and wrapper methods [25]. The variable selection method is reported to be the most effective way to deal with the “large p, small n” problem [135]. Also, the interaction between genetic markers can be easily integrated into ML algorithms by using variable selection methods [133]. Using variable selection methods, the three following steps should be considered: (1) select the smallest set of genetic markers and their interactions to construct the best algorithm that identifies genetic markers with strong causal effects among the tested genome sequences, (2) select the smallest set of genetic markers and their interactions to construct the most accurate algorithm that includes genetic markers with smaller effects, and (3) select the subset of genetic markers and their interactions to understand the biological mechanisms underlying each genetic marker [25]. For the third step, the biological pathways and functional annotations can be incorporated into the variable selection process to provide insightful information on biological mechanisms [25]. Filtering methods such as mutual information, ANOVA, and chi-squared score are the most common variable selection methods, and these rely on the intrinsic nature of features [136, 137]. Wrapper methods such as recursive feature elimination and genetic algorithms are able to consider the joint interaction effects of multiple genetic markers [133]. However, they mostly offer a lower computational efficiency and have some level of overfitting in detecting the most important genetic markers associated with the trait of interest [138]. Besides, most developed supervised ML algorithms are species-specific and need to be developed for each
56
Mohsen Yoosefzadeh-Najafabadi et al.
plant species [113, 118]. Additionally, some ML algorithms such as RF provide an importance score for genetic markers that can be used to prioritize and identify relevant genetic markers [139, 140]. However, the abovementioned selection methods are often subjective and arbitrary, and the efficiency of variable selection methods are strongly dependent on the accuracy of the used algorithm and parameters, which is often unstable [141]. As one of the remedies for solving this issue, two-step ML-statistical algorithms have been developed for GWAS in which the performance of the tested algorithms will be evaluated in a first step, and then the variable importance will be performed in a second step [141, 142]. REGENIE, as the latest developed GWAS model, was developed based on a two-step machine learning paradigm to increase the efficiency of QTL detection in large populations [143]. In the first step, the genetic markers are divided into consecutive groups of N genetic markers and a small set of K ridge regression predictions that are generated from each group [143]. The tested ridge regression predictions have different shrinkage parameters to capture all the unknown size and number of associated QTL [143]. The results of the first step will be used as a covariate for the second step, where each genetic marker is tested. The REGENIE model is reported to be faster and more efficient than other GWAS models [143]. However, further investigations need to be done to confirm its efficiency more broadly.
3
Conclusion Identifying genomic regions associated with important traits in plants is a promising approach for the sustainable development of superior cultivars. Considering the transformative impact of GWAS on characterizing quantitative characteristics, the statistical power of this approach can be improved through exploiting more appropriate and powerful computational procedures. In general, no matter which GWAS method is used, completeness of genetic markers, the choice of genetic algorithm, population structure, cryptic relatedness, and determination of significance threshold play pivotal role to identify genomic regions controlling traits of interest accurately and efficiently. Unlike single-locus models, multilocus models were developed to estimate the joint effects of multiple alleles; however, the major impediment is to deal with “large p, small n” datasets. Although Bayesian methods can handle dataset with large number of genetic markers with no requirements for Bonferroni correction, they are complex and time-consuming. Machine learning algorithms offering a broad range of solutions to address limitations associated with conventional GWAS methods, but they are still in their early stages for GWAS analysis.
Genome-Wide Association Study Statistical Models: A Review
57
There is no doubt that advances in developing new analytical models will help to accelerate and improve analytical procedure, but it is important to note that GWAS is a concerted effort, and its outcome is directly dependent on the assembly of association panel, experimental design, genotyping and genome coverage, phenotyping, analytical model, and postanalysis interpretation and validation. References 1. Ersoz ES, Yu J, Buckler ES (2007) Applications of linkage disequilibrium and association mapping in crop plants. In: Genomics-assisted crop improvement. Springer, New York, pp 97–119 2. Corder EH, Saunders AM, Strittmatter WJ, Schmechel DE, Gaskell PC, Small G, Roses A, Haines J, Pericak-Vance MA (1993) Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families. Science 261:921–923 3. Thornsberry JM, Goodman MM, Doebley J, Kresovich S, Nielsen D, Buckler ES (2001) Dwarf8 polymorphisms associate with variation in flowering time. Nat Genet 28: 286–289 4. Torkamaneh D, Chalifour F-P, Beauchamp CJ, Agrama H, Boahen S, Maaroufi H, Rajcan I, Belzile F (2020) Genome-wide association analyses reveal the genetic basis of biomass accumulation under symbiotic nitrogen fixation in African soybean. Theor Appl Genet 133:665–676 5. Barbinta-Patrascu ME, Badea N, Ungureanu C, Iordache SM, Constantin M, Purcar V, Rau I, Pirvu C (2017) Ecobiophysical aspects on nanosilver biogenerated from Citrus reticulata peels, as potential biopesticide for controlling pathogens and wetland plants in aquatic media. J Nanomater 2017: 4214017. https://doi.org/10.1155/2017/ 4214017 6. Bruce RW, Torkamaneh D, Grainger CM, Belzile F, Eskandari M, Rajcan I (2020) Haplotype diversity underlying quantitative traits in Canadian soybean breeding germplasm. Theor Appl Genet 133:1967 7. Xiao Y, Liu H, Wu L, Warburton M, Yan J (2017) Genome-wide association studies in maize: praise and stargaze. Mol Plant 10: 359–374 8. Tian D, Wang P, Tang B, Teng X, Li C, Liu X, Zou D, Song S, Zhang Z (2020) GWAS atlas: a curated resource of genome-wide varianttrait associations in plants and animals. Nucleic Acids Res 48:D927–D932
9. Chen F, Dong W, Zhang J, Guo X, Chen J, Wang Z, Lin Z, Tang H, Zhang L (2018) The sequenced angiosperm genomes and genome databases. Front Plant Sci 9:418 10. Torkamaneh D, Boyle B, Belzile F (2018) Efficient genome-wide genotyping strategies and data integration in crop plants. Theor Appl Genet 131:499–511 11. Yang W, Guo Z, Huang C, Duan L, Chen G, Jiang N, Fang W, Feng H, Xie W, Lian X (2014) Combining high-throughput phenotyping and genome-wide association studies to reveal natural genetic variation in rice. Nat Commun 5:1–9 12. Tibbs Cortes L, Zhang Z, Yu J (2021) Status and prospects of genome-wide association studies in plants. Plant Genome 14:e20077 13. Kumar J, Pratap A, Solanki R, Gupta D, Goyal A, Chaturvedi S, Nadarajan N, Kumar S (2012) Genomic resources for improving food legume crops. J Agric Sci 150:289–318 14. Astle W, Balding DJ (2009) Population structure and cryptic relatedness in genetic association studies. Stat Sci 24:451–471 15. Mulford AJ, Wing C, Dolan ME, Wheeler HE (2021) Genetically regulated expression underlies cellular sensitivity to chemotherapy in diverse populations. Human molecular genetics, 30(3–4), 305–317. https://doi. org/10.1093/hmg/ddab029 16. Sun L, Dimitromanolakis A (2012) Identifying cryptic relationships. In: Statistical human genetics. Springer, New York, pp 47–57 17. Ke X, Hunt S, Tapper W, Lawrence R, Stavrides G, Ghori J, Whittaker P, Collins A, Morris AP, Bentley D (2004) The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum Mol Genet 13:577–588 18. Joiret MM, John JM, Gusareva ES, Van Steen K (2019) Confounding of linkage disequilibrium patterns in large scale DNA based gene-gene interaction studies. BioData Min 12:199–197 19. Gao Y, Liu Z, Faris JD, Richards J, Brueggeman RS, Li X, Oliver RP, McDonald BA, Friesen TL (2016) Validation of genome-
58
Mohsen Yoosefzadeh-Najafabadi et al.
wide association studies as a tool to identify virulence factors in Parastagonospora nodorum. Phytopathology 106:1177–1185 20. Soto-Cerda BJ, Cloutier S (2012) Association mapping in plant genomes. In: Genetic diversity in plants. InTech Open, London, pp 29–54 21. Maurer A, Draba V, Pillen K (2016) Genomic dissection of plant development and its impact on thousand grain weight in barley through nested association mapping. J Exp Bot 67: 2507–2518 22. Chen Z, Boehnke M, Wen X, Mukherjee B (2021) Revisiting the genome-wide significance threshold for common variant GWAS. G3 11:jkaa056 23. Bush WS, Moore JH (2012) Chapter 11: genome-wide association studies. PLoS Comput Biol 8:e1002822 24. Ding R, Yang M, Quan J, Li S, Zhuang Z, Zhou S, Zheng E, Hong L, Li Z, Cai G (2019) Single-locus and multi-locus genome-wide association studies for intramuscular fat in Duroc pigs. Front Genet 10: 619 25. Sun S, Dong B, Zou Q (2021) Revisiting genome-wide association studies from statistical modelling to machine learning. Brief Bioinform 22:bbaa263 26. Nakaoka H, Inoue I (2009) Meta-analysis of genetic association studies: methodologies, between-study heterogeneity and winner’s curse. J Hum Genet 54:615–623 27. Emily M (2018) Power comparison of Cochran-Armitage trend test against allelic and genotypic tests in large-scale case-control genetic association studies. Stat Methods Med Res 27:2657–2673 28. Bush WS, Moore JH (2012) Genome-wide association studies. PLoS Comput Biol 8: e1002822 29. Manolio TA (2013) Bringing genome-wide association findings into clinical use. Nat Rev Genet 14:549–558 30. Armitage P (1955) Tests for linear trends in proportions and frequencies. Biometrics 11: 375–386 31. Roeder K, Bacanu SA, Sonpar V, Zhang X, Devlin B (2005) Analysis of single-locus tests to detect gene/disease associations. Genet Epidemiol 28:207–219 32. Kaler AS, Gillman JD, Beissinger T, Purcell LC (2020) Comparing different statistical models and multiple testing corrections for association mapping in soybean and maize. Front Plant Sci 10:1794 33. Li C, Fu Y, Sun R, Wang Y, Wang Q (2018) Single-locus and multi-locus genome-wide
association studies in the genetic dissection of fiber quality traits in upland cotton (Gossypium hirsutum L). Front Plant Sci 9:1083 34. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959 35. Hoffman GE, Logsdon BA, Mezey JG (2013) PUMA: a unified framework for penalized multiple regression analysis of GWAS data. PLoS Comput Biol 9:e1003101 36. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55: 997–1004 37. Zhang Z, Ersoz E, Lai C-Q, Todhunter RJ, Tiwari HK, Gore MA, Bradbury PJ, Yu J, Arnett DK, Ordovas JM (2010) Mixed linear model approach adapted for genome-wide association studies. Nat Genet 42:355–360 38. Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet 38:203–208 39. Zhao K, Aranzana MJ, Kim S, Lister C, Shindo C, Tang C, Toomajian C, Zheng H, Dean C, Marjoram P (2007) An Arabidopsis example of association mapping in structured samples. PLoS Genet 3:e4 40. Wen Y-J, Zhang H, Ni Y-L, Huang B, Zhang J, Feng J-Y, Wang S-B, Dunwell JM, Zhang Y-M, Wu R (2018) Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Brief Bioinform 19:700–712 41. Li M, Liu X, Bradbury P, Yu J, Zhang Y-M, Todhunter RJ, Buckler ES, Zhang Z (2014) Enrichment of statistical power for genomewide association studies. BMC Biol 12:1–10 42. Segura V, Vilhja´lmsson BJ, Platt A, Korte A, € Long Q, Nordborg M (2012) An Seren U, efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat Genet 44:825 43. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong S-Y, Freimer NB, Sabatti C, Eskin E (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42:348–354 44. Wang Q, Tian F, Pan Y, Buckler ES, Zhang Z (2014) A SUPER powerful method for genome wide association study. PLoS One 9: e107684 45. Gupta PK, Kulwal PL, Jaiswal V (2019) Association mapping in plants in the post-GWAS genomics era. Adv Genet 104:75–154 46. Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E (2008)
Genome-Wide Association Study Statistical Models: A Review Efficient control of population structure in model organism association mapping. Genetics 178:1709–1723 47. Wen Y-J, Zhang H, Zhang J, Feng J-Y, Huang B, Dunwell JM, Zhang Y-M, Wu R (2016) A fast multi-locus random-SNP-effect EMMA for genome-wide association studies. bioRxiv 077404 48. Zhou X, Stephens M (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet 44:821 49. Spindel J, Begum H, Akdemir D, Collard B, ˜ a E, Jannink J, McCouch S (2016) Redon Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. Heredity 116: 395–408 50. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D (2011) FaST linear mixed models for genome-wide association studies. Nat Methods 8:833–835 51. Tucker G, Price AL, Berger B (2014) Improving the power of GWAS and avoiding confounding from population stratification with PC-select. Genetics 197:1045–1049 52. Listgarten J, Lippert C, Kadie CM, Davidson RI, Eskin E, Heckerman D (2012) Improved linear mixed models for genome-wide association studies. Nat Methods 9:525–526 53. Cui Y, Zhang F, Zhou Y (2018) The application of multi-locus GWAS for the detection of salt-tolerance loci in rice. Front Plant Sci 9: 1464 54. Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ (2012) Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44:369–375 55. Giglio C, Brown SD (2018) Using elastic net regression to perform spectrally relevant variable selection. J Chemom 32:e3034 56. Sun S, Wang C, Ding H, Zou Q (2020) Machine learning and its applications in plant molecular studies. Brief Funct Genomics 19:40–48 57. Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, Steinthorsdottir V, Strawbridge RJ, Khan H, Grallert H, Mahajan A (2012) Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44:981 58. Ding K, Kullo IJ (2007) Methods for the selection of tagging SNPs: a comparison of
59
tagging efficiency and performance. Eur J Hum Genet 15:228–236 59. Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer 60. Cherlin S, Howey R, Cordell HJ (2018) Using penalized regression to predict phenotype from SNP data. BMC proceedings 12 (Suppl 9):38.https://doi.org/10.1186/ s12919-018-0149-2 61. Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ (2008) Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet 4:e1000130 62. Ayers KL, Cordell HJ (2010) SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol 34:879–891 63. Mihalyov PD, Nichols VA, Bulli P, Rouse MN, Pumphrey MO (2017) Multi-locus mixed model analysis of stem rust resistance in winter wheat. Plant Genome 10. https:// doi.org/10.3835/plantgenome2017.01. 0001 64. Wen Y-J, Zhang Y-W, Zhang J, Feng J-Y, Dunwell JM, Zhang Y-M (2019) An efficient multi-locus mixed model framework for the detection of small and linked QTLs in F2. Brief Bioinform 20:1913–1924 65. Wen Y, Zhang Y, Zhang J, Feng J, Zhang Y (2020) The improved FASTmrEMMA and GCIM algorithms for genome-wide association and linkage studies in large mapping populations. Crop J 8:723–732 66. Liu X, Huang M, Fan B, Buckler ES, Zhang Z (2016) Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS Genet 12:e1005767 67. Huang M, Liu X, Zhou Y, Summers RM, Zhang Z (2019) BLINK: a package for the next level of genome-wide association studies with both individuals and markers in the millions. GigaScience 8:giy154 68. Zhong H, Liu S, Meng X, Sun T, Deng Y, Kong W, Peng Z, Li Y (2021) Uncovering the genetic mechanisms regulating panicle architecture in rice with GPWAS and GWAS. BMC Genomics 22:1–13 69. Loh P-R, Tucker G, Bulik-Sullivan BK, Vilhja´lmsson BJ, Finucane HK, Salem RM, Chasman DI, Ridker PM, Neale BM, Berger B et al (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet 47:284–290. https:// doi.org/10.1038/ng.3190
60
Mohsen Yoosefzadeh-Najafabadi et al.
70. Loh P-R, Kichaev G, Gazal S, Schoech AP, Price AL (2018) Mixed-model association for biobank-scale datasets. Nat Genet 50: 906–908. https://doi.org/10.1038/ s41588-018-0144-6 71. Zhao Y, Zhu H, Lu Z, Knickmeyer RC, Zou F (2019) Structured genome-wide association studies with Bayesian hierarchical variable selection. Genetics 212:397–415 72. Armero C, Cabras S, Castellanos ME, Quiro´s A (2019) Two-stage Bayesian approach for GWAS with known genealogy. J Comput Graph Stat 28:197–204. https://doi.org/ 10.1080/10618600.2018.1483828 73. Banerjee S, Zeng L, Schunkert H, So¨ding J (2018) Bayesian multiple logistic regression for case-control GWAS. PLoS Genet 14: e1007856 74. Banerjee S, Zeng L, Schunkert H, So¨ding J (2019) Bayesian multiple logistic regression for case-control GWAS. PLoS Genet 14: e1007856. https://doi.org/10.1371/jour nal.pgen.1007856 75. Stephens M, Balding DJ (2009) Bayesian statistical methods for genetic association studies. Nat Rev Genet 10:681–690 76. Fernando RL, Garrick D (2013) Bayesian methods applied to GWAS. In: Gondro C, van der Werf J, Hayes B (eds) Genome-wide association studies and genomic prediction. Humana Press, Totowa, pp 237–274. https://doi.org/10.1007/978-1-62703447-0_10 77. E Silva FF, Viana JMS, Faria VR, de Resende MDV (2013) Bayesian inference of mixed models in quantitative genetics of crop species. Theor Appl Genet 126:1749–1761. https://doi.org/10.1007/s00122-0132089-6 78. Sorensen D, Gianola D (2007) Likelihood, Bayesian, and MCMC methods in quantitative genetics. Springer, New York 79. Papachristou C, Ober C, Abney M A LASSO penalized regression approach for genomewide association analyses using related individuals: application to the Genetic Analysis Workshop 19 simulated data. In Proceedings of BMC proceedings; pp. 221–226 80. Wang Y, Sha N, Fang Y (2009) Analysis of genome-wide association data by large-scale Bayesian logistic regression. BMC Proc 3: S16. https://doi.org/10.1186/1753-65613-S7-S16 81. Park MY, Hastie T (2008) Penalized logistic regression for detecting gene interactions. Biostatistics 9:30–50
82. Wu TT, Chen YF, Hastie T, Sobel E, Lange K (2009) Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25:714–721 83. Fort G, Lambert-Lacroix S (2005) Classification using partial least squares with penalized logistic regression. Bioinformatics 21: 1104–1111 84. Hoerl AE, Kannard RW, Baldwin KF (1975) Ridge regression: some simulations. Commun Stat Theory Methods 4:105–123 85. Hans C (2009) Bayesian lasso regression. Biometrika 96:835–845 86. Hans C (2011) Elastic net regression modeling with the orthant normal prior. J Am Stat Assoc 106:1383–1393 87. Li J, Das K, Fu G, Li R, Wu R (2011) The Bayesian lasso for genome-wide association studies. Bioinformatics 27:516–523 88. Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101: 1418–1429 89. Zhang HH, Lu W (2007) Adaptive Lasso for Cox’s proportional hazards model. Biometrika 94:691–703 90. Wang D, Eskridge KM, Crossa J (2011) Identifying QTLs and epistasis in structured plant populations using adaptive mixed LASSO. J Agric Biol Environ Stat 16:170–184 91. Cho S, Kim K, Kim YJ, Lee JK, Cho YS, Lee JY, Han BG, Kim H, Ott J, Park T (2010) Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann Hum Genet 74:416–428 92. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320 93. Garrick DJ, Fernando RL (2013) Implementing a QTL detection study (GWAS) using genomic prediction methodology. In: Genome-wide association studies and genomic prediction. Springer, New York, pp 275–298 94. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829 95. Chen C, Steibel JP, Tempelman RJ (2017) Genome-wide association analyses based on broadly different specifications for prior distributions, genomic windows, and estimation methods. Genetics 206:1791–1806 96. Vallejo RL, Cheng H, Fragomeni BO, Shewbridge KL, Gao G, MacMillan JR, Towner R, Palti Y (2019) Genome-wide association analysis and accuracy of genome-enabled breeding
Genome-Wide Association Study Statistical Models: A Review value predictions for resistance to infectious hematopoietic necrosis virus in a commercial rainbow trout breeding population. Genet Sel Evol 51:1–14 97. Wolc A, Arango J, Settar P, Fulton JE, O’Sullivan NP, Dekkers JC, Fernando R, Garrick DJ (2016) Mixture models detect large effect QTL better than GBLUP and result in more accurate and persistent predictions. J Anim Sci Biotechnol 7:1–6 98. Habier D, Fernando RL, Kizilkaya K, Garrick DJ (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12:1–12 99. Cai X, Huang A, Xu S (2011) Fast empirical Bayesian LASSO for multiple quantitative trait locus mapping. BMC Bioinformatics 12:1–13 100. Robert C, Casella G (2013) Monte Carlo statistical methods. Springer, New York 101. Xu S (2007) An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics 63:513–521 102. Xu S (2010) An expectation–maximization algorithm for the Lasso estimation of quantitative trait locus effects. Heredity 105: 483–494 103. Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7: 308–313 104. Liakos KG, Busato P, Moshou D, Pearson S, Bochtis D (2018) Machine learning in agriculture: a review. Sensors 18:2674 105. McQueen RJ, Garner SR, Nevill-Manning CG, Witten IH (1995) Applying machine learning to agricultural data. Comput Electron Agric 12:275–293 106. Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A (2020) A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res 119:104926 107. Yoosefzadeh-Najafabadi M, Tulpan D, Eskandari M (2021) Using hybrid artificial intelligence and evolutionary optimization algorithms for estimating soybean yield and fresh biomass using hyperspectral vegetation indices. Remote Sens 13:2555 108. Chetnik K, Petrick L, Pandey G (2020) MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data. Metabolomics 16:1–13 109. Hesami M, Yoosefzadeh Najafabadi M, Adamek K, Torkamaneh D, Jones AMP (2021) Synergizing off-target predictions for in silico insights of CENH3 knockout in
61
cannabis through CRISPR/Cas. Molecules 26:2053 110. Wen B, Zeng WF, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B (2020) Deep learning in proteomics. Proteomics 20:1900335 111. Peng GC, Alber M, Tepole AB, Cannon WR, De S, Dura-Bernal S, Garikipati K, Karniadakis G, Lytton WW, Perdikaris P (2021) Multiscale modeling meets machine learning: what can we learn? Arch Comput Methods Eng 28:1017–1037 112. Leal LG, David A, Jarvelin M-R, Sebert S, M€annikko¨ M, Karhunen V, Seaby E, Hoggart C, Sternberg MJ (2019) Identification of disease-associated loci using machine learning for genotype and network data integration. Bioinformatics 35:5182–5190 113. Libbrecht MW, Noble WS (2015) Machine learning applications in genetics and genomics. Nat Rev Genet 16:321–332 114. Reitermanova´, Z (2010) Data splitting. WDS’10 Proceedings of Contributed Papers, Part I, 31–36 ˜ onero-Candela, J.; Sugiyama, M.; Lawr115. Quin ence, N.D.; Schwaighofer, A. Dataset shift in machine learning; Mit Press: Cambridge 2009 116. Schaffer C (1993) Selecting a classification method by cross-validation. Mach Learn 13: 135–143 117. Piette ER, Moore JH (2018) Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV). BioData Mining 11: 1–11 118. Schrider DR, Kern AD (2018) Supervised machine learning for population genetics: a new paradigm. Trends Genet 34:301–312 119. Williams AM, Liu Y, Regner KR, Jotterand F, Liu P, Liang M (2018) Artificial intelligence, physiological genomics, and precision medicine. Physiol Genomics 50:237–243 120. Wuest T, Weimer D, Irgens C, Thoben K-D (2016) Machine learning in manufacturing: advantages, challenges, and applications. Prod Manuf Res 4:23–45 121. Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W (2018) Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics 15: 41–51 122. Najafabadi MY, Torabi S, Torkamaneh D, Tulpan D, Rajcan I, Eskandari M (2021) Machine learning based genome-wide association studies for uncovering QTL underlying soybean yield and its components. bioRxiv
62
Mohsen Yoosefzadeh-Najafabadi et al.
123. Yu G-X, Ostrouchov G, Geist A, Samatova NF (2003) An SVM-based algorithm for identification of photosynthesis-specific genome features. In: Proceedings of the 2003 IEEE bioinformatics conference. CSB2003, pp 235–243 124. Sonnenburg S, R€atsch G, Scho¨lkopf B (2005) Large scale genomic sequence SVM classifiers. Proceedings of the 22nd international conference on machine learning 848–855. https:// doi.org/10.1145/1102351.1102458 125. Yoosefzadeh-Najafabadi M, Tulpan D, Eskandari M (2021) Application of machine learning and genetic optimization algorithms for modeling and optimizing soybean yield using its component traits. PLoS One 16: e0250665 126. Hesami M, Condori-Apfata JA, Valderrama Valencia M, Mohammadi M (2020) Application of artificial neural network for modeling and studying in vitro genotype-independent shoot regeneration in wheat. Appl Sci 10: 5370 127. Hesami M, Jones AMP (2020) Application of artificial intelligence models and optimization algorithms in plant cell and tissue culture. Appl Microbiol Biotechnol 104:1–37 128. Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26 129. Calle ML, Urrea V, Boulesteix A-L, Malats N (2011) AUC-RF: a new strategy for genomic profiling with random forest. Hum Hered 72: 121–132 130. Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, Biernacka JM (2012) SNP interaction detection with random forests in high-dimensional genetic data. BMC Bioinformatics 13:1–13 131. Zhang L, Pan Q, Wang Y, Wu X, Shi X (2017) Bayesian network construction and genotypephenotype inference using GWAS statistics. IEEE/ACM Trans Comput Biol Bioinform 16:475–489 132. Jiang X, Neapolitan RE (2015) Evaluation of a two-stage framework for prediction using big genomic data. Brief Bioinform 16: 912–921 133. Pahikkala T, Okser S, Airola A, Salakoski T, Aittokallio T (2012) Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations. Algorithms Mol Biol 7:1–15
134. Yoosefzadeh-Najafabadi M, Earl HJ, Tulpan D, Sulik J, Eskandari M (2021) Application of machine learning algorithms in plant breeding: predicting yield from hyperspectral reflectance in soybean. Front Plant Sci 11. https://doi.org/10.3389/fpls.2020.624273 135. Chong I-G, Jun C-H (2005) Performance of some variable selection methods when multicollinearity is present. Chemom Intell Lab Syst 78:103–112 136. Han B, Park M, Chen XW (2010) A Markov blanket-based method for detecting causal SNPs in GWAS. BMC Bioinformatics 11 (Suppl 3):S5. https://doi.org/10.1186/ 1471-2105-11-S3-S5. PMID: 20438652; PMCID: PMC2863064 137. Guo H, Yu Z, An J, Han G, Ma Y, Tang R (2020) A two-stage mutual information based Bayesian Lasso algorithm for multilocus genome-wide association studies. Entropy 22:329 138. Alzubi R, Ramzan N, Alzoubi H (2017) Hybrid feature selection method for autism spectrum disorder SNPs. In: Proceedings of 2017 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), pp 1–7 139. Yuan H-Y, Chiou J-J, Tseng W-H, Liu C-H, Liu C-K, Lin Y-J, Wang H-H, Yao A, Chen Y-T, Hsu C-N (2006) FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. Nucleic Acids Res 34:W635–W641 140. Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8:1–21 141. Szymczak S, Holzinger E, Dasgupta A, Malley JD, Molloy AM, Mills JL, Brody LC, Stambolian D, Bailey-Wilson JE (2016) r2VIM: a new variable selection method for random forests in genome-wide association studies. BioData Mining 9:1–15 142. Vitsios D, Petrovski S (2019) Stochastic semisupervised learning to prioritise genes from high-throughput genomic screens. bioRxiv 655449 143. Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, Benner C, O’Dushlaine C, Barber M, Boutkov B et al (2021) Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet. https://doi. org/10.1038/s41588-021-00870-7
Chapter 5 Interpretation of Manhattan Plots and Other Outputs of Genome-Wide Association Studies Jiabo Wang, Jianming Yu, Alexander E. Lipka, and Zhiwu Zhang Abstract With increasing marker density, estimation of recombination rate between a marker and a causal mutation using linkage analysis becomes less important. Instead, linkage disequilibrium (LD) becomes the major indicator for gene mapping through genome-wide association studies (GWAS). In addition to the linkage between the marker and the causal mutation, many other factors may contribute to the LD, including population structure and cryptic relationships among individuals. As statistical methods and software evolve to improve statistical power and computing speed in GWAS, the corresponding outputs must also evolve to facilitate the interpretation of input data, the analytical process, and final association results. In this chapter, our descriptions focus on (1) considerations in creating a Manhattan plot displaying the strength of LD and locations of markers across a genome; (2) criteria for genome-wide significance threshold and the different appearance of Manhattan plots in single-locus and multiple-locus models; (3) exploration of population structure and kinship among individuals; (4) quantile–quantile (QQ) plot; (5) LD decay across the genome and LD between the associated markers and their neighbors; (6) exploration of individual and marker information on Manhattan and QQ plots via interactive visualization using HTML. The ultimate objective of this chapter is to help users to connect input data to GWAS outputs to balance power and false positives, and connect GWAS outputs to the selection of candidate genes using LD extent. Key words GWAS, Linkage disequilibrium, Population structure, Kinship, False positive rate, Mixed linear model
1
Introduction The genome-wide association study (GWAS) is an important tool to map the genes underlying complex traits in animals and plants, as well as genetic diseases in humans [1–6]. The most common genetic markers, single-nucleotide polymorphisms (SNPs), are used to capture the linkage disequilibrium (LD) with quantitative trait loci (QTL) [7–10]. The detection ability of GWAS is dependent on many factors, including population size, heritability, marker density, minor allele frequency (MAF), statistical model, and genetic architecture of the trait [11–15]. With the
Davoud Torkamaneh and Franc¸ois Belzile (eds.), Genome-Wide Association Studies, Methods in Molecular Biology, vol. 2481, https://doi.org/10.1007/978-1-0716-2237-7_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
63
64
Jiabo Wang et al.
development of sequencing technology, deep sequencing and large populations have been employed for GWAS detection in more and more species [16–22]. Although the associated markers only explain a small proportion of heritability, the proportion is expected to increase with additional discoveries [23–27]. Multiple software packages (e.g., TASSEL [28], GCTA [29], PLINK [30], and GAPIT [31]) have been developed to conduct GWAS. The models used by these packages differ in statistical power, computing speed, and output. The two major outputs of GWAS are Manhattan and Quantile–Quantile (QQ) plots [32– 34]. In this chapter, we first describe the various forms of these plots, including subtle differences in presentation due to the selected models (single-locus vs. multiple-locus models). Secondly, we describe the plots that are helpful to determine genotypes, population structure, and familial relatedness (i.e., kinship). Finally, we describe interactive visualization based on application of HTML.
2
Generating Manhattan Plots The Manhattan plot is used to summarize and visualize the markertrait relationships across the whole genome. It gained its name from the similarity of such a plot to the Manhattan skyline: a profile of skyscrapers towering above the lower level “buildings” which vary around a lower height [3]. Generally, the x-axis denotes the physical positions of the markers on the chromosomes, and the y-axis is the negative log10 P-values. These P-values correspond to the hypothesis test of H0: There is no association between the tested marker and the studied trait. The tall “buildings” of Manhattan plots represent SNPs with strong statistical association with the tested phenotype. A line typically spans the Manhattan plot from left to right; that named cutoff line is used to highlight markers over a threshold value that are statistically significantly associated with the trait (Fig. 1). Ideally, these thresholds will be calculated by controlling for multiple testing on a genome-wide scale. The significant points and continuous peak of markers are of interest, because they may signify potential nearby QTLs [35, 36]. If there are no markers over the cutoff line, there may be several reasons, such as limited population size, sequencing depth level, or insufficient statistical method.
2.1 P-values and Their Negative Log Transformation
The null hypothesis of an association test is that there is no linkage disequilibrium between a genetic marker and a trait of interest. GWAS tests the hypothesis for all genetic markers across the genome either by testing the markers one at a time in single-locus model [37], or testing all the markers simultaneously [38– 40]. There is another type of test between the two extremes
GWAS Outputs
65
FarmCPU dpoll 6
l
5
l
l
l
4 3 2
l l
l
l
l ll
l l
l
l
l
l l l l l l l l ll ll l l l l l l lll l l ll ll l l l l l l l ll llll lll l l l l lll ll l l ll ll l ll ll l lll l l l l l l l ll l lll l ll ll ll ll ll l ll l l llll l
l
l
l
l
1
− log10(p )
l
ll
l
l
l
l
l
l l l l l l l ll l l l l
l
l l l l l l l ll l l l l l lll l l ll l l l l l l l ll l l ll ll lll l l l lll ll l l l l ll l ll l l l l l l ll l l l l l l l l l ll l l l ll lll l l l l l l l l l l l l l l l l l l l l
l
ll
l
l
ll
l l l l ll ll l l l l ll l l l lll
l
l l l l l l l l ll ll l l l l l l l
l l l l l l l l l l ll l l l l l ll ll ll l l lll l ll l l ll l
l
l l l ll l l l l l l l l l l l ll l l l
l
l l l ll l ll ll ll l ll l lll ll l ll l l ll lll l l l l lll l l ll l l l l l l l l l l l l l l l l lll l l l l l l l l ll llll l l l l l l l l lll l l ll l ll l l l ll l lll l l l l l l l l l lll lll l lll l ll l l ll l l lll l l l ll l lll llll l l ll l ll ll l l l l lll llll ll l l l l ll l l ll l l l l l ll ll ll ll l l l l l ll lll ll l lll l l l l l ll l l ll lll ll l ll l l l l ll l l l l l ll ll ll l l l ll l l l l l l l l l ll ll lll l l ll l l ll l ll l lll l l l l l l l l lll l ll l lll l ll l l l l l l ll l ll ll l ll ll ll l ll l l l l lll l ll ll l ll ll l l ll l l l l l l l l l l l lll l l l l ll ll l l l ll l l ll l l l ll l l ll lll l l lll lll ll ll ll ll l l ll l l l l l l l lll l l l ll l l l l ll ll l ll l l llll l l l l l l l l l l ll ll l l l l l l ll l l l l l l l l l ll l l ll l ll l l l l lll l l lll ll l l l l ll l lll l l l l ll l l ll l lll l l l l l lll l ll l l l ll l l ll l l l l l llll l ll lll l l ll l l l ll l l l ll llll l l l l l l l l l l l l ll l l l l l l ll l l l ll l l lll lll lll l l l l l l l l l l l ll l ll l l ll l l l l l l l l l l ll l ll l l l l l ll lllll l l ll l l l l l l l l l l l l l l l ll l l l l l ll ll ll l ll ll ll l l l ll l l l l ll l l l l l l l l l l l ll l l l ll l ll ll l l l l l l l l l ll l l l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l llll l l l l ll l l l l ll l l l l l l l l ll l l l l l l l l l ll lllll l l l l l lll l l l l ll ll l l ll l l l l l l l l l ll ll l l l l l l l l l l l l lllll l l l ll l l l l l l ll l l l l l ll lll l l l l l l l l l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l lll l l ll l l l l l l l l l l l l l lll l l l l l l l l ll l l l l lll l l l l l l l l l l l l l l l l l l ll l l l l ll l l ll l l l l l l l ll l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll lll l l l l l l l ll ll l l l l l l l l l l l l l l l l l l l l l l l ll l ll l l l l l l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l ll l l l l l ll l l l l ll l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l ll l l l l l ll l l l l l l l l l ll l ll l l l l l l l ll l l l l l l ll ll l l l l l l l ll l l l l l ll l l l l ll l l l l l l l l l l ll l l l l l l l ll l l l l l l l lllll l l ll l ll l l l l l l l l l l l l lll l ll l l l l ll l l l l l l l l l l l ll l l ll l ll l ll l l ll l l l l l ll l l l l lll l l l l l l l l l l l l l l lll l ll l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l ll l l l ll l l lll l l l lll l l l l l l l l l l l l l l l ll ll ll l l l l ll l l l l l l l l llll l lll l l l l l l ll l ll l l l l l l l l
ll ll l ll
ll ll l l l l lllll l ll l ll ll lll
l
1
ll l ll l l l l l ll l l l l l l l ll ll l l ll l lll l l l l l l l ll l ll l
l lll l l l l ll l l l ll
2
lllll l l l ll
3
lll ll l ll ll l l ll ll l l
4
l l l l lll ll l
5
l l l
6
ll ll l l
ll lll
7
lll ll l l ll ll lll ll ll l l ll ll l ll l l l l
8
9
10
Fig. 1 Manhattan plot of genome-wide association study conducted using FarmCPU. The data contains 281 maize inbreds phenotyped on flowering time (days to pollination) and genotyped with 3093 SNPs
named the multiple loci models [41–43]. Besides testing markers one at a time, multiple loci models also include additional markers as cofactors. The typical statistical tests include F-test and t-test, which generate probability (P-values) for each marker. A high Pvalue suggests a high chance that the null hypothesis is true, otherwise reject the null hypothesis when the P-value is below a threshold. To visualize the association across the genome, the P-values are plotted as the negative log scale with a base of 10. Therefore, the associated markers appear at the top and the nonassociated markers at the bottom, formatting the well-known Manhattan plot [44]. 2.2 Chromosome and Position Visualization
The physical location of markers on a chromosome can be used to indicate the relative position. It is easy to plot a “chromosomewise” Manhattan plot for each chromosome. However, the physical locations of markers between chromosomes are not continuous. We need to convert the physical locations of markers to continuous numeric relative positions for a whole-genome Manhattan plot. A lot of software or R packages, such as GAPIT, rMVP [45], qqman [46], and Haploview [47], code the relative positions of markers in the next chromosome equal to the each physical location of markers in the next chromosome plus the maximum value of relative position of markers in the previous chromosome. For a polyploid genome, such as allopolyploid wheat, the chromosome names will be coded as consecutive numbers to show the relative position.
2.3 File Size Reduction
Following the development of sequencing technology and the reduction of sequencing cost, genome sequencing depth and density have been improved. More and more large genotype datasets have been used to analyze the association with traits [48–52]. As markers are displayed in log scale based on their P-values, most of the markers are on the bottom of the plots on top of each other. Displaying all the markers on the bottom is not only unnecessary,
66
Jiabo Wang et al.
but also creates problems for storage and display. For example, two million markers will result a PDF file with size over 200 Mb. Therefore, an effective reduction of marker for display is necessary. By default, GAPIT only displays 5000 markers. The algorithm barely removes markers on the top of Manhattan plot. The markers on the bottom are randomly selected to display based on sampling with uniform distribution on the log P-values. Consequently, the reduction generates identical visual plot as plotting all markers. 2.4 Circle vs. Cartesian
3
Comparing linkage analysis based on genetic information, GWAS is based on statistical information. The associated markers across correlated traits provide a partial evidence for their linkage with a causal pleiotropic mutation. Although the causes other than linkage (e.g., population structure) cannot be excluded, the risk of other causes can be dramatically reduced, including phenotypes with outliers and genotypes with rare variants. Researchers are also interested to compare different analyses such as using different statistical models. Therefore, it is beneficial to organize multiple Manhattan plots together and demonstrate the overlaps among them. The plural Manhattan plot has two types: Circle and Cartesian (Fig. 2). The Circle and Cartesian Manhattan plots are available through the software GAPIT [31, 53], rMVP [45], and CMplot [45]. In GAPIT, a marker is indicated by vertical dashed lines if it appears as an associated mark in exactly two single Manhattan plots, or a solid line if it appears in three or more single Manhattan plots.
Determining a Significance Threshold to Declare Association In a statistical test, the type I error is the probability of falsely rejecting the true null hypothesis. In GWAS, the null hypothesis is that the tested marker is not associated with the trait. If the P-value for the test statistic at a given marker is less than a threshold (α), we declare this test as significant under this threshold cutoff [54– 56]. Conventionally, the threshold equals to 0.05 as significant and 0.01 as very significant to reject the null hypothesis.
3.1 Multiple Test Correction
There are millions of markers in the genome, which means millions of tests. Using a comparison-wise type I error rate of α ¼ 0.05, the probability of making at least one type I error across these millions of markers will be substantially larger. This broad and nonstringent threshold cutoff brings huge risk for the detection of false-positive candidate genes [34, 57–59]. The more markers are tested in GWAS, the more erroneous signals are likely to occur. Thus, there is a critical need to adjust for multiple testing. Given below are several commonly used approaches to adjust for multiple testing that are statistically rigorous.
GLM.dpoll
4 3 2 1 0
FarmCPU.dpoll
4 3 6
0
1
2
3
log10p
7
5
6
7
0
1
Blink.dpoll FarmCPU.dpoll MLM.dpoll GLM.dpoll Centre
MLM.dpoll
8
2
5.2 3.5
4
9
3 2
0 2 4 6 8 10 12 14 16 18 20
3
Density 5.2 3.5
2
1 4.5 3
log10p
10
log10p
5
6
B
A
67
7
GWAS Outputs
Blink.dpoll
4 3
5
0
1
2
log10p
5
6
4
1
2
3
4
5
6
7
8
9
10
Fig. 2 Manhattan plots of genome-wide association studies using multiple methods. The Manhattan plots are displayed in two formats: Circle (a) and Cartesian (b). The data consists of 281 maize inbreds phenotyped on flowering time (days to pollination) and genotyped with 3093 SNPs. Four methods were used: BLINK, FarmCPU, MLM, and GLM. Multiple traits can be displayed similarly
3.2
Bonferroni Cutoff
3.3 False Discovery Rate (FDR) Cutoff
In 1936, Italian mathematician Carlo Bonferroni developed a correction for multiple comparisons using the number of tests and the Type I error for each individual hypothesis [60, 61]. The risk of the error in the single statistical test was repeated multiple times (m). In order to retain a prescribed family-wise error rate (FWER) in an analysis involving more than one comparison, the error rate for each comparison must be more stringent than α. If each of m tests is performed with a type I error rate α/m, the total error rate will not exceed α. The Bonferroni correction could be over-conservative for GWAS, as it assumes independence among multiple comparisons. The markers within a chromosome are in linkage disequilibrium (LD), which contradicts this assumption [6, 62]. Bonferroni’s multiple test correction uses mα as the family-wise error rate threshold where α is the type I error threshold of single test and m is number of independent tests. In 1986, Simes proposed an alternative for multiple dependent tests. To maintain the same family-wise error rate α, the number of null hypotheses that can be rejected is k so there are k P-values that are smaller than kα m [63]. In 1995, Benjamini and Hochberg defined the term as false discovery rate [64]. False discovery rate is the proportion of errors committed by falsely rejecting real null hypotheses. In the
68
Jiabo Wang et al.
Benjamini and Hochberg procedure, hypotheses are sorted by their P-values in ascending order. The hypotheses and corresponding Pvalues are denoted as H(i) and P(i) respectively, i ¼ 1 to m. Hypothesis i is rejected if P ðiÞ mi α. The FDR controlling procedures have been proven to have greater power at the cost of increased numbers of Type I errors [14, 64, 65]. In GAPIT, there are two thresholds with family-wise error rate of 0.05 in the Manhattan plot. The solid and dashed green lines stand for the P-values corresponding to the Bonferroni and FDR multiple test corrections. 3.4 Permutation Cutoff
4
The Bonferroni cutoff is overconservative for the genome LD relationship, and the FDR cutoff also may not be restrictive enough for some populations or traits. Referred to as the gold standard GWAS cutoff, the permutation cutoff uses random assortment between genotype and phenotype to derive an empirical cutoff [56, 66, 67]. GWAS is conducted on phenotypes that are randomly shuffled to break the connections with the genotypes. The smallest P-value is recorded for each random setting. Multiple replicates (usually more than 100) are required to derive the distribution of the smallest P-value for a single replicate. The P-values corresponding to the α percentile is the empirical threshold for the family-wise error rate of α. Because a separate permutation procedure is conducted on each trait, there is an individual cutoff established for each individual trait. The disadvantage of this approach is that it requires significant computing time for each trait in a large population.
Highlighting Associated Loci Additional information is usually added to Manhattan plots to assist interpretation. These include, but are not limited to, the color, type and size of dots, the relationship between the top significant marker and neighboring markers, display of known genes or simulated quantitative trait nucleotides (QTNs), and the supplemental LD information. Here we introduce several of these indicators.
4.1 Dot Size and Fill Percentage
Usually, the dot color and type are used to distinguish the markers on the different chromosomes. Some special colors often are used to mark the significant signals. The dots are normally the same size, but for large datasets or high-density markers, the dot sizes are drawn from small to big, based on the markers’ P-values. That makes the significant markers more conspicuous in the whole Manhattan plot. Also, the opacity of colors can be used to show the marker density. In some simulation GWAS, the positions of real QTNs are known, so the markers are usually represented as circles, and the known QTNs are added as solid points.
GWAS Outputs
69
4.2 Linkage Disequilibrium with Neighboring Markers
The marker with the strongest association with a trait of interest may not be on a gene. Researchers usually need to inspect the region upstream and downstream of an associated marker to identify candidate genes. Linkage Disequilibrium (LD) between the most associated marker and its neighbors are helpful to determine the candidate genes [16, 26, 68–70]. The chromosome view of the Manhattan plot is used to show not only the details of the association, but also the LD among markers. Heatmap is used to indicate the LD between the most significant marker and its neighbors (Fig. 3).
4.3 Manhattans in New York City (NYC) vs. Kansas
The name of the Manhattan plot is derived from the full skyline view of NYC. The concentrated distribution of the markers’ Pvalues looks like tall buildings. However, there is a small town in Kansas also named Manhattan. In the skyline view of this small town, most of the buildings are not tall. The highest human-made object may be the helicopter in the sky. That is similar to some
Fig. 3 Display of linkage disequilibrium on chromosome-wise Manhattan plot. The genome-wide association study was conducted on a simulated trait controlled by 20 genes with heritability of 0.75 in a mouse population with 1940 individuals genotyped with 12,000 SNPs. Data for chromosome 7 is shown to demonstrate the linkage disequilibrium (R square) between the most significant marker and its neighboring markers
70
Jiabo Wang et al.
Manhattan plots created by multiple-loci models, which reduce the markers to bins and only present the marker with the lowest P-value in each bin. Now there are several multistep GWAS methods (such as multiple loci mixed linear model, FarmCPU, and BLINK) using likelihood value or Bayesian information criteria to filter the most significant markers (named “pseudo QTNs”). The continuous peaks of NYC-type Manhattan plots are used to indicate existing large QTL. In the simulation study (Fig. 4), these multiple loci or multistep strategies have greater power than a simpler method, and for some real traits, they are also reported to produce more credible results [71, 72]. These Kansas-type Manhattan plots are also very useful for distinguishing two or more close QTLs.
5
Examining QQ Plots The quantile–quantile (QQ) plot compares two probability distributions by plotting the quantile values against each other [73]. All points are always nondecreasing from bottom left to upper right. In GWAS, the QQ plot helps to identify the inflation of P-values and the markers exceeding the expectation. Under the null hypothesis, all P-values follow a uniform [0, 1] distribution. In GWAS, one expectation is that most of the markers do not associate with the trait and only a small proportion do. Therefore, most markers’ dots will lie on the diagonal line in the QQ plot and some deviated markers are off the diagonals (Fig. 5). Expectation
For comparison between the testing results and the null hypothesis, the P-value will be used to calculate the expectation of such Pvalues (EP-value). After sorting all P-values from 0 to 1, the Pk can be used to present P1, P2, . . ., Pn (k ¼ 1, 2, . . ., n) [74], where n is the total number of markers in the GWAS testing. The expected probability of all marker effects follows a uniform distribution, so the expectation Pk (EPk) can be calculated with k/(n + 1). Like Pvalues, EP-values are also converted to negative logarithms.
5.2 Confidence Interval
A confidence interval gives a range within which a statistic can be off the expectation [75]. In the GWAS, P-values follow a beta distribution under the null hypothesis that markers are not associated with the trait [76]. After log transformation, the confidence interval of small P-values is larger than the confidence interval of large P-values. Thus, in the QQ plot, the confidence interval area is changing from narrow left (large P-values) to wide right (small Pvalues).
5.3 Inflation and Deflation
The median of the observed P-values is expected to equal the median of expected P-values. If let λ stand for the ratio between the median of observed P-values and the median of expected P-
5.1
GWAS Outputs
71
Fig. 4 Two types of Manhattan plots resulting from single-locus and multiple-loci models. The genome-wide association studies were conducted using a single-locus model (MLM) and multiple-loci model (FarmCPU) implemented in GAPIT. The population contains 1940 mice genotyped with 12,000 SNPs. The trait was simulated with 0.75 heritability and 20 genes. The associated markers in a linkage disequilibrium (LD) block appear as a spike (a). The spikes look like the skyscrapers in Manhattan of New York City. In the multiple-loci model, only one marker in an LD block can have a significant P-value (b). The scattered associated markers appear like helicopters flying over Manhattan in Kansas where there are no skyscrapers
values, two consequences could be observed: inflation (λ > 1.1) or deflation (λ < 0.9) [26, 77]. Inflation indicates that most tests are systematically more significant than the expected distribution. Inflation occurs with lack of control of population stratification and unknown family relationships. Deflation indicates that most of the points are systematically less significant than the expected distribution. A common cause of deflation is that markers are assumed to be independent from each other and actually they are not, which is common when a linkage mapping population is used for GWAS.
6
Other Outputs Although Manhattan and QQ plots are the major graphs used to present GWAS results, genotype distribution, estimated genetic parameters, phenotype analysis, and population structure also provide necessary and complementary information for the interpretation about the data and results (Fig. 6).
72
Jiabo Wang et al.
12
QQ plot
8 6 4 0
2
Observed
log10p
10
GLM.dpoll MLM.dpoll FarmCPU.dpoll Blink.dpoll
0
1
2
Expected
3
4
log10p
Fig. 5 Quantile-quantile (QQ) plot of genome-wide association study. The analysis was conducted using FarmCPU on 281 maize inbreds phenotyped for flowering time (days to pollination) and genotyped with 3093 SNPs. The shaded area indicates the 95% of confidence interval 6.1 Genotype Analysis
Rare SNPs with low MAF usually cause false positives, especially for small populations and when phenotypes do not follow a normal distribution [57, 78]. However, many causal genetic variants are rare [78], so the MAF distribution should be noticed. When MAF of markers are plotted against their P-values, extreme attention should be paid to the markers with small P-values and small MAF. The frequency of heterozygosity can be calculated for both individuals and markers. A high level of heterozygosity in a few inbred lines may indicate contamination during their development. A high level of heterozygosity across all inbred lines may suggest the problem of calling markers (Fig. 6a, b).
6.2 Distribution of Missing Genotypes
Similar to heterozygosity, missing genotypes can be analyzed in two directions, individual wise and marker wise. The missing genotypes can be imputed using special software, such as FILLIN [79], Impute [80] or Beagle [81]. The imputed markers should be prudently selected as candidate genes in a GWAS result, because their genotypes were imputed using other genotype information.
6.3
The kinship matrix is visualized via a heat map (Fig. 6c). The y- and x-axes are the order of the individual taxa [31, 82]. The shade of
Kinship
GWAS Outputs
73
Fig. 6 The other output results in the GWAS. The demonstration data include 1940 individual mice and 12K markers. The genotype heterozygosity of individuals (a) and markers (b) are used to show the heterozygosity distribution. The kinship (c) and principal component (PC) (d) plots are used to reveal the relationship and population structure. (e) is the pairwise LD plot between the most significant marker and its neighboring markers. The heritability plot (f) is estimated by MLM
color in the squares indicates the relationship levels between each pair of individuals. Some software packages apply a distribution plot of kinship value, which helps users to evaluate the clustered relationships in the whole population. 6.4 Population Structure
Population structure and cryptic relatedness are important sources of spurious association. Population structure is usually fitted as a covariate in GWAS. The population structure can be quantified by two major approaches using genetic markers, including STRUC TURE [83] and principal component analysis (PCA) [84]. Different types of graphs are produced by these two methods. The Q matrix in the STRUCTURE [85] software is used to indicate the proportions of individuals belonging to different subpopulations. PCA uses a dimensionality reduction strategy to extract the eigenvalues and vectors from all genetic markers. Pairwise plots of principal components such as the first and the second principal component (Fig. 6d) are used to display the population’s structure. The clustering trend indicates the relative population structure.
6.5 Linkage Disequilibrium Decay
Usually, LD is measured as the r2 or D value for pairwise markers in a user-selected segment. The plot represents distances between two markers in the window and their squared correlation coefficient (Fig. 6e). Another presentation of LD is to plot LD between
74
Jiabo Wang et al.
markers against their distance to show the LD decay over distance. The moving average of adjacent markers is usually calculated by using sliding windows whose size can be set. The decreasing trend of the moving average windows shows the speed of LD decay in the population and is used to estimate the linkage distance. The LD decay also can be used to estimate the relative evolutionary relationship. Slow decay means more closely related species or individuals. 6.6 Heritability Estimation and Phenotype Distribution
7
Heritability is an important factor in genetic analyses, including GWAS. Low heritability means that only a small proportion of variability in the trait is explained by genetics and low statistical power is expected to conduct a GWAS [86, 87]. Therefore, it is necessary to understand the heritability level before performing GWAS. The ratio of genetic variance to total phenotypic variance is defined as the heritability (Fig. 6f). The phenotype distribution is another factor influencing statistical power for GWAS. Most GWAS methods assume that the residuals in the entire sample population follow a normal distribution. Illustration of phenotype distribution is helpful to identify data structure, outliers, and relationships between traits.
Interactive Outputs Using HTML Compared to static outputs in formats such as PDF or JPG, interactive outputs using HTMP provide the opportunity for users to gain additional information. The information on a Manhattan plot (Fig. 7) includes minor allele frequency (MAF), estimated effect, names of neighboring gene, and the ratio of markers explaining genetic variance in the whole phenotypic variance. This information will help researchers more efficiently select candidate genes for downstream confirmatory experiments. An interactive Manhattan plot allows the user to select a chromosome, zoom in and out on the whole plot, or filter markers based on cutoff.
7.1
HTML
Hypertext Markup Language (HTML) is the most popular tool markup language designed for web browsers. HTML provides an approach to create structured multimedia documents that can contain images, text, and other interactive objects which are sub-elements in individual tags. Importantly, HTML can present information upon mouseover of the tags, making them more operative and interactive with users. Some R packages (such as lattice [88], rgl, scatterplot3, plot3d, and pca3d) have used this technology to present the 3D PCA plot, making it possible to rotate and zoom in on the whole figure to visualize the internal structure.
GWAS Outputs
75
Fig. 7 The interactive Manhattan plot generated by GAPIT. The genome-wide association study was conducted on a trait simulated from 281 maize inbreds genotyped with 3093 SNPs. The trait was controlled by ten genes with heritability of 0.75. The plot is displayed using HTML format. When the cursor is near a point, the corresponding information is displayed for the SNP 7.2
8
R Library
The R package “plotly” applies an approach from R to HTML [89]. Each marker in the Manhattan plot is added a window with all information. When the mouse meets the marker, all information such as MAF, P-value, estimated effect, explained variance, and gene name are presented in the pop-up window. Each data type can occupy an individual row in an information block. The same type of data can be incorporated into a QQ plot. Using the plotly R package, these interactive Manhattan and QQ plots have been implemented in the GAPIT software. The plot can be displayed in web browsers, which requires a supporting folder named “library.”
Final Remarks Manhattan plots are the most common output to visually display the associations of genetic markers with traits of interest. Stacking multiple Manhattan plots in the circle or Cartesian format helps to demonstrate pleiotropy among multiple traits or the overlap among different models. QQ plots are essential to assess the quality and power of the GWAS by displaying the inflation/deflation of Pvalues and markers that exceeded the expectation. The additional information such as MAF, marker effect estimates, and detailed locations can be displayed as tabulate tables static graphic output, or interactive output of Manhattan and QQ plots using HTML. Finally, it is critical to visualize the properties of phenotypes and genotypes to identify the statistical power and source of spurious association. The properties include marker density, LD decay, MAF, phenotypic clusters and outliers, population structure, heritability, heterozygosity, and missing rate.
76
Jiabo Wang et al.
Acknowledgments This project was partially funded by the National Science Foundation of the United States (Award # DBI 1661348 and ISO 2029933), the United States Department of Agriculture - National Institute of Food and Agriculture (Hatch project 1014919, Award #s 2018-70005-28792, 2019-67013-29171, and 2020-6702132460), the Washington Grain Commission, the United States (Endowment and Award #s 126593 and 134574), the Program of Chinese National Beef Cattle and Yak Industrial Technology System, China (Award # CARS-37), Fundamental Research Funds for the Central Universities, China (Southwest Minzu University, Award # 2020NQN26), and Sichuan Science and Technology Program, China (Award #s 2021YJ0269 and 2021YJ0266). Author Contributions Jiabo Wang: software, data curation, writing—original draft preparation, visualization, investigation. Alexander E. Lipka: revision of the manuscript. Jianming Yu: revision of the manuscript. Zhiwu Zhang: conceptualization and revision of the manuscript.
References 1. Dimensionality M, Pan Q, Hu T et al (2013) Genome-wide association studies and genomic prediction, vol 1019. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703447-0 2. Yano K, Yamamoto E, Aya K et al (2016) Genome-wide association study using wholegenome sequencing rapidly identifies new genes influencing agronomic traits in rice. Nat Genet 48:927–934. https://doi.org/10. 1038/ng.3596 3. Gibson G (2010) Hints of hidden heritability in GWAS. Nat Genet 42:558–560. https:// doi.org/10.1038/ng0710-558 4. Lee C-Y, Kim T-S, Lee S et al (2015) Concept of genome-wide association studies. In: Current technologies in plant molecular breeding. Springer, New York, NY, pp 175–204. https:// doi.org/10.1007/978-94-017-9996-6_6 5. Bush WS, Moore JH (2012) Chapter 11: Genome-wide association studies. PLoS Comput Biol 8:e1002822. https://doi.org/10. 1371/journal.pcbi.1002822 6. Wang MH, Cordell HJ, Van Steen K (2019) Statistical methods for genome-wide association studies. Semin Cancer Biol 55:53–60.
https://doi.org/10.1016/j.semcancer.2018. 04.008 7. Chen K, Baxter T, Muir WM et al (2007) Genetic resources, genome mapping and evolutionary genomics of the pig (Sus scrofa). Int J Biol Sci 3:153–165 8. Benson AK, Kelly SA, Legge R et al (2010) Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proc Natl Acad Sci 107:18933–18938. https://doi. org/10.1073/pnas.1007028107 9. Andreescu C, Avendano S, Brown SR et al (2007) Linkage disequilibrium in related breeding lines of chickens. Genetics 177: 2161–2169. https://doi.org/10.1534/genet ics.107.082206 10. Zhu XM, Shao XY, Pei YH et al (2018) Genetic diversity and genome-wide association study of major ear quantitative traits using high-density SNPs in maize. Front Plant Sci 9:1–16. https://doi.org/10.3389/fpls.2018.00966 11. Zhang H, Yin L, Wang M et al (2019) Factors affecting the accuracy of genomic selection for agricultural economic traits in maize, cattle, and pig populations. Front Genet 10:1–10. https://doi.org/10.3389/fgene.2019.00189
GWAS Outputs 12. Pereira HD, Marcelo J, Viana S et al (2018) Relevance of genetic relationship in GWAS and genomic prediction Relevance of genetic relationship in GWAS and genomic prediction. J Appl Genet 59:1. https://doi.org/10.1007/ s13353-017-0417-2 13. Stich B, Melchinger AE (2009) Comparison of mixed-model approaches for association mapping in rapeseed, potato, sugar beet, maize, and Arabidopsis. BMC Genomics 10: 94. https://doi.org/10.1186/1471-216410-94. 1471-2164-10-94 [pii] 14. Benjamini Y, Yekutieli D (2005) Quantitative trait loci analysis using the false discovery rate. Genetics 171:783–790. https://doi.org/10. 1534/genetics.104.036699 15. Schork AJ, Thompson WK, Pham P et al (2013) All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs. PLoS Genet 9:e1003449. https://doi.org/10.1371/journal.pgen. 1003449 16. Yan G, Qiao R, Zhang F et al (2017) Imputation-based whole-genome sequence association study rediscovered the missing QTL for lumbar number in Sutai pigs. Sci Rep 47:615. https://doi.org/10.1038/ s41598-017-00729-0 17. Zhang Z, Ober U, Erbe M et al (2014) Improving the accuracy of whole genome prediction for complex traits using the results of genome wide association studies. PLoS One 9: e0093017. https://doi.org/10.1371/journal. pone.0093017 18. Han Y, Zhao X, Cao G et al (2015) Genetic characteristics of soybean resistance to HG type 0 and HG type 1.2.3.5.7 of the cyst nematode analyzed by genome-wide association mapping. BMC Genomics 16:1–11. https:// doi.org/10.1186/s12864-015-1800-1 19. Sukumaran S, Reynolds MP, Sansaloni CP (2018) Genome-wide association analyses identify QTL hotspots for yield and component traits in durum wheat grown under yield potential, drought, and heat stress environments. Front Plant Sci 9:81. https://doi.org/ 10.3389/fpls.2018.00081 20. Martinez SA, Godoy J, Huang M et al (2018) Genome-wide association mapping for tolerance to preharvest sprouting and low falling numbers in wheat. Front Plant Sci 9:1–16. https://doi.org/10.3389/fpls.2018.00141 21. Saatchi M, Schnabel RD, Taylor JF et al (2014) Large-effect pleiotropic or closely linked QTL segregate within and across ten US cattle breeds. BMC Genomics 15:442. https://doi. org/10.1186/1471-2164-15-442
77
22. Tan B, Ingvarsson PK (2019) Integrating genome-wide association mapping of additive and dominance genetic effects to improve genomic prediction accuracy in Eucalyptus. BioRxiv 2019:841049. https://doi.org/10. 1101/841049 23. Wei X, Zhang J (2016) The genomic architecture of interactions between natural genetic polymorphisms and environments in yeast growth. Genetics 205:genetics.116.195487. https://doi.org/10.1534/genetics.116. 195487 24. Vinkhuyzen AAE, Pedersen NL, Yang J et al (2012) Common SNPs explain some of the variation in the personality dimensions of neuroticism and extraversion. Transl Psychiatry 2: e125. https://doi.org/10.1038/tp.2012.49 25. Chen CY, Misztal I, Aguilar I et al (2011) Genome-wide marker-assisted selection combining all pedigree phenotypic information with genotypic data in one step: an example using broiler chickens. J Anim Sci 89:23–28. https://doi.org/10.2527/jas.2010-3071 26. Gusev A, Bhatia G, Zaitlen N et al (2013) Quantifying missing heritability at known GWAS loci. PLoS Genet 9:e1003993. https://doi.org/10.1371/journal.pgen. 1003993 27. Eichler EE, Flint J, Gibson G et al (2010) Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11:446–450. https://doi.org/10. 1038/nrg2809 28. Bradbury PJ, Zhang Z, Kroon DE et al (2007) TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633–2635. https://doi.org/10.1093/ bioinformatics/btm308 29. Yang J, Lee SH, Goddard ME et al (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88:76–82. https://doi.org/10.1016/j.ajhg.2010.11.011 30. Purcell S, Neale B, Todd-Brown K et al (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575 31. Tang Y, Liu X, Wang J et al (2016) GAPIT Version 2: an enhanced integrated tool for genomic association and prediction. Plant Genome 9(2):1. https://doi.org/10.3835/ plantgenome2015.11.0120 32. Kaur S, Zhang X, Mohan A et al (2017) Genome-wide association study reveals novel genes associated with culm cellulose content in bread wheat (Triticum aestivum, L.). Front Plant Sci 8:1–7. https://doi.org/10.3389/ fpls.2017.01913
78
Jiabo Wang et al.
33. Hickey JM (2013) Genome-wide association studies and genomic prediction, vol 1019. Humana Press, Totowa, NJ. https://doi.org/ 10.1007/978-1-62703-447-0 34. Hayes B (2013) Genome-wide association studies and genomic prediction, vol 1019. Humana Press, Totowa, NJ, pp 149–169. https://doi.org/10.1007/978-1-62703447-0 35. Ziegler A, Ko¨nig IR, Thompson JR (2008) Biostatistical aspects of genome-wide association studies. Biom J 50:8. https://doi.org/10. 1002/bimj.200710398 36. Almli LM, Duncan R, Feng H et al (2014) Correcting systematic inflation in genetic association tests that consider interaction effects application to a genome-wide association study of posttraumatic stress disorder. JAMA Psychiatry 71:1392–1399. https://doi.org/ 10.1001/jamapsychiatry.2014.1339 37. Yu J, Buckler ES (2006) Genetic association mapping and genome organization of maize. Curr Opin Biotechnol 17:155–160. https:// doi.org/10.1016/j.copbio.2006.02.003 38. Gianola D, De Los CG, Hill WG et al (2009) Additive genetic variability and the Bayesian alphabet. Genetics 183:347–363. https://doi. org/10.1534/genetics.109.103952 39. Evangelou E, Ioannidis JPA (2013) Metaanalysis methods for genome-wide association studies and beyond. Nat Rev Genet 14: 379–389. https://doi.org/10.1038/nrg3472 40. Gonza´lez-Camacho JM, de Los CG, Pe´rez P et al (2012) Genome-enabled prediction of genetic values using radial basis function neural networks. Theor Appl Genet 125:759–771. https://doi.org/10.1007/s00122-0121868-9 41. Huang M, Liu X, Zhou Y et al (2018) BLINK : a package for the next level of genome-wide association studies with both individuals and markers Meng Huang. Gigascience 8:1–12. https://doi.org/10.1093/gigascience/ giy154 42. Liu X, Huang M, Fan B et al (2016) Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS Genet 12:e1005767. https://doi.org/10.1371/journal.pgen. 1005767 43. Segura V, Vilhja´lmsson BJ, Platt A et al (2012) An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat Genet 44: 825–830. https://doi.org/10.1038/ng.2314 44. Kammerer S, Roth RB, Reneland R et al (2004) Large-scale association study identifies
ICAM gene region as breast and prostate cancer susceptibility locus. Cancer Res 64: 8906–8910. https://doi.org/10.1158/ 0008-5472.CAN-04-1788 45. Yin L, Zhang H, Tang Z et al (2020) rMVP: a memory-efficient, visualization-enhanced, and parallel-1 accelerated tool for genome-wide association study. BioRxiv 46. Turner S (2018) qqman: an R package for visualizing GWAS results using Q-Q and Manhattan plots. J Open Source Softw 3:371. https://doi.org/10.1101/005165 47. Barrett JC, Fry B, Maller J et al (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21:263. https:// doi.org/10.1093/bioinformatics/bth457 48. Brown PJ, Upadyayula N, Mahone GS et al (2011) Distinct genetic architectures for male and female inflorescence traits of maize. PLoS Genet 7:e1002383. https://doi.org/10. 1371/journal.pgen.1002383 49. Ma J, Iannuccelli N, Duan Y et al (2010) Recombinational landscape of porcine X chromosome and individual variation in female meiotic recombination associated with haplotypes of Chinese pigs. BMC Genomics 11:159. https://doi.org/10.1186/1471-216411-159 50. Kover PX, Valdar W, Trakalo J et al (2009) A multiparent advanced generation inter-cross to fine-map quantitative traits in Arabidopsis thaliana. PLoS Genet 5:e1000551. https:// doi.org/10.1371/journal.pgen.1000551 51. Buckler ES, Holland JB, Bradbury PJ et al (2009) The genetic architecture of maize flowering time. Science 325:714–718. https://doi. org/10.1126/science.1174276 52. Tian F, Bradbury PJ, Brown PJ et al (2011) Genome-wide association study of leaf architecture in the maize nested association mapping population. Nat Genet 43:159–162. https://doi.org/10.1038/ng.746 53. Lipka AE, Tian F, Wang Q et al (2012) GAPIT: genome association and prediction integrated tool. Bioinformatics 28:2397–2399. https:// doi.org/10.1093/bioinformatics/bts444. bts444 [pii] 54. Piepho HP (2001) A quick method for computing approximate thresholds for quantitative trait loci detection. Genetics 157:425–432 55. Connolly S, Heron EA (2014) Review of statistical methodologies for the detection of parent-of-origin effects in family trio genomewide association data with binary disease traits. Brief Bioinform 16:429–448. https://doi.org/ 10.1093/bib/bbu017
GWAS Outputs 56. Churchill GA, Doerge RW (2008) Naive application of permutation testing leads to inflated type I error rates. Genetics 178:609–610. https://doi.org/10.1534/genetics.107. 074609 57. de Bakker PIW, Yelensky R, Pe´er I et al (2005) Efficiency and power in genetic association studies. Nat Genet 37:1217–1223. https:// doi.org/10.1038/ng1669 58. Ganjgahi H, Winkler AM, Glahn DC et al (2018) Fast and powerful genome wide association of dense genetic data with high dimensional imaging phenotypes. Nat Commun 9: 3254. https://doi.org/10.1038/s41467018-05444-6 59. Chen CW, Yang HC (2019) OPATs: omnibus P-value association tests. Brief Bioinform 20: 1–14. https://doi.org/10.1093/bib/bbx068 60. Bonferroni CE (1936) Teoria statistica delle classi e calcolo delle probabilita`. Pubbl Del R Ist Super Di Sci Econ e Commer Di Firenze 61. Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52. https://doi. org/10.2307/2282330 62. Ingvordsen CH, Backes G, Lyngkjær MF et al (2015) Genome-wide association study of production and stability traits in barley cultivated under future climate scenarios. Mol Breed 35: 84. https://doi.org/10.1007/s11032-0150283-8 63. Simes RJ (1986) A improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751–754 64. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300. https://doi.org/10.2307/ 2346101 65. Zhao F, McParland S, Kearney F et al (2015) Detection of selection signatures in dairy and beef cattle using high-density genomic information. Genet Sel Evol 47:49. https://doi. org/10.1186/s12711-015-0127-3 66. Doerge RW, Churchill GA (1996) Permutation tests for multiple loci affecting a quantitative character. Genetics 142:285–294. https:// doi.org/10.1111/j.1369-7625.2010. 00632.x 67. Phipson B, Smyth GK (2010) Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat Appl Genet Mol Biol 9:39. https://doi.org/10.2202/1544-6115.1585 68. Wall JD, Pritchard JK (2003) Haplotype blocks and linkage disequilibrium in the human genome. Nat Rev Genet 4:587–597. https:// doi.org/10.1038/nrg1123
79
69. De La Vega FM, Isaac H, Collins A et al (2005) The linkage disequilibrium maps of three human chromosomes across four populations reflect their demographic history and a common underlying recombination pattern. Genome Res 15:454–462. https://doi.org/ 10.1101/gr.3241705 70. Lipka AE, Kandianis CB, Hudson ME et al (2015) From association to prediction: statistical methods for the dissection and selection of complex traits in plants. Curr Opin Plant Biol 24:110–118. https://doi.org/10.1016/j.pbi. 2015.02.010 71. Jernigan KL, Godoy JV, Huang M et al (2018) Genetic dissection of end-use quality traits in adapted soft white winter wheat. Front Plant Sci 9:271. https://doi.org/10.3389/fpls. 2018.00271 72. Hu G, Li Z, Lu Y et al (2017) Genome-wide association study identified multiple genetic loci on chilling resistance during germination in maize. Sci Rep 7:1–11. https://doi.org/10. 1038/s41598-017-11318-6 73. Pleil JD (2016) QQ-plots for assessing distributions of biomarker measurements and generating defensible summary statistics. J Breath Res 10:035001. https://doi.org/10.1088/ 1752-7155/10/3/035001 74. Wilk MB, Gnanadesikan R (1968) Probability plotting methods for the analysis of data. Biometrika 55:1. https://doi.org/10.1093/bio met/55.1.1 75. Neyman J (1937) Outline of a theory of statistical estimation based on the classical theory of probability. Phil Trans R Soc London Ser A Math Phys Sci 236:333. https://doi.org/10. 1098/rsta.1937.0005 76. Robinson GK (1975) Some counterexamples to the theory of confidence intervals. Biometrika 62:155. https://doi.org/10.2307/ 2334498 77. Holland D, Fan CC, Frei O et al (2017) Estimating inflation in GWAS summary statistics due to variance distortion from cryptic relatedness. BioRxiv. https://doi.org/10.1101/ 164939 78. Lee S, Abecasis GR, Boehnke M et al (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95:5. https://doi.org/10.1016/j.ajhg.2014.06.009 79. Swarts K, Li H, Romero Navarro JA et al (2014) Novel methods to optimize genotypic imputation for low‐coverage, next‐generation sequence data in crop plants. Plant Genome 7: 1 – 1 2 . h t t p s : // d o i . o r g / 1 0 . 3 8 3 5 / p l a n tgenome2014.05.0023
80
Jiabo Wang et al.
80. Howie B, Fuchsberger C, Stephens M et al (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44:955–959. https:// doi.org/10.1038/ng.2354 81. Ayres DL, Darling A, Zwickl DJ et al (2012) BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst Biol 61:170. https://doi.org/10.1093/sysbio/syr100 82. Zhang Z, Buckler ES, Casstevens TM et al (2009) Software engineering the mixed model for genome-wide association studies on large samples. Brief Bioinform 10:664–675. https:// doi.org/10.1093/bib/bbp050 83. Raj A, Stephens M, Pritchard JK (2014) FastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197:573. https://doi.org/10.1534/genetics. 114.164350 84. Duan F, Ogden D, Xu L et al (2013) Principal component analysis of canine hip dysplasia phenotypes and their statistical power for genomewide association mapping. J Appl Stat 40: 2 3 5 – 2 5 1 . h t t p s : // d o i . o r g / 1 0 . 1 0 8 0 / 02664763.2012.740617
85. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959. https://doi.org/10.1111/j.1471-8286.2007. 01758.x 86. Saatchi M, Miraei-Ashtiani SR, Nejati Javaremi A et al (2010) The impact of information quantity and strength of relationship between training set and validation set on accuracy of genomic estimated breeding values. African. J Biotechnol 9:438–442. https://doi.org/10. 5897/AJB09.1024 87. Daetwyler HD, Pong-Wong R, Villanueva B et al (2010) The impact of genetic architecture on genome-wide evaluation methods. Genetics 185:1021–1031. https://doi.org/10.1534/ genetics.110.116855. genetics.110.116855 [pii] 88. Cheshire J (2009) Lattice: multivariate data visualization with R. J Stat Softw Bk Rev 25(2). https://doi.org/10.1111/j.1467985x.2009.00624_12.x 89. Carson S, Chris P, Toby H, et al (2016) plotly: create interactive web graphics via “plotly. js.” R Packag Version
Part II Phenotypic Data for GWAS
Chapter 6 Preparation and Curation of Multiyear, Multilocation, Multitrait Datasets Amina Abed and Zakaria Kehel Abstract Genome-wide association studies (GWAS) are a powerful approach to dissect genotype-phenotype associations and identify causative regions. However, this power is highly influenced by the accuracy of the phenotypic data. To obtain accurate phenotypic values, the phenotyping should be achieved through multienvironment trials (METs). In order to avoid any technical errors, the required time needs to be spent on exploring, understanding, curating and adjusting the phenotypic data in each trial before combining them using an appropriate linear mixed model (LMM). The LMM is chosen to minimize as much as possible any effect that can lead to misestimation of the phenotypic values. The purpose of this chapter is to explain a series of important steps to explore and analyze data from METs used to characterize an association panel. Two datasets are used to illustrate two different scenarios. Key words Genotype–phenotype association, Multienvironment trials, Experimental design, Descriptive statistics, Design diagnostics, Analysis of residuals, Outliers, Linear mixed model, Raw phenotype per trial, Adjusted phenotype per trial, Combined phenotype across trials, Genotype environment
1
Introduction The power and the accuracy of genome-wide association studies (GWAS) in detecting true phenotype-genotype associations is highly dependent on the quality of the phenotype and the statistical model used to analyze them. Phenotypic trials are routinely executed in experimental designs where lines (genotypes) are evaluated in multiple or single replicated plots. Most of the time, those trials are repeated across a range of locations and years, known as multienvironment trials (METs). Different experimental designs are used for different trials depending on the availability of seed, human and material resources. The latter can lead to unbalanced datasets and missing data present in the entire experiment. The availability of accurate phenotypic data is critical, and their preparation is not a one-person mission and should be a done in
Davoud Torkamaneh and Franc¸ois Belzile (eds.), Genome-Wide Association Studies, Methods in Molecular Biology, vol. 2481, https://doi.org/10.1007/978-1-0716-2237-7_6, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
83
84
Amina Abed and Zakaria Kehel
consultation between biostatisticians, geneticists, and breeders. However, as GWAS often involves a relatively large and diverse population, the collection of phenotypic data across multiple environments is challenging and some technical errors can arise [1]. Therefore, before their integration into a representative phenotypic value per line, each trial (environment) is analyzed individually, and their quality should be assessed in order to detect any inconsistencies and, should it be necessary, to curate the data. Sufficient time needs to be spent to explore and understand the structure of the data and the nature of the traits to choose a suitable statistical approach. Errors resulting in outliers must be distinguished from true variation that we need to preserve. The phenotypic data should be filtered from outliers that can shift the phenotypic data away from a normal distribution, a key prerequisite for GWAS statistical models [2]. An appropriate linear mixed model (LMM) is fitted to dissect, as much as possible, the factors explaining the phenotypic variation [3]. The objective of METs is to evaluate the genetic value of lines and distinguish genetic effects from environmental variability [4]. Two kinds of variation could explain the observed environmental variability: microenvironmental and macroenvironmental variability [5]. Microenvironmental variability takes place due to factors such as soil heterogeneity, agricultural practices, and intraenvironmental conditions affecting the performance of a line and this contributes to spatial variability [6, 7]. The choice of an appropriate experimental design and spatial modelizations are the best approaches to control microenvironmental variability [8, 9]. Macro-environmental variability exists when the environmental conditions (soil type, climate, and management practices) are heterogeneous. This macroenvironmental variability can be at the origin of the genotype by environment interaction (GxE). In the absence of GxE, the relative performance of a line is the same in all environments, whereas GxE is present when the performance of particular lines varies depending on environmental conditions [10, 11]. Consideration of GxE should be explored to increase the accuracy of the estimated phenotypic value of a line. An appropriate LMM is the best approach to control macro- but also microenvironmental variability. A stage-wise procedure for the integration of the phenotypic data from METs can be adopted [12, 13]. From the theoretical considerations the best analytic approach is a single-stage analysis, where the curated phenotypic data by trial are directly fitted in a sophisticated GWAS model in order to be adjusted, combined, and associated with the marker information. However, this approach can be time consuming and computationally demanding and sometimes impossible. In contrast, multistage approaches are simpler and more computationally efficient, where the curated data are first adjusted and combined
Preparation and Curation of Multiyear, Multilocation, Multitrait Datasets
85
across trials before their use as the phenotypic information in the GWAS model. The purpose of this chapter is to guide researchers running GWAS in the preparation and curation of multienvironment datasets. We also aim to provide guidance in the estimation of phenotypic values per line and per trait to be then used in the association model. However, this chapter does not provide an exhaustive description of statistical procedures. The first part of Subheading 2 summarizes steps to explore the raw data on a per-trial basis, to facilitate the detection and curation of errors and inconsistencies. The second part describes some major approaches, from the less to the more complex, for combining multienvironment phenotypic data for GWAS. The third part briefly describes how to deal with multitrait data. The phenotypic data used in GWAS can come from two scenarios. Phenotypic trials can be conducted for a specific purpose to characterize an association panel in different environments. Alternatively, many studies have retrieved phenotypic data from large and dense historical datasets originally obtained in the context of regional registration and/or breeding trials. Two example datasets for grain yield, one for each of these two scenarios, are used to illustrate some important aspects addressed in this chapter. The first scenario is represented by data from the International Center for Agricultural Research in the Dry Areas (ICARDA) where a barley collection of 320 lines was tested in eight environments in an alpha-lattice design with two blocks. The data in the second scenario come from a breeding program at Universite´ Laval (UL) and comprises 120 barley lines for which data were extracted from registration trials carried out in ten environments in a Randomized Complete Block Design (RCBD) with two blocks. Some details about the different sections are provided in Subheading 3.
2
Methods
2.1 Exploration, Curation, and Adjustment of the Raw Phenotypes 2.1.1 Descriptive Statistics and Design Diagnostics
Descriptive statistics are used to summarize, visualize, and explore the phenotypic data per trial. The main descriptive statistics include mean, median, range (maximum/minimum), standard deviation (SD), and coefficient of variation (CV). All these estimations can be computed using basic functions in R [14]: mean, median, summary, sd. These estimations allow us to assess the quality of the data and to detect problematic trials. A low mean and range of values can indicate a failed trial, for example unsuccessful artificial inoculation in a pathogen resistance trial. A divergent mean and median can indicate a nonnormal distribution of the data. The interpretation of the CV must be done with care because it is highly dependent on the level of the mean and SD. Trials affected by uncontrolled factors often show high CV values because of a low
86
Amina Abed and Zakaria Kehel
Table 1 Major descriptive statistics of grain yield (t/ha) for two datasets, one obtained from a series of orthogonal trials performed for the purpose of GWAS (ICARDA) while the other is derived from previous trials originally performed in the context of line testing or registration (UL). In bold, trials exhibiting a relatively high CV (>0.30) that should be regarded with caution Trial
Number of lines
Minimum
Mean
Median
Maximum
SD
CV (%)
ICARDA dataset AN18
318
0.8
2.6
2.6
4.2
0.6
24.9
AN19
320
0.5
1.2
1.1
2.3
0.3
25.6
AN20
320
1.0
3.4
3.5
5.0
0.7
19.7
MC18
319
0.7
2.3
2.3
3.6
0.5
19.7
MC19
320
0.3
2.6
2.7
4.2
0.6
24.3
MC20
320
1.7
3.7
3.7
5.4
0.7
18.2
SE18
318
1.8
2.5
2.6
3.3
0.3
11.4
TE18
318
0.4
2.6
2.6
5.1
0.9
38.2
LP12
46
2.1
3.2
3.2
4.0
0.4
13.2
SA11
26
2.5
4.3
4.3
6.0
0.9
20.1
SA12
48
1.2
4.3
4.3
6.8
1.3
31.1
SA13
74
1.7
4.5
4.7
7.0
1.2
26.2
SF12
116
3.2
5.3
5.3
7.2
0.9
17.7
SF13
36
1.2
2.7
2.9
4.0
0.7
27.9
SH07
36
3.0
4.5
4.6
5.7
0.7
16.0
SH12
112
3.3
5.1
5.1
6.6
0.7
13.9
SH13
74
1.7
3.3
3.2
5.5
0.7
22.1
SR07
36
2.1
4.0
4.0
5.7
0.9
23.4
UL dataset
mean. A very high CV can suggest a high occurrence of errors that exaggerate the SD. Descriptive statistics from the UL and ICARDA datasets are displayed in Table 1. No clear evidence of a problematic pattern arises from these estimations. Nonetheless, some trials exhibited a relatively low mean or a high CV (>0.30) and should be regarded with caution (TE18 in ICARDA dataset and SA12 in UL dataset). The visual exploration of the ICARDA and UL datasets, as displayed in Figs. 1 and 2, can allow the identification of strange or problematic patterns across environments/lines and obvious outliers (details in Subheading 2.1.5) before starting the GxE analysis. The UL dataset is highly unbalanced and unconnected;
Preparation and Curation of Multiyear, Multilocation, Multitrait Datasets
TE18
Environments
SR07
0 SH13
0
SH12
1
SH07
1
SF13
2
SF12
2
SA13
3
SA12
3
LP12
4
SE18
4
MC20
5
MC19
5
MC18
6
AN20
6
AN19
7
SA11
UL dataset
7
AN18
Grain yield (t/ha)
ICARDA dataset
87
Environments
Fig. 1 Distribution of grain yield (t/ha) per trial in the ICARDA and the UL datasets. The ICARDA dataset was obtained from a series of orthogonal trials performed in view of doing a GWAS while the UL dataset is derived from performance trials previously conducted to compare cultivars and advanced breeding lines
6
6
5
5
4
4
3
3
2
2
1
1
0
0 7_1 _151 _172 _199 _224 _234 _288 7_30 7_38 7_54 R_1 _119 _141 _160 _175 _188 R_2 _208 _218 _227 _236 _245 _259 _302 _311 _334 R_40 R_48 R_56 R_66 R_76 R_88
Grain yield (t/ha)
7
Lines
2342 0924 7187 9569 0210 0830 208 4141 5103 5126 597 3012 3029 3143 3403 0405 0444 0492 0674 0811 0907 442 884 2452 tona A148 yane anic belle 681
UL dataset
ICARDA dataset 7
Lines
Fig. 2 Distribution of grain yield (t/ha) per line in the ICARDA and the UL datasets. The ICARDA dataset was obtained from a series of orthogonal trials performed in view of doing a GWAS while the UL dataset is derived from performance trials previously conducted to compare cultivars and advanced breeding lines
all the lines were not tested in all environments. In comparison, the dataset from ICARDA is balanced and a detailed GxE can be assessed and revealed (as addressed in Subheading 2.2). This visual exploration indicates that the phenotypic variability between the trials varies widely in both datasets (Fig. 1). The variability between the lines seems larger in UL trials than in ICARDA as illustrated in Fig. 2, which probably reflects the nature of the materials in each dataset and the unbalanced feature of the UL dataset.
Amina Abed and Zakaria Kehel MC18
AN18
Rows
Rows
Columns
88
Fig. 3 Field map of two contrasted trials from the ICARDA dataset (MC18 and AN18). The ICARDA dataset consisted of a series of orthogonal trials performed to characterize an association panel of 320 lines in an alpha-lattice design with two blocks. The numbers are the tested lines (from 1 to 320) and the letters are the checks. The rows and the columns represent the field layout, where rows from 1 to 14 represent block 1 and rows from 15 to 28 form block 2
Fig. 4 Heatmap of the pairwise connectivity (number of lines tested jointly in a pair of trials) for trials in the ICARDA and UL datasets. The ICARDA dataset was obtained from a series of orthogonal trials performed for the purpose of GWAS and the UL dataset is derived from previous trials performed for line testing
Preparation and Curation of Multiyear, Multilocation, Multitrait Datasets
89
The design diagnostic is an important step in analyzing data from METs. It permits the validation of the design, the investigation of the design components such as replication, blocks, incomplete blocks, the layout of the experiment in the field as well as the distribution of the phenotype across the field. Additionally, it assesses any potential postblocking due to trial management (irrigation, fertilization), harvest patterns, noncontrolled effects (e.g., animal interactions with the trial). The field maps of two contrasted trials are presented in Fig. 3. The field maps displayed the randomization pattern of the lines across the two blocks (block 1 from rows 1 to 14 and block 2 from rows 15 to 28). In the MC18 trial, the grain yield distribution showed that there is a strong block effect as block 2 has much lower values than block 1. There is also a high probability of identifying outliers in block 1. In the field map of the AN18 trial, grain yield seems well distributed along blocks, rows and columns, there is no indication of outliers. Assessing the connectivity between trials is a crucial step for a better evaluation of GxE as the ability to model GxE depends on connectivity. The connectivity between trials is displayed as the number of lines jointly tested in two trials. The ICARDA dataset is highly connected and balanced, which means that the majority of the lines were assessed in the eight environments (Fig. 4), implying a full potential and capacity of assessing the GxE variation. It is not the case for the UL dataset which is typically unbalanced with very low connectivity between the environments. The historical data from UL is far from being optimal for assessing GxE. A corresponding model cannot be performed or if it is, it may lead to a false estimate of GxE. This will cause lines having an incorrect yield estimate as they were not provided with the opportunity to get tested in all environments. 2.1.2 Normality of the Data Distribution
In association studies, we use a population of unrelated lines and usually we analyze one or more quantitative traits, which is the case for most important traits in plant breeding. Therefore, the distribution of the phenotype is expected to follow a normal distribution. Additionally, when we are fitting a linear model such as in GWAS, there is an implicit assumption that the normality of the distribution needs to be respected otherwise the statistical power may be reduced [2]. However, several distributions are now handled by LMM or GLM (general linear model) in many programs (e.g., the “GENESIS” R package [15]). Qualitative traits and binary traits (0/1) have a nonnormal distribution and it is possible to analyze them with logistic mixed models. The distribution of the data needs to be investigated and we may need to correct for any violation of this assumption if the program is designed to handle normal distributions only. Histograms provide an indication of potential asymmetry of the
90
Amina Abed and Zakaria Kehel
distribution curve (hist function in the “graphics” package [14] and skewness function [16]). The Shapiro–Wilk test (shapiro.test function in the “stats” package [14]) provides a statistical test of the normality of the data distribution (see Note 1). A normal probability plot or quantile-quantile (QQ) plot (qqnorm/qqline function in the “stats” package [14]) provides information on the distribution of data by comparing two probability distributions (expected and observed) (see Note 2). When normality is assumed, a deviation from this assumption needs to be corrected. We first need to address the reasons for the lack of normality. The presence of outliers can result in a distorted distribution and removing them can normalize the distribution of the data (the outlier analysis will be addressed in detail below). The design diagnostic can help identify issues with trials or postblocking parameters that could have created a bimodal distribution of the data that seems nonnormal; fitting to an appropriate LMM should handle this problem. When no reason can explain a nonnormal distribution, one can resort to mathematical transformations of the raw data. The appropriate transformation depends on the direction of the dissymmetry and the nature of the data (logarithmic, square root, angular, reciprocal, by power scale, by rank and logit), and many of these transformations can be performed in R [14]. The Box–Cox procedure automatically identifies the best transformation for positive values (boxcox function in the R package “MASS” [17] or BoxCox.lambda function in the R package “forecast” [18]). The normality of transformed data also needs to be checked. However, it is recommended to use a model that can handle different distributions rather than transform the data. Mathematical transformations complicate the analysis and the interpretation because the relationship between the scale of transformed and raw data is not always obvious. The distribution of the data from the ICARDA and UL dataset was investigated, and no transformation was needed to correct the data. An example of these data distributions is presented in Fig. 5; it displays the case of a skewed distribution in the trial MC18 due to the presence of a single outlier and a normal distribution in the trial AN18. 2.1.3 Fitting a Linear Mixed Model per Trial and Estimating the Adjusted Phenotypic Values
A LMM describes the relationship between a response variable and explanatory variables. It allows the combination of fixed and random effects, correlation, and heterogeneous variance between the line observations. It is used to model the phenotypic data and separate between the genetic effect and the nongenetic effects due to the design and random residuals [19]. Experiments are randomized into blocks, incomplete blocks (strata) and arranged in the field as rows and columns (grid). We need to model the strata arising from the design and randomization process. When the
Preparation and Curation of Multiyear, Multilocation, Multitrait Datasets
91
6 0
0
20
2
4
Observed quantiles
80 60 40
Frequency
100
140
8
MC18
0
2
4
6
8
−3
−2
Grain Yield
−1
0
1
2
3
2
3
Theoretical quantiles
4 3
0
1
2
Observed quantiles
40 20
Frequency
60
5
80
AN18
1
2
3 Grain Yield
4
5
−3
−2
−1
0
1
Theoretical quantiles
Fig. 5 Histogram and Normal QQ plot of two different trials from ICARDA dataset (MC18 and AN18). The ICARDA dataset consisted of a series of orthogonal trials performed to characterize an association panel
grid information is available, spatial modelization is performed by fitting spatial covariance together with the other design terms [7]. In our examples, the response variable which is grain yield is explained by a sum of explanatory variables such as the genotype (line) effects, the block/incomplete block effects (other categorical variables can be added as trial effects: rows, columns, experimenter) and the remaining and unaccounted variable is the residual effect. The basic models for an alpha-lattice design and RCBD can be formulated, respectively, as:
92
Amina Abed and Zakaria Kehel
Y ijk ¼ μ þ Linei þ Block j þ ðIBk ÞBlock j þ e ijk Y ij ¼ μ þ Linei þ Block j þ e ij Where Yijk is the observed phenotype of the ith genotype in the kth incomplete block within the jth block and Yij is the observed phenotype of the ith genotype in the random effect of the jth block. The term μ is the grand mean in the trial, Linei is the ith genotype effect considered to be fixed or random depending on the purpose of the analysis and the nature of the population and eijk/eij are the random residual effects to be normally distributed around zero (see Note 3). An analysis of different sources of variance is used to produce a meaningful model that best fits the data (i.e., minimizes the residual effect). Importantly, it can be determined if the explanatory variable should be retained in the model or not and we can adjust the phenotype for trial effects and field heterogeneity. The LMM can be fitted with many R-based programs such as “ASReml-R” [20], “lme4” [21], “sommer,” [22] and “nlme” [23]. The “ASReml-R” program is a paid but powerful statistical tool that runs LMM analysis for breeding trials, it allows to analyze large and complex datasets efficiently [20]. The “MrBean” tool is an R-Shiny web-app, it is easy to use and suitable for the analysis of MTEs [24]. The variance component and significance levels are examined, and a trial should be eliminated if the effect of lines is not significant (>0.05) or if the broad-sense heritability (H2) is very low (0.05 or H2 < 0.20) that should be regarded with caution
Trial
P-value genotype
Variance genotype
Variance residual
Variance block
rRow rCol H2
ICARDA dataset AN18
dotplot(data1$perimeter) > plot(data1$perimeter)
3.2.3 Detecting Outliers Using Percentiles
With the percentile method, all values lying outside the interval formed by the lower and higher percentiles will be considered as potential outliers. Percentiles used to construct the interval can be 2.5/97.5, 1/99, or 5/95. As an example, a 5/95 interval is used: > lower_perc lower_perc 5% 2 > upper_perc upper_perc 95% 14.75
All values below 2 and above 14.75 will be considered as putative outliers: outlier_geno upper_perc) outlier_geno [1] 16 29 46
The values correspond to rows 16, 29, and 46. > data1[outlier_geno, "perimeter"] # A tibble: 3 x 1
1 15 2 15 3 60
With this command is possible to print the values.
Fig. 3 Representation of outliers using histogram and dot plots. (a) Histogram showing the outlier bar on the right; (b) Dot Plot and (c) scatterplot emphasizing the entry numbers and the outlier
Phenomics Data for GWAS 3.2.4 Detecting Outliers Using Statistics
115
There are also statistical methods to detect outliers. The most common are Grubbs’s, Dixon’s, and Rosner’s tests [10]. This paragraph does not aim to provide details of each method since there is a wide literature on the topic. As an example, we illustrate Rosner’s test which is more suitable to detect several outliers at once. > install.packages("EnvStats") > library(EnvStats) > test test
Several basic statistical properties of the dataset are retrieved using the function $all.stats. Using this it is possible to retrieve the main table of results. > test$all.stats i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier 1 0 7.89 8.83 60 46 5.89 3.09 TRUE 2 1 6.73 4.08 15 16 2.02 3.08 FALSE 3 2 6.54 3.93 15 29 2.15 3.07 FALSE
In that case, it is confirmed that the observation no. 46 with value 60 is an outlier, whereas the other two previously detected (see Subheading 3.2.3) cannot be considered outliers. 3.3
Data Analysis
3.3.1 ANOVA
Once the clean dataset is established, it is possible to proceed with statistical analysis. Here we provide the description of the main analysis rather than all possible options, for which further reading is suggested. ANOVA is the first test to perform to compare the means and variances of two or more groups and to evaluate whether these differences are statistically significant. The main assumptions of the ANOVA are: 1. the variances of the subpopulations from which the groups are drawn must be homogeneous (homoscedasticity). 2. the variable must have a normal distribution in all populations corresponding to sampled groups (normality). 3. samples must be random and independent (randomness and independence). In addition, it is required that groups should have approximately the same number of observations. The example dataset name “shape” is composed of two columns: one with genotype names and another with trait value (Table 1).
116
Pasquale Tripodi
Table 1 Example dataset. Values of area (either leaf, fruit, or main root) for three genotypes are considered. The first three entries are reported; additional 28 observations (for a total of 31) are provided and indicated with dots Genotype
Area
Geno1
127.23
Geno1
127.55
Geno1
126.41
.
.
.
.
Geno2
96.479
Geno2
96.044
Geno2
97.412
.
.
.
.
Geno3
147.353
Geno3
147.104
Geno3
148.080
.
.
.
.
Before computing the ANOVA, it is possible to draw a boxplot retrieving a first general overview of samples (Fig. 4): > boxplot(Area ~ Genotype, data = shape, col= rainbow(4))
Install and load the following packages: > install.packages(c("ggplot2", "tidyverse")) > library (ggplot2) > library (tidyverse)
To perform one-way anova > oneway oneway Source DF Sum of Squares Mean Square F Value Pr >F Genotype 2 39871.2789 19935.6395 51633.2342 1.976e-138 Error 90 34.7490832 0.38610092 C. Total 92 39906.028
Phenomics Data for GWAS
117
Fig. 4 Boxplots providing a general overview of the three considered genotypes
The DF column indicates the degrees of freedom for the genotypes (independent variable) (considered genotypes minus 1), and the degrees of freedom for the residuals (the total observations minus one and minus DF genotypes) (93-1-2). The Sum of Squares column indicates the total variation between the group means and the overall mean. The Mean Square column is the mean of the sum of squares, obtained by dividing Sum of Squares/DF. The F value column is the test statistic from the F test. It is obtained as the mean square of sample/mean square of the residuals. A high F value indicates that the variation caused by genotype is real and not due to chance. The Pr > F column indicates the significance of differences of means, showing how likely it is that the F-value would have occurred if the null hypothesis (no difference among group means) were true. The level of significance is expected, observing the boxplot in Fig. 4. A post-hoc test can be applied to find out the pair of samples which differ. It is possible to use the Tukey’s HSD test applying the following code: > TukeyHSD(oneway) Tukey multiple comparisons of means
118
Pasquale Tripodi 95% family-wise confidence level Fit: aov(formula = Area ~ Genotype, data = shape) $Genotype diff lwr upr p adj Geno2-Geno1 -29.86556 -30.24168 -29.48944 0.00. . . Geno3-Geno1 20.56780 20.19168 20.94393 0.00. . . Geno3-Geno2 50.43337 50.05725 50.80949 0.00. . .
The Tukey HSD shows the pair-wise difference of area of three genotypes at a 5% level of significance. The confidence level can be modified at 1% by indicating in the formula “conf.level ¼ 0.99”. The “upr” and “lwr” columns provide upper and lower 95% confidence bounds, respectively. Moreover, “p adj” column provides the pvalues adjusted for the number of comparisons made. The results show how the three genotypes are statistically significantly different. It is also possible to plot the results of Tukey’s HSD comparison by using the plot function as follows (Fig. 5): > plot(TukeyHSD(oneway, conf.level = 0.95),col = "green")
Fig. 5 Tukey HSD graph at 95% confidence level
Phenomics Data for GWAS 3.3.2 Model Assumption Checking
119
Another important aspect to consider is the diagnostic checking of model assumptions. The first step is to assess the homoscedasticity of the variables to determine whether the variance of the groups is equal: >par(mfrow=c(2,2)) >plot(oneway) >par(mfrow=c(1,1))
Three graphs are plotted (Fig. 6) providing information on the model. The Residuals vs. Fitted plot shows that, for the considered values, the mean of residuals corresponds to zero and is horizontal; the Q-Q plot shows the quantiles of the theoretical residuals expected if the data were normally distributed against the quantiles of actual residuals. The assumption of normally distributed residuals is met, since these falls mostly on the straight line (slope near 1). In this case, there are not any outliers affecting the model and there is no deviation between theoretical and actual residuals, thus the model fits the assumption of homoscedasticity. The ScaleLocation plot shows the square root of the absolute standardized residuals plotted against the fitted or predicted values. The almost
Fig. 6 Diagnostic plots for homoscedasticity (details are provided in the text)
120
Pasquale Tripodi
Table 2 Dataset for correlation analysis. Seven traits are considered, first three entries are reported; additional 57 observations (for a total of 60) are provided and indicated with dots Tr1
Tr2
Tr3
Tr4
Tr5
Tr6
Tr7
24.13
949.32
758.73
1708.05
219.01
72.56
37.08
17.13
1027.89
805.02
1832.91
95.72
86.12
41.85
20.63
988.61
781.88
1770.48
157.37
79.34
39.46
.
.
.
.
.
.
.
.
.
.
.
.
.
.
flat line indicates a low chance of failure to meet the assumption of homoscedasticity. The normality of the overall residual can be also assessed by the Shapiro–Wilk test which tests observations taken from a normal population against those from a nonnormal population: > shapiro.test(resid(oneway)) Shapiro-Wilk normality test data: resid(oneway) W = 0.99155, p-value = 0.8236
In this case, the p-value is above the level of significance, therefore the null hypothesis cannot be rejected, and this implies that the samples are taken from normal populations. 3.3.3 Correlations
Additional correlation analysis estimates the level of association between traits scored, providing useful hints for predictive relationships to be used for several experimental and practical purposes (e.g., selection). To that end, the new dataset for the analysis, named “corr”, includes seven traits and 60 observations/trait (Table 2). Before beginning install and load the library RColorBrewer and Corrplot: > install.packages(“RColorBrewer”, “corrplot”) > library(RColorBrewer) > library(corrplot)
To calculate correlations > cor(corr, method = c("pearson")) # see Note 7
Phenomics Data for GWAS
121
Tr1 Tr2 Tr3 Tr4 Tr5 Tr6 Tr7 Tr1 1.00 0.23 0.25 0.23 0.30 -0.12 -0.14 Tr2 0.23 1.00 0.96 0.99 0.23 -0.29 -0.40 Tr3 0.25 0.96 1.00 0.98 0.27 -0.35 -0.48 Tr4 0.23 0.99 0.98 1.00 0.25 -0.32 -0.44 Tr5 0.30 0.23 0.27 0.25 1.00 0.08 -0.10 Tr6 -0.12 -0.29 -0.35 -0.32 0.08 1.00 0.94 Tr7 -0.14 -0.40 -0.48 -0.44 -0.10 0.94 1.00
For calculating threshold of significance: > cor.mtest fviz_pca_biplot(pca1, repel = TRUE, col.var = " #00AFBB", col.ind = " #E7B800")
Eigenvalues (Fig. 8d): > fviz_eig(pca1)
4
Notes 1. Aquaponics or aeroponics by leaving plants growing without the use of soil or aggregate medium can be valid alternatives for leaving roots growing free of any external interference. 2. The number of collected data points is directly proportional to the samples surveyed. Anyhow, it is important to choose the appropriate growth stage on the plant since, in particular for fruits, dimensions could change on different trusses. 3. Avoid as much as possible generating heterogeneity when cutting different samples.
124
Pasquale Tripodi
Fig. 8 Principal component analysis (PCA) graphs. (a) PCA biplot with the distribution of individuals; (b) PCA biplot with trait (Tr) contribution; (c) PCA biplot combining individuals and traits in the same graph; (d) Percentage of explained variance by the first eight principal components (PC). The first and the second PCs account for over 70% of the whole variation
4. The dark background prevents any shadows of the objects that may interfere with the analysis. Furthermore, this allows the scanning of multiple objects close to each other on the scanner. 5. In general, 300 dpi is optimal for an object size of 1–8 cm. For greater and smaller sizes, use 100 dpi (e.g., big fruits) and 600 dpi (e.g., seeds), respectively. 6. Details of traits and software manuals can be found at the specific software manuals https://vanderknaaplab.uga.edu/ files/Tomato_Analyzer_3.0_Manual.pdf; http://morpholeaf. versailles.inra.fr/download/MorphoLeaf_guide_Biot_et_al. pdf. 7. Use “method ¼ c(“spearman”)” or “method ¼ c(“kendall”)” to calculate correlations using the other two methods. 8. By changing the numerical parameters in the scripts, it is possible to obtain a different type of plot. Furthermore, several additional graphical options are available and can be viewed
Phenomics Data for GWAS
125
from the vignettes related to the package (https://cran.rproject.org/web/packages/corrplot/vignettes/corrplotintro.html). 9. The command is for windows operating system, for macOS and Linux use directly install.packages(“stringi”). References 1. Brachi B, Morris GP, Borevitz JO (2011) Genome-wide association studies in plants: the missing heritability is in the field. Genome Biol 12:232. https://doi.org/10.1186/gb2011-12-10-232 2. Esposito S, Carputo D, Cardi T, Tripodi P (2020) Applications and trends of machine learning in genomics and phenomics for nextgeneration breeding. Plants 9(1):34. https:// doi.org/10.3390/plants9010034 3. Casci T (2010) Plants are not humans. Nat Rev Genet 11:315. https://doi.org/10.1038/ nrg2788 4. Korte A, Farlow A (2013) The advantages and limitations of trait analysis with GWAS: a review. Plant Methods 9:29. https://doi.org/ 10.1186/1746-4811-9-29 5. European Plant Phenotyping Network (EPPN) https://www.plant-phenotyping-net work.eu/. Accessed 20 Jun 2021 6. International Plant Phenotyping Network (IPPN) https://www.plant-phenotyping.org/ IPPN_home. Accessed 20 Jun 2021 7. North American Plant Phenotyping Network (NAPPN) https://nappn.plant-phenotyping. org/. Accessed 20 Jun 2021 8. Image Software Tools https://www.quantita tive-plant.org/software. Accessed 20 Jun 2021 9. R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
10. Coucke W, China B, Delattre I, Lenga Y, Van Blerk M, Van Campenhout C, Van de Walle P, Vernelen K, Albert A (2012) Comparison of different approaches to evaluate external quality assessment data. Clin Chim Acta 413: 582–586. https://doi.org/10.1016/j.cca. 2011.11.030 11. O’Connor LJ, Price AL (2018) Distinguishing correlation from causation using genome-wide association studies. arXiv:1811.08803 12. Tripodi P, Soler S, Campanelli G, Dı´ez MJ et al (2021) Genome wide association mapping for agronomic, fruit quality, and root architectural traits in tomato under organic farming conditions. BMC Plant Biol 21:481. https://doi. org/10.1186/s12870-021-03271-4 13. Fernandes SB, Zhang KS, Jamann TM, Lipka AE (2021) How well can multivariate and univariate GWAS distinguish between true and spurious pleiotropy? Front Plant Sci 11:1–11. https://doi.org/10.3389/fgene.2020. 602526 14. Colonna V, D’Agostino N, Garrison E, Albrechtsen A, Meisner J, Facchiano A, Cardi T, Tripodi P (2019) Genomic diversity and novel genome-wide association with fruit morphology in Capsicum, from 746k polymorphic sites. Sci Rep 9:10067. https://doi.org/ 10.1038/s41598-019-46136-5 15. Lever J, Krzywinski M, Altman N (2017) Principal component analysis. Nat Methods 14: 641–642. https://doi.org/10.1038/nmeth. 4346
Chapter 8 Preparation and Curation of Omics Data for Genome-Wide Association Studies Feng Zhu, Alisdair R. Fernie, and Federico Scossa Abstract With the development of large-scale molecular phenotyping platforms, genome-wide association studies have greatly developed, being no longer limited to the analysis of classical agronomic traits, such as yield or flowering time, but also embracing the dissection of the genetic basis of molecular traits. Data generated by omics platforms, however, pose some technical and statistical challenges to the classical methodology and assumptions of an association study. Although genotyping data are subject to strict filtering procedures, and several advanced statistical approaches are now available to adjust for population structure, less attention has been instead devoted to the preparation of omics data prior to GWAS. In the present chapter, we briefly present the methods to acquire profiling data from transcripts, proteins, and small molecules, and discuss the tools and possibilities to clean, normalize, and remove the unwanted variation from large datasets of molecular phenotypic traits prior to their use in GWAS. Key words GWAS, Plants, Transcriptomics, Proteomics, Metabolomics, Normalization
Abbreviations 2-DE BLUP GWAS LD LM LMM MS QTL SIS
Two-dimensional electrophoresis Best linear unbiased predictor Genome-wide association study Linkage disequilibrium Linear model Linear mixed model Mass spectrometry Quantitative trait locus Surrogate internal standard
Davoud Torkamaneh and Franc¸ois Belzile (eds.), Genome-Wide Association Studies, Methods in Molecular Biology, vol. 2481, https://doi.org/10.1007/978-1-0716-2237-7_8, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
127
128
1
Feng Zhu et al.
Introduction Association mapping is a genetic approach to find links between genotypic and phenotypic variation in quantitative traits. By the use of statistical models, association studies aim to identify those specific genomic variants (e.g., SNPs) which are related to the variation of phenotypic traits across a collection of unrelated individuals (accessions, varieties, species). Being first applied in medical genetics, GWAS was successful in the identification of many loci underlying risk variants of several important diseases, including cancer, type-2 diabetes, depression and metabolic disorders [1]. In its essence, conducting a GWAS is simple: once both phenotypic and genotypic data are collected from the population under study (diversity panel), statistical models evaluate the significance and the magnitude of the association between the genetic polymorphisms and the phenotypic trait(s). From the list of these significant associations—through a classical candidate gene approach—it is then possible to fine map the polymorphisms held responsible for the observed variation in the trait [2]. The possibility to establish causal links between genetic variation, estimated on a genome-wide scale, and variation of quantitative traits is of course of paramount importance also to plant breeding, given that many traits of agricultural importance (yield, flowering time, morphology and stress resistance, among others) are characterized by a complex genetic architecture (i.e., when a high number of genetic variants influence the phenotype, each with small- or medium-effect sizes). In these cases, and especially when polygenic inheritance is further complicated by gene-gene and gene-by-environment interaction, GWAS usually provide higher genetic resolution with respect to biparental mapping. Traditional QTL approaches are based in fact on the cross between two parental genotypes, and thus their genetic resolution relies entirely on the few recombination events occurring during the generation of the population. GWAS, on the other hand, by the inclusion of a large diversity panel of (theoretically) unrelated genotypes, efficiently captures all the historical recombinations that occurred along the evolution of the various accessions (see Note 1, [2–4]). The first GWA studies in plants date back to around 10 years ago, when the genetic association of few, relatively simple, morphological, developmental, and life-history traits (e.g., flowering time, trichome density) were studied in populations of Arabidopsis [5–7]. Since then, GWA studies were also applied in a wide range of crops (maize, rice, soybean, cotton, tomato, and peach, among others), embracing the study of the genetic basis of molecular traits collected from large-scale analyses (e.g., transcripts, proteins, metabolites). In transcriptome-wide association studies [8], for example, specific SNPs are associated to gene expression variation to identify eQTL (expression QTL). By considering gene
Preparation and Curation of Omics Data for Genome-Wide Association Studies
129
expression as a trait, these studies uncover both local and distant eQTL, showing that in maize for example, local eQTL generally exerted larger effects in the kernel transcriptome [8]. Similarly, in cotton (Gossypium spp.), GWAS on gene expression levels uncovered some eQTL related to the transcriptional regulation of secondary wall formation [9]. Also, several MS-based proteomics studies have been combined with genetic polymorphisms data in cereals and legumes [10–13]. Along the same lines, GWAS also explored the genetic basis of a multitude of plant metabolic traits, efficiently combining deep genotyping afforded by NGS with metabolic profiling of primary, secondary and volatile metabolites [14– 16]. GWA studies, and especially those conducted on large collections of quantitative phenotypic traits, need to adhere to strict quality control measures and must meet some specific assumptions about the properties of the phenotypic data. In the end, the more precise phenotyping is, the higher the probability to detect strong statistical support for a genetic variant held responsible for trait variation. Although a large body of literature has addressed key issues—and how to overcome them—in the statistical pipelines of GWAS (e.g., population structure and cryptic relatedness, rate of false positive discovery, relationship between sample size, effect size, selection differential and percentage of explained variance, quality control checks for SNP calling, missing data imputation, spurious associations due to genetic heterogeneity [17–25]), we have found that the general preprocessing strategies of molecular phenotypic datasets has received relatively less attention. The proper treatment of transcriptomics, proteomics, and metabolomics data is, however, of great importance when these large datasets are included in a GWA study, as failure to account for the various sources of unwanted variation, for example, which typically arise in these large datasets, may negatively impact the power of GWAS [2, 26]. In this method paper, we first present, for each category of these omic approaches, the initial requirements and the protocols to produce datasets from large transcript, protein or small molecule profiling. We then provide some guidelines about how these data should be preprocessed and normalized prior to their use in combination to genetic polymorphisms in a GWA study (see Fig. 1).
2
Preparation of Omic-Datasets for GWAS Sample preparation is a critical process for multiomics studies. RNA, proteins and metabolites can be easily degraded by different enzymes. Therefore, during harvesting, samples should be immediately frozen in liquid nitrogen and then directly transferred to 80 C to avoid degradation. Care should also be taken when
130
Feng Zhu et al. SNP matrix
genotyping
Acc2
A cc3 Acc3
chr01
29237
-
A
G
chr01
68617
A
A
-
chr01
68911
T
C
-
chr01
69017
-
T
-
chr01
72447
-
-
-
Chromosome
position
A cc1 Acc1
molecular phenotyping AAAAAA A AA AAAAAAA AAAAAAA AAAAAA A AA AAAAAA A AA AAAAAAA AAAAAAA AAAAAA A AA AAAAAA A AA AAAAAAA AAAAAA A AA AAAAAAA AAAAAA A AA AAAAAAA
SNP filtering population structure cryptic cry r ptic relatdness
gene expression
data cleaning normalization transformation proteomics proteomics
SNP-trait association Diversity panel
small molecules
Fig. 1 Diagram of a general GWA study. An association study is based on the collection of genotypic and phenotypic data from a group of (possibly) unrelated accessions (although some statistical methods can efficiently take into account the presence of relatedness within the population). Molecular traits (e.g., gene expression, protein and metabolite abundances) can be obtained from various analytical platforms through large-scale profiling studies (RNAseq, proteomics, and metabolomics). Both genotypic and phenotypic datasets need to undergo some form of preprocessing and statistical pipelines to remove sources of unwanted variation which might bias or reduce the power of GWAS. Given that phenotypic traits used in GWAS need to be normally distributed, large-scale molecular datasets are usually normalized following some sort of data transformation (see the text for details)
preparing aliquots for individual extractions, as samples should not be allowed to thaw before the addition of extraction buffer. Required General Lab Equipment 1. Equipment for tissue grinding (e.g., Retsch Mill MM400).
2. Water bath or heating block. 3. Vortex. 4. Superspeed centrifuge (ultracentrifuge required in case of organelle isolation through sedimentation in sucrose gradients, [27]). 5. Spectrophotometer (e.g., Nanodrop, for measuring RNA and protein concentration). 6. nanoHPLC and/or SE-HPLC for purification and separation of protein fractions. 7. Single quadrupole or ToF GC-MS for analysis of primary metabolites (carbohydrates, amino acids, organic acids, amines, polyols).
Preparation and Curation of Omics Data for Genome-Wide Association Studies
131
8. LC-MS for analysis of protein and/or secondary metabolites (e.g., alkaloids, polyphenols, terpenoids). The type and technical specifications of the mass spectrometers depend on the throughput, depth, coverage, and size of the target proteins or metabolites. In the case of metabolic GWAS, for example, an HPLC or UPLC with a PDA detector is theoretically sufficient for those metabolites that can be reliably identified on the basis of their UV-Vis absorbance spectra (e.g., carotenoids), wider metabolic profiling requiring the identification of unknown masses necessitate mass spectrometers capable of sufficient resolution, speed and with MSn capabilities [28]; similar considerations can be also be made for those MS necessary for protein-based GWAS. 9. Access to a high-performance computing cluster for intensive computational tasks (e.g., RNAseq read filtering and mapping, analysis of peptide spectra matches, omics-data normalization, and calculation of downstream genome-wide associations). 2.1 Materials and Methods for RNA Extraction Prior to RNAseq 2.1.1 Reagents
1. Liquid nitrogen. 2. RNA extraction kit (Qiagen or equivalent) or TRIzol-based reagent (for standard, multistep RNA extraction protocols). 3. RNase decontamination solution (to avoid RNA degradation during extraction). 4. Ethanol. 5. DNase (to remove contamination from genomic DNA). 6. RNase-free water.
2.1.2 Protocol for RNA Extraction (See Note 2)
1. Weigh 50 mg of frozen pulverized tissue and add it to a tube containing 500 μL of guanidine hydrochloride buffer (8 M Guanidine hydrochloride, 20 mM MES (morpholine ethanesulfonic acid), 20 mM EDTA and 50 mM β-mercaptoethanol, pH 7.5). β-mercaptoethanol can be substituted by 20 mM DTT (in any case, the reducing agent has to be added just before extraction). As an alternative, the guanidine hydrochloride buffer can be substituted by an equal volume of TRIzol. 2. Vortex extensively and place the sample on ice until all samples are processed. Homogenates from green tissues become brownish. 3. Centrifuge for 10 min at 10,000 RPM (11,200 g) at 4 C and transfer liquid phase to a new tube. 4. If guanidine buffer has been used in step 1, proceed immediately to step 5. Otherwise, in the case of the use of TRIzol from step 1, add 100 μL of Chloroform, vortex and centrifuge again for 10 min at 10,000 RPM (11,200 g) at 4 C. Transfer upper liquid phase to a new tube.
132
Feng Zhu et al.
5. Add 1 volume of phenol–chloroform–isoamyl alcohol 25:24:1 to remove proteins, vortex. 6. Centrifuge again for 10 min at 10,000 RPM (11,200 g) at 4 C to separate phases. The interphase should be clean from tissue particles. Transfer the upper liquid phase to a new tube. 7. (optional) Reextract with 1 volume of chloroform to eliminate all traces of residual phenol. 8. Centrifuge again for 10 min at 10,000 RPM (11,200 g) at 4 C to separate phases. The interphase should be clean. Transfer the upper liquid phase (around 200 μL) to a new tube. 9. Precipitate RNA with 1 volume of cold isopropanol for at least 1 h at 20 C. 10. Centrifuge for 10 min at 10,000 RPM (11,200 g) at 4 C and discard the supernatant. 11. Wash pellets with cold ethanol 75% (made in RNase-free water). 12. Resuspend pellets in 200 μL RNase-free water. 13. Precipitate RNA with 100 μL of LiCl 8 M for at least 3 h at 4 C. 14. Centrifuge for 30 min at 13,000 RPM (19,000 g) at 4 C and discard the supernatant (at this stage pellets are often invisible). 15. Wash pellets with ethanol 75%. 16. Centrifuge 5 min at 13,000 RPM (19,000 g) and discard the supernatant. 17. Dry pellets for 1–2 min at room temperature. 18. Resuspend in 30–50 μL RNase-free water. 19. Measure RNA concentration at the spectrophotometer. 20. Check integrity of RNA on a 1% agarose gel (in 1 TAE). 21. Store RNA at 80 C. 2.2 Analysis and Curation of RNA-seq Data (See Note 3)
In the past decades, a number of best practices and many software packages for RNA-seq have been developed [29–31]. However, the core workflow for RNAseq analysis is divided into three sections: sequence data quality trimming and adapter cutting, reads alignment, gene expression quantification and normalization. 1. Sequence data quality trimming and adapter cutting After the sequencing process, the adapter sequences need to be removed from the reads to avoid errors in the downstream alignment process. Several tools can be used to remove the adapters, but we focus here on Trimmomatic, one of the most commonly used tools for trimming the Illumina pairedend and single-end data [32]. Trimmomatic includes several
Preparation and Curation of Omics Data for Genome-Wide Association Studies
133
steps ( ILLUMINACLIP, SLIDINGWINDOW, LEADING, TRAILING, CROP, HEADCROP, MINLEN, TOPHRED33, TOPHRED64) and the associated parameters of each step can be specified through the command line. Here are the sample code snippets: for paired end java -jar trimmomatic-0.39.jar PE input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz
output_reverse_paired.fq.gz
output_reverse_unpaired.fq.gz
ILLUMINACLIP :TruSeq3-PE.fa:2:
30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
for single end java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output. fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SL IDINGWINDOW:4:15 MINLEN:36
After the trimming process, the cleaned data can be run on FastQC to confirm that the low-quality bases and adaptor have been trimmed off [33]. 2. Read alignment This step is required to map the trimmed reads to the reference genome. The process can be carried out by tools such as STAR [34] and HISAT2 (the successor of another widely used tool, TopHat2 [35, 36]). For large-scale transcriptome data analysis, HISAT2 has low memory requirements and can carry out alignment both at a local and whole-genome level [36]. Full details of the scripts for RNAseq data analysis with HISAT can be found in [37]. 3. Gene expression quantification and normalization After the alignment of the reads to the genome, the number of reads are counted to quantify the gene expression level. Read-count-based tools, such as HTSeq, are used to produce the matrix containing the count values across different samples [38]. Owing to the differences in read depth, expression patterns and technical biases, the count matrix is usually subjected to some data preprocessing. Raw counts are in fact transformed to a scale which can account for library size differences. Several options are available: RPKM method (reads per kilobase transcript per million mapped reads), FPKM (fragments per kilobase of transcript per million), CPM (counts per million) and log2-CPM (log2-counts per million). With respect to RPKM and FPKM, however, CPM and log2-CPM do not account for differences in gene length. CPM and log2-CPM are implemented in edgeR (along with other alternatives, [39, 40]) and can be called using the function cpm:
134
Feng Zhu et al. cpm