246 50 14MB
English Pages [651] Year 2022
Methods in Molecular Biology 2467
Nourollah Ahmadi Jérôme Bartholomé Editors
Genomic Prediction of Complex Traits Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Genomic Prediction of Complex Traits Methods and Protocols
Edited by
Nourollah Ahmadi and Jérôme Bartholomé CIRAD, UMR AGAP Institut, Montpellier, France
Editors Nourollah Ahmadi CIRAD UMR AGAP Institut Montpellier, France
Je´roˆme Bartholome´ CIRAD UMR AGAP Institut Montpellier, France
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-2204-9 ISBN 978-1-0716-2205-6 (eBook) https://doi.org/10.1007/978-1-0716-2205-6 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022, Corrected Publication 2022 Chapters 3, 9, 13, 14 and 21 are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapters. This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface Over the last three decades, steady progresses within the fields of molecular genetics and quantitative genetics have led to two major breakthroughs in genetics, genetic mapping and genomic prediction of complex traits of imperfectly known biological bases, with immense potential impact in the applied fields of human health and agriculture. The prospect of predicting human disease risks, and complex traits of agronomic interest with reasonably high precision, although they are difficult to evaluate or/and subject to numerous regulations and interactions, has stimulated public and private research. The aims included refining the theoretical bases of genomic prediction, extending its areas of application, fine-tuning the experimental designs for specific cases on different species of interest, and developing dedicated tools. Excitement over the development of methods for genomic prediction has not escaped the one around machine learning and artificial intelligence technologies. Results from various simulation works and empirical applications indicate that the quality of prediction depends on the interplay between a large number of factors, including relatedness between the reference and the candidate populations, trait architecture, optimal marker density, prediction methods, and, in the case of traits of agronomic interest, the overall organization of the plant and animal breeding programs. As the concepts and methods of genomic prediction of complex traits mature and are about to become the mainstream approach in dealing with an increasing number of health issues and objectives of animal and plant breeding, it is very timely to assemble the methodological achievements of the field in a volume of the series Methods in Molecular Biology. This volume is composed of five sections. The first section (Chapter 1) is a reminder of the evolution of the conceptual frameworks for genotype-phenotype relationship analysis and molecular genetics approaches intending to predict phenotypic variations. The second section (Chapters 2–4), passing through the principles of genomic prediction of complex traits and the overview of factors that affect its reliability, provides an extensive review of the characteristics of the most influential factors, and methods to optimize those characteristics. The third section (Chapters 5–10) describes single trait and single environment genomic prediction methods and the associated assumptions on the variance of marker effect and on the architecture of the target trait and presents an overview of major computer packages for genomic prediction. Then, moving on, present genomic prediction approaches dealing with more complex biological contexts, such as nonadditive genetic effects, effects of genotype by environment interaction, and correlation between phenotypic traits, are described and the associate computer packages presented. The fourth section (Chapters 11–14) provides examples of incorporation into genomic prediction models, of new knowledge coming out of molecular genetics and ecophysiology—such as trait-specific genetic information and “omics” data, and the coupling of genomic models with crop growth models. The last section (Chapters 15–22) is dedicated to lessons learned from a number of applications of genomic prediction in the fields of human health, animal breeding, and plant breeding and to methods for analysis of the economic effectiveness of genomic selection relative to conventional breeding approaches.
v
vi
Preface
As the objective of the volume is to provide a reference resource for students, teachers, practitioners, and for further methodological research, the chapters are designed to include practical examples, while reporting the latest state of knowledge in the field. Montpellier, France
Nourollah Ahmadi Je´roˆme Bartholome´
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Genetic Bases of Complex Traits: From Quantitative Trait Loci to Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nourollah Ahmadi 2 Genomic Prediction of Complex Traits, Principles, Overview of Factors Affecting the Reliability of Genomic Prediction, and Algebra of the Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Michel Elsen 3 Building a Calibration Set for Genomic Prediction, Characteristics to Be Considered, and Optimization Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . Simon Rio, Alain Charcosset, Tristan Mary-Huard, Laurence Moreau, and Renaud Rincent 4 Genotyping, the Usefulness of Imputation to Increase SNP Density, and Imputation Methods and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florence Phocas 5 Overview of Genomic Prediction Methods and the Associated Assumptions on the Variance of Marker Effect, and on the Architecture of the Target Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Re´ka Howard, Diego Jarquin, and Jose´ Crossa 6 Overview of Major Computer Packages for Genomic Prediction of Complex Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giovanny Covarrubias-Pazaran 7 Genome-Enabled Prediction Methods Based on Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edgar L. Reinoso-Pela´ez, Daniel Gianola, and Oscar Gonza´lez-Recio 8 Genomic Prediction Methods Accounting for Nonadditive Genetic Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Varona, Andres Legarra, Miguel A. Toro, and Zulma G. Vitezica 9 Genome and Environment Based Prediction Models and Methods of Complex Traits Incorporating Genotype Environment Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose´ Crossa, Osval Antonio Montesinos-Lopez, Paulino Pe´rez-Rodrı´guez, Germano Costa-Neto, Roberto Fritsche-Neto, Rodomiro Ortiz, Johannes W. R. Martini, Morten Lillemo, Abelardo Montesinos-Lopez, Diego Jarquin, Flavio Breseghello, Jaime Cuevas, and Renaud Rincent 10 Accounting for Correlation Between Traits in Genomic Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Osval Antonio Montesinos-Lo pez, Abelardo Montesinos-Lo pez, Brandon A. Mosqueda-Gonzalez, Jose´ Cricelio Montesinos-Lo pez, and Jose´ Crossa
vii
v ix
1
45
77
113
139
157
189
219
245
285
viii
Contents
11
Incorporation of Trait-Specific Genetic Information into Genomic Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shaolei Shi, Zhe Zhang, Bingjie Li, Shengli Zhang, and Lingzhao Fang 12 Incorporating Omics Data in Genomic Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . Johannes W. R. Martini, Ning Gao, and Jose´ Crossa 13 Integration of Crop Growth Models and Genomic Prediction . . . . . . . . . . . . . . . . Akio Onogi 14 Phenomic Selection: A New and Efficient Alternative to Genomic Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pauline Robert, Charlotte Brault, Renaud Rincent, and Vincent Segura 15 From Genotype to Phenotype: Polygenic Prediction of Complex Human Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timothy G. Raben, Louis Lello, Erik Widen, and Stephen D. H. Hsu 16 Genomic Prediction of Complex Traits in Animal Breeding with Long Breeding History, the Dairy Cattle Case . . . . . . . . . . . . . . . . . . . . . . . . . Joel Ira Weller 17 Genomic Selection in Aquaculture Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Franc¸ois Allal and Nguyen Hong Nguyen 18 Genomic Prediction of Complex Traits in Perennial Plants: A Case for Forest Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fikret Isik 19 Genomic Prediction of Complex Traits in Forage Plants Species: Perennial Grasses Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Philippe Barre, Torben Asp, Stephen Byrne, Michael Casler, Marty Faville, Odd Arne Rognli, Isabel Roldan-Ruiz, Leif Skøt, and Marc Ghesquie`re 20 Genomic Prediction of Complex Traits in an Allogamous Annual Crop: The Case of Maize Single-Cross Hybrids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Isadora Cristina Martins Oliveira, Arthur Bernardeli, Jose´ Henrique Soler Guilhen, and Maria Marta Pastina 21 Genomic Prediction: Progress and Perspectives for Rice Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Je´roˆme Bartholome´, Parthiban Thathapalli Prakash, and Joshua N. Cobb 22 Analyzing the Economic Effectiveness of Genomic Selection Relative to Conventional Breeding Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aline Fugeray-Scarbel, Sarah Ben-Sadoun, Sophie Bouchet, and Ste´phane Lemarie´ Correction to: Genomic Prediction Methods Accounting for Nonadditive Genetic Effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
329
341 359
397
421
447 469
493
521
543
569
619
C1 645
Contributors NOUROLLAH AHMADI • CIRAD, UMR AGAP Institut, Montpellier, France FRANC¸OIS ALLAL • MARBEC, Universite´ de Montpellier, CNRS, Ifremer, IRD, Palavas-lesFlots, France TORBEN ASP • Center for Quantitative Genetics and Genomics, Aarhus University, Slagelse, Denmark PHILIPPE BARRE • INRAE, UR P3F, Lusignan, France JE´ROˆME BARTHOLOME´ • CIRAD, UMR AGAP Institut, Montpellier, France SARAH BEN-SADOUN • INRAE - UCA UMR1095, Genetics Diversity and Ecophysiology of Cereals, Clermont-Ferrand, France ARTHUR BERNARDELI • Department of Agronomy, Universidade Federal de Vic¸osa, Vic¸osa-MG, Brazil SOPHIE BOUCHET • INRAE - UCA UMR1095, Genetics Diversity and Ecophysiology of Cereals, Clermont-Ferrand, France CHARLOTTE BRAULT • UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, Montpellier, France; Institut Franc¸ais de la Vigne et du Vin, Montpellier, France; UMT Geno-Vigne®, IFV-INRAE-Institut Agro, Montpellier, France FLAVIO BRESEGHELLO • Embrapa Arroz e Feija˜o, Santo Antoˆnio de Goia´s, GO, Brazil STEPHEN BYRNE • Teagasc, Crop Science Department, Oak Park, Carlow, Ireland MICHAEL CASLER • U.S. Dairy Forage Research Center, USDA-ARS, Madison, WI, USA ALAIN CHARCOSSET • GQE—Le Moulon, INRAE, University Paris-Sud, CNRS, AgroParisTech, Universite´ Paris-Saclay, Gif-sur-Yvette, France JOSHUA N. COBB • RiceTec Inc, Alvin, TX, USA GERMANO COSTA-NETO • Departamento de Gene´tica, Escola Superior de Agricultura “Luiz de Queiroz” (ESALQ/USP), Sa˜o Paulo, Brazil GIOVANNY COVARRUBIAS-PAZARAN • Centro Internacional de Mejoramiento de Maiz y Trigo (CIMMYT), Texcoco, Mexico; Excellence in Breeding Platform (EiB), Texcoco, Mexico JOSE´ CROSSA • Colegio de Postgraduados, Montecillos, Mexico; Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), Carretera MexicoVeracruz, Mexico JAIME CUEVAS • Universidad de Quintana Roo, Chetumal, Quintana Roo, Mexico JEAN-MICHEL ELSEN • GenPhySE, Universite´ de Toulouse, INRAE, ENVT, Castanet Tolosan, France LINGZHAO FANG • MRC Human Genetics Unit at the Institute of Genetics and Cancer, The University of Edinburgh, Edinburgh, UK MARTY FAVILLE • AgResearch Ltd, Grasslands Research Centre, Palmerston North, New Zealand ROBERTO FRITSCHE-NETO • Departamento de Gene´tica, Escola Superior de Agricultura “Luiz de Queiroz” (ESALQ/USP), Sa˜o Paulo, Brazil ALINE FUGERAY-SCARBEL • Univ. Grenoble Alpes, INRAE, CNRS, Grenoble INP, GAEL, Grenoble, France NING GAO • School of Life Sciences, Sun Yat-Sen University, Guangzhou, China MARC GHESQUIE`RE • INRAE, UR P3F, Lusignan, France
ix
x
Contributors
DANIEL GIANOLA • Department of Animal and Dairy Sciences, University of Wisconsin–Madison, Madison, WI, USA OSCAR GONZA´LEZ-RECIO • Instituto Nacional de Investigacion y Tecnologı´a Agraria y Alimentaria. Ctra. de La Corun˜a, Madrid, Spain RE´KA HOWARD • University of Nebraska—Lincoln, Lincoln, NE, USA STEPHEN D. H. HSU • Michigan State University, East Lansing, MI, USA; Genomic Prediction, North Brunswick, NJ, USA FIKRET ISIK • North Carolina State University, Raleigh, NC, USA DIEGO JARQUIN • University of Nebraska—Lincoln, Lincoln, NE, USA ANDRES LEGARRA • INRA, GenPhySE, Castanet-Tolosan, France LOUIS LELLO • Michigan State University, East Lansing, MI, USA; Genomic Prediction, North Brunswick, NJ, USA STE´PHANE LEMARIE´ • Univ. Grenoble Alpes, INRAE, CNRS, Grenoble INP, GAEL, Grenoble, France BINGJIE LI • The Roslin Institute Building, Scotland’s Rural College, Edinburgh, UK MORTEN LILLEMO • Department of Plant Sciences, Norwegian University of Life Sciences, IHA/CIGENE, Ås, Norway JOHANNES W. R. MARTINI • International Maize and Wheat Improvement Center (CIMMYT), Carretera Me´xico-Veracruz, Mexico ISADORA CRISTINA MARTINS OLIVEIRA • Embrapa Milho e Sorgo, Sete Lagoas-MG, Brazil TRISTAN MARY-HUARD • GQE—Le Moulon, INRAE, University Paris-Sud, CNRS, AgroParisTech, Universite´ Paris-Saclay, Gif-sur-Yvette, France ABELARDO MONTESINOS-LO´PEZ • Departamento de Matema´ticas, Centro Universitario de Ciencias Exactas e Ingenierı´as (CUCEI), Universidad de Guadalajara, Guadalajara, Jalisco, Mexico JOSE´ CRICELIO MONTESINOS-LO´PEZ • Departamento de Estadı´stica, Centro de Investigacion en Matema´ticas, Guanajuato, Mexico OSVAL ANTONIO MONTESINOS-LO´PEZ • Facultad de Telema´tica, Universidad de Colima, Colima, Mexico LAURENCE MOREAU • GQE—Le Moulon, INRAE, University Paris-Sud, CNRS, AgroParisTech, Universite´ Paris-Saclay, Gif-sur-Yvette, France BRANDON A. MOSQUEDA-GONZALEZ • Centro de Investigacion en Computacion (CIC), Instituto Polite´cnico Nacional (IPN), Esq. Miguel Othon de Mendiza´bal, Mexico city, Mexico NGUYEN HONG NGUYEN • School of Science, Technology and Engineering, University of the Sunshine Coast, Sippy Downs, QLD, Australia AKIO ONOGI • Department of Plant Life Science, Faculty of Agriculture, Ryukoku University, Otsu, Shiga, Japan RODOMIRO ORTIZ • Department of Plant Breeding, Swedish University of Agricultural Sciences (SLU), Alnarp, Sweden MARIA MARTA PASTINA • Embrapa Milho e Sorgo, Sete Lagoas-MG, Brazil PAULINO PE´REZ-RODRI´GUEZ • Colegio de Postgraduados, Montecillos, Mexico FLORENCE PHOCAS • Universite´ Paris-Saclay, INRAE, AgroParisTech, GABI, Jouy-en-Josas, France PARTHIBAN THATHAPALLI PRAKASH • Rice Breeding Platform, International Rice Research Institute, Manila, Philippines TIMOTHY G. RABEN • Michigan State University, East Lansing, MI, USA
Contributors
xi
EDGAR L. REINOSO-PELA´EZ • Instituto Nacional de Investigacion y Tecnologı´a Agraria y Alimentaria. Ctra. de La Corun˜a, Madrid, Spain RENAUD RINCENT • Universite´ Paris-Saclay, INRAE, CNRS, AgroParisTech, Ge´ne´tique Quantitative et Evolution - Le Moulon, Gif-sur-Yvette, France; INRAE—Universite´ Clermont-Auvergne, UMR1095, GDEC, Clermont-Ferrand, France SIMON RIO • CIRAD, UMR AGAP Institut, Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, Montpellier, France PAULINE ROBERT • INRAE—Universite´ Clermont-Auvergne, UMR1095, GDEC, Clermont-Ferrand, France; Universite´ Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE—Le Moulon, Gif-sur-Yvette, France ODD ARNE ROGNLI • Department of Plant Sciences, Faculty of Biosciences, Norwegian, University of Life Sciences (NMBU), Ås, Norway ISABEL ROLDAN-RUIZ • Flanders Research Institute for Agriculture, Fisheries and Food (ILVO)—Plant Sciences Unit, Melle, Belgium VINCENT SEGURA • UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, Montpellier, France; UMT Geno-Vigne®, IFV-INRAE-Institut Agro, Montpellier, France SHAOLEI SHI • College of Animal Science and Technology, China Agricultural University, Beijing, China LEIF SKØT • IBERS, Aberystwyth University, Ceredigion, UK JOSE´ HENRIQUE SOLER GUILHEN • Embrapa Milho e Sorgo, Sete Lagoas-MG, Brazil MIGUEL A. TORO • Dpto. Produccion Agraria, ETS Ingenierı´a Agronomica, Alimentaria y de Biosistemas, Universidad Polite´cnica de Madrid, Madrid, Spain LUIS VARONA • Departamento de Anatomı´a, Embriologı´a y Gene´tica Animal, Universidad de Zaragoza, Zaragoza, Spain; Instituto Agroalimentario de Aragon (IA2), Zaragoza, Spain ZULMA G. VITEZICA • Universite´ de Toulouse, Castanet-Tolosan, France JOEL IRA WELLER • Agricultural Research Organization, The Volcani Center, Rishon LeZion, Israel; Israel Cattle Breeders’ Association, Caesarea Industrial Park, Israel ERIK WIDEN • Michigan State University, East Lansing, MI, USA SHENGLI ZHANG • College of Animal Science and Technology, China Agricultural University, Beijing, China ZHE ZHANG • Department of Animal Breeding and Genetics, College of Animal Science, South China Agricultural University (SCAU), Guangzhou, China
Chapter 1 Genetic Bases of Complex Traits: From Quantitative Trait Loci to Prediction Nourollah Ahmadi Abstract Conceived as a general introduction to the book, this chapter is a reminder of the core concepts of genetic mapping and molecular marker-based prediction. It provides an overview of the principles and the evolution of methods for mapping the variation of complex traits, and methods for QTL-based prediction of human disease risk and animal and plant breeding value. The principles of linkage-based and linkage disequilibrium–based QTL mapping methods are described in the context of the simplest, single-marker, methods. Methodological evolutions are analysed in relation with their ability to account for the complexity of the genotype–phenotype relations. Main characteristics of the genetic architecture of complex traits, drawn from QTL mapping works using large populations of unrelated individuals, are presented. Methods combining marker–QTL association data into polygenic risk score that captures part of an individual’s susceptibility to complex diseases are reviewed. Principles of best linear mixed model-based prediction of breeding value in animal- and plant-breeding programs using phenotypic and pedigree data, are summarized and methods for moving from BLUP to marker–QTL BLUP are presented. Factors influencing the additional genetic progress achieved by using molecular data and rules for their optimization are discussed. Key words QTL mapping, Architecture of complex traits, Marker-assisted selection
Abbreviations AUC BLUE BLUP BV CIM EM GE GWAS IBD IM Lasso LD LE
Area under the receiver-operating characteristic (ROC) curves Best linear unbiased estimate Best linear unbiased prediction Breeding value Composite interval mapping Expectation maximization Genotype by environment Genome-wide association study Identity by descent Interval mapping Least absolute shrinkage and selection operator Linkage disequilibrium Linkage equilibrium
Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
1
2
Nourollah Ahmadi
LR MAF MAS MIM ML MLM QTL RR SM SNP
1
Linear regression Minor allele frequency Marker assisted selection Multiple interval mapping Maximum Likelihood Mixed linear model Quantitative trait loci Ridge regression Single marker Single nucleotide polymorphism
Introduction The term “complex trait” refers to any phenotype that does not exhibit Mendelian recessive or dominant inheritance attributable to a single gene locus. Such breakdown of simple correspondence between genotype and phenotype can arise from different phenomena including incomplete penetrance, genetic heterogeneity, or polygenic inheritance [1]. Here we will focus on “complex quantitative traits” that are controlled by many genes and by environmental factors, and that present a continuous quantitative variation. Analysis of inheritance of quantitative traits goes back to Galton [2] who described the distribution of such traits among progenies as Gaussian with a mean intermediate between the parents’ trait values, and a variance that is independent of those values. Fisher [3] established the first conceptual framework to interpret the biometrical genetics findings of Galton within the Mendelian schemes of inheritance. It stipulates that (1) the genetic variance in a population is due to a large number of Mendelian factors, each making a small additive contribution to a particular phenotype; (2) the variance among offspring depends on the relatedness of the parents, which can be predicted from the pedigree. Later, this framework was enriched with additional explanatory variables, such as dominance and epistasis, and referred to as the “infinitesimal model.” Providing a complete description of the short-term evolution of quantitative traits within a population, and an estimate of the breeding value (BV) of the individuals that compose the population, the infinitesimal model soon became the theoretical backbone of the animal- and plant-breeding programs [4, 5]. The infinitesimal model also provides a framework to estimate some of the parameters of the genetic architecture of quantitative traits. For a given population, the phenotypic variance could be partitioned into additive, dominance, and epistatic genetic variance, as well as the variances of genotype–environment interactions and environmental effects. The different variance components are estimated from correlations between relatives and thus rely on pedigree information.
From Quantitative Trait Loci to Prediction
3
The earliest attempt in counting and locating genes responsible of the variation of a quantitative trait go back to Sax’s [6] work on seed size in dry beans. Building on Morgan’s [7] concepts of linkage and genetic distance between loci, and on Fisher’s [3] infinitesimal model of inheritance of quantitative traits, Sax [6] demonstrated that discrete markers can be used to quantify the number, the chromosomal location, and the individual effects of Mendelian factors which influence quantitative traits. The basic assumptions in mapping such “quantitative trait loci” (QTLs) were that (1) physical proximity between markers and QTLs resulted in the cosegregation of their alleles within families, and (2) the persistence of such cosegregating haplotypes was a function of the distance between the two loci that determines the probability of a recombination between them. Later, such inference about cosegregation of marker locus alleles and trait-influencing locus alleles was termed “linkage analysis” [8, 9]. The concept of linkage initiated the search for the development of experimental populations suitable for linkage mapping and the search for markers that could be used for predicting disease status within human families and traits of interest in animal and plant breeding. By 1960, blood group genetics had been worked out for cattle and poultry breeding [10]. By the early 1980s, large enough isozyme polymorphic loci had been identified in crops such as tomato [11] and maize [12] to allow linkage mapping of QTLs for complex traits. Later, the advent of recombinant DNA technology and the associated abundance of polymorphic molecular markers led to the application of the concept to a large number of plant and animal species. The first reports on QTL mapping using DNA polymorphism markers go back to Paterson et al. (1988) [13] work on tomato. The first human genome-wide screen for QTLs, using molecular markers, was reported in the mid-90s [14]. The prospect of exploiting the statistical association between alleles at the molecular marker loci and at some polymorphic sites of the QTLs, to track the presence of the favourable allele at the causative loci of the trait of interest (quantitative trait nucleotides), generated high expectations for accelerating the pace of genetic gain in animal and plant breeding programs and in predicting human diseases risk. This also stimulated methodological research to increase the accuracy of QTL mapping and the efficiency of marker-assisted selection (MAS). Concern about the accuracy and robustness of linkage between marker and the quantitative trait nucleotide led first to the development of novel types of familybased experimental mapping populations and novel methods of ascertainment of marker-QTL linkage and, later, to the detection of population level association between marker and causative polymorphism, using large diversity panels. Methodological developments in the area of MAS explored two major directions: (1) the
4
Nourollah Ahmadi
somewhat idealistic concept of “breeding by design” which assumed possible to map and track all allelic variations for all traits of interest [15]; (2) the more pragmatic approach of integrating the most significant marker-QTL information into the existing methods of prediction of BV relying on the infinitesimal model, that is, the simultaneous evaluation of the fixed effects at the mapped QTLs and the additive effects due the remaining unmapped QTLs involved in the phenotypic variation [16]. Despite impressive subsequent improvements in QTL mapping methods, in terms of marker density, type of mapping population and statistical models, in almost all mapping studies the total effects of the detected QTLs remained much below the total additive genetic variation as measured by the population-based estimates of the heritability for the studied traits [17]. One of the explanatory reasons for this phenomenon, known as “missing heritability,” is the high stringency of the tests of statistical significance applied to marker-QTL associations. Only QTLs with rather large effects can be detected [18]. This shortcoming of the QTL mapping approaches, the associated limitations of MAS targeting mapped QTLs, and the availability of larger number of markers gave birth to the idea of omitting the significance test and simply estimating the effects of all markers or marker haplotypes simultaneously, whatever the size of the effects of individual haplotypes [19]. It is assumed that, with a dense marker coverage of the genome, some markers will be very close to the QTLs and probably in linkage disequilibrium with them. Chromosome segments that contain the same rare marker haplotypes are likely to be identical by descent and hence carry the same QTL allele. Figure 1 illustrates the evolution of methods for analysis of the genetic bases of complex traits and for the use of QTL information for MAS and for prediction of human diseases risk, in relation with the evolution of molecular markers technologies. The methodological development that followed the seminal idea of Meuwissen et al. [20], later known as “genome-wide prediction of complex quantitative traits” or “genomic selection,” and examples of applications of the concept in medical science and in animal and plant breeding, are given in Chapters 2–22 of the present book. Here, we are presenting an overview of the concepts and tools of molecular genetics and prediction of quantitative phenotypic variations that preceded the advent of the genome-wide prediction approach. The first part of the overview is dedicated to methods for mapping complex traits, the second part to the genetic architecture of quantitative traits as revealed by QTL mapping studies and the third part to methods of QTL-based prediction of complex traits.
From Quantitative Trait Loci to Prediction
5
Fig. 1 Schematic representation of the evolution of methods to analyse the genetic bases of complex traits and to use the QTL information for marker-assisted selection (MAS) and to predict human diseases risk, in relation with the evolution of molecular marker technologies. RFLP: Restriction fragment length polymorphism; SSR: Simple sequence repeats; DArt: Diversity array technology; SNP: Single nucleotide polymorphism; NGS: Next generation sequencing; BC: backcross; RIL: Recombinant inbred line; NIL: Near isogenic line; CSSL: Chromosome segment substitution line; MAGIC: Multiparent advanced generation intercrosses of inbred lines; NAM: Nested association-mapping population; QTL: quantitative trait loci; LD Linkage disequilibrium; MAS: Markerassisted selection; MABC: Marker-assisted backcrossing; MARS: marker-assisted recurrent selection; PRS: polygenic risk score; GS: genomic selection. (Adapted from [22])
2
Methods for Mapping Quantitative Trait Loci Conceptually, QTL mapping is a straightforward three-step process [21]: (1) Scanning the genome with as large as possible a number of genetic markers, (2) calculating a linkage statistic between the allelic variation at marker loci and the phenotypic variation of the target trait, and (3) identifying the chromosomal regions in which the linkage statistic shows significant deviation from its expected value under the independence hypothesis, that is, under linkage equilibrium (LE). However, the last two steps of the mapping
6
Nourollah Ahmadi
process raise several methodological issues. The choice of the “linkage statistic” needs to take into account factors other than simple physical proximity between the marker loci and the QTL loci, that can cause the breakdown of marker-QTL independence. These include the nature of the mapping population and the associated properties of linkage disequilibrium (LD), and hypotheses on the genetic architecture of the trait. Likewise, decision on “significant deviation” must find the right balance between the risks of picking false linkages and the risk of rejecting true ones [21]. Given the central role of the pattern of LD, or the nonrandom pattern of associations between alleles at different loci, QTL mapping methods are often subdivided into two categories: “linkage mapping” that relies on within-family LD and “association mapping” that mainly relies on population-wide LD. Another line of subdivision is accounting or not for epistatic and QTL by environment interactions. Major differences between the two categories are summarized in Table 1 adapted from [22].
Table 1 Main characteristics of the two categories of QTL mapping approaches Criteria
Family-based QTL mapping
Population-based association mapping
Populations
Several hundreds to thousands of Few hundred progenies of biparental individuals with unknown or mixed crosses (mainly in plants) or member of relationships extended pedigree (few parents or grandparents with large offspring, mainly in animals)
Number of markers
Few: Few hundreds
Many: Thousands to millions
QTL analysis
Simple (single marker) to sophisticated (multiple intervals, accounting to epistasis and G E) mapping, minimizing ghost QTLs and increasing mapping precision
Simple (single marker) to sophisticated (multiple regressions) accounting for population structure and providing precise estimates of each QTL’s effects
QTL QTL segregating and marker–trait linkage detection within family(ies) depends on
QTL segregating and marker–trait LD in mapping population
Mapping precision
Limited (0.1–15 cM). QTL regions may contain many candidate genes
Can be excellent (100 kb), depending on population LD and marker density
Variation detected
Limited to the share of variations segregating in the sampled family(ies)
Larger share depending on the genetic diversity of the population
Extrapolation Poor (Other families not segregating QTLs, Good to excellent. (Although not all potential changes in marker phases, etc.) QTLs will segregate in all populations)
From Quantitative Trait Loci to Prediction
2.1 Linkage Mapping Methods
7
Linkage mapping methods rely on the knowledge of the degree of relatedness between members of the mapping population that allows to estimate the frequency of recombination between pairs of loci. The most efficient way of acquiring pedigree data and, thereby, acceding to an accurate estimate of the recombination frequency is the development of an experimental mapping population from a small number of diverse founders. The characteristics of such mapping populations include the number of founders, their genetic structure (inbred animal strain/plant lines, or heterozygous individuals from outcrossing animal/plant species), the intercrossing scheme, the number of recombination cycles, the size of the final mapping population and the marker density [23]. Among these characteristics, the genetic structure of the founders is often the most determining factor. For instance, in an experimental population derived from a cross between two inbred animal strains or plant lines, all individuals are informative for both markers and QTL segregation as there are theoretically only two marker alleles at each segregating locus, and the linkage phase between marker and QTL alleles is known. Quite the opposite, in a population derived from a cross between two noninbred founders, which is often the case in outcrossing species, only parents that are heterozygous for both markers and linked QTL provide linkage information, and individuals may differ in QTL–marker linkage phase [24]. Moreover, in both inbred and noninbred founder cases, not all QTLs affecting the trait might segregate in all families and when pooled across families, different possible haplotypes for a given marker–QTL linkage can conceal each other’s phenotypic effect [24]. When the development of experimental populations encounters practical limits (in some animal species) or ethical obstacles (in humans and animals), linkage mapping relies on extended pedigrees founded by multiple individuals and on naturally existing relatives such as sibling pairs, sibships, or group of sibships [24, 25]. For instance, the linkage map developed for mapping the QTL associated with milk production in Holstein dairy cattle included 1794 sons of 14 sires, each of the 14 half-sib family represented with an average of 128 sons. The phenotypic unit of measurement was the average milk production of the daughter of each son adjusted for environmental effects and the additive genetic value of the daughters’ dames [26]. Likewise, QTL affecting human obesity in the Caucasian population of the USA was investigated in 2209 individuals distributed over 507 families of predominantly northern European ancestry. The pairwise relationships within this cohort represented 5399 relative pairs, with the majority being nuclear family (2177 parent-to-offspring and 2198 sibling) relationships. The 2209 individuals were selected from a population of 60,000 respondents to a nine-page questionnaire that collected information on family structure, health and behaviour status, and detailed family and personal history of obesity and its health complications [27].
8
Nourollah Ahmadi
Fig. 2 Linkage-based methods for mapping quantitative trait loci (QTL). LR: Linear regression; ML: Maximum likelihood
Concomitant with the diversification of the type of mapping populations and the availability of dense linkage maps has been the development of statistical methods for QTL mapping, relying on frequentist or Bayesian approaches (Fig. 2). The pioneering linkage mapping work of Sax [6] used simple linear regressions (LR) of the phenotype on genotype, with biallelic predictors. The linkage statistic was the F-statistic of a one-way analysis of variance. This “single marker” (SM) method was extended to multiple marker loci through multiple linear regression analysis [28] and presented as least-squares method for linkage mapping in half-sib population [29]. In order to account for multiple tests, a Bonferroni correction is applied to the significance threshold or the distribution of the test statistic is determined via permutation [30]. The parameters of the linear regression can also be estimated through the method of maximum likelihood (ML), these estimates being the values that maximize the probability that the observed data would have occurred. The likelihood L of the acquired data under the assumption of the model H1, L(data j H1), here a QTL segregate as expected from pedigree, is compared to the likelihood of the acquired data under the assumption of a null hypothesis model H0, L(data j H0), here, no QTL segregates. The evidence for a
From Quantitative Trait Loci to Prediction
9
QTL is then summarized by the 10-base logarithm of the likelihood ratio or LOD score, that is Log[L(data j H1)/L(data j H0)]. This approach allows estimating the QTL position and its effect separately, which is not possible in LR. However, it considers only biallelic QTLs and the distribution of a phenotype is a mixture of normal distributions with different means corresponding to the QTL genotypes, while in LR the phenotype has a single (normal) distribution with a mean equal to the average of probabilities of QTL alleles with no limit in the number of alleles. Lander and Botstein [31] developed the concept of interval mapping (IM) that better exploits the availability of complete genetic linkage map, by considering intervals between markers, or intervals of any size decided by the experimenter, as the putative location for a QTL. Dealing with a linear regression problem where the independent variables (genotypes) are missing or known only under the form of a probability distribution, the authors resorted to the maximum likelihood approach that allows resolving such regressions. The approach was implemented with the expectationmaximization (EM) algorithm [32] that maximizes the likelihood function with missing data. The whole linkage map is scanned at intervals decided by the experimenter for biallelic QTLs of additive effects, a LOD score profile is provided along the genome that informs on both QTLs position and QTLs effects. The confidence interval of each putative QTL is given as the chromosomic segment in which the LOD score is within a factor 10 of the maximum. Haley and Knott [33] proposed a regression version to approximate IM. However, resolving separately each phenotype–genotype regression, neither maximum likelihood nor LR-based IM is effective in separating linked QTLs, in analysing interactions between QTLs, and in detecting the presence of QTLs of small effects additional to QTLs of large effects [34]. The application of interval mapping concept was extended to multiple QTL models where the search for QTL in a given interval is enhanced by taking into account the existence of QTLs outside the target interval [35, 36]. The so-called composite interval mapping (CIM) [37] is a hybrid of interval mapping and multiple regression on marker genotypes. The EM algorithm simultaneously estimates the QTL effect at the target interval and the effects of QTLs outside the target interval, the latter being estimated with the regression coefficients of the marker variables selected for this purpose. It is expected that the reduction of the residual variance will improve the QTL detection power and the accuracy of the estimate of the QTLs’ effect size. The key issue with CIM is thus the choice of the set of markers used as cofactors. Using cofactors not linked to QTLs, can increase the variance of the LOD score and thus decrease the power for QTL detection. Another issue is the detection of linked QTLs that, when in coupling phase, can be detected as unique QTLs and, when in repulsion phase, may go
10
Nourollah Ahmadi
undetected. The marker part of CIM can be used in conjunction with any method of analysis, whether it is LR analysis, ML analysis, variance components analysis, or Bayesian analysis, as long as the corresponding software permits the fitting and selection of additional fixed effects to be included in the analysis. The CIM method was later extended to multiple interval mapping (MIM) through model selection approaches where, rather than fitting prespecified models to the observed data (as is the case in CIM), the subset of models that is best supported by the data is identified from the set of potential models. Kao et al. [38] first applied the model selection approach for MIM. In their method, an EM algorithm calculates LOD scores under explicit multiple QTL model, that also accounts for two-way epistasis. A forward or stepwise selection procedure, using the Bonferroni argument for the likelihood ratio test, is implemented to add QTLs and identify the best QTL model. Various other frequentist model selection methods for multiple QTL mapping were proposed calling on genetic algorithm [39] or using different model selection criteria to choose variables for multiple regression [40]. The most comprehensive frequentist method proposed for multiple QTL mapping is probably the restricted maximum likelihood analysis based on the linear mixed model [41]. The mixed model with random QTL effects allows fitting different models of QTL variations (biallelic and multiallelic) and incorporating complex pedigrees even in outcrossing populations. The issue of MIM was also approached from the Bayesian perspective in human [42], animal [43], and plant [44]. MIM in a Bayesian framework comprises three steps. (1) Setting up a joint probability distribution, which models the prior beliefs on marker effects or relationships between the observed data (phenotypes, and marker data, that provide information about the segregation of alleles at various genomic positions in a mapping population) and unknown quantities (QTLs’ genotype, their genomic position and the size of their genetic effects). (2) Computing the posterior distribution, the conditional distribution of the unobserved QTLs’ genotypes and the prior distribution of the size of genetic effects. (3) Including individual loci in the model, while adjusting for effects of all other possible loci, to estimate separately the posterior probabilities and corresponding Bayes factors of their main effects and/or interactions with other loci (epistasis) or environmental effects. Bayesian methods for multiple QTL mapping differ in the ways they deal with the unobserved genotypes or pseudomarkers and the number of loci included in the model (reviewed in [45]). Options proposed for dealing with pseudomarkers include (1) the replacement of all missing genotypes by their expected values conditional to the observed marker data, and thus essentially removing QTL genotypes from the list of unknowns [46]; (2) the sampling of genotypes, several times for each locus,
From Quantitative Trait Loci to Prediction
11
from the conditional probability, and averaging the results [34]. Options proposed for the number of loci to consider include small numbers (1–5), numbers as large as the number of QTLs, larger numbers of loci associated with model selection methods, and numbers of loci equal to the upper bound of detectable QTLs, associated with the removal of small effects from the model (reviewed in [47]). The Bayesian approach allows taking into account the uncertainty on multilocus marker–QTL genotypes and on the number of QTL, fitting different models of QTL variation, and incorporating complex pedigrees. The latest statistical approach for multiple QTL mapping calls on statistical machine learning. The approach combines ridge regression, recursive feature elimination, and estimation of generalization performance and marker effects using bootstrap resampling [48]. The dataset is split into independent training and testing subsets that are used for model induction and evaluation, respectively. A linear model containing all markers is fitted to the phenotype. The model is then reduced in size by recursively eliminating the least useful markers and refitting the model until only a single marker is left, which is similar to recursive feature elimination support vector machines [49]. After each elimination, change in variance explained is measured in the test set and attributed to the marker that was removed. The entire process is then repeated numerous times to derive an unbiased bootstrap estimate of the predictive power of each marker. Compared to CIM, Bayesian MIM and SM on a synthetic dataset and a multitrait and multienvironment barley dataset, the machine-learning algorithm showed QTL detection power similar to Bayesian MIM, produced consistently better estimates of QTL effects, and QTL resolution (peak definition) than the other three methods. Concomitant to the refinement of MIM, growing awareness of the importance of epistasis and genotype by environment (G E) interactions led to the development of mapping methods specifically focusing on the detection of epistatic QTLs (e.g., [50–52]) and genotype by environment interactions in phenotypic data involving multiple environments [53]. Likewise, methods for mapping QTLs for multiple correlated traits were developed adopting a variety of approaches including multiple regression (e.g., [54]), maximum likelihood [55], generalized estimating equations [56], dimension-reduction technique involving principal component analysis [57] and Bayesian frameworks [58]. The advantages expected from multitrait approaches are an increased power of QTL detection and an increased accuracy of parameter estimation. Efforts in narrowing down the confidence interval of QTLs have also led to the development of methods for meta-QTL analysis [59, 60], that allows to combine the large body of QTL information available for major crops and animal species. Another approach to increase the resolution of QTL mapping was the development of
12
Nourollah Ahmadi
multiparental experimental populations that provide a higher number of informative crossovers and the possibility to interrogate multiple alleles, with limited population structure effects [61]. The strategy was first implemented in mouse genetics with heterogeneous stocks [62] and then adapted to plant genetics via multiparent advanced generation intercrosses of inbred lines (MAGIC, [63]) and interconnected populations via diallelic schemes or star (nested) designs [64]. Despite the above-reviewed incremental improvement, linkage mapping methods could not overcome an intrinsic limitation, that is the correct statistical model to estimate QTL effects would require their true number and their precise position to be known. Consequently, all approaches for QTL analysis share the problem of model selection, which generally results in an overestimation of the QTL effects and the proportion of explained genotypic variance, especially for small sample sizes [65]. This limitation and the feasibility of genotyping large numbers of individuals for a very large number of loci prompted the development of methods for LD-based QTL mapping in natural populations. 2.2 Linkage Disequilibrium (LD) Mapping
LD-based QTL mapping, also known as association mapping or genome-wide association study (GWAS), detects and locates QTLs based on the nonrandom correlation of alleles at separate loci, in unrelated individuals, presumably distant, for which the transmission of a phenotype over generations cannot be traced through pedigree information. These nonrandom correlations are either the result of evolutionary processes such as mutation, recombination rate, natural selection, and admixture between populations with different gene frequencies [66] or the result of residual kinship between members of the population. Although the resolution power of GWAS is the highest in natural population-based samples, it can also be implemented in family-based samples. Beyond the type of population, the resolution power of GWAS depends on the heritability and the genetic architecture of the trait, the characteristics of the population (reproduction regime of the species, size, the extent of LD, and degree of subdivision or structure), and the marker density relative to LD’s extent. Similarly to linkage mapping, methods for LD mapping first addressed the model assuming multiple independent QTLs of additive fixed effects. Single marker regressions are fitted, a Bonferroni correction or the Churchill and Doerge [30] permutation method is used to account for multiple tests, and a confidence interval for the detected QTL is estimated using cross-validation. When the GWAS experiment involves case–control samples, as is often the case in human health related studies, a disease risk is tested π using logistic regression. The transformation logit ðπ Þ ¼ log 1π is applied to individual disease risk π. The value of l logit(π i) is equated to either β0, β1 or β2, according to the genotype of
From Quantitative Trait Loci to Prediction
13
individual i (β1 for heterozygotes). The likelihood-ratio test of this general model, against the null hypothesis β0 ¼ β1 ¼ β2, has two degrees of freedom, and for large sample sizes, it is equivalent to the Pearson 2-df test [67]. The power of such GWAS depends on LD between marker and QTL, the share of total phenotypic variance explained by each QTL, the size of the population, the frequency of the marker’s alleles, and the choice of the significance threshold. One can thus approximate, ex ante, the power of a given GWAS experiment to detect a QTL of a given size effect, at least in an unstructured population [68]. Subsequent researches to improve the power of single marker regression GWAS were aimed at the ascertainment of LD between the marker and the QTL alleles, and also at accounting for population structure. In fact, methods for LD ascertainment consist in the ascertainment of the identity by descent (IBD) of the marker (often a single nucleotide polymorphism or SNP) and the QTL alleles. It is done either by using haplotypes of markers instead of individual markers (haplotype-based GWAS), or by using the marker haplotype information to infer the probability that two individuals carry the same QTL allele at a putative QTL position, identity by descent (IBD) approach. In the latter case, the effect of the putative QTL is modelled for each individual as the effects of the QTL alleles carried by its two parents. However, the choice of the size of the haplotype window and the method to calculate IBD are not trivial issues and the comparative advantage of these approaches over the single marker approach is controversial [69]. Methods to account for population structure try to reduce deviation from the assumption of simple relationships between strength of correlation and meiotic distance that might result in false positive associations. When the population structure arises from the sampling process (family-based human and animal populations, animal and plant populations bred for specific traits, etc.), the actual marker–QTL association can be distinguished from spurious background associations through the transmission disequilibrium test first implemented in human GWAS [70]. The test requires the parental genotypic data of the individuals present in the population to check for both linkage between the marker allele and the phenotype at the putative QTL, and LD between the marker and the QTL alleles across the population. Effects of the family-based structure can also be accounted for, directly in the regression model. A relationship matrix is built from the pedigree of the population and its effects, considered as random, are included in a linear mixed model, while the marker effect remains fixed. In the absence of pedigree data, a genomic relationship matrix can be computed using the marker data [71].
14
Nourollah Ahmadi
The most common methods to control false positives caused by population structure arising from complex background-related processes (migration, admixture, etc.) are “genomic control” and “structured association.” Both methods include a detection step using a set of markers distributed randomly across the genome. In genomic control, first the level of spurious background-related associations is estimated empirically from the distribution of the test statistic for association, established with the set of genomewide distributed markers. Then, the hypothesis of marker–QTL association is tested against the null hypothesis of nonassociation above the level of background association [72]. In structured association, the set of genome-wide distributed markers serves a structure analysis that allocates the individuals of the population to a number of parental populations (or subpopulations), so as to minimize the within subpopulations LD. Then, the null hypothesis that subpopulations’ allele frequencies at the candidate locus are independent of the phenotype is tested against an alternative hypothesis where the subpopulations’ allele frequencies at the candidate locus depend on phenotype [73]. However, accounting only for membership of subpopulations may lead to insufficient control for false positives owing to familial relatedness. To overcome this problem, Yu et al. [74] proposed a mixed linear model (MLM) method where the marker additive effect and the subpopulation membership or “admixture” (Q) effect are fitted as fixed effects, whereas kinship (K) among individuals is incorporated as the variancecovariance structure of the random effect for the individuals. The MLM is solved using a restricted maximum likelihood algorithm. Later, several statistical methods were proposed to reduce the computational burden of solving MLM with numerous individuals and markers to be tested, each needing dedicated iterations to estimate population parameters (Fig. 3). One option consists in improving the speed of iteration with a dedicated algorithm; The Efficient Mixed-Model Association algorithm (EMMA) turns the two-dimensional optimization of genetic and residual variance components into one dimensional optimization by deriving the likelihood as a function of their ratio [75]. Another option consists in limiting or avoiding the iteration for each marker tested. The iteration limiting method consists in clustering the individuals into groups based on kinship to reduce the size of the random genetic effect and then fitting the groups as random effect. Subsequently, the marker effect and the random genetic effect are estimated for each marker (Compressed-MLM, [76]). The iteration avoiding methods consists in estimating the variance components (or their ratio) only once and then fixe them to test genetic markers. The Population Parameter Previously Determined (P3D) [76], and the Efficient Mixed-Model Association eXpedited (EMMAX) [77] belong to this category. Later, an exact Genome-wide Efficient Mixed-Model Association
From Quantitative Trait Loci to Prediction
15
Fig. 3 Linkage disequilibrium (LD) based methods for mapping quantitative trait loci (QTL). MLM: Mixed linear model; RR: Ridge regression; Lasso: least absolute shrinkage and selection operator; MLMM: multilocus mixed model; MTMM: multitrait mixed model; MTAG: Multitrait analysis of GWAS; EMMA: efficient mixed model association; GEMMA: genome-wide efficient mixed model; P3D: population parameters previously determined; EMMAX: efficient mixed-model association eXpedite
(GEMMA), was developed to estimate the population parameters for each testing marker with the similar computational efficiency of P3D or EMMAX [78]. The next improvement in GWAS came from the implementation of multiple regression approaches expected to result, notably, in a higher share of genetic variance explained by the detected QTLs, and more accurate estimates of individual QTLs’ effects and QTLs’ positions, especially in populations with large LD extents. As in most GWAS experiments, the number of markers (p) greatly exceeds the number of individuals (n) of the population (some GWAS targeting human diseases involve several million SNPs and several hundred thousand individuals), their effects cannot be estimated simultaneously by conventional least squares based multiple regression analysis. Resolution of such large-pwith-small-n regressions requires modelling all SNP effects as random, associated with shrinkage-based inference or model selection procedures that search for the optimal model among all possible models, assuming that most markers have no influence on the phenotype.
16
Nourollah Ahmadi
Approaches for fitting large-p-with-small-n regressions include ridge regression (RR), least absolute shrinkage and selection operator (LASSO), elastic net and stepwise MLM [79]. Like regular least-squares methods, RR minimizes a loss function that includes the sum of squared regression residuals. However, opposed to least squares, the loss function also includes a term consisting of positive penalty parameter λ times the model complexity, measured by the sum of squared regression weights. When λ ¼ 0, RR corresponds to regular least-squares methods. Likewise, when λ ¼ σ 2e =σ 2g , (σ 2e being the residual variance and σ 2g due to a single SNP in the linear mixed model), the RR estimator of the marker effect is equal to the best linear unbiased prediction (BLUP) of the marker effects obtained by the mixed model. RR is expected to perform particularly well under the hypothesis of a large number of QTLs of small individual effects. The LASSO method also uses a loss function including a penalty term. Its most distinctive property is its ability to perform variable (i.e., marker) selection so that, for a sufficiently large λ, many SNPs’ coefficients will be zero. Thus, the higher the penalty parameter λ is, the lower the number of SNPs with no-null effects. The stepwise MLM uses first the above-described MLM GWAS, scanning the genome one SNP at a time, with a rather loose significance threshold (no correction for multiple tests, for instance), to choose a set of putative SNPs associated to the phenotypic variation. Then it includes these putative SNPs in a multidimensional model to detect the truly significant SNPs. When the number of the putative significant SNPs is small or not too high, parameters of the mixed model can be estimated with an exact method or with maximum likelihood, restricted maximum likelihood, and EM algorithms. When the number of the putative SNPs is several times higher than the number of individuals in the population, a mixed model algorithm that approximates the estimation of the SNPs’ variance is implemented [79]. In the above-described shrinkage or posterior-based inference, SNP markers are treated as identically and independently distributed variables from a Gaussian distribution. The resulting shrinkage can be too strong for SNPs with small or moderate effects, especially in experiments with a very large number of SNPs, and the resulting test of the random-effects too conservative. An alternative is the use of sparser assumptions for the SNP effects, through the Bayesian approach. Several multilocus GWAS models with nonnormal prior distribution of SNP effects were proposed. It includes the Student’s t-distribution that allows SNPs to have a moderate to large effect (Bayes A method), double-exponential (Bayesian LASSO), and two-component mixtures with point of mass at zero and a slab that can be either the Student’s t-distribution (Bayes SSVS or Bayes B) or Gaussian (Bayes C) [80]. These methods can be extended further with priors of multiple normal distributions (Bayes R) so that SNPs can be classified according to their posterior probabilities of being in each distribution as zero effect, small effect, or moderate effect [69].
From Quantitative Trait Loci to Prediction
17
The penalized regressions approach was also used to detect epistatic interactions, especially when the strategy consists in first detecting a limited number of SNPs with significant additive effects and then searching for interactions between these SNPs. Another strategy is to consider the pairwise interactions of all SNPs with each of the significant SNPs [81]. Korte et al. [82] first proposed a multitrait mixed model (MTMM), which extends the MLM approach to multiple correlated traits, permitting the identification of both interactions and pleiotropic loci while correcting for population structure. However, owing to its computational complexity, the method had limited applications. The availability of large bodies of genotypic and phenotypic data generated in individual GWAS experiments, targeting major human health issues and some traits of interest for animal and plant breeding, has given birth to the development of meta-analysis approaches. These approaches either use the summary statistics (i.e., the effect of sizes and their standard errors, or p-values of SNPs) from multiple independent GWAS [83], or combine data collected for individual studies to increase sample size [84]. The first approach is implemented mainly for the validation of individual GWAS results. The second approach has the potential to detect new rare QTLs, with low or below 1% minor allele frequency in the individual GWAS populations, provided the heterogeneity (effects of sizes of SNPs, LD between marker allele and QTL allele, etc.) arising from differences in the genetic background of the individual GWAS populations and from environmental conditions of each experiment is accounted for. Another avenue of mining published GWAS experiments is the joint analysis of summary statistics from GWAS of different traits, possibly from overlapping samples (MTAG, [85]). The wealth of data about genetic variation of complex human diseases and animal or plant traits, has also led to the development of new computational approaches that deal with complexity with much fewer assumptions than the above-reviewed biostatistical approaches, about the functional form of the model and the effects to model. These data mining and machine learning methods are expected to overcome the impracticability of genome-wide testing of epistatic and G E interactions, in the framework of biostatistical approaches. These novel approaches include combinatorial partitioning and machine learning methods such as random forests and multifactor dimensionality reduction methods ([82, 86, 87]. A complementary approach is GWAS pathway analysis that provides a means of integrating the results of a GWAS and the genes in a known molecular pathway to test whether the pathway is associated with the trait [88, 89]. Specific analysis methods and tools were developed for exhaustive genome-wide epistasis search [90] and for detection of G E interactions [91] with large datasets from case– control designs.
18
Nourollah Ahmadi
2.3 Combination of LD and Linkage Mapping
3
Efforts in narrowing down the confidence interval of QTLs have also led to mapping methods that combine linkage mapping with LD mapping in experimental populations. Using information from all recombination events, the recent ones that occurred during the development of the mapping population and the historical ones that occurred since the mutations took place in the ancestral population. The approach increases the precision of the estimate of the QTLs’ positions (reviewed by [92]). The approach was first implemented by Meuwissen and Goddard [93] in a frequentist framework, using LD with closely linked marker loci (haplotype). The method compares the expected covariances between haplotype effects given a postulated QTL position to the covariances that are found in the data. The expected covariances between the haplotype effects are proportional to the probability that the QTL position is identical by descent given the marker haplotype information. Simulation results with a very densely marked map (markers spaced at intervals of 0.25 cM) showed that a QTL could be correctly positioned within a region of 0.75 cM. Now, with a similarly high-resolution map, using linkage mapping alone, Long et al. [94] were able to map QTLs only to 5–10 cM regions. Meuwissen and Goddard [93] proposed a general approach for identifying QTLs, where several stages of LD mapping are used with increasingly dense marker spacing. The combined LD and linkage mapping using haplotypes was also implemented in a Bayesian framework with fewer assumptions about population demography than the frequentist methods. These combinations of LD and linkage mapping should not be confused with the ones used in human genetics, where linkage information is used conditional on LD information or the other way round [92].
Genetic Architecture of Complex Traits For a long time, quantitative genetics, though explicitly modelling the effects of individual QTLs, their pairwise and higher-order interactions and their interaction with the environment, was restricted to estimate variance components of the phenotypic variation for a given population. Inference of the number and the genomic position of the QTLs, their individual allelic effects and the effects of their interactions with other QTLs and with the environment, was not accessible. The advent of molecular markers and the development of GWAS, intending to identify the causal polymorphisms or, at least, to map polymorphisms that are in strong LD with QTLs, provide new opportunities to go beyond the statistical description of the genetic architecture of quantitative traits.
From Quantitative Trait Loci to Prediction
3.1 The Number and the Effect Size of QTLs
19
Reviews of linkage mapping studies in animal species [95] and in plants [96] indicate large variations between experiments in the numbers and effects of QTLs. In several cases, a few QTLs of large effect explain most of the phenotypic variance, whereas in other cases a large number of QTLs of small effects are involved. For instance, in livestock, for a typical trait, approximately 20 of the largest QTLs would explain 90% of the genetic variance and approximately 5 QTLs would explain 50%, for the traits used in their analysis [97]. Moreover, there are no obvious rules, for example for the kind of quantitative traits that would allow making predictions about the genetic basis of a given trait in a given experimental population. QTL cloning studies yielded contrasted results regarding the exact nature of the underlying sequence variation. In several cases, the fine-mapping process revealed that, in fact, the QTLs represented the combined effects of several linked polymorphisms [98]. In other cases, the QTLs corresponded to individual genes with large effects. It should be mentioned that, due to different artefacts, in a typical linkage mapping experiment (heritability 40%, 300 F2 individuals), no more than 12 significant QTLs are ever likely to be detected [99], and the effects of these QTLs are systematically overestimated [65]. Thus, despite enormous efforts in QTL mapping for human susceptibility to diseases and for traits of interest for animal and plant breeding, results are often inconsistent with the infinitesimal model of quantitative genetic variation [5], which posits extremely large numbers of loci with very small effects. Rather, the allelic effects followed an exponential distribution, as proposed by Robertson [100], with a few loci with moderate to large effects, and increasingly larger numbers of loci with increasingly smaller effects [101]. GWAS studies, using large populations of unrelated individuals, provided sharper insight into the genetic architecture of complex traits, especially the number of QTLs involved and their effect size. It identified hundreds of genetic variants associated with quantitative traits important in medicine and agriculture and have provided valuable insights into their genetic architecture [102]. However, in most cases, the published SNPs reliably associated with a trait explain only a small proportion of the known genetic variance. A well-documented illustration of this phenomenon is the results of GWAS for height in humans. While the narrow sense heritability of this typical complex trait is estimated to 80%, and while the proportion of its variance explained by the combined effects of 295,000 SNPs genotyped on 3925 unrelated individuals was 45%, several GWAS experiments, testing each SNP individually, have detected about 50 QTLs that in total accounted for only 10% of phenotypic variance [103]. The largest proportion of variance explained by any one SNP is about 0.3%. The explanatory hypotheses for this discrepancy, called “missing heritability,” are (1) the stringency of the significance threshold to account for
20
Nourollah Ahmadi
multiple tests, that causes many real associations to be missed, especially if individual SNPs have a small effect on the trait; and (2) the exclusion of SNPs of low minor allele frequency (MAF 0.01) from conventional GWASs. In a complementary study, Yang et al. [104] showed that the proportion of the variance of human height explained by the combined effects of 17M SNP genotyped on 44,126 unrelated individuals was 56%, and SNPs with 0.001 < MAF 0.01 and MAF 0.001 explained 8.4% and 3.8% of the total variance, respectively. They concluded that the missing heritability of about 25% might be due either to the large number of SNPs of extremely low MAF or to overestimating the heritability in family studies. Similar conclusions were drawn from the estimation of distribution of the effect size of loci involved in human susceptibility to several complex diseases [105]. Some rare variants can also significantly contribute to genetic variance, especially for diseases with major fitness consequences [106]. Interestingly, when data from [103] were used to compare the predictive ability of wholegenome prediction models with the ones of models based on a few selected SNPs, as it is often the case in the prediction of human diseases risk, the former yielded higher accuracy [107]. Estimating the number of causal variants affecting complex traits and the distribution of their effects by approximating the distribution of effect sizes with a mixture of normal distributions (Bayes R), Goddard et al. [102] confirmed the hypothesis of a very large number of QTLs, each with very small effects. They found that thousands of SNPs, with effect sizes drawn from a distribution with variance of 0.0001 σ 2g and a few with variance of 0.01 σ 2g , affected many traits in both humans and cattle. On the other hand, they found that most of the variance is due to QTLs that only have slightly lower MAF on average than neutral SNPs, suggesting that most of the variance is due to common SNPs (MAF > 0.01). Likewise, comparing the genetic architecture of quantitative traits in laboratory mice, fruit flies and humans, [101] showed that the model of large numbers of loci, each of small effect, is true for the three species and the distribution of effect size of common SNPs is the same for all phenotypes regardless of species. Thus, it appears that, in accordance with the Fisher’s infinitesimal model, a typical complex trait involves thousands of QTLs with effects varying from large to very small, explaining less than 1% of the variance. An implication of this finding is the need for large population size to mitigate the stringency of significance threshold applied to account for multiple tests in GWASs [105]. 3.2 Epistatic Interactions
In Mendelian genetics, epistasis between loci with alleles of large effect refers to the masking of genotypic effects at one locus by genotypes of another locus [108]. Fisher [3] defined epistasis, in a statistical manner, as an explanation for deviation from additivity in a linear model. It refers to any statistical interaction between
From Quantitative Trait Loci to Prediction
21
genotypes at two or several loci [109]. Epistasis can be synergistic, whereby the phenotype of one locus is enhanced by genotypes at another locus; antagonistic, in which the difference between genotypes at a locus is suppressed in the presence of genotypes at the second locus or produces novel phenotypes [110]. Epistasis was recognized as the main genetic component of human diseases [111]. Epistatic interactions are difficult to detect in QTL mapping studies, because the large number of pairwise tests for marker– marker interactions requires large mapping populations and imposes a low experiment-wise significance threshold. Furthermore, segregation of other QTLs can interfere with epistasis between the target pair of loci [101]. Most published studies on epistatic effects of interacting QTLs have focused on plants [112]. For instance in rice, conventional interval mapping using a doubled haploid mapping population, had mapped a QTL involved in partial resistance to rice yellow mottle virus (RYMV) on chromosome 12 [113]. Subsequent search for interaction between this QTL and the rest of the genome detected a complementary epistasis between the resistance QTL on chromosome 12 (QTL12) and a QTL on chromosome 7 (QTL7) involved in plant architecture and tillering [114]. The percentage of variance explained by this interaction (36.5%) was much higher than the one explained by QTL12 alone (21%). Comparison of resistance to RYMV between nearisogenic lines bearing or not the favorable alleles at markers linked to the two QTLs confirmed the existence of the epistatic relationship [115]. Similar epistatic interactions between QTLs were also reported in laboratory animals [116, 117]) and in farm animals such as poultry [118] and swine [119]. In livestock species, while the variance of almost all complex traits cannot be explained by additive genetic effects alone [120], experimental demonstrations of QTL–QTL interactions are rare. One such demonstration was reported for two QTLs of small phenotypic effects, affecting cattle’s meat tenderness [121]. It involved phenotypic data from some 7000 individuals belonging to seven different breeds and associated genotypic data to the putative causative mutation in the two genes underlying the two QTLs. The estimates of the epistatic components were tested by 10,000 times permutation. In LD-based QTL mapping, given the large number of associations detected and given the small size of the individual QTLs’ main effects, the detection of more complex features such as epistatic and QTL E interactions will require astronomically large numbers of individuals [101]. However, difficulty in detecting epistatic and QTL E interactions in GWAS experiments should not be interpreted as if they did not exist. For instance, search for epistasis exploiting large databases for human Crohn’s disease (18,000 cases + 34,000 controls) and ulcerative colitis (14,000 cases + 34,000 controls), and dense marker information associated with
22
Nourollah Ahmadi
IBD information, detected abundant genome-wide significant ( p < 1 1013) epistatic signals [122]. These signals were reduced substantially when conditional on the additive background, but nine pairs still remained significant (P < 1.1 108). Each epistatic signal explained 0.15% of the phenotypic variance. The same study also reported multiple but relatively weak interactions independent of the additive effects. Similar epistatic interaction in the context of GWAS was reported, in animals (e.g., [123]), plants (e.g., [124]), and model organisms (e.g., [125]). However, these findings need to be considered cautiously as, in this context, incomplete LD can generate the illusion of epistasis, even when the underlying genetic architecture is purely additive. De los Campos et al. [126] described the necessary conditions for such “phantom epistasis” to emerge. From the biological perspective, the ubiquity of molecular interactions in gene regulation and metabolic systems suggests that developmental or biochemical pathways leading to the expression of quantitative traits are necessary networks of loci that interact at the genetic and molecular levels and that variation in the traits are necessarily connected to allelic variation at some of these loci. Although there is still no general understanding as to when or how epistasis arises in biological systems, it is widely accepted that it determines the outcome of a variety of evolutionary processes. These processes include canalization or mechanisms of stabilizing selection (i.e., the continuous nature of phenotypic variation that ensures stability against mutational and environmental perturbations) and inbreeding depression [127]. Evidence for epistasis was also found in artificial selection experiments. Phenotypic variation often does not decrease as rapidly as expected during long-term selection. For instance, after more than sixty generations of selection for chicken growth rate and proportion of breast weight, genetic progress continues with no indication of a plateau. The explanatory model proposed is “selection-induced genetic variation,” which explains the genetic variation required for the long continued response as resulting from sequential changes in genetic background by selection bringing new cohorts of genes into play through epistatic interactions, in a programmed manner [128]. Recent network analyses indicate that effects of gene–gene interactions are commonplace in the variation of both human, animal and plant complex traits [129–131]. Likewise, examples of direct relationships between significant statistical interactions and biological interactions between genes are emerging from studies related to human health [132]. Noting that for many complex diseases, association signals tend to spread across most of the genome, including the vicinity of genes without obvious connection to disease, [133] extended the conventional gene-gene interactions toward an “omnigenic” model. They proposed that gene regulatory networks are sufficiently interconnected so that all genes expressed in disease-relevant cells are liable to affect the functions of
From Quantitative Trait Loci to Prediction
23
core disease-related genes and that most heritability can be explained by effects on genes outside the core pathways. This may be another way of naming pleiotropy, which is an important feature of the genetic architecture of any quantitative trait [101]. Most loci involved in development participate in multiple developmental pathways. It might thus not be possible to achieve the complete understanding of the genetic architecture of many quantitative traits even if the whole genome and the whole population of a species was sequenced. From a practical perspective, the above described architecture of complex traits raises questions about how genetic variations produce phenotypes and how genetic and phenotypic variations of complex traits evolve under artificial selection. Traits response to evolutionary forces of natural selection [134] suggests that, when selection decision relies on the infinitesimal model (i.e., pedigree and phenotypic selection or regression of traits on the whole genome sequence), it results in rapid changes in the mean value of the breeding population without major changes in the probability distribution of any particular allele frequency. On the contrary, selection decision based on alleles at a small number of QTLs of rather large effect, as is the case in conventional markerassisted selection, will favour individuals with the least environmental sensitivity and reduce genetic variance by pushing for fixation of those alleles. 3.3 Genotype by Environment Interactions (G E)
Differential expression of a phenotypic trait by genotypes across environments, or G E interaction, is an old problem of primary importance for quantitative genetics and its applications in breeding, as well as in conservation biology, theory of evolution and human genetics [135]. The profile of phenotypes produced by a genotype across environments is the “norm of reaction”; the extent to which the environment modifies the phenotype is termed “phenotypic plasticity” [136]. G E was also conceptualized as a measurement of the relative plasticity of genotypes in terms of expression of specific phenotypes in the context of variable environments [137]. A large body of approaches were developed to explore G E at phenotypic level. It includes linear and linear mixed models [5], graphical approaches [138], stability regression [139], genetic covariance functions [140], and multivariate mixed models [141]. It is generally accepted that the effect of a QTL should be considered in the context of the genetic and environmental background on which it is observed. In addition to testing the hypothesis of QTL E interaction, simultaneous treatment of data from multiple environments provides a significant increase in statistical power of QTL detection and accuracy of the estimates of QTL position and effect, provided the interaction effect is of sufficiently strong magnitude [142]. Specific linkage mapping method were
24
Nourollah Ahmadi
proposed allowing for QTL E interaction, across a large number of discrete environments, or in continuous environments, without the necessity for a corresponding increase in the number of parameters to be estimated [143, 144]. QTL E interactions were observed in model organisms, agricultural species (whenever QTLs have been mapped in different environments) and in humans. In mice, QTL E interaction effects occur for all traits, and often make a large, or larger, contribution to the phenotypic variation than the environmental effects [145]. This is also the case for flies, where QTL E translates mostly into different levels of expression of QTL alleles, with few cases of opposite effects in different environments [146]. In Arabidopsis, important G E interactions were observed in the contexts of both linkage mapping and LD mapping [147]. Implications of these interactions were analysed in terms of plant fitness trade-offs and evolutionary ecology [148]. Human studies were probably the earliest to identify variants involved in QTL E interactions on both behavioural phenotypes and diseases risk [149, 150]. In farm animals, a well-known example of QTL E interaction is the case of two QTLs involved in milk yield in dairy cattle [151]. In crop species and plant natural populations, a meta-analysis of 37 studies, assessing QTL E effects of over 900 QTLs for a variety of traits, revealed that G E is common and is often caused by changes in the magnitude of genetic effects in response to the environment, rather than changes in the direction of the effects [152]. Nearly every study included in the meta-analysis, reported environmentspecific QTLs (either true conditionally neutral effects or censored), with 57% of QTLs lacking a significant effect in at least one studied environment. Nearly 60% of QTLs exhibited G E caused by antagonistic pleiotropy or environment-specific effects. G E interactions were also investigated from the perspective of phenotypic plasticity and local adaptation. Phenotypic plasticity, the ability of an individual to modulate the phenotypic expression of its genotype in response to the environment, is a major adaptive mechanism [153]. Analysing data for a large set of traits in Barley, Lacaze et al. [154] mapped QTLs specific to the plasticity of each of the traits considered. All plasticity QTLs colocated with loci showing QTL E interaction for the trait, and there were no QTL that only affected plasticity. Similar colocalization of phenotypic plasticity QTLs with loci expressing QTL E interaction was reported for several fitness related traits, in the nematode Caenorhabditis elegans [155]. Likewise, evidence of polymorphisms that control rice phenotypic plasticity in response to increased environmental resources such as light and CO2 was established through GWAS [156]. In another respect, some QTLs involved in phenotypic plasticity were also reported to be related to domestication. For instance, in maize, plants homozygous for the teosinte (in maize wild ancestry) allele at a QTL involved in plant
From Quantitative Trait Loci to Prediction
25
architecture showed greater phenotypic plasticity across environments than plants homozygous for the maize allele [157]. Additionally, allelic interactions within a locus has been suggested to affect phenotypic plasticity. The presence of more than one allele at a locus is directly associated with phenotypic plasticity in outbred species or in the comparison of pure lines and hybrids involving self-pollinated species [158]. Regarding local adaptation, theory suggests it results from genetic trade-offs at individual loci (i.e., antagonistic pleiotropy) and adaptation to one set of environmental conditions results in a cost to fitness in alternative environments [159]. However, QTL mapping across 10 sites, covering 17 of latitude, in a mapping population of Panicum virgatum, revealed that beneficial biomass (fitness) QTLs generally incur minimal costs when confronted with different environments [160]. These weak trade-offs for local adaptation suggest that locally advantageous alleles could potentially be combined across multiple loci through breeding to create high-yielding regionally adapted cultivars. Given the limited power of QTL studies to detect small effect QTL E interactions, and the extremely time-consuming process of cloning such QTLs, so far too few G E genes have been cloned to draw conclusions on the type of causative molecular variations. The most extensively studied case of G E interaction is the one of genes involved in regulating transition to flowering in plants by environmental factors such as temperature and light. The flowering loci FT and TFL1 encode a pair of flowering regulators with homology to phosphatidylethanolamine-binding proteins, which function in diverse signalling pathways involved in growth and differentiation in bacteria, animals, and plants [161]. A gene category often reported for cloned genes driving G E in plant for biotic environment (photoperiod, cold tolerance, vernalization, root growth, etc.) is the transcription factors [152]. Additional opportunities in identifying G E patterns and molecular mechanisms across a diversity of phenotypes, species and environments were provided by transcriptomic analyses. Gene families involved in G E are very diverse and include the ones coding for proteins responsible for sensing or transducing environmental signals [162]. Thus, G E is a very common process often caused by changes in the magnitude of genetic effects in response to the environment. It is associated with diverse genetic factors and molecular variants. The diversity of phenotypic traits subject to G E interaction, and the adaptive role it plays, call for its meticulous modeling when analyzing and predicting complex traits. Especially, as the pattern of QTL plasticity for fitness components, or yield in crops, suggests that recombination could effectively generate individuals with high performance across environments by bringing together conditionally neutral alleles that have weak trade-offs.
26
4
Nourollah Ahmadi
Prediction of Complex Traits Based on Mapped QTLs In the previous two sections, we reviewed how molecular genetics is used to dissect the genetic architecture of complex traits and presented an overview of the main characteristics of this architecture. In this section, we present an overview of how this information has been used to enhance the long-existing practices of prediction of human diseases risk and genetic improvement of agriculturally important species. The emphasis is on cases exploiting the statistical association between alleles at the molecular marker loci and at the causative quantitative trait nucleotides.
4.1 Polygenetic Disease Risk Score for Human Complex Diseases
Early identification and treatment of individuals at risk for major human diseases has been an important contributor to the reduction in mortality associated to these diseases. For a long time, the risk prediction for complex diseases was based on pedigree analysis, other predisposing factors and/or biochemical markers. A logistic regression was used to estimate the sensitivity (Se ¼ True positives/ (True positives + False negatives)) and the specificity (Sp ¼ True negatives/(False positives + True negatives)) by defining the dependent variable to be the dichotomous result of diagnosis tests, where Y ¼ 1 if the disease is presumed to be present, and Y ¼ 0 otherwise. Two additional performance criteria were also computed: Positive predictive value PPV ¼ ðSe π Þ=½ðSe π Þ þ ð1 SpÞ ð1 π Þ and Negative predictive value N PV ¼ Sp ð1 πÞ=½Sp ð1 πÞ þ ð1 SeÞ π
When several explanatory variables were involved, the presence or absence of the disease was included in the logistic regression as a binary explanatory variable (Xi), along with variables used to define the subgroups of interest. The log odds of presumptive disease was modelled as a linear function of the (1, . . ., n) explanatory variables, one of which corresponded to the binary results of the “gold standard” (i.e., reliable evidence of presence or absence of disease), along with their β coefficients [163]: n P Logit Pr Y ¼ X1 ¼ α þ βi X i , where α represents the n¼1 intercept. Sensitivity and specificity were computed as: 1 and n P βn X n 1 þ exp α þ n¼1 8 9 > > > > < = 1 , Sp ¼ 1 n P > > > :1 þ exp α þ ; βn X n > Se ¼
n¼1
respectively.
From Quantitative Trait Loci to Prediction
27
The advent of molecular markers and the associated understanding of the genetic etiology of common multifactorial diseases have opened new perspectives for predicting a healthy person’s probability of developing a disease, and for personalized medicine, that is, preventive and therapeutic interventions for complex diseases that are tailored for individuals, based on their genetic profiles [164]. Statistical methods for defining risk groups at the population level, and predicting risk at the individual level, have evolved with the increase in the throughput of genotyping technology and the increase in the number of loci potentially useful for the prediction of diseases risk. Early works on methods to predict an individual patient’s risk of developing a multifactorial disease, using allelic data at multiple disease-susceptibility loci, were based on the use of case–control data in a logistic regression to estimate a likelihood ratio [165]. Assuming n biallelic loci are involved in a multifactorial disease, each individual is described by the vector G(g1, . . ., gn), n n P P so that g i ¼ 1 for positive test and g i ¼ 0 for negative test. At 1
1
the population level, there would be 2n theoretical combinations of test results. If D represents the diseased population and H the nondiseased population, one can define the likelihood ratio for any observed value of G, as LR(G) ¼ P(GD)/P(GH) where P (GD) represents the probability of G given the presence of disease and P(GH) the probability of G given the absence of disease. The likelihood ratio will be higher for combinations of test results that more clearly distinguish people with the disease from those without the disease. This approach permits the simultaneous use of information from many different genetic tests as well as from environmental risk factors such as age, personal medical history, and family history. When all such information is taken into account, the estimated likelihood ratio can easily be converted into the posterior probability of developing the disease. However, the estimate of the posterior probability is highly dependent on the prevalence of the disease in the population to which the tested individual belongs. When the gene-discovery phase relies on hyperselected cases and control individuals, it may lead to an overestimation of the effect sizes and thus to an overestimation of the predictive value. It is thus necessary for predictive testing in preventive medicine to be investigated in cohort studies consisting of individuals who do not have the disease of interest, and predictive testing for diagnosis and therapy response to be evaluated prospectively in clinically relevant patient series. Other limitations of the approach include the uniformity of the effect size for all test loci and the fact that the ability to predict a disease will decrease with the number of “at-risk” alleles carried by a given individual, especially in tests based on a small number of risk loci [166].
28
Nourollah Ahmadi
First investigations on the combined predictive ability of loci significantly associated to common diseases in GWAS studies using cohort type populations (i.e., not enriched with patients with a positive family history or early disease onset) involved a small number of SNPs (less than 50), validated in several independent GWASs. A risk allele score, obtained by counting the number of risk alleles, was used as a simple proxy of the combined effect of multiple polymorphisms. The combined predictive values of the SNPs were examined in a test population of about 5000 individuals, alone or in addition to clinical characteristics, using logistic and Cox regression analyses. The discriminative accuracy of the prediction models was assessed by the area under the receiver-operating characteristic (ROC) curves (AUCs). ROC space is the sensitivity vs (1 specificity) plot, each prediction result representing one point in the ROC space and the point with the (0, 1) coordinate being the perfect classification. AUC is equal to the probability that the test ranks a randomly chosen positive individual higher than a randomly chosen negative individual. It can range from 0.5 (total lack of discrimination) to 1.0 (perfect discrimination) [167]. In the context of genetic profiling the maximum value of AUC is determined by the heritability and the prevalence of the disease. The AUC of these early attempts of use of GWAS results were not significantly higher than the AUC obtained by conventional predictors, pedigree, age, gender, etc. [168]. The next step in the improvement of methods for the establishment of polygenetic disease risk score (PRS) was the use of GWAS results with less stringent significance thresholds (thus increasing the number of SNPs considered for risk evaluation), and the weighting of the contribution of each SNP based on its effect size [169]. The optimal p-value threshold is searched for through parameter optimization or variable selection methods that also adjust the effect size of the selected SNPs. Another issue in using GWAS summary statistics is the redundancy of information from SNPs that are in strong LD. This is the case particularly for single-regression based GWASs. This issue is dealt with by SNP clumping based on LD so that the ones retained are independent. However, the choice and the application of a unique LD threshold may be tricky. Once established at the population level, its performance in predicting individual risk needs to be evaluated with the above-described AUC. It may also need to be standardized, that is, shifted from the unit of the GWAS effect size to a standard normal distribution [169]. The efficiency of the first PRS in providing sufficient risk stratification for clinical utility (i.e., the percentage of a population at a given rate (say, twofold) of increased risk, relative to the rest of the population, was hampered by the small size of initial GWAS data, by the computational methods and by the lack of dataset to validate the selected risk loci and alleles [166, 170]. Overcoming these
From Quantitative Trait Loci to Prediction
29
limitations, the predictive ability of the latest PRS developed for main common diseases is approaching that of the prediction of monogenic diseases [169]. For instance, a PRS developed for the evaluation of coronary artery disease, identified 8% of the population at greater than three-fold increased risk for this disease, 20-fold more people than found by conventional risk evaluation methods [171]. The development of this PRS was based on summary statistic from a GWAS involving 185,000 cases and control individuals, genotyped with 6.7 M common SNPs (MAF > 0.05) and 2.7 M low frequency SNPs (0.005 < MAF < 0.05), analysed under an additive genetic model. First, several sets of risk SNPs and alleles, to be used for computing candidate PRS, were selected by submitting the GWAS summary data to thresholding and clumping, or to shrinkage technique, under several parameter-setting options (see below). Second, the candidate PRSs were computed in a validation population of some 120,000 individuals, by multiplying the genotype dosage of each risk allele for each SNP by its respective weight, and then summing across all selected SNPs. Third, the discriminative ability of the candidate PRSs was compared through their AUC. Finally, the performance of the PRS with the highest AUC was assessed in an independent testing cohort comprising more than 285,000 individuals. The clumping algorithm clumps around SNPs with association p-values below a given threshold [171]. Each clump contained all SNPs within 250 kb of the index SNP that were also in LD with the index SNP as determined by reference r2 threshold. The p-value thresholds tested were 1, 0.5, 0.05, 5 104, 5 106, and 5 108. The r2 thresholds tested were for example 0.2, 0.4, 0.6, and 0.8. The Bayesian approach implemented calculates a posterior mean effect size for each SNP based on a prior and subsequent shrinkage based on LD between SNP and other similarly associated SNP [172] The underlying Gaussian distribution additionally considers the fraction of causal (e.g., non– zero effect sizes) loci via a tuning parameter, ρ. The ρ tested were 1, 0.3, 0.1, 0.03, 0.01, 0.003, and 0.001. It is noteworthy that the process of development of this PRS is quite similar to the one of the whole-genome regression based prediction (i.e., genomic prediction) methods. The main difference lies in the fact that the variable selection or shrinkage of estimates procedures for the calibration of the prediction model was fed with actual (“prior”) information on the contribution of individual loci to the trait, obtained from GWAS implemented with the same dataset. In a conventional genomic prediction model, markers’ effects are sampled from a hypothetic distribution. A second difference lies in the fact that, due to multiple testing, the marker effects obtained through GWAS are overestimated. This does not occur in genomic prediction because all markers’ effects are estimated simultaneously.
30
Nourollah Ahmadi
Interestingly, some PRS building methods come up even closer to genomic prediction procedures. Though SNPs are first selected based on GWAS summary data, their effect size is estimated using different types of prior. For instance, Hu et al. [173] developed a polygenic risk prediction procedure with GWAS summary statistics modelled in a Bayesian framework where the empirical prior of SNPs’ effect size is established based on several genomic and epigenetic functional annotations of human genome available in public databases. They reported improved accuracy of prediction in both simulation studies and real data for several diseases, including breast cancer, and type-II diabetes. The Bayesian framework was also used to model jointly GWAS summary statistics from multiple genetically correlated diseases [174]. In this case, when two diseases were considered, the effect sizes of a SNP was modelled as a mixture distribution with two independent normal distribution when the SNP is causal in both diseases, joint normal and point mass, when the SNP is causal in only one disease, and joint point mass when the SNP is not causal in either disease. The posterior expectation of the effect sizes was sampled from the posterior distribution of the effect sizes. The latest methods of establishing PRS for human diseases bypass the GWAS step and directly apply to all SNPs a shrinkage/ regularization technique or machine-learning approach. Espousing the paradigm of genomic prediction of complex traits, they are reviewed in Chapter 15 of the present book. 4.2 Marker Assisted Prediction of Breeding Value for Animal and Plant Breeding
Prediction of breeding value (BV) for complex traits, the part of an individual genotypic value that is due to additive transmittable effects, has been constitutive of the intentional animal and plant breeding techniques. It was first formalized in the framework of the selection index [175], which combines, in a linear regression, the phenotypic information available on individuals and their relatives to provide the best linear prediction of individual breeding values, I. Suppose y1, y2, and y3 are the phenotypic values for individual i and its parents. The index for this individual would be: I i ¼ b 1 ðY 1 μ1 Þ þ b 2 ðY 2 μ2 Þ þ b 3 ðY 3 μ3 Þ where μ1, μ2, and μ3 are average phenotypes for the tested individuals and their parents respectively and b1, b2, and b3 are the factors by which each measurement is weighted. Individuals’ records need to be preadjusted for fixed factors, such as environmental factors, which may not be known. The b values are computed by the procedure employed in obtaining the regression coefficients in multiple linear regression. Henderson [176] developed the BLUP method to estimate the individual breeding values (u), as a random variable of a linear mixed model. The basic equation is: Y ¼ X β þ Zu þ e
From Quantitative Trait Loci to Prediction
31
where Y is an n-dimensional vector of observations, X and Z are known n p and n q matrices, respectively, β is the p 1 vector of fixed effects, u is the p 1 vector of random effects, and e is a vector of random residual effects. The expectations of the variables are: E(Y) ¼ Xβ; E(u) ¼ E(e) ¼ 0. It is assumed that residual effects, which include random environmental and nonadditive genetic effects, are independently distributed with σ 2e variance and Var ðe Þ ¼ I σ 2e ¼ R. Estimates of the variance–covariance matrix of the random variable u is obtained as G ¼ Aσ 2u, where σ 2u is the additive genetic variance component, and A is the numerator relationship matrix, computed from the coefficient of coancestry (pedigree, kinship) among individuals. The best linear unbiased estimation (BLUE) of the fixed effects (β) and the BLUP of the random effects (u) can be computed by solving the mixed model equations given by Henderson [177, 178]: X 0 R1 X Z 0 R1 X
X 0 R1 Z 0 1 Z R X þ G 1
! β ¼ b u
X 0 R1 y Z 0 R1 y
! :
Enriching the prediction of individuals’ BV with pedigree information, BLUP method found rapidly widespread applications in animal breeding. Increases in computing power extended its application from the simple sire model to more complex models, such as the animal, maternal, and multitrait models and models including nonadditive effects [179]. In plants, the unavailability of deep pedigree information and the availability of large amounts of phenotypic data per genotype (thus, BLUE and BLUP often do not provide grossly different results [180]), have long limited the application of the BLUP method to some forestry and perennial crops. In most annual crops, BV for quantitative traits is estimated through multienvironment phenotypic evaluations. Furthermore, in autogamous crops, pedigree-based BLUP prediction for lines developed from single biparental population is ineffective, as they possess the same pedigree. Thus, its use is limited to the prediction of crosses’ performances, using A matrix developed from historical records [180]. In allogamous annual crops, BLUP method has served mainly to predict hybrid performances. For instance predicting general and specific combining ability effects in crosses of inbred lines from two heterotic pools in maize [181]. Fernando and Grossman [182] extended the BLUP procedure to prediction of individual BV from known relationships, phenotypic data and molecular marker information, that is, to solve a mixed linear model that includes several genetic effects: additive haplotype effects of maternal and paternal origin at markers closely linked with known QTLs, plus an infinitesimal effect u of the remaining QTLs:
32
Nourollah Ahmadi p
yi ¼ X iβ þ V m i þ V i þ U i þ ei p
with V m and V i the additive effects of maternal and paternal i haplotypes at the known QTLs; Ui the vector of polygenic effects, that is, the remaining QTLs, unlinked to haplotypes at the known QTLs. It is assumed that every individual has two unique QTL alleles and their effects are estimated from an infinite allele model, that is, the allelic effects are considered as random effects. The information on the resemblance of a parent and an offspring is accounted for by the degree of correlation between the effects of the parent and offspring alleles. It is assumed that the probability of identity-bydescent (IBD) between any two alleles, calculated from pedigree and marker data, is equal to the correlation between their effects. Thus V arðvÞ ¼ G v σ 2v , where Gv ¼ IBD matrix between all the alleles at the markers linked to the known QTLs and σ v2 is the variance due to these markers. Fernando and Grossman [182] proposed a recursive algorithm to obtain Gv and a simple method to obtain its inverse. Regarding the polygenic effects u, as described above, Var ðuÞ ¼ A σ 2u , where A is the numerator relationship matrix between individuals and σ u2 is the polygenic variance. A is obtained using the algorithm given by Henderson [178]. Given the covariance matrices Gv and Gu, BLUE values of vi and BLUP values of ui can be obtained using the mixed model equations [177]. Using the marker BLUP approach, many animal breeding programs have included marker information for QTLs of large effects, and often rather large confidence intervals, in their models of BV estimation (e.g., [183]), in the framework of their recurrent selection strategy. Response to selection attributable to MAS is limited for traits that can be measured in all selection candidates before selection (e.g., growth rate, food intake). Response to MAS is the highest (up to 65% in the first cycle of selection) when phenotypic information is limited because of low heritability or inability to record the phenotype on all selection candidates before selection (e.g., fertility traits, milk production, carcass traits in animals, end-use related traits in immature perennial plants). The paradox is that the ability to detect QTLs, which also requires phenotypic data, is also limited for such cases. Thus, unless different resources or strategies are used for QTL detection, the greatest opportunities for MAS might exist for traits with moderate rather than low heritability. Goddard [184] extends Fernando and Grossman [182] method to handle information on QTLs bracketed by two markers. Several simulation studies have analysed factors that influence the additional genetic progress achieved in animal breeding using molecular data, and have drawn optimization rules [185–189]. The gain in genetic progress attributable to MAS is also reasoned in term of decreases in length of generation interval. For instance, in
From Quantitative Trait Loci to Prediction
33
dairy cattle selected for milk production, MAS reduces the length of generation interval when it replaces conventional progeny testing systems [187]. Response to MAS also depends on the scale and structure of the breeding programs. Only large-scale breeding programs endowed with in-house QTL detection capability involving large families, and continuous reestimation of marker–QTL phase, would be able to take full advantage of MAS using linkage or LD-based QTL information. Smaller breeding programs would have to do with markers that are direct tests for the QTL alleles, at the whole species or at the breed level, and would not need reestimation along the selection cycles [184]. In plant breeding, the availability of QTL information first served to speed up the conventional backcross breeding activities. Incorporation of marker-assisted estimation of BV in major plant breeding strategies such as recurrent selection, and evaluation of hybrid performances, took more time. Simulation studies [190, 191] provided a general framework for optimizing foreground and background selection in backcross breeding programs aimed at the introgression of one or several QTLs. However, straightforward QTL introgression has at least two limitations: (i) The expression of the expected effect size of the QTL in the new genetic background the recipient line represents is not granted, unless the QTL was mapped in a population involving the recipient line as parent. (ii) It is the most conservative of breeding methods as it improves current cultivars or animals only at a few loci at a time and does not produce new combinations of alleles that are needed to generate improvements in multiple quantitative traits. The inability to accurately infer QTL effects from one breeding cross to another has led to the development of forward breeding methods were QTLs are mapped in the breeding population. The breeding population, often composed of progeny of one biparental cross between elite germplasms, is phenotyped and genotyped in an early generation (F2 or F3) to map QTLs for the target traits. Then, this marker–QTL information serves computing BV in the markerassisted recurrent selection strategies. At least two such strategies were implemented: ideotype building and increasing the frequency of favourable alleles. The former strategy, known as marker-assisted recurrent selection (MARS), aims at assembling the mosaic of favourable chromosomal segments from the two parental lines [192–194]. The approach was widely implemented in major commercial breeding programs to develop commercial maize and soybean lines [195]. Response to MAS depends upon the feasibility of early selection (preflowering), necessary to implement the optimal mating scheme. The practicability of the approach is highly dependent on the number of QTLs involved. In the latter strategy, the frequency of favourable alleles is increased by several cycles of random mating of the selected progenies. The thus enriched base
34
Nourollah Ahmadi
population then serves as the source for the development of commercial lines. In both random-mating and directed-mating based recurrent selection strategies, selection is based either on molecular score alone or on molecular score followed by selection on phenotype, or an index combining molecular score and phenotypic BLUP. Selection on molecular score alone will result in less genetic improvement than combined selection on molecular score and phenotype, unless the molecular score captures all genetic variation [196–198]. The gain in genetic progress attributable to MAS is the highest when phenotypic selection requires progeny testing, as it decreases the length of generation interval. This is the case in perennial plants and for some annual crops targeting the prediction of combining ability for hybrid production. Molecular markers has also been used to predict performances of biparental crosses in term of progeny’s line value for autogamous species [194], and in term of combining ability as hybrid parent, for allogamous species (e.g., [199]). The efficiency of such predictions, when based on allelic frequency of anonymous markers, depends on the existence of pedigree relatedness or membership to a common ancestral population, among the parental lines. When the prediction is based on marker data associated to QTLs, correlation between the true and predicted performance of untested single crosses decreases with an increase in the number of QTLs involved [200]. Early attempts to incorporate MAS into animal and plant breeding programs achieved limited genetic gain compared to phenotypic selection, as these programs were designed to predict breeding value accurately from phenotypic data. It was recognized that taking full advantage of molecular information requires a redesigning of the breeding programs so as to integrate the collection and analysis of phenotypic data for QTL detection with the use of this information for MAS [201, 202]. In the case of plant breeding programs, this may involve creating large reference populations composed of the pool of donors and key ancestral lines involved in the program, mapping QTLs in this population through association analysis, and combining this association information with the linkage mapping-based QTL information from individual crosses, via haplotype sharing approach [203]. However, these improvements will not correct the fundamental drawback of MAS based on alleles at a small number of QTLs of rather large effect: it exhausts rapidly the genetic variance by pushing for fixation of those alleles. This drawback might be mitigated by increasing the number of the target QTLs. Pushed to its highest limit such increase in the number of the target QTLs, would lead to MAS based on the regression of traits on the whole genome sequence, that is, genomic prediction. Application of this approach to plant and animal breeding are presented in the Chapters 16–21 of the present book.
From Quantitative Trait Loci to Prediction
5
35
General Conclusion Complex or quantitative traits are important in medicine and agriculture. They are controlled by several genes, along with environmental factors, each having a quantitative effect on the phenotype. Fisher’s [3] infinitesimal model provided a solid framework for the study of phenotypic variation of such traits. It also provided a framework for predicting BV in animal and plant genetic improvement programs that have contributed enormously to the increase of the livestock and crop productivity during the last century. Breeding value of selection candidates were estimated by combining pedigree and phenotypic data in mixed model equations and computing as BLUE and/or BLUP. Likewise, risk for complex diseases was predicted by combining pedigree information, other predisposing factors and/or biochemical markers in a logistic regression. Until the 1970s, these predictions were conducted without any knowledge of the genetic architecture of the traits of interest. Since then, the development of molecular genetics has allowed a better understanding of the genetic nature of quantitative traits by identifying genes or chromosomal regions (i.e., QTLs) that affect the traits, and has provided the practitioners with a new type of data (molecular markers) for predicting BV and disease risk. Molecular genetic analyses of quantitative traits have led to the identification of a few causal mutations and, mostly, of a rather large number of markers associated to the polymorphism that do affect the trait (quantitative trait nucleotide). Early QTL mapping works, conducted in experimental family-based populations, relied on identity by descent between markers and the quantitative trait nucleotide and, by construction, detected a limited number of QTLs and overestimated their effects. Genome-wide association mapping methods, conducted in large populations of nonrelated individuals and relying on linkage disequilibrium between markers and causative loci, have provided a more refined view of the architecture of complex traits, for which several hundred thousand loci may be involved, but they have not solved the issue of accurate estimation of the effect each of these loci. QTL mapping works have also confirmed the nonnegligible contribution of epistatic and QTL E interactions in the genetic variance of complex traits. Practitioners of animal and plant breeding and of the prediction of human diseases risk have updated their prediction methodology to take advantage of this better understanding of the genetics of quantitative traits. This is increasing the accuracy of predictions and accelerating the prediction process. Roughly, they included the effects of the most statistically significant QTLs in their perdition models next to the polygenic effects that were not explained by the QTLs. However, the actual gains in accuracy of BV and prediction of diseases risk were often limited for reasons related to QTL
36
Nourollah Ahmadi
mapping methodology and to the design of breeding programs and risk prediction experiments. The former included the often-limited fraction of the genetic variance explained by the marked QTLs, the almost systematic overestimation of the effects of these QTLs, the sometimes limited transferability of the QTL information from one population to another, and the erosion of the marker–QTL association over generations. The latter included the fact that conventional breeding programs and risk prediction protocols are designed to have high accuracy when using phenotypic data, and need to be redesigned to take advantage of markers data. For instance, progeny testing provides high accuracy of prediction of BV, though it lengthens the breeding cycle. In this context, the contribution of MAS to increase the rate of genetic gain will depend on the feasibility of selection without any phenotyping to shorten the generation interval. However, this in turn can reduce the accuracy of prediction if markers explain only a limited share of the genetic variance. The central issue of the limited fraction of genetic variance explained by QTLs of “statistically significant effects” gives birth to the idea (later termed “genomic prediction”) of omitting the significance test, which limits the number of QTLs and leads to overestimate their effects, and simply estimating the effects of all possible QTLs simultaneously. Now, the number of genes in the human genome is estimated to about 30,000 and the number of genes in the rice genome between 40,000 and 60,000. If one wants to estimate the effects of a minimum of two alleles for each of these possible contributors to its quantitative trait of interest, within the usual prediction experiment comprising a few hundred or a few thousand individuals, one will face the large p and small n problem that cannot be solved in the framework of classical statistics. The decreasing cost of genotyping and the development of high throughput phenotyping technology, especially in plants, can help increase the size of the populations that serve to build the prediction models. However, as we do not know the genes involved, we still need a larger number of markers to finely tag the genome and to estimate the effects of all the tagged segments simultaneously. The methodological developments to solve this problem, to optimize factors that influence the accuracy of genomic prediction, and to integrate novel contributions of molecular genetics to our understanding of the genetic architecture of complex traits, are presented in the following chapters of this book.
From Quantitative Trait Loci to Prediction
37
References 1. Lander ES, Schork NJ (1994) Genetic dissection of complex traits. Science 265: 2037–2048 2. Galton F (1877) Typical laws of heredity. Nature 15:492–495, 512–514, 532–533 3. Fisher RA (1918) The correlation between relatives on the supposition of Mendelian inheritance. Proc Royal Soc Edinburgh 52: 399–433 4. Henderson CR (1953) Estimation of variance and covariance components. Biometrics 9: 226–252 5. Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Cambridge University Press, p 980 6. Sax K (1923) The association of size differences with seed-coat pattern and pigmentation in Phaseolus vulgaris. Genetics 8:552 7. Morgan TH, Sturtevant AH, Muller HJ, Bridges CB (1915) The mechanism of Mendelian heredity. Henry Holt, New York 8. Morton NE (1955) Sequential tests for the detection of linkage. Am J Hum Genet 20: 277–318 9. Botstein D, Whit RL, Skolnick M, Davis RW (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32:314–331 10. Rendel J (1961) Relationships between blood groups and the fat percentage of the milk in cattle. Nature 189:408–409 11. Tanksley SD, Medina-Filho R, Rick DM (1982) Use of naturally-occurring enzyme variation to detect and map genes controlling quantitative traits in an interspecific backcross of tomato. Heredity 49:11 12. Kahler AL, Wherhahn CF (1986) Associations between quantitative traits and enzyme loci in the F2 population of a maize hybrid. Theor Appl Genet 72:15 13. Paterson AH, Lander ES, Hewiit JD, Peterson S et al (1988) Resolution of quantitative traits into Mendelian factors by using a complete RFLP linkage map. Nature 335: 721–726 14. Cardon LR, Smith SD, Fulker DW, Kimberling WJ et al (1994) Quantitative trait locus for reading disability on chromosome 6. Science 266(5183):276–279 15. Peleman JD, van der Voort JR (2003) Breeding by design. Trends Plant Sci 8(7):330–334 16. van Arendonk JAM, Tied B, Kinghorn BP (1994) Use of multiple genetic markers in
prediction of breeding values. Genetics 137: 319–329 17. MacArthur J, Bowler E, Cerezo M, Gil L et al (2017) The new NHGRI-EBI catalog of published genome-wide association studies. Nucleic Acids Res 45(1):896–901 18. De los Campos G, Sorensen D, Gianola D (2015) Genomic heritability: what is it? PLoS Genet 11:e1005048 19. Miedaner T, Galiano-Carneiro Boeven AL, Gaikpa DS, Kistner MB, Grote CP (2020) Genomics-assisted breeding for quantitative disease resistances in small-grain cereals and maize. Int J Mol Sci 21:9717 20. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829 21. Lander ES, Kruglyak L (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 11:241–247 22. Bohra A, Pandey MK, Jha UC, Singh B et al (2014) · genomics-assisted breeding in four major pulse crops of developing countries: present status and prospects. Theor Appl Genet 127:1263–1291. https://doi.org/10. 1007/s00122-014-2301-3 23. Van Ooijen J, Jansen J (2013) Genetic mapping in experimental populations. Cambridge University Press, Cambridge, p 155 24. Soller M (1990) Genetic mapping of the bovine genome using deoxyribonucleic acidlevel markers to identify loci affecting quantitative traits of economic importance. Dairy Sci 73:2628–2646 25. Fulker DW, Cardon LR (1994) A sib-pair approach to interval mapping of quantitative trait loci. Am J Hum Genet 54:1092–1103 26. Zhang Q, Boichard D, Hoeschele I, Ernst C et al (1998) Mapping quantitative trait loci for milk production and health of dairy cattle in a large outbred pedigree. Genetics 149(4): 1959–1973 27. Kissebah AH, Sonnenberg GE, Myklebust J, Goldstein M et al (2000) Quantitative trait loci on chromosomes 3 and 17 influence phenotypes of the metabolic syndrome. Proc Natl Acad Sci U S A 97(26):14478–14483 28. Knott SA, Elsen JM, Haley CS (1996) Multiple marker mapping of quantitative trait loci in half-sib populations. Theor Appl Genet 93: 71–80 29. Uimari P, Zhan Q, Grignolia FG, Hoeschelaned I, Thaller G (1996)
38
Nourollah Ahmadi
Granddaughter design data using leastsquares, residual maximum likelihood and Bayesian methods for QTL analysis. J Agri Genomics 2:1–20 30. Chuechill G, Doerge R (1994) Empirical threshold values for quantitative trait mapping. Genetics 138(3):963–971 31. Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185–199 32. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc 39:1–38 33. Haley CS, Knott SA (1992) A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69:315–324 34. Sen S, Churchill GA (2001) A statistical framework for quantitative trait mapping. Genetics 159:371–387 35. Jansen RC (1993) Interval mapping of multiple quantitative trait loci. Genetics 135: 205–211 36. Zeng ZB (1993) Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proc Natl Acad Sci U S A 90:10972–10976 37. Zeng ZB (1994) Precision mapping of quantitative trait loci. Genetics 136:1457–1468 38. Kao CH, Zeng ZB, Teasdale RD (1999) Multiple interval mapping for quantitative trait loci. Genetics 152:1203–1216 39. Carlborg O, Andersson L, Kinghorn B (2000) The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics 155: 2003–2010 40. Baierl A, Bogdan M, Frommlet F, Futschik A (2006) On locating multiple interacting quantitative trait loci in intercross designs. Genetics 173:1693–1703 41. Grignola FE, Zhang Q, Hoeschele I (1997) Mapping linked quantitative trait loci via residual maximum likelihood. Genet Sel Evol 29:529–544 42. Thomas C, Cortessis V (1992) A Gibbs sampling approach to linkage analysis. Hum Hered 42:63–76 43. Hoeschele I, VanRaden P (1993) Bayesian analysis of linkage between genetic markers and quantitative trait loci. II. Combining prior knowledge with experimental evidence. Theor Appl Genet 85:946–952
44. Satagopan JM, Yandell BS, Newton MA, Osborn TC (1996) Markov chain Monte Carlo approach to detect polygene loci for complex traits. Genetics 144:805–816 45. Banerjee S, Yandell BS, Yi NJ (2008) Bayesian quantitative trait loci mapping for multiple traits. Genetics 179:2275–2289 46. Haley CS, Knott SA, Elsen JM (1994) Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics 136:1195–1207 47. Chen M, Kendziorski C (2007) A statistical framework for expression quantitative trait loci mapping. Genetics 117:761–771 48. Bedo J, Wenzl P, Kowalczyk A, Kilian A (2008) Precision-mapping and statistical validation of quantitative trait loci by machine learning. BMC Genet 9:35 49. Guyon I (2003) An introduction to variable and feature selection. J Mach Learn Res 3: 1156–1182 50. Jannink JL, Jansen RC (2001) Mapping epistatic quantitative trait loci with one-dimensional genome searches. Genetics 157:445–454 51. Zhang YM, Xu S (2005) A penalized maximum likelihood method for estimating epistatic effects of QTL. Heredity 95:96–104 52. Manichaikul A, Moon JY, Sen S, Yandell BS, Broman KW (2009) A model selection approach for the identification of quantitative trait loci in experimental crosses, allowing epistasis. Genetics 181:1077–1086 53. Wang DL, Zhu J, Li ZKL, Paterson AH (1999) Mapping QTLs with epistatic effects and QTL environment interactions by mixed linear model approaches. Theor Appl Genet 99:1255–1264 54. Knott SA, Haley CS (2000) Multitrait least squares for quantitative trait loci detection. Genetics 156:899–911 55. Jiang C, Zeng ZB (1995) Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140:1111–1127 56. Lange C, Whittaker JC (2001) Mapping quantitative trait loci using generalized estimating equations. Genetics 159:1325–1337 57. Weller JI, Wiggans GR, Vanraden PM, Ron M (1996) Application of a canonical transformation to detection of quantitative trait loci with the aid of genetic markers in a multitrait experiment. Theor Appl Genet 92:998–1002 58. Liu J, Liu Y, Liu X, Deng HW (2007) Bayesian mapping of quantitative trait loci for multiple complex traits with the use of variance components. Am J Hum Genet 81:304–320
From Quantitative Trait Loci to Prediction 59. Goffinet B, Gerber S (2000) Quantitative trait loci: a meta-analysis. Genetics 155: 463–473 60. Veyrieras JB, Goffinet B, Charcosset A (2007) MetaQTL: a package of new computational methods for the meta-analysis of QTL mapping experiments. BMC Bioinformatics 8(49):1–16 61. Rebai A, Goffinet B (2000) More about quantitative trait locus mapping with diallel designs. Genet Res 75:243–247 62. Demarest K, Koyner J, McCaughran J, Cipp L, Hitzzeman R (2001) Further characterisation and high-resolution mapping of quantitative trait loci for ethanol induced locomotor activity. Behave Genet 31:79–91 63. Cavanagh C, Morell M, Mackay I, Powel W (2008) From mutations to MAGIC: resources for gene discovery, validation and delivery in crop plants. Curr Opin Plant Biol 1:215–221 64. Yu J, Holland JB, McMullen MD, Buckler ES (2008) Genetic design and statistical power of nested association mapping in maize. Genetics 178:539–551 65. Beavis WD (1998) QTL analyses: power, precision, and accuracy. In: Paterson AH (ed) Molecular dissection of complex traits. CRC Press, Boca Raton, FL, pp 145–162 66. Hill WG, Robertson A (1968) Linkage disequilibrium in finite populations. Theor Appl Genet 38:226–131 67. Balding D (2006) A tutorial on statistical methods for population association studies. Nat Rev Genet 7:781–791 68. Visscher PM, Wray NR, Zhang Q, Sklar P et al (2017) 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet 101:5–22 69. Hayes B (2014) Overview of Statistical methods for genome-wide association studies (GWAS). In: Gondro C, van der Werf J, Hayes B (eds) Genome-wide association Studies and genomic prediction. Springer, Berlin/Heidelberg, pp 149–169 70. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516 71. Nejati-Javaremi A, Smith C, Gibson JP (1997) Effect of total allelic relationship on accuracy of evaluation and response to selection. J Anim Sci 75:1738–1745 72. Devlin B, Bacanu SA, Roeder K (2004) Genomic control in the extreme. Nat Genet 36: 1129–1130
39
73. Pritchard JK et al (2000) Association mapping in structured populations. Am J Hum Genet 67:170–181 74. Yu J, Pressoir G, Briggs WH, Bi IV et al (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet 38: 203–208 75. Kang HM, Zaitlen NA, Wade CM, Kirby A et al (2008) Efficient control of population structure in model organism association mapping. Genetics 178:1709–1723 76. Zhang Z, Ersoz E, Lai C-Q, Todhunter RJ et al (2010) Mixed linear model approach adapted for genome-wide association studies. Nat Genet 42:355–360 77. Aulchenko YS, Koning DJ, Haley C (2007) Genome wide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177:577–585 78. Zhou X, Stephens M (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet 44:821–824 79. Wen YJ, Zhang H, Ni YL, Huang B et al (2018) Methodological implementation of mixed linear models in multi-locus genomewide association studies. Brief Bioinform 19(4):700–712 80. Fernando RA, Toosi A, Wolc D, Garrick N, Dekkers J (2017) Application of wholegenome prediction methods for genomewide association studies: a Bayesian approach. J Agric Biol Environ Stat 22:172–193 81. Pan Q, Hu T, Moore JH (2014) Epistasis, complexity, and multifactor dimensionality reduction. In: Gondro C et al (eds) Genome-wide association studies and genomic prediction. Springer, Berlin/Heidelberg, pp 465–478 82. Korte A, Vilhja´lmsson BJ, Segura V, Platt P et al (2012) A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nat Genet 44:1066–1071 83. Zhu X, Stephens M (2017) Bayesian largescale multiple regression with summary statistics from genome-wide association studies. Ann Appl Stat 11(3):1561–1592 84. Wang T, Zhou B, Guo T, Bidlingmaier M et al (2014) A robust method for genome-wide association meta-analysis with the application to circulating insulin-like growth factor I concentrations. Genet Epidemiol 38(2):162–171 85. Turley P, Walters RK, Maghzian O, Okbay A et al (2018) Multi-trait analysis of genome-
40
Nourollah Ahmadi
wide association summary statistics using MTAG. Nat Genet 50(2):229–237 86. Cordell HJ (2009) Detecting gene–gene interactions that underlie human diseases. Nature Rev. Genet 10:393–404 87. Wu M, Ma S (2019) Robust genetic interaction analysis. Brief Bioinform 20(2):624–637 88. Cantor RM, Lange K, Sinsheimer JS (2010) Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet 86:6–22 89. Qin X, Ma S, Wu M (2020) Gene-gene interaction analysis incorporating network information via a structured Bayesian approach. arXiv:2010.10960 90. Gaya´n J, Gonza´lez-Pe´rez A, Bermudo F, Sa´ez ME et al (2008) A method for detecting epistasis in genome-wide studies using casecontrol multi-locus association analysis. BMC Genomics 9:360 91. Han SS, Chatterjee N (2018) Review of statistical methods for gene-environment interaction analysis. Curr Epidemiol Rep 5:39–45 92. Ott J, Kamatani Y, Lathrop M (2011) Familybased designs for genome-wide association studies. Nat Rev Genet 12:465–474 93. Meuwissen THE, Goddard ME (2000) Fine mapping of quantitative trait loci using linkage disequilibria with closely linked marker loci. Genetics 155:421–430 94. Long AD, Mullaney SL, Reid LA, Fry JD et al (1995) High resolution mapping of genetic factors affecting abdominal bristle number in Drosophila melanogaster. Genetics 139: 1273–1291 95. Brut DW (2002) A comprehensive review on the analysis of QTL in animals. Trends Genet 18(9):488 96. Kearsey MJ, Farquhar AGL (1998) QTL analysis in plants; where are we now? Heredity 80: 137–142 97. Hayes B, Goddard ME (2001) The distribution of the effects of genes affecting quantitative traits in livestock. Genet Sel Evol 33:209 98. Barton NH, Keightley PD (2002) Understanding quantitative genetic variation. Nat Rev Genet 3:11–21 99. Hyne V, Kearsey MJ (1995) QTL analysis further uses of marker regression. Theor Appl Genet 91:471–476 100. Robertson A (1967) The nature of quantitative genetic variation. In: Brink RA, Styles ED (eds) Heritage from Mendel. University of Wisconsin, Madison, WI, pp 265–280
101. Flint J, Mackay TFC (2009) Genetic architecture of quantitative traits in mice, flies, and humans. Genome Res 19:723–733 102. Goddard ME, Kemper KE, MacLeod IM, Chamberlain AJ, Hayes BJ (2016) Genetics of complex traits: prediction of phenotype, identification of causal polymorphisms and genetic architecture. Proc R Soc B 283: 20160569 103. Yang J, Benyamin B, McEvoy BP, Gordon S et al (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42:565–569 104. Yang J, Bakshi A, Zhu Z, Hemani G et al (2015) Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet 47:1114–1120 105. Park JH, Wacholder S, Gail MH, Peters U et al (2010) Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet 42:570–575 106. Simons YB, Turchin MC, Pritchard JK, Sella G (2014) The deleterious mutation load is insensitive to recent population history. Nat Genet 46:220–224 107. Makowsky R, Pajewski NM, Klimentidis YC, Vazquez AI et al (2011) Beyond missing heritability: prediction of complex traits. PLoS Genet 7(4):e1002051 108. Phillips PC (1998) The language of gene interaction. Genetics 149:1167–1171 109. Cheverud JM, Routman EJ (1995) Epistasis and its contribution to genetic variance components. Genetics 139:1455–1461 110. Mackay TFC (2001) The genetic architecture of quantitative traits. Ann Rev Genet 35: 303–339 111. Nagel RL (2005) Epistasis and the genetics of human diseases. C R Biol 328:606–615 112. Jannink JL, Moreau L, Charmet G, Charcosset A (2008) Overview of QTL detection in plants and tests for synergistic epistatic interactions. Genitica 136(2):225–236 113. Albar L, Lorieux M, Ahmadi N, Rimbault I et al (1998) Genetic basis and mapping of the resistance to rice yellow mottle virus. I. QTLs identification and relationship between resistance and plant morphology. Theor Appl Genet 97:1145–1154 114. Pressoir G, Albar L, Ahmadi N, Rimbault I et al (1998) Genetic basis and mapping of the resistance to the rice yellow mottle virus. II. Evidence of a complementary epistasis between two QTLs. Theor Appl Genet 97: 1155–1161
From Quantitative Trait Loci to Prediction 115. Ahmadi N, Albar L, Pressoir G, Pinel A et al (2001) Genetic basis and mapping of the resistance to Rice yellow mottle virus. III. Analysis of QTLs efficiency in introgressed progenies confirmed the hypothesis of complementary epistasis between two resistance QTLs. Theor Appl Genet 103:1084–1092 116. Yi N, Zinniel DK, Kim K, Eisen EJ et al (2006) Bayesian analyses of multiple epistatic QTL models for body weight and body composition in mice. Genet Res 87:45–60 117. Mackay TFC, Roshina NV, Leips JW, Pasyukova EG (2006) Complex genetic architecture of drosophila longevity. In: Masaro EJ, Austad SN (eds) Handbook of the biology of aging. Elsevier, Academic Press, San Diego, CA, pp 181–216 118. Carlborg O, Hocking PM, Burt DW, Haley CS (2004) Simultaneous mapping of epistatic QTL in chickens reveals clusters of QTL pairs with similar genetic effects on growth. Genet Res 83:197–209 119. Grobe-Brinkhaus C, Jonas E, Buschbell H, Phatsara C et al (2010) Epistatic QTL pairs associated with meat quality and carcass composition traits in a porcine Duroc Pietrain population. Genet Sel Evol 42:39 120. Carlborg O, Haley CS (2004) Epistasis: too often neglected in complex trait studies? Nat Rev Genet 5:618–U614 121. Barendse W, Harrison BE, Hawken RJ, Ferguson DM (2007) Epistasis between Calpain1 and its inhibitor Calpastatin within breeds of cattle. Genetics 176(4):2601–2610 122. Zhang J, Wei Z, Cardinale CJ, Gusareva LS et al (2019) Multiple epistasis interactions within MHC are associated with ulcerative colitis. Front Genet 10:257 123. Strange T, Ask B, Nielsen B (2013) Genetic parameters of the piglet mortality traits stillborn, weak at birth, starvation, crushing, and miscellaneous in crossbred pigs. J Anim Sci 91:1562–1569 124. Huang A, Xu S, Cai X (2014) Whole-genome quantitative trait locus mapping reveals major role of epistasis on yield of Rice. PLoS One 9: e87330 125. Mackay TFC (2014) Epistasis and quantitative traits: using model organisms to study gene-gene interactions. Nat Rev Genet 15(1):22–33 126. de los Campos G, Sorensen DA, Toro MA (2019) Imperfect linkage disequilibrium generates phantom epistasis (& Perils of Big Data). G3 (Bethesda) 9:1429–1436
41
127. Burch CL, Chao L (2004) Epistasis and its relationship to canalization in the RNA virus 6. Genetics 167:559–567 128. Eitan Y, Soller M (2004) Selection induced genetic variation. In: Wasser SP (ed) Evolutionary theory and processes: modern horizons. Springer, Berlin/Heidelberg, pp 153–176 129. Sonawane AR, Weiss ST, Glass K, Sharma A (2019) Network medicine in the age of biomedical big data. Front Genet 10:294 130. Amaral AJ, Bressan MC, Almeida J, Bettencourt C et al (2019) Combining genomewide association analyses and gene interaction networks to reveal new genes associated with carcass traits, meat quality and fatty acid profiles in pigs. Livest Sci 220:180–189 131. Ko DK, Brandizzi F (2020) Network-based approaches for understanding gene regulation and function in plants. Plant J 104: 302–317 132. Ebbert MTW, Ridge PG, Kauwe JSK (2015) Bridging the gap between statistical and biological epistasis in Alzheimer’s disease. Biomed Res Int 2015:870123. https://doi. org/10.1155/2015/870123 133. Boyle EA, Li YI, Pritchard JK (2017) An expanded view of complex traits: from polygenic to omnigenic. Cell 169:1177–1186 134. Field Y, Boyle EA, Telis N, Gao Z et al (2016) Detection of human adaptation during the past 2000 years. Science 354:760–764 135. Via S, Lande R (1987) Evolution of genetic variability in a spatially heterogeneous environment: effects of genotype-environment interaction. Genet Res 49:147–156 136. Bradshaw AD (1965) Evolutionary significance of phenotypic plasticity in plants. Adv Genet 13:115–155 137. de Leon N, Jannink JL, Edwards JW, Kaeppler SM (2016) Introduction to a special issue on genotype by environment interaction. Crop Sci 56:2081–2089 138. Gauch HG, Zobel RW (1996) AMMI analysis of yield trials. In: Kang MS, Gauch HG (eds) Genotype-by-environment interaction. CRC Press, Boca Raton, FL, pp 85–122 139. Eberhart SA, Russell WA (1996) Stability parameters for comparing varieties. Crop Sci 6:36–40 140. Stinchcombe JR, Function-valued Traits Work. Group, Kirkpatrick M (2012) Genetics and evolution of function-valued traits: understanding environmentally responsive phenotypes. Trends Ecol Evol 27:637–647 141. Robinson MR, Beckerman AP (2013) Quantifying multivariate plasticity: genetic
42
Nourollah Ahmadi
variation in resource acquisition drives plasticity in resource allocation to components of life history. Ecol Lett 16:281–290 142. Jansen RC, Van Ooijen JM, Stam P, Lister C, Dean C (1995) Genotype-by-environment interaction in genetic mapping of multiple quantitative trait loci. Theor Appl Genet 91: 33–37 143. Korol AB, Ronin YI, Ne E (1998) Approximate analysis of QTL-environment interaction with no limits on the number of environments. Genetics 148:2015–2028 144. Hayes BJ, Daetwyler HD, Goddard ME (2016) Models for genome x environment interaction: examples in livestock. Crop Sci 56(5):2251–2259 145. Valdar W, Solberg LC, Gauguier D, Cookson WO et al (2006b) Genetic and environmental effects on complex traits in mice. Genetics 174:959–984 146. Vieira C, Pasyukova EG, Zeng ZB, Hackett JB et al (2000) Genotype-environment interaction for quantitative trait loci affecting life span in Drosophila melanogaster. Genetics 154:213–227 147. El-Soda M, Malosetti M, Zwaan BJ, Koornneef M, Aarts MGM (2014) G E interaction QTL mapping in plants: lessons from Arabidopsis. Trends Plant Sci 19(6): 390–398 148. El-Soda M, Kruijer W, Malosetti M, Koornneef M, Aarts MGM (2015) Quantitative trait loci and candidate genes underlying genotype by environment interaction in the response of Arabidopsis thaliana to drought. Plant Cell Environ 38:585–599 149. MacMahon B (1968) Gene-environment interaction in human disease. J Psychiatr Res 6:393–402 150. Hunter DJ (2005) Gene-environment interaction in Humain diseases. Nat Rev Genet 6: 287–298 151. Lillehammer M, Goddard ME, Nilsen H, Sehested E et al (2008) Quantitative trait locus-by-environment interaction for Milk yield traits on Bos taurus autosome 6. Genetics 179:1539–1546 152. Des Marais DL, Hernandez KM, Juenger TE (2013) Genotype-by-environment interaction and plasticity: exploring genomic responses of plants to the abiotic environment. Annu Rev Ecol Evol Syst 44:5–29 153. Via S, Gomulkiewicz R, De Jong G, Scheiner SM et al (1995) Adaptive phenotypic plasticity: consensus and controversy. Trends Ecol Evol 10:212–217
154. Lacaze X, Hayes MP, Koro A (2009) Genetics of phenotypic plasticity: QTL analysis in barley, Hordeum vulgare. Heredity 102: 163–173 155. Gutteling EW, Riksen JAG, Bakker J, Kammenga JE (2007) Mapping phenotypic plasticity and genotype-environment interactions affecting life-history traits in Caenorhabditis elegans. Heredity 98(1):28–37 156. Kikuchi S, Bheemanahalli R, Jagadish KSV, Kumagai E et al (2017) Genome-wide association mapping for phenotypic plasticity in rice. Plant Cell Environ 40(8):1565–1575 157. Lukens LE, Doebley J (1999) Epistatic and environmental interactions for quantitative trait loci involved in maize evolution. Genet Res 74:291–302 158. Liu N, Du Y, Warburton ML, Xiao Y, Yan J (2020) Phenotypic plasticity contributes to maize adaptation and Heterosis. Mol Biol Evol 38(4):1262–1275. https://doi.org/10. 1093/molbev/msaa283 159. Kawecki TJ, Ebert D (2004) Conceptual issues in local adaptation. Ecol Lett 7: 1225–1241 160. Lowry DB, Lovell JT, Zhang L, Bonnette J et al (2019) QTL environment interactions underlie adaptive divergence in switchgrass across a large latitudinal gradient. Proc Natl Acad Sci U S A 116(26):12933–12941 161. Wickland DP, Hanzawa Y (2015) The flowering locus t/terminal flower 1 gene family: functional evolution and molecular mechanisms. Mol Plant 8(7):983–997 162. Dash S, Van Hemert J, Hong L, Wise RP, Dickerson JA (2012) PLEXdb: gene expression resources for plants and plant pathogens. Nucleic Acids Res 40:D1194–D1201 163. Coughlin SS, Trock B, Criqui MH, Pickle LW et al (1992) The logistic modeling of sensitivity, specificity, and predictive value of a diagnostic test. J Clin Epidemiol 45(1):l–7 164. Brand A, Brand H, in den B€aumen TS (2008) The impact of genetics and genomics on public health. Eur J Hum Genet 16:5–13 165. Yang Q, Khoury MJ, Botto L, Friedman JM, Flanders WD (2003) Improving the prediction of complex diseases by testing for multiple disease-susceptibility genes. Am J Hum Genet 72:636–649 166. Dudbridge F (2013) Power and predictive accuracy of polygenic risk scores. PLoS Genet 9(3):e1003348 167. Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves:
From Quantitative Trait Loci to Prediction statistical significance and interpretation. Q J R Meteorol Soc 128(584):2145–2166 168. Janssens ACJW, van Duijn CM (2009) Genome-based prediction of common diseases: Ethodological considerations for future research. Genome Med 1(2):20 169. Choi SW, Mak TS-H, O’Reilly PF (2020) Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc 15:2759–2772 170. Dreyfuss JM, Levner D, Galagan GE, Church GM, Ramoni MF (2012) How accurate can genetic predictions be? BMC Genomics 13: 340 171. Khera AV, Chaffin M, Aragam KG, Haas ME et al (2018) Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet 50(9):1219–1224 172. Vilhja´lmsson BJ, Yang J, Finucane HK (2015) Modelling linkage disequilibrium increases accuracy of polygenic risk scores. Am J Hum Genet 97:576–592 173. Hu Y, Lu Q, Powles R, Yao X et al (2017) Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput Biol 13:e1005589 174. Hu Y, Lu Q, Liu W, Zhang Y et al (2017) Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction. PLoS Genet 13:e1006836 175. Hazel LN (1943) The genetic basis for constructing selection indices. Genetics 38: 476–490 176. Henderson CR (1949) Estimation of changes in herd environment. J Dairy Sci 32:709 177. Henderson CR (1975) Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423–447 178. Henderson CR (1976) A simple method for computing the inverse of a numerator relationship matrix used in predicting of breeding values. Biometrics 32:69–83 179. Mrode RA (2005) Linear models for the prediction of animal breeding values. CAB Int:344 180. Piepho HP, Mo¨hring J, Melchinger AE, Bu¨chse A (2007) BLUP for phenotypic selection in plant breeding and variety testing. Euphytica 161:209–228 181. Bernardo R (1996) Best linear unbiased prediction of maize single-cross performance. Crop Sci 36(1):50–56
43
182. Fernando RL, Grossman M (1989) Markerassisted selection using best linear unbiased prediction. Genet Sel Evol 21:467–477 183. Guimara˜es EP, Ruane J, Scherf BD, Sonnino A, Dargie JD (2007) Marker-assisted selection. Current status and future perspectives in crops, livestock, forestry and fish, Rome, ISBN 978-92-5-105717-9, p 471 184. Goddard ME (1992) A mixed model for the analyses of data on multiple genetic markers. Theor Appl Genet 83:878–886 185. Goddard ME, Hayes BJ (2002) Optimisation of response using molecular data. 7th world congress on genetics applied to livestock production, Montpellier, France, Communication no. 22–01 186. Hayes BJ, Goddard ME (2003) Evaluation of marker assisted selection in pig enterprises. Livest Prod Sci 81(2–3):197–121 187. Boichard D, Fritz S, Rossignol MN, Guillaume F, et al. (2006) Implementation of marker-assisted selection: practical lessons from dairy cattle. 8th world congress on genetics applied to livestock production, Belo Horizonte, MG, Brasil 188. Hayes BJ, Chamberlain J, Mcpartlan H, Macleod I et al (2007) Accuracy of markerassisted selection with single markers and marker haplotypes in cattle. Genet Res 89: 215–220 189. Dekkers JCM, van Arendonk JAM (1998) Optimizing selection for quantitative traits with information on an identified locus in outbred populations. Genet Res 71:257–275 190. Hospital F, Charcosset A (1997) Markerassisted introgression of quantitative trait loci. Genetics 147:1469–1485 191. Bouchez A, Hospital F, Causse M, Gallais A, Charcosset A (2002) Marker-assisted introgression of favorable alleles at quantitative trait loci between maize elite lines. Genetics 162:1945–1959 192. Peleman JD, van der Voort JR (2003) Breeding by design. Trends Plant Sci 7:330–334 193. van Berloo R, Stam P (1999) Comparison between marker-assisted selection and phenotypical selection in a set of Arabidopsis thaliana recombinant inbred lines. Theor Appl Genet 98:113–118 194. Charmet G, Robert N, Perretant MR, Gay G et al (1999) Marker-assisted recurrent selection for cumulating additive and interactive QTL’s in recombinant inbred lines. Theor Appl Genet 99:1143–1148
44
Nourollah Ahmadi
195. Ragot M, Lee M (2007) Marker-assisted selection in maize: current status, potential, limitations and perspectives from the private and public sectors. In: Guiamaraes et al (eds) Marker assisted selection. FAO, Rome, pp 117–150 196. Dekkers JCM, Hospital F (2002) The use of molecular genetics in the improvement of agricultural populations. Nat Rev Genet 3: 22–32 197. Moreau L, Charcosset A, Hospital F, Gallais A (1998) Marker-assisted selection efficiency in populations of finite size. Genetics 148: 1353–1365 198. Moreau L, Charcosset A, Gallais A (2004) Experimental evaluation of several cycles of marker-assisted selection in maize. Euphytica 137:111–118
199. Melchinger AE (1999) Genetic Diversity and Heterosis. In: Coors JG, Pandey S (eds) Genetics and exploitation of Heterosis in crops. ASA, CSSA, and SSSA Books. Wiley, Hoboken, New Jersey, pp 99–118 200. Bernardo R (1999) Marker-assisted best linear unbiased prediction of single cross performance. Crop Sci 39:1277–1282 201. Spelman RJ, Garrick DJ (1998) Genetic and economic responses for within-family markerassisted selection in dairy cattle breeding schemes. J Dairy Sci 81:2942–2950 202. Ribaut JM, Hoisington D (1998) Markerassisted selection: new tools and strategies. Trends Plant Sci 3:236–239 203. Jansen RC, Jannink J-L, Beavis WD (2003) Mapping quantitative trait loci in plant breeding populations: use of parental haplotype sharing. Crop Sci 43:829–834
Chapter 2 Genomic Prediction of Complex Traits, Principles, Overview of Factors Affecting the Reliability of Genomic Prediction, and Algebra of the Reliability Jean-Michel Elsen Abstract The quality of the predictions of genetic values based on the genotyping of neutral markers (GEBVs) is a key information to decide whether or not to implement genomic selection. This quality depends on the part of the genetic variability captured by the markers and on the precision of the estimate of their effects. Selection index theory provided the framework for evaluating the accuracy of GEBVs once the information had been gathered, with the genomic relationship matrix (GRM) playing a central role. When this accuracy must be known a priori, the theory of quantitative genetics gives clues to calculate the expectation of this GRM. This chapter makes a critical inventory of the methods developed to calculate these accuracies a posteriori and a priori. The most significant factors affecting this accuracy are described (size of the reference population, number of markers, linkage disequilibrium, heritability). Key words Genomic prediction, Genomic relationships matrix, Accuracy of predictions, Effective number of loci
1
Introduction From the observation of a genetic part in traits variability, which is the source and is measured by the resemblance between relatives, constant efforts were made throughout the twentieth century to make the best use of family information for individual values predictions. In medicine, the challenge is to provide information on the risks to a particular disease, that is, to predict the patient’s future phenotype. In agronomy, the goal is to choose the best reproducers from batches of candidates, then to predict their genetic values, which are the expected average phenotypes of their potential descendants. The generally accepted predictor is the expected genetic value conditional on the information available. Thus, in the Gaussian framework, the predictions are based on the linear regression method, with phenotypes of individuals related to
Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
45
46
Jean-Michel Elsen
the subject whose value one wishes to predict as predictor variables and, as coefficients, functions of the kinship rates between this subject and its relatives. In all cases, these predictors are random variables, and the predictions have some uncertainty. Knowing a priori this accuracy is essential, especially in the optimization of selection plans, and a posteriori, when evaluating future reproducers. The possibility of knowing, by molecular biology methods, the DNA sequences of many individuals at a large number of polymorphic loci has opened the way to an entirely different approach to predicting the value of these individuals. The original idea was that the scarcity of meiotic recombinations between two close loci, and therefore the strength of the linkage disequilibrium (LD) (see Note 1) that unites them, gives the genotype at a possibly neutral locus (without effect on the trait of interest) an informative value in terms of genotypes of close loci whose polymorphism explains part of the trait variability (the so-called QTL for quantitative trait locus). The statistical estimate of the effects of a priori neutral DNA markers on this trait, in a population whose members are both phenotyped for the trait of interest and genotyped for the set of DNA markers, exceeds the description and becomes predictive: the values of other subjects which were not involved in the estimation of these effects can be predicted from their genotypes measured at these same marker loci. In 20 years the genomic prediction of individual values has flourished, to the point that in certain cases (especially large breeds of dairy cattle), genomic prediction became the norm. The factors of its effectiveness have gradually become clearer; their description is the subject of this chapter. But it should first be noted that the initial hope, of a possibility of getting closer to the ideal prediction model, in which we would “read” the value of individuals by genotyping markers very strongly linked to the genes which control the characters is not materialized. Genomic prediction is ultimately not that different from the methods of the past (e.g., [1]). Much of the information provided by molecular labeling is related to the similarity between relatives, now supported by similarity observed in large chromosomal segments, whereas in earlier nongenomic predictions it was only predicted in the form of probabilities of identity of genotypes conditional on pedigree information. Before going through the list of factors affecting the accuracy of genomic predictions, we must first review how to measure this accuracy. The wording used to describe the quality (precision, accuracy, reliability, etc.) of a statistics is often confusing (see Note 2). The word “accuracy” will be used in this chapter because it is the most often used in the corresponding literature. Most of what will be described comes from the field of agronomy and more specifically animal genetics, the concepts having a broader value, however.
Prediction of GEBV Accuracy
47
The diversity of approaches that have been proposed, explored and applied since the seminal work of Meuwissen et al. [2], will be very largely described in the following chapters. It is multiple: diversity of the hypotheses on the distribution of QTLs effects (Gaussian, exponential, mixtures of Gaussians and Dirac, etc.), of the information mobilized (purely genomic or combining molecular and pedigree data), of the conceptual statistical framework (frequentist or Bayesian), resolution algorithms (derived from the classical mixed model based on the manipulation of large matrices or using MCMC-type sampling methods). The quality of genomic predictions (accuracy, bias, computational heaviness, versatility) vary between these approaches, with differences depending on the structure of populations (defined by their sizes and diversity), the quantity of information available (number of DNA markers) and the “Genetic architecture“of the trait of interest (number and effects of QTLs). These elements will be detailed in the rest of the book. It should nevertheless be remembered that the numerous published comparisons show only small differences in genomic prediction accuracy between models (i.e., hypotheses about the genetic architecture) and statistical approaches. The remainder of this chapter will therefore focus on the simplest, and most widely used, of these approaches, the “GBLUP” for Genomic Best Linear Unbiased Prediction and its equivalent the “SNPBLUP” for Single Nucleotide Polymorphism BLUP (also called RR BLUP). To understand the theory, the measurement tools and the results of the accuracy of these genomic predictions, we must first go back to a few principles and define the notations.
2
Genomic Prediction Methods Historically two approaches have been proposed, we will see their convergence.
2.1
SNPBLUP
The first approach, SNPBLUP, foreshadowed in [3], was in the line of marker-assisted selection methods (“MAS”). This MAS was initially confined to the possibility of enriching within-family transmission information using DNA markers, in the linkage analysis framework (see Note 3). Despite the initial enthusiasm, this MAS has, however, been little used in practice (we can however cite the case of French dairy cattle as a positive counterexample [4]). When the density of markers in the genetic maps increased, the idea of MAS exploiting a strong LD common to all families between neutral markers and polymorphic causal genes, and therefore to overcome the limitations of within-family selection, became realistic. Then, the availability of very high-density SNP microarrays further paved the way for the exploitation of population level LD, and genomic selection was reformulated, with a seminal article [2]
48
Jean-Michel Elsen
in which several statistical tools were proposed for the evaluation of the SNPs effects on the trait of interest. In the simplest and most used version of these methods, phenotypic performances (yi for individual i) are modeled as the sum of apparent effects of M markers (induced by their LD with one or more causal loci (see Note 4), some possible nongenetic effects (μ) and a random factor (ei): X w ik βk þ e i yi ¼ μ þ k
where wik is the genotype at locus k (for example {0, 1, 2} depending on the number of alleles) and βkthe apparent effect of this allele. We will see later that several codifications of genotypes (other than {0, 1, 2}) are possible. In matrix terms, this equation is written y ¼ μ + Wβ + e. Statistically the β effects are given as random and we obtain a prediction of their value using the usual techniques (the BLUP, see [5]): b β¼ 0 1 0 W y (where λβ is the ratio between the of nuisance W W þ λβ I effects e and β effects variances). The model has been called SNPBLUP [6]. Once the effects of the markers are predicted, the genomic value (GEBV) of an individual who has no phenotype but P whose markers genotypes (codified w) are known u ¼ w β , k k can P 0 k b b b ¼ wk βk ¼ w β. be predicted by u k
2.2
GBLUP
The second approach, GBLUP, is in line with the methods of predicting genetic values from pedigrees, and in particular the “animal model BLUP” [5]. These methods are based on the relationship matrix (A) formed by the probabilities of IBD (Identity By Descent) at any loci between phenotyped relatives (A) and between the candidate (phenotyped or not) and its relatives (a): 1 a0 A¼ (see Note 5). This matrix can be established from a A pedigree information. The prediction of the candidate’s genetic value is defined as the expectation of its genetic value given the phenotypes of its relatives (the y) and the kinship relationships summarized in A. We thus have a linear model y ¼ μ + g + e, where g is the vector of additive genetic values whose covariances matrix is Aσ 2g . In the Gaussian framework, this is the classical regression: E ½gjy, a, A ¼ gb ¼ cov ðg, y Þvar ðy Þ1 y, that is gb ¼ a 0 σ 2g 1 1 Aσ 2g þ I σ 2e y ¼ a 0 A þ I λg y (see Note 6). Nejati-Javaremi et al. [7] were probably the first to propose enriching this pedigree information with genotypic data. Their idea was that (1) the coefficients constituting the a and A vectorial elements are the expectations established from the pedigree and the IBD states between individuals, around which there is a variation coming from the Mendelian sampling, and (2) in the ideal
Prediction of GEBV Accuracy
49
situation where all the causal loci (which will be denoted QTL for Quantitative Trait Loci in the following) are known and genotyped, the observation in such a locus of states of identity or not of alleles carried by two individuals would account for the variation which makes it possible to specify these states. Thus the average IBD of two full sibs is 0.5, but, in a biallelic locus, these sibs can be for example both AA or BB, therefore totally similar (we will give a value of 1 to this state), or on the contrary one can be AA the other BB, therefore totally different (we will give a value of 0 to this state). The lower the number of QTL, the more the mean of the states (across the QTLs) observed for this pair of full sibs may differ from 0.5. In the Nejati-Javaremi et al. proposition [7], the relationship matrix A is replaced by a genomic relationships matrix (GRM) at QTLs, G, with elements are defined by P2 P2 1 XQ s¼1 t¼1 I ijqst G ij ¼ q¼1 2 Q where Iijqst ¼ 1 if, at locus q, the sth allele carried by individual i is equal to the tth allele carried by individual j. In this ideal situation of known QTLs, the prediction of a candidate’s value becomes gb ¼ 1 c 0 G þ I λg y (with as in the case of a pedigree matrix G ¼ 1 c0 . This model was subsequently qualified as GBLUP [8]. c G 2.3 Equivalence Between GBLUP and SNPBLUP, Definitions of Genomic Matrices
Under the same assumption of a perfect knowledge of QTLs, phenotypic performances were modeled as y ¼ μ + Zα + e [8], the b b ¼ zα value of the nonphenotyped candidate being estimated by u (here the elements z, Z, and α are relative to the QTLs whereas w, W, and β given previously were relative to markers) with ziq ¼ 0 2pq, ziq ¼ 1 2pq or ziq ¼ 2 2pq, depending on whether the individual i is of genotype AA, AB, or BB at the q locus, pqbeing the frequency of the B allele (see Note 7). The authors stated that this prediction, the result of an SNPBLUP, can also be interpreted as that of a GBLUP [8]. To understand this, two elements are needed: (1) The Householder equation [9] (very frequently used in BLUP developments) b ¼ ðZ 0 Z þ λα I Þ1 Z 0 y ¼ which allows us to write that α 1 Z 0 ðZ Z 0 þ λα I Þ y. (2) The relationships between the genetic vari2 ance σ g and the variance of the effects of QTLs σ 2α , which are described in detail in [10]. In basics of quantitative P genetics, the genetic value of an individual is given by g ¼ z q αq ¼ zα, the αq genes effects are fixed, q PP αq αq 0 covar z q , z q 0 and under the which gives var gj αq ¼ q q0
assumption of Linkage Equilibrium (LE) between all loci,
50
Jean-Michel Elsen
P P var gj αq ¼ var z q α2q ¼ 2pq 1 pq α2q . But we can also q
q
consider that the effects of QTLs (αq) are random variables whose distribution laws are independent with a σ 2α variance. This variance has two interpretations: in the Bayesian framework [2], it reflects the uncertainty about the value of the effects and in the frequency framework, the idea of a conceptual sampling of the effects distributed with this variance and zero expectation. The genetic variance σ2g simplified expectation of var(g| {αq}), (see Note P in the 8), that is, 2pq 1 pq σ 2α under the assumption of indepenq σ2 dence between effects and allelic frequencies. So, σ 2α ¼ P2p g1p . qð qÞ q
b¼ b ¼ zα With those elements, it can be seen than u 1 0 0 zZ ðZ Z þ λα I Þ y : the genetic value predicted by SNPBLUP is 0 zZ 0 zz also the result of a GBLUP using as GRM G ¼ Z z0 Z Z 0 P 1 , thus approaching the proposal of Nejati-Javaremi et al. 2pq ð1pq Þ q [7] (see Note 9). rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The codification z iq ¼ f0, 1, 2g 2pq = 2pq 1 pq by which the genotypes have standardized distributions is another possibility [11]. In this case, equivalence between SNPBLUP and 0 GBLUP requires defining the genomic matrix as G ¼ ZZ /Q, with Q the number of QTLs. This duality was also put forward for more realistic situations in which the QTLs are not known but markers are available (the basic elements are the wikand not the ziq, and the genetic value g i ¼ P z iq αq , as to be distinguished from of the genomic value ui ¼ q P wik βk ). It should not be forgotten that these genomic matrices k
Gm, based on the genotypes in markers, are only approximations of the genomic matrix G as designed by Nejati-Javaremi et al. [7]. We will come back to this approximation later. 2.4 Definitions of Genomic Matrices
Several genomic matrices have been proposed for the GBLUP, differing by the way genotypes are coded (Table 1). To insure equivalence with the SNPBLUP model, each of those genomic matrices is based on a specific hypothesis about relationship between σ 2α or σ 2β and σ 2g . Depending on codification choice, bias may occur and proposals were made to solve these difficulties. Elements of G or Gm may depend on allele frequencies (the pk), making them sensitive to the quality of their estimates. Moreover the pedigree A matrix and the GRM, G or Gm, are difficult to compare because the elements of A come from information on the pedigree, which by definition is of variable depth and always
Prediction of GEBV Accuracy
51
Table 1 Coding of genotypes to markers Codification (AA/AB/BB)
Authors
{1, 0, 1}
Legarra A, Misztal I (2008) [41] VanRaden PM (2008) [12]
{0, 1, 2}
Habier D, Fernando R L, Dekkers J C (2007) [31]
{0, 1, 2} 2pk
Goddard, ME, Hayes B J, Meuwissen T H E (2011) [8] VanRaden PM (2008) [12]
f0, 1, 2g2pk p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2pk ð1pk Þ
Yang J, Benyamin B, McEvoy BP et al. (2010) [11] VanRaden PM (2008) [12] Speed, D, Hemani G, Johnson MR, Balding DJ (2012) [14] Zou, F, Lee, S, Knowles, M R, Wright, F a (2010) [13] Stranden I, Garrick D J (2009) [64]
limited, while the genomic matrices G or Gm depend on allele frequencies which should be those of the population of the founders of the pedigree in order to make these matrices comparable. A modification of the diagonal elements of Gm in a way that their variances are independent of the pk was also proposed [11]. Another propositions is to use the {1,0,1} coding and to regress the matrix b ¼ A þ b ðG m A Þ [12]; assimilating the regresobtained on A∶ G sion coefficient b to the part of the genetic variance explained by the markers, gives a track for its estimation [8]. We will come back to its calculation later. The unavoidable and desired presence of LD between causal and/or markers loci modifies the simple interpretation of the matrices. The variability of LD in the genome can lead to bias in the calculation of heritability based on these genomic matrices [13, 14]: in areas with higher LD the same QTL will be “tagged” by more markers and its weight in estimating its contribution to genetic variability will be overestimated. It was suggested to weight 1 the Wk column by 1þP where ρ2kl is the measure of the LD ρ2 l kl
between loci k and l [13]. If, for instance, n loci are in complete LD and in LE with all other loci, their weight will be 1/n. A second proposal [14] more elaborated, requires many more calculations for its implementation. We will find this idea of a weighting of the contributions of SNPs to the genomic matrix considering the structure of the LD in the estimation of the number of independent genomic segments. 2.5 Conclusion, Implementation, Usefulness of Knowing the Accuracy of Predictions
The value of an individual is therefore 1 0predicted by an equation b ¼ w 0 W 0 W þ λβ I b¼ of the type u W y (SNPBLUP) or u 1 0 y (GBLUP) in which W and w are the matrices of c G m þ λg I the codified genotypes of the phenotyped individuals (reference population) and the nonphenotyped individual (candidates for
52
Jean-Michel Elsen
selection), Gm is the genomic matrices between referents and c is the genomic matrices between the candidate and the referents, the 0 matrix Gm being proportional to WW . b are qualified as GEBV These u P and predict the sum of the apparent effects of the markers u ¼ w k βk. Conversely, the genetic q
values, TBV (True Breeding PValues) are the sum of the effects of the causal genes (QTLs) g ¼ z q αq . q
These quantities are not, at all, the same [8, 15, 16]. The genetic value g is the sum of the effects captured by the markers (u) and polygenic effects (r) not captured (due to partial LD or for QTLs independent of all markers), such that var(g) ¼ ðu Þ var (u) + var (r). The b ¼ var var ðg Þ ratio, part of the genetic variance captured by the QTLs, is an important element to consider when measuring GEBV accuracy. We will largely come back to this value later. In the early 2000s, genomic selection was most often described in a framework comprising a phenotyped and genotyped reference or calibration population which served as a support for establishing prediction equations (in practice for estimating the effects of markers with a SNPBLUP type model), a validation population on which the results of these equations were challenged against classical information about genetic values and a selection or target population whose individuals were sorted using the validated prediction equations. For the most advanced species or breeds, the success of genomic selection has erased this structuring, with individuals from the same population intervening in turn in the three components (calibration, validation, and target): in cattle for instance, young males selected on their GEBV become latter in their life, when their daughters are phenotyped, part of the reference population.
3
Observed Accuracy of Genomic Predictions The accuracy of genomic predictions is an important information in several respects: routinely, to inform breeders’ choice of reproducers, before the implementation of genomic selection when it is necessary to assess the usefulness of investing in the genotyping (e.g., [17–19]), to plan the process of collecting phenotypic performances and evaluating genetic values (e.g., [20]), to compare the performances of statistical prediction methods (e.g., [21]), and to assess, a posteriori, the efficiency of the genomic selection (e.g., [22]).
Prediction of GEBV Accuracy
3.1 Accuracy Measurement Methods 3.1.1 A Posteriori Accuracy of GEBV in the Ideal Case
3.1.2 Experimental Procedures for an A Priori Estimate of the Accuracy
53
The ideal case to obtain a high accuracy for GEBVs is the one where all the assumptions underlying the GEBVs equations are satisfied. They are numerous, concern the coverage of genetic variability at QTLs by markers, depending on their number and positioning on the genome; the quality of the markers effects estimation, which depend on allele frequencies and DL with QTLs and can therefore vary between populations, in particular between reference and selection populations; the absence of nonadditive effects, and so on. In this ideal situation, the accuracy could be quantified by the correlation between GEBVs and TBVs. With the notations pro 2 cov g, b u 2 . posed here, this accuracy is r ¼ g,b u var ðg Þvar b u A close measure, a “reliability” defined as the ratio 0 bÞ=var ðg Þ was given by c (Gm + λgI)1c [12] (see Note 10). var ðu b is a predictor of u and not g: equality We must not forget that u bÞ=var ðg Þ and correlation r 2 is only true between reliability var ðu g,b u bÞ ¼ covðu, u bÞ ¼ var ðu bÞ: Under this assumption, the when cov ðg, u proposed measure is also b r 2 . u,b u This definition of GEBV accuracy (r 2 generally decomposed g,b u into b r 2 ) is omnipresent in the literature on the efficiency of u,b u genomic selection. Measures of accuracy when the information on candidate value is both GEBV and phenotypes of the candidate (or its relatives) were also derived [15]. This accuracy measurement was extended to the case of a multitraits assessment [15, 23], and used to assess the effect of the information source (allele frequencies, LD pattern, haplotypes, chromosomes, family) on the accuracy of genomic prediction [24]. The situation of a multiracial assessment was also developed [25–27]. The decisions to switch from conventional methods to genomic selection can be informed by the observation, on a sample of the candidate population, of the result of such change of organization on the quality of breeding animals. The measure of the utility (of a new organization of the selection, of a new method of calculating GEBV, of the current evaluation and selection organization under various assumptions such as the demographic structure of the population or genetics architecture of the trait), will be based on the accuracy of the genomic prediction, which should, if necessary, be integrated into a calculation of the overall efficiency and profitability of the planned selection scheme. Considering the equations giving the prediction accuracy r 2 g,b u presented above as perfect, it would suffice to compare them with the precisions of the alternative methods when such formulas are available for the latter. Alternatively, the accuracy of the genomic predictions under study can be approximated by the correlations between these predictions and a “gold standard.”
54
Jean-Michel Elsen
In experiments aimed at estimating new approaches, the gold standard will be a genetic values prediction method already referenced. In the demonstration prior to the implementation of genomic selection, this standard will most often be that of the animal model BLUP. A typical question is to choose between a classic prediction obtained after a progeny test, therefore at the cost of long generation intervals, and a genomic prediction from an early age, without measuring the phenotypes of the candidate or its progeny. In other word, does the effect on the annual genetic gain of the acceleration of reproductive cycles surpass or not that of the loss of precision in the evaluation of genetic values (e.g., [17])? Measuring the predictions accuracy by correlation with a standard is not, however, fully satisfactory. A weak correlation does not mean that the new prediction method predicts true genetic values (TGV) worse than the old one, but differently. The best solution would be to correlate the prediction results with those unknown TGV. Returning to the principles defining the genetic value of an individual as the expected phenotypes of its progeny, we can approach these genetic values by the average of a large number of these progeny, corrected for nuisance effects also considered in prediction models. Thus, the correlation of GEBVs with these corrected means can be used as a measure of the accuracy of genomic predictions. An excellent review on genomic evaluation methods and measures of their accuracy [21] cites three “predictants” to which the GEBV must be correlated: the corrected phenotype, the offspring mean (typically the “Daughter Yields Deviation” used in dairy cattle genetics) and EBVs, which must be “deregressed.” In all three cases, for the predictor to be the best possible approach to true genetic value, it is suggested to divide its value by the squareproot ffiffiffiffiffi of the part of its variance which is of genetic origin (i.e., h 2 for a phenotype). The observation of genomic predictions accuracy is based on the above-mentioned subdivision of a set of genotyped and phenotyped individuals into a reference population and a validation or selection population. Several decisions have to be made regarding the constitution of these populations: relative and absolute numbers, genealogical structuring, and genetic distance between populations. The solutions adopted depend on the initial question, demonstration or experimentation of the value of a given innovation. A solution frequently adopted to estimate the accuracy of the evaluations consists, according to [28], in randomly distributing the population across several (k) subsets (fold) and successively considering each of them as a validation population the others forming the reference. An extensive description of those methods applied to genomic prediction is given in [29]. The accuracy of
Prediction of GEBV Accuracy
55
GEBVs is given by their correlation with the corresponding phenotypes. The calculation is repeated k times which makes it possible to estimate the variance of the correlation. More specifically, the distribution between reference and validation populations can be imposed by the nature of the question. We can take as an example the measurement of the quality of the predictions in a breed that does not have a reference population (or has a reference population of a priori too small size), by using the “prediction equations” established in a breed better endowed. 3.1.3 Accuracy of Genomic Predictions Obtained by Simulations
The approach which has just been described has the merit of not being based on a simplification of the biological reality, but suffers on the one hand from the lack of available data, in particular for organisms starting genomic selection, on the other hand from the impossibility of a completely objective measurement of the accuracy when defined as the correlation between true and estimated genetic value. The use of data simulations is a partial answer to these drawbacks. Many examples will be given in the following Subheading 3.2. In these simulations, each individual of the simulated population is described by a vector of genotypes at a large number of loci. These loci correspond to SNPs, some of which are QTLs, the polymorphism of which contributes to the genetic variability of the trait of interest. In a purely additive framework, we have seen that the genetic value of an individual (identified i) is the sum of the P effects of these QTLs, g i ¼ z iq αq . When the individual expresses q
a phenotype corresponding to this genetic value, it is pi ¼ gi + ei where eiis the effect of the environment. The relationships between genotypes at QTLs and the variables gi and pi becomes complicated by considering a nonadditivity of the effects to QTLs (dominance and epistasis), genotype x environment interactions, phenotypes repeated over time, multitraits prediction, and so on. In a simulation experiment, from a set of virtual founders to whom ziq genotypes are assigned, several cycles of creation of new individuals and elimination of old ones are carried out under the effect of selection or mortality. The new individuals are generated according to a mating plan between available reproducers, each new individual being the marriage of gametes produced by its parents and created randomly according to chosen rules (for example the absence of interference) corresponding to the numbers and positions of the crossing over. A realistic simulation must start generating a group of founders whose genomic characteristics are those of the real populations to be simulated, in particular in terms of distribution of allele frequencies, and of LD between loci according to their physical distance. There are two solutions to this problem: (1) starting, for each locus, from identical and independent allele frequency distribution laws and simulating a very large number of generations without selection, with bottleneck and expansion
56
Jean-Michel Elsen
phases, until we obtain the desired structure; (2) exploiting, when it exists, data from a large enough samples (typically a few hundred) of “chromosomes,” that is to say the vectors of genotypes for a set of polymorphic loci of real individuals. In this case, the genomes of the founders are randomly chosen from this sample of chromosomes. For each simulation cycle we obtain data tables of genotypes and phenotypes in all points similar to real world data files. Phasing software, genomic prediction, coupling optimization, and so on developed for the true life can be applied to these data. However, unlike analyzes of real data, the consequences of the technical choices thus made are directly measurable on the evolution of the distributions of genotypes at QTLs and on the distribution of the “true” genetic values of individuals. The correlation between these values and their prediction is an indicator of the quality of genomic predictions. This simulation approach has the considerable advantage of its flexibility. We will see later that it is possible without difficulty to test the effect of the sizes of populations (in particular the reference population) and of their pedigree structure, the density of markers and the distribution of their LD, the method of genomic prediction and, to a lesser extent, and the genetic architecture of the trait. The extreme complexity of cases of interactions between genes makes their modeling very hypothetical. 3.2 Main Factors of Affecting the Accuracy
Thanks to the above described analytical methods, data have accumulated on the factors influencing the accuracy of genomic evaluations. The references to the elements described below do not claim to be exhaustive of a very abundant literature on this subject, but aim to illustrate the issues.
3.2.1 Relative Accuracies of the Different Genetic and Genomic Predictions
The very first observation is that genomic evaluations are very generally more precise than traditional evaluations. Thus, following a 2009 report on the effectiveness of these methods in dairy cattle [30], conclusions of six articles give an almost always positive surplus of accuracy, from 0.1 to +0.31 compared to the only parental information (see Note 11). Daetwyler et al. [21] compared GEBV and classical evaluations in two situations [21]: real data on pine and wheat, simulated data based on genomic structures similar to those of Holstein cattle. For real data, the quality of the evaluations was estimated by ten-fold cross validation, while for simulations, it was given by the correlation between the GEBVs and the TBVs. The authors observed a clear superiority of genomic evaluations over traditional evaluations, for both real and simulated data. This advantage of GEBV was confirmed in many other studies (e.g., [31] with simulated data and [32] with real data; [33] with both approaches).
Prediction of GEBV Accuracy
57
The second observation is that the accuracies of the different genomic prediction methods are very similar, especially when the trait is dependent on a large number of genes with small effects (infinitesimal model). Methods which support the possibility of coexistence of groups of QTLs with very different effects (Bayesian methods such as Bayes A, B, C, and so on (see the Bayesian alphabet [10]), Partial Least Square, Least Absolute Shrinkage and Selection Operator) can be a little more precise when nature (real or simulated) satisfies this hypothesis [21, 34]. The accuracy is higher if the predictant (in the sense of [21]) is closer to the true genetic value, and in particular, in the case of individual phenotypes, if the heritability is higher [8, 35]. This observation conforms to the theory of selection indices. 3.2.2 Effect of the Marker Panel Characteristics
The characteristics of the set of markers used for the construction of the GRM strongly influence the accuracy of the evaluations. The accuracy increases with the number of loci tested (e.g., from the correlation between simulated GEBV and TBV [2, 8, 30]). This increase obeys the law of diminishing returns, especially when the reference and the selection populations are genealogically close [31, 36]. For farm animals, the gain becomes minimal beyond a few tens of thousands of markers (e.g., [37]). Conversely, genomic predictions concerning Humans, for which recruited individuals are very rarely related, require very large sets of markers to obtain sufficient precisions (cf. [11] for a work on hidden heritability). The effect of the number of markers interacts with that of the distribution of the LD between those markers and, even more, between the markers and the QTLs. This point particularly caught the attention of researchers because it is at the heart of the principles of genomic selection: the founding idea was that, thanks to the LD, the markers have an apparent effect that comes from the QTLs linked to them and that this effect persists in individuals of the selection population. In fact, it has gradually appeared that the efficiency of genomic selection does not result only from the phenomenon of the capture of QTL variability by markers. Part of the predictive power of genomic information is the possibility of finding kinship relationships by estimating them by rates of identity by state (IBS) in the genomic matrix rather than IBD in the nongenomic setting [10, 13]. From simulations allowing their effects to be separated, it was demonstrated [1] that three phenomena explain this efficiency: the LD between very close loci, fairly stable over the generations (and corresponding to the initial idea put forward by [2]), the kinship estimated using the IBS levels of the markers, and the cosegregation of alleles from linked loci (QTL and markers) that can be identified within the family, as in the case of MAS, the latter phenomenon mentioned by previous authors [8, 38].
58
Jean-Michel Elsen
3.2.3 Effect of the Reference Population Structure
The structure of the reference population, defined by its size and the relationships between the individuals constituting it also affect accuracy. This accuracy increases with the size of the reference population [8, 30, 35, 37, 39]. It was clearly shown in simulations that the estimates are best when the reference population is diverse, with little kinship between its members [40]. Conversely, it is now well established that genomic evaluations are all the more precise as the genetic distance between the reference and selection populations is low [21, 24, 31, 33, 36, 40, 41]. One of the major consequences of this observation is the need to regularly “refresh” the reference population so that its average relationship with the selection population is not too weak, which is a difficulty when the characteristics to be improved are not recorded in routine. Moreover, this observation is particularly clear when these populations belong to different animal breeds. The disadvantage is considerable because, if for some well-endowed breeds, in particular the Holstein in dairy cattle, the reference populations are very large (the assembly of information within the framework of international consortia having greatly facilitated their construction), resources are limited for many others. The idea sprouted to compensate for this deficit by drawing information from the abundant references (e.g., the Holstein breed) for the improvement of small breeds (e.g., the Jersey breed). It has given rise to numerous works aimed at estimating the usefulness of such multibreed references [25, 36, 42, 43] and at extending the equations for predicting accuracy in these particular situations [26, 27, 44]. A similar field of research concerns sectors using crosses for which the selection is made in pure breeds (or lines) while the trait of interest is expressed in crossed individuals, and therefore for which the choice of reference population (purebred or crossed individuals) is asked (e.g., [45, 46]). To conclude on this inventory of factors affecting GEBV accuracy, it must be underlined that GEBVs accuracy depends on the ability of the set of markers used to capture information on QTLs. The strength of the LD between two loci depends on their physical distance (more intense genotyping decreases the average distance between each QTL and the marker closest to it) and the number of recombination events since the mutation that created this QTL, a number proportional to the age of the mutation and the size of the population. However, by increasing the number of markers to increase precision, we over-parameterize the model that describes their apparent effects on character and therefore decreases the quality of their estimation. This over-parameterization can hardly be compensated by an increase in the size of the reference population. Even if the predictions of the markers’ individual effects are poor, prediction of their estimable linear combinations effects, corresponding to haplotypes present in the reference population, are correct and make it possible to predict genetic values of
Prediction of GEBV Accuracy
59
nonphenotyped individuals. However, we need that those nonphenotyped individuals do not carry many haplotypes absent from the reference. This situation will be encountered if the two populations (reference and selection) are genetically distant. In the case of between breed evaluations, there is also the possibility that a QTL mutation is only present in one of the populations and that interactions between the genotypes at QTL and the genetic background reverse its apparent effects.
4
Algebra of the Genomic Predictions Accuracy Beyond the scope of this inventory, the computing times necessary for the repetition of simulated cases for statistical analysis are always practical limits. This is for example the case for the numerical optimization of the organization of genomic selection, which results from a maximization, often under nonlinear constraints, of a complex function (e.g., the expectation of cumulative genetic progress over several generations) of decision variables such as the size of the reference population, its distribution according to age groups, the number of markers on the chip, etc. Very generally, the ideal is to have algebraic formulas summarizing the variability of the elements of this function (here the accuracy of the GEBV) as a function of these variables to be optimized and of parameters such as heritability. These formulas, equations for predicting the accuracy of GEBVs, must account for the factors affecting the accuracy of GEBVs presented in the previous Subheading 3.2, real world observations, and trends predicted by intensive simulations. They are designed for a priori use, before setting up or reorganizing the selection plan. We must specify the spaces in which we want to estimate this measure of accuracy because several elements are variable: genotypes at the QTLs (the zq) and at the markers (the wk) in the reference population and in the selection population, genotypes effects of markers (βk), nuisance effects in the expression of phenotypes of the reference population (e). Depending on the position taken by the user, some elements are fixed, others random. For example, following the approach of [10], the genetic variance of a trait can be formulated (1) conditional on the effects of QTLs (αq) for an already existing population of which we know the genotypes (the ziq), (2) for a set of unborn individuals with known allele frequencies (the pq), (3) without knowing the effects of the QTLs but given the distribution of the effects (for example a Gaussian of variance σ 2α ), (4) without knowing QTL allele frequencies but having a prior on their distribution (for example a Beta distribution with parameters Φa and Φb). For the simple
60
Jean-Michel Elsen
case of a trait governed by a single QTL the formulae for the 2 P 2 P variance would be then, for (1): z iq αq =N z iq αq =N ,
i
a (2): 2pq 1 pq α2q , (3): 2pq 1 pq σ 2α , (4): 2 ΦaΦþΦ b
i Φb Φa þΦb þ2 2 Φa þΦb Φa þΦb þ1 σ α .
Thus, while in the formulations of a posteriori accuracy bÞ ¼ (notably [12]) the considered as var ðu variance 0 1 1 var c 0 G m þ λg I y ¼ c (Gm + λgI) c uses the realized GRM, Gm, it will be necessary, taking a priori position, also to consider Gm as random and therefore to calculate var(Gm). In the appendix of [10] Gianola et al gave the expression of this variance in the complete case where the loci are in LD, and the individuals of the population are related. We have seen that greater the part of the genetic variaðu Þ bility (quantified by b ¼ var var ðg Þ ) captured by the marker, higher the prediction accuracy, and more precise the estimation of their effects (quantified by r 2 ). We have also seen that (under assumptions not u,b u formulated by the authors) the accuracy r 2 can be predicted by g,b u their product b r 2 . u u,b 4.1 Proportion of Genetic Variability Captured by Markers
Under the hypothesizes that each QTL is linked to a (or a group of) marker (s) and that the QTL-marker pairs are all independent [38], the proportion of the genetic variance captured by the markers is the weighted average of the r2(zq, wq) (measure of the LD between the QTLs and their markers), the weights being the relative proportions of explained genetic variance by each QTL: P 2 2 αq r z q , wq v z q vðwβÞ P 2 b¼ ¼ vðzαÞ αq v z q But zq and αq are unknown. To get around this difficulty, at the cost of an apparent lack of generality, it has been suggested estimating the ratio b on a sample of genotyped and phenotyped individuals from the population object of the analysis [47]: the mean accuracy of the GEBV is evaluated by cross validation. The distribution of these GEBVs is assumed to be normal, with an expectation in accordance with one of the algebraic formulas of accuracy (we will see them later) and therefore depending on b; this parameter can therefore be estimated by maximizing the likelihood of the accuracies assessed. The need for an ad hoc estimate of the b ratio looks similar to that of estimating the heritability of a trait that varies between populations according to the polymorphism of QTLs. However, it does not have the same generality because it depends on the number of markers used for GEBV. A very different approach is possible [8, 38] to determining the value of b. Stressing that genomic information makes it possible to differentiate individuals who are otherwise identical in their kinship relations with the rest of the population (for example full brothers
Prediction of GEBV Accuracy
61
without descendants), and that these differences are estimated with errors by the use of a small number of markers (M), b may be assimilated to the part of the variability captured by the markers which is actually generated by the hazards of meiosis at QTL. The variabilities in question here are those of the genomic matrix at QTL around its expectation ((var(G A)) and of the genomic matrix at Gmmarkers around G (var(Gm G)). Be careful, the var(G A) notation can be misleading: it is about the variance of the elements of the matrix (see Note 12). The authors proposed var ðGA Þ that b ¼ var ðGA Þþvar ðG m G Þ by specifying that these variances are those of the elements of these matrices (therefore scalars). In the general case (as described for G, in [38]) these quantities are impossible to obtain. This issue can be solved at the cost of two hypotheses [8]: by assuming that all individuals are unrelated and that the genome is made of pairs of QTL and marker loci in perfect LD within pair and linkage equilibrium (LE) between pairs. The first simplification is unsatisfactory when we know that the magnitude of the variability due to the meiosis hazards of the genetic covariances between individuals i and j (cov(Gij Aij) depends on the level of kinship linking these individuals [48]. The second is tempered by the introduction of an effective number of loci Me defined as the number of loci in LE for which we have the same value of var(G A) as in the real case (in the Goddard [38]) the “number of loci” refers to the number of pair of marker and QTL, or number of independent genome segments). We will see later how to obtain this effective number, and will show the relation var ðG A Þ ¼ σ 4g =M e given in [38] The second term, var(Gm G), depends on the number of markers used to build the matrix Gm. In a work on height heritability in Humans, it was found numerically that var ðG m G Þ ¼ σ 4g =M [11]. From these observations an approximation of the b ratio widely used in the literature was proposed [8]: b¼
4.2 Accuracy of the Estimation of Marker Effects 4.2.1 Basic Eqs
var ðuÞ M ¼ M þ Me var ðg Þ
A first formula for computing this a priori accuracy was published in 2008 by Daetwyler et al. [49]. In their demonstration, the marker effects were treated alternately as fixed or random effects, which made it confusing. A simpler version was given latter [26]. Several hypotheses are made: all the genetic variability is captured by the markers (var(g) ¼ var (u)), we consider Q ¼ M pairs of markers/ QTLs in LE, the individuals are unrelated, the allelic frequencies are the same in the reference and candidate populations, all QTLs contribute in the same way to the total genetic variance: 2pk 1 pk σ 2βk ¼ vðg Þ=M . The authors state that under the hypothesis of independence due to the absence of LD the β effects can be estimated one by one: the model y ¼ μ + Wβ + e is simplified
62
Jean-Michel Elsen
to y ¼ μ + Wkβk + ek, with Wk a single column corresponding b to the kthmarker. The estimate simplifies βk ¼ !1 P 2 P σ2 wik þ λβk w ik y i where λβk ¼ σ 2ek , whose accuracy, P 2condiβk i i wik var b β tional on the genotypes (the wik), is r 2 ¼ var ðβk Þ ¼ Pwi 2 þλ (see k βk βk βk ,b ik i
Note 13). When considering the situation a priori, r 2
βk ,b βk
must be
integrated on the domain of variation of the genotypes. Expectation of a ratio is closed to ratio of expectations,
and as 2 ¼ E wik ¼ var ðwik Þ ¼ 2pk 1 pk , we finally have E Wk r 2 βk βk ,b N 2pk ð1pk Þ . N 2pk ð1pk Þþλβk To this set of assumptions and simplifications, the idea was added [26] that the effect of a single marker
on the variance is 2
small and therefore that σ 2e k ffi σ 2y , giving E Wk r 2
βk βk ,b
b¼ The GEBV of an individual is, then u
h . ¼ NN h 2 þM
P β, and its wk b βk ¼ w 0 b k
0
TBV (by confusing once again g and u) u ¼ w β. Therefore the 0 b b var u w var β w desired accuracy r 2 ¼ var ðuÞ ¼ w 0 var ðβÞw . The matrix var (β) is u,b u diagonal (the effects are independent) and all of identical 2 terms var ðβÞ ¼ I σ β : It can be assumed that it is the same for var b β giving h i N h2 E W r 2u,bu ¼ N h2 þ M
ð1Þ
This equation, quoted, tested, and used in numerous publications, has been the subject of gradual improvements. In the following, to simplify the presentation, this expectation of accuracy on the domain of variation of the genotypes will be denoted by r2. The assumption of uniqueness of the prior variance of the P marker effects: σ 2β ¼ vðg Þ= 2pk 1 pk is another option k [38]. In this situation the elements of the var b β matrix, still assumed to be diagonal, are not identical, and the accuracies are N 2p ð1p Þσ2β ffi N 2p k1p kσ 2 þσ written E Wk r 2 and 2 kð kÞ β y βk βk ,b N 2pk ð1pk Þσ 2β P P 2pk 1 pk N 2p 1p σ2 þσ 2 Num k ð Þ k k y β k k 2 P P r ¼ ¼ : Denk 2pk 1 pk k
k
Prediction of GEBV Accuracy
63
To use this formulation, it is therefore necessary to have good estimates of the allele frequencies pk. A step further is to integrate these accuracies on the domain of variation of these frequencies [38]. The question is shifted by one level (like the fourth estimate of genetic variance mentioned in the introduction to this part), and a law of frequencies must be given. Goddard [38] choose the κ hypothesis of a U shaped distribution of density 2pð1p Þ (see Note 14). Using the law of large numbers, and approximating r2 by E p ½Numk E p ½Denk , after a classical algebraic expansion of the conditional expectations he found (with a ¼ 1 þ 2 Nλ ): pffiffiffi 2 1þ2 aþa λ 1 pffiffiffi log pffiffiffi Ep r ¼ 1 þ N 2 a 12 aþa
ð2Þ
These accuracy formulae are based on unsatisfactory assumptions. The following paragraphs show how less constrained situations were explored. 4.2.2 Multilocus Approach
The previous formulae [26, 38, 49] approximate σ 2e k ffi σ 2y and neglect the fact that the markers effects are in practice estimated simultaneously. For situations where N M, a correction factor can be derived, based on the assertion that a better approximation of σ 2e k would be σ 2y σ 2g r 2 (taking into account the assumptions, u,b u b þ ε) [49]. It this should be the residual variance in h thei model y ¼ u follows a quadratic equation in E W r 2 : u,b u 2 Nh ð3aÞ r2 ¼ , 2 N h þ M 1 h2r 2 giving
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 Nh þ M þ N h 2 þ M 4N Mh 4 2
r2 ¼
2N h 2
:
ð3bÞ
Daetwyler et al.[49] approximated Eqs. 3a, 3b by multiplying r by 1 þ 12 r 4 M =N (see Note 15) giving 2 Nbh 2 1 4 r 1 þ r2 ¼ M =N ð4Þ 2 N bh 2 þ M Alternatively a full model y ¼ μ + Wβ + e would consider all markers simultaneously. If matrices B,F and Λ are defined as diagoσ2
nal with terms σ 2βk , 2pk(1 pk) and σ2e , the matrix of the variances of βk marker effects estimates comes to be var b β ¼B 1 σ 2e ðW 0 W þ ΛÞ [50]. Under the hypothesis of iso-contribution of markers/QTL to the total genetic hvariance i 2 (2pk 1 pk σ βk ¼ vðg Þ=M ) it is found that E W r 2 ¼ u,b u
64
Jean-Michel Elsen
h i 2 1 0 . By using limited matrix 1 1h trace F E ð W W þ Λ Þ W 2 h expansions, approximations in increasing order of accuracy are possible: r 2ð1Þ ¼
N h2 N h2 þ M 1 h2
1 h2 NM ðM 2Þ 2 2 3 h ðN þ M 1h 2 Þ h 1 X 1 þ M 2pk ð1 pk Þ k
ð5Þ
r 2ð2Þ ¼ r 2ð1Þ
the quantity
P k
1 2pk ð1pk Þ
ð6Þ
can be calculated from the observations of
markers allele frequencies, or estimated by integration when knowing prior frequencies distribution law. From a numerical point of view, formulas (1), (3a, 3b) and (6) give very similar results. 4.2.3 Taking into Account Linkage Disequilibrium
The general hypothesis of LE between causal loci (or pairs of causal loci and markers) is unrealistic as soon as the number of loci becomes large. Several approaches have been proposed to get closer to reality.
The Approach by the Effective Number of Loci
If we imagine a set of loci in perfect LE, therefore respecting the assumptions of the formulae previously described, we can ask ourselves what its size must be (we will denote it Mr) to obtain the same accuracy as the set of loci actually used, for which there is a LD. The idea of an effective number of loci Me was mentioned above for the calculation of the variance of covariances between relatives around an expectation estimable from pedigree information (var(G A)). The underlying idea was that the increased accuracy provided by genomic information beyond pedigree information (which is the basis of predictions of classical genetic values) increases with the variance of covariances between genomic values: in the absence of genomic data two full sibs without personal phenotypic information will be classified identically, while genomics, which captures the differences between gametes of the same parental pair, makes it possible to exploit this variability and differentiates the sibs. When none of the individuals forming the reference and selection populations are related, the classical EBVs of the candidates are all identical (and their classification impossible). Thus, the accuracy of GEBVs comes only from the LD between markers and QTL. The larger the elements of var(G A), the higher our ability to differentiate candidate. It seems that Goddard [38] relies on this idea to claim that Me and Mr are the same (but the proof is not done). Moreover, the number that matters is that of QTLs, and remains unknown. It is therefore approximated by the number of markers, from the variance of Gm rather than of G.
Prediction of GEBV Accuracy
65
First formulas giving Me were proposed [8, 38]. The covariance 0 0 between the ui ¼ wi β and uj ¼ wj β of two unrelated individuals is P wik wjk σ2β . We are looking to var w(cov(ui,uj)). This quantity k
depends on the allelic frequencies (pk, pl) and local LD (Dkl). The quantity var w(cov(ui,uj)) was developed [38] when loci are in LE and when they are in LD, and, after a number of approximations (see Note 16), it was proposed that 1 M e ffi PP 2 ρkl =M 2 k
with
ρ2kl
¼
l
4D 2kl . 2pk ð1pk Þ2pl ð1pl Þ
This formulae was also found in a very different approach [51] based on the theory of correlation matrices proposed in [14]. Correlation coefficients between loci (the ρ2kl ) are needed to compute Me. Again, Goddard [38] proposed to integrate these elements over their variation domains. This idea is taken up by others [8, 51]. To achieve this, they use the formulae for the expectation of the LD between two loci (k and l) as a function of the effective size of the population Ne and the distance ckl between these loci [48, 52–54] (see 17). For illustration, in [38] the Note formula of Sved [52] (E ρ2kl ¼ 1=ð1 þ 4N e c kl Þ) is used, the mean PP 2 ρkl =M 2 calculated by formal integration on the domain of k l R R 1/(1 + 4Ne|xk xl|) variation of ckl ¼|xk xl|: ( dxkdxl ffi log (4NeL)/(2NeL)), giving approximately Me ¼
2N e L log ð4N e L Þ
Depending on the hypotheses, several formulas have therefore been proposed (Table 2).
Table 2 Actual number of loci as a function of effective population size (Ne), length of a chromosome (L ) and number of chromosomes (K ) Authors
Me estimation
Stam (1980)
4NeL
Goddard (2009)
6Ne
Goddard (2009)
2NeL/ log (4NeL )
Hayes et al. (2009)
2NeL
Goddard et al. (2011)
2NeLK/Log(NeL ) (mutations allowed)
Goddard et al. (2011)
2NeLK/Log(2NeL ) (no mutations)
Lee et al. (2017)
4N 2e L 2 =ð log ð2N e L þ 1Þ þ 2N e L ð log ð2N e L þ 1Þ 1ÞÞ
66
Jean-Michel Elsen
A Posteriori Calculation of the Effective Number of Loci
The preceding formulas were derived from developments in population genetics and are based on a set of assumptions that are difficult to verify. Alternatively, it was proposed [24] to empirically estimate (see below) the effective number of loci in two populations (Holstein and Brown Swiss) of the same genome length (LK) and similar effective size (Ne), for which therefore Me estimates should be equal. They found an empirical Me of the order of 1000 for the first population against 200 for the second: additional unknown parameters should be part of the equations to account for these divergences. A reasonable alternative to the formulas listed in Table 2 is to estimate Meempirically from simulated data or from data collected in the research population. One approach is to use estimates of the accuracy r 2 obtained u,b u by simulation to deduce Meby inverting the equation [24, 26]. Alternatively, reversing the arguments of [8], the effective number Mecan be estimated by 1/ var (Gm A) observed in the real population [24]. In practice, the details of the procedure are not given, but one can imagine that the variance of the off-diagonal elements of Gm should gather estimates of this variance by slices of kinship relations deduced from the pedigree. This approach was generalized to predictions between populations (breeds, lines), the elements to be considered in the calculation of the covariance being all pairs of individuals one belonging to the reference population, the other to the selection population [25, 26]. Starting from the idea that var(Gm A) measures the ability of the set of markers to predict relations to QTLs, which can be approached by its ability to predict relations in other markers, the set of markers may be subdivided into two subsets, the corresponding matrices Gm1 and Gm2 built, and var(Gm A) approached by the covariance between the nondiagonal elements of Gm1 and Gm2 [8]. As we saw in Subheading 4.1, the parameter b and the effective number Me can be estimated by maximum likelihood [47], the data being mean accuracies of GEBV evaluated by cross validation and the model a Gaussian distribution centered on a predictions of r 2 u,b u (using Eqs. (1) or (3a, 3b)). The effective number of loci was approached in a very different angle by Misztal et al. [55, 56] in line with the “APY” method (“Algorithm for Proven and Young animals”). This method aims to reduce the dimension of the GEBVs equations system by inverting only a part of Gm matrix (which concerns a core, initially formed of “proven animals,” but extended in later publications to other possibilities). The genomic values u ¼ Wβ are simplified by applying a singular value decomposition on the matrix W: W ¼ UΔV where 0 0 UU ¼ I,VV ¼ I and Δ is the diagonal matrix of the singular values {δij}. He then simplified UΔVβ ffi UΔsVβ with Δs as Δ by setting the small δijvalues to 0, and proposes that the number of nonzero elements of Δs be called the “number of independent chromosome segments” an alternative concept to the effective number of loci.
Prediction of GEBV Accuracy 4.2.4 Consideration of Kinship Relationships
67
The hypothesis of independence between individuals was necessary for the algebraic h i expansions giving the accuracy prediction equations, E W r 2
u,b u
. While this hypothesis is realistic for works on
humans, it is never the case for agronomic populations. The definitions of an effective number of loci [38] as well as a number of independent chromosome segments [55] are somewhat elusive. By comparing these concepts with that of estimable linear combinations, which correspond in practice to haplotypes found in individuals of the selection population that are found in the reference population, and whose effects are therefore estimable, it may be understood why genomic evaluations are more precise when the reference population is genetically more diverse (there are more haplotypes to estimate) and the selection population is genetically closer to the first (limiting the recombination events within the haplotypes). One consequence is that the effective number of loci (measured from the variance of the covariances between genomic values) is lower when the kinship relationships are more important. As an example, a value of the order of 70 was given for cattle [12], an 85 for a set of human nuclear families [57], to be compared with the thousands of “independent” loci found in very unstructured populations (e.g., [39]). Therefore, it would be very important to take this pedigree structure into account. When such relationships exist, in the expression of the variance of the covariances between genomic values, var w(cov(ui,uj)), some expectations (the E[wikwilwjkwjl] terms) can no longer be factorized (see Note 18). In the special case of general LE, X h 2 2 i X X var w cov ui, u j ¼ E wik w jk þ E wik w jk E wil w jl : k
k
k6¼l
As shown in [58] all those expectations are simple functions of the identity coefficients of Gillois [59] and Jacquard [60]. As for considering the LD between marker loci, it should be possible to define an effective number of loci by comparing the variances of covariances between genomic values depending on whether or not the kinship relationships are considered. Exploring this ideas, Elsen [58] did not use the concept of heffective number of loci, but i proposed an approximation of E W r 2
u,b u
based on matrix Taylor
expansions. Using the example of a population of candidates having both of their parents in the reference population, he showed that the second order approximation is correct for heritabilities less than 0.7. These equations are unfortunately very cumbersome and have not yet been applied. In the simplest cases, the W matrix of markers genotypes can be deduced from the pedigree. As a basic example, by assuming the markers in LE, and a population made of independent families of N full sibs distributed in equal numbers according to the parental
68
Jean-Michel Elsen
bÞ ¼ alleles received (see Note 19), it was shown that var ðu N h2 var ðuÞ N h 2 þ4M [35]. In fact, to resume the analysis of Habier [1], ccorresponds to the monitoring of the this particular prediction u cosegregation of alleles of linked loci (QTL and markers) identifiable within the family. However, in essence, genomic selection uses information beyond that provided by the relationships between individuals in the reference population and those in the selection population. It would therefore be advisable to also consider the similarities between “unrelated” owing to LD between very close loci. A framework for gathering this information was provided in [51]: the reference population is described as formed of several subsets each made of individuals of similar degree of kinship with the selection candidates, from very weak to very strong according to the subsets, the expectation of accuracy is calculated for each subset, the whole is gathered in a general value using, as in the theory of selection indices, the usual formulas of linear regression. 4.3 Combining the Proportion of Variability Captured by the Marker and the Accuracy of the Estimation of Their Effect
The formulae presented in Subheading 4.2 were developed by confusing the genetic variance var(g) and the variance captured by the markers var(u). We have seen in Subheading 4.1, how the ðu Þ ratio b ¼ var var ðg Þ can be approximated. If b < 1, which must be the generality, the r2 formulas should be adjusted for this difference. This factor has to be introduced at two levels: using the relation described in Subheading 3.1.1: r 2 ¼ b r 2 and replacing the g,b u u,b u heritability h2 by b h2, that is, the part of the phenotypic variance ðu Þ explained by the markers: var var ðpÞ : For instance the formula (1) was modified in [8] to r2 ¼
Nbh 2 N bh 2 þ M
ð7Þ
Meuwissen et al. [61] modified formula (3a, 3b) to r2 ¼ b
N bh 2 Nbh 2 þ M 1 h 2 r 2
Brad and Ricard [62] used a modification of (3a, 3b) as ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2 2 4 Nbh þ M þ Nbh þ M 4NM b h r2 ¼ 2Nbh 2
ð8Þ
ð9Þ
Table 3 assembles the different equations which were proposed with or without the correction of the factor b. 4.4 Testing the Prediction Equations
Most of the time, presentation of new or enriched existing formulae is accompanied by a numerical assessment of their contribution to the existing one. As indicated in Subheadings 3.1.2 and 3.1.3 these assessments can be based on simulations or the analysis of real data.
Prediction of GEBV Accuracy
69
Table 3 Accuracy of genomic prediction as a function of reference population size, number of markers M and heritability h2 Expression of r2 1
Nh N h 2 þM
2
1 þ Nλ
3
Authors
2
1ffi p 2 a
log
Daetwyler et al. (2008) [49] Wientjes et al. (2016) [26]
pffiffi 1þ2paffiffiþa 12 aþa
Goddard (2009) [38]
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ðN h2 þM Þ 4N Mh 4
Daetwyler et al. (2008) [49]
N h 2 þM þ
2N h 2 2
4
Nbh N bh 2 þM
4
Nh r 2ð1Þ ¼ N h 2 þM ð1h2 Þ
5
r 2ð2Þ ¼ r 2ð1Þ 1h h2
1 þ 12 r 4 M =N
2
Daetwyler et al. (2008) [49]
2
2
Elsen (2017) [50] NM 2 N þM 1h2 h
3 ðM 2Þ þ M1
P k
1 2pk ð1pk Þ
Elsen (2017) [50]
6
b N Nbh bh 2 þM
2
Goddard et al. (2011) [8]
7
Nbh 2 N bh 2 þM
Goddard et al. (2011) [8]
8
N bh b Nbh 2 þM ð1h2 r 2 Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 Nbh 2 þM þ ðNbh 2 þM Þ 4NM b 2 h 4
9
2
2Nbh
Meuwissen et al. (2013) [61] Brad and Ricard (2015) [62]
2
λ ¼ vvððuy ÞÞ; a ¼ 1 þ 2 Nλ ; b ¼ vvððug ÞÞ
As a first example, for traits measured in Holstein, a slight but systematic overestimation of the accuracy predicted by the equations of the mixed model in comparison with the correlation between GEBV and deregressed EBV was observed in [36]. In Jersey, the results differed in different ways depending on the trait. Hayes at al [35] comparing the predicted and the observed accuracies (on simulations) in the case of populations with a strong family structure (set of independent families of full sibs, half sibs, cousins) concluded that their approach is perfectly adequate. This optimistic result must be tempered by the very artificial nature of the situations modeled. On the other hand, in the case of populations without a family structure, the simulations (ending with 8 generations of random mating to break the family relations) always gave accuracies lower than the forecasts, attributed, in part, to recombinations occurring during the 8 generations of mixing.
70
Jean-Michel Elsen
A good agreement was found between the accuracies calculated according to Goddard and [8] equations, and that given by the mixed model (formula of [12]), as soon as the genomic matrix used b ¼ A þ b ðG m A Þ of is regressed on the pedigree matrix (G Subheading 2.4). Erbe et al. [47] compared the accuracy obtained by crossvalidations on real data (in dairy cattle), the accuracy predictions formulated as (3a, 3b) and (4), and their own formulas (which include the maximum likelihood estimation of parameters b and Me), to conclude that their proposition is excellent. In the case of between breeds predictions, Wientjes et al. [25] compared correlations between GEBV and TBV of simulated data, with their deterministic predictions derived either from index theory (according to [12]), or from population parameters (following [8]) for accuracy. The agreement was very good. Lee et al. [51] had a good agreement, for the body size and the body mass index of a data set on humans (“Framingham heart study”), between the accuracies predicted by their equations and obtained by cross validation. Questioning the ability of formulae to predict the accuracy of genomic selection a detailed analysis of the results of 145 prediction estimates found in 13 publications, some based on simulations, others on real data was published in 2015 [62]. The comparisons relate to four formulas ((1), (2), (6), (8)) and 5 equations giving the actual number of locus (Goddard [38]; Goddard et al. [8] with or without mutation hypothesis; Daetwyler et al. [49]; Stam [63], see Table 2). The formulas are tested in domains of probable variation of 5 parameters (h2, N, M, Ne, Me). The marginal effects of these parameters on the value of the accuracy prediction are very consistent between formulas. The differences between observed and predicted accuracies were the subject of an analysis of variance including as effects the type of data (real or simulated), the formulas and the equations for Me. The authors’ conclusions are that none of the predictions is universally better than the others, and that the major difficulty is the estimation of the effective number of loci (which varies by a factor of 10 between the tested equations).
5
Conclusions Large-scale genotyping of DNA markers has made possible a more accurate “genomic” evaluation of breeding animals than older evaluations, based on kinship relationships detectable in pedigrees. Numerous observations demonstrate this and the enthusiasm for these technologies continues. However, there are a number of situations in which the adoption of genomic selection schemes is difficult for technical reasons (impossibility of having a sufficiently large reference population) or budgetary reasons (investment in equipment and skills, unit cost of the genotyping). It is essential
Prediction of GEBV Accuracy
71
to predict the effectiveness and profitability of these schemes before they are put into operation. The accuracy of the evaluations is a key element which depends on the demographic characteristics of the population to be improved (number of reproducers, rate of turnover), on its genomic structure (allelic polymorphism, extensive linkage disequilibrium), on the genetic architecture of the target traits (heritability, distribution of gene effects). The core of genomic selection programs is the distribution of individuals into a reference population, whose members are genotyped and phenotyped, which makes it possible to make the link between this information (what is the effect of a marker on the trait?) and a selection population whose members are only genotyped, their genetic value being estimated using the genotype– phenotype links available. The determinants of the GEBV accuracy are known, many analyzes of real or simulated data giving very consistent results. The evaluation is all the more precise when the size of the reference population is large, the trait heritable, the general LD (and therefore in particular between markers and QTLs) strong. The classical theory of selection indices (which is none other than an application in genetics of linear regression) gives the equations to predict this accuracy. The essential elements of these predictions are the genotyping tables, each line of which describes, for an individual, the list of its coded genotypes. From these tables are constructed matrices of genomic relationships (GRM) which enter into the equations of accuracy. The difficulty is that before the initiation of genomic selection such arrays are not available. Based on quantitative genetics theories, one can, at the cost of strong assumptions, express the expectation of these accuracy predictions. Very simple formulas, involving their key determinants, have been established and gradually enriched. They are imperfect and progress in bringing them closer to reality is yet to come. Among the important elements of these derivations is the “effective number of loci“which, depending on the points of view, is intimately linked to a diversity captured by the genomic information whereas it is invisible in the classical case, or to the estimability of marker effects combinations in populations of finite size. A theory that brings these points of view together in a coherent way is essential.
6
Notes 1. The linkage disequilibrium between two loci points a statistical dependence between the genotypes observed at those loci. In the simple case of two biallelic loci (A1 and A2 alleles at the first, B1 and B2 at the second locus) there is a linkage
72
Jean-Michel Elsen
equilibrium if the probability p(A1B1) that a gamete carries (for example) the alleles A1 and B1 is the product of the probabilities that it carries A1 and that it carries B1 ( p(A1) x p(B1)). 2. The wording used to describe the precision of a statistics is often confusing. In the general sense the precision of a statistics refers to the variability of its results when applied in similar conditions while its accuracy is the proximity of those results to the corresponding true value, that is, informs about possible estimation bias. In genetics, reliability is a term largely used to describe the quality of predictions. Two reliabilities are described, the reliability of selection (which is the squared correlation between the true genetic value, say gi,and its pre2 diction gbi : r g i ,bg i ), and the “base PEV reliability “given by 1 var g i b gi bi is the prediction error variance g where var g i var ðg i Þ (PEV). These concepts were developed in details in [65] showing the equivalence of these two reliabilities. More recently this equivalence was shown to be untrue when the population is selected [66]. More confusing, sometimes r 2 g i ,b gi is called the determination coefficient, and r g ,bg the accuracy. i
i
3. The apparent effect, on the trait to be improved, of alleles (A or B) that a heterozygous AB parent transmits is statistically evaluated on part of its progeny, which makes it possible to enrich the prediction of the value of future progeny or grand progeny, depending on the allele they will inherit from the same parent. 4. For a unique marker–QTL pair we have βk ¼ αkDk/( pk(1 pk)) where Dk is the LD between the two and pk the allelic frequency at the marker. 5. The elements of the A matrix are twice the coefficients of coancestry as defined for example by Male´cot [67]. 0
6. For n related cov(g, y) ¼ (cov(g, y1), , cov(g, yn)) individuals with cov g, y i ¼ cov g, g i þ e i ¼ cov g, g i ¼ a i σ2g where ai is the coefficient of kinship between the candidate and its ith relative. 2 7. Under Hardy Weinberg E[z equilibrium iq] ¼ (1 pq) (0 2pq Þþ 2pq 1 pq 1 2pq þ p2q 2 2pq ¼ 0 et
var(ziq) ¼ 2pq(1 pq). þ var fαq g E gj αq , 8. var ðg Þ ¼ E fαq g var gj αq P E gj αq ¼ E z q αq ¼ 0.
with
q
9. The matrices proposed by Nejati-Javaremi et al. [7] and by Goddard et al. [8] are close without being identical: the term corresponding to the couple i j was
Prediction of GEBV Accuracy
PQ q¼1
P2 P2
I t¼1 ijqst
s¼1
2
¼ fZ Z 0 gij þ
73
2 PQ PQ þ 2p 1 z þ z 2p 1 . iq jq q q q¼1 q¼1
1 b ¼ c 0 G m þ λg I y and var(y) ¼var (g)(Gm + λgI). 10. u 11. It is not a question here of correlations with true or approximate genetic values, but of accuracy calculations given by the formulas of the theory of indices (the r 2 described by Van u,b u PP 2 Raden). x ij P 12. x ij ¼ wik wjk aij then var ðG m A Þ ¼ i Nj2 0PPk 12 @
x ij
i
j
N2
A . and W 0k Wk W 0k σ 2βk þ Iσ 2e k Wk ¼ P 2
W 0 var ðy ÞWk
¼ k βk βk ,b P
13. r 2
2 =σ 2βk
w2ik þλβk
w ik
i P P σ 2βk w 2ik ( w2ik þ λβk ) giving r 2
i
i
14. As we must have
11=N R e
βk βk ,b
¼ Pwi 2 þλ : ik
βk
i
f ð f Þdf ¼ 1 it comes that κ ¼ 1/
1=N e
log (2Ne),Ne being the size of the population. effective 2 2 2 2 15. N h þ M 1 h r ¼ Nh þ M 1 r 4 M =N u u u,b u,b rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 4 r M =N : ffi 1 þ 2 u,b 1r 4 M =N u 2 u,b u !2 3
16. var w cov ui, u j
¼ σ 4β E 4
P
PP þ E ½wik w il E w jk w jl
5 ¼ σ4
wik w jk
β
k
!
and
P 2 h 2i E w ik E wjk k
. The variance of the genomic
k k6¼l
σ 2u
P 2pk 1 pk
!
so ¼ values for Meindependent loci !2 k 2 P P σ 4u ¼ σ 4β 2pk 1 pk ffi 4σ 4β M e pk 1 pk . In the k
σ 2β
k
more general case, with M QTL–marker pairs in LD, we find E[wikwil] ¼ 2Dkl with Dkl the DL coefficient between the k and nP 2 pk 1 pk þ l loci. Thus var w cov ui, u j ¼ σ 4β 4 k
PP 2 o PP 2 D kl that Goddard simplifies to σ 4β 4 D kl . In the k l k l ¼ cov u u absence of LD we have var w i, j ( ) P 2 σ 4β 4 σ 4u =M e . The “effective pk 1 pk number” of P 2 k ð2pk ð1pk ÞÞ PP 2 ffi loci is deduced from this: M e ¼ M P k 2 D kl ð2pk ð1pk ÞÞ þ4 k
k
l
74
Jean-Michel Elsen
P 2 ðpk ð1pk ÞÞ k M PPD2 . In Goddard et al. (2012) it is assumed that kl
P PP k 2 l P D kl ¼ 2pk 1 pk 2pl 1 pl r 2kl k
l
from
k
is little different
l
P 2pk 1 pk
!!2
k
PP 2 r kl =ðM ðM 1ÞÞ (with r 2kl ¼ k
l
4D 2kl ) 2pk ð1pk Þ2pl ð1pl Þ
1 1 17. E r 2kl ¼ αþkN c kl þ n where α ¼ 1 in the absence of a mutation, α ¼ 2 if they are taken into account, k ¼ 4 for the autosomes and k ¼ 2 for the X chromosome, N the size of the sample from which the ckl are estimated. ! h i PP P 18. var w cov ui, u j ¼ σ 4β E w 2ik w 2jk þ E wik wil wjk wjl k
k k6¼l
with E[wikwilwjkwjl] 6¼ E[wikwil]E[wjkwjl] 19. W ¼ (WS WD). WS (sire alleles) is given for diagonal block, each block corresponding to a family, therefore of size N M, each column of these blocks comprising as many 0 as 1 (WSsik ¼ 0 (resp. 1) if the ith descendant of the sth sire received the first (resp. second) allele from the father at locus k). It follows (by neglecting the nondiagonal terms) that matri0 0 ces WS WS and WD WD are summed up to N2 I. References 1. Habier D, Fernando RL, Garrick DJ (2013) Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194(3):597–607 2. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829 3. Lande R, Thompson R (1990) Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124(3): 743–756 4. Guillaume F, Fritz S, Boichard D et al (2008) Estimation by simulation of the efficiency of the French marker-assisted selection program in dairy cattle. Genet Sel Evol 40:91. https:// doi.org/10.1186/1297-9686-40-1-91 5. Henderson R (1973) Sire evaluation and genetic trends. In: Proceedings of the animal breeding and genetics symposium in honor of Dr J L lush, Blacksburg, Virginia august 1972. American Society of Animal Science, Champaign, Illinois, pp 10–41 6. Koivula M, Stranden I, Su G, M€antysaari EA (2012) Different methods to calculate genomic predictions - comparisons of BLUP at the
single nucleotide polymorphism level (SNP-BLUP), BLUP at the individual level (G-BLUP), and the one-step approach (H-BLUP). J Dairy Sci 95:4065–4073 7. Nejati-Javaremi A, Smith C, Gibson J (1997) Effect of total allelic relationship on accuracy of evaluation and response to selection. J Anim Sci 75:1738–1745 8. Goddard ME, Hayes BJ, Meuwissen THE (2011) Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet 128:409–421 9. Householder AS (1953) Principles of numerical analysis. McGraw-Hill, New York 10. Gianola D, de los Campos G, Hill WG et al (2009) Additive genetic variability and the Bayesian alphabet. Genetics 183(1):347–363. https://doi.org/10.1534/genetics.109. 103952 11. Yang J, Benyamin B, McEvoy BP et al (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42:565–569 12. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91: 4414–4423
Prediction of GEBV Accuracy 13. Zou F, Lee S, Knowles MR, Wright FA (2010) Quantification of population structure using correlated SNPs by shrinkage principal components. Hum Hered 70:9–22 14. Speed D, Hemani G, Johnson MR, Balding DJ (2012) Improved heritability estimation from genome-wide SNPs. Am J Hum Genet 91: 1011–1021 15. Dekkers JC (2007) Prediction of response to marker assisted and genomic selection using selection index theory. J Anim Breed Genet 124:331–341s 16. Goddard ME, Hayes BJ (2009) Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet 10:381–391 17. Schaeffer LR (2006) Strategy for applying genome-wide selection in dairy cattle. J Anim Breed Genet 123:218–223 18. Shumbusho F, Raoul J, Astruc JM et al (2013) Potential benefits of genomic selection on genetic gain of small ruminant breeding programs. J Anim Sci 91:3644–3657 19. de Roos APW, Schrooten C, Veerkamp RF, van Arendonk JAM (2011) Effects of genomic selection on genetic improvement, inbreeding, and merit of young versus proven bulls. J Dairy Sci 94:1559–1567 20. Raoul J, Swan AA, Elsen JM (2017) Using a very low-density SNP panel for genomic selection in a breeding program for sheep. Genet Sel Evol 49(1):76. https://doi.org/10.1186/ s12711-017-0351-0 21. Daetwyler HD, Calus MPL, Pong-Wong R et al (2013) Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics 193: 347–365 22. Hayes BJ, Bowman PJ, Chamberlain AJ et al (2009c) Invited review: genomic selection in dairy cattle: Progress and challenges. J Dairy Sci 92:433–443 23. Wientjes YC, Bijma P, Vandenplas J, Calus MP (2017) Multi-population genomic relationships for estimating current genetic variances within and genetic correlations between populations. Genetics 207:503–515 24. Wientjes YC, Veerkamp RF, Calus MP (2013) The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics 193(2):621–631 25. Wientjes YC, Veerkamp RF, Calus MP (2015) Using selection index theory to estimate consistency of multi-locus linkage disequilibrium across populations. BMC Genet 16:87 26. Wientjes YC, Bijma P, Veerkamp RF, Calus MP (2016) An equation to predict the accuracy of genomic values by combining data from
75
multiple traits, populations, or environments. Genetics 202:799–823 27. Raymond B, Wientjes YCJ, Bouwman AC et al (2020) A deterministic equation to predict the accuracy of multi-population genomic prediction with multiple genomic relationship matrices. Genet Sel Evol 52:21. https://doi.org/10. 1186/s12711-020-00540-y 28. Efron B, Gong G (1983) A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Stat 37:36–48 29. Gianola D, Scho¨n C (2016) Cross-validation without doing cross-validation in genomeenabled prediction. G3 (Bethesda) 6(10): 3107–3128. https://doi.org/10.1534/g3. 116.033381 30. Calus MPL (2010) Genomic breeding value prediction: methods and procedures. Animal 4:157–164 31. Habier D, Fernando RL, Dekkers JC (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177:2389–2397 32. Legarra A, Robert-Granie C, Manfredi E, Elsen JM (2008) Performance of genomic selection in mice. Genetics 180(1):611–618 33. Clark SA, Hickey JM, Daetwyler HD, van der Werf JHJ (2012) The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet Sel Evol 44:4 34. Clark S, Hickey J, van der Werf J (2011) Different models of genetic variation and their effect on genomic evaluation. Genet Sel Evol 43:18 35. Hayes BJ, Visscher P, Goddard ME (2009b) Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res 9:47–60 36. Hayes BJ, Bowman PJ, Chamberlain AC et al (2009a) Accuracy of genomic breeding values in multi-breed populations. Genet Sel Evol 41: 51 37. Erbe M, Hayes BJ, Matukumalli LK et al (2012) Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95: 4114–4129 38. Goddard ME (2009) Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136:245–257 39. Lee SH, Clark S, van der Werf JHJ (2017) Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship. PLoS One 12(12): e0189775
76
Jean-Michel Elsen
40. Pszczola M, Strabel T, Mulder HA, Calus MP (2012) Reliability of direct genomic values for animals with different relationships within and to the reference population. J Dairy Sci 95(1): 389–400 41. Legarra A, Misztal I (2008) Technical note: computing strategies in genome-wide selection. J Dairy Sci 91(1):360–366. https://doi. org/10.3168/jds.2007-0403 42. Daetwyler HD, Kemper KE, van der Werf JHJ, Hayes B (2012) Components of the accuracy of genomic prediction in a multi-breed sheep population. J Anim Sci 90:3375–3384 43. Karoui S, Carabano MJ, Diaz C, Legarra A (2012) A joint genomic evaluation of French dairy cattle breeds using multiple-trait models. Genet Sel Evol 44:39 44. Raymond B, Bouwman AC, Wientjes YCJ et al (2018) Genomic prediction for numerically small breeds, using models with pre-selected and differentially weighted markers. Genet Sel Evol 50:49. https://doi.org/10.1186/ s12711-018-0419-5 45. Toosi A, Fernando RL, Dekkers JCM (2010) Genomic selection in admixed and crossbred populations. J Anim Sci 88:32–46 46. Kizilkaya K, Fernando RL, Garrick DJ (2010) Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes. J Anim Sci 88:544–555 47. Erbe M, Gredler B, Seefried FR et al (2013) A function accounting for training set size and marker density to model the average accuracy of genomic prediction. PLoS One 8(12): e81046. https://doi.org/10.1371/journal. pone.0081046 48. Hill WG, Weir BS (2011) Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet Res (Camb) 3(1):47–64. https://doi.org/10.1017/ S0016672310000480 49. Daetwyler HD, Villanueva B, Woolliams JA, Weedon MN (2008) Accuracy of predicting the genetic risk of disease using a genomewide approach. PLoS One 3:e3395. https:// doi.org/10.1371/journal.pone.0003395 50. Elsen JM (2017) An analytical framework to derive the expected precision of genomic selection. Genet Sel Evol 49(1):95. https://doi. org/10.1186/s12711-017-0366-6 51. Lee SH, Weerasinghe WMSP, Wray N et al (2017) Using information of relatives in genomic prediction to apply effective stratified medicine. Sci Rep 7:42091 52. Sved JA (1971) Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor Popul Biol 2: 125–141
53. Hill WG (1975) Linkage disequilibrium among multiple neutral alleles produced by mutation in finite population. Theor Popul Biol 8:117–126 54. Tenesa A, Navarro P, Hayes BJ et al (2007) Recent human effective population size estimated from linkage disequilibrium. Genome Res 17:520–526 55. Misztal I (2016) Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics 202(2):401–409. https://doi. org/10.1534/genetics.115.182089 56. Misztal I, Legarra A, Aguilar I (2014) Using recursion to compute the inverse of the genomic relationship matrix. J Dairy Sci 7(6): 3943–3952. https://doi.org/10.3168/jds. 2013-7752 57. Visscher PM, Medland SE, Ferreira MAR et al (2006) Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet 2(3):e41 58. Elsen JM (2016) Approximated prediction of genomic selection accuracy when reference and candidate populations are related. Genet Sel Evol 48:18. https://doi.org/10.1186/ s12711-016-0183-3 59. Gillois M (1964) Relation d’identite´ en ge´ne´tique. Ann Inst Henri Poincare´ B2 1–94:36 60. Jacquard A (1966) Logique du calcul des coefficients d’identite´ entre deux individus. Population 21:751–776 61. Meuwissen T, Hayes B, Goddard M (2013) Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci 1: 221–237 62. Brard S, Ricard A (2015) Is the use of formulae a reliable way to predict the accuracy of genomic selection? J Anim Breed Genet 132(3): 207–217 63. Stam P (1980) The distribution of the fraction of the genome identical by descent in finite populations. Genet Res 35:131–155 64. Stranden I, Garrick DJ (2009) Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal meri.T. J Dairy Sci 92:2971–2975 65. Laloe D (1993) Precision and information in linear models of genetic evaluations. Genet Sel Evol 25:557–576 66. Gorjanc G, Bilma P, Hickey JM (2015) Reliability of pedigree based and genomic evaluations in selected populations. Genet Sel Evol 47(1):65. https://doi.org/10.1186/s12711015-0145-1 67. Male´cot G (1948) Les Mathe´matiques de l’He´re´dite´. Masson, Paris
Chapter 3 Building a Calibration Set for Genomic Prediction, Characteristics to Be Considered, and Optimization Approaches Simon Rio, Alain Charcosset, Tristan Mary-Huard, Laurence Moreau, and Renaud Rincent Abstract The efficiency of genomic selection strongly depends on the prediction accuracy of the genetic merit of candidates. Numerous papers have shown that the composition of the calibration set is a key contributor to prediction accuracy. A poorly defined calibration set can result in low accuracies, whereas an optimized one can considerably increase accuracy compared to random sampling, for a same size. Alternatively, optimizing the calibration set can be a way of decreasing the costs of phenotyping by enabling similar levels of accuracy compared to random sampling but with fewer phenotypic units. We present here the different factors that have to be considered when designing a calibration set, and review the different criteria proposed in the literature. We classified these criteria into two groups: model-free criteria based on relatedness, and criteria derived from the linear mixed model. We introduce criteria targeting specific prediction objectives including the prediction of highly diverse panels, biparental families, or hybrids. We also review different ways of updating the calibration set, and different procedures for optimizing phenotyping experimental designs. Key words Calibration population, Optimization, Prediction accuracy, Genomic selection, CDmean, PEVmean
1
Introduction Several factors affect the accuracy of genomic prediction including (1) trait-specific characteristics like the heritability and the genetic architecture of the trait, (2) population-specific characteristics like the level of linkage disequilibrium (LD) between markers and quantitative trait loci (QTLs) and the number of effective chromosome segments (Me) segregating in the population, (3) the statistical method used to make predictions, and (4) experiment-specific
Laurence Moreau, and Renaud Rincent are co-last authors. Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_3, © The Author(s) 2022
77
78
Simon Rio et al.
characteristics such as the marker density, the size of the calibration set (CS) and the degree of genetic relationship between the CS and the predicted set (PS). The choice of the CS, i.e., reference individuals and their genotypic and phenotypic data, used to calibrate the prediction model is therefore crucial, especially when predicted traits are difficult or expensive to phenotype. In animal breeding, the pedigree-based BLUP model has been used in routine for several generations to predict the genetic value of candidates since the pioneer work of Henderson [1]. The use of genomic selection has modified the way the relationship between individuals was estimated by adding marker genotypes to pedigree information. In dairy cattle, it had little impact on the phenotypic data used to calibrate predictions for traits previously addressed in routine, but has opened the way to considering many new traits such as disease resistance, which cannot be evaluated directly for all animals [2]. In major crop species, the main focus is to select the best inbred lines that can be produced after selfing generations within large biparental families, either for their direct use as varieties or as parents of single-cross hybrid varieties. The number of candidate lines per family is often larger than the phenotyping capacities and only a small set of them can be evaluated in different environments to evaluate their adaptation to various field conditions. In this context, pedigree information is not useful to identify the best individuals [3] within a given family and pedigree-BLUP based on historical data has therefore not been broadly used in crop breeding. With genomic selection, the differences in genetic covariance between pairs of individuals from a biparental family can be accounted for in the model, unlike with pedigree data. Phenotypes are no longer used exclusively as proxies of the genetic values of candidate individuals but also to train a predictive model involving molecular markers as predictors, which potentially modifies phenotyping strategies. The advent of genomic selection clearly opens new possibilities for improving the breeding efficiency of both animal and plant species but raises the key question of how to define the best CS, especially in plants. In the first part of this chapter, we provide some general guidelines to be considered for this purpose. These guidelines are illustrated with examples and their biological bases are discussed. In a second part, we present the different approaches that have been proposed to optimize the reference population. In the last part we show some applications of reference population optimization, depending on the prediction objective. Even if genomic predictions are also used in human genetics, this chapter focuses on the application of genomic predictions for breeding objectives in animal and plant species, with more emphasis on plants where the issue of the optimization of the reference population has been the most extensively considered.
Optimization of Genomic Prediction Calibration Set
79
2 Impact of the Composition of the Calibration Set on the Accuracy of Genomic Prediction 2.1 Calibration and Predicted Individuals Ideally Originate from the Same Population
Most genomic prediction models such as GBLUP or in the declination of Bayesian models [4–6] assume that CS and PS individuals are drawn from the same population. As described in the examples below, this hypothesis is often violated and comes at the cost of a reduced accuracy. The first case is the application of genomic prediction in a population stratified into genetic groups. This scenario has been the subject of several studies to investigate (1) to which extent a generic CS can be efficient to predict over a wide range of genetic diversity, (2) or to which extent genetic groups with limited resources can benefit from the data originating from other genetic groups with more resources. In general, when a genomic prediction model is trained on one genetic group to perform predictions in another genetic group, the accuracy tends to be lower than what can be achieved within each group. This phenomenon has been illustrated in many animal and plant species including dairy cattle [7–9], sheep [10], maize [11, 12], soybean [13], barley [14], oat [15], or rice [16]. This case can also be extended to the prediction between families, as shown in mice [17], maize [18, 19], wheat [20], barley [21], or triticale [22]. In the worst case, the addition of individuals to the CS that are genetically distant from the PS can lead to a deterioration in the accuracy, as shown for instance in barley [23] and maize [24]. Beyond genetic groups and families, differences in type of genetic material (e.g., purebred and crossbred) between the CS and PS can lead to reduced accuracies compared to what can be achieved when the CS and the PS are of the same type of genetic material. Examples are the prediction of crossbred individuals using a purebred parental CS, as shown in pig [25], the prediction of maize admixed population between two heterotic groups using one of the parental population as CS [26], or the prediction of interspecific hybrids, as shown in Miscanthus [27]. Finally, the CS and PS may be drawn from the same population but different breeding generations. This scenario is very common when cycles of selection are done solely based on predicted genetic values to shorten breeding cycles or when candidates are preselected to reduce their number and limit phenotyping costs. Decrease in accuracies can generally be observed over cycles when prediction models are not updated with data from the selected generation, as shown using simulations [28], or real data in pig [29], sugar beet [30], alfalfa [31], maize [32], wheat [33], barley [34], and rye [35]. Note that genotype-by-environment interactions can also contribute to the drop in accuracy when the CS and PS are not evaluated in the same environments.
80
Simon Rio et al.
As illustrated with these examples, a decrease in accuracies can be expected when CS and PS individuals come from different populations or breeding generations, and this decrease may result from different factors that are presented in the following subsections. 2.1.1 LD Between Markers and QTLs Can Be Different Between Populations
Since the early development of genomic prediction, the LD between molecular markers and QTLs has been identified as a major factor of accuracy [4]. It can be defined as the nonindependence between alleles at different loci on the same gamete. In general, LD between markers and QTLs is assumed to be homogeneous within a population, but it may vary when the population is stratified, which affects the success of across-breed genomic prediction [7, 36]. Differences in LD between markers and QTLs may indeed lead to differences in effects estimated at markers, impacting the accuracy when one population is used to predict another. LD between two loci is a function of recombination rate, minor allele frequency (MAF) at both loci and effective population size [37]. Differences in MAF and effective population size are very common whenever a population is stratified into groups [38], but differences in recombination rate can also be observed between genetic groups, as shown in maize [39]. Differences in LD extent estimated with markers have been observed among populations in dairy and beef cattle [40, 41], pig [42], chicken [43], maize [44, 45], or wheat [46]. Differences in the sign of the correlation between the allelic state of loci pairs can also be found and are referred to as differences in the linkage phase [40, 45]. In presence of dense genotyping, the effect of differences in LD between populations on the accuracy is expected to be minimized [40, 47], as most QTLs are expected to be in high LD with at least one shared marker in both populations. The ideal situation is that only causal loci are captured by the genotyping.
2.1.2 QTL Allele Frequencies Can Be Different Between Populations
In addition to differences in LD, genomic prediction accuracy across populations can be affected by differences in QTL allele frequencies. The most extreme scenario consists of a QTL for which an allele is fixed in the CS but is segregating in the PS. The effect of such a QTL cannot be estimated using the CS, and the genetic variance that it explains will not be accounted for in the prediction [48, 49]. In the context of prediction across biparental populations, Schopp et al. [50] proposed to adapt the formula of Daetwyler et al. [48], used to forecast the accuracy, by including a new term: the proportion of markers that segregates in both the CS and PS relative to the total number of markers segregating in the PS. Based on simulations, they showed that this criterion computed using markers is a good approximation of the equivalent criterion based on QTLs when the marker density is sufficiently large, and is critical for the accuracy of predictions across families. In more
Optimization of Genomic Prediction Calibration Set
81
complex populations, one can estimate the genetic differentiation (FST) using markers, as an indication of how QTL alleles frequency differ between the CS and PS, which was shown to be negatively related to the accuracy of genomic prediction [51]. The consistency of alleles frequency between the CS and PS can be extended to the consistency of the frequencies of genotypic states at QTLs (i.e., the two homozygous states and the heterozygous state for a biallelic QTL). One example consists of dominance effects at QTLs that can only be accounted for if some individuals present a heterozygous state in the CS [52]. This phenomenon can explain the decrease in accuracy observed when predicting crossbred individuals using purebred individuals for traits with substantial dominance effects [25]. 2.1.3 QTL Allele Effects Can Be Different Between Populations
In most genomic prediction models, the effects of QTLs are assumed to be consistent between the CS and PS. But this assumption can be violated when the CS and the PS are drawn from different populations. “Statistical” additive effects reflect the average effect of substituting an allele with the alternative allele in the population and are implicitly or explicitly taken into account in most models, including GBLUP. “Functional” dominance and epistatic effects at QTLs contribute to the statistical additive effect along with the functional additive effect, but, unlike the latter, their contributions depend on allele frequencies [52–54]. From this phenomenon emerges the concept of genetic correlation between populations that aims at quantifying this difference in statistical additive effects [55, 56]. Practically, note that genetic correlations are often estimated using markers and then also include the heterogeneity generated by differences in LD between markers and QTLs.
2.2 Genetic Relationships Between Calibration and Predicted Individuals Are Needed
In the present chapter, we define genetic relationships as standardized covariances between individuals relative to the genetic components of traits. In this context, they are defined at QTLs (i.e., at causal loci) level and reflect the sharing of alleles at these loci. Genetic relationships notably include additive genetic relationships (AGRs) that describe relationships between individuals for additive allele effects. As causal loci are generally unknown, AGRs must be estimated based on the pedigree or by using markers. From a pedigree perspective, the sharing of alleles at QTLs is considered to result from their inheritance from a common ancestor. Those alleles are characterized as identical-by-descent (IBD, see Thompson [57] for a review), with IBD being defined relative to a founder population as a reference starting point of the pedigree. In this context, the coefficients of the pedigree relationship matrix (PRM) consist of expected AGRs conditional on pedigree information. Since the advent of molecular markers, AGRs can be estimated using the genomic relationship matrix (GRM) in GBLUP, often allowing better estimates of additive genetic variances compared to those obtained using the PRM (see Speed and balding [58] for a review).
82
Simon Rio et al.
In the early developments of genomic prediction, the ancestral LD between markers and QTLs was suspected of contributing alone to the genomic prediction accuracy [4]. Ancestral LD can be defined as statistical dependencies between loci that already existed within the population founders of the pedigree, which were generated by ancestral evolutionary forces. Thus, it does not characterize additional dependencies between loci that arise from the pedigree relationships between individuals. When individuals are not related by pedigree, the GRM can only capture AGRs through ancestral LD, as illustrated in Fig. 1A. However, when CS and PS individuals are related by pedigree, the GRM captures AGRs even in absence of ancestral LD between markers and QTLs [5, 59–61], as illustrated in Fig. 1B. In this scenario, the GRM describes IBD at markers and can be considered as a proxy for the PRM. It explains why nonnull accuracies can be obtained when applying genomic predictions with only few markers. The contribution of ancestral LD and pedigree relationships to the accuracy depends on the genomic prediction model. Variable selection approaches like LASSO, or Bayesian approaches like Bayes-B tend to better exploit ancestral LD than GBLUP to make predictions [60, 62, 63]. In addition, for a given pedigree structure, the relative contribution of ancestral LD and pedigree relationships to the prediction accuracy depends on population size: pedigree relationships tend to have a greater effect than ancestral LD on the accuracy for CS of small size, and conversely for CS of large size [63– 65]. This contribution of ancestral LD relative to pedigree relationships is an important parameter to consider when applying genomic prediction between genetically distant CS and PS, as the accuracy due to pedigree relationships will drop more quickly than that due to ancestral LD with decreasing relatedness [59, 62]. In addition to pedigree relationships and ancestral LD between markers and QTLs, Habier et al. [61] also demonstrated that the GRM captures cosegregations between markers and QTLs. Cosegregations characterize the nonrandom association of alleles between linked loci that can be observed within the individuals of a given family of the pedigree. It can be distinguished from ancestral LD that characterizes the nonrandom association between alleles of different loci that were already established in the founders of the pedigree. In the absence of ancestral LD between markers and QTLs, marker alleles will, nevertheless, cosegregate with QTL alleles to which they are physically linked when new individuals are generated, as illustrated in Fig. 1C. This information will be accounted for in the GRM and will contribute to the genomic prediction accuracy. Note that cosegregations help to describe genetic covariances between individuals of the same family. When several families are pooled into a common CS, differences in linkage phase can be observed and may considerably limit the contribution of cosegregations to the accuracy [24].
Optimization of Genomic Prediction Calibration Set
83
Fig. 1 Hypothetical scenarios adapted from Habier et al. [61] illustrating different types of information captured by the genomic relationship matrix (GRM): (A) Ancestral LD only, (B) Pedigree only, and (C) Pedigree + cosegregations (CoS). For each scenario, 50 QTLs and 50 markers are considered with minor allele frequency of 0.5. In scenario (A), QTLs and markers are assigned in pairs to chromosomes (a single pair per chromosome with each loci pair being in LD ¼ 0.8). In scenario (B), QTLs and markers are assigned to different chromosomes (a single locus per chromosome). In scenario (C), QTLs and markers are assigned in pairs to chromosomes (a single pair per chromosome with each loci pair being genetically distant by 5 cM but not in LD). In scenario (A), gametes are generated independently from different founders of the population, while they are generated from individuals resulting from the crossing of founders for scenarios (B) and (C). The additive genetic relationship (AGR) between gametes is computed by applying the formula of [124] to QTLs using simulated allele frequencies. The pedigree relationship matrix (PRM) between gametes is constructed by assigning a coefficient of 0.5 between gametes originating from the same individuals, and 0 otherwise. The GRM is calculated like the AGR but using markers. In scenario (A), the GRM can estimate the AGR using ancestral LD, even in absence of pedigree relationship between gametes. In scenario (B), the GRM can estimate the AGR by tracing pedigree relationships between gametes (like the PRM), even in absence of ancestral LD between markers and QTLs. In scenario (C), the GRM can better estimate the AGR within a family of gametes compared to the GRM in scenario (A) and the PRM, thanks to cosegregations between QTLs and marker alleles that are physically linked on chromosomes. Note that we considered haploid gametes to simplify the schematic representation but those concepts can be generalized to relationships between diploid individuals
84
Simon Rio et al.
Several studies have shown that the accuracy of genomic prediction is linked to the AGRs between CS and PS individuals. Based on simulation, Pszczola et al. [66] established the link between the deterministic reliability of genomic prediction and the average squared genomic relationship coefficient between CS and PS individuals. Habier et al. [60] illustrated that the accuracy of genomic prediction increased with increasing a-max in dairy cattle, where a-max is defined as the maximum pedigree relationship coefficient between CS and PS individuals. This result was confirmed in maize [67] and oil palm [68]. More generally, the need for close pedigree relationships between CS and PS individuals has been illustrated in several species like in mice [17]. In addition to AGRs, other types of genetic relationships can be modeled to improve genomic prediction accuracies such as dominance and epistatic genetic relationships. Like AGRs, these other relationships directly reflect the sharing of alleles at QTLs and can be estimated using the pedigree [57] or using markers [52, 69]. However, they are often not accounted for in genomic prediction models, as they generally have a limited contribution to the overall genetic variance, except for specific applications such as the prediction of hybrids (see Subheading 4.3). 2.3 Calibration Set Should Be As Large as Possible
When building a CS, increasing the number of individuals is generally beneficial. The importance of CS size has been shown theoretically using deterministic equations of the accuracy of genomic prediction [48, 70–74]. They showed that the population size should be large enough to properly estimate the effect of each of the effective chromosome segments that segregate in the population (quantified by their number Me), in particular for low heritability traits. The effect of the CS size on genomic prediction accuracy has been illustrated experimentally in plants [62, 75–77] and animal species [78, 79], as well as in Human [80]. However, one should keep in mind that increasing the number of individuals should be done with caution if additional individuals are genetically distant from the PS individuals, as mentioned in the previous subsection. There is also a compromise to be found between the number of phenotyped individuals and the accuracy of the phenotyping that can be increased in plants by increasing the number of observations per individual (see discussion in Subheading 4.4).
2.4 Genetic Relationships Between CS Individuals Should Be Limited
Finally, it is generally admitted that genetic relationships among individuals should be limited within the CS. This idea is related to the common assumption that, in genomic prediction, experimental designs should aim at replicating alleles rather than individuals [81]. Because individuals with high degrees of genetic relationship can be considered as partial replicates and somewhat redundant, including them all may not be the best allocation of resources regarding genomic prediction accuracy. Based on simulations,
Optimization of Genomic Prediction Calibration Set
85
Pszczola et al. [28, 66] have shown that average reliabilities of genomic prediction decreased with increasing genetic relationships within the CS. These results were confirmed in dairy cattle [82]. However, limiting the genetic relationships between CS individuals is not sufficient to maximize the accuracy, as maximizing genetic relationships between CS and PS individuals is also important.
3
Methods to Optimize the Composition of the Calibration Set Considering all the factors affecting predictive ability mentioned above, optimizing the composition of CS is not a simple task. We can, however, suppose that when CS and PS cover the same genetic space, they will have similar LD patterns, same segregating QTLs and a high genetic relationship, which are the main drivers of predictability. Numerous criteria have been proposed in the literature to optimize the composition of the CS. They can be grouped into two classes. The first class of approaches consists in identifying optimized CS based on model-free relatedness criteria. The second class of approaches is directly based on the genomic selection (GS) statistical model. They mostly rely on GBLUP, which is one of the reference GS models. They consist in defining CS by optimizing some criteria derived from the linear mixed model: the (generalized) Prediction Error Variance (PEV) and Coefficient of Determination (CD), or the expected Pearson correlation between predicted and observed values (r). Each criterion has advantages and drawbacks related to their efficiency to maximize predictive ability, their computational demand, and their ability to optimize the CS prior to phenotyping. A brief description of the methods/ criteria (including references and, if available, scripts or tools to implement them) is presented in Table 1. In this part, we will review the two classes of criteria. The specific questions of predicting biparental families, optimizing or updating the CS when phenotypes are available, optimizing CS for hybrids, and optimizing experimental designs will be reviewed in Subheading 4.
3.1 Model-Free Optimization Criteria Based on Genetic Distances Between Individuals
As mentioned above, one of the main objectives when designing a CS is to ensure that the genetic space covered by the PS is well captured by the CS. Relatedness being a key contributor to prediction accuracy, different relatedness-based criteria were proposed to optimize the composition of the CS.
3.1.1 Optimization Based on Genetic Diversity Within the CS
A first way of optimizing the composition of the CS using relatedness is to minimize genetic similarity within the CS. The underlying idea is that genetic similarity within the CS can be seen as partial redundancy, and so CS including related individuals would be less
It ensures that each group of structure Model-free is well represented in the CS
Stratified sampling
A
A or B
Model-free
The objective is that each predicted individual has a close relative in the CS
A or B
Model-free Maximize a criterion based on the genetic similarity between the CS and the PS
Gmean, Avg_GRM, and Crit_Kin
Gmax
A
Individuals are grouped into clusters Model-free and representatives of each group are identified
Partitioning around Medoids (PAM)
A
Model-free
Method based on graphical networks, which principle is to identify nodes (individuals) with highest degree centrality
A
Uniform coverage of A geometric approach ensuring that no Model-free the target genetic close relatives are included in the CS space
Fast and unique representative subset selection (FURS)
A
Minimize the average or the maximum Model-free of the relationship coefficients within the CS
Considers population structure
Objective Specificities
Amean and Amax
Type
Description
Name/criterion
[84, 87, 84, 90–93]
Unpublished
[23, 86, 87]
[85]
[85]
[84]
[83]
Main references
https://github.com/ TingtingGuo0722/ OptimalDesign
https://github.com/ TingtingGuo0722/ OptimalDesign
Upon request
Scripts/packages
Table 1 Main methods and criteria to optimize the composition of the calibration set in genomic selection. Methods are grouped in two categories (Type) depending on whether they rely on a statistical model (“model based”) or not (“model free”) and depending on the target objective (partition a genotyped population into CS/PS (A), optimize a CS for a given PS (B), or subsample historic data to predict a given PS (C))
86 Simon Rio et al.
Maximize the average generalized expected reliability (CD) of the PS
Maximize the minimal expected reliability (CD) of the PS
Maximize the within population generalized expected reliability
Maximize the generalized expected reliability for multitrait or multienvironment models
Maximize the expected predictive ability or accuracy
Maximize the expected accuracy for a given genetic architecture
A CS is specifically designed for each PS individual by minimizing its individual PEV
Optimization based on a sparse selection index
CDmean
CDmin
CDpop
CDmulti
Expected predictive ability or accuracy (r)
EthAcc
PEVmean1
Sparse selection index
C
C
Model-based
Model-based
C
Model-based
A or B
A or B
Model-based
Model-based
A or B
A or B
A or B
A or B
Model-based
Model-based
Model-based
Modelbased
Supplementary materials
[118, 119]
Defines a CS specific to [136] each predicted individual
https://github.com/ MarcooLopez/SFSI
TSDFGS
Upon request
[101]
[116]
Upon request
[87]
Defines a CS specific to [103] each predicted individual
Takes QTL information into account
Multitrait, multienvironment
Considers population structure/ multiparental populations
TrainSela, https:// github.com/ TheRocinante-lab/ TrainSel
Upon request [83]
Upon request [83], STPGA [99]
[121]
[83, 91, 99, 145, 98]b
[83, 99, 145, 122]b
TrainSel allows different types of criteria to be implemented (optimization function can be provided by users) beyond the built-in CDmin criterion presented in the publication Individual CD or PEV considered instead of the generalized ones
b
a
Minimize the average generalized prediction error variance (PEV) of the PS
PEVmean
Optimization of Genomic Prediction Calibration Set 87
88
Simon Rio et al.
informative than more diverse CS (see Subheading 1). This can be for instance done by minimizing the average or the maximum of the relationship coefficients within the CS [83]. A similar approach was proposed by Bustos-Korts et al. [84] to design a CS leading to a uniform coverage of the target genetic space. This is based on a geometric approach that ensures that no close relatives are included in the CS. Guo et al. [85] proposed a method called Partitioning around Medoids (PAM), in which individuals are grouped into clusters and representatives of each group are identified. They also proposed Fast and Unique Representative Subset Selection (FURS), which is a sampling method based on graphical networks. The principle of FURS is to identify nodes (individuals) with highest degree of centrality. These criteria are best adapted to identify a CS among a set of candidates that will be used to predict the performance of the remaining candidates (like the scenario in Fig. 2A). They cannot be used to optimize a CS for an independent PS. CS optimization based on these criteria has led to a higher prediction accuracy than random sampling [84, 85], but they do not directly consider the genetic relatedness between the CS and the PS. If most of the PS individuals are present in a small part of the genetic space, it is important to have many CS individuals in this part, even if it leads to a low diversity in the CS. In other words, it is important to weigh the different parts of the genetic space according to the distribution of the PS individuals, the optimal CS being not necessarily the one with the highest genetic diversity.
Fig. 2 Comparison of two standard scenarios that can be considered for the optimization of the calibration set (CS) based on PEVmean and CDmean. In (A) a single population is split into a CS and a predicted set (PS) with the objective of optimizing the CS for best predicting non phenotyped individuals (PS), while in (B) the set of CS candidates and the PS (both genotyped) are distinct. In both scenarios, the PEVmean and the CDmean criteria are computed directly for the PS individuals
Optimization of Genomic Prediction Calibration Set
89
3.1.2 Optimization Based on Genetic Relatedness Between the CS and the PS
The genetic relatedness between the CS and the PS is taken into account by criteria Gmean [23], Avg_GRM [86], and Crit_Kin [87]. For Gmean and Avg_GRM, candidates to the CS are ranked according to their average genetic relatedness to the PS. The individuals with the highest average are included in the CS. Roth et al. [88] and Berro et al. [89] proposed similar approaches in which individuals are ranked according to their maximum or median genetic relatedness to the PS. For Crit_Kin the average of the relationship coefficients between the CS and the PS is maximized. These criteria generally resulted in higher predictive ability than random sampling [23, 86–88, 90]. Contrary to the previous criteria (Subheading 3.1.1 above), Gmean, Avg_GRM and Crit_Kin take into account genetic relatedness between CS and PS, but do not consider redundancy within the CS. To design an optimal CS, it seems important to balance the two aspects. Another criterion that has, to our knowledge, not yet been tested in the literature and that could help reaching this balance, is Gmax, the maximum relatedness coefficient between a given PS individual and the CS individuals. As prediction accuracy decreases with the genetic distance between the CS and the PS (see above), it seems interesting to ensure that each PS individual has a close relative in the CS, as illustrated in Clark et al. [63] and Habier et al. [59]. One criterion could be to identify a CS maximizing the average of the Gmax of all PS individuals. This criterion seems promising, as its maximization will result in similar genetic dispersion for CS and PS. We can indeed think that the optimal CS is the PS itself, and Gmax will help identify a CS as close as possible to this optimum.
3.1.3 Taking Population Structure into Account
In case of population structure, all the abovementioned criteria derived from relatedness coefficients could also be useful as population structure is partially captured by genetic relatedness. But in the case of strong structure, it may be necessary to directly take it into account. It was proposed in numerous studies to define the CS by stratified sampling. These algorithms ensure that each group is well represented in the CS, possibly taking the size of each group into account [84, 87, 90–93]. The efficiency of these approaches to increase predictive ability was disappointing, as they were often not better (sometimes worse) than random sampling, and most of the time not as good as relatedness-based criteria not taking structure into account. This is probably due to the fact that they only rely on population structure and do not consider relatedness within the groups. Their efficiency, however, increases with the importance of the structuration [91]. Optimizing CS by combining information on population structure and relatedness seems an interesting alternative strategy to achieve higher accuracy. The specific and extreme case of predicting a population stratified into biparental populations will be discussed below.
90
Simon Rio et al.
3.2 Optimization Using “Model-Based” Criteria Derived from the Mixed Model Theory (PEV, CD, r)
The abovementioned criteria are model-free in the sense that they do not rely on any genomic prediction model. We can, nevertheless, think that if a GS model results in accurate predictions, working with theoretical criteria derived from the same model could be valuable. One of the reference and most efficient GS model is the GBLUP mixed model. It is particularly adapted to polygenic traits because it relies on the infinitesimal model. Analytical developments were made within the mixed model theory to derive criteria related to the expected predictive ability of the model before any phenotyping. In this section we introduce these criteria and how they were used to optimize CS.
3.2.1 CS Optimization Using the Prediction Error Variance (PEV) or the Coefficient of Determination (CD)
Rincent et al. [83] proposed to use the generalized Prediction Error Variance (PEV(c)) or the generalized expected reliability (generalized Coefficient of Determination, CD(c)) of contrast c to optimize the composition of CS in genomic prediction. In animal breeding, PEV(c) or CD(c) were first proposed to track disconnectedness in experimental designs [94, 95]. The contrast c indicates in which comparison we are interested in. If one wants to consider the comparison between the prediction of individual 1 and the prediction of individual 2 in a set of four indivi0 duals, then c ¼ [1 1 0 0]. If one wants to compare a group of individuals (1 and 2) with another group of individuals (3 and 4), 0 then c ¼ [1 1 1 1]. The sum of the contrast elements always has to be null. Contrary to plants, animals cannot be replicated in different environments, and so the comparison of animals of different years or different herds can be a problem. The genetic relatedness between individuals obtained from the pedigree can be used to connect the different management units. Taking into account this connectivity is important to ensure that the comparison between animals is reliable. PEV(c) and CD(c) were initially used to optimize experimental designs (repartition of animals in different herds) to make the comparisons reliable [94, 95]. More recently, it was applied to models relying on realized relationship matrices based on marker information [96, 97], possibly in the presence of nonadditive effects [98]. The generalized PEV and CD are derived from the GBLUP model, with: y ¼ Xβ + Zu + e, where y is a vector of phenotypes, β is a vector of fixed effects, u is a vector of random genetic values (polygenic effect) and e is the vector of errors. X and Z are design matrices. The variance of the random effect u is: varðuÞ ¼ Aσ 2g , where A is the relationship matrix (realized relationship matrix in the context of genomic prediction) and σ 2g is the additive genetic variance in the population. The variance of the errors e is: varðeÞ ¼ I σ 2e , where I is the identity matrix. The PEV of any contrast c of predicted genetic values can be equivalently calculated as:
Optimization of Genomic Prediction Calibration Set
91
varðc 0 u c 0 uÞ , c0c 1 c 0 Z 0 MZ þ λA 1 c σ 2e , PEV ðc Þ ¼ c0c P EV ðcÞ ¼
PEV ðc Þ ¼
c 0 ðA AZ 0 M 2 ZA Þc σ 2g , c0c
b is the BLUP of u, M is an orthogonal where c is a contrast, u projector on the subspace spanned by the columns of X: M ¼ I X 0 0 0 0 (X X)X and (X X) is a generalized inverse of X X [94], M 2 ¼ 1
~ 1 X Þ X 0 Σ ~ 1 , Σ ~ 1 Σ ~ 1 X ðX 0 Σ ~ ¼ Z AZ 0 þ λI is the phenoΣ typic covariance matrix scaled to some variance ratio and λ ¼ σ 2e =σ 2g . The last expression of PEV(c) has the advantage of being computationally much more efficient when the size of the CS is small in comparison to the total number of individuals considered. PEV(c) is influenced by the genetic distance between the compared individuals and by the expected amount of information brought by the experiment on the compared individuals. A low PEV of a contrast between two individuals can be due to their close genetic similarity, or to the important amount of information brought by the experiment on the given comparison (e.g., the two individuals are related to many CS individuals meaning that their predictions will be precise). The generalized CD [94] is defined as the squared correlation between the true and the predicted contrast of genetic values, and is computed as: C DðcÞ ¼ corðc 0 u, c 0 uÞ , 1 c 0 A λ Z 0 MZ þ λA 1 c , CD ðc Þ ¼ c 0 Ac 2
CD ðc Þ ¼
c 0 ðAZ 0 M 2 ZA Þc : c 0 Ac
As for PEV(c) the last expression of CD(c) is computationally more efficient, because of the reduced size of the matrix to be inverted when the number of observations is smaller than the total number of individuals. The CD(c) is equivalent to the expected reliability of the contrast. It takes values between 0 and 1, a CD(c) close to 0 means that the prediction of the contrast is not reliable, whereas a CD(c) close to 1 means that the prediction is highly reliable. The generalized CD(c) is equal to CD ðc Þ ¼ ðc Þ 1 PEV . As a result, the CD(c) increases with diminishing PEV c 0 Acσ 2 g
(c) and with increasing genetic distance between individuals involved in the targeted contrast. An increase of the genetic distance will indeed increase the genetic variance of the contrast. Note
92
Simon Rio et al.
that if c is replaced by a vector of 0 and a single 1, the resulting CD is no longer the generalized CD of a contrast but the individual CD of the corresponding individual. Rincent et al. [83] first proposed to use PEV(c) and CD(c) to optimize the composition of the CS in genomic prediction. As CD (c) is the expected reliability of a given contrast c, it is a criterion of choice to maximize prediction accuracy by optimizing the composition of the CS. The main aim of genomic selection is indeed to discriminate between individuals based on their predicted breeding values. As shown above, the computation of these criteria only requires the kinship matrix and the ratio of the error and genetic variances (λ) that can be chosen based on prior knowledge. No phenotypic information is required, so the optimization of the CS can be done prior to any phenotyping. The optimization was only marginally affected by λ in Rincent et al. [83] and Akdemir et al. [99], which means that the CS optimizing PEV(c) or CD(c) is supposed to be efficient for any polygenic trait. Rincent et al. [83] have proposed the criteria PEVmean ¼ PN PS PN PS 1 1 i¼1 PEV ðc i Þ and CDmean ¼ N PS i¼1 CD ðc i Þ , where ci is N PS the contrast between PS individual i and the mean of the population, and NPS is the size of the PS. CDmean (PEVmean) is the average of the CD(c) (PEV(c)) of the individuals in the PS considering a given CS. CDmean is expected to be better than PEVmean for improving GS accuracy, as illustrated in Rincent et al. [83] and Isidro et al. [91], since the CD(c) is related to the ability to discriminate individuals. By maximizing CDmean of the PS, we define a CS able to discriminate each predicted individual from the average population, so that we are able to reliably identify the best (or the worst) individuals. Using two maize diversity panels, Rincent et al. [83] considered a case when only part of a population could be phenotyped so the CS was optimized in order to predict the non phenotyped individuals (PS), and a case when the CS was optimized in order to predict a predetermined PS (Fig. 2). They showed that a considerable increase of prediction accuracy could be reached by optimizing the CS with PEVmean and even more with CDmean in comparison to randomly sampled CS. From another perspective, PEVmean and CDmean based CS enabled the same prediction accuracy as random CS with twice as less phenotyped individuals. One key point with these criteria is that they take into account kinship between all individuals (CS and PS), and therefore result in the sampling of an optimized CS specific to a given PS. As a result, it is highly recommended to optimize the PEVmean or CDmean of the predicted individuals [83, 87, 99, 100] rather than those of the individuals composing the CS [91, 101]. These criteria have been tested and validated in different species such as maize [83, 86, 87, 93], palm tree [68], wheat [102–104], barley [90], oat [15], cassava [105, 106], miscanthus [27], Arabidopsis
Optimization of Genomic Prediction Calibration Set
93
[99], apple tree [88], and peas [107] in populations of various levels of relatedness. CDmean led to prediction accuracies at least as good as those obtained with model-free criteria [83, 86, 87, 91, 93] with some exceptions [88–90, 108]. Note that the contrasts are flexible and can be adapted to address specific prediction objectives. For instance, in the context of biparental families, different contrasts have to be defined if one is interested in comparing families or individuals within families (see criterion CDpop below). In case of strong population structure, it can be necessary to adapt these criteria [87, 91, 101]. Isidro et al. [91] have proposed the stratified CDmean maximizing the CDmean within each group. This criterion did not improve prediction accuracy in comparison to CDmean. This may be explained by the fact that CDmean takes population structure into account as long as it is captured by the kinship matrix. One of the strengths of PEV(c) and CD(c) is that they can be adapted to address specific prediction objectives (e.g., scenarios a and b in Fig. 2) by adapting the contrasts. It can be used to optimize CS for a given PS (Fig. 2A), or to select the best CS within a population that can only be partially phenotyped, the remaining individuals being predicted (Fig. 2B). Rincent et al. [87] proposed to adapt the contrasts to take population structure into account. In this study based on connected biparental populations, new criteria were proposed to maximize prediction accuracy within each population (CDpop), or the global accuracy not taking population structure into account (CDmean). They showed that the definition of the contrasts could be adapted to specifically address each prediction objective (see below). Examples of CS optimized with CDmean or CDpop are presented in Fig. 3. 3.2.2 Multitrait CS Optimization with CDmulti
Genomic prediction models can be adapted to take into account multitraits and multienvironments in a same statistical model. This was shown to increase prediction accuracy in particular when a low-cost secondary trait is measured on the PS, i.e. trait-assisted prediction [109–111], or when all PS individuals are phenotyped in at least one environment in a multienvironment trial, i.e., sparse testing [112–116]. In these situations, the partition between CS and PS is not as clear as in the previous paragraphs, as some of the PS individuals are partially observed (phenotyped for a secondary trait, and/or in some of the environments). The optimization is more complex, as the experimental design involves more than one trait or environment. The underlying model is y ¼ Xβ + Zu + e, in which y is a vector of phenotypes concatenating the different traits, u is the corresponding vector of multitrait polygenic effects, and e is the vector of errors, var(u) ¼ Σa A and var(e) ¼ Σε I, with Σa the matrix of genetic variance/covariance between traits, and Σε the matrix of error variance/covariance between traits. Generalized CD can be derived from this model [117] to compute the expected
94
Simon Rio et al.
Fig. 3 Networks representing examples of calibration set (CS) optimized with generalized CD criteria (A: CDmean and B: CDpop). The green dots indicate the individuals to be predicted (PS), the red squares indicate the 30 individuals composing the CS optimized with CD criteria. Individuals are connected with an edge when their genetic relationship is above a given threshold. In (A) we considered a highly diverse panel in which the objective was to sample a CS optimal for the prediction of the remaining individuals. In (B), we considered different biparental populations of a Nested Association Mapping (NAM) design, with the objective of predicting one given biparental family by sampling an optimal CS from the other biparental families. The contrasts were adapted to answer these two prediction objectives and correspond to the criteria CDmean (A) and CDpop (B), see Rincent et al. [83, 87]. In (A), the network indicates that CDmean selects key individuals related to many others. In (A) the network illustrates that CDpop samples individuals the most representative of the PS, mostly belonging to biparental populations strongly related to the PS
reliability for each individual–trait combination. This is a generalization of the single trait CD, in which the genetic and error covariances are adapted to the multitrait context. The computation of this criterion (CDmulti) is as follows. CDmultiðcÞ ¼ necker product,
1
1 c 0 ððΣa AÞðZ 0 M 3 Z þðΣ1 ÞÞ a A c 0 ðΣa AÞc
Þc
, with the Kro1
1 0 1 0 1 M 3 ¼ ðΣ1 ε I Þ ðΣε I ÞX ðX ðΣε I ÞX Þ X ðΣε I Þ:
Computing CDmulti requires prior knowledge on genetic covariances between traits (genetic and error covariance matrices between traits), and so the optimized multitrait design is specific to a set of traits or environments. In CDmulti, each individual–trait combination is characterized by a CD value (using the corresponding contrast). Ben-Sadoun et al. [117] considered a trait-assisted prediction scenario with a target trait and a secondary trait easy and inexpensive to phenotype and correlated to the target trait. The goal was to identify which individuals should be phenotyped for the target trait, for the secondary trait or for both, to maximize prediction accuracy of the PS for the target trait with
Optimization of Genomic Prediction Calibration Set
95
budget constraints. They showed that phenotyping strategies optimized with CDmulti resulted in a slight but systematic increase of prediction accuracy in comparison to random sampling. In a multienvironment context, one can expect some levels of GxE and different phenotyping costs associated to each environment. In this situation, CDmulti could help determine which individuals should be phenotyped in each environment. 3.2.3 CS Optimization Using the Expected Predictive Ability Or Accuracy (r)
More recently, Ou and Liao [101] proposed to derive the expected Pearson correlation between phenotypes and predicted breeding values (r) in the PS, often referred to as predictive ability. This optimization criterion is also derived from the GBLUP model and can be computed without any phenotypic data. This criterion is interesting as it directly targets the predictive ability, which is related to genetic progress. The authors showed that it resulted in higher predictive ability than other criteria derived from the GBLUP model (PEV, CD) and stratified sampling. This conclusion could, however, partly be due to the fact that the CD criteria were computed within the CS (the genotypes of the PS individuals were not considered). The main limitation common to all aforementioned criteria is that they rely on genome-wide relatedness through the use of a GBLUP model, which means that they are only adapted to polygenic traits. This is not a problem for most productivity traits, but they are not adapted to traits influenced by major genes such as some disease resistances or phenology. Theoretical developments could be proposed in the future to adapt these criteria to trait specific genetic architecture, in particular to the presence of major genes. A new criterion (EthAcc) targeting the expected prediction accuracy (rðu, uÞ) was proposed to better take genetic architecture into account using the results of genome wide association studies obtained with historic data and genotypic information of the PS [118, 119]. The objective here is to determine an optimal CS from existing phenotyped and genotyped individuals. This is a common situation in plant breeding, as breeders accumulate such data year after year. This criterion was efficient to determine the optimal size and composition of the CS, but the search algorithms were unable to identify the optimal CS without using phenotypic information from the PS. This approach implies that the CS is specific to a given trait and requires the identification of QTLs prior to CS optimization.
3.3 Search Algorithms for Optimal CS and Corresponding Packages
For most of the abovementioned criteria, it is not possible to directly determine the CS with the optimal value for the chosen criterion. For instance, there is no analytical way to determine the CS with the best CD, PEV or r value. Different iterative optimization algorithms were proposed based on exchanges of individuals between the CS and the remaining individuals to improve step by
96
Simon Rio et al.
step the criterion computed for a combination CS/PS. These algorithms can be simple exchange algorithms [83, 87, 117], genetic algorithms [99, 101, 120, 121] or differential exchange algorithms [121] (see Table 1 for the list of scripts available implementing different algorithms). Such iterative algorithms do not guarantee convergence toward the global optimum, and have to be run with different starting values and with sufficient iterations to reach a better CS than the initial one. One of the main limits of these criteria is that the search algorithm is computationally demanding for large datasets composed of thousands of individuals or beyond [99]. Approaches based on approximation of the PEV [123–125], including principal component analysis on the genotypic data [99], can reduce computational time. It would be interesting to include contrasts in this approach to optimize more specific prediction objectives.
4
Focus on Some Specific Applications of CS Optimization
4.1 CS Optimization for Predicting Biparental Populations
Plant breeders mainly work with full-sib families, which is a specific case of population structure. Optimizing a CS is particularly challenging in this case because of the different LD phases and QTLs segregating between families. Considering a single family, the optimization of the CS can be done with the criteria based on genetic relatedness presented above. However, in Marulanda et al. [126] where CS optimization was applied within each family, all the tested criteria failed to optimize the CS. In this scenario, due to strong relatedness between full sibs, the improvement associated with CS optimization is expected to be limited in comparison to what can be observed with more diverse material. Apart from these simple within-family scenarios or the situations in which the parents involved in the different crosses are genetically close, the identification of families highly predictive of a target family is challenging [87, 127] even when the phenotypic variance and heritability of each family is known [128]. It is common that unrelated families result in negative prediction accuracy [19, 127], and so it is important to remove such families from the calibration set. To identify the best predictive families Schopp et al. [50] proposed criteria such as the proportion of shared segregating SNP in the CS and the PS families (θ), the linkage phase similarity [40], or the simple matching coefficient [129]. θ was efficient to predict the accuracy when averaged over many traits, but was much less efficient when considering a given trait because of trait specific genetic architecture. Brauner et al. [127] concluded that it was too risky to add unrelated families to the CS with regard to the potential gain in predictive ability, and so recommended to include only full and half sibs.
Optimization of Genomic Prediction Calibration Set
97
Rincent et al. [87] proposed a criterion (CDpop) derived from the generalized CD to predict the prediction accuracy within a given family when using as CS individuals sampled from one or several other families. This criterion was able to predict the observed prediction accuracy quite accurately, and was efficient to optimize CS specifically designed to predict a given family. The prediction accuracies were on average much higher with CDpop than with random sampling. However, this study was based on families of half-sibs (NAM, [39]) and CDpop has not been tested yet on unrelated families. 4.2 CS Optimization or Update When Phenotypes Are Already Available
The criteria introduced in the previous parts were mostly proposed to optimize the composition of the CS prior to any phenotyping. Breeders are, however, facing situations in which some individuals have already been phenotyped, for instance when the CS has to be selected from previous breeding cycles. In these situations, the information provided by the phenotypes may be used to improve the composition of the CS. This would be valuable in two situations: the regular update of the CS along breeding cycles, or the selection of phenotypes from historical data.
4.2.1 Updating the CS
Prediction accuracy decreases over time in successive breeding cycles because of the lower genetic similarity and increased discrepancy of segregating QTLs between the CS and the PS [28– 35]. This makes it necessary to regularly update the CS by phenotyping additional individuals. The selection of the new individuals to include in the CS, can be done with the abovementioned criteria, but we can think that the phenotyping data collected in the previous cycles could help updating the CS. Neyhart et al. [130] and Brandariz and Bernardo [131] have proposed to update the CS with the individuals with the best and worst GEBV in the previous generation(s). Simulations showed that it resulted in higher prediction accuracy than random sampling, PEVmean or CDmean. The efficiency of this approach was illustrated in various experimental studies [132–135]. We can suppose that the efficiency of this strategy is due to the maximization of the number of segregating QTLs in the CS.
4.2.2 Subsampling Historical Phenotypic Records
Breeders have access to important phenotypic data collected year after year that can be used to calibrate the GS model. It was, however, shown that subsampling part of the available phenotypic data can improve the predictive ability in comparison to using the full dataset. The presence of genetically distant individuals can indeed decrease predictive ability [23]. This subsampling can be done with classical criteria such as PEVmean, CDmean, or r derived from the GBLUP, but they cannot be used to determine the optimal CS size as they always improve when adding additional individuals. They can, however, be used to determine the
98
Simon Rio et al.
composition and size of the CS after which the criterion only marginally improves [101]. Criterion such as EthAcc [119] does not present the same limitation, but its use in practice is hindered by the poor ability of the search algorithm to identify the optimal CS without including the PS phenotypes. Another option is to determine a CS specific to each predicted individual by selecting its most related individuals [23] or by optimizing criteria based on PEV(c) or CD(c) (PEVmean1, [103]). With PEVmean1, a CS is specifically designed for each PS individual by minimizing its individual PEV. Predictive abilities obtained with PEVmean1 were generally similar to those obtained with PEVmean, but higher for small CS. De los Campos and Lopez-Cruz [136] have formalized an approach in which a penalty is used to set to zero the contribution of some individuals to the prediction. They showed that it could significantly increase predictive ability when the penalty coefficient is well determined. 4.2.3 Optimizing the Choice of Individuals to Be Genotyped
In all the optimization approaches presented above, it was supposed that genotypic information was available for all CS individuals. It can, however, happen that only part of the individuals with historical phenotypic data have been genotyped, and in this case it could be valuable to genotype some additional key individuals to improve the predictions. This selection can be guided by the phenotypic data or the pedigree. Boligon et al. [133] and Michel et al. [134] have proposed to apply the “best and worst individuals” sampling strategy to identify the individuals that should be preferentially genotyped. Maenhout et al. [137] have used the generalized CD (computed with pedigree) to improve the subsampling of historical data by taking into account the balance (number of replicates of each variety) and the connectedness between individuals (disconnectedness can be present when unrelated individual are evaluated in distinct trials). Bartholome´ et al. [138] proposed a two-step strategy involving pedigree information and simulations.
4.3 Optimization of the Calibration Set in the Context of Hybrid Breeding
For many plant and animal species, commercial products are hybrids between individuals from different genetic groups (different breeds or heterotic groups). In animal species such as pigs or poultry, even if the commercial products are hybrids, the conventional selection is often done at the purebred level and hybrid performances are seldom considered. With the advent of GS, several studies investigated the interest of accounting for crossbred performances in CS in addition or instead of purebred performances. Recently, Wientjes et al. [139] explored how to optimize CS in this context using simulations but focused mainly on the crossing design used to generate the crossbred individuals from the purebred and not on the composition of the crossbred CS itself. For allogamous plant species such as maize or sunflower, the breeding objective is to produce single-cross hybrid varieties from two
Optimization of Genomic Prediction Calibration Set
99
inbred lines, each selected in complementary groups. In this context, the total number of potential single-cross hybrids is very large (N1 N2, if N1 and N2 are the numbers of inbred lines in group 1 and 2 respectively) and all of them cannot be evaluated. Classically, the genetic value of a hybrid is decomposed as the sum of the general combining abilities (GCAs) of each of its parental lines (i.e., the average performances of the hybrids progeny generated by crossing one parental line to the lines of the other group) and the Specific Combining Ability (SCA) of the cross (i.e., the complementarity between the two parental lines). In 1994, Bernardo [140] proposed to use molecular markers to compute covariances between the GCAs of parental lines in each group and between SCAs of intergroup hybrids to predict performances of nonphenotyped hybrids from phenotyped ones. It was the first application of GS in plants. Genomic selection is particularly interesting in this context since the genotypes of all potential hybrids can be derived from the genotypes of inbred lines. This offers the possibility to use genotypes of inbred lines to (1) predict GCA of each candidate line evaluated or not as hybrid and (2) to directly predict all potential single-cross hybrid values (GCAs+SCA) to identify the most promising varieties. First optimization approaches of the CS based on empirical data highlighted that the qualities of prediction of new hybrids were higher when the CS and PS hybrids shared common parental lines, that is when the new hybrids derived from parental lines that contributed to the CS hybrids [141–143]. However, there is a trade-off between the number of hybrid contributions per candidate line and the total number of lines that can contribute to the CS [142]. This trade-off depends on the proportion of SCA relative to the GCA, the total number of hybrids that can be evaluated and on the prediction objective: the prediction of new hybrid combinations between new lines (T0 hybrids) or the prediction of new hybrids between lines that contributed to the CS (T1 or T2 hybrids when respectively one or two of the parental lines are parents of some CS hybrids) [144]. Studies based on real [142], and simulated data [144] showed that increasing the number of lines contributing to the CS at the expense of the number of hybrids evaluated per line is beneficial for better predicting T0 hybrids. However, doing so decreases the total number of T0 hybrids among the whole set of potential hybrids, so the optimal solution over all categories of hybrids depends on the percentage of hybrids that can be phenotyped. This advantage is also reduced when the percentage of SCA is high since the accuracy of SCA prediction decreases when inbred lines are only evaluated in one single CS hybrid. When the objective is to predict the hybrid values in the next generations, increasing the number of lines in the CS at the expense of their contribution is generally the best solution (unless a large percentage of the variance is due to SCA). Recently, Guo et al.
100
Simon Rio et al.
[85] proposed a strategy called MaxCD (Maximization of Connectedness and Diversity). In this strategy, a representative subset of parental lines is first selected from patterns detected in the inbred line genomic relationship matrix. From these lines, a set of hybrids with nonoverlapping parental lines is defined and combined with a set of hybrids issued from pairs of inbred lines most distant from each other. The idea is to represent in the CS the expected diversity of the whole set of hybrids. Besides these empirical optimizations, other criteria such as those based on PEV and CD were proposed recently. Momen and Morota [98] extended the CD and PEV to include nonadditive effects. In a model including additive and dominance effects they proposed to use a multikernel approach for the predictions and to use as K matrix in the CD and PEV, a linear combination of the additive and dominance relationship matrices (G and D) each weighted by the proportion of variance associated with these variance components, that is, K¼
σ 2A
σ 2A σ2 Gþ 2 D 2 D 2 þ σD σA þ σD
They evaluated the link between the CD and the genomic prediction accuracies in an animal breeding context using simulations and real pig data. Based on their results they proposed to use the CD for optimizing the CS. Note, however, that they did not consider a hybrid design between unrelated populations and therefore assumed in their prediction model that there was only a single additive variance component and a dominance variance component, which does not correspond to the decomposition of hybrid value in terms of GCA and SCA commonly used for factorial designs. Fritsche Neto et al. [145] used the same formalism to evaluate the interest of genomic selection in different maize hybrid designs and optimized the CS using PEV. They used historical data of variance component estimation to weigh the proportion of additive and dominance variance in PEV computation and also considered, as a benchmark, PEV based on additivity only. Their results showed the interest in using PEV to optimize the hybrid CS, but not the interest of considering dominance for its computation. In agreement with empirical optimization, they found that an optimal hybrid CS should involve as many parental lines as possible. More recently, Heslot and Feoktistov [122] also confirmed on sunflower data the interest of optimizing the hybrid CS using PEV based on a single additive variance. Kadam et al. [93] used an individual CD to identify among all potential hybrids that could be produced from segregation families those to be phenotyped to be included in the CS. They confirm the interest in using these criteria (individual CD or PEV) for optimizing the CS compared to the use of stratified sampling. Akdemir et al. [121] proposed to
Optimization of Genomic Prediction Calibration Set
101
choose wheat hybrids to be included in the CS to best predict the remaining hybrids by maximizing the worst individual CD of the PS (CDmin) and showed its interest relative to random sampling. To our knowledge, no optimization study has been based so far on CD or PEV of contrasts (CDmean and PEVmean) and questions remain on the extension of these criteria using a GCA / SCA formalism. 4.4 Optimization of the Phenotypic Evaluation of the Calibration Set
In terms of optimization of the CS, beyond its composition, a key question is the optimization of the experimental design for its evaluation in next to come experiments or, if the CS is based on historical data, the choice of the phenotypic data that should be included in the model calibration process. Optimization of the phenotyping design is a classical question in plant breeding as a compromise must be found between the number of individuals to phenotype, which has a direct impact on the selection intensity, and the phenotyping effort: number of traits measured, number of replicates within each field trial, and the number of field trials [146]. Marker-assisted selection (MAS), allowing selecting nonphenotyped individuals using marker-based predictions, leads to a different optimal resource allocation compared to phenotypic selection. In MAS, phenotypes are mostly used to estimate marker effects and detect QTLs. As population size plays a major role in determining the power of QTL detection, optimal resources allocation strategies for QTL-based MAS are to phenotype a larger number of individuals but with a lower number of replications per individual compared to phenotypic selection [147]. The first attempts to optimize the experimental design for phenotyping the CS, focused on selection within a given biparental population. Those approaches were based on simulations [81] and/or deterministic formula of the expected accuracy of GS adapted from Daetwyler et al. [48]: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N h2 r ðg, gbÞ ¼ N h2 þ M e where N is the size of CS, h2 is the trait heritability at the design level (which depends on the individual plot heritability, the number of plots and the GxE variance component) and Me corresponds to the number of independent loci segregating in the population. This formula assumes that the accuracy of prediction does not depend on CS composition. When considering a segregating population where the LD is only due to cosegregation, Me can be approximated from the number of chromosomes and the expected number of recombination events along chromosomes. Both Lorenz [81] and Riedelsheimer and Melchinger [148] therefore considered an Me value around 30 for a single biparental segregating family of
102
Simon Rio et al.
maize. Endelman et al. [149] estimated Me on two real data sets of barley and maize and used this estimate to derive expected accuracies for the optimization process. In GS, phenotypic data are used to calibrate prediction equations with little concern on the accuracy of each marker effect estimation compared to MAS. So even if the prediction accuracy of untested individuals increases with the CS size, it plateaus more quickly than for MAS giving more flexibility in terms of design in the trade-off between the number of individuals evaluated and the number of replicates. Riedelsheimer and Melchinger [148] extended the approach by (1) considering the prediction accuracy of untested individuals but also of the tested individuals included in the CS to predict the genetic gain and (2) by taking into account GxE when optimizing the number of environments in which the CS is evaluated. Endelman et al. [149] showed that an efficient strategy is to combine GS and sparse designs in which different subsets of CS individuals are phenotyped in each trial, reducing the total number of plots needed without reducing the number of phenotyped individuals nor the number of locations. Other optimization approaches [150, 151] also studied optimal resource allocation for the phenotyping of the CS using deterministic simulations but instead of studying the impact of the resource allocation on the GS accuracy, they considered it as an entry parameter. Jarquin et al. [115] using maize experimental data confirmed the interest in using genomic prediction models including GxE effects with sparse designs in which most genotypes are evaluated in only one trial. They, nevertheless, recommended having a small percentage of individuals common to the different trials. All the abovementioned approaches aim at optimizing the phenotyping for a next to come population of candidates considering that part of them will be phenotyped to predict the remaining ones. They did not consider the genotypic information of the candidates when choosing among them which individuals should be included in the CS at a fixed CS size. More recently, Atanda et al. [86] extended the use of the CDmean proposed by Rincent et al. [83] to this purpose in a maize data set composed of segregating families. They considered two different phenotypic designs: sparse testing (ST) design where all candidates of the targeted family are evaluated but each in only one trial and another strategy where only half of the candidates of the targeted family (HFS) are evaluated in all field trials. In both cases, they showed that CDmean efficiently selects the subset of individuals to be evaluated in each trial in ST designs and which individuals should be evaluated in the targeted family to predict the remaining ones in the HFS design. Extensions of this approach, considering phenotypes in different trials as correlated traits, showed the interest of using multitrait CD to optimize the allocation of CS individuals to different field trials [116, 121]. This opens the way to combine optimization of CS with optimal resource allocation.
Optimization of Genomic Prediction Calibration Set
103
A step forward into the optimization would be to fully integrate the optimization of the CS with the optimization of the experimental design up to the plot allocation of individuals in each field trial. Recently, Cullis et al. [152] showed by simulation that partially repeated field trial designs, optimized using “modelbased design” and considering genetic relatedness between genotypes based on pedigree, increased the prediction accuracy of their genetic values. The optimization was based on a sum of the PEV of all pairwise contrasts between the genetic values of the individuals which ensured an efficient comparison between all of them. Ideally, it would be interesting to extend this approach for optimizing experimental design and CS composition for the prediction of individuals in the PS. This would require efficient optimization processes to jointly address these two issues.
5
Conclusion and Prospects The practical implementation of a new tool in breeding mainly depends on the balance between costs and benefits. In this regard, the optimization of the experimental designs and in particular the optimization of the calibration set in genomic prediction is essential because it can reduce costs and increase benefits [153]. CS composition optimized with the criteria presented here most of the time resulted in higher prediction accuracy than random CS. The choice of the appropriate criterion depends on many factors including the prediction objectives, the population structure, the genetic architecture of the trait and the type of data available (e.g., PS individuals genotyped or not). In any case, there is no universal CS that would be optimal for any genetic material and any trait. We emphasize that it is fundamental to take the genotypic information of the PS into account when available to optimize the CS. Criteria such as CD, PEV, or r should be further investigated to address other questions such as the optimization of the CS for predicting hybrids or crosses that have not been produced yet [93, 122]. Another application in a plant breeding context would be to optimize jointly the CS size, its composition as well as the phenotypic design for each individual (we can suppose that it might be beneficial to phenotype more deeply key individuals). Another issue that should be taken care of, is the effect of the composition of the CS on the loss of diversity in the breeding population. Eynard et al. [154] have indeed shown that the way of updating the CS affected the genetic diversity of the breeding population along cycles, maybe because reducing the diversity within the CS can result in fixing some of the QTLs. The effect of CS optimized with the abovementioned criteria on this potential loss of diversity has not been studied yet. A CS constrained optimization procedure that combines both objectives by maximizing
104
Simon Rio et al.
predictive ability while controlling the loss of diversity would be valuable, this was not addressed yet in literature. References 1. Henderson CR (1975) Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423–447. https://doi. org/10.2307/2529430 2. Tsairidou S, Woolliams JA, Allen AR, Skuce RA, McBride SH, Wright DM, Bermingham ML, Pong-Wong R, Matika O, McDowell SWJ, Glass EJ, Bishop SC (2014) Genomic prediction for tuberculosis resistance in dairy cattle. PLoS One 9:e96728. https://doi.org/ 10.1371/journal.pone.0096728 3. Daetwyler HD, Villanueva B, Bijma P, Woolliams JA (2007) Inbreeding in genome-wide selection. J Anim Breed Genet 124:369–376. https://doi.org/10.1111/j.1439-0388. 2007.00693.x 4. Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of Total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829. https://doi.org/ 10.1093/genetics/157.4.1819 5. Gianola D, des los Campos G, Hill WG, Manfredi E, Fernando R (2009) Additive genetic variability and the Bayesian alphabet. Genetics 183:347–363. https://doi.org/10. 1534/genetics.109.103952 6. Gianola D (2013) Priors in whole-genome regression: the Bayesian alphabet returns. Genetics 194:573–596. https://doi.org/10. 1534/genetics.113.151753 7. Hayes BJ, Bowman PJ, Chamberlain AC, Verbyla K, Goddard ME (2009) Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet Sel Evol 41:51. https://doi.org/10.1186/1297-968641-51 8. Pryce JE, Gredler B, Bolormaa S, Bowman PJ, Egger-Danner C, Fuerst C, Emmerling R, So¨lkner J, Goddard ME, Hayes BJ (2011) Short communication: genomic selection using a multi-breed, across-country reference population. J Dairy Sci 94:2625–2630. https://doi.org/10.3168/jds.2010-3719 9. Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME (2012) Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95:4114–4129. https://doi.org/ 10.3168/jds.2011-5019
10. Daetwyler HD, Kemper KE, van der Werf JHJ, Hayes BJ (2012) Components of the accuracy of genomic prediction in a multibreed sheep population. J Anim Sci 90: 3375–3384. https://doi.org/10.2527/jas. 2011-4557 11. Windhausen VS, Atlin GN, Hickey JM, Crossa J, Jannink J-L, Sorrells ME, Raman B, Cairns JE, Tarekegne A, Semagn K, Beyene Y, Grudloyma P, Technow F, Riedelsheimer C, Melchinger AE (2012) Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 (Bethesda) 2:1427–1436. https://doi.org/ 10.1534/g3.112.003699 12. Rio S, Mary-Huard T, Moreau L, Charcosset A (2019) Genomic selection efficiency and a priori estimation of accuracy in a structured dent maize panel. Theor Appl Genet 132: 81–96. https://doi.org/10.1007/s00122018-3196-1 13. Duhnen A, Gras A, Teysse`dre S, Romestant M, Claustres B, Dayde´ J, Mangin B (2017) Genomic selection for yield and seed protein content in soybean: a study of breeding program data and assessment of prediction accuracy. Crop Sci 57:1325–1337. https://doi.org/10.2135/cropsci2016.06. 0496 14. Lorenz AJ, Smith KP, Jannink J-L (2012) Potential and optimization of genomic selection for fusarium head blight resistance in six-row barley. Crop Sci 52:1609–1621. https://doi.org/10.2135/cropsci2011.09. 0503 15. Rio S, Gallego-Sa´nchez L, MontillaBasco´n G, Canales FJ, Isidro y Sa´nchez J, Prats E (2021) Genomic prediction and training set optimization in a structured Mediterranean oat population. Theor Appl Genet 134:3595–3609. https://doi.org/10.1007/ s00122-021-03916-w 16. Guo Z, Tucker DM, Basten CJ, Gandhi H, Ersoz E, Guo B, Xu Z, Wang D, Gay G (2014) The impact of population structure on genomic prediction in stratified populations. Theor Appl Genet 127:749–762. https://doi.org/10.1007/s00122-0132255-x 17. Legarra A, Robert-Granie´ C, Manfredi E, Elsen J-M (2008) Performance of genomic
Optimization of Genomic Prediction Calibration Set selection in mice. Genetics 180:611–618. https://doi.org/10.1534/genetics.108. 088575 18. Albrecht T, Wimmer V, Auinger H-J, Erbe M, Knaak C, Ouzunova M, Simianer H, Scho¨n C-C (2011) Genome-based prediction of testcross values in maize. Theor Appl Genet 123:339. https://doi.org/10.1007/s00122011-1587-7 19. Lehermeier C, Kr€amer N, Bauer E, Bauland C, Camisan C, Campo L, Flament P, Melchinger AE, Menz M, Meyer N, Moreau L, Moreno-Gonza´lez J, Ouzunova M, Pausch H, Ranc N, Schipprack W, Scho¨nleben M, Walter H, Charcosset A, Scho¨n C-C (2014) Usefulness of multiparental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198:3–16. https://doi.org/10. 1534/genetics.114.161943 20. Herter CP, Ebmeyer E, Kollers S, Korzun V, Wu¨rschum T, Miedaner T (2019) Accuracy of within- and among-family genomic prediction for fusarium head blight and Septoria tritici blotch in winter wheat. Theor Appl Genet 132:1121–1135. https://doi.org/10. 1007/s00122-018-3264-6 21. Nielsen NH, Jahoor A, Jensen JD, Orabi J, Cericola F, Edriss V, Jensen J (2016) Genomic prediction of seed quality traits using advanced barley breeding lines. PLoS One 11:e0164494. https://doi.org/10.1371/ journal.pone.0164494 22. Wu¨rschum T, Maurer HP, Weissmann S, Hahn V, Leiser WL (2017) Accuracy of within- and among-family genomic prediction in triticale. Plant Breed 136:230–236. https://doi.org/10.1111/pbr.12465 23. Lorenz AJ, Smith KP (2015) Adding genetically distant individuals to training populations reduces genomic prediction accuracy in barley. Crop Sci 55:2657–2667. https://doi. org/10.2135/cropsci2014.12.0827 24. Riedelsheimer C, Endelman JB, Stange M, Sorrells ME, Jannink J-L, Melchinger AE (2013) Genomic predictability of interconnected Biparental maize populations. Genetics 194:493–503. https://doi.org/10. 1534/genetics.113.150227 25. Hidalgo AM, Bastiaansen JWM, Lopes MS, Harlizius B, Groenen MAM, de Koning D-J (2015) Accuracy of predicted genomic breeding values in purebred and crossbred pigs. G3 (Bethesda) 5:1575–1583. https://doi.org/ 10.1534/g3.115.018119 26. Rio S, Moreau L, Charcosset A, Mary-Huard T (2020) Accounting for group-specific allele effects and admixture in genomic predictions:
105
theory and experimental evaluation in maize. Genetics 216:27–41. https://doi.org/10. 1534/genetics.120.303278 27. Olatoye MO, Clark LV, Labonte NR, Dong H, Dwiyanti MS, Anzoua KG, Brummer JE, Ghimire BK, Dzyubenko E, Dzyubenko N, Bagmet L, Sabitov A, Chebukin P, Głowacka K, Heo K, Jin X, Nagano H, Peng J, Yu CY, Yoo JH, Zhao H, Long SP, Yamada T, Sacks EJ, Lipka AE (2020) Training population optimization for genomic selection in Miscanthus. G3 (Bethesda) 10:2465–2476. https://doi.org/ 10.1534/g3.120.401402 28. Pszczola M, Calus MPL (2016) Updating the reference population to achieve constant genomic prediction reliability across generations. Animal 10:1018–1024. https://doi. org/10.1017/S1751731115002785 29. Castro Dias Cuyabano B, Wackel H, Shin D, Gondro C (2019) A study of genomic prediction across generations of two Korean pig populations. Animals 9:672. https://doi. org/10.3390/ani9090672 30. Hofheinz N, Borchardt D, Weissleder K, Frisch M (2012) Genome-based prediction of test cross performance in two subsequent breeding cycles. Theor Appl Genet 125: 1639–1645. https://doi.org/10.1007/ s00122-012-1940-5 31. Li X, Wei Y, Acharya A, Hansen JL, Crawford JL, Viands DR, Michaud R, Claessens A, Brummer EC (2015) Genomic prediction of biomass yield in two selection cycles of a tetraploid alfalfa breeding population. Plant Genome 8:plantgenome2014.12.0090. h t t p s : // d o i . o r g / 1 0 . 3 8 3 5 / p l a n tgenome2014.12.0090 32. Wang N, Wang H, Zhang A, Liu Y, Yu D, Hao Z, Ilut D, Glaubitz JC, Gao Y, Jones E, Olsen M, Li X, San Vicente F, Prasanna BM, Crossa J, Pe´rez-Rodrı´guez P, Zhang X (2020) Genomic prediction across years in a maize doubled haploid breeding program to accelerate early-stage testcross testing. Theor Appl Genet 133:2869–2879. https://doi.org/10. 1007/s00122-020-03638-5 33. Michel S, Ametz C, Gungor H, Epure D, Grausgruber H, Lo¨schenberger F, Buerstmayr H (2016) Genomic selection across multiple breeding cycles in applied bread wheat breeding. Theor Appl Genet 129: 1179–1189. https://doi.org/10.1007/ s00122-016-2694-2 34. Sallam AH, Endelman JB, Jannink J-L, Smith KP (2015) Assessing genomic selection prediction accuracy in a dynamic barley breeding population. Plant Genome 8:
106
Simon Rio et al.
plantgenome2014.05.0020. https://doi. org/10.3835/plantgenome2014.05.0020 35. Auinger H-J, Scho¨nleben M, Lehermeier C, Schmidt M, Korzun V, Geiger HH, Piepho H-P, Gordillo A, Wilde P, Bauer E, Scho¨n C-C (2016) Model training across multiple breeding cycles significantly improves genomic prediction accuracy in rye (Secale cereale L.). Theor Appl Genet 129:2043–2053. https://doi.org/10.1007/s00122-0162756-5 36. de Roos APW, Hayes BJ, Goddard ME (2009) Reliability of genomic predictions across multiple populations. Genetics 183: 1545–1553. https://doi.org/10.1534/ genetics.109.104935 37. Hill WG, Robertson A (1968) Linkage disequilibrium in finite populations. Theor Appl Genet 38:226–231. https://doi.org/10. 1007/BF01245622 38. Wright S (1949) The Genetical structure of populations. Ann Eugenics 15:323–354. https://doi.org/10.1111/j.1469-1809. 1949.tb02451.x 39. Bauer E, Falque M, Walter H, Bauland C, Camisan C, Campo L, Meyer N, Ranc N, Rincent R, Schipprack W, Altmann T, Flament P, Melchinger AE, Menz M, Moreno-Gonza´lez J, Ouzunova M, Revilla P, Charcosset A, Martin OC, Scho¨n C-C (2013) Intraspecific variation of recombination rate in maize. Genome Biol 14:R103. https://doi.org/10.1186/gb-2013-14-9r103 40. de Roos APW, Hayes BJ, Spelman RJ, Goddard ME (2008) Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179: 1503–1512. https://doi.org/10.1534/ genetics.107.084301 41. Porto-Neto LR, Kijas JW, Reverter A (2014) The extent of linkage disequilibrium in beef cattle breeds using high-density SNP genotypes. Genet Sel Evol 46:22. https://doi. org/10.1186/1297-9686-46-22 42. Badke YM, Bates RO, Ernst CW, Schwab C, Steibel JP (2012) Estimation of linkage disequilibrium in four US pig breeds. BMC Genomics 13:24. https://doi.org/10.1186/ 1471-2164-13-24 43. Heifetz EM, Fulton JE, O’Sullivan N, Zhao H, Dekkers JCM, Soller M (2005) Extent and consistency across generations of linkage disequilibrium in commercial layer chicken breeding populations. Genetics 171: 1173–1181. https://doi.org/10.1534/ genetics.105.040782
44. Van Inghelandt D, Reif JC, Dhillon BS, Flament P, Melchinger AE (2011) Extent and genome-wide distribution of linkage disequilibrium in commercial maize germplasm. Theor Appl Genet 123:11–20. https://doi. org/10.1007/s00122-011-1562-3 45. Technow F, Riedelsheimer C, Schrag TA, Melchinger AE (2012) Genomic prediction of hybrid performance in maize with models incorporating dominance and population specific marker effects. Theor Appl Genet 125: 1181–1194. https://doi.org/10.1007/ s00122-012-1905-8 46. Hao C, Wang L, Ge H, Dong Y, Zhang X (2011) Genetic diversity and linkage disequilibrium in Chinese bread wheat (Triticum aestivum L.) revealed by SSR markers. PLoS One 6:e17279. https://doi.org/10.1371/jour nal.pone.0017279 47. Iba´ne˜z-Escriche N, Fernando RL, Toosi A, Dekkers JC (2009) Genomic selection of purebreds for crossbred performance. Genet Sel Evol 41:12. https://doi.org/10.1186/ 1297-9686-41-12 48. Daetwyler HD, Villanueva B, Woolliams JA (2008) Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One 3:e3395. https://doi.org/10. 1371/journal.pone.0003395 49. Wientjes YC, Calus MP, Goddard ME, Hayes BJ (2015) Impact of QTL properties on the accuracy of multi-breed genomic prediction. Genet Sel Evol 47:42. https://doi.org/10. 1186/s12711-015-0124-6 50. Schopp P, Mu¨ller D, Wientjes YCJ, Melchinger AE (2017) Genomic prediction within and across Biparental families: means and variances of prediction accuracy and usefulness of deterministic equations. G3 (Bethesda) 7: 3571–3586. https://doi.org/10.1534/g3. 117.300076 51. Scutari M, Mackay I, Balding D (2016) Using genetic distance to infer the accuracy of genomic prediction. PLoS Genet 12:e1006288. https://doi.org/10.1371/journal.pgen. 1006288 52. Varona L, Legarra A, Toro MA, Vitezica ZG (2018) Non-additive effects in genomic selection. Front Genet 9:78. https://doi.org/10. 3389/fgene.2018.00078 53. Hill WG, Goddard ME, Visscher PM (2008) Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet 4:e1000008. https://doi.org/10. 1371/journal.pgen.1000008 54. Vitezica ZG, Varona L, Legarra A (2013) On the additive and dominant variance and
Optimization of Genomic Prediction Calibration Set covariance of individuals within the genomic selection scope. Genetics 195:1223–1230. https://doi.org/10.1534/genetics.113. 155176 55. Wientjes YCJ, Bijma P, Vandenplas J, Calus MPL (2017) Multi-population genomic relationships for estimating current genetic variances within and genetic correlations between populations. Genetics 207: 503–515. https://doi.org/10.1534/genet ics.117.300152 56. Wientjes YCJ, Calus MPL, Duenk P, Bijma P (2018) Required properties for markers used to calculate unbiased estimates of the genetic correlation between populations. Genet Sel Evol 50:65. https://doi.org/10.1186/ s12711-018-0434-6 57. Thompson EA (2013) Identity by descent: variation in meiosis, across genomes, and in populations. Genetics 194:301–326. https:// doi.org/10.1534/genetics.112.148825 58. Speed D, Balding DJ (2015) Relatedness in the post-genomic era: is it still useful? Nat Rev Genet 16:33–44. https://doi.org/10.1038/ nrg3821 59. Habier D, Fernando RL, Dekkers JCM (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177:2389–2397. https:// doi.org/10.1534/genetics.107.081190 60. Habier D, Tetens J, Seefried F-R, Lichtner P, Thaller G (2010) The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet Sel Evol 42:5. https://doi.org/10.1186/12979686-42-5 61. Habier D, Fernando RL, Garrick DJ (2013) Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194:597–607. https://doi.org/10.1534/ genetics.113.152207 62. Zhong S, Dekkers JCM, Fernando RL, Jannink J-L (2009) Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics 182:355–364. https:// doi.org/10.1534/genetics.108.098277 63. Jannink J-L, Lorenz AJ, Iwata H (2010) Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics 9: 166–177. https://doi.org/10.1093/bfgp/ elq001 64. Clark SA, Hickey JM, Daetwyler HD, van der Werf JH (2012) The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock
107
breeding schemes. Genet Sel Evol 44:4. https://doi.org/10.1186/1297-9686-44-4 65. Wientjes YCJ, Veerkamp RF, Calus MPL (2013) The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics 193:621–631. https://doi.org/10.1534/genetics.112. 146290 66. Pszczola M, Strabel T, Mulder HA, Calus MPL (2012) Reliability of direct genomic values for animals with different relationships within and to the reference population. J Dairy Sci 95:389–400. https://doi.org/10. 3168/jds.2011-4338 67. Albrecht T, Auinger H-J, Wimmer V, Ogutu JO, Knaak C, Ouzunova M, Piepho H-P, Scho¨n C-C (2014) Genome-based prediction of maize hybrid performance across genetic groups, testers, locations, and years. Theor Appl Genet 127:1375–1386. https://doi. org/10.1007/s00122-014-2305-z 68. Cros D, Denis M, Sa´nchez L, Cochard B, Flori A, Durand-Gasselin T, Nouy B, Omore´ A, Pomie`s V, Riou V, Suryana E, Bouvet J-M (2015) Genomic selection prediction accuracy in a perennial crop: case study of oil palm (Elaeis guineensis Jacq.). Theor Appl Genet 128:397–410. https://doi.org/10. 1007/s00122-014-2439-z 69. Vitezica ZG, Legarra A, Toro MA, Varona L (2017) Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations. Genetics 206:1297–1307. https://doi.org/10.1534/genetics.116. 199406 70. Hayes BJ, Visscher PM, Goddard ME (2009) Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res 91:47–60. https://doi.org/10.1017/ S0016672308009981 71. Goddard M (2009) Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136:245–257. https://doi.org/10.1007/s10709-0089308-0 72. Goddard ME, Hayes BJ, Meuwissen THE (2011) Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet 128:409–421. https:// doi.org/10.1111/j.1439-0388.2011. 00964.x 73. Elsen J-M (2016) Approximated prediction of genomic selection accuracy when reference and candidate populations are related. Genet Sel Evol 48:18. https://doi.org/10.1186/ s12711-016-0183-3
108
Simon Rio et al.
74. Elsen J-M (2017) An analytical framework to derive the expected precision of genomic selection. Genet Sel Evol 49:95. https://doi. org/10.1186/s12711-017-0366-6 75. Heffner EL, Jannink J-L, Iwata H, Souza E, Sorrells ME (2011) Genomic selection accuracy for grain quality traits in Biparental wheat populations. Crop Sci 51:2597–2606. https://doi.org/10.2135/cropsci2011.05. 0253 ˜ o J, 76. Crossa J, Pe´rez P, Hickey J, Burguen Ornella L, Cero´n-Rojas J, Zhang X, Dreisigacker S, Babu R, Li Y, Bonnett D, Mathews K (2014) Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity 112:48–60. https://doi. org/10.1038/hdy.2013.16 77. Norman A, Taylor J, Edwards J, Kuchel H (2018) Optimising genomic selection in wheat: effect of marker density, population size and population structure on prediction accuracy. G3 (Bethesda) 8:2889–2899. https://doi.org/10.1534/g3.118.200311 78. Habier D, Fernando RL, Kizilkaya K, Garrick DJ (2011) Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics 12:186. https://doi.org/10.1186/14712105-12-186 79. Dehnavi E, Mahyari SA, Schenkel FS, Sargolzaei M (2018) The effect of using cow genomic information on accuracy and bias of genomic breeding values in a simulated Holstein dairy cattle population. J Dairy Sci 101: 5166–5176. https://doi.org/10.3168/jds. 2017-12999 80. Lello L, Raben TG, Yong SY, Tellier LCAM, Hsu SDH (2019) Genomic prediction of 16 complex disease risks including heart attack, diabetes, breast and prostate cancer. Sci Rep 9:15286. https://doi.org/10.1038/ s41598-019-51258-x 81. Lorenz AJ (2013) Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: a simulation experiment. G3 Bethesda 3: 481–491. https://doi.org/10.1534/g3.112. 004911 82. Wu X, Lund MS, Sun D, Zhang Q, Su G (2015) Impact of relationships between test and training animals and among training animals on reliability of genomic prediction. J Anim Breed Genet 132:366–375. https:// doi.org/10.1111/jbg.12165 83. Rincent R, Laloe D, Nicolas S, Altmann T, Brunel D, Revilla P, Rodriguez VM, MorenoGonzalez J, Melchinger A, Bauer E, Schoen C-C, Meyer N, Giauffret C, Bauland C, Jamin P, Laborde J, Monod H, Flament P,
Charcosset A, Moreau L (2012) Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize Inbreds (Zea mays L.). Genetics 192:715–728. https://doi.org/10. 1534/genetics.112.141473 84. Bustos-Korts D, Malosetti M, Chapman S, Biddulph B, van Eeuwijk F (2016) Improvement of predictive ability by uniform coverage of the target genetic space. G3 (Bethesda) 6(11):3733–3747. https://doi.org/10. 1534/g3.116.035410 85. Guo T, Yu X, Li X, Zhang H, Zhu C, FlintGarcia S, McMullen MD, Holland JB, Szalma SJ, Wisser RJ, Yu J (2019) Optimal designs for genomic selection in hybrid crops. Mol Plant 12:390–401. https://doi.org/10. 1016/j.molp.2018.12.022 ˜ o J, Crossa J, 86. Atanda SA, Olsen M, Burguen Dzidzienyo D, Beyene Y, Gowda M, Dreher K, Zhang X, Prasanna BM, Tongoona P, Danquah EY, Olaoye G, Robbins KR (2021) Maximizing efficiency of genomic selection in CIMMYT’s tropical maize breeding program. Theor Appl Genet 134:279–294. https://doi.org/10.1007/ s00122-020-03696-9 87. Rincent R, Charcosset A, Moreau L (2017) Predicting genomic selection efficiency to optimize calibration set and to assess prediction accuracy in highly structured populations. Theor Appl Genet 130:2231–2247. https://doi.org/10.1007/s00122-0172956-7 88. Roth M, Muranty H, Di Guardo M, Guerra W, Patocchi A, Costa F (2020) Genomic prediction of fruit texture and training population optimization towards the application of genomic selection in apple. Hortic Res 7:1–14. https://doi.org/10.1038/s41438020-00370-5 89. Berro I, Lado B, Nalin RS, Quincke M, Gutie´rrez L (2019) Training population optimization for genomic selection. Plant Genome 12:190028. https://doi.org/10. 3835/plantgenome2019.04.0028 90. Tiede T, Smith KP (2018) Evaluation and retrospective optimization of genomic selection for yield and disease resistance in spring barley. Mol Breed 38:55. https://doi.org/10. 1007/s11032-018-0820-3 91. Isidro J, Jannink J-L, Akdemir D, Poland J, Heslot N, Sorrells ME (2015) Training set optimization under population structure in genomic selection. Theor Appl Genet 128: 145–158. https://doi.org/10.1007/ s00122-014-2418-4
Optimization of Genomic Prediction Calibration Set 92. Adeyemo E, Bajgain P, Conley E, Sallam AH, Anderson JA (2020) Optimizing training population size and content to improve prediction accuracy of FHB-related traits in wheat. Agronomy 10:543. https://doi.org/ 10.3390/agronomy10040543 93. Kadam DC, Rodriguez OR, Lorenz AJ (2021) Optimization of training sets for genomic prediction of early-stage single crosses in maize. Theor Appl Genet 134(2): 687–699. https://doi.org/10.1007/ s00122-020-03722-w 94. Laloe¨ D (1993) Precision and information in linear models of genetic evaluation. Genet Sel Evol 25:557. https://doi.org/10.1186/ 1297-9686-25-6-557 95. Laloe¨ D, Phocas F, Me´nissier F (1996) Considerations on measures of precision and connectedness in mixed linear models of genetic evaluation. Genet Sel Evol 28:359. https:// doi.org/10.1186/1297-9686-28-4-359 96. Yu H, Spangler ML, Lewis RM, Morota G (2018) Do stronger measures of genomic connectedness enhance prediction accuracies across management units?1. J Anim Sci 96: 4490–4500. https://doi.org/10.1093/jas/ sky316 97. Zhang S-Y, Olasege BS, Liu D-Y, Wang Q-S, Pan Y-C, Ma P-P (2018) The genetic connectedness calculated from genomic information and its effect on the accuracy of genomic prediction. PLoS One 13:e0201400. https:// doi.org/10.1371/journal.pone.0201400 98. Momen M, Morota G (2018) Quantifying genomic connectedness and prediction accuracy from additive and non-additive gene actions. Genet Sel Evol 50:45. https://doi. org/10.1186/s12711-018-0415-9 99. Akdemir D, Sanchez JI, Jannink J-L (2015) Optimization of genomic selection training populations with a genetic algorithm. Genet Sel Evol 47:38. https://doi.org/10.1186/ s12711-015-0116-6 100. Akdemir D, Isidro-Sa´nchez J (2019) Design of training populations for selective phenotyping in genomic prediction. Sci Rep 9: 1446. https://doi.org/10.1038/s41598018-38081-6 101. Ou J-H, Liao C-T (2019) Training set determination for genomic selection. Theor Appl Genet 132:2781–2792. https://doi.org/10. 1007/s00122-019-03387-0 102. Rutkoski J, Singh RP, Huerta-Espino J, Bhavani S, Poland J, Jannink JL, Sorrells ME (2015) Efficient use of historical data for genomic selection: a case study of stem rust resistance in wheat. Plant Genome 8:
109
eplantgenome2014.09.0046. https://doi. org/10.3835/plantgenome2014.09.0046 103. Sarinelli JM, Murphy JP, Tyagi P, Holland JB, Johnson JW, Mergoum M, Mason RE, Babar A, Harrison S, Sutton R, Griffey CA, Brown-Guedira G (2019) Training population selection and use of fixed effects to optimize genomic predictions in a historical USA winter wheat panel. Theor Appl Genet 132: 1247–1261. https://doi.org/10.1007/ s00122-019-03276-6 104. Charmet G, Tran L-G, Auzanneau J, Rincent R, Bouchet S (2020) BWGS: a R package for genomic selection and its application to a wheat breeding programme. PLoS One 15:e0222733. https://doi.org/10. 1371/journal.pone.0222733 105. Wolfe MD, Del Carpio DP, Alabi O, Ezenwaka LC, Ikeogu UN, Kayondo IS, Lozano R, Okeke UG, Ozimati AA, Williams E, Egesi C, Kawuki RS, Kulakow P, Rabbi IY, Jannink J-L (2017) Prospects for genomic selection in cassava breeding. Plant Genome 10. https://doi.org/10.3835/plan tgenome2017.03.0015 106. Ozimati A, Kawuki R, Esuma W, Kayondo IS, Wolfe M, Lozano R, Rabbi I, Kulakow P, Jannink J-L (2018) Training population optimization for prediction of cassava Brown streak disease resistance in west African clones. G3 (Bethesda) 8:3903–3913. https://doi.org/ 10.1534/g3.118.200710 107. Tayeh N, Klein A, Le Paslier M-C, Jacquin F, Houtin H, Rond C, Chabert-Martinello M, Magnin-Robert J-B, Marget P, Aubert G, Burstin J (2015) Genomic prediction in pea: effect of marker density and training population size and composition on prediction accuracy. Front Plant Sci 6:941. https://doi.org/ 10.3389/fpls.2015.00941 108. Keep T, Sampoux J-P, Blanco-Pastor JL, Dehmer KJ, Hegarty MJ, Ledauphin T, Litrico I, Muylle H, Rolda´n-Ruiz I, Roschanski AM, Ruttink T, Surault F, Willner E, Barre P (2020) High-throughput genome-wide genotyping to optimize the use of natural genetic resources in the grassland species perennial ryegrass (Lolium perenne L.). G3 (Bethesda) 10:3347–3364. https://doi.org/10.1534/ g3.120.401491 109. Calus MP, Veerkamp RF (2011) Accuracy of multi-trait genomic selection using different methods. Genet Sel Evol 43:26. https://doi. org/10.1186/1297-9686-43-26 110. Jia Y, Jannink J-L (2012) Multiple-trait genomic selection methods increase genetic value prediction accuracy. Genetics 192:
110
Simon Rio et al.
1513–1522. https://doi.org/10.1534/ genetics.112.144246 111. Robert P, Le Gouis J, Consortium TB, Rincent R (2020) Combining crop growth modeling with trait-assisted prediction improved the prediction of genotype by environment interactions. Front Plant Sci 11:827. https://doi.org/10.3389/fpls.2020.00827 ˜o J, Crossa J, Fuentes 112. Saint Pierre C, Burguen Da´vila G, Figueroa Lo´pez P, Solı´s Moya E, Ireta Moreno J, Herna´ndez Muela VM, Zamora Villa VM, Vikram P, Mathews K, Sansaloni C, Sehgal D, Jarquin D, Wenzl P, Singh S (2016) Genomic prediction models for grain yield of spring bread wheat in diverse agro-ecological zones. Sci Rep 6:27312. https://doi.org/10.1038/srep27312 113. Ly D, Chenu K, Gauffreteau A, Rincent R, Huet S, Gouache D, Martre P, Bordes J, Charmet G (2017) Nitrogen nutrition index predicted by a crop model improves the genomic prediction of grain number for a bread wheat core collection. Field Crops Res 214: 331–340. https://doi.org/10.1016/j.fcr. 2017.09.024 114. Rincent R, Malosetti M, Ababaei B, Touzy G, Mini A, Bogard M, Martre P, Le Gouis J, van Eeuwijk F (2019) Using crop growth model stress covariates and AMMI decomposition to better predict genotype-by-environment interactions. Theor Appl Genet 132: 3399–3411. https://doi.org/10.1007/ s00122-019-03432-y 115. Jarquin D, Howard R, Crossa J, Beyene Y, Gowda M, Martini JWR, Covarrubias ˜ o J, Pacheco A, Pazaran G, Burguen Grondona M, Wimmer V, Prasanna BM (2020) Genomic prediction enhanced sparse testing for multi-environment trials. G3 (Bethesda) 10:2725–2739. https://doi.org/ 10.1534/g3.120.401349 116. Rio S, Akdemir D, Carvalho T, Isidro y Sa´nchez J (2021) Assessment of genomic prediction reliability and optimization of experimental designs in multi-environment trials. Theor Appl Genet. https://doi.org/ 10.1007/s00122-021-03972-2 117. Ben-Sadoun S, Rincent R, Auzanneau J, Oury FX, Rolland B, Heumez E, Ravel C, Charmet G, Bouchet S (2020) Economical optimization of a breeding scheme by selective phenotyping of the calibration set in a multi-trait context: application to bread making quality. Theor Appl Genet 133: 2197–2212. https://doi.org/10.1007/ s00122-020-03590-4 118. Rabier C-E, Barre P, Asp T, Charmet G, Mangin B (2016) On the accuracy of genomic
selection. PLoS One 11:e0156086. https:// doi.org/10.1371/journal.pone.0156086 119. Mangin B, Rincent R, Rabier C-E, Moreau L, Goudemand-Dugue E (2019) Training set optimization of genomic prediction by means of EthAcc. PLoS One 14:e0205629. https://doi.org/10.1371/journal.pone. 0205629 120. Akdemir D (2017) Selection of training populations (and other subset selection problems) with an accelerated genetic algorithm (STPGA: an R-package for selection of training populations with a genetic algorithm). ArXiv170208088 Cs Q-bio stat 121. Akdemir D, Rio S, Isidro y Sa´nchez J (2021) TrainSel: an R package for selection of training populations. Front Genet 12:655287. https://doi.org/10.3389/fgene.2021. 655287 122. Heslot N, Feoktistov V (2020) Optimization of selective phenotyping and population Design for Genomic Prediction. J Agric Biol Environ Stat 25:579–600. https://doi.org/ 10.1007/s13253-020-00415-1 123. Misztal I, Wiggans GR (1988) Approximation of prediction error variance in largescale animal models. J Dairy Sci 71:27–32. https://doi.org/10.1016/S0022-0302(88) 79976-2 124. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423. https://doi.org/10.3168/ jds.2007-0980 125. Hickey JM, Veerkamp RF, Calus MP, Mulder HA, Thompson R (2009) Estimation of prediction error variances via Monte Carlo sampling methods using different formulations of the prediction error variance. Genet Sel Evol 41:23. https://doi.org/10.1186/12979686-41-23 126. Marulanda JJ, Melchinger AE, Wu¨rschum T (2015) Genomic selection in biparental populations: assessment of parameters for optimum estimation set design. Plant Breed 134: 623–630. https://doi.org/10.1111/pbr. 12317 127. Brauner PC, Mu¨ller D, Molenaar WS, Melchinger AE (2020) Genomic prediction with multiple biparental families. Theor Appl Genet 133:133–147. https://doi.org/10. 1007/s00122-019-03445-7 128. Edwards SM, Buntjer JB, Jackson R, Bentley AR, Lage J, Byrne E, Burt C, Jack P, Berry S, Flatman E, Poupard B, Smith S, Hayes C, Gaynor RC, Gorjanc G, Howell P, Ober E, Mackay IJ, Hickey JM (2019) The effects of training population design on genomic
Optimization of Genomic Prediction Calibration Set prediction accuracy in wheat. Theor Appl Genet 132:1943–1952. https://doi.org/10. 1007/s00122-019-03327-y 129. Sneath PHA, Sneath PHA, Sokal RR, Sokal URR (1973) Numerical taxonomy: the principles and practice of numerical classification. W. H. Freeman, New York 130. Neyhart JL, Tiede T, Lorenz AJ, Smith KP (2017) Evaluating methods of updating training data in Long-term Genomewide selection. G3 (Bethesda) 7:1499–1510. https:// doi.org/10.1534/g3.117.040550 131. Brandariz SP, Bernardo R (2018) Maintaining the accuracy of Genomewide predictions when selection has occurred in the training population. Crop Sci 58:1226–1231. https://doi.org/10.2135/cropsci2017.11. 0682 132. Jimenez-Montero JA, Gonzalez-Recio O, Alenda R (2012) Genotyping strategies for genomic selection in small dairy cattle populations. Animal 6:1216–1224. https://doi. org/10.1017/S1751731112000341 133. Boligon AA, Long N, Albuquerque LG, Weigel KA, Gianola D, Rosa GJM (2012) Comparison of selective genotyping strategies for prediction of breeding values in a population undergoing selection. J Anim Sci 90: 4716–4722. https://doi.org/10.2527/jas. 2012-4857 134. Michel S, Ametz C, Gungor H, Akgo¨l B, Epure D, Grausgruber H, Lo¨schenberger F, Buerstmayr H (2017) Genomic assisted selection for enhancing line breeding: merging genomic and phenotypic selection in winter wheat breeding programs with preliminary yield trials. Theor Appl Genet 130:363–376. https://doi.org/10.1007/s00122-0162818-8 135. Hu X, Carver BF, Powers C, Yan L, Zhu L, Chen C (2019) Effectiveness of genomic selection by response to selection for winter wheat variety improvement. Plant Genome 12:180090. https://doi.org/10.3835/plan tgenome2018.11.0090 136. Lopez-Cruz M, de los Campos G (2021) Optimal breeding-value prediction using a sparse selection index. Genetics 218: iyab030. https://doi.org/10.1093/genet ics/iyab030 137. Maenhout S, De Baets B, Haesaert G (2010) Graph-based data selection for the construction of genomic prediction models. Genetics 185:1463–1475. https://doi.org/10.1534/ genetics.110.116426 138. Bartholome´ J, Van Heerwaarden J, Isik F, Boury C, Vidal M, Plomion C, Bouffier L
111
(2016) Performance of genomic prediction within and across generations in maritime pine. BMC Genomics 17:604. https://doi. org/10.1186/s12864-016-2879-8 139. Wientjes YCJ, Bijma P, Calus MPL (2020) Optimizing genomic reference populations to improve crossbred performance. Genet Sel Evol 52:65. https://doi.org/10.1186/ s12711-020-00573-3 140. Bernardo R (1994) Prediction of maize single-cross performance using RFLPs and information from related hybrids. Crop Sci 34:cropsci19940011183X003400010003x. https://doi.org/10.2135/cropsci1994. 0011183X003400010003x 141. Massman JM, Gordillo A, Lorenzana RE, Bernardo R (2013) Genomewide predictions from maize single-cross data. Theor Appl Genet 126:13–22. https://doi.org/10. 1007/s00122-012-1955-y 142. Technow F, Schrag TA, Schipprack W, Bauer E, Simianer H, Melchinger AE (2014) Genome properties and prospects of genomic prediction of hybrid performance in a breeding program of maize. Genetics 197: 1343–1355. https://doi.org/10.1534/ genetics.114.165860 143. Kadam DC, Potts SM, Bohn MO, Lipka AE, Lorenz AJ (2016) Genomic prediction of single crosses in the early stages of a maize hybrid breeding pipeline. G3 (Bethesda) 6: 3443–3453. https://doi.org/10.1534/g3. 116.031286 144. Seye AI, Bauland C, Charcosset A, Moreau L (2020) Revisiting hybrid breeding designs using genomic predictions: simulations highlight the superiority of incomplete factorials between segregating families over topcross designs. Theor Appl Genet 133:1995–2010. https://doi.org/10.1007/s00122-02003573-5 145. Fristche-Neto R, Akdemir D, Jannink J-L (2018) Accuracy of genomic selection to predict maize single-crosses obtained through different mating designs. Theor Appl Genet 131:1153–1162. https://doi.org/10.1007/ s00122-018-3068-8 146. Gauch HG, Zobel RW (1996) Optimal replication in selection experiments. Crop Sci 36: cropsci1996.0011183X003600040002x. https://doi.org/10.2135/cropsci1996. 0011183X003600040002x 147. Moreau L, Lemarie´ S, Charcosset A, Gallais A (2000) Economic efficiency of one cycle of marker-assisted selection. Crop Sci 40: 329–337. https://doi.org/10.2135/ cropsci2000.402329x
112
Simon Rio et al.
148. Riedelsheimer C, Melchinger AE (2013) Optimizing the allocation of resources for genomic selection in one breeding cycle. Theor Appl Genet 126:2835–2848. https:// doi.org/10.1007/s00122-013-2175-9 149. Endelman JB, Atlin GN, Beyene Y, Semagn K, Zhang X, Sorrells ME, Jannink J-L (2014) Optimal Design of Preliminary Yield Trials with genome-wide markers. Crop Sci 54:48–59. https://doi.org/10. 2135/cropsci2013.03.0154 150. Longin CFH, Mi X, Melchinger AE, Reif JC, Wu¨rschum T (2014) Optimum allocation of test resources and comparison of breeding strategies for hybrid wheat. Theor Appl Genet 127:2117–2126. https://doi.org/10. 1007/s00122-014-2365-0 151. Longin CFH, Mi X, Wu¨rschum T (2015) Genomic selection in wheat: optimum allocation of test resources and comparison of breeding strategies for line and hybrid breeding. Theor Appl Genet 128:1297–1306.
https://doi.org/10.1007/s00122-0152505-1 152. Cullis BR, Smith AB, Cocks NA, Butler DG (2020) The design of early-stage plant breeding trials using genetic relatedness. J Agric Biol Environ Stat 25:553–578. https://doi. org/10.1007/s13253-020-00403-5 153. Lorenz A, Nice L (2017) Training population design and resource allocation for genomic selection in plant breeding. In: Varshney RK, Roorkiwal M, Sorrells ME (eds) Genomic selection for crop improvement: new molecular breeding strategies for crop improvement. Springer International Publishing, Cham, pp 7–22 154. Eynard SE, Croiseau P, Laloe¨ D, Fritz S, Calus MPL, Restoux G (2018) Which individuals to choose to update the reference population? minimizing the loss of genetic diversity in animal genomic selection programs. G3 (Bethesda) 8:113–121. https:// doi.org/10.1534/g3.117.1117
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 4 Genotyping, the Usefulness of Imputation to Increase SNP Density, and Imputation Methods and Tools Florence Phocas Abstract Imputation has become a standard practice in modern genetic research to increase genome coverage and improve accuracy of genomic selection and genome-wide association study as a large number of samples can be genotyped at lower density (and lower cost) and, imputed up to denser marker panels or to sequence level, using information from a limited reference population. Most genotype imputation algorithms use information from relatives and population linkage disequilibrium. A number of software for imputation have been developed originally for human genetics and, more recently, for animal and plant genetics considering pedigree information and very sparse SNP arrays or genotyping-by-sequencing data. In comparison to human populations, the population structures in farmed species and their limited effective sizes allow to accurately impute high-density genotypes or sequences from very low-density SNP panels and a limited set of reference individuals. Whatever the imputation method, the imputation accuracy, measured by the correct imputation rate or the correlation between true and imputed genotypes, increased with the increasing relatedness of the individual to be imputed with its denser genotyped ancestors and as its own genotype density increased. Increasing the imputation accuracy pushes up the genomic selection accuracy whatever the genomic evaluation method. Given the marker densities, the most important factors affecting imputation accuracy are clearly the size of the reference population and the relationship between individuals in the reference and target populations. Key words Imputation accuracy, Imputation error rate, Phasing, Haplotype, Low density, High density, SNP array, Genotyping-by-sequencing, Sequence
1
Introduction A major challenge in genome-wide association studies (GWAS) and genomic selection (GS) programs in animal and plant species is the cost of genotyping. Indeed, large numbers of densely genotyped individuals are required to get accurate results thanks to a high SNP density along the genome that constructs strong linkage disequilibrium between SNP and causative mutations [1, 2]. An appealing strategy is to use a cheaper and reduced-density SNP chip with markers optimized for imputation. Imputation is a term that
Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
113
114
Florence Phocas
denotes a statistical procedure that replaces the missing values in a data set by some plausible values. Genotype imputation describes the process of predicting genotypes that are not directly assayed in a sample of individuals. While it traditionally refers to the procedure of inferring the sporadic missing genotypes in an assay, it now commonly refers to the process of predicting untyped loci in a study sample genotyped for a low density panel (LDP) of markers, using observed genotypes in a reference population that has been genotyped for a greater number of loci with a high density panel (HDP) of markers [3, 4]. Genotype imputation is a crucial step in many genomic studies as all existing genotyping methods result in some missing data. Missing genotypes can be imputed in order to reach a 100% genotype call rate in a single assay. Imputation is also applied to combine sample sets genotyped with different marker panels, provided enough overlap exists between panels, to allow simple integration of data and/or the meta-analysis of various study results by standardizing the set of targeted markers. Imputation has become a standard practice in modern genetic research to increase genome coverage and improve GS accuracy and GWAS resolution as a large number of samples can be genotyped at lower density (and lower cost) and imputed up to denser marker panels or to sequence level, using information from a limited reference population. These low-cost genotyping strategies enable increased intensity of selection through the genotyping of large numbers of selection candidates or increased accuracy of estimated breeding values by expanding the training population [5]. Current applications of GS are typically based on genotypes called from high and low-density SNP array data. However, many plant and animal species cannot afford the development of such expensive genomic tools and genotyping-by-sequencing (GBS) has been proposed as an attractive and low-cost alternative to SNP arrays [6, 7], where restriction enzymes are used to focus sequencing resources on a limited number of cut sites. Because GBS makes possible the coverage of large portions of the genome, it may have some potential advantages for GS and GWAS in animal and plant breeding [2, 8, 9]. GBS also helps to avoid ascertainment bias that happens with SNP data array when marker data are not obtained from a random sample of the polymorphisms in the population of interest. Low-coverage GBS followed by imputation has also been proposed as a cost-effective genotyping approach for human disease and population genetics studies. The theoretical sequencing coverage (or depth) is the average number of times (for instance tenfold referred as 10) that each nucleotide is expected to be sequenced given a certain number of reads of a given length and the assumption that reads are randomly distributed across the reconstructed genome [10]. In a proof-of-concept study, Pasaniuc et al. [11] demonstrated that very low coverage in DNA sequencing (at 0.1–1), followed by
Genotype Imputation to Increase Genomic Prediction Accuracy
115
imputation using genotypic data from a reference population (the map of human genome variation established in the framework of the 1000 genomes project), captures almost as much of the common and low-frequency (minor allele frequency in-between 1% and 5%) variation as SNP arrays, and argued that this paradigm could become cost-effective for GWAS as sample preparation and sequencing costs would continue to fall. However, GBS data suffer from a large proportion of missing or incorrect genotype calls, in particular for low-coverage data. With GBS data, genotypes must be called from observed sequence reads that vary between loci and individuals. It is then challenging to accurately call an individual’s genotype when (almost) no reads are generated at a particular locus. Genotype calling accuracy can be increased by imputation, considering the haplotypes of other individuals in the population and detecting shared haplotype segments between individuals [12, 13]. Several methods and efficient software for genotype imputation have been developed over the last decade. Most imputation methods are using a reference population (RP) that is distinct from the target population (TP) although it is preferable that the two populations have similar genetic background. In this case, two categories of methods are used to predict untyped loci, for members of the TP, depending whether haplotypes are inferred only from linkage disequilibrium (LD) information between SNP (known as “population-based” imputation), or they are inferred using both LD and pedigree information (known as “family-based” imputation). A third category of imputation methods (known as “free reference panel-based” imputation) does not imply the use of a reference population and is useful for animal and plant species that have less genomic data and tools than the main farmed species, and rely on GBS strategies. Imputation from lower density toward higher density genotype (or sequence) may be thought as a cost-effective strategy to get accurate GS and GWAS, but the accuracy of SNP imputation needs to be assessed by comparing imputed genotypes with true genotypes. Imputation accuracy is measured at the population level as the genotype correct rate (also called concordance rate) or the Pearson correlation between true and imputed genotypes in the TP. Several factors affect imputation accuracy, including the choice of the imputation method, the size of the RP, the degree of relatedness between the reference and the target populations, the minor allele frequency (MAF) of the SNP being imputed. All these factors as well as the choice of the genomic evaluation model in relation to the number and importance of the quantitative trait loci (QTL) affect the GS accuracy.
116
Florence Phocas
The first objective of this review is to give an overview of the imputation methods and the advantages and drawbacks of the associated tools. The second objective is to shed light on how and under which circumstances marker density affects the imputation accuracy and thereby the genomic prediction quality.
2
Imputation Methods and Tools: Advantages and Drawbacks Imputation requires haplotype reconstruction (known as phasing) from genotype data. Haplotype phasing is the result of a statistical inference procedure exploiting patterns of LD between SNPs by modeling haplotype frequencies and local haplotype sharing between individuals to estimate haplotype phases for a number of samples together, often augmented by a reference panel of previously estimated haplotypes [3, 14–16]. Haplotypes are needed for both individuals in TP and RP for imputation methods that require a reference population. In that case, the dense genotypes of the RP members is used to build a reference panel of haplotypes that exhibit high LD over a region of tightly linked markers, and use these haplotypes to fill untyped SNP for target individuals genotyped at LDP (Fig. 1). The tag SNP that are common to both RP and TP serve as anchors for guiding genotype imputation of unobserved haplotypes within the LD block. Prephasing of genotypes in TP has been suggested to speed up the imputation process [17]. To this end, haplotypes are constructed once and stored so they can be used for subsequent imputations. The quality of the phasing in RP is the most important factor for the accuracy of TP haplotypes [18]. Some accurate phasing tools can be used such as SHAPEIT2 for common variants [18, 19] or its extension SHAPEITR for achieving greater accuracy for rare variants [20]. Most of the widely used phasing methods iteratively update each individual’s haplotype estimates conditional upon the current haplotype estimates of all other individuals. When a new RP with larger numbers of variants and haplotypes is made available, the TP need to be reimputed and the computational cost of this can be considerably reduced if target individuals can be “prephased.” Indeed imputation to give the resulting haplotypes is considerably faster without appreciable loss of downstream accuracy when RP and TP are unrelated as it is often the case for human genomic studies [18]. Because prephasing can only be effectively implemented in situations where individuals newly genotyped with the high density panel are not closely related to the target individuals, it is not well suited for animal and plant applications where the numbers of markers in the LDP are sparse and the genotypes of parents of young individuals are continually added to RP. In such a case, the use of prephased haplotypes will not lead to optimal imputation accuracy for the target individuals [20].
Genotype Imputation to Increase Genomic Prediction Accuracy
117
Fig. 1 Imputation process based on a set of haplotypes in a reference population
For the last decade, the increase in the size of RP and in the density of marker panels, on one hand, and the development of GBS technology, on the other hand, have motivated the development of many new computational methods and the optimization of the oldest ones (Table 1). Current imputation methods are making use of a rich palette of computational techniques, including the use of prephasing to reduce computational complexity [17], the use of identity-by-descent (IBD) [22, 23], haplotype clustering [24, 25] and linear interpolation [25] to reduce the state space in haplotype models, and the use of specific reference file formats to reduce size and memory needs [24, 26, 27]. For instance, it is now possible to provide imputation using RP with tens of thousands of individuals as a free web service [24]. Due to this recent and tremendous development of computational strategies, the different imputation algorithms may strongly differ in accuracy (especially for rare variants), computing speed and memory requirement [21, 23, 27, 28]. When using a reference population, imputation methods can be broadly divided into population-based methods, which use population LD information [29] and pedigree-based methods, which use linkage information from close relatives. 2.1 PopulationBased Methods Requiring a Reference Panel
Population-based methods assume that individuals are unrelated. They do not make use of close relationships directly. However, they can still capture close relationships between individuals by finding long shared haplotypes [30, 31]. Long haplotype blocks of individuals in the TP can be phased and imputed using a group of surrogate parents (individuals sharing IBD regions with the target
118
Florence Phocas
Table 1 List of the main genotype imputation methods and their main software versions Software name
Current version
Referenced versions
Population-based imputation methods requiring a reference panel BEAGLE
v5.1
v3.3 [25] v4.1 [26] v5.0 [28]
fastPHASE
v1.4
[15]
GeneImp
v1.3
[35]
GLIMPSE
v1
[36]
IMPUTE
v5 named IMPUTE5
[27]
IMPUTE2
IMPUTE v2
[23]
MINIMAC
v4 named MINIMAC4
V1 named MINIMAC [17] V2 named MINIMAC2 [34] V3 named MINIMAC3 [24]
PLINK
v2 named PLINK2
[32]
Pedigree-based imputation methods requiring a reference panel AlphaImpute
v1.9
[40]
Findhap
v4
v1 [50] v4 [51]
FImpute
v3
[21]
GIGI
v1.06
[39]
MERLIN
v1.1
[37] [38]
Free reference panel-based imputation methods AlphaFamImpute
v1
[61]
LinkImpute
v1
[59]
LinkImputeR
v1
[60]
magicImpute
v1
[57]
SCDA
v1
[52]
STITCH
v1.6
[12]
individuals) instead of true parents [30]. Population-based methods are highly accurate if both the number of markers and the number of reference individuals are high enough, but they are computationally intensive. In general, population-based imputation methods use a hidden Markov model (HMM) of the full set of typed and untyped loci for each target sample to infer missing genotypes by maximum likelihood optimization, considering that
Genotype Imputation to Increase Genomic Prediction Accuracy
119
each reference haplotype represents a hidden state path of the HMM [4]. Additionally, SNP tagging-based imputation approaches such as the one proposed in PLINK [32] carry out genotype imputation using LD information on tag SNP. Specifically, for each SNP to be imputed, the reference haplotypes are used to search for a small set of tag SNPs in the flanking region that forms a local haplotype background in high LD with the target SNP to be imputed. The most popular imputation algorithms, Beagle [33], IMPUTE2 [23], and MaCH [3] were initially developed for applications in human genetics. Beagle’s first two versions (released in 2006–2007) were only dedicated to haplotype phasing and sporadic missing data inference in unrelated individuals [33]. Late 2008, the major release of version 3.0 added phasing of parent– offspring trios and imputation of ungenotyped markers that have been genotyped in a reference panel [25]. The Beagle imputation method constructs a tree of haplotypes and summarizes it in a direct acyclic graph model by joining nodes of the tree based on haplotype similarity in order to cluster haplotypes at each marker. Then Beagle uses a HMM to find the most likely haplotype pairs, based on the individual’s known genotypes. It works iteratively by fitting the model to the current set of estimated haplotypes and then resampling new estimated haplotypes for each individual using the fitted model. Beagle predicts the most likely genotype at missing SNP from the model that is fitted at the final iteration. The three popular imputation algorithms, Beagle, IMPUTE, and MaCH, are currently in their fifth major version (Table 1). Methods are all based on a HMM-based pedigree-free imputation approach and have been compared to each other in several studies [4, 24, 27, 28]. Generally speaking, they give similar results in terms of accuracy, but computation times and memory requirements vary strongly depending on the versions of the algorithms. In general, the RP in human includes a sample of representative individuals that are unrelated to the target individuals. Genotype imputation must be performed using the largest available RP because the number of accurately imputed variants increases with the RP size. However, one impediment to using larger RP is the increased computational cost of imputation. Therefore, the latest versions of the imputation algorithms are less memory-intensive and more computationally efficient implementations of the original ones with comparable imputation accuracy. For instance, Minimac4 is the latest version in the series of genotype imputation software - preceded by Minimac3 [24], Minimac2 [34], Minimac [17], and MaCH [3]. Das et al. [24] showed that Minimac3 was twice as fast that Beagle 4.1 and about 30 times faster than IMPUTE2 or Minimac2 when considering 100 individuals in the target sample and about 30,000 sequenced individuals in the RP. In addition, increasing size of sequenced
120
Florence Phocas
individuals about 30 fold (from ~1000 to 30,000) increased memory requirement only sixfold while the memory requirement of Beagle 4.1, Minimac2 and Impute2 increased almost linearly with the size of the RP. Browning et al. [28] showed that the Beagle 5.0 computational cost of imputation from large reference panels is drastically reduced compared to Beagle 4.1, IMPUTE4 and Minimac4, when considering 1000 phased individuals in the TP and 10k, 100k, 1M, and 10M individuals in RP, although all methods produce nearly identical accuracy. In addition, Beagle 5.0 has the best scaling of computation time with increasing reference panel size: its computation time is 33 (10k), 123 (100k), 433 (1M), and 5333 (10M) faster than the fastest alternative method. Recently, a new version IMPUTE5 [27] has been developed from the initial IMPUTE2 algorithm [23] that can also scale to RP with millions of samples and appears to be even faster than Beagle 5.1 for such large RP sizes. IMPUTE5 assumes that both the reference and target samples are phased and contain no missing alleles at any site. This method continues to refine the observation made in the IMPUTE2 that imputation accuracy is optimized via the use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting best matching haplotypes using the Positional Burrows Wheeler Transform. The method then uses the selected haplotypes as conditioning states within the IMPUTE HHM. Using a reference panel with 65,000 sequenced haplotypes, [27] showed that IMPUTE5 was up to 30 faster than Minimac4 and up to 3 faster than BEAGLE 5.1, and used less memory than both these methods. They also showed that IMPUTE5 scales sublinearly with reference panel size: less than twice the initial computation time is required for an increase of 10,000 to 1 million reference haplotypes, because IMPUTE5 is able to utilize a smaller number of reference haplotypes. Therefore, at the end of 2020, IMPUTE5 appeared to be the most computationally efficient software for population-based imputation handling large RP with millions of haplotypes, including ones with unphased and incomplete genotypes. Finally, we mention in this section two other software, GeneImp [35] and GLIMPSE [36], that perform genotype imputation to a dense RP given genotype likelihoods computed from low coverage (2x) were spent on the offspring, even if the parents were not sequenced. In addition, the computational costs were strongly decreased: when imputing 100 full-sib families with 100 offspring each, AlphaFamImpute took less than 1 min for 1000 loci on one chromosome while Beagle 4.0 took 11 h for similar memory needs [61].
126
Florence Phocas
3 Factors Affecting Imputation Accuracy and Subsequent Genomic Prediction Quality Empirical evidence from various animal and plant breeding populations [53, 62–68] suggest that imputation of low density to higher density genotypes can be highly accurate and that the estimated breeding values (EBV) derived from imputed genotypes can reach similar levels of accuracy to that derived from high density genotypes. Nevertheless, accuracy of EBV increases when imputation error rate decreases [68, 69]. It is therefore important to define what are the most influential factors affecting the imputation accuracy and, when possible, methods to optimize those characteristics. Both the imputation and GS accuracies depend on: (a) the imputation method; (b) the characteristics of the low-density marker panel with respect to the MAF of the SNP, their number, localization, spacing and linkage between adjacent SNP; (c) the characteristics of the RP including its size and its relationship, and proportion of common genotyped SNPs, with the TP; (d) the genomic evaluation method linked to the genetic architecture of the evaluated trait. 3.1 Choice of the Imputation Method
An optimal imputation strategy for application in animal and plant breeding programs must: (a) allow both ungenotyped and low-density genotyped individuals to be imputed; (b) functions well in small and large datasets of moderately related individuals; (c) use information from close and distant relatives and from close and distant SNP loci; (d) accurately impute genotypes for all individuals in the pedigree for all SNP (including rare variants) and whatever the position of high-density genotyped individuals in the pedigree; (e) have efficient computing time and memory usage when routine genomic evaluations are required. Imputation accuracy can be measured as the allele correct rate, the genotype correct rate (also called concordance rate) or the Pearson correlation between true and imputed genotypes in the target population. Genotype error (i.e., 1 concordance rate) is the proportion of genotypes called incorrectly and allele error is the proportion of alleles called incorrectly. Those two rates give similar results although allele error is approximately half of genotype error, because all methods that are likely to impute one allele correctly are unlikely to impute both alleles incorrectly. These statistics of imputation quality can also be derived at the SNP level. The allele/genotype correct rates are allele-frequency dependent. With a naive imputation procedure based on the most frequent genotypes, the proportion of genotypes correctly imputed approaches 100% as allele frequencies approach zero or one [40]. When considering rare alleles, it is therefore recommended to look at the correlation between imputed and true genotypes
Genotype Imputation to Increase Genomic Prediction Accuracy
127
rather than at the rate of correct allele/genotype as the latter will always be high when the MAF are low despite the fact that the rare alleles will not be well predicted [40, 43]. Browning and Browning [25] proposed the squared correlation between the allele dosage (number of minor alleles) of the most likely imputed genotype and the allele dosage of the true genotype as a metrics of imputation accuracy at the marker level. They called this quantity, the allelic R2. Its interpretation does not depend on allele frequency. Allelic R2 measures the loss of power when the most likely imputed genotypes are used in place of the true genotypes for a marker. Browning and Browning [25] also showed that allelic R2 can be estimated from the imputed posterior genotype probabilities without knowledge of the true genotypes, which is an important feature because the true genotype is generally unknown. This internal quality metrics of imputation is given by software such as Beagle or Minimac. Another internal quality metrics, the INFO score, is proposed in IMPUTE2. Both imputation quality scores were shown to give highly correlated results [26]. Their values range from 0 to 1, where a higher value indicates increased quality of an imputed SNP. The allelic R2 and the INFO score can be used for identifying or excluding markers with poor imputation accuracy prior to downstream analysis. In a study that was independent of any of the coauthors of imputation algorithms, Ma et al. [43] compared the five most popular imputation algorithms in animal and plant breeding (Beagle 3.3, IMPUTE2, findhap, AlphaImpute, and FImpute), using SNP array. Two dairy cattle datasets with low (3K), medium (54K), and high (777K) density SNP panels were used to investigate imputation accuracy, considering about 30% of individuals in the RP and relatedness between target and reference individuals. Results demonstrated that the accuracy was always high (allele correct rate > 93%), but lower when imputing from 3K to 54K (93–97%) than from 54K to 777K (97–99%). IMPUTE2 and Beagle 3.3 resulted in higher accuracies and were more robust under various conditions than the other three methods when imputing from 3K to 54K. The accuracy of imputation using FImpute was similar to the ones of Beagle and IMPUTE2 when imputing from 54K to 777K, and higher than findhap and AlphaImpute. Considering computing time and memory usage, FImpute was proposed as a relevant alternative tool to IMPUTE2 and Beagle 3.3. He et al. [70] also investigated the imputation accuracies for dairy cattle when the RP, genotyped with 50 K SNP panel, contained sires or halfsibs, or both sires and halfsibs of the individuals in the TP genotyped with a low density panel using three imputation software (FImpute, findhap and Beagle 3.3). They showed that FImpute performed the best in all cases, with correlations between true and imputed genotypes from 0.92 to 0.98 when imputing from sires to their daughters or between
128
Florence Phocas
halfsibs. Recently a study compared Beagle 4.1 and FImpute for phasing quality [71]. Although similar phasing quality was observed when at least one parent was genotyped and pedigree information was considered for FImpute, they concluded that, since in most actual breeding programs there will be a certain amount of individuals without genotyped parents and progeny, Beagle 4.1 was the most robust and recommendable option for phasing quality, despite a 29 times longer computing time compared to FImpute for their poultry dataset [71]. Currently, efficient algorithms for imputation of missing genotypes in GBS data are still in their earliest steps of development, especially with regard to very low sequencing read depth ( 0.03) were investigated. The results showed that imputation accuracies were all low (r < 0.90) for GBS at 2x, but FImpute had a slightly lower imputation accuracy than Beagle and IMPUTE2 at this depth. The three algorithms had similar imputation accuracy of r > 0.95, when the depth of sequencing read depth was 4x. As the depth increased to 10x, the prediction accuracies approached those using true genotypes in the GBS loci. The authors also analyzed the reliability of genomic prediction with the different imputation hypotheses. They concluded that, retaining more SNPs with no MAF limit resulted in higher reliability of genomic prediction. To sum up, there are nowadays a rich palette of imputation methods and algorithms useful for either low density SNP array or low coverage GBS data, although none of them appears to be efficient for all situation in terms of both genomic resources (reference assembly genome, density of SNP panels, RP size) and TP structure. In most cases, Beagle and FImpute performed better than other methods. An obvious advantage of FImpute over Beagle is that it uses much less computing time. However comparisons have only been performed with early versions of Beagle. Due to the computational efforts made in the latest version of Beagle (v5.1) and the recent development of specific software for GBS data in plant and animal breeding, new comparison studies of imputation quality and computational costs are needed to help users in choosing the relevant imputation software according to the characteristics of their genotyping datasets.
Genotype Imputation to Increase Genomic Prediction Accuracy
3.2 Characteristics of the Low-Density Panel and its Optimized Choice 3.2.1 Characteristics of LDP Influencing the Imputation Accuracy
129
For all species and study populations, a limit exists upon which increasing the number of SNP in the array used for GS will not induce higher prediction accuracy [73–76]. The upper bound of GS accuracy is the proportion of the genetic variance which is captured by the array and is determined by the LD between the markers and the causative mutations affecting the trait. Thus this upper limit depends on the genetic architecture. of the traits. In wheat, Nyne et al. [52] hypothesized that the limit will be reach at a lower density level for monogenic traits than for polygenic traits for which imputed SNP increased the chances of capturing most of the QTL linked to these traits. If the major factor affecting the imputation quality of a low-density panel is the number of SNP it is composed of, in relation with the existing LD between adjacent SNP [5, 63, 65, 67], imputation quality and GS accuracy are also dependent of the MAF and location of tag SNP in the low density panel. The individual SNP imputation accuracy is strongly dependent of the MAF as reported in maize [40], sheep [77], cattle [43], pig [66] or salmon [78]. This is especially the case for SNP with MAF below 10% that are difficult to correctly impute unless the tag SNP density is sufficient and the size of RP is large [79]. Regarding localization along the chromosomes, lower accuracy are generally observed for SNP located at the two end of the chromosomes, in centromeres and more generally in regions with high similarity or high recombination rates (such as HLA/MHC in humans). The telomeres have very long patterns of repeats which generate problems in reads mapping and imputation. Another explanation for the low imputation accuracy is that SNP imputation relies on surrounding markers, but for SNP at the very end of the telomere, surrounding information is only on one side of the chromosome [80]. An additional explanation is the fact that recombination is higher around the telomeres, which may decrease the precision of haplotype reconstruction and imputation accuracy [62, 67, 80]. Therefore it is often recommended to increase the number of SNP at the chromosome extremities [81]. Sun et al. [82] observed that imputation accuracy was positively associated with chromosome size due to the fact that longer chromosomes harbor more markers, and hence provide more information for inferring unknown haplotypes. In longer chromosomes, the problem of low imputation accuracy at the beginning and end of the chromosomes are relatively less important than in shorter chromosomes. Low imputation accuracies have also been observed in some centromere regions [62] that might be attributed to incorrect order of markers on the reference genome in regions difficult to assemble [83]. By contrast, in other studies the imputation accuracy of SNP in centromere regions was close to 1 [66, 80].
130
Florence Phocas
3.2.2 Optimization of the Low-Density Panel
Several avenues are possible to optimize the design of the low-density chips. In animal and plant breeding, the choice of SNP for low-density arrays is often based on the selection of markers that are uniformly distributed along the genome (equidistant spacing based on physical position along the genome) and that have high MAF to ensure segregation [81, 84]. This strategy was shown to be more relevant than choosing at random the SNP [75], especially for traits with large-effect QTL for which prediction accuracy crucially depends on capturing specific regions that explain a high proportion of the phenotypic variance. If the optimal choice of SNP in a LDP chip is crucial for the accuracy of genomic prediction only based on low-density genotypes, it also significantly impacts the accuracy of genomic prediction based on high-density imputed genotypes as SNP in the LDP are the only ones that are not subject to imputation errors. However, it has also been shown that a LD-based strategy could allow more accurate imputation [85, 86] and that densification of markers at recombination hot spots and telomeres improves accuracy [65, 87]. A mixed strategy combining LD and physical distance has also been proposed to design low-density chips [76, 88]. It consists in LD based marker pruning in userdefined sliding windows. An alternative strategy is to choose the markers for their effects on the important traits to be improved [89–91]. Results suggest that a low density panel comprising SNP with the largest effects has the potential to preserve the accuracy of genomic prediction from higher density panels [92]. However, this strategy limits the interest of the genotyping tool to a single population and a limited number of traits with similar genetic architectures [84]. While arrays with at least 3000 SNP must be used in dairy cattle to obtain mean allelic imputation error rates below 5% [67, 90, 93], very low density SNP ( 0 and p0 ∈ (0, 1). Combining all these assumptions, the resulting marginal prior density is comprised of independent and identically distributed (IID) mixture of scaledt distributions with a point mass at zero. There are also other Bayesian models that have been derived and are part of the so called Bayesian alphabet [27]. The difference between these models lies in the prior distribution assigned to the marker effects. For example, the Bayes C model uses a common variance for all of the SNPs while the Bayes D model uses an estimated scale parameter for the scaled inverse chi-square distribution. Also, in the case of the Bayes Cπ and Bayes Dπ models the probability of having a zero effect SNP, π is estimated. 2.2.4 Bayesian Least Absolute Angle and Selection Operator (LASSO)
This is another option that overcomes the limitations of the least squares method. It was first introduced by Tibshirani [22], and first implemented by Usai et al. [28] for GP. The configuration of this model is similar to the Bayes A model; however, the marker-specific 2 prior variances are assumed be IID exponential such that σ β j
Exp σ 2β j jλ2 with λ being a regularization parameter that controls the shape of the exponential distribution. The resulting marginal prior distribution of the marker effects becomes a double exponential distribution. All the above presented prediction methods assume additive gene action. The G-BLUP method is attractive because it assumes an infinitesimal genetic model and its implementation is
Prediction Methods and the Associated Assumptions
147
straightforward using existing REML+BLUP software. The Bayesian methods are generally chosen because they allow departures from the infinitesimal model, while assuming additive mode of gene action. However, these linear regression–based methods are not well suited to model nonadditive (dominance, epistasis, . . .) genetic effects and the associated complex interactions existing in dense marker data. 2.3 Nonparametric and Semiparametric Methods for GS
The difficulties of parametric models in coping with highdimensional interaction such as epistasis, led to the development of methods that exploit the existence of interaction for predictive purposes but without explicit modeling. Gianola et al. [29] first proposed a nonparametric treatment of effects of SNPs. They focused on kernel-based methods, as the reproducing kernel Hilbert spaces (RKHS), which combines a classical additive genetic model with a kernel function. A kernel function converts predictor variables to a set of distances among observations to produce an n n positive definite matrix to be used in a linear model. Because kernel methods contain a great deal of flexibility and no assumptions of linearity, they may be superior in their ability to capture nonadditive effects of all kinds simultaneously. The general form of the kernel density estimator, fbðx Þ, can be n P written in the form of fbðx Þ ¼ 1 K xXi , where n is the number nh
i¼1
h
of individuals, h is the smoothing parameter (which can be estimated using cross-validation techniques), K is the kernel function which is a true density function, x is the focal point, and Xi is the ith row of the incidence matrix corresponding to the ith individual. Typically, the kernel function is to be chosen to be a symmetric, unimodal density (e.g., Gaussian kernel). In the process, several focal points are considered in the estimation of f(x), and the observations closer to the focal point get a higher weight in the calculation of the fitted value of E(y|x). Gianola et al. [30] developed the mixed model formulation of the RKHS model, and described its theoretical implications which was supported by its implementation on a real data set [31]. Their model includes random marker effects and an unknown function of the marker data such that yi ¼ wi0 β + zi0 u + g(xi) + ei, where wi and zi are the ith row of the incidence matrices corresponding to the fixed and random model effect terms, respectively, and g(xi) is the unknown function of SNPs data. Another nonparametric method that has been adopted to conduct GS studies without skipping nonadditive genetic effects is the support vector machine regression (SVM) [32, 33]. SVM is a machine learning technique that is traditionally applied to classification, but it can also be used for prediction. It was first proposed by Vapnik [34] and discussed by Cortes and Vapnik [35]. The main idea of SVM for GS is that the relationship between the phenotypes and genotypes can be modeled with a linear or nonlinear mapping
148
Re´ka Howard et al.
function that maps the predictor space to an abstract, multidimensional feature space [36]. Briefly, the model describing the relationship between the phenotypes and genotypes can be written as y ¼ b + K(x, xT) + e, where b is a vector of unknown constants, and K() is a kernel function which can have different forms (Gaussian kernel with radial basis function, linear kernel function, polynomial kernel function, sigmoid kernel function, Laplace radial basis function, etc.). SVM is essentially an optimization problem where n P the objective function to be minimized can be written as λ ei þ i¼1 2 1 2 kw k ,
where λ is the regularization/penalty parameter that quantifies the trade-off between the sparsity and the complexity of the model, ei is the regression loss for the ith individual/line, ||w|| represents the norm of vector w which is a function of K(x, xT). The optimization of the hyperparameters can be accomplished by implementing different methods but the most convenient and frequently used method is the grid search which just evaluates all of the possible combination of the parameters and choses the best combination [37]. 2.3.1 Gaussian Kernel (GK) Method
The genomic relationship matrix (G) shown earlier varies with models [38], and unlike the G-BLUP, for GK model the matrix is calculated as hd j j 0 K x j , x j 0 ¼ exp q 0 0 where xj ¼ (xj1, . . ., xjp) and x j 0 ¼ x j 0 1 , . . . , x j 0 p are the input 0 vectors of genetic markers in line j and j , h is the bandwidth parameter, q is the quantile used to scale the squared distance and
2
d j j 0 ¼ x j x j 0 , where x j x j 0 is the Euclidean norm for the input vector. K x j , x j 0 is a symmetric and positive semidefinite matrix that reflects the correlation between the genotypes and controlled by the bandwidth parameter. Cuevas et al. [38], proposed to use the 0.05 quantile (q0.05) to avoid extremely high or low values of the bandwidth parameter, which may negatively affect the performance of the model.
2.4 Deep Learning (DL) Method
The DL model generally has input layer, output layer and several hidden layers [38]. Montesinos-Lo´pez et al. [39] provide a detailed description of its implementation in plant breeding studies. A DL model with one hidden layer is a conventional neural network. In the network the information follows in one direction, from the input units to the output units through the hidden layers. The units in the input layer represents the independent variables, those in the output layer corresponds to the response or target
Prediction Methods and the Associated Assumptions
149
variable. The units located in the hidden layers undertake complex nonlinear transformation of the input layer. The units, also called as neurons or nodes of the network, of different layers are connected with each and the strength of these connection are termed as weights. The weights in a network depends on how the network learn from the training data. In DL model the number of epoch and batch need to be specified and so is the activation process. One epoch indicate one forward pass and one backward pass of the whole training data through the network; however, one epoch can be computationally intensive for the memory of the computer which is why these are divided into small batches. The batch size indicates the number of training samples in one forward/backward pass, and an iteration is the number of batches needed to complete one epoch. Therefore, if there are 1000 training data points and a batch size of 500, then it will take 2 iterations to complete one epoch. The number of epochs decides how many times the weights change and the optimal number of epochs depends on the data. Any lower or higher number of epochs than the optimal one can make the model prone to underfitting or overfitting, respectively. As long as the testing and training error is dropping, we can increase the number for epochs, therefore, more than one epoch is recommended to avoid the possibility of underfitting of the model. Again, to avoid the overfitting of the DL model constraints are placed on the complexity of the network by making the weights small and more regularly distributed. To regulate the weights in the network a loss function has been added to the model that penalize the model for larger values of weights. There are several ways to regulate the weights; for example [39] considered the dropout regularization [40] method. Under this type of regularization, a random subset of the neurons or nodes with their connection with a given probability (generally 20–50%) are temporarily removed during the training of the network. When the selected neurons are dropped the other neurons have to make up for them by representing the features required for prediction, which, in turn, reduces the sensitivity of the model to specific weights and the model becomes less likely to be overfitted. Hyperparameters such as the number of hidden layers, rate of drop out, number of units, type of activation can affect the learning ability of the DL model to a great extent, and thus impact the performance of the model. Manual tuning of the DL model can be computationally difficult and depends on user expertise. The most common approaches of hyperparameter tuning are (a) grid search, (b) random search, (c) Latin hypercube sampling and (d) optimization (see [38] for details on each approach). In an attempt to reduce the complexity of tunning a large number of hyperparameters other models based on kernels methods have been proposed.
150
Re´ka Howard et al.
2.5 Arc-Cosine Kernel (AK) Method
The AK model emulates the DL model, however, it does not require the complex tuning of hyperparameters. Rather an important part of the model is the angle (θ) between two vectors computed from the marker genotypes x j x j 0 . The angle can be calculated [41] as follows: 1 0 B x j x j 0 C θ j ,j ¼ cos 1 @
A
x j
x j 0
where · is the inner product and kxjk is the norm of line j. Then the kernel shown below is symmetric, positive definite and related to an artificial neural network with one hidden layer and ramp activation function [42].
1
AK 1 x j , x j 0 ¼ x j
x j 0 J θ j , j 0 π
0 where π is pi constant, h and J θ j , j captures the angular depeni dence and is equal to sin θ j , j 0 þ π θ j , j 0 cos θ j , j 0 .
The diagonal elements of the matrix (AK1) are not identical and rather indicate the heterogeneous genomic variances. An AK model that emulates an ANN model with more than one hidden layer (l) is shown below [41, 42]. h i12 1 AK ðlþ1Þ x j , x j 0 ¼ AK ðl Þ x j , x j AK ðl Þ x j 0 , x j 0 J θl j , j 0 π
h
i1
where θl j , j 0 ¼ cos 1 AK ðl Þ x j , x j 0 AK ðl Þ x j , x j AK ðl Þ x j 0 , x j 0 2 . Therefore, the computation of the AKl + 1 matrix need information from the previous layer (l). Cuevas et al. [43] provided a maximum marginal likelihood method to determine the number of hidden layers for AK method. Other models have emerged, and these consider not only the genetic component but also other sources of information. The list of this type of models considerably increased in the last years, adopting different frameworks and adapting these to the availability of different data types (other than genetics) of information.
3
Genomic Models Including GE Models incorporating genome environment interaction have been proposed (see Chapter 9 for more details) in an attempt to improve accuracy when predicting the breeding values of complex continuous traits (e.g., grain yield production) of individuals in different environments (site–year combinations) [44–50]. However, different statistical models are required for assessing the genomic-enabled prediction accuracy of noncontinuous categorical
Prediction Methods and the Associated Assumptions
151
response variables (e.g., ordinal diseases as rates, counting data, etc.) and for this purpose conventional Genomic Best Linear Unbiased Predictors (GBLUP) and Deep learning artificial neural networks (DL) have been developed [32, 39]. Since the initial times of genomic selection, several genetic and statistical factors had been pointed out as complications for the application of GS and GP [6]. Genetic difficulties arise when deciding the size of the training population and the heritability of the traits to be predicted. Furthermore, statistical challenges are related to the high dimensionality of marker data, with the number of markers ( p) being much larger than the number of observations (n) ( p n), the multicollinearity among markers and the course of dimensionality. One important genetic-statistical complexity of GS and GP models arises when predicting unphenotyped individuals in specific environments (e.g., planting date–site–management combinations) by incorporating GE interaction into the genomicbased statistical models. The genomic complexity related to GE interactions for multitraits is also important, as these interactions require statistical-genetic models that exploit the complex multivariate relations due to multitrait and multienvironment variance– covariance, but also to exploit the genetic correlations between environments, between traits and between traits and environments simultaneously. The abovementioned problem of p n results in a matrix of predictors that are rank-deficient, not likelihood identified and thus are prone to overfitting. The use of penalized regression, variable selection, and dimensionality reduction offers solutions to some of these problems.
4
Important Factors Affecting Prediction Accuracy Several studies have been conducted to analyze what are the factors that most affect predictive ability in genomic selection applications. Among these we have the relative sizes and compositions between calibration and testing sets [16, 51–54], the genetic and environmental relationships between and within individuals in calibration and testing sets [55–59]. Another important factor to consider is the genetic architecture of the targeted trait as well as its heritability [51, 60]. A trait controlled by a few major genes is expected to return a high predictive ability; however, dealing with complex traits a high heritability (proportion of phenotypic variability attributable to genetic factors) is desired. Further, other very important factors to consider and that were object of study are the election of the prediction model, the imputation and quality control on marker genotypes. Several authors [20, 51, 61, 62] have conducted comparison studies to contrast the levels of predictive ability that can be reached
152
Re´ka Howard et al.
using different prediction models in the parametric (Penalized, Bayesian) and nonparametric context (machine learning based). In general, dealing with complex traits, such as grain yield, these authors found no significant differences between the different prediction models and approaches. Hence, no one of the models showed to be clearly superior with respect to the others and its election should be determined by considering other factors such as: how to cope easier with the computational burden? To our knowledge, the most frequently implemented and convenient prediction model is the Genomic BLUP (G-BLUP) because it deals with matrices of low dimension (n n) and does not require a very elaborate statistical framework. Due to limitations of the sequencing technologies (imperfect data) in many cases it is required to perform a data curation before conducting the analyses for filling up the gaps by imputing missing maker genotypes. For this, several studies have been performed for assessing the effects in predictive ability when imputing unordered [51, 55, 63] and ordered (physical order) markers [51, 64]. Other more elaborated implementations combine pedigree data and highdense markers of parents to impute low dense platforms of progeny [65–67]. The list of studies that have been conducted in this regard is very extensive; however, none of these methods (random forest, beagle, FILLIN) showed to be significantly superior to the conventional naı¨ve imputation (using the arithmetic mean after discarding the missing values at each marker) using real datasets. Another aspect to consider dealing with imperfect data is the deletion/removal of marker SNPs with (1) a considerable percentage of missing values (PMV), (2) and low minor allele frequency (MAF), or combinations of both of these. Although there is not a written rule regarding the parameters to consider for applying quality control on marker SNPs, several studies have considered/ suggested the removal of SNPs with more than 50% of missing values and those with a MAF smaller than 0.03 or 0.05 [51, 57]. conducted an exhaustive study considering reps (200) of all combinations between 27 different values for PMV and 12 MAF values (324 ¼ 27 12) for three traits (grain yield, plant height and days to maturity) and two imputation methods (naı¨ve and random forest) in soybeans. These authors showed that there is not a unique consensus about the best strategy (PMV and MAF) that systematically returns the highest correlation between predicted and observed values and these depend mostly on the imputation method.
5
Concluding Remarks The application of powerful, relatively new statistical methods to the problem of high dimensional marker data for GS has been
Prediction Methods and the Associated Assumptions
153
nearly as important to the development of GS as the creation of high-density marker platforms and greater computing power. The methods can be classified by what type of genetic architecture they try to capture. Somewhat surprisingly, G-BLUP, which makes the ostensibly unrealistic assumption that genetic effects are uniformly spread across the genome, often performs as well as models that are more sophisticated. Exceptions do exist, though, and there is abundant evidence that BayesB is superior for traits with strong QTL effects. Additionally, since BayesB better identifies markers in strong LD with QTL than RR-BLUP, it maintains accuracy for more generations. Finally, the question of whether or not to model epistasis remains open. If epistasis is important for a particular trait in a particular population, the kernel methods and machine-learning techniques such as SVM may be preferable. It is important for the practitioner to consider such issues or test different categories of method on its own data before the final decision. References 1. Henderson CR (1963) In: Hanson WD, Robinson HF (eds) Selection index and expected genetic advance. Statistical genetics and plant breeding. National Academy of Sciences-National Research Council, Washington, pp 141–163 2. Henderson CR (1975) Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423–447 3. Henderson CR (1984) Applications of linear models in animal breeding. University of Guelph, Guelph 4. de Roos AP, Hayes BJ, Goddard ME (2009) Reliability of genomic predictions across multiple populations. Genetics 183:1545–1553 5. Beaulieu J, Doerksen TK, MacKay J, Rainville A, Bousquet J (2014) Genomic selection accuracies within and between environments and small breeding groups in white spruce. BMC Genomics 15:1048 6. de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics 193:327–345 7. Fernando RL, Grossman M (1989) Marker assisted selection using best linear unbiased prediction. Genet Sel Evol 21:467 8. Soller M, Plotkin-Hazan J (1977) The use of marker alleles for the introgression of linked quantitative alleles. Theor Appl Genet 51: 133–137
9. Soller M (1978) The use of loci associated with quantitative effects in dairy cattle improvement. Anim Prod 27:133–139 10. Mohan M, Nair S, Bhagwat A, Krishna TG, Yano M et al (1997) Genome mapping, molecular markers and marker-assisted selection in crop plants. Mol Breed 13:87–103. https:// doi.org/10.1023/A:1009651919792 11. Bernardo R (2008) Molecular markers and selection for complex traits in plants: learning from the last 20 years. Crop Sci 48:1649–1664 12. Beavis WD (1994) The power and deceit of QTL experiments: lessons from comparative QTL studies. In: Proceedings of the FortyNinth Annual Corn & Sorghum Industry Research Conference. American Seed Trade Association, Washington, DC, pp 250–266 13. Dekkers JC (2004) Commercial application of marker-and gene-assisted selection in livestock: strategies and lessons. J Anim Sci 82: E313–E328 14. Bernardo R (1994) Prediction of maize singlecross performance using RFLPs and information from related hybrids. Crop Sci 34:20–25 15. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829 16. Crossa J, Beyene Y, Kassa S, Pe´rez P, Hickey JM et al (2013) Genomic prediction in maize breeding populations with genotyping-bysequencing. G3 (Bethesda) 3:1903–1926. https://doi.org/10.1534/g3.113.008227
154
Re´ka Howard et al.
˜ o J, 17. Zhang X, Pe´rez-Rodrı´guez P, Burguen Olsen M, Buckler E et al (2017) Rapid cycling genomic selection in a multiparental tropical maize population. G3 (Bethesda) 7: 2315–2326. https://doi.org/10.1534/g3. 117.043141 18. He S, Schulthess AW, Mirdita V, Zhao Y, Korzun V et al (2016) Genomic selection in a commercial winter wheat population. Theor Appl Genet 129:641–651 19. Wolc A, Arango J, Settar P, Fulton JE, O’Sullivan NP et al (2011) Persistence of accuracy of genomic estimated breeding values over generations in layer chickens. Genet Sel Evol 43:23 20. Howard R, Carriquiry AL, Beavis WD (2014) Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 (Bethesda) 4(6):1027–1046. https://doi.org/ 10.1534/g3.114.010298 21. Bernardo RN (2010) Breeding for quantitative traits in plants, 2nd edn. Stemma Press, Woodbury 22. Kimeldorf GS, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Statist 41:495–502. Autotune: A Derivative-free Optimization Framework for Hyperparameter Tuning 23. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol 58:267–288 24. Hoerl AE, Kennard R (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67 25. Habier D, Fernando RL, Dekkers JC (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177:2389–2397 26. Andrews DF, Mallows CL (1974) Scale mixtures of normal distributions. J R Stat Soc Series B Stat Methodol 36:99–102 27. Habier D, Fernando R, Kizilkaya K, Garrick D (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12: 186 28. Usai MG, Goddard ME, Hayes BJ (2009) LASSO with cross-validation for genomic selection. Genet Res 91:427–436. https:// doi.org/10.1017/S0016672309990334 29. Gianola D, Fernando RL, Stella A (2006) Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173:1761–1776 30. Gianola D, van Kaam JBCM (2008) Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative
traits. Genetics 4:2289–2303. https://doi. org/10.1534/genetics.107.084285 31. Gonza´lez-Recio O, Gianola D, Long N, Weigel ˜ o S (2008) NonparaKA, Rosa GJM, Avendan metric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics 178: 2305–2313. https://doi.org/10.1534/genet ics.107.084293 32. Montesinos-Lo´pez OA, Martı´n-Vallejo J, Crossa J, Gianola D, Herna´ndez-Sua´rez CM et al (2019) A benchmarking between deep learning, support vector machine and Bayesian threshold best linear unbiased prediction for predicting ordinal traits in plant breeding. G3 (Bethesda) 9(2):601–618. https://doi.org/ 10.1534/g3.118.200998 33. Zhao W, Lai X, Liu D, Zhang Z, Ma P et al (2020) Applications of support vector machine in genomic prediction in pig and maize populations. Front Genet 11:1537. https://doi. org/10.3389/fgene.2020.598318 34. Vapnik V (1995) The nature of statistical learning theory. Springer, New York, NY. https://doi.org/10.1007/978-1-47572440-0 35. Wang X, Li L, Yang Z, Zheng X, Yu S et al (2017) Predicting rice hybrid performance using univariate and multivariate GBLUP models based on North Carolina mating design II. Heredity 118:302–310. https://doi.org/ 10.1038/hdy.2016.87 36. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297 37. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York, New York 38. Pe´rez-Elizalde S, Cuevas J, Pe´rez-Rodrı´guez P, Crossa J (2015) Selection of the bandwidth parameter in a Bayesian kernel regression model for genomic-enabled prediction. J Agric Biol Environ Stat 5(4):512–532 39. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444 40. Cuevas J, Crossa J, Soberanis V, Pe´rezElizalde S, Pe´rez-Rodrı´guez P et al (2016) Genomic prediction of genotype environment interaction kernel regression models. Plant Genome 9. https://doi.org/10.3835/ plantgenome2016.03.0024 41. Montesinos-Lo´pez A, Montesinos-Lo´pez OA, Gianola D, Crossa J, Herna´ndez-Sua´rez CM (2018) Multi-environment genomic prediction of plant traits using deep learners with a dense architecture. G3 (Bethesda) 8(12):
Prediction Methods and the Associated Assumptions 3813–3828. https://doi.org/10.1534/g3. 118.200740 42. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958 43. Crossa J, Martini JWR, Gianola D, Pe´rezRodrı´guez P, Jarquin D et al (2019) Deep kernel and deep learning for genome-based prediction of single traits in multienvironment breeding trials. Front Genet 10:1168. https:// doi.org/10.3389/fgene.2019.01168 44. Cho Y, Saul L (2009). Kernel methods for deep learning. Advances in Neural Information Processing Systems 22—Proceedings of the 2009 Conference. 342–350 45. Cuevas J, Montesinos-Lo´pez OA, Juliana P, Guzma´n C, Pe´rez-Rodrı´guez P, Gonza´lezBucio J et al (2019) Deep kernel for genomic and near infrared prediction in multienvironments breeding trials. G3 (Bethesda) 9:2913–2924. https://doi.org/10.1534/g3. 119.400493 ˜ o J, de los Campos GDL, Weigel K, 46. Burguen Crossa J (2012) Genomic prediction of breeding values when modeling genotype environment interaction using pedigree and dense molecular markers. Crop Sci 52:707–719 47. Jarquin D, Crossa J, Lacaze X, Pe´rez P, Cheyron PD et al (2014) A reaction norm model for genomic selection using highdimensional genomic and environmental data. Theor Appl Genet 127(3):595–607 48. Heslot N, Akdemir D, Sorrells ME, Jannink JL (2013) Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. Theor Appl Genet 127(2): 463–480 49. Lopez-Cruz M, Crossa J, Bonnett D, Dreisigacker S, Poland J et al (2015) Increased prediction accuracy in wheat breeding trials using a marker environment interaction genomic selection model. G3 (Bethesda) 5(4): 569–582. https://doi.org/10.1534/g3.114. 016097 50. Crossa J, Maccaferri M, Tuberosa R, ˜ o J, Pe´rez-Rodrı´guez P (2016) Burguen Extending the marker environment interaction model for genomic-enabled prediction and genome-wide association analyses in durum wheat. Crop Sci 56:1–17 51. Cuevas J, Crossa J, Montesinos-Lo´pez OA, ˜ o J, Pe´rez-Rodrı´guez P, de los Burguen Campos G. (2017) Bayesian genomic prediction with genotype environment interaction
155
kernel models. G3 (Bethesda) 7(1):41–53. https://doi.org/10.1534/g3.116.035584 52. Rincent R, Malosetti M, Ababaei B, Touzy G, Mini A et al (2019) (2019) using crop growth model stress covariates and AMMI decomposition to better predict genotype-by-environment interactions. Theor Appl Genet 132: 3399–3411. https://doi.org/10.1007/ s00122-019-03432-y 53. Jarquı´n D, Kocak K, Posadas L, Hyma K, Jedlicka J et al (2014) (2014) genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genomics 15:740 54. Zhang J, Song Q, Cregan PB, Jiang G-L (2016) Genome-wide association study, genomic prediction and marker-assisted selection for seed weight in soybean (Glycine max). Theor Appl Genet 129:117–130. https://doi. org/10.1007/s00122-015-2614-x 55. Duangjit J, Causse M, Sauvage C (2016) Efficiency of genomic selection for tomato fruit quality. Mol Breed 36:29. https://doi.org/ 10.1007/s11032-016-0453-3 56. Berro IM, Lado B, Nalin RS, Quincke M, Gutierrez L (2019) Training population optimization for genomic selection. Plant Genome 12:190028. https://doi.org/10.3835/plan tgenome2019.04.0028 57. Hickey JM, Crossa J, Babu R, de los Campos G. (2012a) Factors affecting the accuracy of genotype imputation in populations from several maize breeding programs. Crop Sci 52: 654–663 58. Jarquin D, Specht J, Lorenz AJ (2016) Prospects of genomic prediction in the USDA soybean germplasm collection: historical data creates robust models for enhancing selection of accessions. G3 (Bethesda) 6(8):2329–2341. https://doi.org/10.1534/g3.116.031443 59. Jarquı´n D, Howard R, Graef G, Lorenz A (2019) Response surface analysis of genomic prediction accuracy values using quality control covariates in soybean. Evol Bioinforma 15:1–7 60. Thorwarth P, Ahlemeyer J, Bochard A-M, Krumnacker K, Blu¨mel H et al (2017) Genomic prediction ability for yield-related traits in German winter barley elite material. Theor Appl Genet 130:1669–1683. https://doi. org/10.1007/s00122-017-2917-1 61. Jarquin D, de Leon N, Romay C, Bohn M, Buckler ES et al (2021) Utility of climatic information via combining ability models to improve genomic prediction for yield within the maize genomes to fields project. Front Genet 11:592769. https://doi.org/10.3389/ fgene.2020.592769
156
Re´ka Howard et al.
62. Lozada DN, Carter AH (2019) Accuracy of single and multi-trait genomic prediction models for grain yield in US pacific northwest winter wheat. Crop Breeding Genes Genomics 1: e190012. https://doi.org/10.20900/ cbgg20190012 63. Heslot N, Yang H-P, Sorrells ME, Jannink J-L (2012) Genomic selection in plant breeding: a comparison of models. Crop Sci 52:146. https://doi.org/10.2135/cropsci2011.06. 0297 64. Xavier A, Muir WM, Craig B (2016) Walking through the statistical black boxes of plant breeding. Theor Appl Genet 129:1933–1949 65. Rutkoski JE, Poland J, Jannink J-L, Sorrells ME (2013) Imputation of unordered markers
and the impact on genomic selection accuracy. G3: genes, genomes. G3 (Bethesda) 3(3): 427–439. https://doi.org/10.1534/g3.112. 005363 66. Hayes BJ, Bowman PJ, Daetwyler HD, Kijas JW, van der Werf JHJ (2012) Accuracy of genotype imputation in sheep breeds. Anim Genet 43:72–80 67. Hickey JM, Kinghorn BP, Tier B, van der Werf JH, Cleveland MA (2012b) A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet Sel Evol 44(1):9. https://doi.org/ 10.1186/1297-9686-44-9
Chapter 6 Overview of Major Computer Packages for Genomic Prediction of Complex Traits Giovanny Covarrubias-Pazaran Abstract Genomic prediction models are showing their power to increase the rate of genetic gain by boosting all the elements of the breeder’s equation. Insight into the factors associated with the successful implementation of this prediction model is increasing with time but the technology has reached a stage of acceptance. Most genomic prediction models require specialized computer packages based mainly on linear models and related methods. The number of computer packages has exploded in recent years given the interest in this technology. In this chapter, we explore the main computer packages available to fit these models; we also review the special features, strengths, and weaknesses of the methods behind the most popular computer packages. Key words Prediction model, Computer packages, Linear models, REML
1
Prediction Models in Breeding
1.1 The Beginnings of Prediction Models in Breeding
The use of covariates to predict phenotypes dates back to the early twentieth century when phenotypic markers could be used to infer other relevant traits [1] that could be expressed in mathematical terms with models as simple as: y i ¼ xi β þ ei
ð1Þ
where the level in trait “y” for the ith individual can be inferred by the level in trait “x” for the same individual and a residual error e. A good example of this was the phenotypic markers in maize between the color of the stem and resistance to low temperature [1]. Then the development of protein-based and genetic markers came to substitute phenotypic markers in many species and were also widely used for genetic-linkage research [2]. Of particular interest is the history of using single markers linked to major-effect genes to build prediction models for important traits (e.g., disease resistance), something that came to be Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_6, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
157
158
Giovanny Covarrubias-Pazaran
known as marker-assisted selection (MAS). This method showed its value and power to predict simple or oligogenic traits and accelerate backcrossing [2], but failed miserably—as expected under the quantitative genetics theory—when breeders attempted to use it to predict complex or polygenic traits [2, 3] to the point of discouraging generations of breeders from implementing new predictive methods. In the simplest inception of MAS, after a genetic study determining the association of a given marker allele or alleles with a phenotype [e.g., disease resistant and susceptible for a locus with dominance action (R ¼ Resistant ¼ dosage of 1 or 2 alleles, Susceptible ¼ S ¼ dosage of 0 desired alleles)], the predictive model is the same as in Eq. 1. A dummy example of this phenotype– marker model could look like this: 2 3 2 3 0 S 6R7 6 2 7 6 7 6 7 6 7¼6 7 4⋮5 4⋮5 1 R where any discrepancy between the marker allele and the phenotype is due to genotyping errors, phenotyping errors, or breakage of linkage between the QTL and the marker. This, of course, is not different from the phenotypic marker approach, other than changing the manner in which the independent or input variable information is obtained; here it is a genetic marker instead of a phenotype, opening the door to new applications [2]. Few additional markers can be incorporated into the model to explain phenotypes based on more than one gene, known as quantitative trait loci (QTL), and the response variable changes from categorical to numerical. Another more popular and highly effective predictive model developed before the MAS era, is the animal model. Initially proposed by Henderson in 1949 and 1959 with the statistical support of Shayle Searle [4, 5], in this model the individuals (that can be seen as levels of a random categorical variable) are not assumed to be independent but to have a relationship that can be calculated using coefficients of coancestry to obtain a covariance matrix [5] that can be the input of the following model:
with
y i ¼ x i, : β þ z i,: u þ e i
ð2Þ
u MVN 0, Aσ 2a
ð3Þ
where yi refers to the ith record or phenotype, xi, refers to ith row of a fixed indicator matrix associating records with nongenetic effects (β) and zi, refers to the ith row of an indicator matrix linking the ith record to the genetic effects (u) that has a distribution that allows
Computer Packages for Genomic Prediction
159
the covariance among the levels of u (A matrix in Eq. 3) allowing the possibility to predict unseen individuals. The development of this model led to many developments in the area of linear mixed models and set the foundations of many genomic prediction models used today. These models will be further discussed in Subheadings 1.2 and 1.3 to help the understanding of the behind the scenes of the most popular computer packages for genomic prediction. 1.2 The Genomic Prediction Model for Complex Traits (Many Sides of the Same Dice)
It was with the beginning of the twenty-first century when the low cost of genotyping and greater availability of genetic markers led Meuwissen et al. [3] to propose a new way of approaching the prediction of complex traits and abandon the QTL paradigm. They proposed expanding the simple MAS model to a more robust model where phenotypes could be explained as a function of the addition of small additive effects associated with the thousands or more markers (covariates) available, closely linked to the infinitesimal model from Fisher [6]: y ¼ X β þ e or y i ¼ x i,gene1 βgene1 þ x i,gene2 βgene2 þ þ x i,genep βgenep þ e i
ð4Þ
where y (n 1) is the response variable or phenotype, i refers to the ith row of any indicator matrix linking observations to the vector of effects, X (n p) represents the allele calls denoting the allele dosage (e.g., 0, 1, 2 or 1, 0, 1) from the pth gene associated with the trait or its surrogate (genetic marker; in column vectors) in the nth individuals (in row vectors), and β is the ( p 1) vector of marker effects. A dummy example of this type of model could look like this: 2 3 2 3 e1 3 0:001 2 3 2 0 2 1 1 6 5 7 6 7 0:012 7 6 e 2 7 6 3 7 6 1 0 2 0 76 6 7 6 7 76 6 7 6 þ6 ⋮ 7 76 ⋮ 7 6 7¼6 7 6 7 4 ⋮ 5 4 ⋮ ⋮ ⋱ ⋮ ⋮ 56 7 6 7 0:004 e 4 5 4 n1 5 2 1 0 2 7 0:003 en where a phenotype is a function of the marker scores multiplied by the marker effects plus some random error. The proposal of Meuwissen et al. [3] was to treat the marker matrix as a random effect to control for the collinearity of variables (repeated information) and reduce rank models (lack of degrees of freedom to estimate all effects) to be able to estimate an effect for each marker ( p) even when the number of observations (n) is smaller than the number of markers. In the statistics literature, these were first described as special features and cases of random effects mixed models, commonly referred to as penalized models (i.e., ridge, lasso, elastic net, regressions, etc.), which aim to identify the most relevant variables when the number of covariates is too big (variable
160
Giovanny Covarrubias-Pazaran
selection) [7]. The proposal of Meuwissen et al. [3] to build a genomic prediction model able to predict the breeding or genetic value of an unobserved individual was possible thanks to the development of these statistical methods and the availability of highdensity markers. They used the BayesA and BayesB methods, characterized by specifying a prior distribution of the marker effects [e.g., χ 2(v, S) where S is a scale parameter and v is the number of degrees of freedom] that when combined with information from the data produces a posterior distribution that can sampled using a Gibbs sampler conditional on all other effects, and hence be used for the estimation of effects and variances [3]. The different Bayesian methods (A, B, C, π, etc.) differ in their distribution assumptions. From the frequentist point of view, for example BayesA is equivalent to a regular ridge regression commonly called rrBLUP where the distribution of the variance parameter follows a normal Gauss distribution. A closer look to Bayesian methods is taken in Subheading 3. Some years earlier, in 1995, Bernardo [8] was using the animal model to predict the performance of single-cross hybrids. The main difference with the classical animal model was that instead of deriving the additive relationships among the hybrid parents using the pedigree Bernardo used RFLP genetic markers to build the matrix of relationships. The formation of this matrix would be later proposed by VanRaden [9]. This model from Bernardo was an extension of the animal model that turned out to be equivalent to the rrBLUP/SNP-BLUP model and closely linked the BayesA model proposed by Meuwissen et al. [3]. The relationship was not explicitly mentioned by the research groups at the moment of the parallel developments; it was Habier et al. [10] who derived the equivalence of these methods some years later. The small differences that can be found when using these two models are only related to numerical issues found in the estimation of variance components. Given their equivalence, the GBLUP form is preferred for its computational efficiency [9] and the marker effects can be later derived as: G ¼ MM 0
ð5Þ
P ¼ M 0 G 1
ð6Þ
urrBLUP ¼ P uGBLUP
ð7Þ
where M is the marker matrix, G is a relationship matrix based on markers to be used in the GBLUP model, P is the partitioned matrix that allows the recovery of marker effects from the rrBLUP model using the BLUPs (u) from individuals in the GBLUP model. This model is known as the partitioned model [11].
Computer Packages for Genomic Prediction
1.3 Linear and Nonlinear Genomic Prediction Models
161
Linear models have a long history of being useful for explaining the quantitative genetics theory developed by many important figures like Fisher, Haldane and Wright. These models have been used to explain abstract concepts like heritability and allelic substitution, among others [5]. It was Fisher who came up with the analysis of variance (ANOVA) to partition the variation observed in the data in different sources of genetic and nongenetic variation [12]. The models explained in Subheading 2 (i.e., Eq. 4) and used by Fisher [12] depend on estimating few coefficients, given the small number of markers. In that case, when the residual matrix is an identity there is a closed solution to estimate the effects through the most classical approaches of linear models, where the fixed coefficients have the following solution: 1 0 1 β ¼ X 0 V 1 X XV y ð8Þ V ¼ R ¼ Σσ 2 y 0P y np 1 0 1 XV P ¼ V 1 V 1 X X 0 V 1 X σ2 ¼
ð9Þ ð10Þ ð11Þ
where X is the incidence matrix connecting observations with fixed effects, Σ is a symmetric positive definite matrix (diagonal or nondiagonal), σ 2 is the error variance, n is the number of observations, p is the number of coefficients to be estimated and P is the projection matrix [the matrix transforming observations into residuals 0 0 (P ¼ V1 V1X(X V1X)1X V1) and the second portion projecting observations into fitted values (V1X(0 1 1 0 1 X V X) X V )]. This model requires the inverse of V (or R) to obtain the solution for fixed effects, and if R is a diagonal matrix, inversion is not required, leading the solution directly to Eq. 8. To utilize additive (and sometimes nonadditive) effects, genetic improvement requires identifying the individuals with the highest breeding or genetic value and selecting them, and then replacing them as better ones come along. A major intellectual step into prediction of breeding values came from C. R. Henderson, a student of Lush and Hazel. He recognized that there were “fixed” effects (e.g., herd or season), which had to be included but not estimated, and “random” effects, which were samples from a distribution [13]. The methodology was formalized as BLUP by Henderson and first appeared in his Ph.D. thesis [14]. Linear mixed models were taken to the next level by Searle and Henderson to accommodate the theory of random effects [15]. To be able to estimate variance parameters of the random effects present in the mixed model, maximum likelihood (ML) and the modified method of restricted maximum likelihood (REML) proposed by Patterson and Thompson [16] became the standard approach to estimate
162
Giovanny Covarrubias-Pazaran
these parameters. An extensive review of the history of its development can be found in Adams’s compiled book in honor of John Nelder and a review by Thompson et al. [17, 18]. The typical mixed models developed by Searle and Henderson that we will refer to in the next sections are in the form: y j u, e ¼ X β þ Z u þ e
ð12Þ
where y is the response vector conditional on the estimated effects u and e, X and Z are the incidence matrices connecting the observations with the fixed and random effects, respectively, β and u are the vectors of fixed and random effects, and e is the vector of residuals. When there is more than one fixed or random effect, β and u can be partitioned as β ¼ [β0 1 . . . β0 i] and u as u ¼ [u0 1 . . . u0 j], with X and Z partitioned conformably as X ¼ [X1 . . . Xi] and Z ¼ [Z1 . . . Zj], and residuals e is a long vector. The distributions of the variables are as follows: 02 31 3 2 2 3 V G R Xβ y 7C 7 6 6 7 ~ B6 ð13Þ 4 u 5MVN @4 0 5 , 4 0 G 0 5A e
0
0
R
0
with V ¼ ZGZ 0 þ R
ð14Þ
In their simplest cases of one random effect and a single residual variance component, G ¼ Aσ 2a and R ¼ Σσ 2e , where A and Σ are covariance matrices for the levels of the random variables (e.g., an additive relationship matrix). In a more general case with more than one random effect: 3 2 G1 0 b0 M 7 6 ð15Þ G¼ Gi ¼ 4 ⋮ ⋱ ⋮ 5 i¼1 0 G b0 and
2 R¼
s0 M j ¼1
R1
6 Rj ¼ 4 ⋮ 0
⋱
0
3
7 ⋮ 5 Rs 0
ð16Þ
The default assumption is that each random model term generates one component of this direct sum. This means that the random effects from any two distinct model (random) terms are uncorrelated. However, in some models, one component of G may apply across several model terms, for example, in random coefficient regression where the random intercepts and slopes for subjects are correlated [17–19].
Computer Packages for Genomic Prediction
163
In case of more complex variance–covariance structure (distribution of the random variable), the formulation can be expanded using Kronecker products. For example, the G1 matrix for that random effect would become: 2 2 3 σ al1 σ al1 ali 6 7 ⋱ ⋮ 5 A1 G1 ¼ 4 ⋮ ð17Þ 2 σ al1 ali σ ali where li is the ith level for a second variable that defines the covariance structure for the random effect a, and would replace the G1 partition in the original distribution and equations. If the variance–covariance parameters were known, β and u could be estimated easily as follows. With the direct-inversion method (problem dimensions n n): 1 0 1 XV y ð18Þ β ¼ X 0 V 1 X u ¼ GZ 0 V 1 ðy X βÞ
ð19Þ
where V was defined in Eqs. 14, 15, and 16. Or, with the mixedmodel-equation or Henderson’s method (problem dimensions p p), the following equations can be formed: " # " # β X 0 R 1 y X 0 R 1 Z X 0 R 1 X ¼ Cη ¼ ð20Þ Z 0 R 1 X Z 0 R 1 Z þ G 1 u Z 0 R 1 y The left side of Eq. 20 is the coefficient matrix C, and η ¼ [β u]0 . It can then be rearranged to obtain β and u as: # " # " 0 1 β XR X X 0 R 1 Z X 0 R 1 y ¼ ð21Þ u Z 0 R 1 X Z 0 R 1 Z þ G 1 Z 0 R 1 y The estimation of β and u requires the variance–covariance parameters to be known, which is usually not the case. To overcome this limitation the next section will explore some of the solution to the problem. 1.4 REML-Based Estimation Methods Used in Software for Genomic Prediction
A key point when using either of the two methods referred to above for estimating β and u is that the variance–covariance parameters are usually not known. There is no closed solution for calculating all these variance–covariance parameters. In practice, an iterative approach is used to obtain estimates of the parameters. Maximum likelihood (ML) and restricted maximum likelihood (REML) were proposed to obtain these parameters [18, 19]. The first step is to define the likelihood function of the observed data. Because of its mathematical properties (i.e., it changes products into sums), the log of the REML is used instead:
164
Giovanny Covarrubias-Pazaran
0 y Py log det X 0 V 1 X Þ þ log ðdetðV ÞÞ þ v log σ 2 þ ð22Þ σ2 0 2 y Py 1 log ðdetðC ÞÞ þ log ðdetðR ÞÞ þ log ðdetðG ÞÞ þ v log σ þ ð23Þ L¼ 2 σ2 L¼
1 2
Depending on the estimation method, direct inversion or mixed-model-equation based [14], Eq. 22 or Eq. 23 are used, where X and V were defined in Eq. 12 and v ¼ n – t, with n being the number of observations, t the number of fixed coefficients to be estimated, σ 2 the error variance, C the coefficient matrix, and P the so-called projection matrix, which again, depending on the estimation method, could obtained as: 1 0 1 XV ð24Þ P ¼ V 1 V 1 X X 0 V 1 X P ¼ R 1 R 1 W C 1 W 0 R 1
ð25Þ
where W ¼ [X Z]. As previously mentioned, there are two ways to solve the linear system. One depends on directly inverting V ¼ Z’GZ + R which is a symmetric matrix of size n n, n being the number of records (called Direct-Inversion algorithm from now on). Another method depends on inverting the coefficient matrix C of dimensions p p, p being the number of coefficients (fixed plus random) to be estimated (called Mixed-Model-Equation-Based algorithm from now on) [20]. Once the likelihood function has been defined, different methods have been proposed to estimate the variance parameters based on maximum likelihood; they can be classified as derivative-free [21] and derivative-based algorithms [18]. The derivative-based algorithms are more commonly used, and they include algorithms based on first derivatives (e.g., Expectation–maximization) [22], and algorithms based on first and second derivatives (e.g., FisherScoring-type of algorithms) [23, 24]. The algorithms based on first and second derivatives like Expected-Information, Observed-Information, and Average-Information [23] have come to dominate when we talk about genomic prediction models given their ability to converge faster. The reader can check the canonical references for the methods based on derivative-free algorithms [21], and the ones based only on first derivatives like Expectation–Maximization [22– 24]. Here we will focus on methods based on the first and second derivatives, given their wider use in genomic prediction models. 1.5 The Algorithm for Genomic Prediction Programs Based on First and Second Derivatives
In this section we will summarize the steps for estimating the variance parameters for the univariate case under two approaches: when it is more convenient to work with a model of dimensions n n (direct inversion algorithm), n being the number of records, and when it is more convenient to work under a model of dimensions p p [Mixed-model-equation (MME) based algorithm],
Computer Packages for Genomic Prediction
165
p being the number of fixed and random effects to be estimated. The iterative algorithms use the typical Newton method. Θðkþ1Þ ¼ ΘðkÞ þ H 1
∂L j Θ ðk Þ ∂Θ
ð26Þ
The vector of variance–covariance parameters are Θ ¼ [θ1 . . . θi ∂L γ 1. . . γ j]0 . H is a matrix of second derivatives, and ∂Θ j ΘðkÞ is the vector of first derivatives of the likelihood with respect to the variance parameters in the current iteration. The matrix of second derivatives can be the Hessian matrix (Newton–Raphson), the minus expected value of the Hessian matrix (Fisher Scoring), or the average of the last two matrices (Average Information) [23]. When the direct-inversion algorithm is used, the first derivatives are obtained with the following. ∂L 1 ∂V ∂V y 0P ¼ tr P Py ð27Þ 2 ∂θi ∂θi ∂θi with ∂V ¼ Z i A i Z 0i ∂θi
and
∂V X ¼ ∂θi
ð28Þ
Most of the times, Σ ¼ I, an identity matrix but can be a dense matrix as well, and Ai is typically a covariance matrix among levels of the ith random effect (e.g., a relationship matrix). Notice that when there are complex covariance structures in the mixed model, the variance and covariance components can be specified as independent random effects changing the composition of Zi and Ai. For example, in a unstructured model whereas a variance component is Zi ¼ [Zi( j )], for a covariance component Zi ¼ [Zi( j ) Zi(k)], with j and k being the j row and k column in a variance structure for the ith random effect, and Ai for a variance component is.Ai ¼ [Ai] and . for a covariance component Ai ¼ [Ai .. Ai] where .. refers to the diagonal binding of matrices. When the MME-based algorithm is used, the first derivatives are obtained as follows. ∂L 1 1 ¼ q i θ1 i θi ½T i þ S i θi ∂θi u ð j Þ u ðk Þ t iðjkÞ ¼ tr A 1 i Ci s iðjkÞ ¼ uið j Þ 0 A 1 i u i ðkÞ b i ¼ uð j Þ uðkÞ θ1 U i 1 b 0 1 b θ1 i S i θi ¼ U i A i U i
So
ð29Þ ð30Þ ð31Þ ð32Þ ð33Þ
166
Giovanny Covarrubias-Pazaran
∂L 1 1 b 0 1 b ¼ q i θ1 i θi T i θi U i A i U i ∂θi
ð34Þ
For a random effect, where θi refers to the variance covariance matrix for the ith random effect with j rows and k columns (a matrix in the case of complex covariance structures), ti( jk) refers to the j row and k column of the matrix of traces in T of the ith random effect covariance structure, and si( jk) refers to the j row and k column of the matrix of sum of squares of the BLUPs of the ith random effect covariance structure. In addition, q is the number of coefficients for the ith random effect, Cu( j )u(k)i is the jkth partition of the coefficient matrix mapping to the ith random effect, and A1i the inverse of the covariance matrix among the levels of the ith random effect, and u( j ) and u(k) are the jth and kth BLUPs of the ith random effect. And for a residual effect we obtain first derivatives as follows. ∂L ¼ tr R jk R 1 tr C 1 W 0 R 1 R jk R 1 W ∂γ i e 0 R 1 R jk R 1 e e ¼ y Xβ Zu X R jk γ RðjkÞ R¼
ð35Þ ð36Þ ð37Þ
j k
2
where W ¼ [X Z], the column binding of X and Z’s. R1 is the inverse of the complete residual matrix as defined in Eq. 16, and Rjk is an indicator matrix defined in Eq. 35, and γ is the variance– covariance component for the residual term. When the directinversion algorithm is used, the second derivatives are (with the residual term being a diagonal) as follows.
tr ðPP Þ 2y 0 PPP y 6 2 ∂ L 16 H ¼ ¼ 6 ∂γ i ∂θ j 2 4 ∂V ∂V tr P P 2y 0 P PP y ∂θ j ∂θ j
3 ∂V ∂V tr P P 2y 0 P PP y 7 ∂θ j ∂θ j 7 7 ð38Þ 5 ∂V ∂V ∂V ∂V y 0P tr P P P Py ∂θ j ∂θ j ∂θi ∂θ j
or
2 ∂ L F ¼ E ∂γ i ∂θ j 2 3 ∂V P tr ðPP Þ tr P 7 6 ∂θ j 16 7 7 ¼ 6 24 ∂V ∂V ∂V 5 tr P P tr P P ∂θ j ∂θ j ∂θ j or
ð39Þ
Computer Packages for Genomic Prediction
2 AI ¼
ðH þ F Þ 1 6 6 ¼ 6 2 24
0
y PPP y y 0P
∂V PP y ∂θ j
∂V yP PP y ∂θ j 0
y 0P
∂V ∂V P Py ∂θ j ∂θ j
167
3 7 7 7 ð40Þ 5
depending on whether the Hessian, Fisher, or Average Information matrix will be used respectively. When the Average Information matrix is used in the MME-based algorithm, the second derivatives can be obtained either by performing absorption (Gaussian elimination) or the Cholesky factorization/decomposition in the mixed model equations expanded by the response and working variates to be defined now [24, 25]. First, the extended mixed model equations are formed as follows. 2 0 1 3 XR X X 0 R 1 Z X 0 R 1 y j 6 0 1 7 0 1 0 1 7 M ¼6 ð41Þ 4Z R X Z R Z þG Z R yj 5 0 1 0 1 0 1 yi R X yi R Z yi R y j After performing the Cholesky factorization, the last element of M, in the position of yi0 R1yj equals y0 Py if the elements of yi and yj are the response variable [25]. This trick allows us to obtain all the elements of the second derivatives by changing the yi and yj elements by the different working variates, which are obtained as follows.
∂V b i ðk Þ Z i ðk Þ u bi ð j Þ y θiðjkÞ ¼ P y ¼ Z ið j Þ u ð42Þ ∂θi If there is not a covariance component, the second derivatives is reduced to
∂V bi ð j Þ : y θiðjkÞ ¼ P y ¼ Z ið j Þ u ð43Þ ∂θiðjkÞ And for residual working variates y ðγ i Þ ¼
∂V P y ¼ R jk R 1 ðy X β Z uÞ ∂γ iðjkÞ
ð44Þ
For the jth and kth partitions of the ith random effect in the incidence matrix Z and ¯ u was defined in Eq. 32. The working variates can be seen as a fitted value for the ith random effect divided by the ith variance–covariance matrix. Each element of the average information matrix can be obtained in the same way by forming M with the right working variates or just forming a multivariate response with all working variates y(Θ) ¼ [y(θ1) . . . y(γ i)] and performing absorption of M into the last “i” rows and columns [23]. For example, consider the element y0 PPPy in the AI matrix for a residual term (Eq. 40), that is,
168
Giovanny Covarrubias-Pazaran
y 0 PPP y ¼ y ðγ 1 Þ0 P y ðγ 1 Þ
ð45Þ
For a cross product in the AI matrix, the M matrix will be formed as follows.
3 2 X 0 R 1 X X 0 R 1 Z X 0 R 1 y γ j 6
7 6 7 1 0 1 0 1 0 1 7 ð46Þ M ¼6 Z R Z þG Z R y γj 6 ZR X 7 4
5 y ðθi Þ0 R 1 X y ðθi Þ0 R 1 Z y ðθi Þ0 R 1 y γ j and after the Cholesky factorization, the square of the last diagonal element will be y 0 P ∂V PP y. When yi and yj are equal to the response ∂θi variable, twice the sum of the n 1 diagonal elements of MChol (L from the L0 L factorization) is equal to the log(|C|) to be used in the log-likelihood calculation [25]. log ðjC jÞ ¼ 2
n1 X
log ðl k Þ
ð47Þ
k¼1
The iterative procedure to estimate the variance–covariance parameters can be summarized as follows in Table 1.
Table 1 Steps for estimating variance–covariance components using two different REML algorithms Direct-inversion-based algorithm (n n)
Step MME-based algorithm ( p p) 1
Construct the extended mixed model equations (M) using Build the phenotypic variance matrix (V) and its inverse (V1). initial values of the variance–covariance parameters (Θ) and perform Cholesky factorization (Mchol).
2
Estimate β and ui’s by back-solving using Mchol, get y0 Py from Mchol(nxn)2, log(|C|), and the residuals e.
Calculate the projection matrix (P).
3
Calculate the working variates [y(θi)].
Calculate the log-likelihood.
4
Calculate the second derivatives by absorption (Gaussian elimination) or Cholesky factorization in the extended M matrix by the multivariate working variate.
Calculate the first derivatives.
5
Calculate the log-likelihood.
Calculate the second derivatives.
1
ii
6
Evaluate sufficiently C first derivatives.
7
Update the vector of variance–covariance parameters.
Repeat steps 1–6 until convergence.
8
Repeat steps 2–7 until convergence.
Estimate β and ui’s.
to obtain tr(C ) to calculate the
Update the vector of variance–covariance parameters.
Computer Packages for Genomic Prediction
169
Table 2 Performance of maize single crosses published by Bernardo [26] Set
ID
Male
Female
Yield
1
SC1
B73
Mo17
7.85
1
SC2
H123
Mo17
7.36
1
SC3
B84
N197
5.61
2
SC2
H123
Mo17
7.47
2
SC3
B84
N197
5.96
1.6 A Minimal Example of Estimation of Variance– Covariance Parameters
Assuming five records from three single-cross hybrids (Table 2) with known pedigree published by Bernardo [26] used by him to demonstrate the use of the expectation–maximization (EM) algorithm, we can simulate one iteration of the Average Information algorithm. The information in this data frame can be used to form the mixed model with “Set” as fixed factor and Male and Female as a single overlayed random effect: 2
7:85
3
2
1
y j u, e ¼ X β þ Z u þ e 3 2 0 1 0 0 1 7 6 0 7 6 0 0 1 1 7 β1 6 6 07 7 β þ 60 1 0 0 7 2 6 15 40 0 1 1
6 7 6 6 7:36 7 6 1 6 7 6 6 5:61 7 ¼ 6 1 6 7 6 6 7 6 4 7:47 5 4 0 0 1 5:96
0 1 0 0
0
32
uB73
3
76 7 0 76 uB84 7 76 7 6 7 17 76 uH 123 7 þ e 76 7 0 54 uMo17 5 1 uN 197
for which we need to estimate the variance components, fixed (β) and random effects (u), with: 2 3 1 0:265 0:750 0 0 6 7 1 0:19875 0 0 7 6 0:265 6 7 1 0 0 7 G ¼ Aθ1 ¼ 6 6 0:750 0:19875 7θ1 6 7 0 0 1 0:75 5 4 0 0 and
0
0
0:75
1
170
Giovanny Covarrubias-Pazaran
G 1 ¼ A 1 θ1 1 2 2:361 6 6 0:285 6 6 ¼6 6 1:714 6 6 0 4 0 2 1 0 0 6 60 1 0 6 R¼6 60 0 1 6 40 0 0
0:285
1:714
0
1:075
0
0
0
2:285
0
0
0
2:285
0 2
1:714
0 0 0
3
0:1 7 6 0 07 6 0 7 6 6 0 07 7γ 1 ¼ 6 0 7 6 1 05 4 0 0 0 0 0 1 0 2 1 0 0 6 60 1 0 6 1 1 R ¼ Rγ 1 ¼ 6 60 0 1 6 40 0 0 2
10 6 6 0 6 ¼6 6 0 6 4 0 0
0
3
7 7 7 7 1 7θ1 0 7 7 1:714 7 5 2:285 3 0 0 7 0 0 7 7 0 0 7 7 7 0:1 0 5 0 0:1 0
0
0
0:1 0
0 0:1
0
0
0
0 3
0 0
7 0 07 7 1 0 07 7γ 1 7 1 05 0 0 0 0 1 3 0 0 0 0 7 10 0 0 0 7 7 0 10 0 0 7 7 7 0 0 10 0 5 0 0 0 10
The initial values for variance components are (assuming a scaled response with mean 0 and variance 1): Θ ¼ ½ θ1
2
M y,y
γ 1 ¼ ½ 0:1 0:1
The extended mixed model eqs. M ¼ [X Z y]0 [X Z y] (with G added in the corresponding portion of Zi in Cii) are: 30 0 10 10 10
0 20 0 0 10
6 6 6 6 6 6 6 ¼6 6 6 6 20 20 6 6 4 10 10 208:2 134:3
10 10 10 0 10 10 33:612 2:850 17:143 2:850 30:755 0 17:143 0 42:857 10 0 78:5
0 10 115:7
20 0 148:3
20 20 10 0 20
10 10 0 10 0
208:2 134:3 78:5 115:7 148:3
52:857 17:142 226:8 17:142 42:857 115:7 226:8 115:7 2385:867
3
7 7 7 7 7 7 7 7 7 7 7 7 7 5
where the coefficient matrix (C) is this matrix, except for the last row and last column. We then perform the Cholesky factorization:
Computer Packages for Genomic Prediction
171
2
y,y
M Chol
3 5:477 0 1:825 1:825 1:825 3:651 1:825 38:011 6 0 4:472 0 2:236 2:236 2:236 2:236 30:030 7 6 7 6 7 6 0 0 5:502 1:123 3:721 0:605 0:605 1:653 7 7 6 6 0 0 0 4:599 2:72 2:388 2:388 4:128 7 7 6 ¼L¼6 7 7 6 0 0 0 0 3:643 1:122 1:122 1:830 7 6 6 0 0 0 0 0 5:214 4:118 1:521 7 7 6 7 6 4 0 0 0 0 0 0 3:198 0:521 5 0
2 6 6 6 6 6 6 ½LHS jRHS ¼ 6 6 6 6 6 4
0
0
0
0
0
0
3:662
We use back solving using the MChol matrix without the last row to obtain β, u, whereas y0 Py is the square of the last diagonal element of MChol and e ¼ y – Xβ – Zu: 5:477
0
0 0 0
4:472 0 0
0 0
0 0
0 0
0 0
3:643 0
1:122 5:214
1:122 4:118
0
0
0
0
0
0
3:198
½β
1:825
1:825
1:825
3:651
1:825
38:011
j
38:011
7 30:030 7 7 1:653 7 7 7 4:128 7 7 1:830 7 7 7 1:521 5 0:521
0 2:236 2:236 2:236 2:236 30:030 j 5:502 1:123 3:721 0:605 0:605 1:653 j 0 4:599 2:72 2:388 2:388 4:128 j
u ¼ ½ 6:769
6:759
1:830 1:521
j j
0:521 j
0:436 0:490 0:402 0:162
0:162
y 0 P y ¼ M 2Choln,n ¼ 3:6622 ¼ 13:415 log ðjC jÞ ¼ 2
n1 X
log ðl k Þ
k¼1
¼ 2 ð log ð5:477Þ þ . . . þ log ð3:198ÞÞ ¼ 21:073 3 2 3 2 3 2 0:599 6:769 7:85 7 6 7 6 7 6 6 7:36 7 6 6:769 7 6 0:565 7 7 6 7 6 7 6 7 6 7 6 7 e ¼ y Xβ Zu ¼ 6 6 5:61 7 6 6:769 7 6 0:653 7 7 6 7 6 7 6 4 7:47 5 4 6:759 5 4 0:565 5 6:759 5:96 0:653 2 3 0:480 6 7 6 0:025 7 6 7 7 ¼6 6 0:506 7 6 7 4 0:145 5 0:145 We calculate the working variates: b i ¼ uð j Þ uðkÞ θ1 U i ¼ ½ 0:436 0:490 ¼ ½ 4:364 4:904
0:402 4:021
0:162 1:629
3
1 0:162 0:1 1:629 0
172
Giovanny Covarrubias-Pazaran
∂V y θiðjkÞ ¼ Py ∂θiðjkÞ 2 1 0 6 60 0 6 ¼6 60 1 6 40 0 0 1 ∂V Py ∂γ iðjkÞ 2 1 0 6 60 1 6 ¼6 60 0 6 40 0
y ðγ i Þ ¼
bi ¼ Z ið j Þ U ð
jÞ
3 3 2 32 5:993 4:364 0 1 0 7 7 6 76 1 1 0 76 4:904 7 6 5:650 7 7 7 6 76 7 7 6 6 0 0 17 76 4:021 7 ¼ 6 6:543 7 7 7 6 76 1 1 0 54 1:629 5 4 5:650 5 6:534 1:629 0 0 1
¼ R jk R 1 ðy X β Z uÞ 0 0 0
32
76 0 0 0 76 76 6 1 0 07 76 76 0 1 0 54 0 0 0 0 1
0
32
0:480
2
3
4:809
3
10
0
0
0
0
10
0
0
0 0
0 0
10 0
0 10
7 6 76 7 0 76 0:025 7 6 0:252 7 7 6 76 7 6 7 6 7 0 7 76 0:506 7 ¼ 6 5:061 7 7 6 76 7 0 54 0:145 5 4 1:457 5
0
0
0
0
10
1:457
0:145
The log-likelihood is: L¼ ¼
1 2
þ y 0P y log ðdetðC ÞÞ þ q i log ðdetðθi ÞÞ þ log ðdetðA i ÞÞ þ N log det γ j
1 ½ ð11:512Þ þ ð11:512Þ þ ð1:726Þ þ ð11:512Þ þ ð13:415Þ ¼ 4:868 2 We obtain the second derivatives by performing Cholesky factorization in the M matrix formed using the pertinent working variates instead of the regular response y. For example, for the off-diagonal element of the AI matrix we calculate:
3 2 X 0 R 1 X X 0 R 1 Z X 0 R 1 y γ j 6
7 7 6 0 1 0 1 0 1 7 M y ðθi Þ,y ðγ j Þ ¼ 6 Z R X Z R Z þ G Z R y γj 7 6 4
5 y ðθi Þ0 R 1 X y ðθi Þ0 R 1 Z y ðθi Þ0 R 1 y γ j 3 2 0 0 0 7 6 ¼ ½X Z 0 R 1 ½X Z þ 4 0 G 1 0 5 ¼ 0
0
0
Computer Packages for Genomic Prediction
2
1 0
6 61 0 6 61 0 6 6 40 1 0 1 2 1 6 61 6 6 61 6 40 0 2 0 60 6 6 60 6 60 6 þ6 60 6 60 6 6 40
1 0 0 1 0
30 2
5:993
10
0 1 0 0 1 0 0 1 1 0
76 5:650 7 6 0 76 6 6:543 7 76 0 76 5:650 5 4 0
0 0 0 1 1
6:534
0 0 1 1 0
0 1 0 0 1
0
0 0 0 1 1
0
4:809
0 3
3
0
0
0
0
10
0
0
0 0
10 0
0 10
7 0 7 7 0 7 7 7 0 5
0
0
0
10
1 0 0 1 1 1 0 0 0 1
7 0:252 7 7 1 5:061 7 7 7 0 1:457 5 1 1:457
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
0
0
0
0
0 0 1 0 0
173
3 0 07 7 7 0 23:612 2:850 17:142 0 0 07 7 0 2:850 10:755 0 0 0 07 7 7 0 17:142 0 22:857 0 0 07 7 0 0 0 0 22:857 17:142 0 7 7 7 0 0 0 0 17:142 22857 05 0
¼ 2
30
0
10
10
10
20
10
6 0 20 0 10 10 10 10 6 6 6 10 0 33:612 2:85 17:143 10 0 6 6 10 2:850 30:755 0 0 20 6 10 6 6 10 10 17:143 0 42:857 20 0 6 6 10 10 0 20 52:857 17:143 6 20 6 6 10 10 0 20 0 17:143 42:857 6 6 59:94 130:693 113:018 172:958 130:693 4 51:102 8:837 0 0 48:095 65:192 17:097 65:192 65:192 2 5:477 0 1:825 1:825 1:825 3:651 1:825 6 0 4:472 0 2:236 2:236 2:236 2:236 6 6 6 0 0 5:502 1:123 3:721 0:605 0:605 6 6 0 0 0 4:599 2:720 2:388 2:388 6 6 y ðσ 2 Þ,y ðσ 2e Þ 0 0 0 0 3:643 1:122 1:122 M Chola ¼6 6 6 0 0 0 0 4:118 4:118 6 0 6 6 0 0 0 0 0 0 3:198 6 6 0 0 0 0 0 0 4 0
0
0
0
0
0
0
0
51:102
0
8:837 59:940
0 48:095
0
11:179
3
7 7 7 7 7 7 130:693 65:192 7 7 113:018 17:097 7 7 7 172:958 65:192 7 7 130:693 65:192 7 7 7 1851:963 810:895 5 810:895 530:649 3 9:329 0 7 1:976 0 7 7 7:797 8:740 7 7 7 29:249 12:037 7 7 13:678 4:631 7 7 7 10:232 4:977 7 7 3:507 1:705 7 7 7 23:248 11:631 5
and do the same to obtain all terms in the AI matrix:
174
Giovanny Covarrubias-Pazaran
2 AI ¼
6 16 6 24 "
¼ AI
1
∂V yP PP y ∂θ j
0
0
y PPP y y 0P
∂V PP y ∂θ j
23:2482 11:6312
540:493 ¼ 135:279
∂V ∂V P Py ∂θ j ∂θ j # 540:493 11:6312 ¼ 2 135:279 11:179 y 0P
135:279 124:967
1
3 7 7 7 5 135:279 124:967
0:0025 0:0027 ¼ 0:0027 0:0109
and the first derivatives: 02 3 2:361 0:285 1:714 0 0 B6 7 B6 7 0 0 0 B6 0:285 1:075 7 B6 7 B6 7 1 uð j Þ uðkÞ B 7 6 t iðjkÞ ¼ tr A i C i ¼ tr B6 1:714 0 2:285 0 0 7 B6 7 B6 7 0 0 0 2:285 1:714 B6 7 @4 5 0 0 0 1:714 2:285 31 2 0:085 0:042 0:063 0:005 0:005 7C 6 7C 6 0:078 0:040 0:007 0:007 7C 6 0:042 7C 6 7C 6 C 6 6 0:063 0:040 0:079 0:006 0:006 7 7C ¼ ½ 0:408 7C 6 7C 6 0:077 7C 6 0:005 0:007 0:006 0:098 5A 4 0:005 0:007 0:006 0:077 0:098 2 2 3 2:361 0:285 1:714 0 0 4:364 6 6 76 0 0 0 6 4:904 76 0:285 1:075 6 76 6 6 7 76 bið j Þ 0 A1 biðkÞ ¼ 6 s iðjkÞ ¼ u 0 2:285 0 0 i u 6 4:021 76 1:714 6 76 6 1:629 76 0 0 0 2:285 1:714 4 56 4 1:629 0 0 0 1:714 2:285 3 2 4:364 7 6 6 4:904 7 7 6 7 6 6 6 4:021 7 7 ¼ ½ 81:089 7 6 6 1:629 7 5 4 1:629 ∂L 1 1 b 0 1 b ¼ q i θ1 i θi T i θi U i A i U i ∂θi ¼ 5½ 10 ½ 81:089 ½ 10 ½ 0:408 ½ 10 ¼ 71:958
3 7 7 7 7 7 7 7 7 7 7 5
Computer Packages for Genomic Prediction
∂L ¼ tr R jk R 1 tr C 1 W 0 R 1 R jk R 1 W e 0 R 1 R jk R 1 e ∂γ i 0 2 31 2 3 10 0 0 0 0 1 0 0 0 0 B 6 7C 76 B6 7C 0 0 7C B6 0 1 0 0 0 7 6 0 10 0 76 B6 7C 76 B6 7C 60 0 1 0 076 0 7C ¼ tr B 0 10 0 0 76 B6 7C 76 B6 7C B6 0 0 0 1 0 7 6 7C 0 0 10 0 7 C 56 0 B4 @ 4 5A 0 0 0 0 1 0 0 0 0 10 02 31 0:180 0:145 0:062 0:056 0:059 0:089 0:086 B6 7C C B6 0:145 0:197 0:052 0:059 0:060 0:088 0:087 7 B6 7C B6 7C B6 7C C B6 0:062 0:052 0:085 0:042 0:063 0:005 0:005 7 B6 7C B6 7C B6 7C 0:078 0:040 0:007 0:007 7C tr B6 0:056 0:059 0:042 B6 7C B6 7C B6 0:059 0:060 0:063 7C 0:040 0:079 0:006 0:006 B6 7C B6 7C B6 7C 0:077 7C B6 0:089 0:088 0:005 0:007 0:006 0:098 @4 5A 0:086 0:087 0:005 0:007 0:006 0:077 0:098 2 32 2 30 100 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 6 76 6 76 76 100 0 0 0 76 1 0 0 0 1 61 0 0 0 1 1 07 6 0 6 76 76 6 76 76 61 0 0 1 0 0 17 6 0 76 1 0 0 1 0 0 100 0 0 6 76 76 6 76 76 60 1 0 0 1 1 07 6 76 0 0 100 0 76 0 1 0 0 1 4 56 0 4 54 0 1 0 0 0 1 1 0 0 0 0 100 0 1 0 0 0 0 31 32 2 30 100 2 0:480 0 0 0 0 B 0:480 7C 76 6 B6 7C 7 6 76 100 0 0 0 7 6 0:025 7C B6 0:025 7 6 0 B6 7C 7 6 76 B6 7C 7 6 76 6 0:506 7 6 0 7 6 0:506 7C ¼ ½ 32:195 B 0 100 0 0 B6 7C 76 7 6 B6 7C 76 7 6 B6 0:145 7 6 7C 76 0 0 100 0 7 6 0:145 7C B4 5 6 0 @ 5A 54 4 0:145 0:145 0 0 0 0 100
1 0
175
31
7C 7C 1 07C 7C 7C C 0 17 7C 7C 7C 1 07C 5A 1 1
and update the variance parameters: ∂L Θðkþ1Þ ¼ ΘðkÞ þ H 1 j ΘðkÞ ∂Θ " # 0:0025 0:0027 ¼ ½ 0:1 0:1 þ 0:0027 0:0109 ½ 71:958 32:195 ¼ ½ 0:194 0:255 The procedure is repeated until the log-likelihood converges. Then the inverse of the coefficient matrix provides the variance and predicted error variance (PEV) of the estimates. Some heuristics are required to deal with convergence issues given the tendency of second derivative methods to go outside the parameter space.
176
2
Giovanny Covarrubias-Pazaran
Most Common Software Used for Genomic Prediction Models In the previous section we explored the REML methods to estimate variance–covariance parameters and fixed and random effects behind the most common software used for genomic prediction in plant and animal breeding. In the Table 3 we summarize the computer packages that use REML as their core algorithm and some relevant features. Instead of reviewing one software after another, we decided to explain the basis of computation behind the most popular REML software used in genomic prediction models, to understand their strengths and weaknesses and provide some additional details when it comes to their use.
2.1
Algorithm Types
2.2 Modeling Strengths and Weaknesses
Programs like ASReml [23] (license required), REMLF90 [27, 28], WOMBAT [29], Mix99 [30] (license required), HPMIXED-SAS [31] (license required) and the most recent one, Echidna [32], are focused on the n > p problem, which means that the number of observations is greater than the number of coefficients to be estimated, whereas GCTA [33], rrBLUP [34], MTG2 [35] and sommer [36] are focused in the p > n problem when the number of coefficients is equal or greater than the number of observations, or when mixed model equations become too dense (e.g., genomic relationship matrices) [20]. The first scenario is common in multienvironment trial analyses in plant breeding and single-step genomic prediction approaches [37], whereas the second is more common in genomic selection models with dense relationship matrices, two-step genomic prediction approaches [20, 31], as well as in models with a big number of markers or spatial models requiring the estimation of many effects [38, 39]. All software presented can be used to fit the classical genomic models like GBLUP and rrBLUP. But when it comes to fit more than one random effect rrBLUP [34] faces a limitation. The same occurs when it comes to fit multitrait models. The rest of the software can deal with multiple trait and multiple random effect models (e.g., additive plus nonadditive genetic effects plus experimental design factors). The licensed software such as ASReml, SAS, and Mix99 possess in general more correlation and covariance structures compared to the open-source and open-access software. The R packages (ASReml-R, sommer, breedR, rrBLUP) provide an advantage with respect to the learning curve since they are easier to learn than other software written in other languages like pure FORTRAN (ASReml-standalone, REMLF90, WOMBAT, GCTA, Echidna, HPMIXED, Mix99).
Fortran 95
C++ & R
Fortran
Fortran
MTG2
sommer
Echidna
Mix99
AI +EM
AI + EM
AI + EM
AI
AI
Yes Not yet Yes Yes Not yet Not yet Yes Not yet Yes Not known
n>p (MME) n>p (MME) p > n (DI) p > n (DI) p > n (DI) p > n (DI) n>p (MME) n>p n>p
Yes
Yes
Yes
No
No
Yes
Yes
License req. Yes
License req. Yes
Openaccess
Opensource
Opensource
Opensource
Opensource
Openaccess
Opensource
License req. Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Multiple random Parallelizationa Availability Multitrait effects
n>p (MME)
MME mixed model equation solution, DI direct inversion solution a Multiprocessing (use of two or more central processing units) b Through the breedR package available in GitHub [34] c has replaced DFREML
HPMIXED Fortran and SAS
R
rrBLUP
Ridge
AI
C++
AI + EM
GCTA
Fortran 90/95 & Rb
BLUPF90
AI + EM
AI
Fortran & R
ASReml
Algorithm Focus
WOMBATc Fortran 95
Language
Software
Table 3 The most popular REML software used for genomic prediction models and special features to be considered for its use
SAS [40]
Lidauer [30]
Gilmour [33]
Covarrubias-Pazaran [32]
Lee [31]
Endelman [30]
Yang [29]
Meyer [28]
Misztal [27]
Gilmour et al. [23]
Released
Computer Packages for Genomic Prediction 177
178
Giovanny Covarrubias-Pazaran
2.3 Language and Support
The software developed for the n > p scenario relies on the mixed model equations derived by Henderson and Searle [14, 15], whereas the software for the p > n scenario relies on the natural derivation of linear models [17]. The developers of all these software have used the FORTRAN and C++ languages to ensure efficient matrix operations [41]. Some have developed a connection with the R language to make their use easier. Only the rrBLUP package relies on pure R language for matrix operations [34]. There are quite different levels of support and maintenance for the available REML software; to our knowledge, all the opensource and open-access software is supported only by the developer, whereas ASReml (license-required) is supported by a team of statisticians from the VSNi company and the original developers of the Average Information algorithm Arthur Gilmour, Robin Thompson, and Brian Cullis [19], HPMIXED by the SAS team, and Mix99 by the LUKE institute.
2.4 Additional REMLBased Software
Other REML-based specialized software for quantitative genetics is GEMMA [42] and PLINK [43], but they are specialized for genome-wide association analyses (GWAS) instead of being generic REML mixed model solvers for genomic prediction models.
3
Bayesian Approach for Genomic Prediction As we saw in the previous section, the problem to solve the linear system of equations to predict selection units in plant and animal breeding programs lies in answering the question of how to estimate genetic parameters such as variance–covariance components to calculate the fixed and random effects used in the prediction equations. To illustrate the differences between the REML and Bayesian approaches, let us imagine we have observed realizations ( y) from a standard normal distribution (i.e., μ ¼ 0 and σ 2 ¼ 1). The likelihood is then the probability of the data, given the following parameters: ð48Þ Pr yjμ, σ 2 This is a conditional distribution, where the conditioning is in the model parameters which are taken as fixed and known. We are in a situation where we have already observed the data, but we do not know what the parameter values for μ and σ 2 are. In a Bayesian analysis, we evaluate the conditional probability of the model parameters given the observed data: ð49Þ Pr μ, σ 2 jy This probability is proportional to
Computer Packages for Genomic Prediction
Pr yjμ, σ 2 Pr μ, σ 2
179
ð50Þ
The first term is the likelihood, and the second term represents our prior belief of the values that the model parameters μ and σ 2 could take. The criticism of Bayesian analysis comes from the fact that the choice of prior is rarely justified, and its choice can make a difference in the final result [44]. For some researchers, the Bayesian analysis can be interpreted like the ML/REML approach modified by our prior knowledge of the parameters. Using REML versus Bayesian Markov Chain Monte Carlo (MCMC) methods has its pros and cons. The obvious pros of REML are the speed and simplicity with which models are specified without worrying too much about the assumptions. On the other hand, analytical results for non-Gaussian GLMM are generally not available, and REML-based procedures use approximate likelihood methods that may not work well [44]. In addition, REML is limited to the assumptions and often approximations. Markov Chain Monte Carlo (MCMC) procedures are also an approximation, but the accuracy of the approximation increases the longer the analysis is run and is exact at the limit. Bayesian software is more flexible when it comes to assumptions. On the other hand, Bayesian will have a distribution of answers that may change from run to run. 3.1
The Likelihood
We can generate seven observations from the standard normal distribution using, for example, the rnorm function in R to describe the concept of likelihood: > x x [1]
-0.6200897
0.4629847
2.278899
-0.5078287
0.7848663
-0.725449 1.249929
The likelihood of these data, conditioning on μ ¼ 0 and σ 2 ¼ 1, is proportional to the product of the densities: > prod(dnorm(x, mean = 0, sd = sqrt(1))) [1] 0.00002019764
If we assume that we do not know the true parameters (mean and variance), we may want to ask how probable the data would be if, for example, the parameters were instead μ ¼ 0, and σ 2 ¼ 0.5: > prod(dnorm(x, mean = 0, sd = sqrt(0.5))) [1] 0.00000286969
which has a lower likelihood than the one using the true parameters. We have to expect to have lower likelihood with the true parameters from time to time just from random sampling and
180
Giovanny Covarrubias-Pazaran
Fig. 1 Likelihood of N ¼ 7 (left) and N ¼ 100 (right) observations belonging to different combinations of the parameters μ and σ 2. The bigger the sample size, the more we approximate the true parameters with greater accuracy and precision
sample size. The likelihood of the data can be calculated for a grid of possible parameter values to produce a likelihood surface (x ¼ μ, y ¼ σ 2, z ¼ likelihood). For example, in Fig. 1 we plotted the likelihood results using a grid of values for μ and σ 2 for two different sample sizes. The highest likelihood falls close to the true parameters when the sample size is big. 3.2 Maximum Likelihood
The ML estimator is the combination of μ and σ 2 that makes the data most likely to occur. When there is an analytical solution (proof) to the problem, it should be used. When such a solution it too difficult to obtain, we can get a numerical solution using optimization routines to maximize the likelihood function. In the case above, such a function in R language is: > loglik MLest MLest mean -0.1329895
3.3 Prior Distributions
var 0.9695745
The typical linear model assumed in Bayesian software has the following form: y ¼ηþe
ð51Þ
where y ¼ {y1, ..., yn}, η ¼ {η1, ..., ηn}, and e ¼ {e1, ..., en}. The linear predictor represents the conditional expectation function, and it is structured as follows. η ¼ 1μ þ
J X
x jβ j þ
j ¼1
L X
ul
ð52Þ
l¼1
where μ is an intercept, Xj are design matrices for predictors, Xj ¼ {xijk}, βj are vectors of effects associated with the columns of Xj, and ul ¼ {ul1, ..., uln} are vectors of random effects [45]. The elements are user specified. Collecting the above assumptions, we have the following conditional distribution of the data: ! Kj J X L X X pðyjθÞ ¼ ∏ni¼1 N yi jμ þ xijk βjk þ uli , σ 2ε w2i : ð53Þ j¼1 k¼1
l¼1
where θ represents the collection of unknowns, including the intercept, regression coefficients, other random effects, and the residual variance. One of the strengths of Bayesian software is that different priors can be specified for each set of coefficients of the linear predictor, {β1, ..., βJ, u1, u2, ..., uL}, giving the user great flexibility in building models for data analysis. The prior density is assumed to admit the following factorization [45]: ð54Þ pðθÞ ¼ pðμÞp σ 2e ∏J j¼1 p β j ∏Ll¼1 pðul Þ: The coefficients can be assigned either uninformative (e.g., flat) or informative priors. Those coefficients that are assigned flat priors, the so-called fixed effects, are estimated based on information contained solely in the likelihood. For the coefficients assigned priors that are informative, the choice of the prior determines the type of shrinkage induced in the estimated effects [44, 45]. Some of the most popular priors used in Bayesian software are the following: (1) the Gaussian prior induces shrinkage of estimates similar to that of ridge regression (RR) [7], where all effects are shrunk to a similar extent [Bayesian software sometimes refers to this model as the Bayesian ridge regression (BRR)]; (2) the scaled-t, used in model BayesA [3]; and (3) double exponential (DE) or Laplace densities have higher mass at zero and thicker tails than the normal density. These priors induce a type of shrinkage of estimates
182
Giovanny Covarrubias-Pazaran
that is size-of-effect dependent [46]. Some software also implements two finite mixture priors: a mixture of a point of mass at zero and a Gaussian slab, a model known as BayesC in genomic prediction literature [47], and a mixture of a point of mass at zero and a scaled-t slab, a model known as BayesB [46]. By assigning a nonnull prior probability for the marker effect to be equal to zero, the priors used in BayesB and BayesC have potential for inducing variable selection [45]. We can, for example, write a function for calculating the (log) prior probability: > logprior prior loglikprior Best Best
3.5 Markov Chain Monte Carlo (MCMC)
MCMC calculates the height of the posterior distribution at a particular set of parameter values, as we did in previous examples. Rather than going through every possible combination in the grid of μ and σ 2 and calculate the height of the distribution, MCMC moves stochastically through the parameter space (that is why it is called “Monte Carlo”) [44].
Computer Packages for Genomic Prediction
183
We initiate the chain by specifying a set of values for the parameters (from which the chain can start moving through parameter space). Ideally, we would like to pick a region of high probability. Although starting configurations can be set by the user using arguments in software, in general the heuristic techniques used by the software tend to work well. We denote the parameter values of the starting configuration (time t ¼ 0) as μt ¼ 0 and σ 2t ¼ 0. There are several ways in which we can get the chain to move in the parameter space; different software may use Gibbs sampling, Slice sampling, Metropolis–Hastings or a combination of two or more to make the updates [48]. 3.6 Metropolis– Hastings Updates
After initiating the chain, we need to decide where to go next, and this decision is based on two rules. First, we have to generate a candidate destination, and then we need to decide whether to go there or stay where we are. Commonly, a random set of coordinates is sampled from a multivariate normal distribution that is entered on the initial coordinates μt ¼ 0 and σ 2t ¼ 0. Then, the question of whether to move to this new set of parameter values or remain at the current parameter values remains. If the posterior probability for the new set of parameter values is greater than the current values, then the chain moves to this new set of parameters and the chain has successfully completed an iteration. If the new set of parameter values has a lower posterior probability than the current values, then the chain may move there, but not all the time. The probability that the chain will move to low lying areas, is determined by the relative difference between the old and new posterior probabilities [44, 48]. Different software may use different rules to take such decisions linked to the number of times the probability changes (up or down).
3.7
Gibbs sampling is a special case of Metropolis–Hastings updating [48]. In the previous Metropolis–Hastings example, the Markov Chain was allowed to move in both directions of parameter space simultaneously. An equally valid approach is to set up two Metropolis–Hastings schemes where the chain is first allowed to move along the μ axis given that σ 2 is equal to some fixed value. For example, Pr μjσ 2 ¼ 1, y
Gibbs Sampling
and then we can move along the σ 2 axis. When the conditional distribution of μ is known, we can use Gibbs sampling. Let us say the chain at a particular iteration is located at σ 2 ¼ 1; if we updated μ using a Metropolis–Hastings algorithm, we would generate a candidate value and evaluate its relative probability compared to the old value. This procedure would take place in the slice of posterior. However, because we know the actual equation for this
184
Giovanny Covarrubias-Pazaran
slice, we can just generate a new value of μ directly. Gibbs sampling can be more efficient than Metropolis–Hastings updates when high dimensional conditional distributions are known, as is typical in linear mixed models.
4
The Most Common Software Used for Bayesian Genomic Prediction Models In the previous section, we briefly explored the basis of Bayesian methods to estimate variance–covariance parameters and fixed and random effects behind the most common software used for genomic prediction in plant and animal breeding. In the Table 4, we summarize the computer packages that use Bayesian methods as their core algorithm and some relevant features. Among them we can find OpenBUGS [49], BLR [50], BGLR [51], and BMTME [52]. Other software included in the list is the INLA package which uses integrated nested Laplace approximation, a deterministic approach to approximate Bayesian inference. Performing inference within a reasonable time frame, authors presume that, in most cases, INLA is both faster and more accurate than MCMC alternatives. A deeper review of this methodology can be found in Rue et al. [53].
5
Notes The introduction of machine learning methods to build genomic prediction models is very recent and is the focus of Chapter 7. The only thing that readers must keep in mind is that machine learning experts have developed their own lingo making people think that these models are completely different than currently used approaches, but actually machine learning can be seen as extensions of generalized linear models in statistics with special features (e.g., backward differentiation) that can make them especially powerful under special circumstances that will be explained in the next chapter.
X X
X
Fortran, C++ & R
C++ & R
Fortran, C++ & R
Fortran 90/95 & Ra X
C++ & R
C/C++ & R
BLR
MCMCglmm
BGLR
REMLF90 (GIBBSnF90)
BMTME
INLA
MME mixed model equation solution a Through the breedR package [28] available in GitHub
X
X
X
X
Component Pascal & X R
WinBUGS/ OpenBUGS
X
X
X
X
Language
Software
X
X
X
Double Gaussian t-scaled exp
Priors
X
X
X X
X
Bayes (A,B, InverseC) Wishart
Opensource
Opensource
Opensource
Opensource
Opensource
Opensource
Opensource
Held et al. [50]
Montesinos-Lo´pez et al. [52]
Misztal [27]
Perez and de los Campos [49]
Hadfield [42]
Perez et al. [48]
OpenBUGS Foundation [47]
Availability Released by
Table 4 The most popular Bayesian software used for genomic prediction models and special features to be considered for its use
Computer Packages for Genomic Prediction 185
186
Giovanny Covarrubias-Pazaran
References 1. Allard RW (1999) Principles of plant breeding. Wiley 2. Xu Y, Crouch JH (2008) Marker-assisted selection in plant breeding: from publications to practice. Crop Sci 48(2):391–407 3. Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829 4. Wells MT (2009) A conversation with Shayle R. Searle. Stat Sci 24(2):244–254 5. Schaeffer LR (1991) CR Henderson: contributions to predicting genetic merit. J Dairy Sci 74(11):4052–4066 6. Fisher RA (1918) The correlation between relatives on the supposition of Mendelian inheritance. Trans Royal Soc Edinburgh 52: 399–433 7. Marquardt DW, Snee RD (1975) Ridge regression in practice. Am Stat 29(1):3–20 8. Bernardo R (1994) Prediction of maize singlecross performance using RFLPs and information from related hybrids. Crop Sci 34(1): 20–25 9. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423 10. Habier D, Fernando RL, Dekkers JC (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177(4):2389–2397 11. Oakey H, Cullis B, Thompson R, Comadran J, Halpin C, Waugh R (2016) Genomic selection in multi-environment crop trials. G3 6(5): 1313–1326 12. Fisher RA (1992) Statistical methods for research workers. In: Breakthroughs in statistics. Springer, New York, NY, pp 66–70 13. Hill WG (2014) Applications of population genetics to animal breeding, from Wright, Fisher and Lush to genomic prediction. Genetics 196(1):1–16 14. Henderson CR (1948) Estimation of general, specific and maternal combining abilities in crosses among inbred lines of swine. Iowa State University, Iowa, USA 15. Searle SR, Gruber MH (1971) Linear models, vol 10. Wiley, New York 16. Patterson HD, Thompson R (1971) Recovery of inter-block information when block sizes are unequal. Biometrika 58(3):545–554 17. Adams NM (2004) Methods and models in statistics: in honour of Professor John Nelder Imperial College Press 260pp
18. Thompson R (2008) Estimation of quantitative genetic parameters. Proc R Soc B Biol Sci 275(1635):679–686 19. Butler DG, Cullis BR, Gilmour AR, Gogel BJ (2009) ASReml-R reference manual. The State of Queensland, Department of Primary Industries and Fisheries, Brisbane 20. Lee SH, Van Der Werf JH (2006) An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree. Genet Sel Evol 38(1): 25 21. Meyer K (1989) Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm. Genet Sel Evol 21(3):1–24 22. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1): 1–22 23. Gilmour AR, Thompson R, Cullis BR (1995) Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, (12):1440–1450 24. Johnson DL, Thompson R (1995) Restricted maximum likelihood estimation of variance components for univariate animal models using sparse matrix techniques and average information. J Dairy Sci 78(2):449–456 25. Meyer K (1997) An ‘average information’ restricted maximum likelihood algorithm for estimating reduced rank genetic covariance matrices or covariance functions for animal models with equal design matrices. Genet Sel Evol 29(2):1–20 26. Bernardo R (2002) Breeding for quantitative traits in plants, vol 1. Stemma Press, Woodbury, MN, p 369 27. Misztal I, Tsuruta S, Strabel T, Auvray B, Druet T, Lee DH (2002) BLUPF90 and related programs (BGF90). In: Proceedings of the 7th world congress on genetics applied to livestock production, vol 33, pp 743–744 28. Munoz F, Rodriguez LS (1920) BreedR: statistical methods for forest genetic resources analysis. Trees for the future: plant material in a changing climate. Nov 2014, Tulln, Austria. 13 p. ffhal-02801127f (https://hal.inrae.fr/ hal-02801127/document) 29. Meyer K (2007) WOMBAT—a tool for mixed model analyses in quantitative genetics by restricted maximum likelihood (REML). J Zhejiang Univ Sci B 8(11):815–821
Computer Packages for Genomic Prediction 30. Lidauer M, Matilainen K, M€antysaari E, Stra´nden I (2011) General program for solving large mixed model equations with preconditioned conjugate gradient method. In Technical reference guide for MiX99 solver. MTT Agrifood Research Finland 31. Littell RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberger O (1996) SAS system for mixed models 32. Gilmour AR (2018) Echidna mixed model software. In: Proceedings of the world congress on genetics applied to livestock production, volume methods and tools-software, Auckland, New Zealand, pp 11–16 33. Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88(1): 76–82 34. Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4(3):250–255 35. Lee SH, Van der Werf JH (2016) MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information. Bioinformatics 32(9):1420–1422 36. Covarrubias-Pazaran G (2016) Genomeassisted prediction of quantitative traits using the R package sommer. PLoS One 11(6): e0156744 37. Smith A, Cullis B, Thompson R (2001) Analyzing variety by environment data using multiplicative mixed models and adjustments for spatial field trend. Biometrics 57(4): 1138–1147 38. Lee DJ, Durba´n M, Eilers P (2013) Efficient two-dimensional smoothing with P-spline ANOVA mixed models and nested bases. Comput Stat Data Anal 61:22–37 ´ lvarez MX, Boer MP, 39. Velazco JG, Rodrı´guez-A Jordan DR, Eilers PH, Malosetti M, van Eeuwijk FA (2017) Modelling spatial trends in sorghum breeding field trials using a two-dimensional P-spline mixed model. Theor Appl Genet 130(7):1375–1392 40. SAS Institute Inc (2013) SAS/ACCESS® 9.4 Interface to ADABAS: Reference. Cary, NC: SAS Institute Inc. 41. Cary JR, Shasharina SG, Cummings JC, Reynders JV, Hinker PJ (1997) Comparison of C+ + and Fortran 90 for object-oriented scientific
187
programming. Comput Phys Commun 105(1):20–36 42. Zhou X (2014) Gemma user manual. Univ Chicago, Chicago 43. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D et al (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575 44. Van Dongen S (2006) Prior specification in Bayesian statistics: three cautionary tales. J Theor Biol 242(1):90–100 45. Pe´rez P, de los Campos G (2014) Genomewide regression and prediction with the BGLR statistical package. Genetics 198(2): 483–495 46. Gianola D (2013) Priors in whole-genome regression: the Bayesian alphabet returns. Genetics 194(3):573–596 47. Habier D, Fernando RL, Kizilkaya K, Garrick DJ (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12(1):186 48. Sorensen D, Gianola D (2007) Likelihood, Bayesian, and MCMC methods in quantitative genetics. Springer Science & Business Media 49. Surhone LM, Tennoe MT, Henssonow SF (2010). OpenBUGS 50. Pe´rez P, de los Campos G, Crossa J, Gianola D (2010) Genomic-enabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome 3(2):106–116 51. Pe´rez P, de los Campos G (2014) Genomewide regression and prediction with the BGLR statistical package. Genetics 198(2): 483–495 52. Montesinos-Lo´pez OA, Montesinos-Lo´pez A, Luna-Va´zquez FJ, Toledo FH, Pe´rezRodrı´guez P, Lillemo M, Crossa J (2019) An R package for Bayesian analysis of multienvironment and multi-trait multi-environment data for genome-based prediction. G3 9(5):1355–1369 53. Rue H, Riebler A, Sørbye SH, Illian JB, Simpson DP, Lindgren FK (2017) Bayesian computing with INLA: a review. Ann Rev Stat Appl 4:395–421
Chapter 7 Genome-Enabled Prediction Methods Based on Machine Learning Edgar L. Reinoso-Pela´ez, Daniel Gianola, and Oscar Gonza´lez-Recio Abstract Growth of artificial intelligence and machine learning (ML) methodology has been explosive in recent years. In this class of procedures, computers get knowledge from sets of experiences and provide forecasts or classification. In genome-wide based prediction (GWP), many ML studies have been carried out. This chapter provides a description of main semiparametric and nonparametric algorithms used in GWP in animals and plants. Thirty-four ML comparative studies conducted in the last decade were used to develop a meta-analysis through a Thurstonian model, to evaluate algorithms with the best predictive qualities. It was found that some kernel, Bayesian, and ensemble methods displayed greater robustness and predictive ability. However, the type of study and data distribution must be considered in order to choose the most appropriate model for a given problem. Key words Machine learning, GWP, Neural networks, Kernel methods, Ensemble methods, Bayesian methods, Meta-analysis, Complex traits
Abbreviations ANN BAGG BNN BL BOOTS BRR CNN DL DVR GBLUP GWP KAML KBMF KNN MCMC
artificial neural network Bagging Bayesian neural network Bayes LASSO gradient boosting Bayesian ridge regression convolutional neural network deep learning developmental rate model genomic best linear unbiased prediction genome-wide prediction; genome-based prediction Kinship-adjusted-multiple-loci kernelized Bayesian matrix factorization k-nearest neighbors Markov chain Monte Carlo
Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_7, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
189
190
Edgar L. Reinoso-Pela´ez et al.
ML MLP NB NN NP-BOOST OLS-BOOST OOB PMSE QTNs RF RKHS RR-BLUP SSVS SVM wBSR
1 1.1
machine learning multilayer perceptron Naı¨ve Bayes Neural networks Nonparametric gradient boosting ordinary least squares gradient boosting out of bag predicted mean squared error quantitative trait nucleotides random forest reproducing kernel Hilbert spaces ridge regression BLUP stochastic search variable selection support vector machines weighted Bayesian shrinkage regression
Introduction Overview
One of the goals of genomic selection is to predict the breeding value and future performance of animals and plants from phenotypes and molecular markers (generally SNPs) distributed throughout the entire genome. Hopefully, these markers should be able to account for at least part of the phenotypic variation of target traits, such as some diseases [1]. The interest in genome-wide prediction (GWP) of complex traits in plant and animal breeding has increased in the last 20 years. As reviewed in previous chapters of this book, GWP utilizes linear regression on whole-genome markers using a phenotype of interest as target or response variable. Despite its great success, genome-wide prediction through linear models has some limitations that can be alleviated with machine learning (ML) algorithms, at least in principle. One of the main ones is “the large p small n problem”: in most research, the number of markers is greater than the number of genotyped individuals [2]. Another possible advantage of nonparametric models over parametric models is that linear regression usually displays a reasonable goodness of fit when phenotypes present a continuous distribution, for example, Gaussian. When the data are close to displaying a normal distribution, the predictions of linear models most often have high accuracies. However, in practice, these distributions often deviate from normality, either because of the type of the data or because of the presence of environmental factors that are difficult to control. In the latter situations, parametric models may have less effectiveness than alternative procedures [1]. In recent years, novel regression algorithms with great flexibility to overcome the curse of dimensionality and are apt to capture nonlinear situations (i.e., nonparametric models) have been developed. These algorithms use some form of regularization and are computationally feasible.
Machine Learning in Genomic Prediction
191
The exponential growth in computing performance has revolutionized programming techniques. Some machine learning procedures present surprising learning and prediction capacity. For example, the AlphaZero computer program, programmed as a neural network algorithm, was the first machine able to win a professional board game Go [3]. Likewise, the language translation software DeepL (http://www.deepl.com), which uses a convolutional neural network algorithm, is capable of translating 9 languages in 72 language combinations, standing out for its high speed, accuracy, and quality. These accelerated technological developments have had a powerful impact on statistical data analysis, and the field of genetics has taken advantage of this progress, for example, combining genome-assisted prediction and association analysis with ML. 1.2 What is Machine Learning?
Machine learning is a branch of artificial intelligence and a sub-field of computer science. It aims for computers to learn through mathematical algorithms applied to a training database to classify or predict a yet to be observed outcome measured from individuals drawn from a population with a similar data distribution to that in the sample used to train the algorithm. In this chapter, ML models are conceptualized as those that do not assume a parametric hypothesis. These models differ from parametric linear regression models (e.g., GBLUP, RR-BLUP, LASSO) and Bayesian linear regressions (e.g., BAYES A, BAYES B, BLASSO) in that ML models do not require an explicit or implicit distribution of the data to be characterized. For example, LASSO does not make distributional assumptions, so it is an ML procedure; however, the loss function implies a double exponential error distribution, so we do not treat is as a ML procedure. Both nonparametric and semiparametric models will be addressed in this chapter due to their flexibility with the data, which confers them interesting potential to improve or complement GWP tools. ML models can be classified according to the algorithm or type of prediction technique employed. The algorithms considered included supervised and unsupervised learning. The first uses input data (e.g., SNPs and environmental covariates) and output data (phenotypes) to train some model, for example, random forest or support vector machines [2, 4]. Unsupervised learning works only with input data, and it must be able to recognize patterns that allow interpreting and classifying the performance of the data (e.g., principal components or multidimensional scaling). Classification of ML methods according to the prediction technique employed are neural networks, ensemble methods, kernel methods, and other procedures. Figure 1 shows a summary of our ML classification of the most popular methods, and their specificities are described in detail later.
192
Edgar L. Reinoso-Pela´ez et al.
Fig. 1 Classification of most popular parametric and nonparametric (machine learning) regression models
2
Neural Networks Neural networks (NN) were inspired by the human brain [5– 7]. Their principle is based on laws that govern neuroscience, and the aim is to mimic its behavior through computing structures. Complex neural networks models, for example, consisting of many layers, are the corner stone of what is known as deep learning (DL). Geoffrey Hinton, Yann LeCun, and Yoshua Bengio were recognized as the “fathers” of deep learning in 2018 by receiving the Turing Award (https://awards.acm.org/about/2018-turing) and have contributed significantly to making NN a critical component of modern artificial intelligence. NN algorithms have revolutionized the methods for data science and have been useful for solving computer vision, speech recognition, natural language processing, robotics, and GWP problems for complex traits. In genomic selection, three main models have received attention: multilayer perceptron, convolutional NN, and Bayesian NN.
Machine Learning in Genomic Prediction
2.1 Multilayer Perceptron
193
The multilayer perceptron (MLP), commonly known as artificial neural network (ANN) was first employed in GWP by Okut et al. [8] to predict body mass index in mice and evaluate individual marker effects on the trait. MLP is a network of processing units (called neurons) divided into layers. The first layer (input layer) representing predictor variables (e.g., SNPs), is connected in a typically nonlinear way to one or more hidden layers, and the last hidden layer is connected to an output layer. Each neuron in the hidden layers receives the values of all the neurons in the previous layer and generates a weighted sum as output. This weighted sum is transformed by an activation function (e.g., piece-wise linear, sigmoid or tanh) and a set of weights w [9–11]. The weights of neurons are connecting values or weights that are adjusted in each iteration of the algorithms and are controlled by a loss function. The loss function represents the error between predicted and observed values (phenotype). Decades ago, adjusting the weights through the loss function was a great challenge for neural networks because of their high-dimensional and nonlinear form, which made it too difficult to train the models. The problem was alleviated when the backpropagation algorithm was developed, which adjusts the weights using the gradient of the loss function, that is, through derivatives [12]. In summary, in the model training, the output values (predictions) are compared with the observed values, and the error (loss function) is corrected by “reverse” direction using the gradient descent (backpropagation algorithm), updating weights in each iteration [10]. If we take an ANN with one hidden layer and S neurons, the algorithm can be represented as follows: S X g i xg,i ¼ β0 þ w s f s w s ; y, xg,i s¼1
where i indicates individual; xg,i is a vector of genomic inputs; β0 is the intercept or “bias”; fs(ws; y, xg,i) is a nonlinear transformation for neuron s where s ¼ 1, 2, . . ., S. Here, ws is a vector of weight connections (regression coefficients). Note that the equation above represents a nonlinear, nonparametric regression function [11]. In GWP studies with ANN, some encouraging results have been found, but without the expected advantage over other regression models, considering the complexity of the model. Several authors have employed ANN and the results differed over studies. Some authors [13, 14] obtained improvements on prediction accuracies for complex traits in Holsteins and Nellore cattle, and higher GWP (with respect to classical and Bayesian models) have also been reported in plants for maize and eucalyptus [15, 16]. However, other authors reported that ANN were outperformed by gradient boosting and support vector machines algorithms [17, 18]. Some studies in plants found that the prediction
194
Edgar L. Reinoso-Pela´ez et al.
accuracy of ANN was lower when compared with other methods. For instance, Montesinos-Lo´pez et al. [19–21] reported that ANN presented lower accuracies than GBLUP in GWP of complex traits in wheat and maize. Other authors [22, 23] have stated that, although the performance of ANN models varied according to the trait, it was not superior to that of classical or other nonparametric models. Despite ANN is a model with theoretical potential, it presents some difficulties, for example, it needs very large data sets for training, it is prone to overfitting, and the algorithms depend on numerous hyperparameters (number of hidden layers, the type of connection between them, the activation function, among others) and finding the optimal value can be challenging [24–26]. Therefore, there is constant research to improve the optimization of these hyperparameters [27, 28], for example, application of “differential evolution” as a population-based evolutionary heuristic, which could function as an efficient search for arbitrarily complex hyperparameter spaces in deep learning models [27]. In addition, the objective function can be multimodal, and some algorithms may converge to local minima or maxima, with parameter values conferring low predictive power. An interesting member of the neural networks family is the Bayesian neural network (BNN), where the model is trained with a Bayes-based probabilistic approach. Methods for regularization can be interpreted in a Bayesian framework with a clear connection to Gaussian processes. The Bayesian model assumes a prior density function for network weights, and the posterior distribution of these weights is estimated from the data. It assumes a joint prior density function for the network weights, and the posterior distribution of these weights is estimated from the data. During each forward pass, the weights are sampled from the distribution and an attempt is made to learn the best mean and standard deviation values for each normal distribution. This method has been successfully applied in GWP studies [11, 29–32]. 2.2 Convolutional Neural Network (CNN)
This type of NN was inspired by cognitive neuroscience and by the contributions of Hubel and Wiesel [33, 34] in their studies in the visual cortex of cats, where simple and complex neurons respond to small and large patterns, respectively. In a CNN model, the input data (e.g., images) is organized into multidimensional matrices with three channels for the colors. In GWP, the input data is a one-dimensional sequence with one channel per nucleotide. Normally, GWP regressions have a high dimensionality, so there is computational challenge, apart from the fact that the number of nucleotides frequently exceeds (largely) the number of individuals with phenotypes. However, the CNN performs two optimization processes, as the data goes through one or several convolutional layers and a pooling layer [10]. The convolutional layer consists of several maps of neurons, which function as filters that detect
Machine Learning in Genomic Prediction
195
characteristics from the high dimensionality data and creates a feature map with a lower dimension. Consider that in the first layer, c0 referes to the SNP index, d to the window size for sliding over the locus, and x to the genotype value (0, 1, 2). The CNN consist of one (or several) convolutional layers with the form g ¼ CK( f ) acting on a t-dimensional input f(x) ¼ ( f1(x), . . ., ft(x)) by applying a set of filters called kernels K ¼ (κc0 ,c) for c ¼ 1, . . ., d and c0 ¼ 1, . . ., t together with a nonlinear activation ϕ [35], then ! T X ðf c 0 κ c 0 ,c ÞðxÞ g l ðxÞ ¼ ϕ c 0 ¼1
This produces a d-dimensional output g(x) ¼ (g1(x), . . ., gd(x)), which is called a feature map. For this case, each convolutional operation is over discrete univariate index variables and becomes ( f κ)(x) ¼ fc0 (x)κc0 , c. Whereby, in a discrete univariate convolution, each row of the matrix is constrained to be equal to the previous row but shifted sequential by one element. The pooling layer creates an additionally reduced map from the previous layer, where g ¼ T( f ) replaces the output with a summary statistic of the neighborhood V(x) of the convolution g c ðx Þ ¼ T f c ðx 0 Þ : x 0 ϵ V ðx Þ where x0 is the set of neighborhood values to be taken for the clustering function. There are some clustering functions T. The most used is the max pooling, which returns the maximum output value of the convolution neighbourhood. Some other pooling functions provide the weighted average or some norm regularized measure [35]. This process further reduces the parameter space allowing to optimize the size of the initial sequence which also helps to prevent overfitting (this process preserves the SNPs that are relevant to predicting the trait). Eventually, the data is flattened into a very long vector, which is connected to the hidden layers and later to the output layer, similar to ANN [10]. Use of CNN in genomic selection is recent and mostly exploratory. Previous studies reported that CNN reduced the prediction error when performing GWP of number of piglets born alive per litter [35]. Other studies reported that CNN presented a higher predictive ability in situations where nonadditive genetic variance was sizable. However, CNN was not the best predictor among those examined [17]. In studies with plants, CNN-based predictions have produced conflicting results, that is, some studies have found favorable yield results in wheat [36], soybean [37], and maize [38]. In the wheat case, it was suggested to use CNN as a complement to rr-BLUP in a framework of an ensemble learning approach. The soybean example suggested that the deep learning framework based on CNN could bypass the imputation of missing
196
Edgar L. Reinoso-Pela´ez et al.
genotypic data and achieve more accurate predictions than a number of conventional prediction methods (rr-BLUP, BRR, BL, Bayes A). Two studies found that deep learning (ANN and CNN) had lower predictive ability than other algorithms (classical and ML) in GWP of yield in maize, rice, sorghum, and soybeans in the first study [5], and of strawberry and blueberry in the second [38]. While CNN are potentially valuable, the picture is far from conclusive. More research needs to be done concerning their applicability to GWP, especially in the domain of CNN optimization, which greatly depends on the number of convolutional layers, the kernels for each layer, the type of linear activation, the type of pooling layer (max pooling, average pooling, etc.), as well as on factors related to the population, training sample size, nature of the traits and its probability distribution.
3
Ensemble Methods Ensemble methods use many simple models that are organized to jointly solve a complex problem. A set of algorithms, generally nonparametric, are combined to improve the prediction or classification ability of the assembled model. Two of the most used ensemble models in GWP are random forest and Boosting, which are detailed below.
3.1
Random Forest
Random forest (RF) was proposed by Breiman [40]. It is based on the bagging algorithm [41], which is an approach that generates T pseudotraining sets by bootstrapping from the original training sample, train the model on each subset to predict the outcome, and finally average the results from each bootstrapped set. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. It is a classification or regression method based on the principle of defining a large number of geometric features (ensemble of tree) and searching over a random selection of these for the best split at each node. RF randomly selects a subset of m available input variables (SNPs) at each division of the branch, and the SNP that minimizes the loss function of choice is the one used to split the observations to each new branch. The RF can be represented as: y ¼μþ
T X
c t h t y; Wm,t
t¼1
where y is the expected phenotype; μ is the population mean; T is the number of trees in the forest, where each tree is built through bootstrap aggregation; ct is a factor that averages the regression trees; ht(y; Wm, t) is a specific random tree for t ϵ (1, T), where each
Machine Learning in Genomic Prediction
197
tree is constructed from n samples from the original data set (selected at random with replacement) and it is different from any other in the ensemble; W contain the predictor variables (SNPs and environmental factors), and m is the set of selected variables for t [2, 42–44]. This model works with the out of bag (OOB) samples to measure prediction error using the chosen criterion (loss function), for example, misclassification rate or L2, the latter is the most commonly used in GWP and evaluates the margin of error between the predicted and the real values within each tree through the mean squared error, considering the smaller values as referring to a high prediction. In each random sampling used to build a tree, a small percentage of data (OOB) remains out of the tree. After each tree is formed, the prediction accuracy is calculated using the fitted tree and this OOB data, and the prediction error is calculated taking into account the predicted and observed phenotypes in the OOB (through L2 loss function). Additionally, we can estimate the contribution of each feature (variable-SNPs or variable-effects included in the model) to the predictive performance: the values of the mth predictor variable in the OOB are permuted and the prediction accuracy is calculated again. The difference between these two prediction accuracies (with original OOB and permuted OOB) is an estimate of the contribution of the feature to the classification or regression in a given tree. This process is repeated with all the features within each tree, and the difference between prediction accuracies is averaged over all trees. Finally, a value for each predictor variable (SNPs) is obtained which represents its “importance” [42, 45]. Random forest has proven to be a robust and powerful algorithm in association, classification, and regression analysis. For instance, some studies in animals have demonstrated superiority in GWP for classification and regression problems [2, 45]. However, other GWP studies performed in plants were not conclusive in favor of RF, often achieving accuracies similar to other methods [46–50]. 3.2
Boosting
Gradient Boosting (BOOST) is an ensemble method that can be interpreted as a generalization of the AdaBoost algorithm described by Freund and Schapire [51] and adapted to regression by Friedman [52]. BOOST uses multiple “weak” learners (predictors) in a sequential manner while making use of shrinkage [53]. It adjusts the residual of an initial predictor in a subsequent manner. That is, each model is trained in the residuals of the previous one, and the residuals are then adjusted for the next predictor. Formally, the response value is represented as follows. y ¼ 1μ þ
Z X z¼1
vh z ðy; W Þ
198
Edgar L. Reinoso-Pela´ez et al.
where y is the vector of phenotypes; 1 is a vector of ones; μ is the mean; hz(y; W) is the predictor; v is a parameter, and z ϵ (1, Z), where Z is the number of weak learners (predictors). Besides, this method can be adjusted with a number of partitions in each predictor which controls the boosting complexity and is known as depth [42, 45]. An adaptation [54] of the Boosting algorithm for regression problems in GWP was subsequently proposed as follows: y ¼ μ þ g xp þ e where xp is the vector of SNP codes of n individuals for the marker locus p; g(xp) correspond to an unknown expectation function E(y| xp), and e is a residual vector with independently distributed elements. Linear and nonlinear learners can be applied, for example, ordinary least squares (OLS-BOOST) or nonparametric (NP-BOOST) regressions, respectively [54]. The process involves a few steps. First, a learner is applied to one SNP at a time, producing b g0 ðxÞ ¼ yPandb e ¼ y b g0 ðxÞ; then, forz in {1 to Z} the loss function for ni¼1 b ei , gz1 xpz þ h z yi ; xpz is calculated, where xpz is the SNP that minimizes the loss function, and h z yi ; xpz is the prediction of the observation from learner h(.) applied to SNP pz. Later, gz ðxÞ ¼ g z1 ðxÞ þ the function is updated at iteration z as b vh m yi ; x with v ϵ (0, 1), being a shrinkage factor. Residuals are updated as b e ¼ yb μb gz ðxÞ, z increases in 1 unit and the procedure is repeated Z times until results stabilize [2, 54, 55]. Boosting has been used in both animals and plants, and has shown a strong predictive ability. Several GWP studies [2, 17, 42, 54, 56] were conducted in animals and found a better predictive performance of Boosting relative to other algorithms. In plants, some GWP studies of complex traits have also been presented [46, 57] with encouraging results for Boosting.
4
Kernel Methods Kernel methods guide a linear solution in the features space, while being nonlinear in the input space. The first article proposing prediction of targets in the presence of noise using Reproducing Kernel Hilbert spaces (RKHS) was by Parzen [58]. Years later, as research in the field of statistical learning advance, a new algorithm called support vector machines (SVM) was developed [59]. Over the decades, these methods have had a high impact on classification and regression problems. Gianola et al. [60] and Gianola and Van Kaam [61] were the first papers suggesting RKHS for GWP. The two kernel-based methods most used on genomic selection are explained below.
Machine Learning in Genomic Prediction
4.1 Reproducing Kernel Hilbert Spaces
199
RKHS for GWP has a similarity with the genomic best linear unbiased prediction (GBLUP) model. RKHS uses a kernel matrix, that is a more general structure of covariance between individuals than the genomic relationship matrix employed for measuring molecular similarities between individuals (G matrix) [62]. It was shown that both BLUP and GBLUP are special cases of RKHS regression [63]. Here, a Kernel function K defines a structure of covariance between individuals [2, 63–65]. RKHS uses a kernel matrix to make a representation of Euclidean distances between focal points in a Hilbert space. In this way, the most similar genotypes, or sequences pairs of individuals have shorter distances than other pairs of individuals. This matrix can be modified according to objectives, for example, a kernel diffusion method [66] suitable for discrete input values (SNPs) was used. Likewise, in another study [67] kernels were constructed for measuring similarities between SNPs methylation profiles in Arabidopsis. If g(xi) is some nonparametric function (e.g., a function of a large number of markers), while x0 i ¼ [xi1, xi2, . . ., xip] is a vector of p marker observed in individual i, a semi-parametric model can be formed as follows [2, 61]. 2 3 g 1 ðx1 Þ 6 g ðx Þ 7 2 7 6 y ¼ Wθ þ 6 2 7þe 4 ... 5 g n ðxn Þ where y is the vector of phenotypes, W is the incidence matrix of parametric effects θ on y, and e is the residual vector of the penalized residual sum of squares [2]. J ½g ðx: Þjλ ¼
1 ½y Wθ gðx: Þ0 R 1 ½y Wθ g ðx: Þ 2 λ þ jjg ðx: Þjj2H 0 2
where R is the residual covariance matrix; g(x.) is the vector of genotypes (SNPs); λ is a regularization parameter, which together with jjg ðx: Þjj2H (squared norm under a Hilbert space), act as a penalty to control the trade-off between goodness of fit and model complexity. It was shown [68] that the function could be represented as follows. g ðxÞ ¼ α0 þ
n X
αi K h ðx xi Þ ¼ α0 þ k0h α,
i¼1
where α ¼ [α0, α1, . . ., αn]´ is a vector of unknown regression coefficients and k0 h is a row of the kernel matrix K; Kh(x xi) is a reproducing kernel that acts as a basis function, and this is regulated by smoothing parameters h. Both h and λ are important to avoid
200
Edgar L. Reinoso-Pela´ez et al.
overfitting, and their values can be evaluated via cross-validation, or Bayesian inference. The choice of kernel is essential to obtain a model with good predictive ability [2]. Several studies have used RKHS-based GWP [2, 4, 64, 69–75] in plants and animals resulting often in good flexibility and top predictive performance. 4.2 Support Vector Machines
Support vector machines (SVM) can be interpreted as belonging to the class of RKHS [76]. It is a method that works especially well in classification problems but has also been used for regression. It works by mapping input spaces to a larger-dimensional feature space using kernel functions, so it is based on linear equations while embedding nonlinear regression models. The objective function to minimize in SVM is [2]: J ½g ðxÞjλ ¼
1 λ V ðy, g ðxÞÞ þ jjg ðxÞjj2H 2 2
where V(y, g(x)) is the epsilon-insensitive loss function (RKHS uses the quadratic error loss function instead) with the form ( 0 if ðy i gðxi ÞÞ < ε V ðy, gðxÞÞ ¼ jy i gðxi Þj ε otherwise the nonparametric function can be expressed as a linear function. g ðxi Þ ¼
n X
αi αi K ðx xi Þ,
i¼1
where K(.) is a kernel of choice, and αi and αi are solutions to a nonlinear system of equations. Only a small fraction of these coefficients is different from zero (it induces sparsity), because the epsilon insensitive loss function prevents from overfitting. The data points associated with a nonzero coefficient are known as support vectors. In the same way as in RKHS, the balance between the SVM function and the complexity of the loss function is controlled by λ. These values can also be tuned by cross-validation, likelihood based or Bayesian methods [2]. Many studies have applied SVM algorithm in animals [2, 18, 56, 77–80] and plants [4, 20, 46, 48, 69, 73, 74, 81–83], and several of them concluded that SVM is one of the most accurate algorithms for GWP.
5
Other Machine Learning Models Although less frequently, other ML algorithms have also been used for GWP. For instances, k-nearest neighbors (KNN) is a nonparametric regression based on distances from k closest variables
Machine Learning in Genomic Prediction
[84]. The general formula is: by j ¼ 1k
201
K P f yk , where by j is the
k¼1
predicted phenotype for individual j, K is the number of nearest Euclidean distances (nearest neighbors), and yk is a vector of phenotypes selected from the training set. Given the size of the neighborhood, membership is determined from the Euclidean distances sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2 P between variables (SNPs): d xi x j ¼ x pi x pj , p¼1
where d(xi xj) is the Euclidean distance between individuals i and j based on SNPs; P is the number of SNPs; xpi and xpj are codes for SNP p in individuals i and j, respectively. The phenotypes from the K lowest distance values are selected as neighbors. To define K, a guide search can be used with different values of K, the one with the lowest mean squared error (PMSE) between real and predicted value is selected. KNN-based GWP in pigs [85] wheat, and ryegrass [82], led to prediction accuracies lower than those of other classical and machine learning models. Other less cited GWP machine learning algorithms included the phenotype prediction machine “WhoGEM” [71] for disease resistance and functional traits in plants (Medicago truncala). WhoGEM was found to be at least as good as other models. Another such algorithm is the Kinship adjusted multi-loci (KAML) [86], a flexible model that generalizes the linear mixed model with the incorporation of pseudo quantitative trait nucleotides (QTNs). This algorithm performed better than linear models. Likewise a model called daily developmental rate (DVR) was developed for predicting the biomass of rice using other traits as predictor [72]. This model works as a function of environmental factors, it was not better than LASSO regression. Naive Bayes has also been used in machine learning as a probabilistic classification algorithm. Its predictive capacity is high, although it differs according to conditions and environment. A Naive Bayes classification model was successfully employed to identify SNPs associated to mortality in chicken [87, 88]. Comparison of Naive Bayes with random forest presented to predict individual survival to second lactation in Holstein cattle, concluded to higher performances of Naı¨ve Bayes [89]. On the other hand, when applied to breed prediction in horses, although Naive Bayes had a good predictive ability, it was inferior to that of other Bayesian methods [31]. Other methods have been studied but to a lesser extent. For example kernelized Bayesian matrix factorization (KBMF) algorithm provided more accurate genomic predictions than GE-BLUP thanks to enhanced capture of the G E interactions of barley, in a study involving modelling of G E with historical weather information [90]. Likewise, instance-based ML algorithms such as IB1, IB5, and KSTAR were used for classification purpose but performed worse than other machine learning models or Bayesian prediction procedures [91].
202
Edgar L. Reinoso-Pela´ez et al.
In addition to the ML methods mentioned above for GWP, a directed learning strategy supervised by omics data has been developed, it was called multi-layered least absolute shrinkage and selection operator (MLLASSO). This method integrates genetic, transcriptomic and metabolomic data within a model that learns interactively. MLLASSO has shown an improvement in predictive capacity over conventional methods [92].
6
The Computer Science Behind Machine Learning in GWP Several software packages have been developed for GWP in different programming languages. These languages provide a wide range of options and are chosen according to the interests of the research group. Among the most popular are: 1. R statistics (https://www.r-project.org/), a programming language that is distributed under the GNU General Public License (https://www.gnu.org/licenses/gpl-3.0.html) and which focuses on statistical analysis, probably the one with the most regression model packages. It has the advantage of having a very large community and there is a constant updating of software and packages. 2. Python (https://www.python.org), a freely distributed multiparadigm programming language (supports object-oriented, imperative, and functional programming), several machine learning packages have been developed in this environment, its most famous field being deep learning with the Keras (https://keras.io) and TensorFlow (https://www.tensorflow. org) platforms. The difference between R and Python is that R is specifically oriented to the statistical field and its language is more “friendly,” while Python is a multipurpose programming language, so it requires more programming knowledge and has fewer library options than R. Its advantage is that it can be computationally faster than R. 3. Java, a computing platform from Sun Microsystems (https:// www.java.com), is an object-oriented programming language; it can generate autonomous code fragments capable of interacting with other objects. Unlike the two previous languages, Java requires more programming knowledge and GWP-oriented codes are limited. Its advantage is that Java is computationally very fast for large-scale systems (multiple simultaneous users accessed). 4. Fortran (https://www.fortran90.org), a programming language adapted to numerical computation and scientific computing, which makes it fast and efficient. This language has been used to develop the main software for genetic selection
Machine Learning in Genomic Prediction
203
(e.g., BLUP) and with later adaptations for genomic selection (e.g., GBLUP). However, packages written in Fortran for GWP are scarce. In a general context, R is a program with great statistical variety and variability. Python has less package variability, but its programming capability is broader and can be computationally stronger than R. Java and Fortran are recommended languages for researchers with deep programming knowledge, as they are generally highly specialized packages and perform more specific computations, which allows them to be faster in analysing massive data. Table 1 presents some of the most popular packages for GWP in ML, classical and Bayesian models.
Table 1 Popular software for machine learning and conventional methods Method Package BNN
BLNN R package [100]; PyMC3 Python package [101]
ANN
MXNet R package [102]; Keras Python package (https://keras.io/)
CNN
DeepGS R package [36]; Keras Python package (https://keras.io/)
BAGG
ipred R package [103]; scikit-learn Python package [104]; Weka Java package [105]
RF
Random forest R package [106]; scikit-learn Python package [104]; RanFog Java package [107]
BOOST gbm R package [108]; scikit-learn Python package [104]; Weka Java package [105]; XGBoost package [109]; RanBoost Fortran package (https://github.com/ogrecio/RanBoost) RKHS
rrblup R package [110]; BGLR R package [111]
SVM
kernlab R package [112]; scikit-learn Python package [104]
NB
Naı¨ve Bayes R package [113]; Weka Java package [105]
KNN
FNN R package [114]; neighbr R package [115]
BAYES A
BGLR R package [111];
BAYES B
BGLR R package [111]
BAYES C
BGLR R package [111]
BL
BGLR R package [111]; BLasso Fortran package (https://github.com/ogrecio/BLasso)
RR
BLR R package [116]; rrblup R package [110]
LASSO
glmnet R package [117]
GBLUP BGLR R package [111]; BLUP F90 Fortran package [118] BNN Bayesian neural network; ANN Artificial neural network; CNN convolutional neural network; BAGG Bagging; RF Random forest; BOOST Boosting; RKHS reproducing kernel Hilbert space; SVM Support vector machine; NB Naı¨ve Bayes; KNN k-nearest neighbor; BL Bayes LASSO; RR Ridge regression; r correlation
204
7
Edgar L. Reinoso-Pela´ez et al.
Predictive Ability of Machine Learning Method: a Meta Comparison Many studies have compared various ML models with standard linear models. Results vary over studies perhaps due to the data specificities, environmental effects, species, and other factors. However, some algorithms have consistently shown greater accuracy. This section presents a comparative review of predictive ability of ML and classical models. For this, rankings of 205 comparisons in 34 studies in the last decade in plants and animals were used to obtain guidance on choice of machine learning algorithms for GWP.
7.1 Literature Review
To obtain a general orientation of the predictive ability of the models, we performed a comparative meta-analysis between machine learning and classical models through a systematic review of scientific articles on GWP in agriculture in the last decade. For the comparative analysis, a Thurstonian model (detailed below) was employed. We used a total of 34 papers published between 2010 and 2020 of comparative studies of GWP models, where at least one ML procedure was considered (Table 2). The studies evaluated predictive ability by cross-validation, and the correlation (r) or predicted mean squared error (PMSE) between actual and predicted values were used to evaluate the predictive accuracy. All papers made comparisons between prediction models for at least one complex trait using ni models (i is the study index), and several of these studies analyzed m complex traits. Then, we made m comparisons (one per trait) between ni statistical models. For example, one study [22] compared nine regression models to predict height in six plants, rice, spruce, soybean, sorghum, maize, and switchgrass. Each comparison yielded cross-validation results. A ranking was calculated in each comparison considering correlation, percentage of items classified correctly, and PMSE. For r and classification percentage, 205 comparisons and 19 models from 34 studies were selected, and models appearing in at least two studies were considered. The highest and lowest score were attributed for the first and last values. For PMSE, 39 comparisons and 13 models from 10 studies were selected, and the models with lower PMSE were considered as better models. These data sets were analyzed with a Thurstonian Model [93, 94].
7.2
The Thurstonian model is a rank analysis [93, 95] that has been used especially for comparative analysis in competitions [96– 99]. This is the first analysis using the Thurstonian method to compare prediction models in GWP, and this procedure has been selected because of its similarity to rank competitions. This model assumes an underlying variable that is transformed into the ranking of the prediction model within each comparison. This allows the
Meta-Analysis
ANIMAL S
AbdollahiArpanahi et al. [119]
X
X
X
X
ANIMAL S
Long et al. [79]
Putnova´ and ANIMAL X Sˇtohl S [31]
ANIMAL S
Kim et al. [56]
X
X
X
X
ANIMAL S
X
X
Jime´nezMontero et al. [120]
X
X
X
X
X
X
X
X
X
X
X
X
X
X
BOO RKH BAYES BAYES BAYES ST S SVM NB KNN A B C BL wBSR SSVS RR
X
ANIMAL S
X
X
BNN ANN CNN BAGG RF
Gonza´lezANIMAL Recio S et al. [54]
Gonza´lezRecio [42]
ANIMAL AbdollahiS Arpanahi et al. [17]
TYPE
Authors
X
X
X
✓
✓
(continued)
✓
✓
✓
✓
✓
✓
✓
✓
PM SE
✓
✓
✓
✓
LA SSO GBLUP r
Table 2 Comparative studies of machine learning algorithms (nonparametric and semi-parametric) classical linear regression models (parametric) in GWP studies with animals and plants. X represents the model used in the research, and ✓ means the metric used to evaluate the models through cross-validation results
TYPE
X
X
X
X X
X
X
X
X
X
X
X
X
X
X
X
X
Gentzbittel PLANTS et al. [71]
X
X
X
X
X
X
X
X
X
X
X
✓
✓
✓
✓
✓
✓
✓
LA SSO GBLUP r
Cuevas et al. PLANTS [70]
Budhlakoti PLANTS et al. [81]
Blondel PLANTS et al. [69]
X
X
X
PLANTS
X
X
Azodi et al. [22]
X
X
X
X
X
X
ANIMAL X S
X
X
BOO RKH BAYES BAYES BAYES ST S SVM NB KNN A B C BL wBSR SSVS RR
Waldmann ANIMAL et al. [35] S
Waldmann [30]
ANIMAL Van der S Heide et al. [89]
X
BNN ANN CNN BAGG RF
Van Bergen ANIMAL X et al. [32] S
Silveira et al. ANIMAL [45] S
Shahinfar ANIMAL et al. [18] S
Authors
Table 2 (continued)
✓
✓
✓
PM SE
PLANTS
PLANTS
Long et al. [79]
Ma et al. [36]
PLANTS
PLANTS
Sarkar et al. [50]
SchulzStreeck
X X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Montesinos- PLANTS Lo´pez et al. [21]
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Montesinos- PLANTS Lo´pez et al. [20]
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Montesinos- PLANTS Lo´pez et al. [19]
Medina PLANTS et al. [48]
PLANTS
Liu et al. [37]
Kwong et al. PLANTS [46]
X
X
X
Grinberg PLANTS et al. [57]
Heslot et al. PLANTS [47]
X
X
Grinberg PLANTS et al. [82]
✓
(continued)
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
29
11
2
11
118 77
15
X
72
2
105 6
14
2 12
53
10
X
X
8
X
X X
X
X X
X
42
9
X
X
X
44
6
X
X
97 24
17 2
X
X
X
10
2
X
7
X
X
X
132
20
X
X
X
X
X
10
PM SE
205 29
34
✓
✓
✓
✓
✓
✓
LA SSO GBLUP r
124 64
15
X
X
X
BOO RKH BAYES BAYES BAYES ST S SVM NB KNN A B C BL wBSR SSVS RR
BNN Bayesian neural network; ANN Artificial neural network; CNN convolutional neural network; BAGG Bagging; RF Random forest; BOOST Boosting; RKHS reproducing kernel Hilbert space; SVM Support vector machine; NB Naı¨ve Bayes; KNN k-nearest neighbor; BL Bayes LASSO; wBSR Weighted Bayesian shrinkage regression; SSVS Stochastic search variable selection; RR Ridge regression; r correlation; Total comparisons refers to the total number of comparisons or that each model had
65
4
Total comparisons
8
6
PLANTS
Xu et al. [74]
3
PLANTS
Xavier et al. [122]
Total number of studies
PLANTS
Wang et al. [49]
X
PLANTS
Wang et al. [73]
BNN ANN CNN BAGG RF
Zingaretti PLANTS et al. [39]
PLANTS
TYPE
Toda et al. [72]
et al. [121]
Authors
Table 2 (continued)
Machine Learning in Genomic Prediction
209
comparison effect to be fixed, which makes it possible to correct the results according to the quality of each comparison. Our Thurstonian model assumes a simple univariate model with a liability (l) underlying the observed ranking in the comparisons. The liabilities of the compared models in event m can be packed in nm 1 vector lm with a conditioned probability distribution being lm ¼ Xbm b þ Zpm p þ em , where l corresponds to the liability; b was the effect of type of study (plant or animal); p corresponds to the effect of the model employed assumed to be distributed as p N 0, σ 2p with σ 2p being the em variance associated to the method effect; 2 2 N 0, σ e was the residual, with residual variance σ e ¼ 1 ; and Xpm and Zpm were the incidence matrices, respectively [93]. The probability distribution of l was a Gaussian distribution. Prior to the comparison, liability was conditioned by b, and p as pðl1 , l2 , . . . , lm jb, p, δÞ ¼ ∏M m¼1 pðlm jθ, δÞ δim I ¼ ∏M m¼1 ∏i¼1 ½Nðlim jμim , 1Þ , 0
0 0
where θ ¼ [b , p ] ; δ ¼ {δim} was a set of indicator variables, it means δim ¼ 1 if model i is present in comparison m, if not, δim ¼ 0; N(lim| μim, 1) indicates the density of the normally distributed liability lim; and μim was the mean with variance 1 [93]. The software used was GIBBSTHUR [94], which uses an MCMC Bayesian approach with Gibbs sampling. Prediction results differ between methods according to the study conditions. However, a general picture can be drawn from their rankings in each comparison. The most widely used models were GBLUP, RR, RF, SVM, and BL with 132, 124, 118, 105, and 97 comparisons, respectively (Table 2), while BNN, NB, SSVS, BAGG, and KNN were used less frequently with 4, 6, 10, 11, and 12 comparisons, respectively (Table 2). Results of the ranking analysis based on the Thurstonian model are shown in Table 3, Figs. 2, and 3. For r-based results, the model with the best performance effect in the results of the Thurstonian analysis was RKHS, followed by NB and BNN; however, the last two models results were not reliable because of large SD. These models were followed by Bayes B, Bayes A, wBSR, Bayes C, and BAGG, which had a relatively high rank for predictive accuracy. ANN, BL, GBLUP, RF, SVM had middle score (model-effect). Finally, CNN, SSVS, LASSO, RR, BOOST, and KNN were those with the lowest model-effect, of which SSVS and KNN had large SD (Fig. 2). The estimated performance of the methods presented large SD and must be viewed carefully. For PMSE-based results, most models had large SD relative to the r-based results as expected. BOOST showed the highest score
a
0.23 (0.19)k
0.41 (0.23)b 0.40 (0.23)c 0.18 (0.21)f 0.18 (0.19)i
0.23 (0.75)g 1.52 (0.90)k
0.43 (0.93)c 1.72 (1.36)b
0.52 (0.90)h 0.02 (1.13)e
0.13 (0.17)l
0.49 (0.20)r
1.02 (0.20)a
0.23 (0.18)m
0.54 (0.47)b
0.62 (0.35)s
0.29 (0.21)e
0.36 (0.22)d
0.15 (0.21)g
0.07 (0.18)k
RF
BOOST
RKHS
SVM
NB
KNN
Bayes A
Bayes B
Bayes C
0.41 (0.18)o 0.35 (0.21)n 0.04 (0.18)g
0.45 (0.17)q
0.42 (0.21)p
0.07 (0.17)j
RR
LASSO
GBLUP
m
b
a
0.10 (0.42)h
0.24 (0.52)i
0.79 (0.50)l
0.42 (0.45)c
0.72 (0.92)h
0.68 (0.92)c
1.38 (1.09)b
0.56 (1.12)d
0.39 (1.02)f
0.28 (1.14)e
0.84 (1.20)h
0.35 (0.65)j
3.12 (2.02)i
1.156(1.09)c
1.68 (1.30)a
Plants
0.76 (1.18)g
0.18(0.95)e
0.26 (1.04)f
1.23(0.97)b
0.50 (0.99)g
0.00 (1.07)d
1.74 (1.09)i
1.73 (1.93)
Animals
0.07 (0.58)g
0.59 (0.50)k
0.34 (0.42)d
0.96 (0.56)a
0.31 (0.48)e
0.10 (0.68)f
0.93 (0.60)
0.94 (0.92)
All
PMSE
BNN Bayesian neural network; ANN Artificial neural network; CNN convolutional neural network; BAGG Bagging; RF Random forest; BOOST Boosting; RKHS reproducing kernel Hilbert space; SVM Support vector machine; NB Naı¨ve Bayes; KNN k-nearest neighbor; BL Bayes LASSO; wBSR Weighted Bayesian shrinkage regression; SSVS Stochastic search variable selection; RR Ridge regression Superscript letters rank the model predictive performances in alphabetical order, being ‘a’ the best perfromance and ‘s’ the worst one
0.23 (0.32)l
0.28 (0.32)o
SSVS
1.21 (0.98)j
0.20 (0.25)d
0.16 (0.24)f
0.30 (0.80)d
0.54 (0.35)p
0.28 (0.21)m
0.07 (0.18)h
wBSR
BL
1.08 (0.22)a
0.07 (0.84)f
0.20 (0.27)j
0.06 (0.33)h
e
BAGG
1.05 (0.95) i
0.27 (0.26)n
i
CNN
0.19 (0.22)
Plants
0.01 (0.21)
2.12 (1.96)
Animals
ANN
c
0.48 (0.53)
All
BNN
Model
r
Table 3 Performance by method estimated with the Thurstonian model. Values mean outcomes with their standard deviation (SD) in parentheses. Effect estimate from the Thursttonian model for correlation (r) and mean squared error (PMSE). Highest scored in bold face 210 Edgar L. Reinoso-Pela´ez et al.
Machine Learning in Genomic Prediction
211
Fig. 2 Score of regression models for GWP obtained from accuracy-ranking based on correlation (or correctly classified values) from cross-validation. The classification (CLASS) of the models is color codes. BNN Bayesian neural network; ANN Artificial neural network; CNN convolutional neural network; BAGG Bagging; RF Random forest; BOOST Boosting; RKHS reproducing kernel Hilbert space; SVM Support vector machine; NB Naı¨ve Bayes; KNN k-nearest neighbor; BL Bayes LASSO; wBSR Weighted Bayesian shrinkage regression; SSVS Stochastic search variable selection; RR Ridge regression
despite its unfavourable rank for predicting r (second to last), perhaps due to the number of comparisons and studies carried out for PMSE (8 comparisons from 3 studies in animals); BNN has the second-highest score. In animals, BNN was the best predictor. For BL, SVM, RF, CNN, Bayes B, GBLUP, LASSO, and Bayes C models, it was not possible to draw conclusions due to the large SD (Fig. 3), it was not possible to determine which of these models performed best predictions. Bayes A, RR and ANN were generally the models with the lowest scores, but they could also be subject to error. In both r-based and PMSE-based studies, Bayesian and Kernel methods had the highest prediction capacities in both animals and plants. Although BNN is a neural network algorithm, it can also be interpreted under a Bayesian approach [11]. Other neural network models did not perform as well as expected, although it must be considered that these are relatively new in the field of genomic selection and perhaps there is a learning aim effect (Table 2). The scores from ensemble methods ranged widely between studies, and
212
Edgar L. Reinoso-Pela´ez et al.
Fig. 3 Score of regression models for GWP obtained from accuracy-ranking based on PMSE from crossvalidation. The classification (CLASS) of the models is color-codes. BNN Bayesian neural network; ANN Artificial neural network; CNN convolutional neural network; RF Random forest; BOOST Boosting; SVM Support vector machine; BL Bayes LASSO; SSVS Stochastic search variable selection; RR Ridge regression
generally ranked near the middle, although they are viewed as robust and stable, with extensive bibliographic support. GBLUP has been the most cited procedure and delivered the best prediction results among classical parametric methods, confirming its good predictive ability. However, our meta-analysis found that other traditional (e.g., Bayes B, BL) and ML (e.g., RKHS, BNN) methods overperformed it in animals and plants. It must be pointed out the number of times each method is represented in the comparisons. Classic models such as GBLUP or RR were the most representative in the considered studies (132 and 124, respectively), and they were ranked in 10th and 17th position for correlation-based results (Fig. 2) and 8th and 12th position for PMSE-based results (Fig. 3). Instead, ML models such as BNN and NB were seldom used (4 and 6 times, respectively), and they were in second and third place for results based on correlation (Fig. 2) and second place for BNN (Fig. 3) in PMSE-based results. This suggests that some ML algorithms may not have sufficient evidence to obtain strong support for their position (Several are relatively new in GWP). However, models such as RKHS ranks first in the correlation-based results (Fig. 2) and has been widely compared
Machine Learning in Genomic Prediction
213
(72 comparisons), and from which a more reliable conclusion can be drawn. Therefore, it is relevant to consider the SD of the results for a better interpretation.
8
Conclusions ML methods have been used with success in genomic sciences. Many studies have reported a large variety of methods, models, and algorithms with promising predictive skills. In our meta-analysis, some ML algorithms (semi- and nonparametric) were globally more accurate than classical (parametric) prediction models. The most prominent models for r were RKHS, BNN, Naive Bayes, Bayes A, and Bayes B. The lowest PMSE were obtained for BOOST, BNN, BL, SVM, and RF. However, each approach has particularities and qualities that make its performance depending on to the type of problem addressed. In conclusions, there is not universally best prediction machine, and it is prudent to consider the data metrics, environmental factors, and the study objective prior to carrying out a prediction analysis. Nevertheless, RKHS and Bayesian methods seemed to outperform other procedures in animal and plant breeding.
References 1. Luaces O, Quevedo JR, Pe´rez-Enciso M et al (2010) Explaining the genetic basis of complex quantitative traits through prediction models. J Comput Biol 17:1711–1723. https://doi.org/10.1089/cmb.2009.0161 2. Gonza´lez-Recio O, Rosa GJM, Gianola D (2014) Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livest Sci 166: 217–231. https://doi.org/10.1016/j.livsci. 2014.05.036 3. Chen JX (2016) The evolution of computing: AlphaGo. Comput Sci Eng 18:4–7 4. Gonza´lez-Camacho JM, Ornella L, Pe´rezRodrı´guez P et al (2018) Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome 11:170104. https://doi.org/ 10.3835/plantgenome2017.11.0104 5. McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133. https:// doi.org/10.1007/BF02478259 6. Farley BG, Clark WA (1954) Simulation of self-organizing systems by digital computer. IRE Prof Group Inf Theory 4:76–84.
h t t p s : // d o i . o r g / 1 0 . 1 1 0 9 / T IT. 1 9 5 4 . 1057468 7. Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychol Rev 65: 386–408. https://doi.org/10.1037/ h0042519 8. Okut H, Gianola D, Rosa GJM, Weigel KA (2011) Prediction of body mass index in mice using dense molecular markers and a regularized neural network. Genet Res (Camb) 93: 189–201. https://doi.org/10.1017/ S0016672310000662 9. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: JMLR Workshop and Conference Proceedings 10. Angermueller C, P€arnamaa T, Parts L, Stegle O (2016) Deep learning for computational biology. Mol Syst Biol 12:878. https://doi. org/10.15252/msb.20156651 11. Gianola D, Okut H, Weigel KA, Rosa GJM (2011) Predicting complex quantitative traits with Bayesian neural networks: A case study with Jersey cows and wheat. BMC Genet 12: 87. https://doi.org/10.1186/1471-215612-87
214
Edgar L. Reinoso-Pela´ez et al.
12. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by backpropagating errors. Nature 323:533–536. https://doi.org/10.1038/323533a0 13. Ehret A, Hochstuhl D, Krattenmacher N et al (2015) Short communication: Use of genomic and metabolic information as well as milk performance records for prediction of subclinical ketosis risk via artificial neural networks. J Dairy Sci 98:322–329. https://doi.org/10. 3168/jds.2014-8602 14. Brito Lopes F, Magnabosco CU, Passafaro TL et al (2020) Improving genomic prediction accuracy for meat tenderness in Nellore cattle using artificial neural networks. J Anim Breed Genet 137:438–448. https://doi.org/10. 1111/jbg.12468 15. Maldonado C, Mora-Poblete F, ContrerasSoto RI et al (2020) Genome-wide prediction of complex traits in two outcrossing plant species through deep learning and Bayesian regularized neural network. Front Plant Sci 11:1808. https://doi.org/10.3389/fpls. 2020.593897 16. Khaki S, Wang L (2019) Crop yield prediction using deep neural networks. Front Plant Sci 10:621. https://doi.org/10.3389/fpls. 2019.00621 ˜ agari17. Abdollahi-Arpanahi R, Gianola D, Pen cano F (2020) Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet Sel Evol 52:12. https://doi.org/10.1186/ s12711-020-00531-z 18. Shahinfar S, Al-Mamun HA, Park B et al (2020) Prediction of marbling score and carcass traits in Korean Hanwoo beef cattle using machine learning methods and synthetic minority oversampling technique. Meat Sci 161:107997. https://doi.org/10.1016/j. meatsci.2019.107997 19. Montesinos-Lo´pez A, Montesinos-Lo´pez OA, Gianola D et al (2018) Multienvironment genomic prediction of plant traits using deep learners with dense architecture. G3 (Bethesda) 8:3813–3828. https:// doi.org/10.1534/g3.118.200740 20. Montesinos-Lo´pez OA, Martı´n-Vallejo J, Crossa J et al (2019) A benchmarking between deep learning, support vector machine and Bayesian threshold best linear unbiased prediction for predicting ordinal traits in plant breeding. G3 (Bethesda) 9: 601–618. https://doi.org/10.1534/g3.118. 200998 21. Montesinos-Lo´pez OA, MontesinosLo´pez A, Crossa J et al (2018) Multi-trait, multi-environment deep learning modeling
for genomic-enabled prediction of plant traits. G3 (Bethesda) 8:3829–3840. https:// doi.org/10.1534/g3.118.200728 22. Azodi CB, Bolger E, McCarren A et al (2019) Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3 (Bethesda) 9:3691–3702. https://doi.org/10.1534/g3.119.400498 23. Chateigner A, Lesage-Descauses MC, Rogier O et al (2020) Gene expression predictions and networks in natural populations supports the omnigenic theory. BMC Genomics 21: 416. https://doi.org/10.1186/s12864020-06809-2 24. Pe´rez-Enciso M, Zingaretti LM (2019) A guide for using deep learning for complex trait genomic prediction. Genes (Basel) 10: 5 5 3 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / genes10070553 25. Montesinos-Lo´pez OA, MontesinosLo´pez A, Pe´rez-Rodrı´guez P et al (2021) A review of deep learning applications for genomic selection. BMC Genomics 22:19. https://doi.org/10.1186/s12864-02007319-x 26. Bellot P, de los Campos G, Pe´rez-Enciso M (2018) Can deep learning improve genomic prediction of complex human traits? Genetics 210:809–819. https://doi.org/10.1534/ genetics.118.301298 27. Han J, Gondro C, Reid K, Steibel JP (2021) Heuristic hyperparameter optimization of deep learning models for genomic prediction. G3 (Bethesda) 11(7):jkab032. https://doi. org/10.1093/g3journal/jkab032 28. Peters SO, Sinecen M, Kizilkaya K, Thomas MG (2020) Genomic prediction with different heritability, QTL, and SNP panel scenarios using artificial neural network. IEEE Access 8:147995–148006. https://doi.org/ 10.1109/ACCESS.2020.3015814 29. Pe´rez-Rodrı´guez P, Gianola D, Weigel KA et al (2013) Technical Note: An R package for fitting Bayesian regularized neural networks with applications in animal breeding1. J Anim Sci 91:3522–3531. https://doi.org/ 10.2527/jas.2012-6162 30. Waldmann P (2018) Approximate Bayesian neural networks in genomic prediction. Genet Sel Evol 50:70. https://doi.org/10. 1186/s12711-018-0439-1 31. Putnova´ L, Sˇtohl R (2019) Comparing assignment-based approaches to breed identification within a large set of horses. J Appl Genet 60:187–198. https://doi.org/10. 1007/s13353-019-00495-x
Machine Learning in Genomic Prediction 32. Van Bergen GHH, Duenk P, Albers CA et al (2020) Bayesian neural networks with variable selection for prediction of genotypic values. Genet Sel Evol 52:26. https://doi.org/10. 1186/s12711-020-00544-8 33. Hubel DH, Wiesel TN (1963) Shape and arrangement of columns in cat’s striate cortex. J Physiol 165:559–568. https://doi.org/10. 1113/jphysiol.1963.sp007079 34. Hubel DH, Wiesel TN (1970) The period of susceptibility to the physiological effects of unilateral eye closure in kittens. J Physiol 206:419–436. https://doi.org/10.1113/ jphysiol.1970.sp009022 35. Waldmann P, Pfeiffer C, Me´sza´ros G (2020) Sparse convolutional neural networks for genome-wide prediction. Front Genet 11: 25. https://doi.org/10.3389/fgene.2020. 00025 36. Ma W, Qiu Z, Song J et al (2018) A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248:1307–1318. https://doi.org/10.1007/ s00425-018-2976-9 37. Liu Y, Wang D, He F et al (2019) Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Front Genet 10:1091. https://doi.org/10.3389/fgene.2019. 01091 38. Pook T, Freudenthal J, Korte A, Simianer H (2020) Using local convolutional neural networks for genomic prediction. Front Genet 11:561497. https://doi.org/10.3389/ fgene.2020.561497 39. Zingaretti LM, Gezan SA, Ferra˜o LFV et al (2020) Exploring deep learning for complex trait genomic prediction in polyploid outcrossing Species. Front Plant Sci 11:1. https://doi.org/10.3389/fpls.2020.00025 40. Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/ A:1010933404324 41. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. https://doi.org/10. 1007/bf00058655 42. Gonza´lez-Recio O, Forni S (2011) Genomewide prediction of discrete traits using bayesian regressions and machine learning. Genet Sel Evol 43:7. https://doi.org/10.1186/ 1297-9686-43-7 43. Yao C, Spurlock DM, Armentano LE et al (2013) Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle. J Dairy Sci 96: 6716–6729. https://doi.org/10.3168/jds. 2012-6237
215
44. Hempstalk K, McParland S, Berry DP (2015) Machine learning algorithms for the prediction of conception success to a given insemination in lactating dairy cows. J Dairy Sci 98: 5262–5273. https://doi.org/10.3168/jds. 2014-8984 45. Silveira LS, Lima LP, Nascimento M et al (2020) Regression trees in genomic selection for carcass traits in pigs. Genet Mol Res 19: GMR18498. https://doi.org/10.4238/ gmr18498 46. Bin KQ, Teh CK, Ong AL et al (2017) Evaluation of methods and marker systems in genomic selection of oil palm (Elaeis guineensis Jacq.). BMC Genet 18:107. https://doi. org/10.1186/s12863-017-0576-5 47. Heslot N, Yang H-P, Sorrells ME, Jannink J-L (2012) Genomic selection in plant breeding: a comparison of models. Crop Sci 52:146–160. https://doi.org/10.2135/cropsci2011.06. 0297 48. Medina CA, Hawkins C, Liu X-P et al (2020) Genome-wide association and prediction of traits related to salt tolerance in autotetraploid alfalfa (Medicago sativa L.). Int J Mol Sci 21: 3 3 6 1 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / ijms21093361 49. Wang DR, Guadagno CR, Mao X et al (2019) A framework for genomics-informed ecophysiological modeling in plants. J Exp Bot 70:2561–2574. https://doi.org/10.1093/ jxb/erz090 50. Sarkar R, Rao AR, Meher PK et al (2015) Evaluation of random forest regression for prediction of breeding value from genomewide SNPs. J Genet 94:187–192. https:// doi.org/10.1007/s12041-015-0501-5 51. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Saitta L (ed) Thirteen International Conference on Machine Learning. Morgan Kaufmann, San Francisco 52. Friedman JH (2001) Greedy function approximation: A gradient boosting machine. Ann Stat 29:1189–1232. https://doi.org/10. 1214/aos/1013203451 53. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: Data mining, inference, and prediction, second edn. Springer, New York 54. Gonza´lez-Recio O, Weigel KA, Gianola D et al (2010) L2-Boosting algorithm applied to high-dimensional problems in genomic selection. Genet Res (Camb) 92:227–237. h t t p s : // d o i . o r g / 1 0 . 1 0 1 7 / S0016672310000261 55. Gonza´lez-Recio O, Jime´nez-Montero JA, Alenda R (2013) The gradient boosting
216
Edgar L. Reinoso-Pela´ez et al.
algorithm and random boosting for genomeassisted evaluation in large data sets. J Dairy Sci 96:614–624. https://doi.org/10.3168/ jds.2012-5630 56. Kim K, Seo M, Kang H et al (2015) Application of LogitBoost classifier for traceability using SNP chip data. PLoS One 10: e0139685. https://doi.org/10.1371/jour nal.pone.0139685 57. Grinberg NF, Orhobor OI, King RD (2020) An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 109:251–277. https:// doi.org/10.1007/s10994-019-05848-5 58. Parzen E (1962) Extraction and detection problems and reproducing kernel Hilbert spaces. J Soc Ind Appl Math Ser A Control 1 : 3 5 – 6 2 . h t t p s : // d o i . o r g / 1 0 . 1 1 3 7 / 0301004 59. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999 60. Gianola D, Fernando RL, Stella A (2006) Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173:1761–1776. https://doi.org/10.1534/ genetics.105.049510 61. Gianola D, Van Kaam JBCHM (2008) Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178:2289–2303. https://doi.org/10.1534/genetics.107. 084285 62. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423. https://doi.org/10.3168/ jds.2007-0980 63. De Los CG, Gianola D, Rosa GJM (2009) Reproducing kernel Hilbert spaces regression: A general framework for genetic evaluation. Artic J Anim Sci 87:1883–1887. https://doi. org/10.2527/jas.2008-1259 64. Morota G, Gianola D (2014) Kernel-based whole-genome prediction of complex traits: A review. Front Genet 5:363. https://doi. org/10.3389/fgene.2014.00363 65. De Los Campos G, Gianola D, Rosa GJM et al (2010) Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res (Camb) 92(4):295–308 66. Morota G, Koyama M, Rosa GJM et al (2013) Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data. Genet Sel Evol 45:1–15. https://doi.org/10.1186/12979686-45-17
67. Hu Y, Morota G, Rosa GJ, Gianola D (2015) Prediction of plant height in arabidopsis thaliana using DNA methylation data. Genetics 201:779–793. https://doi.org/10.1534/ genetics.115.177204 68. Kimeldorf G, Wahba G (1971) Some results on Tchebycheffian spline functions. J Math Anal Appl 33:82–95. https://doi.org/10. 1016/0022-247X(71)90184-3 69. Blondel M, Onogi A, Iwata H, Ueda N (2015) A ranking approach to genomic selection. PLoS One 10:e0128570. https://doi. org/10.1371/journal.pone.0128570 70. Cuevas J, Montesinos-Lo´pez O, Juliana P et al (2019) Deep Kernel for genomic and near infrared predictions in multi-environment breeding trials. G3 (Bethesda) 9:2913–2924. https://doi.org/10.1534/g3.119.400493 71. Gentzbittel L, Ben C, Mazurier M et al (2019) WhoGEM: An admixture-based prediction machine accurately predicts quantitative functional traits in plants. Genome Biol 20:1–20. https://doi.org/10.1186/s13059019-1697-0 72. Toda Y, Wakatsuki H, Aoike T et al (2020) Predicting biomass of rice with intermediate traits: Modeling method combining crop growth models and genomic prediction models. PLoS One 15:e0233951. https://doi. org/10.1371/journal.pone.0233951 73. Wang S, Wei J, Li R et al (2019) Identification of optimal prediction models using multiomic data for selecting hybrid rice. Heredity (Edinb) 123:395–406. https://doi.org/10. 1038/s41437-019-0210-6 74. Xu Y, Xu C, Xu S (2017) Prediction and association mapping of agronomic traits in maize using multiple omic data. Heredity (Edinb) 119:174–184. https://doi.org/10. 1038/hdy.2017.27 75. Gonza´lez-Recio O, Gianola D, Long N et al (2008) Nonparametric methods for incorporating genomic information into genetic evaluations: An application to mortality in broilers. Genetics 178:2305–2313. https:// doi.org/10.1534/genetics.107.084293 76. Wahba G (2002) Soft and hard classification by reproducing kernel Hilbert space methods. Proc Natl Acad Sci U S A 99:16524–16530. https://doi.org/10.1073/pnas.242574899 77. Aljouie A, Roshan U (2015) Prediction of continuous phenotypes in mouse, fly, and rice genome wide association studies with support vector regression SNPs and ridge regression classifier. In: In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). IEEE, London, pp 1246–1250
Machine Learning in Genomic Prediction 78. Heuer C, Scheel C, Tetens J et al (2016) Genomic prediction of unordered categorical traits: an application to subpopulation assignment in German Warmblood horses. Genet Sel Evol 48:13. https://doi.org/10.1186/ s12711-016-0192-2 79. Long N, Gianola D, Rosa GJM, Weigel KA (2011) Application of support vector regression to genome-assisted prediction of quantitative traits. Theor Appl Genet 123: 1065–1074. https://doi.org/10.1007/ s00122-011-1648-y 80. Yao C, Zhu X, Weigel KA (2016) Semisupervised learning for genomic prediction of novel traits with small reference populations: an application to residual feed intake in dairy cattle. Genet Sel Evol 48:1–9. https://doi.org/10.1186/s12711-0160262-5 81. Budhlakoti N, Mishra DC, Rai A et al (2019) A Comparative study of single-trait and multitrait genomic selection. J Comput Biol 26: 1100–1112. https://doi.org/10.1089/cmb. 2019.0032 82. Grinberg NF, Lovatt A, Hegarty M et al (2016) Implementation of genomic prediction in lolium perenne (L.) breeding populations. Front. Plant Sci 7:133. https://doi. org/10.3389/fpls.2016.00133 83. Iwata H, Jannink J-L (2011) Accuracy of genomic selection prediction in barley breeding programs: a simulation study based on the real single nucleotide polymorphism data of barley breeding lines. Crop Sci 51: 1915–1927. https://doi.org/10.2135/ cropsci2010.12.0732 84. Fix E, Hodges JL (1989) Discriminatory analysis. nonparametric discrimination: consistency Properties. Int Stat Rev/Rev Int Stat 57:238. https://doi.org/10.2307/1403797 85. Romero JR, Roncallo PF, Akkiraju PC et al (2013) Using classification algorithms for predicting durum wheat yield in the province of Buenos Aires. Comput Electron Agric 96: 173–179. https://doi.org/10.1016/j.com pag.2013.05.006 86. Yin L, Yin L, Zhang H et al (2020) KAML: Improving genomic prediction accuracy of complex traits using machine learning determined parameters. Genome Biol 21:146. https://doi.org/10.1186/s13059-02002052-w 87. Long N, Gianola D, Rosa GJM et al (2007) Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. J Anim Breed Genet 124:377–389. https://doi.org/ 10.1111/j.1439-0388.2007.00694.x
217
88. Long N, Gianola D, Rosa GJM et al (2008) Marker-assisted assessment of genotype by environment interaction: A case study of single nucleotide polymorphism-mortality association in broilers in two hygiene environments1. J Anim Sci 86:3358–3366. https://doi.org/10.2527/jas.2008-1021 89. Van der Heide EMM, Veerkamp RF, van Pelt ML et al (2019) Comparing regression, naive Bayes, and random forest methods in the prediction of individual survival to second lactation in Holstein cattle. J Dairy Sci 102: 9409–9421. https://doi.org/10.3168/jds. 2019-16295 90. Gillberg J, Marttinen P, Mamitsuka H et al (2019) Modelling G3E with historical weather information improves genomic prediction in new environments. Bioinformatics 35:4045–4052. https://doi.org/10.1093/ bioinformatics/btz197 ˇ ´ıha J (2009) Horse breed 91. Buro´cziova´ M, R discrimination using machine learning methods. J Appl Genet 50:375–377. https://doi. org/10.1007/BF03195696 92. Hu X, Xie W, Wu C, Xu S (2019) A directed learning strategy integrating multiple omic data improves genomic prediction. Plant Biotechnol J 17:2011–2020. https://doi.org/ 10.1111/pbi.13117 93. Gianola D, Simianer H (2006) A Thurstonian model for quantitative genetic analysis of ranks: A Bayesian approach. Genetics 174: 1613–1624. https://doi.org/10.1534/ genetics.106.060673 94. Varona L, Legarra A (2020) GIBBSTHUR: Software for estimating variance components and predicting breeding values for ranking traits based on a Thurstonian model. Animals 1 0 : 1 00 1 . h t t p s : // do i . o r g / 1 0. 33 9 0/ ani10061001 95. Maydeu-Olivares A, Brown A (2010) Item response modeling of paired comparison and ranking data. Multivariate Behav Res 45: 935–974. https://doi.org/10.1080/ 00273171.2010.531231 96. da Gama MPM, Aspilcueta Borquis RR, de Arau´jo Neto FR et al (2016) Genetic parameters for racing performance of thoroughbred horses using Bayesian linear and Thurstonian models. J Equine Vet Sci 42: 39–43. https://doi.org/10.1016/j.jevs. 2016.03.021 97. Cervantes I, Gutie´rrez JP, Garcı´aballesteros S, Varona L (2020) Combining threshold, thurstonian and classical linear models in horse genetic evaluations for endurance competitions. Animals 10:1–14. https:// doi.org/10.3390/ani10061075
218
Edgar L. Reinoso-Pela´ez et al.
98. da Gama MPM, de Arau´jo Neto FR, de Oliveira HN et al (2014) Genetic parameters for rank of dairy Gir cattle in agricultural shows using Thurstonian procedure. In: Paper presented at: WCGALP Permanent International Committee and American Society of Animal Science. Proceedings of the 10th World Congress On Genetics Applied To Livestock Production, Vancouver (Canada) 99. Go´mez MD, Varona L, Molina A, Valera M (2011) Genetic evaluation of racing performance in trotter horses by competitive models. Livest Sci 140:155–160. https://doi.org/ 10.1016/j.livsci.2011.03.024 100. Sharaf T, Williams T, Chehade A, Pokhrel K (2020) BLNN: An R package for training neural networks using Bayesian inference. SoftwareX 11:100432. https://doi.org/10. 1016/j.softx.2020.100432 101. Salvatier J, Wiecki TV, Fonnesbeck C (2016) Probabilistic programming in Python using PyMC3. PeerJ Comput Sci 2016:e55. https://doi.org/10.7717/peerj-cs.55 102. Chen T, Li M, Li Y, et al (2015) MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv Prepr 103. Peters A, Hothorn T, Ripley BD, et al (2019) Improved predictors. R package ipredVersion 0.9-9 104. Pedregosa F, Varoquaux G, Buitinck L et al (2015) Scikit-learn: machine learning in Python. J Mach Learn Res 12(19):29–33 105. Witten IH, Frank E, Geller J (2002) Weka: practical machine learning tools and techniques with Java implementations. SIGMOD Rec 31:76–77. https://doi.org/10.1145/ 507338.507355 106. Liaw A, Wiener M (2002) Classification and Regression by randomForest. Quant Biol 5(4):338–351 107. Gonza´lez-Recio O, Forni S (2010) RanFoG: Random Forest in a java package to analyze disease resistance using genomic information. Journal of Dairy Science 99:7261–7273 108. Greenwell B, Boehmke B, Cunningham J, GBM developers (2020) Generalized boosted regression models. R package gbm Version 2.1.8 109. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. Proc ACM SIGKDD Int Conf Knowl Discov Data Min 13–17-Augu:785–794. https://doi.org/10. 1145/2939672.2939785 110. Endelman JB (2011) Ridge regression and other kernels for genomic selection with R
Package rrBLUP. Plant Genome 4:250–255. h t t p s : // d o i . o r g / 1 0 . 3 8 3 5 / p l a n tgenome2011.08.0024 111. Pe´rez P, De Los Campos G (2014) Genomewide regression and prediction with the BGLR statistical package. Genetics 198: 483–495. https://doi.org/10.1534/genet ics.114.164442 112. Karatzoglou A, Smola A, Hornik K (2016) Kernlab: Kernel-based machine learning lab. R package kernlab. Version 0.9-29 113. Michal Majka (2019) High performance implementation of the Naive Bayes algorithm. R package naivebayes. Version 0.9.7 114. Beygelzimer A, Kakadet S, Langford J, Li S (2015) Fast nearest neighbor search algorithms and applications. R package FNN. Version 1.1.3 115. Bolotov D (2020) Classification, regression, clustering with K nearest neighbors. R package neighbr. Version 1.0.3 116. De Los CG, Pe´rez P, Vazquez AI, Crossa J (2013) Genome-enabled prediction using the BLR (Bayesian Linear Regression) R-package. Methods Mol Biol 1019:299–320. https:// doi.org/10.1007/978-1-62703-447-0_12 117. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22. https://doi.org/10.18637/jss. v033.i01 118. Misztal I, Tsuruta S, Lourenco D et al (2018) Manual for BLUPF90 family of programs, Athens, USA, p 142 119. Abdollahi-Arpanahi R, Morota G, Valente BD et al (2015) Assessment of bagging GBLUP for whole-genome prediction of broiler chicken traits. J Anim Breed Genet 132: 218–228. https://doi.org/10.1111/jbg. 12131 120. Jime´nez-Montero JA, Gonza´lez-Recio O, Alenda R (2013) Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle. J Dairy Sci 96:625–634. https://doi.org/10.3168/jds. 2012-5631 121. Schulz-Streeck T, Ogutu JO, Piepho HP (2013) Comparisons of single-stage and two-stage approaches to genomic selection. Theor Appl Genet 126:69–82. https://doi. org/10.1007/s00122-012-1960-1 122. Xavier A, Muir WM, Rainey KM (2016) Assessing predictive properties of genomewide selection in soybeans. G3 Genes. G3 (Bethesda) 6:2611–2616. https://doi.org/ 10.1534/g3.116.032268
Chapter 8 Genomic Prediction Methods Accounting for Nonadditive Genetic Effects Luis Varona, Andres Legarra, Miguel A. Toro, and Zulma G. Vitezica Abstract The use of genomic information for prediction of future phenotypes or breeding values for the candidates to selection has become a standard over the last decade. However, most procedures for genomic prediction only consider the additive (or substitution) effects associated with polymorphic markers. Nevertheless, the implementation of models that consider nonadditive genetic variation may be interesting because they (1) may increase the ability of prediction, (2) can be used to define mate allocation procedures in plant and animal breeding schemes, and (3) can be used to benefit from nonadditive genetic variation in crossbreeding or purebred breeding schemes. This study reviews the available methods for incorporating nonadditive effects into genomic prediction procedures and their potential applications in predicting future phenotypic performance, mate allocation, and crossbred and purebred selection. Finally, a brief outline of some future research lines is also proposed. Key words Genomic prediction, Dominance, Epistasis, Crossbreeding, Genetic evaluation, Genomic selection
1
Introduction Nonadditive genetic variation is caused by interaction between the effects of alleles either at the same locus (dominance) or between the alleles of different loci (epistasis). The term “dominance” was introduced by Gregor Mendel [1] when he realized that some traits dominate others (in his case, “round peas” were dominant over “wrinkled peas”). He noted that “hybrids do not represent the form exactly intermediate between the parental strains . . . .” Some years after the rediscovery of Mendel’s rules, at the beginning of the twentieth century, Bateson [2] introduced the term “epistasis” to describe those situations where two or more genes interact, and
The original version of this chapter was revised. The correction to this chapter is available at https://doi.org/ 10.1007/978-1-0716-2205-6_23 Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_8, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022, Corrected Publication 2022
219
220
Luis Varona et al.
when the individual action of genes does not explain the mode of inheritance. However, most human, animal, or plant traits are governed by a large number of genes. For this reason, Fisher [3] developed the infinitesimal model that postulates that genetic variation is controlled by a large number of unlinked genes, and that its relative importance in relation with the phenotypic variation or heritability can be inferred from the resemblance between relatives. Initially, Fisher [3] proposed a pure additive model that was expanded to incorporate dominance [3, 4], as well as second and higher order epistasis [5, 6]. The main objective of animal and plant breeding is to predict the genetic merit (or breeding value) of the reproduction candidates [7], while in human genetics the aim is to predict the phenotype of the individuals (i.e., disease risk). In animal and plant breeding, these predictions were traditionally based solely on phenotypic and pedigree information, using methods such as the selection index [8] and the Best Linear Unbiased Predictor (BLUP) [9], based on the postulates of the infinitesimal model [3]. Nevertheless, the use of nonadditive genetics in these prediction procedures was negligible due to the complexity of calculations involved and the absence of large full-sib families for accurately estimating nonadditive genetic effects [10].
2
Genomic Prediction Since the structure of the genetic code was discovered [11], developments in molecular genetics have provided an increasing number of molecular markers, including microsatellites or SNPs (singlenucleotide polymorphism), which have been used massively to detect QTL (quantitative trait loci) in livestock, plants and human beings. The ultimate goal of QTL studies is to identify polymorphic markers or genes associated with phenotypic variation, potentially for use in phenotypic prediction for traits governed by one or just few loci, and in marker or gene assisted selection or prediction of polygenic traits [12]. Nevertheless, the sequencing of the human [13, 14], livestock [15, 16], and plant genomes [17, 18] opened up the possibility of utilizing a large number of SNPs in dense genotyping devices [19] or next generation sequencing techniques [20]. Through these new information sources, prediction entered a new era with the development of genomic selection (GS) or prediction models [21]. The starting point for GS is a reference population with phenotypes and genotypes used to develop the following linear model. yi ¼ μ þ
n X j ¼1
t ij a j þ e i
Genomic Prediction with Nonadditive Effects
221
which explains the phenotypic performance of ith individual (yi) for i ¼ 1, 2, . . ., m by the sum of the effects associated with a huge number of SNP (aj) for j ¼ 1, 2, . . ., n. Moreover, tij is the genotypic configuration of the ith individual and for the jth SNP (0, 1, 2 for A1A1, A1A2, and A2A2, respectively), and ei is the residual. The prediction of the breeding values for the candidates to selection or the phenotypic performance of individuals without a n P b j , where kij is genetic phenotype can be calculated as kij a j ¼1
configuration for the jth SNP marker of the individuals to predict. The main limitation of the direct implementation of the model presented above is that, in many cases, the number of parameters (SNPs) is much greater than the number of phenotypic records, which leads to an extreme case of large p, small n problem. Even if the number of genotyped individuals is large, it makes sense to use some prior information, for instance, small effects are frequent and large effects are unlikely. In the statistical literature, this problem is often addressed using some type of regularization [22]. Several procedures have been proposed, ranging from a Gaussian regularization [21] to models that use t-shaped [21], double exponential [23] or mixtures of distributions [24]. In general, the consensus of most of the studies comparing regularization procedures is that a more complex model may be useful for traits influenced by largeeffect genes, but they all perform similarly for traits whose genetic variation is governed by many small-effect genes [25, 26]. The assumption of a Gaussian regularization (Random Regression BLUP—RR-BLUP) has a very interesting property, since it can be reformulated in terms of individual effect rather than SNP effects by defining a genomic relationship matrix (G) [27] and using the Henderson’s Mixed Model equations (Genomic BLUP or GBLUP) [9]. Furthermore, this approach can be extended to include nongenotyped individuals by combining the observed (G) and expected (A) additive relationship matrices in the single-step approach (SSGBLUP) [28, 29]. Despite the regularization method, genomic evaluation methods are based on the estimation of marker substitution effects that contribute to the breeding values and are expressed in the next generation. Substitution effects are capable of capturing a large part of the dominance and epistatic effects [30–32], although they are not stable across generations due to the change of allelic frequencies as a consequence of selection or random drift. Nevertheless, estimates of nonadditive genetic effect may be interesting for three main reasons. First, their inclusion may increase the accuracy of prediction [33–35]. Second, they can be used to define mate allocation procedures in plant and animal breeding programs [33, 36, 37]. Finally, taking into account nonadditive genetic variation in crossbreeding or purebred breeding schemes can increase the response to selection [36, 38, 39].
222
Luis Varona et al.
Here we will outline models that consider nonadditive inheritance, in the case of a population, ideally in Hardy–Weinberg and linkage equilibria (LE) unless otherwise stated. Methods for other type of populations, such as hybrid crops, are extension of these models [40, 41]. 2.1 Genomic Prediction Models with Dominance
The most basic approach to extend genomic prediction models is to include an additional dominance effect for each SNP marker [33, 42]. yi ¼ μ þ
n X
t ij a j þ
j ¼1
n X
c ij d j þ e i
j ¼1
where aj and dj are the additive and dominance effects for the jth SNP marker. The covariates tij and cij are 0, 1, 2 and 0, 1, and 0 for the genotypes A1A1, A1A2, and A2A2. It should be noted that with this model, aj is the “biological” additive effect and not the marker substitution effects. This model can therefore be used for prediction of future phenotypes (or total genetic values) but not breeding values. In fact, Vitezica et al. [43] reformulated the model in terms of breeding values and dominance deviations [7] as: yi ¼ μ þ where
n X j ¼1
wij α j þ
n X j ¼1
8 > A1 A1 2 2p > j >
> > : 2p j A2 A2 8 2 A1 A1 > < 2q j 2p j q j A1 A2 ¼ > : 2p2j A2 A2
g ij d j þ e i
g ij
and αj ¼ aj + dj(qj pj) is the allelic substitution effect and pj and qj are the allelic frequencies for A1 and A2 for the jth marker. The contribution of each SNP marker to the additive variance, under h i2 and to the LE, is σ 2Aj ¼ 2p j q j α2j ¼ 2p j q j a j þ d j q j p j 2 dominance variance is σ 2Dj ¼ 2p j q j d j . Moreover, if we treat the additive and effects dominance of SNPs as random and assum2 2 and Cov(aj, dj) ¼ 0, the ing a j N 0, σ a , d j N 0, σ d “biological” and the “statistical” approaches are two equivalent parameterizations of the same model, as probed by Vitezica et al. [43], who describe the following expressions for switching variance components estimates between “biological” (σ 2A and σ 2D ) and “statistical” (σ 2A and σ 2D ) models:
Genomic Prediction with Nonadditive Effects
223
2 σ 2A ¼ 2p j q j σ 2a þ 2p j q j q j p j σ 2d σ 2A ¼ 2p j q j σ 2a
σ 2D
2 σ 2D ¼ 2p j q j σ 2d ¼ 2p j q j 1 2p j q j σ 2d
Note that σ 2A þ σ 2D ¼ σ 2A þ σ 2D and that “biological” and “statistical” variances are identical if p ¼ q ¼ 0.5 and that σ 2A ¼ σ 2A if dj ¼ 0. The approach of Vitezica et al. [43] can be also generalized to avoid the assumption of Hardy–Weinberg equilibrium by using the natural and orthogonal interactions algorithm, NOIA, described ´ lvarez-Castro and Carlborg [44]. In both the “biological” and by A “statistical” models, the additive and dominance models must be regularized. Under the assumption of a Gaussian prior distribution, the model can be easily transformed by constructing an additive (G) and a dominance (D) covariance matrices [43, 45] and, additionally, they can be integrated into a single-step approximated approach [46] that include genotyped and nongenotyped individuals. 2.2 Dominance and Inbreeding Depression (or Heterosis)
The theory of quantitative genetics [7] attributes inbreeding depression (and heterosis) to the presence of directional dominance (i.e., more positive than negative dominance effects) or epistasis. However, all of the procedures described above assume a symmetrical prior distribution for dominance effects. This limitation can be overcome by assuming an asymmetric prior distributions [47] or a nonzero mean of the dominant effects [48]. In this last approach, the standard model is reformulated. n n h i X X yi ¼ μ þ t ij a j þ c ij μd þ d j þ e i j ¼1
¼μþ
n X j ¼1
n P j ¼1
t ij a j þ
j ¼1
n X j ¼1
c ij d j þ
n X
c ij μd þ e i
j ¼1
Being d j ¼ d j μd , E(d) ¼ μd and E(d) ¼ 0. Note that c ij μd is the average of the dominance effects in the ith individ-
ual. The inbreeding coefficients fi can therefore be calculated as follows. n P
fi ¼1
j ¼1
c ij
n
,
224
Luis Varona et al.
n P
where
c ij
j¼1
n
is the percentage of heterozygotes, and n X j ¼1
c ij μd ¼ 1 f i nμd ¼ nμd f i nμd
The term nμd collapse with the overall mean, and finμd is a covariate associated with inbreeding and it can be interpreted as inbreeding depression. However, the same expected mean is assumed for the dominance effects of all the markers and there is evidence in the literature of the heterogeneity of inbreeding depression effects along the genome [49–51]. On the other hand, some authors [52] have found that inbreeding depression is fairly homogeneous across the genome and that there are few benefits of accounting for region-specific inbreeding depression. It is also usually assumed that additive and dominance effects are independent, although there is evidence of a possible relationship between additive and dominance “biological” effects [53, 54]. In this sense, Wellman and Bennewitz [55] proposed a set of new regularization procedures (Bayes D0 to D3) and the latter (Bayes D2 and D3) modelled the relationship between additive and dominance effects through the prior variance of the dominance effects (Bayes D2) and their prior mean and variance (Bayes D3). However, these approaches (Bayes D2 and D3) are computationally demanding and cannot be described as a covariance structure of additive and dominance effects in individuals, given the ambiguity when assigning the reference allele within a locus. To solve this, Xiang et al. [56] developed a procedure that sets the most frequent allele as the reference allele and develops the covariance structure between the additive and dominant effects, proving that the choice of the reference allele and selection processes determines the correlation sign between functional additive and dominance effects. However, the procedure is complex, assumes a selected trait, and finally the predictive ability was comparable to the orthogonal model suggested by Vitezica et al. [43]. Finally, it is worth to mention that there is some decoupling between inbreeding depression (and heterosis) and nonadditive variance. On the one hand, additive x additive variance does not generate inbreeding depression but could be involved in heterosis, and, on the other, inbreeding depression is possible under the infinitesimal model without dominance variance [41, 57, 58]. 2.3
Imprinting
Genomic imprinting [59] is another source of nonadditive genetic variation, especially in mammalians. It implies the total or partial inactivation of paternal or maternal alleles. Nishio and Satoh [60] proposed two genomic prediction models to include imprinting effects. The first extends the model with dominance as:
Genomic Prediction with Nonadditive Effects
yi ¼ μ þ where
n X
w ij α j þ
n X
j ¼1
j ¼1
g ij d j þ
n X
r ij i j þ e i
j ¼1
8 > A1 A1 2 2p > j > 1 2p j > > : 2p j A2 A2 8 > A1 A1 2q 2j > < 2p j q j A 1 A2 =A 2 A 1 ¼ > > : 2p2 A A 2
j
and
8 0 > > > < 1 r ij ¼ > 1 > > : 0
225
g ij
2
A1 A1 A1 A2 A2 A1 A2 A2
and ij is the imprinting effect associated with jth marker. The second alternative divided the genetic effects into paternal ð f a j Þ and maternal (ma j ) gametic effects plus a dominance deviation: yi ¼ μ þ
n X j ¼1
l ij f a j þ
where l ij ¼ j ij ¼
n X j ¼1
j ij ma j þ
8
> 2 > > > > p þ p p p > 11j 22j 11j 22j > > > > > 4p11j p22j < 2 ¼ > p þ p p p > 11j 22j 11j 22j > > > > > 2p11j p12j > > > 2 > > > : p11j þ p22j p11j p22j
A1 A1
A1 A2
A2 A2
being p11j, p12j and p22j, the genotypic frequencies for A1A1, A1A2 and A2A2 for the jth marker, respectively. Under the assumption that the additive and dominance SNP marker effects follow a Gaussian distribution, the additive and dominance “genomic” (co) variance relationship can be calculated as: Cov ðg A Þ ¼
H A H 0A σ 2 ¼ G A σ 2A tr H A H 0A =n A
Cov ðg D Þ ¼
H D H 0D σ 2 ¼ G D σ 2D tr H D H 0D =n D
For second order epistatic effects (gAA, gAD and gDD): O O h Aj h Dj h ADij ¼ h Ai h DDij h AAij ¼ h Ai O h Dj ¼ h Di the matrices HAA, HAD and HDD are 1 1 0 0 N N h A1 h A1 h A1 h D1 C C B B H AD ¼ @ ... ... H AA ¼ @ A A N N h Ak h Ak h Ak h Dk 1 0 N h D1 h D1 C B ¼@ ... A N h Dk h Dk
H DD
And the covariance between gAA, gAD and gDD is: Cov ðg AA Þ ¼
H H0 AA 0AA σ 2AA ¼ G AA σ 2AA tr H AA H AA =n
Cov ðg AD Þ ¼
H AD H 0AD σ 2 ¼ G AD σ 2AD tr H AD H 0AD =n AD
Cov ðg DD Þ ¼
H H0 DD 0DD σ 2DD ¼ G DD σ 2DD tr H DD H DD =n
Calculating HH0 cross-products is computationally very costly, but Vitezica et al. [65] proposed an algebraic shortcut that allows it
Genomic Prediction with Nonadditive Effects
229
to be calculated from the additive (GA) and dominance (GD) matrices as Cov ðg AA Þ ¼
G A ∘G A σ 2 ¼ G AA σ 2AA tr ðG A ∘G A Þ=n AA
Cov ðg AD Þ ¼
G A ∘G D σ 2 ¼ G AD σ 2AD tr ðG A ∘G D Þ=n AD
Cov ðg DD Þ ¼
G D ∘G D σ 2 ¼ G DD σ 2DD tr ðG D ∘G D Þ=n DD
Similarly, the covariance between higher order epistatic effects can be calculated as: G i ∘G j ∘G k σ 2 ¼ G ijk σ 2ijk Cov g ijk ¼ tr G i ∘G j ∘G k =n ijk However, despite the orthogonal or nonorthogonal coding, Martini et al. [70] pointed out that this approach double counted interactions between loci and implicitly models some extra intralocus interactions. To avoid double counting, several authors [64, 69, 70] have proposed that the genomic relationship matrix for the additive x additive interaction should be: G AA ¼ 0:5ðG A ∘G A Þ 0:5ðH A ∘H A ÞðH A ∘H A Þ0 which converges to the same genomic relationship matrix as the Hadamard product GA ¼ GA ∘GA if the number of SNP markers is large enough. This procedure has been expanded up the fourth order of epistatic effects [70], but it is computationally demanding, although Jiang and Reif [71] have recently proposed an exact and computationally feasible solution. The Hadamard product approach will work reasonably well with a very large number of SNP for lower epistatic degrees, and when the whole genome epistatic relationship matrix is considered. However, in some cases [72–74], the divergence between the Hadamard product of additive and dominance genomic relationship matrices and the exact approach may be greater. 2.6 Machine Learning Approaches
An alternative approach to developing predictive models that include nonadditive genetic effects is based on new developments in machine learning (ML) [75]. In genomic prediction, the objective of ML methods is to predict the performance of an individual through a function that maps genotypes to phenotypes. The starting point of ML is the development of a predictive function that must be calibrated in a training population and evaluated in a testing population. In recent decades, ML algorithms have explored several approaches. Among these, some strategies have gained popularity in genomic prediction, such as RKHS
230
Luis Varona et al.
(Regression Kernel Hilbert Space) [76], ensemble [77], and, more recently, deep learning [78] methods. The RKHS methods are based on the definition of a parametric or nonparametric function g() of a set of features (X), such as the SNP genotypes. y ¼ μ þ g ðX Þ þ e and on defining a cost function to minimize J ¼ ðy g ðX ÞÞ0 ðy g ðX ÞÞ þ λkg ðX Þk2H where the term kg ðX Þk2H is a norm in a Hilbert space. Kimeldorf and Wahba [79] noticed that g(X) can be reformulated as follows. g ðX Þ ¼ α0 þ
n X
α1 K ðx x i Þ
i¼1
where K is a positive semidefinite matrix that meets the requisites of a Kernel Matrix. Its function is to define the similarity between individuals and it corresponds to a distance in a Hilbert space [80]. Therefore, the key step is the proper definition of K, which can be chosen from a very large number of options, such as, among others, Gaussian [81], Gaussian radial basis function [82], exponential [83], SNP Grid [84] or the average of several kernels [85]. Procedures consider similarity across individuals within loci or between pairs, triplets etc., of loci, as described by Jiang and Reif [69] and Martini et al. [64]. In the first case, the K matrices are weighted sums of the genomic additive (G) and dominance covariance matrices (D), whereas in the second it may also consider epistatic interactions. Ensemble methods for ML rely on the idea of combining the predictions for a set of “weak” predictive models or learners into a final, more accurate prediction [86]. Thus, in ensemble methods g ðX Þ ¼
M X
w i h i ðy, X Þ
i¼1
where wi is the weight for the ith learner (hi(y, X)). There is an enormous literature on how to generate and combine learners, ranging from using a predetermined number of learners to defining methods that generate new learners from a base learner using resampling techniques [86]. Some methods from this last category have been frequently used in genomic prediction, such as Boosting [87, 88], Bagging [89, 90], and Random Forest [91, 92]. The appeal of ensemble methods for genomic prediction is that a combination of “weak” learners may be able to capture hidden nonadditive interactions among SNP markers.
Genomic Prediction with Nonadditive Effects
231
Finally, the ML approach that has gained the greatest relevance in recent years involves deep learning [93, 94]. Deep learning methods are based on neural networks, which are founded on several layers of “perceptrons” [95]. Each “perceptron” receives inputs (either data or the output from previous layers) and transforms these into an output through an activation function. Thus, the first layer of “perceptrons” receives the initial data and transforms this into several outputs that feed the subsequent layers to generate the final output from the last layer. Neural networks have been used for genomic prediction [96, 97], but they generally contain only a few hidden layers, while deep networks can have a very large number of them. Moreover, deep learning methods can include a very large variety of complex structures such as convolutional or recurrent steps [98]. Among these, convolutional steps can be a very useful tool for analyzing input data that follows a spatial pattern, such as the locations of the SNP markers in the genome. Convolutions summarize the inputs within a neighborhood and can be used to capture local epistatic effects. Nevertheless, the implementation of deep learning procedures requires a very large data set (i.e., several thousands) and fine tuning of a set of hyperparameters [78, 98] that includes, among other things, an activation function, a learning rate and the architecture of the deep neural network. Deep Learning has been shown to be a highly efficient tool in a wide range of applications in various fields [99] and has gained relevance in genomic analysis [100–102]. Additionally, it has recently been introduced into genomic prediction techniques in plants [103, 104], animals [105], and humans [106] with mixed results, as they are not consistently better than GBLUP. However, intensive research is still needed to define the appropriate hyperparameters to efficiently implement of deep learning in genomic prediction.
3
Applications of Genomic Prediction with Nonadditive Genetic Effects
3.1 Phenotype Prediction
The main application of genomic prediction is to estimate the continuous or categorical phenotypes of humans, animals, or plants before they manifest, despite the additive, dominance or epistatic gene action. Thus, any parametric or nonparametric procedure that considers nonadditive effects should in principle improve phenotype prediction [33, 55]. Momen et al. [107] showed that when the true genetic architecture is largely or partially due to epistatic interactions, the additive model may not perform well, while models that account explicitly for interaction generally increase prediction accuracy. Moreover, apparent nonadditive interactions such as “phantom” epistasis may be generated under pure additive inheritance by imperfect linkage disequilibrium (LD) between markers and QTL [108]. Therefore, models that consider epistasis may have
232
Luis Varona et al.
a better predictive ability than additive models, especially for low density SNP panels [109], even in the absence of “biological” epistatic effects. From an alternative point of view, it should be noted that the “biological” epistatic effects that influence complex traits may be caused by the interaction among unobserved intermediate traits, whose phenotypic information can be achieved by “omics” techniques [110, 111], high-throughput phenotyping [112] or by the output of crop growth models [113]. If those intermediate traits are somewhat available, modelling them may be an interesting alternative for genomic prediction of the end-point traits [114]. In general, the models that best predict future phenotypes may not be associated with a readily understandable inference, such as that provided by parametric methods. Several studies have compared the performance of parametric and nonparametric methods [92, 115, 116]. The consensus is that parametric methods are generally more robust, although, when properly tuned, nonparametric methods can achieve better phenotype prediction. However, it has not been possible to identify a gold standard method for all traits and populations [116]. Nevertheless, machine learning and nonparametric methods may play a very important role in the prediction of the phenotypic performance, as they can also be more integrated with the new and unstructured information provided by a broad range of new “omics” [117, 118]. Yet, in most cases, the prediction goal in plant and animal breeding is not individual phenotypes, but rather their breeding value, as this contributes to the phenotypic performance of their progeny. In this case, parametric methods with orthogonal approaches [65] that distinguish between additive and nonadditive “statistical” effects may provide more accurate breeding values. 3.2
Mate Allocation
In the genetic improvement of livestock and plants, breeders have wanted to define the mating structure between the reproductive individuals of the population since the foundations of the discipline [119]. In hybrid crop breeding, prediction of the yield of single crosses can have a great impact by reducing the number of crosses for training [120]. According to the postulates of the theory of quantitative genetics, mate allocation feasibility requires a great variation caused by dominance deviations (or epistasis). Toro and Varona [33] showed that, under models that include dominance effects, the performance of future mating (Gij) between the ith and jth individual can be predicted through: n h i X b j pr ðA 2 A 2 Þb a j þ pr ijk ðA 1 A 2 Þd a E G ij ¼ pr ijk ðA 1 A 1 Þb j ijk k¼1
Genomic Prediction with Nonadditive Effects
233
where prijk(A1A1), prijk(A1A2), and prijk(A2A2) are the probabilities of the genotypes A1A1, A1A2, and A2A2 for the combination b j are the b j and d of the ith and jth individuals at the kth marker, a estimates of the additive and dominance effects and n is the number of markers. Once the performance expectation of all possible mates has been calculated, the choice of the optimum mate structure should be achieved through any optimization method, such as linear programming [121] or simulated annealing [122]. In a simulation study, Toro and Varona [33] showed that mate allocation can provide an additional total genetic value response in the expected progeny of up to 22% over random mating, with a narrow sense heritability of 0.20 and dominance variance ratio of 0.10. This was subsequently confirmed in both dairy cattle [123–125], pigs [37] and rice [126]. However, Ferna´ndez et al. [127] showed that minimum coancestry matings (MCM) can outperform mate selection (MS) when the traits exhibits substantial inbreeding depression. This is because MCM leads to high levels of heterozygotes across all the genome, whereas MS promotes specifically heterozygosity in those SNPs where a dominance effect has been detected, and if this effect is not correctly estimated it may lead to suboptimal results [127]. In models with more complex interactions, such as epistatic or nonparametric models, predicting the performance of a future mating is not as easy, since this must be calculated integrating the predictive performance over all possible future genotypic configurations of the expected progeny. Moreover, for effects that involve more than one locus, these genotypic configurations also depend on recombination fractions across the genome. A possible alternative for avoiding that complex integration by Monte Carlo is to generate “phantom individuals” under a dominance or epistatic single-step approach [46] to generate predictions of the additive, dominance and epistatic effects of future potential individuals. Nevertheless, prediction accuracy for nonadditive effects in in future individuals will be generally very low.
4
Selection for Crossbreeding Livestock and plant genetic improvement is often carried out through crossbreeding schemes. In animals, the most frequent structure involves two or three-way crosses between populations that must be maintained and selected separately, while in hybrid crop the breeding objective is to identify the best cross among inbred lines and to select new lines from crosses of existing ones [128]. The aim of crossbreeding is to capture the complementarity between the lines and to obtain the benefits of heterosis that, according to quantitative genetic postulates is caused by dominance and epistatic interactions [7]. Therefore, in crossbreeding, the
234
Luis Varona et al.
breeding goal of purebred populations is to maximize phenotypic performance in the crossbred individuals, and the traditional approach to achieve this goal is via Reciprocal Recurrent Selection (RRS) [129]. RRS proposes that selection in purebred populations is based on the crossbreeding performance of their progeny, but implies an increase in the generation interval that compromises the overall response. In crop breeding, RRS is the norm [41], whereas, in livestock, most breeding schemes use purebred performance for selection with the expectation of a high pure/crossbred genetic correlation [130]. However, purebred and crossbred performances are not perfectly correlated due to several reasons, including genotype-by-environment interactions, imperfect LD and the existence of dominance, epistasis or imprinting [131, 132]. Genomic data has provided a very useful information for improving the breeding value prediction of purebred individuals for crossbred performance without the need to wait to record the crossbred progeny. This strategy has been termed Reciprocal Recurrent Genomic Selection (RRGS) [133, 134]. The first approach using GS for crossbred performance in livestock breeding ˜ ez-Escriche et al. [135], who defined a breedwas designed by Iban specific genomic selection model (BSAM) as n X D yi ¼ μ þ t Sijk αSjk þ t D α ijl jl þ e i j ¼1
where t Sijk is the SNP allele at the jth locus from breed k and received from the sire of the ith individual that can take values 0 or 1. αSjk is the breed-specific substitution effect for the jth locus and the kth D breed. Similarly, t D ijl and αjl are defined for the alleles received from the dam and of the lth breed. The aim is to estimate allele substitutions effects within the breed that depends on the “biological” additive and dominance effects and on the allele frequencies in the purebred populations. This model was expanded by Sevillano et al. [136] to a three-way crossbreeding scheme by using a procedure to trace the breed-of-origin of the alleles [137, 138]. Stuber and Cockerham [139] showed that gene substitution effects can be defined within or across populations, and that both approaches are equivalent if all the nonadditive effects are accounted for. Therefore, Christensen et al. [140] proposed an alternative model called the “common genetic” approach, and described the genomic relationship matrix for genotyped and nongenotyped individuals following a single-step rationale [28, 29]. Later, Xiang et al. [48, 141] compared the “breed specific allele” and “common genetic” approaches with similar results. Both the “breed specific allele” and “common genetic” approaches are based on the allele substitution effects, but it is feasible to develop a model for crossbreeding data that includes the “biological” additive and dominance model. In this sense, Zeng
Genomic Prediction with Nonadditive Effects
235
et al. [38] developed a model incorporating the additive and dominance effects of the SNPs and compared this to the “specific allele” model, concluding that the dominance model has better prediction abilities in a simulation study. Moreover, Vitezica et al. [142] and Christensen et al. [143] proved that dominance models with normally distributed SNP effects can be transformed into mixed models with animal effects within a crossbreeding scheme. In addition, these studies also assumed that the “biological” additive and dominance effects might differ between pure and crossbred populations as a consequence of genotype-by-environment interaction. They therefore proposed a multivariate genomic BLUP to consider the correlations between pure and crossbred populations. In plants, models for hybrid prediction divide the genotypic value of a cross between two inbred lines (i and j) into General Combining Ability (GCA) and Specific Combining Ability (SCA) [139, 144] as follows. G ij ¼ μ þ GCAi þ GCA j þ SCA ij GCA includes additive effects and all sort of within-line additive epistatic interactions, while SCA included dominance, acrossline additive additive epistasis and all epistatic interaction involving dominance [139]. A first approach to genomic prediction of hybrid performance was developed by Technow et al. [40], that modelled GCA as within-line additive effects and SCA as across-line additive additive interactions. Recently, Gonza´lez-Dieguez et al. [145] have proposed an orthogonal model that allows GCA and SCA to be divided into all possible additive, dominant and epistatic interactions. For breeding purposes, lines should be selected by GCA, but disentangling between additive and epistatic effects (and variance) within each line is interesting because additive variance is related to the genetic progress achievable when selecting new lines.
5
Selection in Purebred Populations In outbred populations, the selection response in purebred populations depends on the magnitude of the additive variance and on the prediction accuracy of the additive breeding values [7]. Hence, the only expected advantage of including nonadditive effects in the genomic prediction models is an increase in the accuracy of breeding values [33, 55], and it is not worth selecting individuals with the highest dominance values because their progeny will revert to the average as a results of random mating. However, Toro [146, 147] proposed two strategies to take advantage of dominance in a closed population.
236
Luis Varona et al.
The first [146] consists of performing two types of mating: (a) minimum coancestry matings in order to obtain the progenies to constitute the commercial population and which will also be utilized for testing, and (b) maximum coancestry mating from which the breeding population will be maintained. Toro’s second strategy [147] advocates selecting grandparental combinations. Both strategies are related to reciprocal-recurrent selection [129] as they rely on the distinction between commercial and breeding populations. Nevertheless, they have only been tested through simulations and using a reduced set of genes with known additive and dominance effects, and their efficiency has yet to be verified with a large number of SNP markers. The distinction between breeding and commercial populations is also implicit in the proposal of a two part strategy [148] for the development of new crop inbred lines, that propose to distinguish between a breeding population that is systematically selected by recurrent genomic selection and product development component to identify the best new crossbred variety.
6
Final Remarks In recent years, enormous efforts have been done in developing new statistical models for the use of nonadditive genetic effects of genomic selection or prediction, but several problems remain that must be resolved before a routine implementation of those models. In livestock and plant breeding, the GBLUP procedures has become the method of choice and incorporation of nonadditive effects is possible [43, 65]. Although some approximated approaches has been developed [46], it is still necessary to define an appropriate procedure to incorporate nonadditive effects into a single-step approach [29] that allows the use the phenotypic and pedigree information from nongenotyped individuals. Inbreeding depression is currently modeled through a covariate with the average individual heterozygosity [149], but only considers the effects of directional dominance and the consequences of epistatic interactions in inbreeding depression [57] has not yet studied in depth. Further, the presence of dominance with inbreeding implies the existence of up to five variance components (additive, dominance between noninbred, dominance between inbred, covariance between additive and inbred dominance effects, and inbreeding depression) that has been scarcely studied in real populations [150, 151]. Research is still needed to describe their equivalence with the variance components captured by marker effects. Finally, parametric approaches that include epistatic interactions [65] cannot consider LD and a general model to describe the effects of the genes and their interactions in populations under
Genomic Prediction with Nonadditive Effects
237
LD is required. Nonparametric approaches for genomic prediction can capture very complex interactions even when there is LD, but they generally require to be tuned for each trait and each population and a gold standard for a routine implementation is far from ready. Nonetheless, although the nonadditive variance is expected to be low in outbred populations [31, 32], strong evidence of heterosis has been found in animals and plants. Therefore, models that consider nonadditive effects will play a relevant role in the predicting future performance of crossbred individuals or hybrid crops [41, 132]. Moreover, interactions among intermediate traits [113] and imperfect LD between markers and QTL can generate epistatic interactions [108] that can be captured for predictive purposes with both parametric and nonparametric models. References 1. Mendel G (1866) Versuche u¨ber PflanzenHybriden, vol 4. Bru¨nn Im Verlage des Vereine, Brno, pp 3–47 2. Bateson W (1909) Mendel’s principles of heredity. Cambridge University Press, Cambridge 3. Fisher RA (1919) The correlation between relatives on the supposition of mendelian inheritance. Trans R Soc Edinburgh 52: 399–433 4. Wright S (1921) Systems of mating. I. the biometric relations between parent and offspring. Genetics 6:111–123 5. Kempthorne O (1954) The correlation between relatives in a random mating population. Proc R Soc Lond B Biol Sci 143: 102–113 6. Cockerham CC (1954) An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics 39:859–882 7. Falconer DS, Mackay TFC (1996) Introduction to quantitative genetics. Longman Group, Harlow, UK 8. Hazel LN, Lush JL (1942) The efficiency of three methods of selection. J Hered 33: 393–399. https://doi.org/10.1093/ oxfordjournals.jhered.a105102 9. Henderson CR (1984) Applications of linear models in animal breeding. University of Guelph, Ontario, Canada 10. Misztal I, Varona L, Culbertson M et al (1998) Studies on the value of incorporating the effect of dominance in genetic evaluations of dairy cattle, beef cattle and swine. Biotechnol Agron Soc Environ 2:227–233
11. Watson JD, Crick FHC (1953) Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171: 737–738. https://doi.org/10.1038/ 171737a0 12. Dekkers JCM (2004) Commercial application of marker- and gene-assisted selection in livestock: strategies and lessons. J Anim Sci 82: E313–E328. https://doi.org/10.2527/ 2004.8213_supplE313x 13. Lander ES, Linton LM, Birren B et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. https://doi. org/10.1038/35057062 14. Craig Venter J, Adams MD, Myers EW et al (2001) The sequence of the human genome. Science 291:1304–1351. https://doi.org/ 10.1126/science.1058040 15. Elsik CG, Tellam RL, Worley KC et al (2009) The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science 324:522–528. https://doi.org/10. 1126/science.1169588 16. Groenen MAM, Archibald AL, Uenishi H et al (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491:393–398. https://doi.org/ 10.1038/nature11622 17. Matsumoto T, Wu J, Kanamori H et al (2005) The map-based sequence of the rice genome. Nature 436:793–800. https://doi.org/10. 1038/nature03895 18. Schnable PS, Ware D, Fulton RS et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326:1112–1115. https://doi.org/10.1126/science.1178534
238
Luis Varona et al.
19. Gunderson KL, Steemers FJ, Lee G et al (2005) A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet 37:549–554. https://doi.org/10. 1038/ng1547 20. Metzker ML (2010) Sequencing technologies — the next generation. Nat Rev Genet 11: 31–46. https://doi.org/10.1038/nrg2626 21. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829 22. Gianola D (2013) Priors in whole-genome regression: the bayesian alphabet returns. Genetics 194:573–596. https://doi.org/10. 1534/genetics.113.151753 23. de los Campos G, Naya H, Gianola D et al (2009) Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182:375–385. https://doi.org/10.1534/genetics.109. 101501 24. Erbe M, Hayes BJ, Matukumalli LK et al (2012) Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95:4114–4129. https://doi.org/10.3168/ jds.2011-5019 25. de Los Campos G, Hickey JM, Pong-Wong R et al (2013) Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327–345. https://doi.org/10.1534/genetics.112. 143313 26. Wang X, Yang Z, Xu C (2015) A comparison of genomic selection methods for breeding value prediction. Sci Bull 60:925–935. https://doi.org/10.1007/s11434-0150791-2 27. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423. https://doi.org/10.3168/ JDS.2007-0980 28. Legarra A, Aguilar I, Misztal I (2009) A relationship matrix including full pedigree and genomic information. J Dairy Sci 92: 4656–4663. https://doi.org/10.3168/jds. 2009-2061 29. Aguilar I, Misztal I, Johnson DL et al (2010) Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J Dairy Sci 93:743–752. https://doi. org/10.3168/jds.2009-2730 30. Hill WG, Goddard ME, Visscher PM (2008) Data and theory point to mainly additive
genetic variance for complex traits. PLoS Genet 4:e1000008. https://doi.org/10. 1371/journal.pgen.1000008 31. Crow JF (2010) On epistasis: why it is unimportant in polygenic directional selection. Philos Trans R Soc B Biol Sci 365: 1241–1244. https://doi.org/10.1098/rstb. 2009.0275 32. Hill WG (2010) Understanding and using quantitative genetic variation. Philos Trans R Soc B Biol Sci 365:73–85. https://doi.org/ 10.1098/rstb.2009.0203 33. Toro MA, Varona L (2010) A note on mate allocation for dominance handling in genomic selection. Genet Sel Evol 42:33. https://doi. org/10.1186/1297-9686-42-33 34. Aliloo H, Pryce JE, Gonza´lez-Recio O et al (2016) Accounting for dominance to improve genomic evaluations of dairy cows for fertility and milk production traits. Genet Sel Evol 48: 8. https://doi.org/10.1186/s12711-0160186-0 35. Duenk P, Calus MPL, Wientjes YCJ, Bijma P (2017) Benefits of dominance over additive models for the estimation of average effects in the presence of dominance. G3 (Bethesda) 7:3405–3414. https://doi.org/10.1534/g3. 117.300113 36. M€aki-Tanila A (2007) An overview on quantitative and genomic tools for utilising dominance genetic variation in improving animal production. Agric Food Sci 16:188–198. h t t p s : // d o i . o r g / 1 0 . 2 1 3 7 / 145960607782219337 37. Gonza´lez-Die´guez D, Tusell L, CarillierJacquin C et al (2019) SNP-based mate allocation strategies to maximize total genetic value in pigs. Genet Sel Evol 51:55. https:// doi.org/10.1186/s12711-019-0498-y 38. Zeng J, Toosi A, Fernando RL et al (2013) Genomic selection of purebred animals for crossbred performance in the presence of dominant gene action. Genet Sel Evol 45:11. https://doi.org/10.1186/1297-968645-11 39. Gonza´lez-Die´guez D, Tusell L, Bouquet A et al (2020) Purebred and crossbred genomic evaluation and mate allocation strategies to exploit dominance in pig crossbreeding schemes. G3 (Bethesda) 10:2829–2841. https://doi.org/10.1534/g3.120.401376 40. Technow F, Schrag TA, Schipprack W et al (2014) Genome properties and prospects of genomic prediction of hybrid performance in a breeding program of maize. Genetics 197: 1343–1355. https://doi.org/10.1534/ genetics.114.165860
Genomic Prediction with Nonadditive Effects 41. Labroo MR, Studer AJ, Rutkoski JE (2021) Heterosis and hybrid crop breeding: a multidisciplinary review. Front Genet 12:643761. https://doi.org/10.3389/fgene.2021. 643761 42. Su G, Christensen OF, Ostersen T et al (2012) Estimating additive and non-additive genetic variances and predicting genetic merits using genome-wide dense single nucleotide polymorphism markers. PLoS One 7: e45293. https://doi.org/10.1371/journal. pone.0045293 43. Vitezica ZG, Varona L, Legarra A (2013) On the additive and dominant variance and covariance of individuals within the genomic selection scope. Genetics 195:1223–1230. https://doi.org/10.1534/genetics.113. 155176 ´ lvarez-Castro JM, Carlborg O ¨ (2007) A uni44. A fied model for functional and statistical epistasis and its application in quantitative trait loci analysis. Genetics 176:1151–1161. https://doi.org/10.1534/genetics.106. 067348 45. Nishio M, Satoh M (2014) Including dominance effects in the genomic BLUP method for genomic evaluation. PLoS One 9:e85792. https://doi.org/10.1371/journal.pone. 0085792 46. Ertl J, Edel C, Pimentel ECG et al (2018) Considering dominance in reduced singlestep genomic evaluations. J Anim Breed Genet 135:151–158. https://doi.org/10. 1111/jbg.12323 47. Varona L, Legarra A, Herring W, Vitezica ZG (2018) Genomic selection models for directional dominance: an example for litter size in pigs. Genet Sel Evol 50:50. https://doi.org/ 10.1186/s12711-018-0374-1 48. Xiang T, Christensen OF, Vitezica ZG, Legarra A (2016) Genomic evaluation by including dominance effects and inbreeding depression for purebred and crossbred performance with an application in pigs. Genet Sel Evol 48:92. https://doi.org/10.1186/ s12711-016-0271-4 49. Saura M, Ferna´ndez A, Varona L et al (2015) Detecting inbreeding depression for reproductive traits in Iberian pigs using genomewide data. Genet Sel Evol 47:1. https://doi. org/10.1186/s12711-014-0081-5 50. Howard JT, Tiezzi F, Huang Y et al (2017) A heuristic method to identify runs of homozygosity associated with reduced performance in livestock. J Anim Sci 95:4318–4332. https:// doi.org/10.2527/jas2017.1664
239
51. Martikainen K, Koivula M, Uimari P (2020) Identification of runs of homozygosity affecting female fertility and milk production traits in Finnish Ayrshire cattle. Sci Rep 10:3804. https://doi.org/10.1038/s41598-02060830-9 52. Doekes HP, Bijma P, Veerkamp RF et al (2020) Inbreeding depression across the genome of Dutch Holstein Friesian dairy cattle. Genet Sel Evol 52:64. https://doi.org/ 10.1186/s12711-020-00583-1 53. Caballero A, Keightley PD (1994) A pleiotropic nonadditive model of variation in quantitative traits. Genetics 138:883–900 54. Bennewitz J, Meuwissen THE (2010) The distribution of QTL additive and dominance effects in porcine F2 crosses. J Anim Breed Genet 127:171–179. https://doi.org/10. 1111/j.1439-0388.2009.00847.x 55. Wellmann R, Bennewitz J (2012) Bayesian models with dominance effects for genomic evaluation of quantitative traits. Genet Res (Camb) 94:21–37. https://doi.org/10. 1017/S0016672312000018 56. Xiang T, Christensen OF, Vitezica ZG, Legarra A (2018) Genomic model with correlation between additive and dominance effects. Genetics 209:711–723. https://doi. org/10.1534/genetics.118.301015 57. Minvielle F (1987) Dominance is not necessary for heterosis: a two-locus model. Genet Res 49:245–247. https://doi.org/10.1017/ S0016672300027142 58. Toro MA, M€aki-Tanila A (2018) Some intriguing questions on Fisher’s ideas about dominance. J Anim Breed Genet 135: 149–150. https://doi.org/10.1111/jbg. 12332 59. Reik W, Walter J (2001) Genomic imprinting: parental influence on the genome. Nat Rev Genet 2:21–32. https://doi.org/10.1038/ 35047554 60. Nishio M, Satoh M (2015) Genomic best linear unbiased prediction method including imprinting effects for genomic evaluation. Genet Sel Evol 47:32. https://doi.org/10. 1186/s12711-015-0091-y 61. Hu Y, Rosa GJM, Gianola D (2016) Incorporating parent-of-origin effects in wholegenome prediction of complex traits. Genet Sel Evol 48:34. https://doi.org/10.1186/ s12711-016-0213-1 62. Guo X, Christensen OF, Ostersen T et al (2016) Genomic prediction using models with dominance and imprinting effects for backfat thickness and average daily gain in Danish Duroc pigs. Genet Sel Evol 48:67.
240
Luis Varona et al.
https://doi.org/10.1186/s12711-0160245-6 63. Jiang J, Shen B, O’Connell JR et al (2017) Dissection of additive, dominance, and imprinting effects for production and reproduction traits in Holstein cattle. BMC Genomics 18:425. https://doi.org/10.1186/ s12864-017-3821-4 64. Martini JWR, Wimmer V, Erbe M, Simianer H (2016) Epistasis and covariance: how gene interaction translates into genomic relationship. Theor Appl Genet 129:963–976. https://doi.org/10.1007/s00122-0162675-5 65. Vitezica ZG, Legarra A, Toro MA, Varona L (2017) Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations. Genetics 206:1297–1307. https://doi.org/10.1534/genetics.116. 199406 66. Gonza´lez-Recio O, Rosa GJM, Gianola D (2014) Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livest Sci 166: 217–231. https://doi.org/10.1016/j.livsci. 2014.05.036 67. Gonza´lez-Camacho JM, Ornella L, Pe´rezRodrı´guez P et al (2018) Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome 11:170104. https://doi.org/ 10.3835/plantgenome2017.11.0104 68. Azodi CB, Tang J, Shiu S-H (2020) Opening the black box: interpretable machine learning for geneticists. Trends Genet 36:442–455. https://doi.org/10.1016/j.tig.2020.03.005 69. Jiang Y, Reif JC (2015) Modeling epistasis in genomic selection. Genetics 201:759–768. https://doi.org/10.1534/genetics.115. 177907 70. Martini JWR, Toledo FH, Crossa J (2020) On the approximation of interaction effect models by Hadamard powers of the additive genomic relationship. Theor Popul Biol 132: 16–23. https://doi.org/10.1016/j.tpb. 2020.01.004 71. Jiang Y, Reif JC (2020) Efficient algorithms for calculating epistatic genomic relationship matrices. Genetics 216:651–669. https://doi. org/10.1534/genetics.120.303459 72. Akdemir D, Jannink JL (2015) Locally epistatic genomic relationship matrices for genomic association and prediction. Genetics 199: 857–871. https://doi.org/10.1534/genet ics.114.173658 73. Jiang Y, Schmidt RH, Reif JC (2018) Haplotype-based genome-wide prediction
models exploit local epistatic interactions among markers. G3 (Bethesda) 8: 1687–1699. https://doi.org/10.1534/g3. 117.300548 74. Santantonio N, Jannink JL, Sorrells M (2019) A low resolution epistasis mapping approach to identify chromosome arm interactions in allohexaploid wheat. G3 (Bethesda) 9: 675–684. https://doi.org/10.1534/g3.118. 200646 75. Shalev-Shwartz S, Ben-David S (2013) Understanding machine learning: from theory to algorithms. Cambridge University Press, Cambridge 76. Hofmann T, Scho¨lkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36:1171–1220. https://doi.org/10.1214/ 009053607000000677 77. Dietterich TG (2000) Ensemble methods in machine learning. In: Proceedings of the First International Workshop on Multiple Classifier System. Springer-Verlag, Berlin, pp 1–15 78. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi. org/10.1038/nature14539 79. Kimeldorf G, Wahba G (1971) Some results on Tchebycheffian spline functions. J Math Anal Appl 33:82–95. https://doi.org/10. 1016/0022-247X(71)90184-3 80. Wootters WK (1981) Statistical distance and Hilbert space. Phys Rev D 23:357–362. https://doi.org/10.1103/PhysRevD.23.357 81. Gianola D, Fernando RL, Stella A (2006) Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173:1761–1776. https://doi.org/10.1534/ genetics.105.049510 82. Long N, Gianola D, Rosa GJM et al (2010) Radial basis function regression methods for predicting quantitative traits using SNP markers. Genet Res (Camb) 92:209–225. h t t p s : // d o i . o r g / 1 0 . 1 0 1 7 / S0016672310000157 83. Piepho HP (2009) Ridge regression and extensions for Genomewide selection in maize. Crop Sci 49:1165–1176. https://doi. org/10.2135/cropsci2008.10.0595 84. Morota G, Koyama M, Rosa GJM et al (2013) Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data. Genet Sel Evol 45:17. https://doi.org/10.1186/12979686-45-17 85. de los Campos G, Gianola D, GJM R et al (2010) Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet
Genomic Prediction with Nonadditive Effects Res (Camb) 92:295–308. https://doi.org/ 10.1017/S0016672310000285 86. Re M, Valentini G (2012) Ensemble methods : a review. In: Data mining and machine learning for astronomical applications. Chapman & Hall, London, pp 563–594 87. Gonza´lez-Recio O, Weigel KA, Gianola D et al (2010) L2-boosting algorithm applied to high-dimensional problems in genomic selection. Genet Res (Camb) 92:227–237. h t t p s : // d o i . o r g / 1 0 . 1 0 1 7 / S0016672310000261 88. Gonza´lez-Recio O, Jime´nez-Montero JA, Alenda R (2013) The gradient boosting algorithm and random boosting for genomeassisted evaluation in large data sets. J Dairy Sci 96:614–624. https://doi.org/10.3168/ jds.2012-5630 89. Mikshowsky AA, Gianola D, Weigel KA (2016) Improving reliability of genomic predictions for Jersey sires using bootstrap aggregation sampling. J Dairy Sci 99:3632–3645. https://doi.org/10.3168/jds.2015-10715 90. Mikshowsky AA, Gianola D, Weigel KA (2017) Assessing genomic prediction accuracy for Holstein sires using bootstrap aggregation sampling and leave-one-out cross validation. J Dairy Sci 100:453–464. https://doi.org/10.3168/jds.2016-11496 91. Gonza´lez-Recio O, Forni S (2011) Genomewide prediction of discrete traits using bayesian regressions and machine learning. Genet Sel Evol 43:7. https://doi.org/10.1186/ 1297-9686-43-7 92. Heslot N, Yang HP, Sorrells ME, Jannink JL (2012) Genomic selection in plant breeding: a comparison of models. Crop Sci 52:146–160. https://doi.org/10.2135/cropsci2011.06. 0297 93. Bellot P, de los Campos G, Pe´rez-Enciso M (2018) Can deep learning improve genomic prediction of complex human traits? Genetics 210:809–819. https://doi.org/10.1534/ genetics.118.301298 94. Pe´rez-Enciso Z (2019) A guide for using deep learning for complex trait genomic prediction. Genes (Basel) 10:553. https://doi.org/ 10.3390/genes10070553 95. Rosenblatt F (1962) Principles of Neurodynamics: Perceptrons and the theory of brain mechanisms. Spartan Books, Washington, D. C 96. Okut H, Gianola D, Rosa GJM, Weigel KA (2011) Prediction of body mass index in mice using dense molecular markers and a regularized neural network. Genet Res (Camb) 93:
241
189–201. https://doi.org/10.1017/ S0016672310000662 97. Gonza´lez-Camacho JM, Crossa J, Pe´rezRodrı´guez P et al (2016) Genome-enabled prediction using probabilistic neural network classifiers. BMC Genomics 17:208. https:// doi.org/10.1186/s12864-016-2553-1 98. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61: 85–117. https://doi.org/10.1016/j.neunet. 2014.09.003 99. Khamparia A, Singh KM (2019) A systematic review on deep learning architectures and applications. Expert Syst 36:e12400. https://doi.org/10.1111/exsy.12400 ˇ , Gagneur J, Theis FJ 100. Eraslan G, Avsec Z (2019) Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 20:389–403. https://doi.org/10. 1038/s41576-019-0122-6 101. Zou J, Huss M, Abid A et al (2019) A primer on deep learning in genomics. Nat Genet 51: 12–18. https://doi.org/10.1038/s41588018-0295-5 102. Kopp W, Monti R, Tamburrini A et al (2020) Deep learning for genomics using Janggu. Nat Commun 11:3488. https://doi.org/10. 1038/s41467-020-17155-y 103. Montesinos-Lo´pez OA, Martı´n-Vallejo J, Crossa J et al (2019) New deep learning genomic-based prediction model for multiple traits with binary, ordinal, and continuous phenotypes. G3 (Bethesda) 9:1545–1556. https://doi.org/10.1534/g3.119.300585 104. Montesinos-Lo´pez OA, Montesinos-Lo´pez JC, Singh P et al (2020) A multivariate Poisson deep learning model for genomic prediction of count data. G3 (Bethesda) 10: 4177–4190. https://doi.org/10.1534/g3. 120.401631 105. Waldmann P, Pfeiffer C, Me´sza´ros G (2020) Sparse convolutional neural networks for genome-wide prediction. Front Genet 11: 25. https://doi.org/10.3389/fgene.2020. 00025 106. Wu Q, Boueiz A, Bozkurt A et al (2018) Deep learning methods for predicting disease status using genomic data. J Biom Biostat 9: 417 107. Momen M, Mehrgardi AA, Sheikhi A et al (2018) Predictive ability of genome-assisted statistical models under various forms of gene action. Sci Rep 8:12309. https://doi.org/10. 1038/s41598-018-30089-2 108. de los Campos G, Sorensen DA, Toro MA (2019) Imperfect linkage disequilibrium generates phantom epistasis ( & perils of big
242
Luis Varona et al.
data). G3 (Bethesda) 9:1429–1436. https:// doi.org/10.1534/g3.119.400101 109. Schrauf MF, Martini JWR, Simianer H et al (2020) Phantom epistasis in genomic selection: on the predictive ability of epistatic models. G3 (Bethesda) 10:3137–3145. https://doi.org/10.1534/g3.120.401300 110. Fontanesi L (2016) Metabolomics and livestock genomics: insights into a phenotyping frontier and its applications in animal breeding. Anim Front 6:73–79. https://doi.org/ 10.2527/af.2016-0011 111. Scossa F, Alseekh S, Fernie AR (2021) Integrating multi-omics data for crop improvement. J Plant Physiol 257:153352. https:// doi.org/10.1016/j.jplph.2020.153352 112. Shakoor N, Lee S, Mockler TC (2017) High throughput phenotyping to accelerate crop breeding and monitoring of diseases in the field. Curr Opin Plant Biol 38:184–192. https://doi.org/10.1016/j.pbi.2017. 05.006 113. Voss-Fels KP, Cooper M, Hayes BJ (2019) Accelerating crop genetic gains with genomic selection. Theor Appl Genet 132:669–686. https://doi.org/10.1007/s00122-0183270-8 114. Messina CD, Technow F, Tang T et al (2018) Leveraging biological insight and environmental variation to improve phenotypic prediction: integrating crop growth models (CGM) with whole genome prediction (WGP). Eur J Agron 100:151–162. https:// doi.org/10.1016/j.eja.2018.01.007 115. Blondel M, Onogi A, Iwata H, Ueda N (2015) A ranking approach to genomic selection. PLoS One 10:e0128570. https://doi. org/10.1371/journal.pone.0128570 116. Azodi CB, Bolger E, McCarren A et al (2019) Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3 (Bethesda) 9:3691–3702. https://doi.org/10.1534/g3.119.400498 117. Pe´rez-Enciso M (2017) Animal breeding learning from machine learning. J Anim Breed Genet 134:85–86. https://doi.org/ 10.1111/jbg.12263 118. Grapov D, Fahrmann J, Wanichthanarak K, Khoomrung S (2018) Rise of deep learning for genomic, proteomic, and metabolomic data integration in precision medicine. Omi A J Integr Biol 22:630–636. https://doi. org/10.1089/omi.2018.0097 119. Lush JL (1943) Animal breeding plans. Iowa State College Press, Ames, Iowa (USA) 120. Kadam DC, Potts SM, Bohn MO et al (2016) Genomic prediction of single crosses in the
early stages of a maize hybrid breeding pipeline. G3 (Bethesda) 6:3443–3453. https:// doi.org/10.1534/g3.116.031286 121. Jansen GB, Wilton JW (1984) Linear programming in selection of livestock. J Dairy Sci 67:897–901. https://doi.org/10.3168/ jds.S0022-0302(84)81385-5 122. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–680. https://doi.org/10.1126/sci ence.220.4598.671 123. Sun C, VanRaden PM, O’Connell JR et al (2013) Mating programs including genomic relationships and dominance effects. J Dairy Sci 96:8014–8023. https://doi.org/10. 3168/jds.2013-6969 124. Ertl J, Legarra A, Vitezica ZG et al (2014) Genomic analysis of dominance effects on milk production and conformation traits in Fleckvieh cattle. Genet Sel Evol 46:40. https://doi.org/10.1186/1297-968646-40 125. Aliloo H, Pryce JE, Gonza´lez-Recio O et al (2017) Including nonadditive genetic effects in mating programs to maximize dairy farm profitability. J Dairy Sci 100:1203–1222. https://doi.org/10.3168/jds.2016-11261 126. Wang X, Li L, Yang Z et al (2017) Predicting rice hybrid performance using univariate and multivariate GBLUP models based on North Carolina mating design II. Heredity (Edinb) 118:302–310. https://doi.org/10.1038/ hdy.2016.87 127. Ferna´ndez J, Villanueva B, Toro MA (2021) Optimum mating designs for exploiting dominance in genomic selection schemes for aquaculture species. Genet Sel Evol 53:14. https://doi.org/10.1186/s12711-02100610-9 128. Bernardo R (2014) Process of plant breeding. In: Essentials of plant breeding. Stemma Press, Saint Paul, MN, pp 9–13 129. Comstock RE, Robinson HF, Harvey PH (1949) A breeding procedure designed to make maximum use of both general and specific combining ability 1. Agron J 41: 360–367. https://doi.org/10.2134/ agronj1949.00021962004100080006x 130. Wientjes YCJ, Calus MPL (2017) BOARD INVITED REVIEW: the purebred-crossbred correlation in pigs: a review of theory, estimates, and implications1. J Anim Sci 95: 3467–3478. https://doi.org/10.2527/jas. 2017.1669 131. Duenk P, Bijma P, Calus MPL et al (2020) The impact of non-additive effects on the genetic correlation between populations. G3
Genomic Prediction with Nonadditive Effects (Bethesda) 10:783–795. https://doi.org/10. 1534/g3.119.400663 132. Stock J, Bennewitz J, Hinrichs D, Wellmann R (2020) A review of genomic models for the analysis of livestock crossbred data. Front Genet 11:568. https://doi.org/10.3389/ fgene.2020.00568 133. Kinghorn BP, Hickey JM, Van Der Werf JHJ (2010) Reciprocal recurrent genomic selection for Total genetic merit in crossbred individuals. In: Proceedings of the 9th World Congress on Genetics Applied to Livestock Production, Leipzig, p 36 134. Rembe M, Zhao Y, Jiang Y, Reif JC (2019) Reciprocal recurrent genomic selection: an attractive tool to leverage hybrid wheat breeding. Theor Appl Genet 132:687–698. https://doi.org/10.1007/s00122-0183244-x 135. Iba´nz-Escriche N, Fernando RL, Toosi A, Dekkers JC (2009) Genomic selection of purebreds for crossbred performance. Genet Sel Evol 41:12. https://doi.org/10.1186/ 1297-9686-41-12 136. Sevillano CA, Vandenplas J, Bastiaansen JWM et al (2017) Genomic evaluation for a threeway crossbreeding system considering breedof-origin of alleles. Genet Sel Evol 49:75. https://doi.org/10.1186/s12711-0170350-1 137. Sevillano CA, Vandenplas J, Bastiaansen JWM, Calus MPL (2016) Empirical determination of breed-of-origin of alleles in threebreed cross pigs. Genet Sel Evol 48:55. https://doi.org/10.1186/s12711-0160234-9 138. Vandenplas J, Calus MPL, Sevillano CA et al (2016) Assigning breed origin to alleles in crossbred animals. Genet Sel Evol 48:61. https://doi.org/10.1186/s12711-0160240-y 139. Stuber CW, Cockerham CC (1966) Gene effects and variances in hybrid populations. Genetics 54:1279–1286. https://doi.org/ 10.1093/genetics/54.6.1279 140. Christensen OF, Madsen P, Nielsen B, Su G (2014) Genomic evaluation of both purebred and crossbred performances. Genet Sel Evol 46:23. https://doi.org/10.1186/12979686-46-23 141. Xiang T, Christensen OF, Legarra A (2017) Technical note: genomic evaluation for crossbred performance in a single-step approach with metafounders. J Anim Sci 95: 1472–1480. https://doi.org/10.2527/ jas2016.1155
243
142. Vitezica ZG, Varona L, Elsen J-M et al (2016) Genomic BLUP including additive and dominant variation in purebreds and F1 crossbreds, with an application in pigs. Genet Sel Evol 48:6. https://doi.org/10.1186/ s12711-016-0185-1 143. Christensen OF, Nielsen B, Su G et al (2019) A bivariate genomic model with additive, dominance and inbreeding depression effects for sire line and three-way crossbred pigs. Genet Sel Evol 51:45. https://doi.org/10. 1186/s12711-019-0486-2 144. Sprague GF, Tatum LA (1942) General vs. specific combining ability in single crosses of corn 1. Agron J 34:923–932. https://doi.org/10.2134/agronj1942. 00021962003400100008x 145. Gonzalez-Dieguez D, Legarra A, Charcosset A et al (2021) Genomic prediction of hybrid crops allows disentangling dominance and epistasis. Genetics 18(1):iyab026. https:// doi.org/10.1093/genetics/iyab026 146. Toro MA (1993) A new method aimed at using the dominance variance in closed breeding populations. Genet Sel Evol 25:63–74. https://doi.org/10.1051/gse:19930104 147. Toro MA (1998) Selection of grandparental combinations as a procedure designed to make use of dominance genetic effects. Genet Sel Evol 30:339–349. https://doi. org/10.1051/gse:19980402 148. Gaynor RC, Gorjanc G, Bentley AR et al (2017) A two-part strategy for using genomic selection to develop inbred lines. Crop Sci 57: 2372–2386. https://doi.org/10.2135/ cropsci2016.09.0742 149. Xiang T, Nielsen B, Su G et al (2016) Application of single-step genomic evaluation for crossbred performance in pig1. J Anim Sci 94: 936–948. https://doi.org/10.2527/jas. 2015-9930 150. Shaw FH, Woolliams JA (1999) Variance component analysis of skin and weight data for sheep subjected to rapid inbreeding. Genet Sel Evol 31:43. https://doi.org/10. 1051/gse:19990103 151. Ferna´ndez EN, Legarra A, Martı´nez R et al (2017) Pedigree-based estimation of covariance between dominance deviations and additive genetic effects in closed rabbit lines considering inbreeding and using a computationally simpler equivalent model. J Anim Breed Genet 134:184–195. https://doi. org/10.1111/jbg.12267
Chapter 9 Genome and Environment Based Prediction Models and Methods of Complex Traits Incorporating Genotype Environment Interaction Jose´ Crossa , Osval Antonio Montesinos-Lo´pez, Paulino Pe´rez-Rodrı´guez, Germano Costa-Neto, Roberto Fritsche-Neto, Rodomiro Ortiz, Johannes W. R. Martini, Morten Lillemo, Abelardo Montesinos-Lo´pez, Diego Jarquin, Flavio Breseghello, Jaime Cuevas, and Renaud Rincent Abstract Genomic-enabled prediction models are of paramount importance for the successful implementation of genomic selection (GS) based on breeding values. As opposed to animal breeding, plant breeding includes extensive multienvironment and multiyear field trial data. Hence, genomic-enabled prediction models should include genotype environment (G E) interaction, which most of the time increases the prediction performance when the response of lines are different from environment to environment. In this chapter, we describe a historical timeline since 2012 related to advances of the GS models that take into account G E interaction. We describe theoretical and practical aspects of those GS models, including the gains in prediction performance when including G E structures for both complex continuous and categorical scale traits. Then, we detailed and explained the main G E genomic prediction models for complex traits measured in continuous and noncontinuous (categorical) scale. Related to G E interaction models this review also examine the analyses of the information generated with high-throughput phenotype data (phenomic) and the joint analyses of multitrait and multienvironment field trial data that is also employed in the general assessment of multitrait G E interaction. The inclusion of nongenomic data in increasing the accuracy and biological reliability of the G E approach is also outlined. We show the recent advances in large-scale envirotyping (enviromics), and how the use of mechanistic computational modeling can derive the crop growth and development aspects useful for predicting phenotypes and explaining G E. Key words Genome-enabled prediction, Genomic selection, Models with G E interaction, Plant breeding
Nourollah Ahmadi and Je´roˆme Bartholome´ (eds.), Genomic Prediction of Complex Traits: Methods and Protocols, Methods in Molecular Biology, vol. 2467, https://doi.org/10.1007/978-1-0716-2205-6_9, © The Author(s) 2022
245
246
1
Jose´ Crossa et al.
Introduction Selection in plant breeding is usually based on estimates of breeding values, which can be obtained with pedigree-based mixed models [1, 2]. In their multivariate formulation, these models can also accommodate G E interaction [3, 4]. In the past, pedigreebased models have been successful for predicting breeding values of complex traits in plant and animal breeding by modeling the genetic covariance between any pair of related individuals ( j and j0 ), due to their additive genetic effects, as being equal to two times the coefficient of parentage (2fjj0 ¼A) times the additive genetic variance, σ 2a (Aσ 2a ). In self-pollinated species, Aσ 2a is the variance– covariance matrix of the breeding values (additive genetic effects). Closely related individuals contribute more to the prediction of breeding values of their relatives than less closely related genotypes. Moreover, when data from one individual or one selection candidate are missing (either partially or totally), its breeding value can still be predicted from its relatives, albeit less efficiently than if the data were complete. Pedigree-based models cannot account for Mendelian segregation—a term that, under an infinitesimal additive model [5, 6] and in the absence of inbreeding, explains one half of the genetic variation [7, 8]. However, molecular markers allow tracing Mendelian segregation at several positions of the genome, which gives them enormous potential in terms of increasing the accuracy of estimates of breeding and genetic values and the genetic progress attainable when these predictions are used for selection purposes [9]. GS [10] and genomic prediction of complex traits predict breeding values that comprise the parental average (half the sum of the breeding values of both parents) plus a deviation due to Mendelian sampling. In annual crops GS has been applied mainly in two different contexts; one approach focuses on predicting additive effects in early generations of a breeding program such that a rapid selection cycle with a short interval cycle (i.e., GS at the F2 level of a biparental cross) is achieved. Another approach consists of predicting the genotypic values of individuals where both additive and nonadditive effects determine the final commercial (genetic) value of the lines; here predicting lines established in multienvironment field evaluation is required. Various models for analyzing variation arising from quantitative trait loci (QTL) and marker-assisted selection, as well as for identifying molecular markers closely linked to QTL have been widely used in plant breeding to improve a few traits controlled by major genes. However, adoption of these models has been limited because the biparental populations used for mapping QTL are not easily used in breeding applications and because only limited marker information (a few markers) is used. On the other hand,
Genome Based Prediction of G E Interaction
247
GS is an approach for improving quantitative complex traits that uses all available molecular markers across the genome to estimate breeding values for specific environments and across environments by adopting conventional single-environment or G E interaction analyses [11–13]. Early plant and animal breeding data have shown that GS [10, 14, 15] significantly increases the prediction accuracy of pedigree based selection for complex traits [13, 16–31]. Reviews on optimizing genomic-enabled prediction and application to annual and perennial plants were early published elsewhere [22, 32– 35]. Since then, crop breeding programs worldwide have been studying and applying GS and, simultaneously, extensive research have been conducted on new statistical models for incorporating pedigree, genomic, and environmental covariates such as soil characteristics or weather data, among others. Genomic models for incorporating G E interaction have been proposed in an attempt to improve accuracy when predicting the breeding values of complex traits (e.g., grain yield) of individuals in different environments (site–year combinations) [11, 23, 36]. However, different statistical models are required for assessing the genomic-enabled prediction accuracy of noncontinuous categorical response variables (ordinal diseases as rates, counting data, etc.) using conventional genomic best linear unbiased predictors (GBLUP). Furthermore, deep learning artificial neural networks (DL) are also being developed for assessing multitrait, multienvironment genomic-enabled prediction [36–53]. Since the beginning of GS, several genetic and statistical factors had been pointed out as complications for the application of GS and genomic prediction. Genetic difficulty arises when deciding the size of the training population and the heritability of the traits to be predicted. Statistical challenges are related to the number of markers ( p) being much larger than the number of observations (n) ( p n), the multicollinearity among markers and the curse of dimensionality. One important genetic-statistical complexity of GS models arises when predicting unphenotyped individuals in specific environments (e.g., planting date–site–management combinations) by incorporating G E interaction into the genomicbased statistical models. Moreover, the genomic complexity related to G E interactions for multitraits is important because these interactions require statistical-genetic models that exploit the complex multivariate relations due to multitrait and multienvironment variance–covariance, and the genetic correlations between environments, between traits and between traits and environments. The abovementioned problem of p n results in a matrix of predictors that are rank-deficient and without having a likelihood identified, thus being prone to overfitting. Penalized regression, variable selection, and dimensionality reduction offer solutions to some of these problems.
248
Jose´ Crossa et al.
Genomic-enabled prediction models are based on quantitative genetics theory, which considers two main structures of variation, one for the sum of genetic values (e.g., linear additive models), and a second for nongenetic residual noise [19]. Hence, most of the research on genomic prediction has been developing efficient parametric and nonparametric statistical and computational models to deal with those two main structures of variation, and many research articles show good prediction accuracy for complex traits such as grain yield. The use of relationship matrices based on genomics has also been expanded by developing and using linear and nonlinear kernels. Nonlinear genomic kernels have the ability to account for cryptic small effect interactions between markers (e.g., epistasis). Furthermore, these kernels are more efficient than GBLUP in incorporating large-scale environmental data (enviromics) and G E realized by enviromics-aided relatedness among field trials [45–54]. In this chapter, we explain and review the complexity of genomic-enabled prediction and describe models for assessing different forms of G E interaction and marker environment interaction. We also describe GS models for categorical and counting traits that are not continuous and do not have a normal distribution. As intimately related with studding G E interaction we briefly summarize the latest results of the use of methods that include Bayesian multitrait multienvironments as well as deep learning (DL) of artificial neural networks, and ecophysiologyenriched approaches such as the use of crop growth models and enviromics. Furthermore, we extend the study of G E interaction when high throughput phenotype data are available.
2
Historical Timeline of G E Modeling in Genomic Prediction Since the study of Meuwissen et al. [10] researchers have been devoted to the use of whole-genome markers to adjust statistical and computation tools to predict particular phenotypes. Figure 1 presents a short timeline of the type of statistical and computational regressions and kernel methods used in GS research in the context of G E. This timeline starts with two genomic G E interaction models, the first is related to environment-specific genomic prediction effects [11] and the second to specific marker effects across ˜ o et al. [11] environments [12]. At this point, the model of Burguen takes into account pedigree and molecular marker information, and in the following eight years it was updated along with that of Schulz-Streeck et al. [12] with different statistical and computational processing methods. All models based on this source of data were highlighted as blue in Fig. 1. Green color in Fig. 1, highlighted aspects introduced by Heslot et al. [23] involving the use of environmental covariates (EC) over the marker effects. Models in
Genome Based Prediction of G E Interaction
249
Fig. 1 History of the main research involving genomic prediction and G E interaction since the first published paper in 2012 until the articles published in 2020. A blue box denotes works using only DNA markers or genomic information. A green box refers to models in which DNA marker is complemented by crop growth modeling (CGM) outputs, such as stress index. A purple box refers to models in which DNA markers are complemented by the use of environmental covariates (EC), such as weather and soil information for the experimental trials
green introduced the use of crop growth models (CGM). Briefly, the CGM is a mechanistic approach aimed to reproduce the main plant-environment relations through “crop-specific” parameters and environmental inputs. After running CGM, it is possible to derive EC that represent the plant-environment interactions, instead of the direct use of climatic information. It also includes research involving the direct use of CGM with genomic prediction models, which was a concept introduced by Cooper et al. [55] as CGM-whole genomic prediction. This approach uses the marker effects to predict intermediate phenotypes over the mechanistic structure of a certain CGM. Then, the tuned CGM is used for phenotype prediction. Lastly, purple-colored model in Fig. 1, involve the use of environmental data to fit reaction-norm structures (e.g., linear relation between phenotype and environmental variations). Since Jarquı´n et al. [36] there is a second interpretation of the so-called reactionnorm approach, which involves the use of environmental relatedness realized from EC together with genomic kinships under whole-genome regressions or kernel methods. Recently, Resende et al. [56] and Costa-Neto et al. [54] introduced the concept of ‘enviromics’ to describe the core of possible environmental factors acting over a target population of environments (TPE). It is expected that this type of approach will popularize the use of environment data in training prediction models for selection,
250
Jose´ Crossa et al.
which is especially useful for screening genotypes at novel growing conditions. Differently from the models in green color, here the philosophy ranges from using a large-scale environmental information [36, 45, 54, 56–58] to a very small number of ECs [26, 43, 44, 59, 60]. In the first case, the purpose is to shape a robust environmental relatedness, whereas the second relies in a two-stage analysis (e.g., factorial regression), where are found a few key ECs that explains a large amount of the trait G E for that germplasm and experimental network. Each model structure and concept is described in the next sections. Below, we discuss the basic genomic model used to reproduce particular gene–phenotype variations using phenotypic data from single trials. Thus, it is expected that these kind of models could capture specific environmental–phenotype covariances, related to the particular growing conditions faced by each genotype in the same field. Because of that, this type of model is named singleenvironment model. Then, we describe how single-environment models are fitted in terms of resemblance among relatives, captured by genomic or pedigree realized relationship kernels. Thus, in this section we will present novel options to model G E interactions among several field trials (multienvironment trials). Furthermore, we will show Bayesian models for ordinal or count data, and finally describe models using climatic environmental data.
3
Genomic-Enabled Prediction Models Accounting for G E
3.1 Basic SingleEnvironment Genomic Model
To explain how the multienvironment GS approach were developed, it is indispensable to first understand how the baseline singleenvironment genomic model was conceived. The basic genetic model describes the response of the jth phenotype (yj) as the sum of an intercept (μ), a genetic value (gj) plus a residual εj: yj ¼ μ + gj + εj ( j ¼ 1, . . ., n individuals). Thus, this model takes into account a certain genetic-informed structure within the gj effect, and considers that all nongenetic sources are split in a fixed main intercept plus the error variation. As the genes affecting a trait in a certain environment where the phenotyping was conducted are unknown, a complex function must be approximated by a regression of phenotype on marker genotypes with large numbers of markers {xj1, . . ., xjk, . . ., xjp} (k ¼ 1, . . ., p markers) to predict the genetic value of the jth individual. This function can be expressed as f(x) ¼ f(xj1, . . ., xjp; β) such that yj ¼ μ + f(xj1, . . ., xjp; β) + εj. Usually f(x; β) is a parametric linear regression of the p P x jk βk , where βk is the substitution form f x j 1 , . . . , x jp ; β ¼ k¼1
effect of the allele coded as ‘one’ at the kth marker. Then, the linear p P regression function on markers becomes y j ¼ μ þ x jk βk þ ε j , k¼1
or, in matrix notation,
Genome Based Prediction of G E Interaction
y ¼ 1n μ þ X β þ ε,
251
ð1Þ
where 1n is a vector of order n 1, X is the n p matrix of centered and standardized markers, β is the vector of unknown marker effects, and ε is an n 1 vector of random errors, with ε N 0, I n σ 2ε , where σ 2ε is the random error variance component. When the vector of marker effects is assumed β N 0, I p σ 2β , where σ 2β is the variance of marker effects, this is called the ridge regression best linear unbiased predictor (rrBLUP). 3.1.1 Kernel Methods to Reproduce Genomic Relatedness Among Individuals
From the last subsection, we can go deeper into the modeling of g effects using relatedness kernels. By letting g ¼ Xβ with a variance– covariance matrix proportional to the genomic relationship matrix p P G ¼ X X 0= 2q k 1 q k (where qk is the frequency of allele “1”) k¼1 with g N 0, Gσ 2g , one can define the GBLUP prediction model as: [61, 62]. y ¼ 1n μ þ g þ ε:
ð2Þ
GBLUP and rrBLUP are equivalent if the genomic relationship matrix is computed accordingly. Model 2 is computationally much simpler than the rrBLUP, which makes the kernel methods interesting for dealing with complex models involving G E interactions. It should be noted that different kinds of kernel can be used for G, potentially taking any structure capable of reproducing a certain degree of relatedness among individuals, such as nonlinear effects into account. One of the most used nonlinear kernels is the so-called Gaussian Kernel (GK). Results have consistently shown for single-environment models as well as for multienvironment models with G E interaction, that GK performs better than GBLUP in terms of genomic-enabled prediction accuracy [39– 41, 45, 63]. A second nonlinear approach that has been used is deep kernel (DK), which is implemented by the arc-cosine kernel (AK) function recently introduced by Cuevas et al. [41] in genomic prediction. This nonlinear DK is defined by a covariance matrix that emulates a deep learning model, but based on one hidden layer and a large number of neurons. To implement it, a recursive formula is used for altering the covariance matrix in a stepwise process, in which at each step, more hidden layers are added to the emulated deep neural network. In this function, the tuning parameter “number of layers” required for DK can be determined by a maximum marginal likelihood procedure [41]. Research involving near-infrared data [41], multiple G E scenarios for several data sets [64] and modeling additive and nonadditive genomic-by-enviromic sources [54] have shown that DK genomic-enabled prediction accuracy is similar to that of the
252
Jose´ Crossa et al.
GK, but DK has the advantage over GK because (a) it is computationally more straightforward, since no bandwidth parameter is required, while performing similarly or slightly better than GK; (b) it is a data-driven kernel capable of linking genomic or enviromic kernels with empirical phenotypic covariance structures [41, 54, 64]. To implement DK in an R computational environment, Cuevas et al. [41] and Costa-Neto et al. [54] have provided codes and examples that are freely available. After the creation of DK for each genomic or enviromic source, this kernel can be incorporated in diverse packages to implement genomic prediction, such as BGLR [65] and BGGE [66]. 3.2 Basic Marker Environment Interaction Models
Multienvironment trials for assessing G E interactions play an important role for selecting high performing and stable breeding lines across environments, or breeding lines adapted to local environmental constraints. A first way of modeling G E is to allow environment-specific marker effect [12], or to model environment˜o et al. [11] specific genetic effects as proposed by Burguen A kernel model can be derived to allow environment-specific genetic effects as in model 2 y ¼ μ þ g þ ε, 2
y1
3
2
1 n 1 μ1
6 7 6 6 ⋮ 7 6 ⋮ 6 7 6 6 7 where y ¼ 6 y 7; μ ¼ 6 6 1 n i μi 6 i 7 6 4 ⋮ 5 4 ⋮ ym
1nm μm
3
ð3Þ 2
g1
7 6 7 6 ⋮ 7 6 7; g ¼ 6 7 6 gi 7 6 5 4 ⋮
gm
3 7 7 7 7; 7 7 5
2
ε1
3
6 7 6⋮7 6 7 7 ε¼6 6 εi 7, are 6 7 4⋮5 εm
vectors with elements corresponding to each of the environments, which are equivalent to ZE μm, where ZE is an incidence matrix for the environments and μm is a vector of order m, that represents one mean for each environment. When there are many environments, it is recommended to consider vector μ as random effect, such that μ N 0, Z E Z 0E σ 2e . Other fixed effects can also be included in the model. The random effects are the genetic effects g ~N(0, K0) and the residuals ε~N(0, R). When the numberN of observations isNthe K, and R ¼ Σ I, same inNall the environments, then K0 ¼ UE where denotes the Kronecker product and K is a kernel matrix of relationships between the genotypes. Matrix K0 is the product of one matrix with information between environments (UE) and one kernel with information between individuals based on markers or pedigrees (K). The unstructured variance–covariance can be used for UE,of order m m such that:
Genome Based Prediction of G E Interaction
2
σ 2g 1
6 6 ⋮ 6 6 σg i g 1 UE ¼ 6 6 6 6 ⋮ 4 σg m g 1
... ⋱
σg 1 g i . . . σg 1 g m ⋮ ⋱
...
σ 2g i
...
⋮ σg i g m
⋱
⋮
⋱
⋮
...
σg m g i
...
σ 2g m
253
3 7 7 7 7 7, 7 7 7 5
where the ith diagonal element is the genetic variance σ 2g i within the ith environment, and the off-diagonal element is the genetic covariance σ g i g i0 between the ith and i0 th environments. Model 3 can be used with a linear kernel GBLUP, [11] Gaussian kernel [40] or deep kernel [67], which allows capturing small cryptic effects, such as epistasis. To account for variations between individuals that was not captured by g., a random component f, representing the genetic variability among individuals across environments, can added to model 3 as: y ¼ μ þ g þ f þ ε, 2
f1
ð4Þ
3
7 6 6 ⋮ 7 7 6 7 where f ¼ 6 6 f i 7 with the random vectors f independent of g and 7 6 4 ⋮ 5 fm normally distributed f ~ N(0, Q). In general, when the number of individuals is not the same in all environments, 2 3 σ f 1 f i I n1 . . . σ f 1 f m I n1 σ 2f 1 I n1 ... 6 7 ⋮ ⋱ ⋮ ⋱ ⋮ 6 7 6 7 2 6 I σ σ I . . . f i f m ni 7 n I σ i f . . . 6 7 f i f 1 ni i Q ¼6 7 6 7 ⋮ ⋮ 6 7 ⋮ ⋱ ⋱ 4 5 σ f m f 1 I nm . . . σ 2f m I nm σ f m f i I nm . . . where σ 2f i is the genetic effects in the ith environment not explained by the random genetic effect g, and σ f i f i0 is the covariance of the genetic effects between two environments not explained by g. When the number of individuals is the same in all the envirN onments, Q ¼ FE I. The matrix FE captures genetic variance– covariance effects between the individuals across environments that were not captured by the UE matrix. 3.3 Basic Genomic Environment Interaction Models
Before introducing the reaction-norm model, we will first consider the model 5, in which the response of the jth line in the ith environment (yij) is modeled by main random effects that account
254
Jose´ Crossa et al.
for the environment, (Ei), the genotypes (Lj), the genetic values (gi) of the lines, a component assumed to be stable across environments, plus the random effects of the interaction (Egij) between the ith environment (Ei) and the jth line (gj), representing deviations from the main effects: ð5Þ y ij ¼ μ þ E i þ L j þ g j þ Eg ij þ εij , iid where μ is an intercept, E i N 0, σ 2E is the random effect of the iid ith environment, L j N 0, σ 2L is the random effect of the jth iid is a model residual. Here Ngenotype, and εij N 0, σ 2ε (∙, ∙) stands for a normally distributed random variable and iid stands for independent and identically distributed. 0 The vector of random effects g ¼ (g1, . . ., gJ) , g N 0, Z g K Z 0g σ 2g , with Zg being the incidence matrix for the
effects of the genetic values of the genotypes and σ 2g is the variance component of g; K is a kernel or matrix of genetic relationships between as in GBLUP (G) [36]. The vector Eg the genotypes 2 0 0 N 0, Z g K Z g # Z E Z E σ Eg denotes the interaction between the genotypes and the environments, where ZE is the incidence matrix for environments and σ 2Eg is the variance component of Eg with #denoting the Hadamard cell-by-cell product. Note that g N 0, K σ 2g are correlated, such that model 5 allows borrow-
ing information between the genotypes. Hence, prediction of genotype performance in environments where the genotypes were not observed is possible. 3.4 Illustrative Examples When Fitting Models 2–5 with Linear and Nonlinear Kernels
Examples of models 2–5 using a wheat data set comprising 599 wheat lines evaluated in four environments (E1–E4) were employed by Crossa et al. [19] and Cuevas et al. [29]. They are available in the BGLR R package [65]. A total of 50 random samples each with a training set composed of 70% of the wheat lines in each environment and a testing set composed of the remaining 30% of lines observed in only some environments but not in others. The predictive ability was calculated as the average Pearson’s correlation between observed and predicted lines. Table 1 lists the averages of these correlations for each of the four models in each environment and their standard deviations when using GBLUP or Gaussian kernel (GK). Models 2 and 5 were fitted with the BGLR package [65], whereas models 3 and 4 were fitted using the multitrait model (MTM) package of de los Campos and Gru¨neber [68]. For these data and the sampling used to fit the four models, the Gaussian kernel (GK) showed higher prediction accuracy than the GBLUP in all four models. However, model 4 gave the best results. The differences in prediction accuracy between models 3 and 4 are
Genome Based Prediction of G E Interaction
255
Table 1 Mean prediction accuracies for the different environments of wheat breeding data for GBLUP and GK methods, and four models including a single environment (model 2) and three multienvironment models (models 3, 4, and 5). Values in parenthesis represent the standard deviations. Bold values represent the best predictions among the 4 models. GBLUP Environment E1
E2
E3
E4
Model 2
GK Model 3
Model 4
Model 5
Model 2
Model 3
Model 4
Model 5
0.512
0.543
0.422
0.577
0.575
0.606
0.458
0.06
0.04
0.04
0.07
0.04
0.04
0.04
0.06
0.474
0.635
0.72
0.626
0.477
0.685
0.713
0.626
0.05
0.04
0.03
0.05
0.06
0.03
0.03
0.04
0.37
0.592
0.694
0.473
0.422
0.685
0.699
0.5
0.06
0.05
0.03
0.06
0.05
(0.03))
0.03
0.04
0.447
0.501
0.525
0.501
0.511
0.555
0.572
0.525
0.05
0.04
0.03
0.06
0.04
0.04
0.04
0.05
0.500
a
Empirical phenotypic correlation between environments: E1 vs. E2 ¼ 0.020 E1 vs. E3 ¼ 0.193; E1 vs. E4 ¼ 0.123; E2 vs. E3 ¼ 0.661; E2 vs. E4 ¼ 0.411; E3 vs. E4 ¼ 0.388
a
greater with GBLUP; that is, component f captured effects that are retained for the genetic component g. Models 3 and 4 allow capturing covariances close to 0 or negative between environments, but require a more intense computational effort than model 5. Model 5 allows estimating the main genetic effects and interactions and is very flexible for including other variables as environmental covariables. In addition, when the covariances between environments are positive, the prediction ability of model 5 is similar to those of model 4. For large data sets, fitting model 5 could present problems or require intense computer programming. A good option for large data sets including large numbers of environments is to use the approximate kernel [69]. The G E prediction models presented in this section can predict new individuals in existing multienvironment trial using molecular or pedigree information, but they cannot predict new environments. In the next section, we discuss the use of EC and the so-called enviromics in creating reaction-norm models for dealing with this concern.
4
Genomic-Enabled Reaction-Norm Approaches for G E Prediction
4.1 Basic Inclusion of Genome-Enabled Reaction Norms
Diverse researchers have modeled genotype-specific variations due to key environmental factors, [70–77] which afterward was named reaction norm; that is, the core of expressed phenotypes for a given genotype across a certain environmental gradient. From this
256
Jose´ Crossa et al.
concept, it is reasonable to expect that the core of reaction norm for a certain breeding population or germplasm, evaluated for a certain range of environments, will be the main driver of the statistical phenomena interpreted as G E interaction. The models presented in the previous section considers only the G effects realized from molecular marker data. Thus, it is also reasonable to assume that in the same manner to the use of molecular data as a descriptor of the genotype resemblance, the use of environmental data can also contribute to explain a large amount of the nongenomic differences observed in phenotypic records from field trials. In the GS context, modeling the interaction between markers and environmental covariates can be a complex task due to the high dimensionality of the matrix of markers, the environmental covariates, or both. Jarquı´n et al. [36] proposed modeling this interaction, using Gaussian processes where the associated variance– covariance matrix induces a reaction norm model. The authors showed that assuming normality for the terms involving the interaction and also assuming that the interaction obtained using a firstorder multiplicative model is distributed normally, then the covariance function is the Hadamard product of two covariance structures, one describing the genetic information and the other describing the environmental effects. This approach was expanded by Morais-Ju´nior et al. [57] to account for an H matrix, based on genomic and pedigree-based data in field trials from different cycles of rice breeding in Brazil. Thereafter, Gillberg et al. [78] introduced the use of Kernelized Bayesian Matrix Factorization (KBMF) to account for the uncertainty of environmental covariates in environmental relatedness kernels. Finally, the use of the GK or DK to model both genomic and environmental relatedness was suggested by Costa-Neto et al. [45] and will be described with details in further sections. 4.2 Modeling Reaction-Norm Effects Using Environmental Covariables (EC)
Jarquı´n et al. [36] considered model 5 but modeled the interaction Egij, y ij ¼ μ þ E i þ L j þ g j þ Eg ij þ εij , ð6Þ where all the components are defined in model 5 except the interaction vector (Egij) that is defined as ðEg Þ N 0, Z g K Z 0g # Z E ΩZ 0E σ 2Eg . The originality of this model is that the relationship matrix between environments Ω is estimated using environmental covariables and proportional to WW0, with W being a matrix with centered and standardized values of the environmental covariables. The construction of matrix Ω can also be guided by the phenotypic data of the calibration set together with the environmental covariables [44]. Heslot et al. [23] proposed an alternative method based on factorial regressions at the marker
Genome Based Prediction of G E Interaction
257
level, which increased prediction accuracy in 11.1% on average in a large winter wheat dataset. Ly et al. [43] proposed a similar model based on a single environmental covariate, which allows predicting the response of new varieties to the change of a given factor (e.g., temperature variations, drought-stress). Heslot et al. [23] novel approach consisted in using crop models to predict the main developmental stages for a better characterization of the environmental conditions faced by the plants. Since then, several publications have shown that environmental covariates directly simulated by the crop model (nitrogen stress in Ly et al. [42] dry matter stress index in Rincent et al. [44]) captured more G E than simple climatic covariates. Millet et al. [59] applied a factorial regression genomic model based on three environmental covariates, which resulted in promising prediction accuracy of maize yield at the European scale. It is important to note that the use of environmental covariates results in sharing information between environments. This means that these models can predict new environments as long as they are characterized by environmental covariates. 4.3 Inclusion of Dominance Effects in G E and ReactionNorm Modeling
The main product of most allogamous breeding programs is the development of highly adapted and productive hybrids (singlecross F1s). In maize, an important allogamous species, recent research suggests that the G E variation is the end-result of two main genomic-based sources: the additive environment (A E) plus dominance environment (D E) interactions [80– 82]. Thus, for predicting single-crosses across diverse contrasting environments, it is necessary to incorporate both genomic-related sources of variation in a computational efficient and biological accurate way. Costa-Neto et al. [45] tested five prediction models including D E and enviromics (W) for predicting grain yield over two maize germplasm. All models were run with three different kernel methods (GBLUP, GK and DK), but a coincidence trend of increment for D and A + D + W models were observed for all kernels. In average, for both data sets evaluated, for predicting novel genotypes at know growing conditions (the so-called random crossvalidation CV1 scheme) using GBLUP, these authors found accuracy gains ranging from to 22% to 169%, compared with the baseline additive GBLUP. These authors concluded that the inclusion of dominance effects is an important source for predicting novel environments in cross-pollinated crops. Rogers et al. [83] conducted an extensive multienvironment framework analysis involving 1918 hybrids across 65 environments, in which the use of factor analytic (FA) structures were used for both defining clusters of environments and finding patterns of genomic and enviromic relatedness. The use of FA is a common practice since the classic phenotypic-based G E analysis. This
258
Jose´ Crossa et al.
means that the variance–covariance matrices are dissected in orthogonal factors and these loadings are used as variance–covariance structures, priors of any Bayesian approach [68] or for clustering genotypes or environments for targeting better and adapted cultivars for certain environments [84]. FA was important to identify the environmental factors related to the main A and D effects and how it can boost accuracy for predicting these phenotypes [83]. Gathering results from Costa-Neto et al. [45] and Rogers et al. [83] it is possible to infer that the dominance-related factors are responsible for a sizeable proportion of the phenotypic variation for grain yield in hybrid maize. The inclusion of environmental covariates is important, but some key aspects should be taken into account: (a) the statistical structure to model the A, D, and W effects, in which linear kernel GBLUP models might be limited and the use of FA or other nonlinear kernels may overcome this limitation; (b) how the environmental data were processed and integrated in the genomic prediction model for modeling A W and D W; (c) the nature of G E for each trait under prediction. Below we discuss some results related to the use of nonlinear kernels for those purposes. 4.4 Nonlinear Kernels and Enviromic Structures for Genomic Prediction
Three kernel methods were adopted for the genomic and enviromic sources: nonlinear GK, nonlinear arc-cosine, named as deep kernel and the linear GBLUP used as the benchmark approach. It is important to highlight the differences in creating the environmental relatedness kernel, which in this study was designed the percentile distributions of each environmental factor (e.g., soil, weather, management) across five key crop development stages, and because of that, it takes into account a large amount of environmental typologies as markers of relatedness (W). This bridges the gap between raw environmental data, and what has really happened in the field. In order to differ from the reaction norm using quantitative covariables (e.g., factorial regression on ECs), Costa-Neto et al. [45, 54] named this model as “envirotyping-informed” GS, because it takes into account the environments of the experimental network. A generalization of the enviromic-enriched genomic prediction model can be described in matrix form as follows. y ¼ 1μ þ X f β þ
k X s¼1
gs þ
l X
wr þ ε
ð7Þ
r¼1
where y is the vector combining the means of each genotype across each one of the q environments in the experimental network, in which y ¼ [y1, y2, . . .yq]T. The scalar 1μ is the common intercept or the overall mean. The matrix Xf represents the design matrix associated with the vector of fixed effects β. In some cases, this vector is
Genome Based Prediction of G E Interaction
259
associated with environmental effects (target as fixed-effect). Random vectors for genomic effects (gs) and enviromic-based effects (wr) are assumed to be independent of other random effects, such as residual variation (ε). Equation 7 is a generalization for a reaction-norm model because, in some scenarios, the genomic effects may be divided as additive, dominance, and other sources (epistasis) and the G E multiplicative effect. In addition, the envirotyping-informed data can be divided into several environmental kernels and a subsequent genomic by enviromic (GW) The baseline Pp reaction-normPkernels. q genomic models assumes s¼1 g s 6¼ 0 and r¼1 w r ¼ 0, without any enviromic P data. However, P the enviromic enrichedPmodels p q q might assume s¼1 g s 6¼ 0 and r¼1 w r 6¼ 0 , in which r¼1 w r can describe a main enviromic effect (given by W), analogous to a random environment effect, but with a structured matrix from ECs (Ω), and a reaction-norm GW effect as multiplicative effect such as described by Jarquin et al. [36]. Thus, some enviromic-enriched models can be more accurate with less parameters, depending on the way that the genomic and enviromic kernels are built. It has been observed that nonlinear kernels are more efficient than the reaction-norm GBLUP [36]. 4.5 Genomic Prediction Accounting for G E Under Uncertain Weather Conditions at Target Locations
In most crops, genetic and environmental factors interact in complex ways, giving rise to substantial and complex G E. The combination of G E and the uncertainty about future weather conditions make agricultural research and plant breeding extremely challenging. In this context, de los Campos et al. [58] proposed that computer simulations leveraging field trial data, DNA sequences, and historical weather records can be used to predict the future performance of genotypes under largely uncertain weather conditions. The authors used field trial data linked to DNA sequences and environmental covariates in order to learn how genotypes react to specific environmental conditions. These patterns are then used, together with DNA sequences and historical weather records, to simulate the expected performance of genotypes at target locations. The approach of de los Campos et al. [58] uses Monte Carlo methods that integrate uncertainty about future weather conditions as well as model parameters. Using extensive maize data from 16 years and 242 locations in France, the authors demonstrate that it is possible to predict the performance distributions of genotypes at locations where the genotypes have had limited testing data or lacking them. They also showed that predictions that incorporate historical weather records are more robust with respect to yearto-year variation in environmental conditions than the ones that can be derived using field trials only. As the use of EC information can really improve the accuracy of genomic prediction across multienvironment conditions, most
260
Jose´ Crossa et al.
research has focused on quantitative traits (e.g., grain yield) or traits with a simpler genomic architecture, such as days to heading [60] and flowering time [85] which also are measured using a continuous scale. Below we present Bayesian models for dealing with ordinal data, which is a complex problem, particularly for quality traits.
5
G E Genome-Based Prediction Under Ordinal Variables and Big Data
5.1 GenomicEnabled Prediction Models for Ordinal Data Including G E Interaction
In this section, we present Bayesian genomic-enabled prediction models for ordinal data including G E interaction. Several genomic-enabled prediction models have been developed for predicting complex traits in genomic-assisted animal and plant breeding. These models include linear, nonlinear, and nonparametric models, mostly for continuous responses and less frequently for categorical responses. Linear and nonlinear models used in GS can fit different types of responses (e.g., continuous, ordinal, binary). Several linear and nonlinear models are special cases of a more general family of statistical models known as artificial neural networks. Recently Pe´rez-Rodriguez et al. [86] introduced a neural network that generalizes existing models for the prediction of ordinal responses. The authors proposed a Bayesian Regularized Neural Network (BRNNO) for modeling ordinal data. The proposed model was fitted in a Bayesian framework using data augmentation algorithm to facilitate computations and was compared with the Bayesian Ordered Probit Model (BOPM). Results indicated that the BRNNO model performed better in terms of genomic-based prediction than the BOPM model. Results are consistent with the findings of previous research [47]. It should be pointed out that the BRNNO approach for modeling ordinal data could be applied not only in the GS context but also in the context of conventional phenotypic breeding for host plant resistance to pathogens and pests, and many other ordinal traits. In general, models for nonnormal data are scarce in the context of genome-enabled prediction since most of the models developed so far are linear mixed models (mixed models for Gaussian data). Statistical research has shown that using linear Gaussian models for ordinal and count data frequently produces poor parameter estimates, lower prediction accuracy and lower power, while increasing the complexity of parameter interpretation when transformations are used [47, 49, 87, 88]. Few models for genome-enabled prediction of ordinal and count variables are available [46–49, 88, 89]. The ordinal probit model assumes that conditioned to xi (covariates of dimension p), Yi is a random variable that takes values 1, ..., C, with the following probabilities:
Genome Based Prediction of G E Interaction
261
P ðY i ¼ c Þ ¼ P ðγ c1 l i γ c Þ ¼ Φ γ c þ E i þ g j þ Eg ij Φ γ c1 þ E i þ g j þ Eg ij , c ¼ 1, . . . , C
ð8Þ
where 1 ¼ γ 0 < γ 1 < . . . < γ C ¼ 1 are threshold parameters. A Bayesian formulation of this model assumes the following independent priors for the parameters: a flat prior distribution for γ ¼ (γ 1, . . ., γ C 1) ( f(γ) / 1), a normal distribution 0
for beta coefficients, g ¼ (g1, . . ., gJ) , g N 0, Z g K Z 0g σ 2g
and a scaled inverse chi-squared distribution for σ 2g , σ 2g χ 2 vg ,S g , Eg N 0, Z g K Z 0g # Z E Z 0E σ 2Eg denotes the G E interaction, where ZE is the incidence matrix for environments and σ 2Eg is the variance component of Eg, also with a scaled inverse chi-squared distribution for σ 2Eg , σ 2Eg χ 2 v Eg ,S Eg . This threshold model assumes that the process that gives rise to the observed categories is an underlying or latent continuous normal random variable li ¼ Ei gj Egij + ϵ i where ϵ i is a normal random variable with mean 0 and variance 1, and the values of li are called “liabilities.” The ordinal categorical phenotypes in model 1 are generated from the underlying phenotypic values, li, as follows: yi ¼ 1 if 1 < li < γ 1, yi ¼ 2 if γ 1 < li < γ 2,. . . ., and yi ¼ C if γ C 1 < li < 1. 5.2 Illustrative Application Bayesian Genomic-Enabled Prediction Models Including G E Interactions, to Ordinal Variables
Gray leaf spot (GLS), caused by Cercospora zeae-maydis, is a foliar disease of global importance in maize production. The disease was evaluated using an ordinal scale [1 (no disease), 2 (low infection), 3 (moderate infection), 4 (high infection), and 5 (complete infection)] at three environments (in Colombia, Mexico, and Zimbabwe), in 240 maize lines. The 240 lines were genotyped using the 55k single-nucleotide polymorphism (SNP) Illumina platform. The final genotypic data contained 46,347 SNPs [46]. Table 2 gives the prediction performance for each environment of the GLS data set. The prediction performance is reported as an average Brier Score [90], which was computed as: n P C P BS ¼ n1 ðb π ic d ic Þ2 , where dic takes the value of 1 if the i¼1 c¼1
ordinal categorical response observed for individual i falls in category c; otherwise, dic ¼ 0. The closer to zero, the better the prediction performance. The average Brier Score was computed with the testing set of the 20 random partitions implemented. The best predictions were obtained with model E + G + GE that takes into account the G E interaction. Relative to models based on main effects only, the models that included G E gave gains in prediction accuracy between 9% and 14% [46].
262
Jose´ Crossa et al.
Table 2 Brier scores (mean, minimum and maximum; smaller indicates better prediction) evaluated for the validation samples. Model E + L contains in the predictor only the information of environment + lines without markers, model E + G contains in the predictor information of environment + genomic data and model E + G + GE has the same information as E + G plus the genotype-by-environment (GE) interaction [46] Colombia
Zimbabwe
Mexico
Model
Mean
Min
Max
Mean
Min
Max
Mean
Min
Max
E+L
0.389
0.379
0.401
0.360
0.355
0.366
0.351
0.341
0.362
E+G
0.382
0.371
0.393
0.362
0.358
0.369
0.346
0.336
0.363
E + G + GE
0.329
0.315
0.347
0.333
0.323
0.344
0.320
0.304
0.335
5.3 Approximate Genomic-Enabled Kernel Models for Big Data
When the number of observations (n) is large (thousands or millions), there are computational difficulties for inverting and decomposing large genomic kernel relationship matrices. This problem increases when G E and multitrait kernels are included in the model. Cuevas et al. [69] proposed selecting a small number of lines m (m < n) for constructing an approximate kernel of lower rank than the original and thus exponentially decreasing the required computing time. The method of approximate kernels proposes a simple input that originally had a kernel matrix Kn, n of order n n from where a smaller submatrix is selected, Km, m of order m m with the restriction that m < n, with the aim of finding an approximate matrix Q of rank m, smaller than the rank of Kn, n such that: 0 K Q ¼ K n,m K 1 m,m K n,m
where Km, m is a submatrix constructed with m selected individuals with p markers, Kn, m is a submatrix of K with the relation between the total n lines and the m selected ones. Thus, Q is of smaller rank than K, and computational time is significantly saved when performing the required spectral decomposition or/and inversion. Based on this approximation, Misztal et al. [91] and Misztal [92] employed recursive methods from the joint distribution of the random genetic effects when testing a large number of animal production. Cuevas et al. [69] described the full genomic method for single environment (FGSE) with a covariance matrix (kernel) including all n lines. Then m lines (observations) approximate the original kernel for the single environment model (APSE model). Similarly, but including main effects and G E (FGGE model), and including m lines, the kernel method was approximated by main effects and G E (APGE model). The authors compared the prediction
Genome Based Prediction of G E Interaction
263
Table 3 The models FGGE and APGE considering the size of m, as 25% of the original training set. Average correlation between predictive and observed values (CORR), residual variance b σ 2ε and average computing time required. Training set sizes contains 8000 to 10,000 wheat lines each cycle Data set
FGGE
APGE
Cycle
Training cycle
Corr.
b σ 2ε
Time
Corr.
b σ 2ε
Time
2014_2015
2013_2014
0.222
0.317
4.96 h
0.206
0.363
0.68 h
2015_2016
2013_2014 2014_2015 2013_2014 2014_2015 2015_2016 2013_2014 2014_2015 2015_2016 2016_2017
0.328
0.287
11.10 h
0.347
0.309
2.80 h
0.328
0.275
23.72 h
0.321
0.29
5.08 h
0.426
NA
NA
0.427
0.301
8.38 h
2016_2017
2017_2018
performance and computing time for FGGE vs APGE models and showed a competitive prediction performance of the approximated method with a significant reduction in computing time (Table 3). To predict the 2017–2018 cycle using the previous four cycles with the full genomic GE model (FGGE), it was necessary to manipulate large covariance matrices, one for the main effects of the genomic model and another for the interaction, of order 45,099 45,099. It was not possible to manage this matrix size with available laptops; therefore, the genomic-enabled prediction accuracy recently reported by Pe´rez-Rodrı´guez et al. [87] was used as a reference. The authors reported a genomic prediction accuracy of 0.426 for the 2017–2018 cycle using all the other cycles as a training set. Using the APGE model and only 25% of the total training set, matrices Km, m, and Kn, m, are now of manageable sizes of order 9021 9021 and 45,099 9021, respectively, which gave a genomic prediction accuracy of 0.427; that is, there is no loss of prediction accuracy with respect to the full model FGGE. The computing time required, including the time for preparing the matrices for the approximation method, and the time for the eigenvalue decomposition and the 20,000 iterations, was 30,670 s or 8.5 h.
6
Open Source Software for Fitting Genomic Prediction Models Accounting for G E In this section, we present practical examples of use of three software for genomic prediction accounting for multitrait and
264
Jose´ Crossa et al.
multienvironment data in plant breeding. We emphasize opensource packages developed under the R computational-statistical environment, due to widespread use in plant breeding and quantitative genetics. Historically, the first open-source R software for genome-based prediction was developed by de los Campos et al. [16] Thereafter, Pe´rez et al. [93] formally described the Bayesian linear regression (BLR) that allows fitting high-dimensional linear regression models including dense molecular markers, pedigree information, and several other covariates. Then, Endelman [94] presented the frequentist ridge-regression approach (RR), that also allowed the estimation of marker effects and other kernel models that helped to popularize GS in plants. This package were named rrBLUP because it runs a RR-BLUP approach, that is, a wholegenome regression of the molecular markers over a certain phenotype. The package rrBLUP is mostly used for single-environment studies or genome-wide association studies (GWAS). On the other hand, BLR from Pe´rez et al. [93] allows not only including markers but also pedigree data jointly. In the seminal work of BLR, Pe´rez et al. [93] explained the challenges that arise when evaluating the genomic-enabled prediction accuracy through random crossvalidation and how to select the best choice of hyperparameters of the Bayesian models. Thus, to facilitate the use of such Bayesian models in genomic prediction, the Bayesian generalized linear regression package (BGLR) [65, 95] was defined in 2014, as a generalization of the BLR package that implements several parametric and semiparametric regression models, which includes Bayesian Lasso and Bayesian ridge regression (BRR), BayesB, BayesCπ, and reproducing kernel Hilbert spaces (RKHS) for continuous and ordinal responses (either censored or not). This approach opened up the way for dealing with more complex structures of phenotypic records, specially concerning to the multienvironment data and the “black box” of the G E interaction. After the works of Jarquı´n et al. [36], Lo´pez-Cruz et al. [37], and Souza et al. [63] the use of kernel models including several structures for main genotypic effect (MM model), MM plus single G E deviation (MDs), MDs with environment-specific variation (MDe), inclusion of random intercepts [41] and environmental relatedness kernels [36] became an issue for a large number of genotypes and environments (thus large size of each kernel). Granato et al. [66] presented the Bayesian genotype plus genotypeby-environment (BGGE) software, which takes advantage of a singular-value decomposition (SVD) of those kernels to speed up Gibbs sampling and mixed model solving. This software runs the same kernel models of BGLR (using multienvironment RKHS) but it is about 5 times faster without accuracy loss.
Genome Based Prediction of G E Interaction
265
Another software with great importance is the ASReml-R (version 3.0) [96], a non–open source software that is widely used. Briefly, the main advantage of this software is the possibility of easily running a wide number of structures genomic relationship (G), environmental relatedness (E), and G E, thus allowing explicit modeling of variance–covariance matrices of G, E and G E in different ways, such as unstructured (UN) and factor analytic structure (FA). Several publications show the benefits of using FA for modeling genomic and G E sources [80, 82, 83] because this approach deals with the main patterns of variation in a more parsimonious and accurate manner. Another way to model multivariate structures is through the open source software MTM [68]. It allows fitting a Bayesian multivariate Gaussian model with arbitrary number of random effects using a Gibbs sampler with several specifications for (co)variance parameters (unstructured, diagonal, factor analytic, etc.). In this package, the use of multienvironment structures can be interpreted as multitrait, where the phenotypic records for same genotype at different environments is visualized as a different trait. This concept traces back to the idea of the phenotypic correlation across environments [97] and its putative structure for modeling G E effects, and measure its importance in phenotypic variation. The MTM package is able to fit model 3 and to estimate matrices UE, and Σ. Matrix UE can indeed be modeled with different levels of complexity as illustrated by Malosetti et al. [79]. Another option for creating unstructured environmental relatedness matrices is the use of explicit environmental data [36, 57]. The use of environmental data for this purpose must follow a certain biological reality, because the covariates must represent in silico the growing conditions expected for a certain environment, in which “environment” means a certain time interval for a certain location using a certain crop management. The package EnvRtype [54] was developed to support quantitative geneticists to import environment data and use it in genomic prediction. This model runs the BGGE routine developed by Granato et al. [66] and Cuevas et al. [41]. Despite the mention as a software for genomic prediction, the main contribution of this package is related to the facility in importing, processing and incorporating environmental information as reliable source of variation. This package provides tools to implement reaction-norm model and other enviromic-enriched structures (see Eq. 7). The Bayesian prediction is implemented using the structure of both BGGE and BGLR packages. Other open source packages commonly used are the Solving Mixed Model Equations in R (sommer), [98] Bayesian multitrait multienvironment (BMTME) [99], and linear mixed models for millions of observations (MegaLMM) [100]. Here, we focus on practical examples for two software: BGLR and MTM.
266
Jose´ Crossa et al.
7 Practical Examples for Fitting Single Environment and Multi-Environment Modeling G E Interactions 7.1 Single Environment Models with BGLR
This section gives the R codes to illustrate how to fit RR and GBLUP models described before. We have adapted the codes from previous publications. Here, we analyze the wheat dataset described in Crossa et al. [19]. The dataset includes phenotypic and genotypic information for 599 wheat lines. The response variable is grain yield, which was evaluated in four environments. Lines were genotyped using DArT markers, which were coded as 0 and 1. An additive relationship matrix (A) derived from the pedigree is also available. R code in Box 1 shows how to load the BGLR package and load the data from an RData file. In this example the RData file can be downloaded from the following link: http://genomics.cimmyt. org/BookChapter_Rincent/. Note that the grain yield data contained in this dataset differs from the one included in the package because this response variable is not standardized in each environment. After loading the data, three objects are available in the R environment: (1) X, a matrix with markers whose dimensions are 599 rows (individuals) and 1279 columns (markers), (2) A, additive relationship matrix derived from pedigree, (3) Pheno, a data frame with 3 columns, Yield (t/ha), Var (Genotype) and Env (Environment). The R code also shows how to generate a boxplot for grain yield in each environment (Fig. 2).
Box 1 Loading Bayesian Generalized Linear Regression (BGLR) and Wheat Data 1 library(BGLR) 2 load("wheat599.RData") 3 ls() #list objects #Boxplot for grain yield boxplot(Yield~Env,data¼Pheno, xlab¼"Environment", ylab¼"Grain yield (t/ha)")
Box 2 includes R code to fit RR (model 1). We predict grain yield in environment 4 using the BGLR function. The number of burn-in, iterations and thin are parameters for the Gibbs sampler, in order to compute the posterior means of the parametersofinterest. After the model is fitted, the estimated marker effects b β can be obtained for further processing (see Fig. 3). The structure of the output object returned by the BGLR function is described by Pe´rez
Genome Based Prediction of G E Interaction
267
Fig. 2 Distribution of grain yield in each environment
and de los Campos [65] and in the corresponding package documentation. Box 2 Fitting Bayesian Ridge Regression (BRR) 1 #You need to run code in Box 1 2 #Specify linear predictor 3 EtaR