Heterogeneity in Statistical Genetics: How to Assess, Address, and Account for Mixtures in Association Studies 3030611205, 9783030611200

Heterogeneity, or mixtures, are ubiquitous in genetics. Even for data as simple as mono-genic diseases, populations are

223 67 6MB

English Pages 352 [366] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Foreword
References
Acknowledgements
Initial Comments and Technical Notes
References
Contents
1 Introduction to Heterogeneity in Statistical Genetics
1.1 Different Types of Heterogeneity
1.2 A Note on Definitions and Notation Throughout This Book
1.3 Hardy–Weinberg Equilibrium (HWE) Proportions and Their Importance in Gene-Mapping
1.4 Determination of Conditional Genotype Frequencies
1.4.1 Genetic Model-Free Approaches
1.4.2 Genetic Model-Based Approach Through the Use of Genotype Relative Risks
1.5 The Box (and Whiskers) Plot as a Tool for Visualizing Empirical Data Distributions
1.6 Power and Minimum Sample Size (MSSN) for Different Statistical Tests of Genetic Association
1.6.1 Contingency Table for Organizing Categorical Phenotype and Genomic-Data
1.7 The Expectation–Maximization (EM) Algorithm
1.7.1 Example Application
References
2 Overview of Genomic Heterogeneity in Statistical Genetics
2.1 Heterogeneity Due to SNP Genotype Misclassification
2.2 Examples of How Genotype Misclassification May Arise in Practice
2.3 Mathematical Models of Genotype Misclassification
2.4 Genotype Misclassification for Genomic Data with Three or More Categories
2.5 Effects of Misclassification on Statistical Tests
2.5.1 Non-differential Misclassification Error
2.5.2 Differential Misclassification Error
2.5.3 Non-differential Misclassification in Family-Based Tests of Association
2.6 Errors in Next-Generation Sequencing (NGS)
2.6.1 Definitions and Notation
2.6.2 Mathematical Model for NGS Data
2.6.3 Empirical Type I Error for Test Statistics Applied to NGS Data with Sequence Error—Simulation Results
2.7 Non-misclassification Forms of Heterogeneity
2.7.1 Mathematical Model for Heterogeneity
References
3 Phenotypic Heterogeneity
3.1 Phenotype Misclassification
3.2 How Phenotype Misclassification May Arise in Practice
3.2.1 Lack of Access to Gold-Standard Classification
3.2.2 Variability of Phenotype Expression over Time
3.2.3 Variable Age of Onset
3.2.4 Incomplete Knowledge of Gold-Standard Classifications
3.2.5 Model Misspecification
3.3 Effects of Misclassification on Statistical Tests
3.3.1 Non-differential Misclassification Error Example for Single-Stage Genetic Association
3.3.2 Why Do We Observe Such Large Power Loss/MSSN Increase for Phenotype Misclassification?
3.3.3 Multi-stage Phenotype Classification and Limits of Observed Genotype Frequencies
3.4 Non-misclassification Forms of Heterogeneity
3.5 Summary
References
4 Association Tests Allowing for Heterogeneity
4.1 Introduction
4.2 Statistical Tests that Use Genotype Data
4.2.1 Likelihood Ratio Test that Allows for Random Phenotype and Genotype Misclassification Error (LRTae)
4.2.2 Trend Statistic that Allows for Random Phenotype and Genotype Misclassification Error
4.2.3 Likelihood Ratio Statistic for Family-Based Association that Incorporates Genotype Misclassification Errors (TDTae)
4.3 Statistical Tests that Consider Heterogeneity Other Than Misclassification
4.3.1 Mixture Likelihood Ratio Test (MLRT) for Genetic Association in the Presence of Locus Heterogeneity
4.3.2 Transmission Disequilibrium Test that Allows for Locus Heterogeneity (TDT-HET)
4.3.3 Tests that Incorporate Phenotype Heterogeneity
4.4 Statistical Tests that Use Sequence Data
4.4.1 Single-Variant and Multiple Variant Tests of Trend for Genetic Association that Allows for Random and Differential NGS Error ( LTTae,NGS )
4.4.2 Transmission Disequilibrium Test that Allows for Next-Generation Sequence Error (TDT1-NGS)
References
5 Designing Genetic Linkage and Association Studies that Maintain Desired Statistical Power in the Presence of Mixtures
5.1 Parameter Settings, for Example, Calculations
5.1.1 Example Parameter Settings to Compute Power for a Fixed Sample Size and Significance Level
5.1.2 Example Parameter Settings to Compute MSSN for a Fixed Power and Significance Level
5.2 Statistical Tests that Use Genotype Data
5.2.1 Power and MSSN for Population-Based Data in the Presence of Non-differential Genotype Misclassification
5.2.2 Power and MSSN for Population-Based Data in the Presence of Non-differential Phenotype Misclassification
5.2.3 Likelihood Ratio Test that Allows for Random Phenotype and Genotype Misclassification Error ( LRTae )—Empirical Power
5.2.4 Trend Statistic that Allows for Random Phenotype and Genotype Misclassification Error
5.2.5 Family-Based Tests of Association—Analytic Solution to Increase in Rejection Rate for TDT in the Presence of Genotype Misclassification Errors
5.2.6 Family-Based Tests of Association—Analytic Solution to Increase in Rejection Rate for TDT in the Presence of Phenotype Misclassification Errors
5.3 Statistical Tests that Consider Heterogeneity Other Than Misclassification
5.3.1 Sample Size Calculations in the Presence of Locus Heterogeneity—Population-Based Tests of Genetic Association
5.3.2 Power and Sample Size Calculations for Chi-Square Tests of Independence on Allele and Genotype Data for Phenotype Heterogeneity
5.3.3 Family-Based Test of Linkage/Association
5.4 Power Calculations in the Presence of NGS Misclassification
5.4.1 Test of Trend Applied to Multiple NGS Data for SNP Loci
5.4.2 Increase in Interest for NGS Statistics
5.4.3 Empirical Null and Power Simulations for the LTTae,NGS Statistic
5.4.4 Factors that Most Significantly Affect Power for NGS-Based TDT
References
6 Threshold-Selected Quantitative Trait Loci and Pleiotropy
6.1 Quantitative Trait Locus with Single Phenotype
6.1.1 Notation
6.1.2 Conditional Genotype Frequencies for Threshold-Selected Phenotypes
6.1.3 Example Sample Size Calculation for Threshold-Selected Phenotypes
6.1.4 Why Use Threshold-Selected Dichotomous Phenotypes as Compared with Quantitative Phenotypes? Power Comparison with ANOVA
6.2 Quantitative Trait Locus with Multiple Phenotypes
6.2.1 Notation for Multivariate Quantitative Traits
6.2.2 Methods
6.2.3 Thresholds
6.2.4 Example MSSN Calculation
6.2.5 A Final Note on Advantages of the Threshold-Selected Approach
References
Bibliography
Index
Recommend Papers

Heterogeneity in Statistical Genetics: How to Assess, Address, and Account for Mixtures in Association Studies
 3030611205, 9783030611200

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Statistics for Biology and Health

Derek Gordon Stephen J. Finch Wonkuk Kim

Heterogeneity in Statistical Genetics How to Assess, Address, and Account for Mixtures in Association Studies

Statistics for Biology and Health Series Editors Mitchell Gail, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD, USA Jonathan M. Samet, Department of Epidemiology, School of Public Health, Johns Hopkins University, Baltimore, MD, USA

More information about this series at http://www.springer.com/series/2848

Derek Gordon · Stephen J. Finch · Wonkuk Kim

Heterogeneity in Statistical Genetics How to Assess, Address, and Account for Mixtures in Association Studies

Derek Gordon Department of Genetics Rutgers University Piscataway, NJ, USA

Stephen J. Finch Department of AMS Stony Brook University Stony Brook, NY, USA

Wonkuk Kim Department of Applied Statistics Chung-Ang University Seoul, Korea (Republic of)

ISSN 1431-8776 ISSN 2197-5671 (electronic) Statistics for Biology and Health ISBN 978-3-030-61120-0 ISBN 978-3-030-61121-7 (eBook) https://doi.org/10.1007/978-3-030-61121-7 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To TH, JBF, and HYH with love

Foreword

The modern fields of mathematical statistics and statistical genetics started off as more or less the same discipline. Throughout the early twentieth century, quantitative problems in human genetics were catalysts for statistical innovation; and people we now think of as famous statisticians could equally well have been described as mathematical geneticists [1]. But this changed during the second half of the twentieth century, as statistical genetics evolved its own characteristic themes and methods. Pressures driving the idiosyncratic development of the field started with the focus on pedigrees, which meant that the unit of observation comprised a group of individuals whose phenotypes and genotypes were patterned in biologically structured ways. The focus on pedigrees also gave rise to the problem of ascertainment, the practice of recruiting families for study on the basis of part of the outcome, namely the phenotypes of some family members. These features of human genetic data necessitated specialized forms of likelihood and tailored methods of statistical modeling, and they posed unique computational challenges as well [2]. Also, throughout this period statistical genetics was the only branch of statistics, as far as I know, with an undercurrent of an “evidentialist”, or a purely likelihood ratio-based, approach to statistical inference [3–6]. For example, for decades it was standard practice to report the LOD score, a log likelihood ratio, rather than a p-value as the primary endpoint of analysis. But in recent times the field has moved away from the study of pedigrees (beyond the simplest structures, such as case-parent triads) and the particular modeling challenges they posed, and frequentist inference has overshadowed evidentialism. It would seem that a contemporary student of statistical genetics does not need to know so much about these historically important topics. Of course, understanding the biological principles of genetic inheritance, and familiarity with ever-evolving molecular genetic technologies, have always been and remain today drivers of the field, requiring specialized treatment. But from a purely statistical point of view, most of the statistical genetics presents few distinctive challenges nowadays …with one exception.

vii

viii

Foreword

This exception is the problem of heterogeneity, that is, the existence of mixtures of different types within our data, with no a priori basis for sorting the data into more homogeneous subsets (e.g., no known or measured covariates). The classic example is locus heterogeneity, a common feature of even the simple Mendelian genetic diseases, in which a mutation in any one of multiple genes can cause a single clinical condition. Allowing for locus heterogeneity is critical to the power of pedigree studies to discover genes, and to post hoc sorting of pedigrees into subsets corresponding to distinct underlying genetic causes, which is in turn is essential for a precise molecular genetic understanding of genotype-phenotype relationships. It is also important in clinical practice, including genetic counseling. Other forms of heterogeneity confound statistical genetic modeling as well. Those curious about these other forms would do well to read this book. The ubiquitous presence of heterogeneity means that the core statistical genetic model is in its essence a mixture model, in which the underlying distribution is a mixture of distinct distributions, with neither the mixing proportions nor the parameters of the sub-distributions resolvable in advance. In many non-genetic situations, any underlying heterogeneity may be fine-grained enough to be treated as part of the background noise of random variation; or the statistician may be so fortunate as to be provided with measured covariates capable of discriminating subgroups. Accordingly, most biostatisticians can avoid mixture models altogether. That is very nice for them since mixture models are highly inconvenient! Mixture models mask what may be key features of indistinguishable subsets of the data, confounding interpretation of results; they lead to greatly increased levels of variation, compared to non-mixture models, seriously compromising power and power to replicate; and even in large samples, they can violate standard regularity conditions in ways that produce irregular likelihoods, e.g., causing parameters to become unidentifiable (not uniquely estimable) or otherwise ill-behaved. In genetic settings, however, it is even more problematic to eschew mixture models than to deploy them, because to ignore heterogeneity is to get the biology wrong in a crucial way right from the start. Perhaps much of twentieth-century statistical genetics now appears archaic, but the need to grapple with heterogeneity remains, if for no other reason than that not to take heterogeneity seriously is to risk utterly misinterpreting statistical genetic results. (Of course, this last point might be applicable to some other areas of biostatistics as well.) I am, therefore, delighted that we finally have a book entirely devoted to this statistically complex and biologically crucial topic. As someone who has devoted a good deal of my own career to the implications of mixture models in statistical genetic applications, I hope this volume will become a standard on the bookshelves of all serious students of the field. Columbus, Ohio, USA 2020

Veronica J. Vieland

Foreword

ix

References 1. Edwards, A.W.F.: R. A. Fisher — Twice professor of genetics: London and Cambridge, or ‘a Fairly Well-Known Geneticist’. The Stat. 52(Part 3), 311–318 (2003) 2. Elston, R.C., Thompson, E.A.: A century of biometrical genetics. Biometrics 56(3), 659–666 (2000) 3. Morton, N.E.: Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7(3), 277–318 (1955) 4. Vieland, V.J., Hodge, S.E.: Review of statistical evidence: a likelihood paradigm. Am. J. Hum. Genet. 63, 283–289 (1998) 5. Strug, L.J.: The evidential statistical paradigm in genetics. Genet. Epidemiol. 42, 590–607 (2018) 6. Edwards, A.W.F.: Likelihood, Expanded ed. The Johns Hopkins University Press, Baltimore (1992)

Acknowledgements

We gratefully acknowledge the following individuals in the creation and completion of this work: Family and friends (DG) Mrs. Marjorie Gordon and Dr. Joseph Gordon (of blessed memory), siblings David, Dan, Amy (Super Grover!) and Andy, Seth and Amy, wonderful nieces and nephews, Rabbi Louis Stein, and Mrs. Marcia Offsey, (SJF) My wife Jeanne and daughters Marcia and Glenda, (WK) My wife Hae Young and my mother, for their continual love and support; Advisors, mentors, and teachers (in approximate chronological order) (DG) Professors John Harper, Anthony W. Knapp, Stephen J. Finch, Nancy Mendell, Jürg Ott, Marcella Devoto, Susan E. Hodge, David Axelrod, for imbuing us with their exceptional knowledge and wisdom; Research colleagues (in approximate chronological order): (DG) Dr. Francis J. McMahon, Professor Angela Christiano, Dr. Amalia MartinezMir, Dr. Abraham Zlotogorski, Professor Rebecca Morris, Dr. Francisco De La Vega, Professor Yaning Yang, Dr. Abraham Brown, Mr. Chad Haynes, Professors Lei Yu, Steve Buyske, and Amrik Sahota, Dr. Carol Wise, and Professors Jinchuan Xing and Karen Schindler, for sharing their ideas and working together to produce quality science; Students (in approximate chronological order): (DG) Dr. Brian J. Edwards, Dr. Mark Levenstien, Dr. Douglas Londono, Dr. Anthony Musolf, Ms. Payal Patel, Dr. Lisheng Zhou, Ms. Dana Godrich, Ms. Angelica Galianese, and Ms. Aliyah Khatri, (SJF) Dr. Sun Kang, Dr. Kwangmi Ahn, and Dr. Nathan Tintle, who teach us more than we teach them (as the adage goes). A special note of thanks to Professors Jay A. Tischfield, Linda Bzustowicz, Tara Matise, Terry McGuire, Chris Rongo, and Gary Heiman. The leadership of these colleagues provided for an atmosphere of patience, support, and encouragement. This work could not have been created without them. xi

xii

Acknowledgements

Additional thanks go to Dr. Judy Flax and Dr. Laurel Glaser, who conversed with us at length on the subject of research-quality phenotyping. An extra-special note of gratitude to Dr. Veronica Vieland, who graciously offered to write the Forward for this book. Undoubtedly, we missed acknowledging individuals who deserve to be in this list. We regret any such oversights, and we hope they know how much we appreciate them.

Initial Comments and Technical Notes

If we were asked to reduce the contents of this book down to one word, we would respond “computation”. Virtually, every section of every chapter contains equations or statistical analyses that address different types of heterogeneity. Why is this approach important? Consider the following. Most people who work in statistical genetics or in an allied field like genetic epidemiology, if asked about the effects that heterogeneity has on test statistics designed to find an association between phenotypes and categorical genomic data, like SNPs, would answer that heterogeneity reduces statistical power and/or inflates the false positive rate. The follow-up question is more challenging: how much? Quantify the amount of power loss, or equivalently, the increase in minimum sample size necessary to detect association. Also, given the power and/or sample size is a function of multiple genetic factors such as disease allele frequency, mode of inheritance, relative risk, and a heterogeneity parameter, what, if any, of the factors most significantly alter power or sample size. Can one determine a linear relationship among parameter values for the factors and outcomes like power or sample size? We seek to answer these and other questions in this work. The subtitle of this book refers to assessing, addressing, and account for heterogeneity in the design and analysis of genetic association. We assess heterogeneity by describing how it may arise in practice, and by developing methods to estimate heterogeneity parameters in actual data sets. We address heterogeneity by designing test statistics that produce unbiased estimates of genetic model parameters like allele or genotype frequencies in case- and control-populations. Finally, we account for heterogeneity in experimental design by derived closedform solutions of power and/or sample size for a given significance level in the presence of heterogeneity, and as a result, provide precise adjusted values for power or sample size given different values of heterogeneity parameters. In this way, one can insure that studies are sufficiently powered to detect genetic association.

xiii

xiv

Initial Comments and Technical Notes

Some technical notes: 1. Target audience—This book is written for those who have advanced undergraduate knowledge in statistics. Some of the topics essential to an understanding of this work include parameter estimation, distribution theory, statistical hypothesis testing, and likelihood theory. While there are many excellent books written on such subjects, we are most familiar with the books by Hogg and Craig [2] and Edwards [1]. An excellent foundation for the genetics material may be found in books by Ott [3], Laird and Lange [4], and the laboratory manual edited by Al-Chalabi and Almasy [5]. It is highly recommended that people who study this book be familiar with the basics of terms like SNP, linkage disequilibrium, penetrance, genome-wide association studies (GWAS), and others. 2. All notation is local—while we have strived to insure that notation be as general as possible, some of the methods are sufficiently different that novel notation is required. Still, for a given section, all the notation needed for a method is contained in that section. 3. While one way that researchers have chosen to address misclassification, particularly with genotypes, is through repeated sampling, we have chosen to focus on methods where a gold-standard is known. Our colleague, Professor Nathan Tintle, has published extensively on the use of repeated sampling of genotypes in genetic association testing.

References 1. Edwards, A.W.F.: Likelihood, Expanded ed. The Johns Hopkins University Press, Baltimore (1992) 2. Hogg, R.V., Craig, A.T.: Introduction to Mathematical Statistics, Fourth ed. Macmillan, New York, NY 3. Ott, J.: Analysis of Human Genetic Linkage, Third ed. The John Hopkins University Press, Baltimore, MD (1999) 4. Laird, N.M., Lange, C.: The Fundamentals of Modern Statistical Genetics. Springer Publishing Company, Incorporated (2010) 5. Al-Chalabi, A., Almasy, L. (eds.): Genetics of Complex Human Diseases: A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor NY (2009)

Contents

1 Introduction to Heterogeneity in Statistical Genetics . . . . . . . . . . . . . . . 1.1 Different Types of Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 A Note on Definitions and Notation Throughout This Book . . . . . . . 1.3 Hardy–Weinberg Equilibrium (HWE) Proportions and Their Importance in Gene-Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Determination of Conditional Genotype Frequencies . . . . . . . . . . . . 1.4.1 Genetic Model-Free Approaches . . . . . . . . . . . . . . . . . . . . . . . 1.4.1.1 Locus Genotype Frequencies Follow HWE Proportions in Both Populations . . . . . . . . . . . . . . . . 1.4.1.2 Locus Genotype Frequencies Follow HWE Proportions in One Population but Not Both . . . . . 1.4.1.3 Locus Genotype Frequencies Follow HWE Proportions in Neither Population . . . . . . . . . . . . . . 1.4.2 Genetic Model-Based Approach Through the Use of Genotype Relative Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2.1 Genetic Model-Based Approach Through the Use of Logistic Model . . . . . . . . . . . . . . . . . . . . . 1.5 The Box (and Whiskers) Plot as a Tool for Visualizing Empirical Data Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Power and Minimum Sample Size (MSSN) for Different Statistical Tests of Genetic Association . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Contingency Table for Organizing Categorical Phenotype and Genomic-Data . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1.1 Formula for Chi-Square Test of Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1.2 (Cochran-Armitage) Test of Trend . . . . . . . . . . . . . . 1.6.1.3 The Transmission Disequilibrium Test for Detecting Linkage in the Presence of Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1.4 Computing Power and MSSN for Tests of Genetic Association . . . . . . . . . . . . . . . . . . . . . . . . 1.7 The Expectation–Maximization (EM) Algorithm . . . . . . . . . . . . . . . .

1 2 7 8 9 9 10 11 12 12 18 19 19 21 22 23

24 25 28 xv

xvi

Contents

1.7.1 Example Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1.1 Implementation of the Algorithm . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 30 40

2 Overview of Genomic Heterogeneity in Statistical Genetics . . . . . . . . . 2.1 Heterogeneity Due to SNP Genotype Misclassification . . . . . . . . . . . 2.2 Examples of How Genotype Misclassification May Arise in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Mathematical Models of Genotype Misclassification . . . . . . . . . . . . . 2.4 Genotype Misclassification for Genomic Data with Three or More Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Effects of Misclassification on Statistical Tests . . . . . . . . . . . . . . . . . . 2.5.1 Non-differential Misclassification Error . . . . . . . . . . . . . . . . . 2.5.2 Differential Misclassification Error . . . . . . . . . . . . . . . . . . . . . 2.5.3 Non-differential Misclassification in Family-Based Tests of Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Errors in Next-Generation Sequencing (NGS) . . . . . . . . . . . . . . . . . . 2.6.1 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1.1 What Are Estimated NGS Probabilities for Empirical Data? . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Mathematical Model for NGS Data . . . . . . . . . . . . . . . . . . . . . 2.6.3 Empirical Type I Error for Test Statistics Applied to NGS Data with Sequence Error—Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Non-misclassification Forms of Heterogeneity . . . . . . . . . . . . . . . . . . 2.7.1 Mathematical Model for Heterogeneity . . . . . . . . . . . . . . . . . . 2.7.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1.2 Mathematical Model for Locus Heterogeneity—Equations . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 53

3 Phenotypic Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Phenotype Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 How Phenotype Misclassification May Arise in Practice . . . . . . . . . 3.2.1 Lack of Access to Gold-Standard Classification . . . . . . . . . . 3.2.2 Variability of Phenotype Expression over Time . . . . . . . . . . . 3.2.3 Variable Age of Onset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Incomplete Knowledge of Gold-Standard Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Model Misspecification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Effects of Misclassification on Statistical Tests . . . . . . . . . . . . . . . . . . 3.3.1 Non-differential Misclassification Error Example for Single-Stage Genetic Association . . . . . . . . . . . . . . . . . . . 3.3.2 Why Do We Observe Such Large Power Loss/MSSN Increase for Phenotype Misclassification? . . . . . . . . . . . . . . .

55 57 58 59 59 61 69 71 72 76 82

83 86 86 87 88 88 99 99 101 101 102 102 103 103 104 104 106

Contents

3.3.3 Multi-stage Phenotype Classification and Limits of Observed Genotype Frequencies . . . . . . . . . . . . . . . . . . . . . 3.3.3.1 Conditional Genotype Frequencies in Presence of Conditionally Independent Phenotype Classification . . . . . . . . . . . . . . . . . . . . . . 3.3.3.2 Conditional Genotype Frequencies in the Presence of Biased Phenotype Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Non-misclassification Forms of Heterogeneity . . . . . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Association Tests Allowing for Heterogeneity . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Statistical Tests that Use Genotype Data . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Likelihood Ratio Test that Allows for Random Phenotype and Genotype Misclassification Error (LRTae ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . 4.2.1.2 Log-Likelihoods of the Observed Data . . . . . . . . . . 4.2.1.3 Test Statistic—Likelihood Ratio Test Allowing for Error (LRTae ) . . . . . . . . . . . . . . . . . . . . 4.2.1.4 Example Application . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Trend Statistic that Allows for Random Phenotype and Genotype Misclassification Error . . . . . . . . . . . . . . . . . . . 4.2.2.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . 4.2.2.2 Log-Likelihoods of the Observed Data . . . . . . . . . . 4.2.2.3 Test Statistic—(Linear) Test of Trend Allowing for Error (LTTae ) . . . . . . . . . . . . . . . . . . . . 4.2.3 Likelihood Ratio Statistic for Family-Based Association that Incorporates Genotype Misclassification Errors (TDTae ) . . . . . . . . . . . . . . . . . . . . . . . 4.2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3.2 Determination of the Bayesian Posterior Probabilities (BPPs) τ (r) (abc)(xyz) . . . . . . . . . . . . . . . . . 4.2.3.3 TDTae Parameter Estimates . . . . . . . . . . . . . . . . . . . . 4.2.3.4 Log-Likelihood of Observed Data . . . . . . . . . . . . . . 4.2.3.5 The TDTae Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Statistical Tests that Consider Heterogeneity Other Than Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Mixture Likelihood Ratio Test (MLRT) for Genetic Association in the Presence of Locus Heterogeneity . . . . . . . 4.3.1.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . 4.3.1.2 Log-Likelihoods of the Observed Data . . . . . . . . . . 4.3.1.3 Example Application . . . . . . . . . . . . . . . . . . . . . . . . .

xvii

109

110

115 116 117 117 129 129 130

130 130 133 137 138 144 145 146 151

151 151 157 157 160 160 160 161 161 163 165

xviii

Contents

4.3.1.4 Computing the MLRT Statistic for the Example Data . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Transmission Disequilibrium Test that Allows for Locus Heterogeneity (TDT-HET) . . . . . . . . . . . . . . . . . . . 4.3.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (r) ............. 4.3.2.2 Determination of the BPPs τ(m)(abc) 4.3.2.3 TDT-HET Parameter Estimates . . . . . . . . . . . . . . . . 4.3.2.4 Log-Likelihood of Observed Data . . . . . . . . . . . . . . 4.3.2.5 Computing the TDT-HET Statistic . . . . . . . . . . . . . . 4.3.2.6 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.7 How TDT-HET Permutation p-Values Are Computed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.8 A Proof of the Robustness of the TDT-HET Statistic’s Type I Error When Null Data Are Drawn from Multiple Sub-populations . . . . . . . . . . 4.3.3 Tests that Incorporate Phenotype Heterogeneity . . . . . . . . . . 4.3.3.1 Analysis of Data with R (Greater Than Two) Phenotypes and C (Greater Than One) Genomic Data Categories . . . . . . . . . . . . . . . . . . . . . 4.3.3.2 Example Application of Chi-Square Test of Independence to Multiple Phenotypes . . . . . . . . 4.3.3.3 Does Modeling Phenotypic Heterogeneity Increase Power for Detecting Association? Results from Example Data . . . . . . . . . . . . . . . . . . . 4.3.3.4 Other Methods for Addressing Phenotype Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.5 Morton’s M-Test for Heterogeneity Applied to Different Groups of Phenotypes . . . . . . . . . . . . . . 4.4 Statistical Tests that Use Sequence Data . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Single-Variant and Multiple Variant Tests of Trend for Genetic Association that Allows for Random and Differential NGS Error LTTae,NGS . . . . . . . . . . . . . . . . . 4.4.1.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . 4.4.1.2 Log-Likelihood of the Observed Data . . . . . . . . . . . 4.4.1.3 LTTae,NGS Parameter Estimates . . . . . . . . . . . . . . . . 4.4.1.4 LTTae,NGS Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Transmission Disequilibrium Test that Allows for Next-Generation Sequence Error (TDT1 -NGS) . . . . . . . . 4.4.2.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . 4.4.2.2 Log-Likelihood of the Observed Data . . . . . . . . . . . 4.4.2.3 TDT1 -NGS Parameter Estimates . . . . . . . . . . . . . . . 4.4.2.4 TDT1 -NGS Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2.5 Example Calculations . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

172 174 175 176 177 178 179 179 183

183 185

186 187

193 195 197 203

203 204 207 208 209 209 210 214 215 217 217 237

Contents

5 Designing Genetic Linkage and Association Studies that Maintain Desired Statistical Power in the Presence of Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Parameter Settings, for Example, Calculations . . . . . . . . . . . . . . . . . . 5.1.1 Example Parameter Settings to Compute Power for a Fixed Sample Size and Significance Level . . . . . . . . . . 5.1.2 Example Parameter Settings to Compute MSSN for a Fixed Power and Significance Level . . . . . . . . . . . . . . . . 5.2 Statistical Tests that Use Genotype Data . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Power and MSSN for Population-Based Data in the Presence of Non-differential Genotype Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1.1 Example Power Calculation . . . . . . . . . . . . . . . . . . . 5.2.2 Power and MSSN for Population-Based Data in the Presence of Non-differential Phenotype Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Example MSSN Calculation . . . . . . . . . . . . . . . . . . . 5.2.3 Likelihood Ratio Test that Allows for Random Phenotype and Genotype Misclassification Error (LRTae )—Empirical Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3.1 Genetic Model Parameters Determined Using Two Loci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3.2 Conditional Two-Locus Genotype Frequencies Based on Affection Status . . . . . . . . . . 5.2.3.3 Results of Simulations for LRTae Under Two-Locus Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Trend Statistic that Allows for Random Phenotype and Genotype Misclassification Error . . . . . . . . . . . . . . . . . . . 5.2.5 Family-Based Tests of Association—Analytic Solution to Increase in Rejection Rate for TDT in the Presence of Genotype Misclassification Errors . . . . . . 5.2.5.1 Non-centrality Parameter and Inflation in Rejection Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Family-Based Tests of Association—Analytic Solution to Increase in Rejection Rate for TDT in the Presence of Phenotype Misclassification Errors . . . . . 5.2.6.1 Example MSSN Calculations for TDT in the Presence of Phenotype Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Statistical Tests that Consider Heterogeneity Other Than Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Sample Size Calculations in the Presence of Locus Heterogeneity—Population-Based Tests of Genetic Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1.1 Example Power Calculation—Test of Trend . . . . . .

xix

247 247 248 249 250

251 252

254 254

256 256 256 263 276

277 281

284

286 287

287 287

xx

Contents

5.3.1.2 Factors that Most Significantly Influence MSSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Power and Sample Size Calculations for Chi-Square Tests of Independence on Allele and Genotype Data for Phenotype Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Family-Based Test of Linkage/Association . . . . . . . . . . . . . . . 5.3.3.1 Example MSSN Calculations for TDT in the Presence of Locus Heterogeneity . . . . . . . . . 5.4 Power Calculations in the Presence of NGS Misclassification . . . . . 5.4.1 Test of Trend Applied to Multiple NGS Data for SNP Loci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Increase in Interest for NGS Statistics . . . . . . . . . . . . . . . . . . . 5.4.3 Empirical Null and Power Simulations for the LTTae,NGS Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3.1 Empirical Type I Error (Null) Results . . . . . . . . . . . 5.4.3.2 Empirical Power Results . . . . . . . . . . . . . . . . . . . . . . 5.4.3.3 Additional Investigation of Three Factors . . . . . . . . 5.4.4 Factors that Most Significantly Affect Power for NGS-Based TDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Threshold-Selected Quantitative Trait Loci and Pleiotropy . . . . . . . . . 6.1 Quantitative Trait Locus with Single Phenotype . . . . . . . . . . . . . . . . . 6.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Conditional Genotype Frequencies for Threshold-Selected Phenotypes . . . . . . . . . . . . . . . . . . . . . 6.1.3 Example Sample Size Calculation for Threshold-Selected Phenotypes . . . . . . . . . . . . . . . . . . . . . 6.1.4 Why Use Threshold-Selected Dichotomous Phenotypes as Compared with Quantitative Phenotypes? Power Comparison with ANOVA . . . . . . . . . . . 6.2 Quantitative Trait Locus with Multiple Phenotypes . . . . . . . . . . . . . . 6.2.1 Notation for Multivariate Quantitative Traits . . . . . . . . . . . . . 6.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Example MSSN Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 A Final Note on Advantages of the Threshold-Selected Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

289

293 293 294 295 296 297 298 299 302 304 310 316 323 323 324 325 326

328 329 330 332 333 336 340 340

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

Chapter 1

Introduction to Heterogeneity in Statistical Genetics

We start with the definition of heterogeneity and how this concept applies to genetic data. The remainder of the chapter focuses primarily on genetic modeling under homogeneity. We provide formulas for important genetic concepts like HardyWeinberg Equilibrium, haplotype and diplotype frequencies, genetic model-free and model-based methods for computing conditional genotype frequencies, thresholdselected methods for computing conditional genotype frequencies, test statistics considered in this book, non-centrality parameters, and an introduction to the EM algorithm. The Merriam-Webster Dictionary defines heterogeneous as “consisting of dissimilar or diverse ingredients or constituents; mixed” [1]. The word is derived from the Greek word heterogen¯es, where heteros means ‘other’ and genos ‘a kind’. For our purposes, heterogeneity means any kind of mixture. Data includes, but are not limited to, alleles, genotypes, multi-locus genotypes, phenotypes, multiple phenotypes, quantitative traits, longitudinal data, copy number polymorphisms, or misclassified categories. Our only requirement is that the samples fall into two or more categories, or are drawn from two or more distributions. The categories/distributions may be known or unknown a priori; in most situations, however, the category or distribution assignments will be unknown. Interestingly, The National Library of Medicine’s Genetics Home Reference [2] does not define the term heterogeneity directly. Rather, it defines allelic heterogeneity, locus heterogeneity, and genetic heterogeneity, and also provides a list of genes where some type of heterogeneity has been observed. Among the three terms presented, the most general definition is for genetic heterogeneity, defined as “A single disorder, trait, or pattern of traits caused by genetic factors in some cases and non-genetic factors in others” [3]. In what follows, we express each type of heterogeneity, or mixture, by indicating the unique categories into which each sampling unit (e.g., case, affected individual, family with affected individuals) is placed.

© Springer Nature Switzerland AG 2020 D. Gordon et al., Heterogeneity in Statistical Genetics, Statistics for Biology and Health, https://doi.org/10.1007/978-3-030-61121-7_1

1

2

1 Introduction to Heterogeneity in Statistical Genetics

1.1 Different Types of Heterogeneity Locus heterogeneity This heterogeneity is the type where a genotype or genotypes in any of a number of genes will produce the same phenotype. The mixture categories are the different genes that produce the same disease phenotype. We provide a figure representing our definition of locus heterogeneity in Fig. 1.1. Examples of locus heterogeneity include hereditary breast cancer [4], non-syndromic hearing loss [5], retinitis pigmentosa [6], and childhood epileptic encephalopathies [7], among several other [8–11]. For an exhaustive list, see Online Inheritance In Man (OMIM) [12]. In this figure, the disease phenotype is the presence of either breast or ovarian cancer. This phenomenon is an example of locus heterogeneity, since gene BRCA1 is located on Human Chromosome 17 [13], and BRCA2 is located on Human Chromosome 13 [14]. Yet, mutations in BRCA1 can cause the disease phenotype in one subset of families (1–3), while mutations in BRCA2 can cause the same disease phenotype

Fig. 1.1 Example of locus heterogeneity in fictitious pedigrees where affected females have either breast or ovarian cancer. We present pedigrees drawings of five different families whose pedigree structures are fictitious (not based on any real data). Circles represent females and squares represent males. Those with black fill have either breast or ovarian cancer (the disease phenotype for our purposes), those with no fill do not have the disease phenotype, and those with gray fill have unknown phenotype status. Horizontal lines between a male and a female with a perpendicular vertical line below (forming a “T”) indicate a mating, and the vertical line connected to a horizontal line below it indicates the offspring of a mating. For example, in Family 1, Individuals 3 and 4 mated and produced Individuals 5, 6, 7, and 8 as offspring. Similarly, in Family 5, Individuals 4 and 5 mated and produced 6 and 7

1.1 Different Types of Heterogeneity

3

Fig. 1.2 Our model of phenotype heterogeneity

in another subset of families (4–5) (Fig. 1.1). While it is possible that males may inherit a mutation and may show some disease phenotype, for simplicity, we restrict our attention to breast and ovarian cancer in the females in each family. If this were an actual study, simply reviewing these pedigrees would most likely not provide us with any information about locus heterogeneity. It is for this reason that it became critical to develop statistical methods that could detect locus heterogeneity from pedigrees genotyped at multiple markers. For examples of such methods, see the Refs. [15–25], among others. Phenotype heterogeneity In this situation, a genotype or genotypes in a single gene cause multiple different phenotypes. That is, genotypes in a single gene produce a mixture of phenotypes. The mixture categories are the unique phenotypes i, 1 ≤ i ≤ m. We present this concept pictorially in Fig. 1.2. This definition may be modified to consider a larger set of genes, but the key concept is that a fixed set of under underlying genotypes in a set of genes leads to different phenotypes. While precision dictates that we use the terms like “phenotype category”, “genotype category”, or “quantitative trait value” for observed or true measurements of an individual, frequently we shall abbreviate with the terms “phenotype”, “genotype”, and “trait value”. We do this to be consistent with the common parlance of our field. When we anticipate that greater clarity is needed, we shall use fuller terms. Also, we shall use the term “true” in a mathematical/statistical context to mean that the values provided are the correct values, or in some situations, the most accurate values. As stated by Paaby and Rockman [26], phenotype heterogeneity may also be referred to as pleiotropy. Lobo [27] reported that Gregor Mendel’s observations regarding the relationship among seed coat color and flower and axil pigmentation were examples of pleiotropy. Definition of phenotype in this book Phenotype heterogeneity will always refer to at least three different groups; one of these groups may be the “unaffected” or “control” group, but there will always be at least two other “affected” or “case”

4

1 Introduction to Heterogeneity in Statistical Genetics

groups. Given our definition above, this condition implies that m ≥ 3 in Fig. 1.2. If we did not require this condition, then many classic single-major gene disorders like cystic fibrosis, Tay Sachs, or Huntington’s disease, for which individuals are traditionally classified as “affected” or “unaffected” [12, 17, 28, 29], would qualify as displaying phenotypic heterogeneity. There have been several papers that consider phenotype heterogeneity in genetics, particularly in the last decade. If we just restrict to human phenotypes, there have been at least 40 publications that include the term “phenotype heterogeneity” in the title or abstract. Chronologically, these publications consider phenotypes from hemochromatosis and iron overload to systemic lupus erythematosus [30–70]. When considering all species, the number of publications is considerably larger. One of the key types of heterogeneity considered in this book is misclassification. One definition of the verb classify is “to consider (someone or something) as belonging to a particular group” [71]. The Merriam-Webster Online Dictionary defines misclassify as “to assign (someone or something) to an incorrect group or category; to classify wrongly” [72]. This definition implies that there is a correct or true category. We will make use of this concept throughout the book; notably, because the majority of the book deals with categorical data, errors will mean misclassification error. Phenotype misclassification In this situation and in the genotype misclassification description below, we consider categorical data. Observed phenotypes are a mixture of correctly classified phenotypes (the observed phenotype classification is the same as true phenotype classification) and misclassified phenotypes (the observed and true phenotype classifications are not the same). We present an example in Fig. 1.3. In Fig. 1.3, we see that Subjects 1 and 3 are correctly classified; Subject 1’s true phenotype is a yellow Labrador (lab) (darker fur coat), and its observed phenotype classification is also a yellow lab. The same holds true for Subject 3, although for that subject, the true phenotype is a white lab (lighter fur coat). Subjects 2 and 4 are incorrectly (or mis-)classified. Subject 2’s true phenotype is that of a white lab, but its observed phenotype classification is that of a yellow lab. For Subject 4, we have the reverse situation; the lab’s true phenotype is that of a yellow lab, but its observed phenotype classification is that of a white lab. While the American Kennel Club and other organizations worldwide recognize only three coat colors in the lab (black, chocolate, and yellow) [74], some authors have written on the “white Labrador” (e.g., [75]). Our example is for illustrative purposes. Genotype misclassification Here, observed genotypes are a mixture of correctly and incorrectly classified genotypes. We present an example in Table 1.1. Studying Table 1.1, we see that Subjects 1 and 2 have correctly classified genotypes; their observed genotype calls are the same as their true genotypes. Subjects 3 and 4 have misclassified genotype calls. For Subject 3, the true genotype is AT (heterozygote), while the recorded (observed) genotype is TT (homozygote). Similarly, the true genotype for Subject 4 is AA (homozygote), while the recorded genotype is AT (heterozygote).

1.1 Different Types of Heterogeneity

Subject Number 1

5

Observed Phenotype

True Phenotype

2

3

4

Fig. 1.3 Phenotype misclassification (fictitious example). We present example classifications of the Yellow and White Labrador Retriever coat colors (species Canis Lupus Familiaris, Breed Labrador Retriever). The colors refer to the coat colors of the different retrievers. In this example, the observed phenotype is made by a human, whereas the true phenotype is the “true” coat color of the animal (that is, defined by the authors). The cartoon images were inspired by an online drawing [73]

Table 1.1 Example of genotype misclassification for fictitious data set

Subject

Observed genotype

True genotype

1

AA

AA

2

AT

AT

3

TT

AT

4

AT

AA

Here, we specify that there is a single-nucleotide polymorphism (SNP) locus with two alleles, A (adenine) and T (thymine), in the population under study. In this fictitious example, there are four subjects for whom the observed genotype “call” (i.e., classification) is made by a human, typically by means of a technology designed for calling genotypes. The true genotype is the one that the person has in their DNA. Those rows with bold genotypes are the ones where misclassification has taken place

It is important to note that we can extend the misclassification to more than two categories for phenotypes and more than three categories for genotypes. For phenotypes, it has long been a practice to subtype diseased individuals based on the severity of disease (see, e.g., Refs. [76–84] for a sampling of publications since 2014). For genotypes, Sobel et al. [85] and Douglas et al. [86] extended mathematical

6

1 Introduction to Heterogeneity in Statistical Genetics

models of genotype misclassification to microsatellite markers (more in Chap. 2). Furthermore, Levenstien et al. [87] and others [88–92] considered misclassification of multi-locus haplotypes in tests of genetic association, and other authors considered the effects of misclassification with multi-locus genotypes on (i) classification of individuals into different sub-populations [93]; and (ii) robustness of statistical tests of association (where the multi-locus genotypes are treated as latent variables) [94, 95]. Wikipedia’s definition of “genetic association” is the scenario where “one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence” [96]. The primary purpose of this book is an investigation of statistics that test for genetic association. Coding for single-nucleotide polymorphism (SNP) genotypes Throughout, we shall use the codes 0, 1, and 2 to refer to SNP genotypes with 0, 1, or 2 copies (respectively) of a pre-specified allele (sometimes referred to as a reference allele). For example, in Table 1.1, we have a SNP with two alleles in the population, A and T. If we should A as the reference allele, then we may use the codes 0, 1, and 2 to refer to genotypes TT, AT, and AA, respectively. Also, we extend this coding to multi-locus genotypes. For example, if the SNP just mentioned is the first of two SNPs considered for a set of two SNPs, where the second SNP has alleles C and G, with G being the reference allele at the second SNP locus, then there are nine possible two-locus genotypes: (TT, CC), (TT, CG), (TT, GG), (AT, CC), (AT, CG),…(AA, GG). These pairs are coded as (0, 0), (0, 1), (0, 2), (1, 0), (1, 1), …, (2, 2). Mixture of quantitative trait distributions In this situation, we specify that, for a locus with two alleles (a disease or “increaser” allele (the reference allele) and a “null” allele) and thus with three genotypes, coded 0, 1, and 2 (corresponding to the number of copies of a disease allele in the respective genotype), that the probability density function (pdf) of a given quantitative trait for individuals with genotype  j, 0 ≤ j ≤ 2 is given by N μ j , σ R2 , the univariate normal pdf with mean μ j and variance σ R2 . Using our example alleles in Table 1.1, if we specify that the A allele is the increaser/reference allele and the T is the null allele, then the coded genotype correspondences are 0 = TT, 1 = AT, and 2 = AA. The means and variances may be computed through specification of other geneticmodel parameters. We provide more information on this computation in Chap. 6. In Fig. 1.4, we provide an example of the mixture of the three pdf corresponding to the three genotypes (the parameter settings are from no particular study; that is, they are fictitious). For this mixture of distributions to be a probability density function, the integral over the linear combination of all component pdfs must sum to 1.0. To achieve that, we weight each pdf by specific genotype frequency. In this work, we use the respective genotype frequencies determined by Hardy-Weinberg Equilibrium (provided in Sect. 1.3) as weights.

1.2 A Note on Definitions and Notation Throughout This Book

7

Fig. 1.4 Example histogram of QT distribution mixtures. In this figure, we provide the locations of the three means μ j by way of arrows that point to their QT values on the horizontal axis. The means are μ0 = −1.07, μ1 = 0.00, μ2 = 1.35. Also, the length of the horizontal line segment (0.50) indicates the value of σ R . This standard deviation is the same for all three distributions. Finally, the probability of observing an individual with a particular genotype (distributional weights) are w0 = (1 − 0.5)2 = 0.25, w1 = 2(0.5) (1 − 0.5) = 0.50, and w2 = (0.5)2 = 0.25. These weights correspond to Hardy–Weinberg Equilibrium proportions, where the increaser (disease) allele frequency is p = 0.5. The three colored distributions indicate the QT pdfs for the respective genotypes (red for genotype 0, blue for genotype 1, teal for genotype 2)

1.2 A Note on Definitions and Notation Throughout This Book Because we require a number of definitions for our work and make use of a fair amount of notation, we choose to keep the definitions and notation “local”. That is, for a given section, if we use a new definition or introduce some notation, we will place it in the section where we use it. We may need to use the same or similar notation in other sections of the book.

8

1 Introduction to Heterogeneity in Statistical Genetics

1.3 Hardy–Weinberg Equilibrium (HWE) Proportions and Their Importance in Gene-Mapping The principle of Hardy-Weinberg Equilibrium (HWE) is one of the most important in genetics [97]. The conditions for a population to achieve HWE at a locus are summed up in a term called panmixia. In this work, we define a population (i.e., individuals who have the capacity to breed and produce offspring) to be in panmixia if it satisfies the following six conditions: 1. 2. 3. 4. 5. 6.

The population is infinitely large (in practice, very large). There is no assortative mating (see below). There is no genetic drift. There is no migration among the population and other populations. There is no mutation. Individuals in each generation only mate with others in the same generation.

The Merriam-Webster Dictionary defines assortative mating [98] as “nonrandom mating, such as (a) mating between the more similar individuals of a population especially when regarded as a factor in evolutionary differentiation within a population; or (b) selective mating between individuals whose choice of marriage partners is determined by similarity of social environment”. Here, we provide the formula we use throughout this work. Suppose we have a locus that has alleles n a1 , a2 , . . . , an with respective population frequencies pi = 1. Given the n alleles, we have n homozygous p1 , p2 , . . . , pn . We have i=1   n heterozygous genogenotypes given by a1 a1 , a2 a2 , . . . , an an , and = n(n−1) 2 2 types given by a1 a2 , a1 a3 , . . ., a1 an , a2 a3 , a2 a4 , . . . , a2 an , . . . , an−1 an . In total, we n+1 = have n + n(n−1) genotypes. The HWE law enables us to write the 2 2   n+1 genotype frequencies in terms of n − 1 allele frequencies. The key to 2 this simplification is the assumption that alleles in a given genotype are transmitted independently from one generation to the next. Mathematically, we can determine the genotype frequencies gai a j for the ai a j genotype, 1 ≤ i, j ≤ n, by expanding the term ( p1 + p2 + · · · pn )2 and summing like terms. We have ( p 1 + p 2 + · · · p n )2 =

n  n  i=1 j≥i

where  ci j =

1, i = j, 2, i < j.

ci j pi p j ,

1.3 Hardy–Weinberg Equilibrium (HWE) Proportions and Their Importance …

9

The HWE principle states that we have the equality: gai a j = ci j pi p j for 1 ≤ i, j ≤ n. Note that, for the situation, we consider most frequently in this work, that of n = 2 alleles, we have ga1 a1 = ( p1 )2 , ga1 a2 = 2 p1 p2 , ga2 a2 = ( p2 )2 .

(1.1)

One of the most important results regarding HWE in a population is that, after one generation of panmixia, the genotype frequencies for any locus genotyped in the population will have HWE proportions, even if the genotype frequencies for the locus did not have HWE proportions in the initial generation. Furthermore, as long as the population satisfies the conditions of panmixia, all of the loci individually will maintain HWE proportions. Because deviations from panmixia are less common in human genetics studies with large sample sizes (say, greater than 1,000 individuals), one does not expect to see deviations from HWE proportions too frequently among genotyped loci. In Chap. 2, we present a standard goodness-of-fit test to evaluate whether the estimated genotype frequencies from a sample of individuals are consistent with HWE proportions. Beyond its importance in population genetics, the HWE principle is critically important in statistical genetics [99]. Among its many applications are dimensionality-reduction for parameter specification (one may use allele frequencies rather genotype frequencies) and genotype misclassification error checking [100–106].

1.4 Determination of Conditional Genotype Frequencies We list several ways to compute conditional genotype frequencies. These frequencies are of considerable importance, especially in the area of study design (e.g., Chap. 6).

1.4.1 Genetic Model-Free Approaches In this book, we consider multiple ways of specifying conditional genotype frequencies in a model-free way. By “genetic-model-free”, we mean that the genotype frequencies are specified directly, without reference to any genetic model parameters like mode of inheritance. There have been a number of review articles that consider the term “genetic model free” in both linkage- [107–111] and association-analyses [110–122].

10

1 Introduction to Heterogeneity in Statistical Genetics

Regarding linkage analysis, Knapp et al. [123] derived a correspondence between a one-parameter genetic model-free affected-sib-pair (ASP) test and the genetic modelbased recessive LOD statistic. Huang and Vieland [124] extended that work to show a correspondence between a two-parameter ASP test and the recessive heterogeneity LOD; also, “H-LOD” or “HLOD” test (see page 165). These publications add insight into the genetic model-free/genetic model-based distinction. Of note, the work by Huang and Vieland shows that in the presence of locus heterogeneity, standard ASP tests show losses in power equivalent to what one would have if using a homogeneity LOD for a heterogeneous disease. See Vieland and Logue [125] for related work. In what follows, we maintain our example (fictitious) locus with alleles A and T and corresponding frequencies pA , pT , and genotypes AA, AT, TT. We extend our notation for frequencies to case/affected and control/unaffected populations by using the subscript 0 for controls and 1 for cases. Thus, the A allele frequency in the control population is pA0 , the T allele frequency is pT0 = 1 − pA0 . Likewise, the A and T allele frequencies in the case population are pA1 and pT1 = 1 − pA1 , respectively. From this point forward, we will use the notation g ji to indicate the conditional genotype frequencies in the case and control populations. Specifically, Pr(genotype = j, phenotype = i) Pr(phenotype = i) = Pr(genotype = j|phenotype = i).

g ji =

(1.2)

For genotypes, j = 0, 1, or 2 indicates the number of A alleles in the genotype in our example. The index j = 0 refers to the TT genotype, and j = 1 refers to the AT genotype. We note here and reiterate below that for genotype coding, the reference allele (i.e., the one for which the homozygous genotype code is 2) is the allele with either: the smaller allele frequency (referred to as the minor allele frequency), the risk allele, or the increaser allele (see above: Sect. 1.1—Mixture of quantitative trait distributions). For phenotypes, as suggested above, i = 0 indicates the phenotype is control/unaffected, and i = 1 indicates the phenotype is case/affected.

1.4.1.1

Locus Genotype Frequencies Follow HWE Proportions in Both Populations

The first genetic model parameterization requires only two specified parameters for our example locus. They are pA0 (the A allele frequency in the control population) pA1 (the A allele frequency in the case population). Given the condition that, in each population, the locus genotype frequency frequencies follow HWE proportions, we have

1.4 Determination of Conditional Genotype Frequencies

2  g00 = 1 − pA0 ,  g10 = 2 pA0 1 − pA0 ,  2 g20 = pA0 ,  2 g01 = 1 − pA1 ,  g11 = 2 pA1 1 − pA1 ,  2 g21 = pA1 .

11

(1.3a)

(1.3b)

Note that every conditional genotype frequency in Eqs. (1.3a) and (1.3b) may be written in terms of the parameters pA0 and pA1 . A benefit of this model is its simplicity. For a SNP locus, only two parameters are needed. A potential limitation is that not every disease locus follows HWE proportions in each population, particularly the case population. For example, Wittke-Thompson et al. documented several possible disease models for which the case population does not follow HWE proportions [126].

1.4.1.2

Locus Genotype Frequencies Follow HWE Proportions in One Population but Not Both

The second genetic model parameterization requires three specified parameters. While we may choose the population whose genotype frequencies follows HWE proportions (control or case), in this example, we select the control population. One reason is that, while there are diseases where the case genotype frequencies depart from HWE proportions, the control genotype frequencies tend to follow HWE proportions (e.g., [126]). Interestingly, genetic analysis programs like PLINK [127] and PEDSTATS [128] allow for HWE checks in both case- and control-only populations. In this way, the researcher may choose the population (case and/or control) for which the HWE constraint may be applied. The parameters for this model are pA0 (the A allele frequency in the control population) g01 (the TT genotype frequency in the case population), and g21 (the AA genotype frequency in the case population). We may choose any two of the three conditional genotype frequencies in the case population for our parameter specifications since the sum of all three frequencies is 1.0. We have 2  g00 = 1 − pA0 ,  g10 = 2 pA0 1 − pA0 ,  2 g20 = pA0 ,

(1.4a)

g01 = g01 , g11 = 1 − g01 − g21 , g21 = g21 .

(1.4b)

12

1 Introduction to Heterogeneity in Statistical Genetics

Note that every conditional genotype frequency in Eq. (1.4a) may be written in terms of the parameters pA0 . While this model requires more parameters than the number in Sect. 1.4.1.1, one potential benefit is that the frequencies specified here may be more realistic; case-genotype frequencies may not display HWE proportions.

1.4.1.3

Locus Genotype Frequencies Follow HWE Proportions in Neither Population

For this model, we consider only the conditional genotype frequencies. There are four parameters: g00 (the TT genotype frequency in the control population), g20 (the AA genotype frequency in the control population), g01 (the TT genotype frequency in the case population), and g21 (the AA genotype frequency in the case population). As before, the parameters are subject to the conditions: g01 = 1 − g00 − g20 , g11 = 1 − g01 − g21 .

(1.5)

1.4.2 Genetic Model-Based Approach Through the Use of Genotype Relative Risks We begin this section by providing some notation (unless that notation was provided above). Consider the following parameters: φi , i = 0, 1 = Population prevalence of phenotype of interest; this value is the proportion of the population that displays the particular phenotype. Unless otherwise stated, the value i = 0 corresponds to the control or unaffected/healthy population, and i = 1 corresponds to the case or affected/disease population. Because the situation where every person is either affected or unaffected, we have φ0 = 1 − φ1 . pd = Disease (risk, trait) allele frequency; here, we specify that there is a di-allelic SNP locus whose genotypes determine the probability that an individual displays the disease. p+ = 1 − pd = Wild-type allele frequency; because the SNP is di-allelic, the genotypes at the locus may be written as “++”, “+d”, and “dd”. f j , 0 ≤ j ≤ 2 = Penetrance for the jth genotype. Specifically f 0 = Pr(affected | ++ genotype), f 1 = Pr(affected | +d genotype), f 2 = Pr(affected | dd genotype). Here, the term “affected” means that an individual displays the disease phenotype. Also, we specify that every individual is either affected or unaffected, so that Pr(unaffected | ++ genotype) = 1 − f 0 , Pr(unaffected | +d genotype) = 1 − f 1 ,

1.4 Determination of Conditional Genotype Frequencies

13

Pr(unaffected | dd genotype) = 1 − f 2 . As with the term “affected”, the term “unaffected” means that an individual does not display the disease phenotype. As an application of the genotype coding convention above (Coding for SNP Genotypes), we use the coding “0, 1, or 2” to represent the genotypes ++, +d, and dd, respectively; that is, each integer is the number of disease (risk) alleles in the respective genotype. This coding can be extended to non-disease loci (by choosing a reference allele), to loci that have more than two alleles, and to multi-locus genotypes. What is required is that the risk or reference allele(s) at each locus are identified. f R j = f0j , ( j = 1, 2) = Heterozygote (R1 ) and Homozygote (R2 ) genotype relative risks [129] for the trait locus. We comment that the penetrance values may be written as functions of the genotype relative risks and the disease allele frequencies, using the HWE proportions. Given the definition of conditional probability and the law of total probability, we compute: φ1 = f 0 ( p+ )2 + 2 f 1 ( p+ )( pd ) + f 2 ( pd )2 , = f 0 ( p+ )2 + 2R1 f 0 ( p+ )( pd ) + R2 f 0 ( pd )2 .

(1.6)

It follows that f0 =

φ1 , ( p+ )2 +2R1 ( p+ )( pd )+R2 ( pd )2

f 1 = R1 f 0 , f 2 = R2 f 0 .

(1.7)

Note that the penetrance values in Eq. (1.7) may be written in terms of the trait allele frequency pd (since p+ = 1 − pd ), the prevalence φ1 , and the genotype relative risks R1 and R2 . This relationship will become useful in later chapters, when we consider (statistical) power and sample size calculations. Among the more common genetic modes of inheritance defined with genotype relative risk constraints are R1 = R2 (dominant mode of inheritance), (R1 )2 = R2 (multiplicative mode of inheritance), R1 = 1(recessive mode of inheritance).

(1.8)

Power and sample size may also be computed using a marker locus that is in linkage disequilibrium (LD) with a trait locus (see, e.g., [130–138]). The motivation for the use of such a locus is that we may not always have the disease locus genotyped (i.e., genotypes are not observed), but we will have observed genotypes a marker locus in LD with the disease locus. To perform these calculations, we must provide an additional definition and some additional parameters. First, we define the term haplotype. According to Ott and Hoh [139], a haplotype is a sequence of alleles, at different loci, that are transmitted from a parent to an offspring in one gamete. In Fig. 1.5, we list all possible two-locus haplotypes, where

14

1 Introduction to Heterogeneity in Statistical Genetics

Fig. 1.5 Pictorial representation of all two-locus haplotypes for the trait and SNP marker

the first locus is the disease locus and the second locus is a marker locus with alleles A and T. Given the definition of haplotypes, we can define the concept of LD among the two loci. To do this, we present some additional definitions. As with the definition of haplotype, the A and T alleles mentioned below are the observed alleles at a given SNP marker locus. pA = Frequency of the A allele in the population being studied. 1 − pA = Frequency T allele in the population being studied. pT = √ D = c min( pd pT , p+ pA ) = Disequilibrium between the trait and the marker locus, where c is the coefficient of maximum linkage disequilibrium. In this formula, we select the A allele to be in (positive) linkage disequilibrium with the trait d allele. That is, the haplotype with the d and A alleles will have a higher population frequency than the haplotype with the d and T alleles (see Table 1.2). The genotype codes are determined using the A allele as the reference allele. Using this notation, we can compute the frequencies of all two-locus haplotype frequencies. They are provided in the table below. Table 1.2 Two-locus haplotype frequencies for a trait and SNP marker locus

Haplotype

Frequency

Notation

+A

p+ pA − D

h1

+T

p+ pT + D

h2

dA

pd pA + D

h3

dT

pd pT − D

h4

1.4 Determination of Conditional Genotype Frequencies

15

When the SNP locus is the disease locus If we specify that the SNP locus is the disease locus, then the A allele and the d allele are one in the same, and so are the T and + allele. Also, pA = pd , pT = p+ , c = 1, and D = pA pT = p+ pd . Given the simplification above that the SNP locus is the disease locus, the haplotype frequencies in Table 1.2 simplify to: h 1 = h 4 = 0, h 2 = p+ , h 3 = pd . Typically, we cannot observe the disease locus genotypes because we do not know its location. In the presence of heterogeneity, the observed conditional genotype frequencies (g ji ; Eq. (1.2)) may not be equal to the true frequencies. We will see this in the following chapters. Note While we focus on SNP genotypes in this book, the definitions of penetrances and conditional genotype frequencies may be extended to any finite number of affection statuses and genotypes/categories. To determine the values g ji , we use Bayes Rule and the Law of Total Probability. We perform the computations for genotype and phenotype separately. g00 = Pr(SNP genotype = 0 | phenotype = 0) = Pr(SNP genotype = 0, phenotype = 0)/ Pr(phenotype = 0). We focus on the numerator. Pr(SNP genotype = 0, phenotype = 0) = Pr(phenotype = 0, SNP genotype = 0). = Pr(phenotype = 0 | SNP genotype = 0) × Pr(genotype = 0). (1.9) Notice that the product Pr(phenotype = 0 | SNP genotype = 0) is similar to the penetrance value, with the exception that we are conditioning on the SNP genotype (observed data) and not the disease genotype (unobserved). However, the penetrance is only defined for the disease genotype. To condition on the trait genotypes, we make use of the haplotypes in Fig. 1.5. First, we define the term diplotype. A diplotype is any pair of haplotypes (note the analogy of haplotype to allele and diplotype to genotype). We use the following notation for diplotypes: x1 x2 /y1 y2 ,

(1.10)

that equals the diplotype comprised of haplotypes x1 x2 and y1 y2 , where x1 and x2 are the alleles at the first and second loci, respectively, for the first haplotype, and y1 and y2 are at the first and second loci, respectively, for the second haplotype. Note that this notation may easily be extended to a haplotype with l loci, l > 2. Next, we consider all diplotypes that contain the TT genotype; that is, the diplotypes that contain the true 0 genotype for the marker locus. Studying Fig. 1.5, we see that any such diplotypes can only have haplotypes dT and/or +T (the second and fourth haplotypes in the figure). It is straightforward to confirm that the diplotypes containing the 0 (marker) genotype are dT/dT, dT/ +T, +T/dT, and +T/ +T. Note that the second and third diplotypes are equal, analogous to heterozygous genotypes with the same set of alleles. Since all the diplotypes are different, it follows that

16

1 Introduction to Heterogeneity in Statistical Genetics

Pr(SNP genotype = 0) = Pr(diplotype = +T/ + T) + Pr(diplotype = +T/dT) + Pr(diplotype = dT/ + T) + Pr(diplotype = dT/dT).

It follows that the numerator (1.9) may be rewritten as Pr(phenotype = 0, genotype = 0) = Pr(phenotype = 0, diplotype = +T/ + T) + Pr(phenotype = 0, diplotype = +T/dT) + Pr(phenotype = 0, diplotype = dT/ + T) + Pr(phenotype = 0, diplotype = dT/dT), = Pr(phenotype = 0 | diplotype = +T/ + T) × Pr(diplotype = +T/ + T) = Pr(phenotype = 0 | diplotype = +T/dT) × Pr(diplotype = +T/dT) = Pr(phenotype = 0 | diplotype = dT/ + T) × Pr(diplotype = dT/ + T) = Pr(phenotype = 0 | diplotype = dT/ + T) × Pr(diplotype = dT/dT) = (1 − f 0 ) × Pr(diplotype = +T/ + T) +(1 − f 1 ) × Pr(diplotype = +T/dT) +(1 − f 1 ) × Pr(diplotype = dT/ + T) +(1 − f 2 ) × Pr(diplotype = dT/dT). (1.11) The last equality in Eq. (1.11) follows from the fact that phenotypes are determined by the trait locus genotypes only and not by the marker locus genotypes. To complete the computation of the numerator (1.9), we specify that the diplotype frequencies are determined from the haplotype frequencies using HWE proportions (see, e.g., [29, 140]). Using notation from Table 1.2, we have Pr(diplotype = +T/ +T) = (h 2 )2 , Pr(diplotype = +T/dT) = Pr(diplotype = dT/+T) = (h 2 )(h 4 ), Pr(diplotype = dT/dT) = (h 4 )2 . It follows we can rewrite Eq. (1.9) as Pr(phenotype = 0, genotype = 0) = (1 − f 0 )(h 2 )2 +2(1 − f 1 )(h 2 )(h 4 ) +(1 − f 2 )(h 4 )2 .

(1.12)

Finally, the denominator of the genotype frequency g00 is Pr(phenotype = 0) = 1 − φ1 = φ0 , so we have

1.4 Determination of Conditional Genotype Frequencies

g00 =

(1 − f 0 )(h 2 )2 + 2(1 − f 1 )(h 2 )(h 4 ) + (1 − f 2 )(h 4 )2 . φ0

17

(1.13)

Using the information provided above, we may compute the conditional genotype frequency as a function of the genotype relative risks, the trait (disease) allele frequency, the prevalence, one marker allele frequency, and the coefficient of maximum linkage disequilibrium, a total of six parameters. It is straightforward to check that the remaining five conditional genotype frequencies are computed in the same manner and that the equations are (1 − f 0 )(h 1 )(h 2 ) + (1 − f 1 )(h 1 h 4 + h 2 h 3 ) + (1 − f 2 )(h 3 )(h 4 ) , φ0 (1 − f 0 )(h 1 )2 + 2(1 − f 1 )(h 1 )(h 3 ) + (1 − f 2 )(h 3 )2 = , φ0 f 0 (h 2 )2 + 2 f 1 (h 2 )(h 4 ) + f 2 (h 4 )2 = , φ1 f 0 (h 1 )(h 2 ) + f 1 (h 1 h 4 + h 2 h 3 ) + f 2 (h 3 )(h 4 ) =2× , φ1 f 0 (h 1 )2 + 2 f 1 (h 1 )(h 3 ) + f 2 (h 3 )2 = . (1.13 continued) φ1

g10 = 2 × g20 g01 g11 g21

 To check our calculations, note that 4k=1 h k = 1. Also note (from Table 1.2) that h 1 + h 2 = p+ , h 3 + h 4 = pd , and thus the sums of conditional genotype frequencies,  2 j=0 g ji , equal one for each phenotype i, as they must. Hardy-Weinberg Equilibrium revisited Now that we have another type of genomic data, namely haplotypes, we provide another way that HWE proportions may be calculated for pairs of genomic data (e.g., alleles or haplotypes in a population). For the work here and elsewhere, the number of alleles, haplotypes, or other (categorical) genomic data that are transmitted from parent to offspring is two. Under HWE, the   transmission is independent. Therefore, the probability of observing any a+1 pairs (Sect. 1.3) may be computed using the multinomial distribution, 2 where the number of “experiments” performed is two. Using the haplotype notation in Table 1.2, we have a = 4 haplotypes, indexed  1 ≤ k ≤ 4,with corresponding 4+1 5 = probabilities h k . The number of diplotypes is = 10. For a diplo2 2 type h r / h s , 1 ≤ r, s ≤ 4 (Notation from the term (1.4a)), the probability that an individual in the population has this diplotype is Pr(h r / h s ) =

2! (h 1 )n 1 (h 2 )n 2 (h 3 )n 3 (h 4 )n 4 . n 1 !n 2 !n 3 !n 4 !

(1.14)

18

1 Introduction to Heterogeneity in Statistical Genetics

That is, Formula (1.14) is the multinomial distribution with two “experiments” 4, are the number of times the number k appears performed. The values n k , 1 ≤ k ≤  in the set of subscripts {r, s}, with 4k=1 n k = 2. For example, if r = 3 and s = 4, then n 3 = n 4 = 1, n 1 = n 2 = 0. If r = s = 2, then n 2 = 2, n 1 = n 3 = n 4 = 0.

1.4.2.1

Genetic Model-Based Approach Through the Use of Logistic Model

An additional way to determine conditional genotype frequencies is through the logistic model. This model is useful when considering dosage model for disease genotypes. We define a dosage model as one where each additional copy of an allele at a specific locus increases the probability that an individual will show the phenotype. As documented by Gauderman [141] (Eq. 1 of his work), with a slight modification of his notation, we can write the penetrances of genotypes as f j = Pr(Affected | genotype = j) =

eα+βw j , 1 + eα+βw j

(1.15)

where α is the baseline odds ratio, β is the log-odds ratio, and w j are the weights corresponding to the different coded genotypes (same as α, βg , and G, respectively, in Gauderman’s Eq. 1). We specify the notation in Eq. (1.15) to be consistent with our previous work [142]. Our penetrance function differs from that of Gauderman’s in that we do not consider an environmental covariate; hence there is no interaction term. It is straightforward to check that   1 − f j = Pr(Unaffected | genotype = j) =

1 . 1 + eα+βw j

(1.16)

Substituting the values (1.15) and (1.16) into Eq. (1.13), we obtain conditional genotype frequencies in a logistic model framework. Different weights are used for different conjectured genetic MOIs. For dominant, recessive, and multiplicative modes of inheritance, we set the weights to be (w0 = 0, w1 = w2 = 1), (w0 = w1 = 0, w2 = 1), and (w0 = 0, w1 = 1, w2 = 2), respectively. As an example, for recessive weights, we have α

e f 0 = 1+e α, eα f 1 = 1+eα , eα+β f 2 = 1+e α+β .

(1.17)

In Eq. (1.17), f 0 = f 1 , and f 2 is not equal to either value, unless β = 0. This is precisely the set of penetrances we would expect for a recessive mode of inheritance.

1.5 The Box (and Whiskers) Plot as a Tool for Visualizing Empirical Data Distributions

19

1.5 The Box (and Whiskers) Plot as a Tool for Visualizing Empirical Data Distributions For empirical data that is continuous, there are several ways of presenting those data pictorially. One way is with the histogram or bar graph. We provided an example of histogram in Fig. 1.4. Another is the box and whiskers plot [143]. Hereafter, we shall refer to this type of figure as a box plot. While the histogram can show the shape of the data, thereby providing an approximate shape for the probability density function, the box plot provides us with precise values for the minimum, maximum, and median values of the distribution. With the application of outliers and suspected outliers, we may also determine those values in the distribution that seem unusual, in that they are a considerable distance from other values in the distribution. From the information provided in Fig. 1.6 legend and from the box plot itself, we see that the minimum, median, and maximum CAG Repeat Sizes are 32, 43, and 113, respectively. The mean value is 44.04, and the first and third quartile values are 42 and 46, respectively. The distribution is skewed, in the sense that the lower 50% of the CAG Repeat Sizes occur between the values of 32 and 43, while the upper 50% occur between 43 and 113. Finally, there is one suspected lower outlier size, 32, two suspected upper outlier sizes (54 and 55) and five upper outlier sizes (68, 79, 93, 100, 113). We present box plots throughout the book, always with the notation and legend provided in Fig. 1.6. The value “CAG repeat size” is replaced by the quantitative value of interest.

1.6 Power and Minimum Sample Size (MSSN) for Different Statistical Tests of Genetic Association Under homogeneity, most (although not all) of the test statistics we consider in this book for gene mapping will follow a certain chi-square distribution under null and alternative hypotheses. Under the null hypothesis, the distribution will be a central chi-square distribution with degrees of freedom determined by the null hypothesis. Under the alternative, the distribution will be a non-central chi-square distribution with the same degrees of freedom as that for the null hypothesis, and with non-centrality parameter determined as a function of parameters (e.g., genotype frequencies). Type I error and power for statistics incorporating heterogeneity Given that the underling model for parameters such as allele and genotype frequencies follows a finite mixture model, statistics that account for this mixture generally do not follow a known distribution under the null or alternative hypotheses (see, e.g., [145, 146]). Mixture models aside, when considering power and/or sample size calculations for test statistics that follow a central chi-square under the null hypothesis, and a noncentral chi-square distribution under the alternative, three critical parameters are (i)

20

1 Introduction to Heterogeneity in Statistical Genetics

Fig. 1.6 Example box plot of CAG repeat lengths for a sample of individuals affected with Huntington’s Disease. The data used to generate this box plot were modified from the work of Brinkman et al. [144], specifically Table 1 of that publication. We recorded the counts of affected individuals having CAG repeat sizes ranging from 29 to 50, a total of 668 individuals. Also, we randomly generated fictitious CAG repeat sizes ranging from 20 to 120 for ten individuals using the “=randbetween (20,120)” function in Excel. We included these data when creating the box plot to produce outlier values. The symbols in this figure may be defined as follows: = Mean value of CAG repeat sizes. Upper horizontal size of gray box = third quartile (3Q) of values (75% of the CAG repeat sizes are less than the value corresponding to this line). Black horizontal line inside gray box = Median value (50% of the CAG repeat sizes are less than the value corresponding to this line, and also, 50% are greater than the value). Lower horizontal size of gray box = first quartile (1Q) of values (75% of the CAG repeat sizes are greater than the value corresponding to this line). Upper line segment at top of “T” = Maximum value for set of CAG repeat sizes x that satisfy the condition: 1Q − 1.5δ ≤ x ≤ 1.5δ + 3Q, δ = 3Q − 1Q = Inter-quartile range (IQR). Lower line segment at bottom of inverted “T” = Minimum value for set of CAG repeat sizes x that satisfy the inequality listed directly above. = Suspected Outliers = Values y that satisfies either 1.5δ + 3Q < y ≤ 3δ + 3Q or 1Q − 3δ ≤ y < 1Q − 1.5δ. = Outliers = Values z that satisfies either 3δ + 3Q < z or 1Q − 3δ > z

1.6 Power and Minimum Sample Size (MSSN) …

21

the significance level, usually denoted by α; (ii) the degrees of freedom (df) (noted above); and (iii) the non-centrality parameter (NCP), usually denoted by λ (also noted above). Knowing these three parameters, it is possible to compute asymptotic power and/or sample size calculations for a given test statistic when the alternative hypothesis is true. Significance levels compared to type I error rates A crucial point to understand is that a significance level is an experimental design parameter, whereas a type I error is a probability. The significance level is set before analyses are performed. If the test statistic is working properly, the type I error rate is equal to significance. However, we shall see that, in the presence of different forms of heterogeneity, there will situations where the type I error rate is not equal to the significance level. Now that we have the conditional genotype frequencies determined by a genetic model-free or model-based method (using up to six genetic model parameters) (Sect. 1.4), we can compute NCPs for different statistical tests using the frequencies. In this book, we shall focus on a relatively small number of test statistics and their corresponding distributions with NCPs. We place these formulas in this chapter because we will refer to them extensively in later chapters. Before computing NCPs for each of the statistics, we provide formulas on how to compute the value of the statistic given appropriate data. To do this, we provide a useful method for organizing the data, namely the contingency table.

1.6.1 Contingency Table for Organizing Categorical Phenotype and Genomic-Data Consider individuals whose phenotype falls into one of R (more than one) unique categories, and whose genomic data simultaneously may be placed into one of C (more than one) unique categories. For example, the phenotype may be affected or not affected with a disease, and genomic data may be the coded genotype at a SNP of interest. The number of phenotypes is two, and the number of genomic data categories is three (the coded genotypes). Each individual has a unique phenotype and SNP genotype. We may enter such data in a contingency table, presented immediately below (Table 1.3). There are two main reasons we start the row and column indices at 0. First, when R is two, as with cases and controls, the controls are typically indexed by the phenotype value of 0. Also, a number of computing languages, C, C++, and Pascal among them, start their for-loop indices at 0. Several of the statistics presented in this book have computer written in one of these languages. code Let g ji be the set of conditional probabilities defined in Eq. (1.2), with 0 ≤ i ≤ n R − 1, 0 ≤ j ≤ C − 1. These probabilities are estimated by ni+i j .

22

1 Introduction to Heterogeneity in Statistical Genetics

Table 1.3 Notation for cell counts in an R × C table Phenotype categories

Genomic data categories Category 0

Category 1



Category j



Category C − 1

Row Total

Phenotype 0

n 00

n 01



n0 j



n 0,C−1

n 0+

Phenotype 1

n 10

n 11



n1 j



n 1,C−1

n 1+

„,



„,







n i1



ni j



n i,C−1

n i+

… Phenotype i

n i0



„,



„,







Phenotype R−1

n R−1,0

n R−1,1



n R−1, j



n R−1,C−1

n R−1,+

Column total

n +0

n +1



n+ j



n +,C−1

N

The values n i j , 0 ≤ i ≤ R − 1, 0 ≤ j ≤ C − 1, are the counts of subjects who are jointly in the ith phenotype category and the jth genomic data category. The row totals are defined as   R−1 n i+ = C−1 j=0 n i j , and the column totals are defined as n + j = i=0 n i j . The total number of  R−1 C−1 subjects is N = i=0 j=0 n i j . Commas are inserted for legibility

1.6.1.1

Formula for Chi-Square Test of Independence

For this class of test statistic, we may write the null and alternative hypotheses as i. H0 : For each genomic − data column j, g j0 = g j1 = ν = g j,R−1 . ii. H1 : For at least two phenotypes,0 ≤ i 1 < i 2 ≤ R − 1 , and at least one genomic data column j, g j,i1 = g j,i2 . The chi-square test of independence [147] that tests for association under these hypotheses is written as X2 =

R−1 C−1    n i j − ei j 2 , ei j i=0 j=0

(1.18)

where ei j , the expected count under independence, is given by   (n i+ ) n + j . ei j = N

(1.19)

If ei j is zero for the (i, j)th cell, the corresponding summand in Formula (1.18) is set to zero. This statistic has been widely used in genetics for cross-classified (categorical) data, especially when R is two (case, control) and C is two (the different

1.6 Power and Minimum Sample Size (MSSN) …

23

alleles for a single SNP) or three (the three categories for a SNP). It is implemented in genetic association programs like PLINK [127], CONTING [17], FINETTI [148], and GenABEL [149]. Fisher [150] documented that, under H0 stated above, the asymptotic distribution of the test statistic (1.18) is a central chi-square statistic with (R − 1) × (C − 1) degrees of freedom. Special case—test of independence for alleles Each individual has twice as many alleles (or haplotypes) as genotypes (or diplotypes). The chi-square test of independence for alleles tests for differences in allele frequencies, rather than genotype frequencies; that is, we count the number of times each allele occurs for each phenotype category. The sample size for each phenotype category is doubled since each genotype (excluding copy number variants and markers on the X chromosome) is made up of two alleles. The degrees of freedom change as well (see Table 1.5).

1.6.1.2

(Cochran-Armitage) Test of Trend

For this statistic, we restrict the number of phenotypes R to two (cases and controls, or affected individuals and unaffected individuals). While the number of genomic data categories C may exceed three, when computing power or sample size calculations, we typically restrict the number of categories to three (although, see Chap. 4, LTT ae ). We defined penetrances in Sect. 1.4.2. Here, we adjust the subscripts to be consistent with the values in Table 1.3. Specifically f 1 j = Pr(phenotype = 1 | genomic − data value = j), 0 ≤ j ≤ C − 1 Typically, the phenotype category being 1 refers to being a case/affected. The null and alternative hypotheses are i. H0 : Penetrances satisfy the condition: f 10 = f 11 = · · · f 1 j = · · · = f 1,C−1 . ii. H1 : For at least two columns0 ≤ j1 < j2 ≤ C − 1, f 1, j1 = f 1, j2 . Said another way, the null hypothesis for this statistic is that the case penetrances are equal for all genomic data categories. In the example of a SNP with three genotype categories, the null hypothesis is that the case penetrances are equal for the coded genotypes 0, 1, and 2. This class of test statistics is parameterized by weights w j , 0 ≤ j ≤ C − 1. When the coded genotype refers to the number of a reference allele in a SNP genotype, Slager and Schaid recommended the weights (w0 , w1 , w2 ) be (0, 0, 1), (0, 1, 1), and (0, 1, 2) for recessive, dominant, and multiplicative MOIs, respectively [151]. When the disease MOI is unknown, it is common practice to use the (0, 1, 2) weights [152, 153]. Consider the following terms, used in the definition of the Cochran-Armitage test of trend formulation:

24

1 Introduction to Heterogeneity in Statistical Genetics C−1    U= w j nN1+ n 0 j − nN0+ n 1 j , j=0

    2  2 C−1  C−1 (n 0+ )(n 1+ ) N × . − V = w n w n j + j j + j 3 j=0 j=0 N

(1.20)

All notation comes from Table 1.3. Given these terms, we express the CochranArmitage test of trend as C AT T =

U2 . V

(1.21)

Under the null hypothesis, when the genomic data are the genotypes for a single SNP, the CATT statistic follows a central chi-square distribution with 1 degree of freedom.

1.6.1.3

The Transmission Disequilibrium Test for Detecting Linkage in the Presence of Association

Here, we present a test statistic we consider for family-based data. It was titled the transmission disequilibrium test, or TDT , by Spielman et al. [154]. Suppose we have a SNP locus with alleles A and T. Further, suppose we have N AC children affected with a disease being studied, and that these children and their parents (trios) are all genotyped at the SNP locus. For these trios, there are four possibilities for any parent: (i) the parent is homozygous for the A allele; (ii) the parent is homozygous for the T allele; (iii) the parent is heterozygous and transmits the A allele and does not transmit the T allele to the affected child; and (iv) the parent is heterozygous and transmits the A allele and does not transmit the T allele to the affected child. When considering all trios, this information may be summarized in the following Table 1.4. In this table, the values n AA and n TT are the number of genotyped parents who are homozygous for the A and T alleles, respectively. The value n AT is the number of instances where a heterozygous parent transmits the A allele to their affected child and does not transmit the T allele. Finally, the value n TA is the number of instances where a heterozygous parent transmits the T allele to their affected child and does not transmit the A allele. Table 1.4 Counts of transmitted and non-transmitted alleles in genotyped trios

Transmitted allele

Non-transmitted allele A

T

A

n AA

n AT

T

n TA

n TT

The counts n x y , x, y ∈ {A, T} are the number of instances in which a parent transmits the x allele and does not transmit the y allele to an affected child

1.6 Power and Minimum Sample Size (MSSN) …

25

Table 1.5 Asymptotic null and alternative distributions for test statistics studied in this book Statistic

Probability Probability distribution distribution under H 0 under H 1

Non-centrality formula reference

Chi-square test of independence for genotypes

Central chi-square with (n − 1) df

Non-central chi-square with (n − 1) df

Formula (1.22)

Chi-square test of independence for alleles

Central chi-square with (a − 1) df

Non-central chi-square with (a − 1) df

Formula (1.23)

Cochran-Armitage test of trend

Central chi-square with (a − 1) df

Non-central chi-square with (a − 1) df

Formula (1.24)

Transmission disequilibrium test (TDT )

Central chi-square with 1 df

Non-central chi-square with 1 df

Formula (1.25)

For each of these distributions, the non-centrality parameter differs and is given by formulas indicated in the fourth column. The value n is the number of genotypes at the locus (or loci) and the value a is the number of alleles at the locus. The values n and a are special cases of the value C in Table 1.3

For a putative disease locus with disease allele d and wild-type allele +, we can express the null hypothesis for the TDT statistic by making use of the information in Fig. 1.5 and Table 1.2. The null hypothesis is i. The recombination fraction is θ between the SNP locus and the disease locus is ½, or; ii. The disequilibrium parameter D is 0. The alternative hypothesis is iii. θ < ½ and D = 0. The SNP locus has the observed data (genotypes AA, AT, and TT for the affected children and their parents), while genotypes for the disease locus are unobserved unless the SNP locus is the disease locus.

1.6.1.4

Computing Power and MSSN for Tests of Genetic Association

In the previous section, we provided the null and alternative hypotheses for each of four types of test statistics. In this section, we discuss how we compute statistical power and minimum sample size necessary (MSSN) when the alternative hypothesis is true. In Table 1.5, we consider asymptotic null and alternative distributions. That is, the distributions are valid when conditions summarized by Cochran [155] are met. In situations where these conditions are not met, it is usually possible to determine correct significance by either exact methods, permutation, or by bootstrap (see, e.g., [156]).

26

1 Introduction to Heterogeneity in Statistical Genetics

For each statistic mentioned in Table 1.5, there may be multiple versions that have been published in the literature. For example, the TDT has over 900 versions for different situations (search keywords “transmission disequilibrium test AND genetics AND method” in PubMed). The statistic to which we refer in each row of the column labeled “Statistic” is (to our knowledge) the most widely-cited publication of a method. Thus, we refer to the TDT published by Spielman, McInnis, and Ewens [154] in Table 1.5. The publication we reference for each statistic’s non-centrality parameter indicates the test statistic to which we refer. If the NCP λ of a distribution is zero, the distribution is identical to a distribution in the central family [157]. Here, we present formulas for the NCPs of the test statistics mentioned in Table 1.5. The greek letter λ is the NCP. In all formulas, N A = n 1+ = Number of cases/affected individuals genotyped (notation from Table 1.3), NU = n 0+ = Number of controls/unaffected individuals genotyped (notation from Table 1.3). The genotype frequencies are determined based on certain underlying models, including misclassification or other types of heterogeneity. In what follows, we provide two forms of each NCP. The second is derived from the first, with the definition k = NU /N A . Chi-square test of independence for genotypes λ = N A NU



2 j=0



g j1 − g j0

2

N A g j1 + NU g j0

,

(1.22)

or λ = k NA



2 j=0



g j1 − g j0

2

g j1 + kg j0

,

where g j1 and g j0 are the genotype frequencies in the case and control populations, respectively, for the jth genotype, 0 ≤ j ≤ 2. Chi-square test of independence for alleles λ = 2N A NU or

1 j=0

 

p j1 − p j0

2

N A p j1 + NU p j0

,

(1.23)

1.6 Power and Minimum Sample Size (MSSN) …

λ = 2k N A



1 j=0



27

p j1 − p j0

2

p j1 + kp j0

,

where p j1 and p j0 are the allele frequencies in the case and control populations, respectively, for the jth allele, 0 ≤ j ≤ 1. Test of trend NCP  2 j=0

λ = N A NU  2

 2 w j g j1 − g j0

.

  2  2 2 /(N A + NU ) j=0 w j N A g j1 + NU g j0 − j=0 w j N A g j1 + NU g j0

(1.24)  2

 2 w j g j1 − g j0 , λ = k NA    2  2 2 2 g − g w + kg w + kg /(1 + k) j1 j0 j j1 j0 j=0 j=0 j j=0

In Eq. (1.24), w j are the respective weights for the jth genotype, 0 ≤ j ≤ 2. This NCP may be extended to any number of genomic data categories C (Table 1.3), as long as weights for each category are provided. Transmission disequilibrium test NCP when disease locus is marker locus λ=

[E T − E N T ]2 , ET + E NT

(1.25)

where pd = The disease allele frequency. p+ = The wild-type allele frequency. δ = δ  pd p+ , and 0 ≤ δ  ≤ 1 is specified by the user. In Fig. 1.5, we refer to this value as D, with δ  = c. N = The number of affected children in the data set for whom both parents are genotyped at a given marker locus (the child is also genotyped). f j , 0 ≤ j ≤ 2 = the disease locus penetrances defined in Eq. (1.7). C = pd f 2 + (1 − 2 pd ) f 1 − p+ f 0 . 2 f 0 (Disease Prevalence). φ1 = pd2 f 2 + 2 pd p+ f 1 + p+ E T = 2N pd p+ + δp+ C/φ1 , (we specify that the marker locus is the disease locus) is the expected number of heterozygous parents who transmit a pre-specified reference allele to an affected child;

E N T = 2N pd p+ − δpd C/φ1 , (we specify that the marker locus is the disease locus) is the expected number of heterozygous parents who do not transmit a prespecified reference allele to an affected child.

28

1 Introduction to Heterogeneity in Statistical Genetics

1.7 The Expectation–Maximization (EM) Algorithm We close out this chapter by providing a brief discussion of the EM algorithm and by performing an example calculation. We do so because (as mentioned below), we use the EM algorithm extensively to derive test statistics that are applied to heterogeneous data. Those who are interested in the application of such statistics and not necessarily in the derivation of the methods may skip this section. In 1977, Dempster, Laird, and Rubin published a seminal paper on a general method (labeled “Expectation–Maximization algorithm” or “EM algorithm”) for finding (local) maximum likelihood parameters for a function titled the likelihood function [158]. Dempster et al. noted that other authors previously published on the same topic under special circumstances (e.g., see [159–163]). Paraphrasing the text in the Wikipedia article on the EM algorithm, we comment that typically “the EM” is used in situations where the equations cannot be solved directly. Often such situations involve latent variables, unknown parameters, and known data observations. That is, either there are missing values among the data or the likelihood can be expressed more simply by assuming the existence of additional unobserved data points. For example, a mixture model can be described more simply by assuming that each observed data point has a corresponding unobserved data point or latent variable that indicates the mixture component for which the data point is a member. It is this previous sentence that suggests why we use the EM algorithm extensively. For most of our heterogeneity models, we do not have a priori knowledge of the membership of all given data points (samples/individuals). Also, the EM algorithm has two desirable properties: (i) for every iteration of the algorithm, we obtain estimates of all parameters; and (ii) for every iteration, our likelihood is non-decreasing when using the parameter estimates from (i) (under the appropriate mathematical conditions). In summary, the lion’s shares of the statistics that we present in this book are derived using the EM algorithm. The parameter estimates corresponding to the maximum log-likelihood of the data are called the maximum likelihood estimates (MLEs). There are numerous published works on the theory and application of the EM algorithm (see, e.g., [158, 164, 165]). In the field of statistical genetics, publications have considered research ranging from haplotype reconstruction to population genetics and genetic association [142, 166–206].

1.7.1 Example Application In the following example, we demonstrate how we may obtain unbiased estimates of genotype frequencies for a data set where individuals have been genotyped with a standard genotyping technology prone to some error, and the error process is random and independent. By errors, we mean that, for some individuals, the observed

1.7 The Expectation–Maximization (EM) Algorithm

29

genotype (determined by the standard technology) at a locus of interest is not their true genotype. In Sect. 1.1, we noted that genotype misclassification is a form of heterogeneity. In our current example, we specify that some individuals are double-sampled [207–209]: that is, they are genotyped by two independent technologies; a standard technology, already mentioned, and a “gold-standard” technology, where the genotyping error probabilities are much smaller, or zero. Thus, the double-sampled individuals have two genotype calls for the same locus. Tenenbein used the term “fallible” to refer to the classification that may contain errors (the call based on the standard genotyping technology in our case) and the term “infallible” to refer to the classification that has a much lower error rate (the gold-standard genotyping technology in our case) [207]. We shall use these definitions as well. It is important to mention that the term “observed” in “observed data” (used below in applying the EM algorithm) is not exactly the same as fallible. In fact, for people who are double-sampled the observed call is the infallible classification since it is the true call. We define Group 1 to be the set of all individuals who are double-sampled and Group 2 to be the set of individuals for whom there is only a fallible genotype call. The latent variable in this example is the true genotype for any individual who has not been double-sampled. We write out the steps of the EM algorithm and then demonstrate how we apply those steps to our example. Our log-likelihood calculations and MLEs of the genotype frequency parameters are determined using a simplification of mathematical work we published previously [210]. In that work, we considered hypothesis tests on cases and controls who are jointly phenotyped and genotyped. Here, the only data we consider are genotype calls on a sample of individuals for a single SNP locus (two alleles). For all the statistics below, where the EM algorithm is utilized, we follow the same series of steps: 1. Describe the data for which the log-likelihood computed, including the latent variable. 2. Provide notation, indicating the parameters that are used in the log-likelihood equations. 3. For the observed log-likelihood: a. State the null and alternate hypotheses (written as equation(s) of the parameters). b. Generate a set of starting parameter values for each hypothesis. We index the set by an integer s. c. Present the log-likelihood of the observed data as a function of the parameters; this function is based on an underlying mathematical model with relevance to genetic association. 4. Write out the complete log-likelihood L c of the data. a. (Expectation (E)-step) Compute the expectation E of ln(L c ) conditional on the observed data (OD). This expectation is often written E[ln(L c )|O D].

30

1 Introduction to Heterogeneity in Statistical Genetics

b. (Maximization (M)-step) Compute the updated estimates of the non-constant parameters. 5. Determine the (max)imum log-likelihood of the observed data. a. Set a limit-distance ε so that M-step iterations are terminated after a finite number of steps, and compute the max log-likelihood of the data for the sth set of starting values. b. Determine the maximum log-likelihoods under each of the null and alternative hypotheses. 6. Compute the value of the test statistic. Notation for iteration steps When applying the EM algorithm, one of the first steps is providing a set of starting values for each of the parameters that are updated in each iteration. Let us denote this set by s. For all statistics in this work that use the EM algorithm, the rth (iteration) step estimate for the parameter is written as (parameter)(r) , with the understanding that the estimate of the set of starting values. Also, the r = 0 step estimates, (parameter)(0) are the starting values drawn from the set s. We provide examples in Chap. 4.

1.7.1.1

Implementation of the Algorithm

Step 1—Describe the data. We provided this description above. The data consist of two groups individuals who have been genotyped at a particular locus. In Group 1, the individuals have been genotyped twice; once by a standard method, and also using a gold-standard method that has a much lower (or zero) genotype misclassification rate. Individuals in Group 2 have been genotyped only once, with the standard method. It is possible that some of the genotypes for these individuals are incorrect; that is, for some individuals in Group 2, their recorded genotype is not the true genotype (see Table 1.1). The latent variable in this example is the true genotype of an individual. Step 2—Provide notation, including terms that appear in the log-likelihood function. Throughout this example, we have 0 ≤ j, j  ≤ 2; these values are the coded SNP genotypes.  n (1) j  j = Number of individuals in Group 1 with true genotype j and observed genotype j. These individuals are double-sampled. (1) n (1) j n j j . j + =  n (1) N1 = j  j = Number of individuals in Group 1. j

n (2) j

j

= Number of individuals in Group 2 with observed genotype j. These individuals are genotyped with the standard technology only.  n (2) j  = Number of individuals with true genotype j . These numbers are not known but are used  in computation of the complete log-likelihood (below). N2 = j n (2) j = Number of individuals in Group 2. N = N1 + N2 = Total sample size.

1.7 The Expectation–Maximization (EM) Algorithm

31

X = Random variable whose outcome is one of the genotypes 0, 1, 2; it is the standard genotype of an arbitrary individual. X t = Random variable whose outcome is one of the genotypes 0, 1, 2; it is the true (infallible)  arbitrary individual.  genotype of an θ j  j = Pr X = j|X t = j  . When j  = j, these probabilities are referred to as misclassification parameters [130, 209, 211]. A fuller description of these parameters is provided in Chap. 2. As mentioned in our previous work, the MLE of each n (1) 

misclassification parameter is given by θ j  j = (1)j j . n j+   g j  = Pr X t = j  = True frequency of genotype j  . g (rj  ) = r th (iteration) step estimate of the genotype frequency parameter g j  . When   are referred to as the starting values, or starting vector, and r is 0, the values g (0) j are specified by the user. For r ≥ 1, the parameter estimates are determined through application of the EM algorithm. E[] = Expectation operator. I () = Indicator function.   τ j(r j) = Pr X t = j  for an individual | X = j for same individual = r th (iteration) step estimate of the conditional probability. The probability Pr(true classification for an individual is j | observed classification is j) is referred to as the Bayesian Posterior Probability (BPP). In this example, it is the probability that the true genotype of an individual observed genotype j is j  .  (rwith (r ) ) τ j + = j τ j  j . In this list, the terms that are used in log-likelihood of the observed data are n (1) j j , (2) (r )  n j , θ j j , and g j  . The first two terms are the observed data, while the third and the last terms are the parameters. The parameter θ has no superscript because it is estimated directly from the observed data and does not need to be updated. Step 3a—State the null and alternate hypotheses (written as equation(s) of the parameters). For the majority of statistics discussed in this book that use the EM algorithm, the null hypothesis is that genomic data frequencies, such as genotype or allele frequencies, are equal in affected and unaffected groups. In the interests of presenting the simplest example, we do not consider such a situation here. However, if we were to extent our example to consider data for two groups, such as case (affected) and control (unaffected) individuals, we could consider the genotype frequencies in the case and control groups. The data and parameters for the log-likelihood would be (2) (r )   n i(1)  j  j , n i  j , θ j  j , and g j  , where i = 0 for a control, and i = 1 for a case. One null hypothesis (the most common one used in this book) is H0 : g j  0 = g j  1 for all genotype categories j  ; that is, for every genotype category, the case and control frequencies are equal. The most common alternative hypothesis is H1 : g j  0 = g j  1 for at least one j  . We drop the (r ) superscript because hypotheses refer to the true population values. The values g (rj  i) are estimates of the true values, based on the data. An example where we perform hypothesis testing using double-sample data in a case-control design may be found in Chap. 4 (LRT ae statistic).

32

1 Introduction to Heterogeneity in Statistical Genetics

In what follows, all steps are performed for the log-likelihoods under the null and alternative hypotheses, with the understanding that the number of coordinates of starting vectors, the BPP formulas, and the updated parameter estimates will most likely differ under the two different hypotheses. Step 3b—Generate a set of starting values g (0) j  for the genotype frequencies. In practice, we obtain this list by calling a random number generator three times, where the random numbers are Label the random  drawn from a U(0,1) distribution. u j = . Using the notation above, values u 0 , u 1 , u 2 . Let u + = 2j=0 u j . We set g (0)     j u + (0) (0) (0) (0) (0) (0) we may write s = g0 , g1 , g2 , or s = g0 , g1 since g2 = 1 − g0(0) − g1(0) . Step 3c—Present the log-likelihood of the observed data as a function of the parameters. Without loss of generality, we specify that the first N1 individuals are in Group 1 and the remaining individuals are in Group 2. ln(L) =

N1  2 2    

ln Pr X = j, X t = j  for individual a a=1 j=0  =0

+

N 2  

ln[Pr(X = j for individual a)],

a=N1 +1 j=0

=

2  2 



 t  n (1) j  j ln Pr X = j, X = j

j=0 j  =0

+

2 

n (2) j ln[Pr(X = j)],

j=0

=

2 2   j=0 j  =0



n (1) j  j ln

θ j  j g (rj  )



+

2  j=0

  n (2) j

ln

2  v  =0

 θv j gv(r )

.

(1.26)

The second-to-last equation follows from the fact that the probability of an individual having an observed genotype j and a true genotype j  is independent of the individual a. We make the same statement for the probability of an individual having an observed genotype j. The last equation follows from the second-to-last equation and the law of total probability. This last equation is important when discussing the EM algorithm, particularly when determining when to stop iterations. Step 4—Write out the complete (log)-likelihood L c of the data. MacKenzie et al. [212] define the complete data likelihood (complete loglikelihood in our work, denoted ln(L C )) as that (log) likelihood that is constructed assuming that the value of the latent variable(s) is/are known. For our example, we have

1.7 The Expectation–Maximization (EM) Algorithm

ln(L C ) =

33

N  2      I X t = j  for individual a × ln g (rj  ) .

(1.27)

a=1 j  =0

Step 4a (Expectation (E)-step)—Compute the expectation of ln(L C ) conditional on the observed data. The observed data are the genotype calls for Groups 1 and 2. If we abbreviate the term observed data by OD, we have E[ln(L C | O D)] =

2 N   a=1 j  =0

=

   

(r ) E I X t = j  for individual a | O D × ln g j 

N1  2 2   a=1 j  =0 j=0

+

N 

    Pr X t = j  for ind a | X t = j  , X = j for ind a × ln g (rj  )

2  2 

a=N1 +1 j  =0 j=0

=

2  j  =0

=

2  j  =0

    Pr X t = j  for ind a | X = j for ind a × ln g (rj  ) ,

2 2       (r ) (r ) (r ) n (1) ln g n (2) +   j τ j  j ln g j  , j+ j

⎛ ⎝n (1) j+

⎛ +⎝

j  =0 j=0

2 

⎞⎞



(r ) ⎠⎠ n (2) ln j τ j j

 g (rj  ) .

(1.28)

j=0

In the set of equations leading to formula (1.28), we use the fact the expectation of the event, and also that   of the indicator function is the probability Pr X t = j  for ind a|X t = j  , X = j for ind a = 1. We may write out the expression for the BPPs τ rj (s) j using Bayes Theorem:     Pr X t = j  , X = j , τ j(r j) = Pr X t = j  | X = j = Pr(X = j)     Pr X = j | X t = j  Pr X t = j   = , t   Pr(X = j, X = v )  v    Pr X = j | X t = j  Pr X t = j  = . t  t  v  Pr(X = j | X = v ) Pr(X = v )

(1.29)

Making use of the notation above, we rewrite the quotient (1.29) as θ j  j g (rj  ) τ j(r j) =  . (r ) v  θv  j gv 

(1.30)

34

1 Introduction to Heterogeneity in Statistical Genetics

The conditional probabilities τ j(r j) are functions of the genotype frequencies g (rj  ) . Therefore, as the genotype frequencies get updated (M-step, below), the BPPs get updated as well. Step 4b (M-step)—Compute the updated estimates g (rj  +1) of the non-constant parameters. Specifically, solve the system of equations  ∂ ∂g (rj  )

E[ln(L C | O D)] = 0

for the terms g (rj  ) .

j  =0,1

The system of equations considered does not include j  = 2; the reason is that g2(r ) = 1 − g0(r ) − g1(r ) . Also, when computing the derivatives, one can drop the superscript (r ). We compute: ⎛

2 

∂ ∂ ⎝ E[ln(L C | O D)] = ∂gv ∂gv j  =0



c j  ln g j  ⎠, v  = 0, 1

(1.31)

where we use the abbreviation: ⎛

⎛ ⎞⎞ 2  ⎝ ⎠⎠. c j  = ⎝n (1) n (2) j + + j τ j j

(1.32)

j=0

Computing the partial derivatives in Eq. (1.31) and setting the results to 0, we obtain the equations: c0 c2 c1 c2 − = 0, − = 0. g0 g2 g1 g2 From these equations, we determine that g0 =

c0 g2 , g1 c2

=

c1 g2 . c2

It follows:

2 

2 g2  1= g j = c j . c2 j  =0 j  =0

 Given that 2j  =0 c j  = N , it follows from Eq. (1.33) that g2 = into the expressions above, we conclude: g j =

(1.33)

c2 . N

Substituting

c j  , j = 0, 1, 2. N

Re-inserting the superscript (r ), we may rewrite this equation as g (rj  +1)

=

c(rj  ) N

, j  = 0, 1, 2.

(1.34)

1.7 The Expectation–Maximization (EM) Algorithm

35

In Eq. (1.34), we use the notation c(rj  ) , since c j  from Eq. (1.32) is updated with each iteration step. Equation (1.34) indicates that the (r + 1)th-iteration step estimates of the genotype frequencies are function the (r )th-step estimates of τ j(r j) . Step 5a—Set a limit-distance ε so that M-step iterations are terminated after a finite number of steps. Compute the max(imum) log-likelihood of the data. For our example, with the calculations in Step 4b, we may update the genotype frequencies g (rj  +1) ad infinitum (i.e., let r → ∞). The log-likelihood of the observed data is non-decreasing as r increases [158]; that is,     ln(L) using estimates g (rj  +1) ≥ ln(L) using estimates g (rj  ) .

(1.35)

Recall from Eq. (1.12) that, for our example, ln(L) =

2  2  j=0 j  =0

n (1) j  j ln

  2  2     (r ) (2) (r ) θ j j g j + n j ln θv j g j  . j=0

v  =0

There is a point of diminishing returns with the updating. As one increases the iteration r, the difference between the terms in expression (1.19) typically becomes smaller. We declare that we have obtained our parameter MLEs for this example when the following condition is met:    ln(L) using estimates g0(r +1) , g1(r +1)    − ln(L) using estimates g0(r ) , g1(r ) ≤ ε,

(1.36a)

for a pre-specified number ε > 0, defined as the limit-distance. The general form of the inequality is   ln(L) using estimates P (r +1) 

 − ln(L) using estimates P (r ) ≤ ε,

(1.36b)

  ) with P (r ) = p1(r ) , p2(r ) , . . . , p (r being the K parameters that are updated in the K log-likelihood.   As an illustration, P (r ) = g0(r ) , g1(r ) in the current example, and K = 2. Another example vector of parameters is given in the next section. We refer to inequality (1.36b)  as the stopping condition.The log-likelihood ln(L)  (r ) (r ) using the estimates g0 , g1 is the max log-likelihood for our set of starting values generated in Step 3. Iteration step r is the smallest number for which condition (1.36b) is met. To reduce computational time in the calculation of the inequality (1.36a, 1.36b), we add a maximum number of iterations over which the inequality is evaluated. For

36

1 Introduction to Heterogeneity in Statistical Genetics

a given vector of starting values, if the inequality is not satisfied for any value r ≤ I , we declare that the maximum log-likelihood for the given vector is ln(L) using the Ith-iteration step parameter estimates. A generic example using I = 15 is presented in Table 1.6. Step 5b—Determine the maximum log-likelihoods under the null and alternative hypotheses. Given there are S starting vectors σ s,h , 1 ≤ s ≤ S, where h = 0 (Null hypothesis), and h = 1 (Alternative hypothesis) (a total of 2S starting vectors), the maximum log-likelihood for each hypothesis h is the one that satisfies the condition: max ln(L) in Eq. (1.36b)when starting vector is σ s,h .

1≤s≤S

(1.37)

Implicit in the definition of the value (1.37) is the fact that one must repeat Steps 3–5 for each new set of starting values. The number of parameters and the probabilities they represent are usually different under the null and alternative hypotheses. For example, in Sect. 4.2.1 of Chap. 4, the parameters for the LRT ae statistic when there are three categories of genotypes (single SNP), the parameters under the null and alternative hypotheses are   ) (r ) (r ) (r ) (K = 3 parameters) H0 : P (r = g , g , φ 0 0∗ 1∗ 1   ) (r ) (r ) (r ) (r ) (r ) H1 : P (r (K = 5 parameters). = g , g , g , g , φ 1 00 10 01 11 1 The parameter estimates φ1(r ) under the null and alternative hypotheses are likely to be different. Step 6—Compute the test statistic. Let ln(L h ) be the values in Eq. (1.37) under the respective hypothesis (h = 0, 1). The likelihood ratio statistic, LRT is defined as L RT = 2[ln(L 1 ) − ln(L 0 )]. This statistic is used extensively throughout our book, especially Chap. 4. Under certain mathematical conditions [213], the LRT is asymptotically distributed as a central chi-square statistic with varying degrees of freedom under the null hypothesis. Step 7. Practical issues To protect against potentially finding a local maximum rather than a global maximum log-likelihood, we apply the EM algorithm with multiple random starting vectors (Step 4.1). Note that we do this for the log-likelihoods calculated under the null and alternate hypotheses. Our experiences suggest that a minimum of S = 200–300 such vectors is sufficient to find a global maximum. We repeat the statement we made in the Initial Comments section—all notation is local. While we try to maintain the same notation for each application of the EM algorithm, sometimes the notation differs for a different statistic. The reason for this is primarily historical. In a number of situations, we use notation that was used in a particular paper that was published.

0.0334

0.0123

0.0048

0.0012

−7540.5598

−7540.5550

−7540.5538

12

13

14

0.7903

−7541.1071

8

−7540.5721

3.1967

−7541.8974

7

11

7.9504

−7545.0941

6

0.1279

24.6418

−7553.0445

5

0.3737

77.4093

−7577.6863

4

−7540.7334

281.1526

−7655.0956

3

−7540.6055

974.034

−7936.2482

2

10

4140.302

−8910.2822

1

9

NA

−13050.5842

0 (starting value)

(0.722, 0.223)

−13316.2771

−13316.3638

−13316.6125

−13317.1269

−13317.9015

−13319.6529

−13323.9754

−13333.4129

−13353.5063

−13391.0980

−13476.2108

−13653.1383

−14065.4449

−15205.0171

−19901.8934

lnLike

(0.508, 0.540)

lnLike

s=2

s=1

Starting vector

lnLike

Iteration step (r)

0.0867

0.2487

0.5144

0.7746

1.7514

4.3225

9.4375

20.0934

37.5917

85.1128

176.9275

412.3066

1139.5722

4696.8763

NA

lnLike

Table 1.6 Log-likelihoods as a function of random starting vectors and iteration steps

−11626.7739

−11626.7741

−11626.7903

−11628.443

−11755.6283

lnLike

(0.656, 0.961)

s=3

0.0002

0.0162

1.6527

127.1853

NA

lnLike

−14083.0245

−14083.0249

−14083.0265

−14083.0407

−14083.1819

−14084.5907

−14098.6895

−14241.1006

−15823.4455

lnLike

(0.083, 0.549)

s=4

(continued)

0.0004

0.0016

0.0142

0.1412

1.40880

14.0988

142.4111

1582.3449

NA

lnLike

1.7 The Expectation–Maximization (EM) Algorithm 37

0.0004

−7540.5534

−13316.2388

lnLike

(0.722, 0.223)

lnLike

(0.508, 0.540)

lnLike

s=2

s=1

Starting vector

0.0383

lnLike lnLike

(0.656, 0.961)

s=3

lnLike lnLike

(0.083, 0.549)

s=4

lnLike

The log-likelihood values in this table are mock values that do not refer to any data set. They are provided for illustrative purposes. There are S = 4 starting vectors, and the vectors are two-dimensional. For the sth-starting set, we write the parameters as ( p1s , p2s ). There are I = 15 iteration steps specified. Example starting vectors are ( p11 , p21 ) = (0.508, 0.540), for the s = first starting set, and ( p12 , p22 ) = (0.722, 0.223) for the s = second starting set. The terms “lnLike” and “ lnLike” refer to the log-likelihoods and the difference in consecutive log-likelihood for a specific starting point/vector and iteration value. We use term NA (Not Applicable) for each 0th-iteration step since the corresponding log-likelihood is the first log-likelihood for each starting vector. The terms in bold indicate the maximum log-likelihood for each starting vector. The limit-distance is ε = 0.0005

15

Iteration step (r)

Table 1.6 (continued)

38 1 Introduction to Heterogeneity in Statistical Genetics

1.7 The Expectation–Maximization (EM) Algorithm

39

In this table, we see that each starting vector produces a different iteration step for the maximum log-likelihood. For the s =first starting vector, the smallest iteration step that satisfies inequality (1.36a, 1.13b) is r = 14 since (ln(L) using (r + 1, s) = (15, 1) parameter estimates)

−(ln(L) using (r, s) = (14, 1) parameter estimates) = −7540.5534 − (−7540.5538) = 0.0004 < 0.0005 = ε. The lnLike for all-iteration steps before 14 are greater than 0.0005. Said another way, (rmax , s) = (14, 1) when s = 1. For the s = second starting vector, there is no iteration step r that satisfies inequality (1.36a, 1.13b), so we set rmax = I = 15 (see paragraph after Inequality (1.36a, 1.13b)), and conclude that the maximum log-likelihood for the second starting vector is −13316.2388. Given the same reasoning as for the first starting vector, (rmax , 3) = (3, 3), and (rmax , 4) = (7, 3), with corresponding maximum log-likelihoods of −11626.7741 and −14083.0249. It follows that the maximum log-likelihood for the example values in Table 1.6 is −7540.5538. Hypothesis testing using maximum log-likelihoods and the EM algorithm Another benefit of likelihood theory is that we can perform statistical hypothesis testing. In all the examples we provide in this book, where a likelihood function is stated, our alternative hypothesis is that none of the parameters are constrained; that is, the maximum log-likelihood under the alternative hypothesis is obtained as described above, by maximizing over the entire range of each parameter. The null hypothesis is given by constraining some of the parameters to be fixed values. To use an example from human linkage analysis with one parameter, the alternative hypothesis is that the recombination fraction between two loci, typically written as θ , can take on any value between 0 and ½. The null hypothesis is that θ is constrained to ½ [17]. For our work, let ln(L 1 ) be the value in the inequality (1.36a, 1.36b), maximized over the full range of all parameters. Let ln(L 0 ) be similarly defined, with the modification that a subset of the parameters in the vector θ are set to fixed values for all iterations in the EM algorithm. The test statistic, known as the likelihood ratio test, is given by L RT = 2(ln(L 1 ) − ln(L 0 )).

(1.39)

For sufficiently large sample sizes, the null distribution of the LRT statistic is a central chi-square distribution, where the degrees of freedom are given by the number of parameters that are fixed under the null hypothesis; in general, the degrees of freedom are the difference between the number of parameters fixed under the null

40

1 Introduction to Heterogeneity in Statistical Genetics

and the number fixed under the alternative. However, for our work in this book, the parameters under the alternative distribution will usually not be constrained. In some circumstances, the alternative distribution may be known, and it usually is a non-central chi-square distribution with a specified non-centrality parameter. However, when dealing with mixtures, determination of the alternative distribution can be more challenging. In Chaps. 4 and 5, we provide closed-form solutions of non-centrality parameters where possible.

References 1. Merriam-Webster.com: Heterogeneous (2015) 2. National library of medicine: genetics home reference [Internet]. https://ghr.nlm.nih.gov (2013) 3. Pagon, R.A., Adam, M.P., Ardinger, H.H., et al. (eds.): Genereviews® [Internet] Illustrated Glossary (1993–2015) 4. Tonin, P.N.: Genes implicated in hereditary breast cancer syndromes. Semin. Surg. Oncol. 18(4), 281–286 (2000) 5. Rehman, A.U., Santos-Cortez, R.L., Drummond, M.C., Shahzad, M., Lee, K., Morell, R.J., et al.: Challenges and solutions for gene identification in the presence of familial locus heterogeneity. Eur. J. Hum. Genet. 23(9), 1207–1215 (2015). https://doi.org/10.1038/ejhg. 2014.266 6. Inglehearn, C.F., Tarttelin, E.E., Plant, C., Peacock, R.E., al-Maghtheh, M., Vithana, E., et al.: A linkage survey of 20 dominant retinitis pigmentosa families: frequencies of the nine known loci and evidence for further heterogeneity. J. Med. Genet. 35(1), 1–5 (1998) 7. Gonsales, M.C., Montenegro, M.A., Soler, C.V., Coan, A.C., Guerreiro, M.M., Lopes-Cendes, I.: Recent developments in the genetics of childhood epileptic encephalopathies: impact in clinical practice. Arq. Neuropsiquiatr. 73(11), 946–958 (2015). https://doi.org/10.1590/0004282x20150122 8. Allison, K.H., Sledge, G.W.: Heterogeneity and cancer. Oncology (Williston Park) 28(9), 772–778 (2014) 9. Geschwind, D.H., Flint, J.: Genetics and genomics of psychiatric disease. Science 349(6255), 1489–1494 (2015). https://doi.org/10.1126/science.aaa8954 10. Ringman, J.M., Goate, A., Masters, C.L., Cairns, N.J., Danek, A., Graff-Radford, N., et al.: Genetic heterogeneity in alzheimer disease and implications for treatment strategies. Curr. Neurol. Neurosci. Rep. 14(11), 499 (2014). https://doi.org/10.1007/s11910-014-0499-8 11. Sabatelli, M., Conte, A., Zollino, M.: Clinical and genetic heterogeneity of amyotrophic lateral sclerosis. Clin. Genet. 83(5), 408–416 (2013). https://doi.org/10.1111/cge.12117 12. Online Mendelian Inheritance in Man, Omim®. https://omim.org/. Accessed 30 Dec 2019 13. National Center for Biotechnology Information: Gene: Brca1. https://www.ncbi.nlm.nih.gov/ gene/?term=BRCA1 (2017) 14. National Center for Biotechnology Information: Gene: Brca2. https://www.ncbi.nlm.nih.gov/ gene/?term=BRCA2 (2017) 15. Smith, C.A.B.: Homogeneity test for linkage data. Proc. Sec. Int. Congr. Hum. Genet. 1, 212–213 (1961) 16. Morton, N.E.: The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am. J. Hum. Genet. 8, 80–96 (1956) 17. Ott, J.: Analysis of Human Genetic Linkage, 3rd edn. The John Hopkins University Press, Baltimore, MD (1999) 18. Risch, N.: A new statistical test for linkage heterogeneity. Am. J. Hum. Genet. 42(2), 353–364 (1988)

References

41

19. Goldstein, D.R.: A combined test of linkage heterogeneity. Am. J. Hum. Genet. 55(4), 841–848 (1994) 20. Hodge, S.E., Anderson, C.E., Neiswanger, K., Sparkes, R.S., Rimoin, D.L.: The search for heterogeneity in insulin-dependent diabetes mellitus (Iddm): linkage studies, two-locus models, and genetic heterogeneity. Am. J. Hum. Genet. 35(6), 1139–1155 (1983) 21. Ott, J.: Linkage analysis and family classification under heterogeneity. Ann. Hum. Genet. 47(Pt 4), 311–320 (1983) 22. Risch, N., Baron, M.: X-linkage and genetic heterogeneity in bipolar-related major affective illness: reanalysis of linkage data. Ann. Hum. Genet. 46(Pt 2), 153–166 (1982) 23. Gao, H., Zhou, Y., Ma, W., Liu, H., Zhao, L.: An estimating function approach to linkage heterogeneity. J. Genet. 92(3), 413–421 (2013) 24. Talebizadeh, Z., Arking, D.E., Hu, V.W.: A novel stratification method in linkage studies to address inter- and intra-family heterogeneity in autism. PLoS ONE 8(6), e67569 (2013). https://doi.org/10.1371/journal.pone.0067569 25. Bautista, J.F., Kelly, J.A., Harley, J.B., Gray-McGuire, C.: Addressing genetic heterogeneity in complex disease: finding seizure genes in systemic lupus erythematosus. Epilepsia 49(3), 527–530 (2008). https://doi.org/10.1111/j.1528-1167.2007.01453.x 26. Paaby, A.B., Rockman, M.V.: The many faces of pleiotropy. Trends Genet. 29(2), 66–73 (2013). https://doi.org/10.1016/j.tig.2012.10.010 27. Lobo, I.: Pleiotropy: one gene can affect multiple traits. Nat. Edu. 1(1), 10 (2008) 28. Nussbaum, R.L., McInnes, R.R., Willard, H.F.: Thompson & Thompson Genetics in Medicine. Elsevier Health Sciences (2015) 29. Terwilliger, J.D., Ott, J.: Handbook of Human Genetic Linkage. Johns Hopkins University Press, Baltimore (1994) 30. Acton, R.T., Barton, J.C., Leiendecker-Foster, C., Zaun, C., McLaren, C.E., Eckfeldt, J.H.: Tumor necrosis factor-alpha promoter variants and iron phenotypes in 785 hemochromatosis and iron overload screening (Heirs) study participants. Blood Cells Mol. Dis. 44(4), 252–256 (2010). https://doi.org/10.1016/j.bcmd.2010.01.007 31. Kullo, I.J., Fan, J., Pathak, J., Savova, G.K., Ali, Z., Chute, C.G.: Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease. J. Am. Medi. Inf. Assoc. JAMIA 17(5), 568–574 (2010). https://doi.org/10.1136/jamia.2010.004366 32. Ruggieri, M., Pavone, P., Scapagnini, G., Romeo, L., Lombardo, I., Li Volti, G., et al.: The aristaless (Arx) gene: one gene for many “interneuronopathies”. Front. Biosci. (Elite edn) 2, 701–710 (2010). https://doi.org/10.2741/e130 33. Volpi, L., Ricci, G., Passino, C., Di Pierri, E., Alì, G., Maccherini, M., et al.: Prevalent cardiac phenotype resulting in heart transplantation in a novel Lmna gene duplication. Neuromus. Disord. NMD 20(8), 512–516 (2010). https://doi.org/10.1016/j.nmd.2010.03.016 34. Bennett, S.N., Caporaso, N., Fitzpatrick, A.L., Agrawal, A., Barnes, K., Boyd, H.A., et al.: Phenotype harmonization and cross-study collaboration in Gwas consortia: the geneva experience. Genet. Epidemiol. 35(3), 159–173 (2011). https://doi.org/10.1002/gepi.20564 35. Davies, P.F., Civelek, M.: Endoplasmic reticulum stress, redox, and a proinflammatory environment in athero-susceptible endothelium in vivo at sites of complex hemodynamic shear stress. Antioxid. Redox Signal. 15(5), 1427–1432 (2011). https://doi.org/10.1089/ars.2010. 3741 36. Sousa, A.G., Selvatici, L., Krieger, J.E., Pereira, A.C.: Association between genetics of diabetes, coronary artery disease, and macrovascular complications: exploring a common ground hypothesis. Rev. Diabetic Stud. RDS 8(2), 230–244 (2011). https://doi.org/10.1900/ RDS.2011.8.230 37. Arc, O.C., Arc, O.C., Zeggini, E., Panoutsopoulou, K., Southam, L., Rayner, N.W., et al.: Identification of new susceptibility loci for osteoarthritis (arcogen): a genome-wide association study. Lancet (London, England) 380(9844), 815–823 (2012). https://doi.org/10.1016/ S0140-6736(12)60681-3

42

1 Introduction to Heterogeneity in Statistical Genetics

38. Fang, Y., Davies, P.F.: Site-specific microrna-92a regulation of Kruppel-like factors 4 and 2 in atherosusceptible endothelium. Arterioscler. Thromb. Vasc. Biol. 32(4), 979–987 (2012). https://doi.org/10.1161/ATVBAHA.111.244053 39. Gourraud, J.B., Kyndt, F., Fouchard, S., Rendu, E., Jaafar, P., Gully, C., et al.: Identification of a strong genetic background for progressive cardiac conduction defect by epidemiological approach. Heart (British Cardiac Society) 98(17), 1305–1310 (2012). https://doi.org/10.1136/ heartjnl-2012-301872 40. Labbe, A., Liu, A., Atherton, J., Gizenko, N., Fortier, M.-È., Sengupta, S.M., Ridha, J.: Refining psychiatric phenotypes for response to treatment: contribution of Lphn3 in Adhd. Am. J. Medi. Genet. Part B, Neuropsy. Genet. Offi. Publi. Int. Soc. Psy. Genet. 159B(7), 776–785 (2012). https://doi.org/10.1002/ajmg.b.32083 41. Minucci, A., Canu, G., Tellone, E., Giardina, B., Zuppi, C., Capoluongo, E.: Phenotype heterogeneity of hyperbilirubinemia condition: the lesson by coinheritance of glucose-6phosphate dehydrogenase deficiency and Crigler-Najjar syndrome type Ii in an Italian patient. Blood Cells Mol. Dis. 49(2), 118–119 (2012). https://doi.org/10.1016/j.bcmd.2012.05.004 42. Silva Pinto, C., Fidalgo, T., Salvado, R., Marques, D., Gonçalves, E., Martinho, P., et al.: Molecular diagnosis of haemophilia a at Centro Hospitalar De Coimbra in Portugal: study of 103 families—15 new mutations. Haemophilia Offi. J. World Federat. Hemophilia 18(1), 129–138 (2012). https://doi.org/10.1111/j.1365-2516.2011.02570.x 43. Sinner, M.F., Porthan, K., Noseworthy, P.A., Havulinna, A.S., Tikkanen, J.T., MüllerNurasyid, M., et al.: A meta-analysis of genome-wide association studies of the electrocardiographic early repolarization pattern. Heart Rhythm 9(10), 1627–1634 (2012). https:// doi.org/10.1016/j.hrthm.2012.06.008 44. El Andalousi, J., Murawski, I.J., Capolicchio, J.P., El-Sherbiny, M., Jednak, R., Gupta, I.R.: A single-center cohort of Canadian children with Vur reveals renal phenotypes important for genetic studies. Pediatr. Nephrol. (Berlin, Germany) 28(9), 1813–1819 (2013). https://doi. org/10.1007/s00467-013-2440-9 45. Kim, M.J., Kim, S.J., Kim, J., Chae, H., Kim, M., Kim, Y.: Genotype and phenotype heterogeneity in Perrault syndrome. J. Pediatr. Adolesc. Gynecol. 26(1), e25–e27 (2013). https:// doi.org/10.1016/j.jpag.2012.10.008 46. Lim, B.C., Lee, S., Shin, J.Y., Hwang, H., Kim, K.J., Hwang, Y.S., et al.: Molecular diagnosis of congenital muscular dystrophies with defective glycosylation of alpha-dystroglycan using next-generation sequencing technology. Neuromus. Disord. NMD 23(4), 337–344 (2013). https://doi.org/10.1016/j.nmd.2013.01.007 47. Wu, W., Clark, E.A.S., Stoddard, G.J., Watkins, W.S., Esplin, M.S., Manuck, T.A., et al.: Effect of interleukin-6 polymorphism on risk of preterm birth within population strata: a meta-analysis. BMC Genet. 14, 30 (2013). https://doi.org/10.1186/1471-2156-14-30 48. Bagnall, R.D., Molloy, L.K., Kalman, J.M., Semsarian, C.: Exome sequencing identifies a mutation in the Actn2 gene in a family with idiopathic ventricular fibrillation, left ventricular noncompaction, and sudden death. BMC Med. Genet. 15, 99 (2014). https://doi.org/10.1186/ s12881-014-0099-0 49. Jalkh, N., Guissart, C., Chouery, E., Yammine, T., El Ali, N., Farah, H.A., Mégarbané, A.: Report of a novel mutation in Crb1 in a Lebanese family presenting retinal dystrophy. Ophthalmic Genet. 35(1), 57–62 (2014). https://doi.org/10.3109/13816810.2013.763995 50. Nowinska, A.K., Wylegala, E., Teper, S., Wróblewska-Czajka, E., Aragona, P., Roszkowska, A.M., et al.: Phenotype and genotype analysis in patients with macular corneal dystrophy. Brit. J. Ophthalmol. 98(11), 1514–1521 (2014). https://doi.org/10.1136/bjophthalmol-2014305098 51. Wakimoto, H., Tanaka, S., Curry, W.T., Loebel, F., Zhao, D., Tateishi, K., et al.: Targetable signaling pathway mutations are associated with malignant phenotype in Idh-mutant gliomas. Clini. Cancer Res. Offi. J. Am. Assoc. Cancer Res. 20(11), 2898–2909 (2014). https://doi. org/10.1158/1078-0432.CCR-13-3052 52. Guido, D., Morandi, G., Palluzzi, F., Borroni, B.: Telling the story of frontotemporal dementia by bibliometric analysis. J. Alzheimer’s Dis. JAD 48(3), 703–709 (2015). https://doi.org/10. 3233/JAD-150275

References

43

53. Padang, R., Bagnall, R.D., Tsoutsman, T., Bannon, P.G., Semsarian, C.: Comparative transcriptome profiling in human bicuspid aortic valve disease using Rna sequencing. Physiol. Genomics 47(3), 75–87 (2015). https://doi.org/10.1152/physiolgenomics.00115.2014 54. Rönnbäck, C., Nissen, C., Almind, G.J., Grønskov, K., Milea, D., Larsen, M.: Genotypephenotype heterogeneity of ganglion cell and inner plexiform layer deficit in autosomaldominant optic atrophy. Acta Ophthalmol. 93(8), 762–766 (2015). https://doi.org/10.1111/ aos.12835 55. Castaño-Betancourt, M.C., Evans, D.S., Ramos, Y.F.M., Boer, C.G., Metrustry, S., Liu, Y., et al.: Novel genetic variants for cartilage thickness and hip osteoarthritis. PLoS Genet. 12(10), e1006260–e1006260 (2016). https://doi.org/10.1371/journal.pgen.1006260 56. Roucher-Boulez, F., Mallet-Motak, D., Samara-Boustani, D., Jilani, H., Ladjouze, A., Souchon, P.F., et al.: Nnt mutations: a cause of primary adrenal insufficiency, oxidative stress and extra-adrenal defects. Eur. J. Endocrinol. 175(1), 73–84 (2016). https://doi.org/10.1530/ EJE-16-0056 57. Wang, L., Chen, Y., Chen, X., Sun, X.: Further evidence for P59l mutation in Gja3 associated with autosomal dominant congenital cataract. Indian J. Ophthalmol. 64(7), 508–512 (2016). https://doi.org/10.4103/0301-4738.190139 58. Zeng, B., Li, R., Hu, Y., Hu, B., Zhao, Q., Liu, H., et al.: A novel mutation and a known mutation in the Clcn7 gene associated with relatively stable infantile malignant osteopetrosis in a Chinese patient. Gene 576(1 Pt 1), 176–181 (2016). https://doi.org/10.1016/j.gene.2015. 10.021 59. Aterido, A., Julià, A., Carreira, P., Blanco, R., López-Longo, J.J., Venegas, J.J.P., et al.: Genome-wide pathway analysis identifies Vegf pathway association with oral ulceration in systemic lupus erythematosus. Arthrit. Res. Therapy 19(1), 138 (2017). https://doi.org/10. 1186/s13075-017-1345-6 60. Greni, F., Valenti, L., Mariani, R., Pelloni, I., Rametta, R., Busti, F., et al.: Gnpat Rs11558492 is not a major modifier of iron status: study of Italian hemochromatosis patients and blood donors. Ann. Hepatol. 16(3), 451–456 (2017). https://doi.org/10.5604/16652681.1235489 61. Lin, H.-C., Lin, C.-H., Chen, P.-L., Cheng, S.-J., Chen, P.-H.: Intrafamilial phenotypic heterogeneity in a Taiwanese family with a Mapt P.R5h mutation: a case report and literature review. BMC Neurol. 17(1), 186 (2017). https://doi.org/10.1186/s12883-017-0966-3 62. Molfetta, G.A., Zanette, D.L., Santos, J.E., Silva, W.A., Jr.: Mutational screening in the Ldlr gene among patients presenting familial hypercholesterolemia in the Southeast of Brazil. Genet. Molecul. Res. GMR 16(3) (2017). https://doi.org/10.4238/gmr16039226 63. Panoutsopoulou, K., Thiagarajah, S., Zengini, E., Day-Williams, A.G., Ramos, Y.F., Meessen, J.M., et al.: Radiographic endophenotyping in hip osteoarthritis improves the precision of genetic association analysis. Ann. Rheum. Dis. 76(7), 1199–1206 (2017). https://doi.org/10. 1136/annrheumdis-2016-210373 64. W˛edrychowicz, A., Tobór, E., Wilk, M., Ziółkowska-Ledwith, E., Rams, A., Wzorek, K., et al.: Phenotype heterogeneity in glucokinase-maturity-onset diabetes of the Young (GckMody) patients. J. Clini. Res. Pediat. Endocrinol. 9(3), 246–252 (2017). https://doi.org/10. 4274/jcrpe.4461 65. Zhang, G., Xie, Y., Wang, W., Feng, X., Jia, J.: Clinical characterization of an app mutation (V717i) in five Han Chinese families with early-onset Alzheimer’s disease. J. Neurol. Sci. 372, 379–386 (2017). https://doi.org/10.1016/j.jns.2016.10.039 66. Brichant, G., Nervo, P., Albert, A., Munaut, C., Foidart, J.M., Nisolle, M.: Heterogeneity of estrogen receptor A and progesterone receptor distribution in lesions of deep infiltrating endometriosis of untreated women or during exposure to various hormonal treatments. Gynecol. Endocrinol. Offi. J. Int. Soc. Gynecol. Endocrinol. 34(8), 651–655 (2018). https:// doi.org/10.1080/09513590.2018.1433160 67. Khan, M.T.M., Naz, A., Ahmed, J., Shamsi, T., Ahmed, S., Ahmed, N., et al.: Mutation spectrum and genotype-phenotype analyses in a Pakistani cohort with hemophilia B. Clin. Appl. Throm./Hemo. Offi. J. Int. Acad. Clini. Appl. Throm./Hemo. 24(5), 741–748 (2018). https://doi.org/10.1177/1076029617721011

44

1 Introduction to Heterogeneity in Statistical Genetics

68. Kor, Y., Zou, M., Al-Rijjal, R.A., Monies, D., Meyer, B.F., Shi, Y.: Phenotype heterogeneity of congenital adrenal hyperplasia due to genetic mosaicism and concomitant nephrogenic diabetes insipidus in a sibling. BMC Med. Genet. 19(1), 115 (2018). https://doi.org/10.1186/ s12881-018-0629-2 69. Kumar, S., Yadav, N., Pandey, S., Thelma, B.K.: Advances in the discovery of genetic risk factors for complex forms of neurodegenerative disorders: contemporary approaches, success, challenges and prospects. J. Genet. 97(3), 625–648 (2018) 70. Leffers, H.C.B., Lange, T., Collins, C., Ulff-Møller, C.J., Jacobsen, S.: The study of interactions between genome and exposome in the development of systemic lupus erythematosus. Autoimmun. Rev. 18(4), 382–392 (2019). https://doi.org/10.1016/j.autrev.2018.11.005 71. Classify. https://www.merriam-webster.com/dictionary/classify. Accessed 4 Jan 4 2020 72. Misclassify. https://www.merriam-webster.com/dictionary/misclassify. Accessed 4 Jan 2020 73. Cartoon Image of Golden Labrador. Accessed 2 June 2 2020 74. Little, C.C.: The Inheritance of Coat Color in Dogs. Comstock Pub. Associates (1957) 75. Mattinson, P.: Thelabradorsite. https://www.thelabradorsite.com/white-labradors/. Acquired 29 Dec 2019 76. Hong, Y.S., Sinn, D.H., Gwak, G.Y., Cho, J., Kang, D., Paik, Y.H., et al.: Characteristics and outcomes of chronic liver disease patients with acute deteriorated liver function by severity of underlying liver disease. World J. Gastroenterol. 22(14), 3785–3792 (2016). https://doi.org/ 10.3748/wjg.v22.i14.3785 77. Sha, J., Chen, X., Ren, Y., Chen, H., Wu, Z., Ying, D., et al.: Differences in the epidemiology and virology of mild, severe and fatal human infections with avian influenza a (H7n9) virus. Arch. Virol. 161(5), 1239–1259 (2016). https://doi.org/10.1007/s00705-016-2781-3 78. Kostanyan, T., Sung, K.R., Schuman, J.S., Ling, Y., Lucy, K.A., Bilonick, R.A., et al.: Glaucoma structural and functional progression in American and Korean cohorts. Ophthalmology 123(4), 783–788 (2016). https://doi.org/10.1016/j.ophtha.2015.12.010 79. Bourque, P.R., Pringle, C.E., Cameron, W., Cowan, J., Chardon, J.W.: Subcutaneous immunoglobulin therapy in the chronic management of myasthenia gravis: a retrospective cohort study. PLoS ONE 11(8), e0159993 (2016). https://doi.org/10.1371/journal.pone.015 9993 80. Marras, C.: Subtypes of Parkinson’s disease: state of the field and future directions. Curr. Opin. Neurol. 28(4), 382–386 (2015). https://doi.org/10.1097/wco.0000000000000219 81. Park, S., Cho, S.C., Kim, J.W., Shin, M.S., Yoo, H.J., Oh, S.M., et al.: Differential perinatal risk factors in children with attention-deficit/hyperactivity disorder by subtype. Psychiatry Res. 219(3), 609–616 (2014). https://doi.org/10.1016/j.psychres.2014.05.036 82. Melidou, A., Gioula, G., Exindari, M., Chatzidimitriou, D., Malisiovas, N.: Genetic analysis of post-pandemic 2010–2011 influenza a(H1n1)Pdm09 hemagglutinin virus variants that caused mild, severe, and fatal infections in Northern Greece. J. Med. Virol. 87(1), 57–67 (2015). https://doi.org/10.1002/jmv.23990 83. Fine, J.D., Bruckner-Tuderman, L., Eady, R.A., Bauer, E.A., Bauer, J.W., Has, C., et al.: Inherited epidermolysis bullosa: updated recommendations on diagnosis and classification. J. Am. Acad. Dermatol. 70(6), 1103–1126 (2014). https://doi.org/10.1016/j.jaad.2014.01.903 84. Allen, N., Robinson, A.C., Snowden, J., Davidson, Y.S., Mann, D.M.: Patterns of cerebral amyloid angiopathy define histopathological phenotypes In Alzheimer’s disease. Neuropathol. Appl. Neurobiol. 40(2), 136–148 (2014). https://doi.org/10.1111/nan.12070 85. Sobel, E., Papp, J.C., Lange, K.: Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet. 70(2), 496–508 (2002). https://doi.org/10.1086/338920 86. Douglas, J.A., Skol, A.D., Boehnke, M.: Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. Am. J. Hum. Genet. 70(2), 487–495 (2002). https://doi.org/10.1086/338919 87. Levenstien, M.A., Ott, J., Gordon, D.: Are molecular haplotypes worth the time and expense? A cost-effective method for applying molecular haplotypes. PLoS Genet 2(8), e127 (2006). https://doi.org/10.1371/journal.pgen.0020127, 06-PLGE-RA-0080R2 [pii]

References

45

88. Lamina, C., Kuchenhoff, H., Chang-Claude, J., Paulweber, B., Wichmann, H.E., Illig, T., et al.: Haplotype misclassification resulting from statistical reconstruction and genotype error, and its impact on association estimates. Ann. Hum. Genet. 74(5), 452–462 (2010). https://doi.org/ 10.1111/j.1469-1809.2010.00593.x 89. Proudnikov, D., LaForge, K.S., Hofflich, H., Levenstien, M., Gordon, D., Barral, S., et al.: Association analysis of polymorphisms in serotonin 1b receptor (Htr1b) gene with heroin addiction: a comparison of molecular and statistically estimated haplotypes. Pharmacogenet. Genomics. 16(1), 25–36 (2006). 01213011-200601000-00004 [pii] 90. Marquard, V., Beckmann, L., Heid, I.M., Lamina, C., Chang-Claude, J.: Impact of genotyping errors on the type I error rate and the power of haplotype-based association methods. BMC Genet. 10, 3 (2009). https://doi.org/10.1186/1471-2156-10-3, 1471-2156-10-3 [pii] 91. Lamina, C., Bongardt, F., Kuchenhoff, H., Heid, I.M.: Haplotype reconstruction error as a classical misclassification problem: introducing sensitivity and specificity as error measures. PLoS ONE 3(3), e1853 (2008). https://doi.org/10.1371/journal.pone.0001853 92. Govindarajulu, U.S., Spiegelman, D., Miller, K.L., Kraft, P.: Quantifying bias due to allele misclassification in case-control studies of haplotypes. Genet. Epidemiol. 30(7), 590–601 (2006). https://doi.org/10.1002/gepi.20170 93. Tal, O.: The cumulative effect of genetic markers on classification performance: insights from simple models. J. Theor. Biol. 293, 206–218 (2012). https://doi.org/10.1016/j.jtbi.2011. 10.005 94. Gordon, D., Finch, S.J., De La Vega, F.M.: A New expectation-maximization statistical test for case-control association studies considering rare variants obtained by high-throughput sequencing. Hum. Hered. 71(2), 113–125 (2011). https://doi.org/10.1159/000325590 95. Kim, W., Londono, D., Zhou, L., Xing, J., Nato, A.Q., Musolf, A., et al.: Single-variant and multi-variant trend tests for genetic association with next-generation sequencing that are robust to sequencing error. Hum. Hered. 74(3–4), 172–183 (2012). https://doi.org/10.1159/ 000346824 96. Figure—Mapping Sequence Reads. https://en.wikipedia.org/wiki/DNA_sequencing#/media/ File:Mapping_Reads.png. Accessed 7 May 2020 97. Mayo, O.: A century of Hardy-Weinberg equilibrium. Twin Res. Hum. Genet. 11(3), 249–256 (2008). https://doi.org/10.1375/twin.11.3.249 98. Merriam-Webster.com: Assortative Mating (2015) 99. McCarthy, M.I., Abecasis, G.R., Cardon, L.R., Goldstein, D.B., Little, J., Ioannidis, J.P., Hirschhorn, J.N.: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9(5), 356–369 (2008). https://doi.org/10.1038/nrg2344 100. Cox, D.G., Kraft, P.: Quantification of the power of Hardy-Weinberg equilibrium testing to detect genotyping error. Hum. Hered. 61(1), 10–14 (2006). https://doi.org/10.1159/000 091787 101. Graffelman, J., Weir, B.S.: Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome. Heredity (Edinb) 116(6), 558–568 (2016). https://doi.org/ 10.1038/hdy.2016.20 102. Laurie, C.C., Doheny, K.F., Mirel, D.B., Pugh, E.W., Bierut, L.J., Bhangale, T., et al.: Quality control and quality assurance in genotypic data for genome-wide association studies. Genet. Epidemiol. 34(6), 591–602 (2010). https://doi.org/10.1002/gepi.20516 103. Leal, S.M.: Detection of genotyping errors and pseudo-snps via deviations from HardyWeinberg equilibrium. Genet. Epidemiol. 29(3), 204–214 (2005). https://doi.org/10.1002/ gepi.20086 104. Liu, N., Zhang, D., Zhao, H.: Genotyping error detection in samples of unrelated individuals without replicate genotyping. Hum. Hered. 67(3), 154–162 (2009). https://doi.org/10.1159/ 000181153 105. Morin, P.A., Leduc, R.G., Archer, F.I., Martien, K.K., Huebinger, R., Bickham, J.W., Taylor, B.L.: Significant deviations from Hardy-Weinberg equilibrium caused by low levels of microsatellite genotyping errors. Mol. Ecol. Resour. 9(2), 498–504 (2009). https://doi.org/ 10.1111/j.1755-0998.2008.02502.x

46

1 Introduction to Heterogeneity in Statistical Genetics

106. Moskvina, V., Schmidt, K.M.: Susceptibility of biallelic haplotype and genotype frequencies to genotyping error. Biometrics 62(4), 1116–1123 (2006). https://doi.org/10.1111/j.15410420.2006.00563.x 107. Dawn Teare, M., Barrett, J.H.: Genetic linkage studies. Lancet 366(9490), 1036–1044 (2005). https://doi.org/10.1016/s0140-6736(05)67382-5 108. Elston, R.C., Cordell, H.J.: Overview of model-free methods for linkage analysis. Adv. Genet. 42, 135–150 (2001) 109. Goldgar, D.E.: Major strengths and weaknesses of model-free methods. Adv. Genet. 42, 241–251 (2001) 110. Loughlin, J.: Genetic epidemiology of primary osteoarthritis. Curr. Opin. Rheumatol. 13(2), 111–116 (2001) 111. Schaid, D.J., Buetow, K., Weeks, D.E., Wijsman, E., Guo, S.W., Ott, J., Dahl, C.: Discovery of cancer susceptibility genes: study designs, analytic approaches, and trends in technology. J. Natl. Cancer Inst. Monogr. (26), 1–16 (1999) 112. Brooks, A.S., Oostra, B.A., Hofstra, R.M.: Studying the genetics of Hirschsprung’s disease: unraveling an oligogenic disorder. Clin. Genet. 67(1), 6–14 (2005). https://doi.org/10.1111/j. 1399-0004.2004.00319.x 113. Di Bona, D., Candore, G., Franceschi, C., Licastro, F., Colonna-Romano, G., Camma, C., et al.: Systematic review by meta-analyses on the possible role of Tnf-alpha polymorphisms in association with Alzheimer’s disease. Brain Res. Rev. 61(2), 60–68 (2009). https://doi.org/ 10.1016/j.brainresrev.2009.05.001 114. Di Bona, D., Rizzo, C., Bonaventura, G., Candore, G., Caruso, C.: Association between interleukin-10 polymorphisms and Alzheimer’s disease: a systematic review and metaanalysis. J. Alzheimers Dis. 29(4), 751–759 (2012). https://doi.org/10.3233/jad-2012-111838 115. Grigoras, C.A., Ziakas, P.D., Jayamani, E., Mylonakis, E.: Atg16l1 and Il23r variants and genetic susceptibility to Crohn’s disease: mode of inheritance based on meta-analysis of genetic association studies. Inflamm. Bowel Dis. 21(4), 768–776 (2015). https://doi.org/10. 1097/mib.0000000000000305 116. Hinney, A., Remschmidt, H., Hebebrand, J.: Candidate gene polymorphisms in eating disorders. Eur. J. Pharmacol. 410(2–3), 147–159 (2000) 117. Kauffman, M.A., Moron, D.G., Consalvo, D., Bello, R., Kochen, S.: Association study between interleukin 1 beta gene and epileptic disorders: a huge review and meta-analysis. Genet. Med. 10(2), 83–88 (2008). https://doi.org/10.1097/GIM.0b013e318161317c 118. Minelli, C., Thompson, J.R., Abrams, K.R., Thakkinstian, A., Attia, J.: The choice of a genetic model in the meta-analysis of molecular association studies. Int. J. Epidemiol. 34(6), 1319–1328 (2005). https://doi.org/10.1093/ije/dyi169 119. Tang, L., Lu, X., Tao, Y., Zheng, J., Zhao, P., Li, K., Li, L.: Scn1a Rs3812718 polymorphism and susceptibility to epilepsy with febrile seizures: a meta-analysis. Gene 533(1), 26–31 (2014). https://doi.org/10.1016/j.gene.2013.09.071 120. Ziakas, P.D., Poulou, L.S., Pavlou, M., Zintzaras, E.: Thrombophilia and venous thromboembolism in pregnancy: a meta-analysis of genetic risk. Eur. J. Obstet. Gynecol. Reprod. Biol. 191, 106–111 (2015). https://doi.org/10.1016/j.ejogrb.2015.06.005 121. Ziakas, P.D., Poulou, L.S., Zintzaras, E.: Fcgammariia-H131r variant is associated with inferior response in diffuse large B cell lymphoma: a meta-analysis of genetic risk. J. Buon. 21(6), 1454–1458 (2016) 122. Zining, J., Lu, X., Caiyun, H., Yuan, Y.: Genetic polymorphisms of Mtor and cancer risk: a systematic review and updated meta-analysis. Oncotarget 7(35), 57464–57480 (2016). https:// doi.org/10.18632/oncotarget.10805 123. Knapp, M., Seuchter, S.A., Baur, M.P.: Linkage analysis in nuclear families. 2: Relationship between affected Sib-Pair tests and Lod score analysis. Hum. Hered. 44(1), 44–51 (1994). https://doi.org/10.1159/000154188 124. Huang, J., Vieland, V.J.: Comparison of ‘model-free’ and ‘model-based’ linkage statistics in the presence of locus heterogeneity: single data set and multiple data set applications. Hum. Hered. 51(4), 217–225 (2001). 53345 [pii]

References

47

125. Vieland, V.J., Logue, M.: Hlods, trait models, and ascertainment: implications of admixture for parameter estimation and linkage detection. Hum. Hered. 53(1), 23−35 (2002). 48601 [pii] 126. Wittke-Thompson, J.K., Pluzhnikov, A., Cox, N.J.: Rational inferences about departures from Hardy-Weinberg equilibrium. Am. J. Hum. Genet. 76(6), 967–986 (2005). https://doi.org/10. 1086/430507, S0002-9297(07)62894-8 [pii] 127. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., et al.: Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007). https://doi.org/10.1086/519795, S0002-9297(07)61352-4 [pii] 128. Wigginton, J.E., Abecasis, G.R.: Pedstats: descriptive statistics, graphics and quality assessment for gene mapping data. Bioinformatics 21(16), 3445–3447 (2005). https://doi.org/10. 1093/bioinformatics/bti529 129. Schaid, D.J., Sommer, S.S.: Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am. J. Hum. Genet. 53(5), 1114–1126 (1993) 130. Gordon, D., Finch, S.J., Nothnagel, M., Ott, J.: Power and sample size calculations for casecontrol genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum. Hered. 54(1), 22–33 (2002). 66696 131. Purcell, S., Cherny, S.S., Sham, P.C.: Genetic power calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19(1), 149–150 (2003) 132. Knight, J.: A survey of current software for genetic power calculations. Hum. Genom. 1(3), 225–227 (2004) 133. Gordon, D., Haynes, C., Blumenfeld, J., Finch, S.J.: Pawe-3d: visualizing power for association with error in case-control genetic studies of complex traits. Bioinformatics 21(20), 3935–3937 (2005). https://doi.org/10.1093/bioinformatics/bti643 134. Wessel, J., Schork, A.J., Tiwari, H.K., Schork, N.J.: Powerful designs for genetic association studies that consider twins and sibling pairs with discordant genotypes. Genet. Epidemiol. 31, 789–796 (2007) 135. Menashe, I., Rosenberg, P.S., Chen, B.E.: Pga: power calculator for case-control genetic association analyses. BMC Genet. 9, 36 (2008) 136. Feng, S., Wang, S., Chen, C.C., Lan, L.: Gwapower: a statistical power calculation software for genome-wide association studies with quantitative traits. BMC Genet. 12, 12 (2011) 137. Ball, R.D.: Experimental designs for robust detection of effects in genome-wide case-control studies. Genetics 189(4), 1497–1514 (2011) 138. Hong, E.P., Park, J.W.: Sample size and statistical power calculation in genetic association studies. Genom. Inf. 10(2), 117–122 (2012) 139. Ott, J., Hoh, J.: Statistical approaches to gene mapping. Am. J. Hum. Genet. 67(2), 289–294 (2000). https://doi.org/10.1086/303031, S0002-9297(07)62640-8 [pii] 140. Excoffier, L., Slatkin, M.: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12(5), 921–927 (1995) 141. Gauderman, W.J.: Sample size requirements for matched case-control studies of geneenvironment interaction. Stat. Med. 21(1), 35–50 (2002) 142. Gordon, D., Haynes, C., Yang, Y., Kramer, P.L., Finch, S.J.: Linear trend tests for case-control genetic association that incorporate random phenotype and genotype misclassification error. Genet. Epidemiol. 31(8), 853–870 (2007). https://doi.org/10.1002/gepi.20246 143. Tukey, J.W.: Exploratory Data Analysis. Pearson Education—Addison Wesley, Upper Saddle River, NJ (1977) 144. Brinkman, R.R., Mezei, M.M., Theilmann, J., Almqvist, E., Hayden, M.R.: The likelihood of being affected with Huntington disease by a particular age, for a specific cag size. Am. J. Hum. Genet. 60(5), 1202–1210 (1997) 145. Zhou, H., Pan, W.: Binomial mixture model-based association tests under genetic heterogeneity. Ann. Hum. Genet. 73(Pt 6), 614–630 (2009). https://doi.org/10.1111/j.1469-1809. 2009.00542.x, AHG542 [pii] 146. Londono, D., Buyske, S., Finch, S.J., Sharma, S., Wise, C.A., Gordon, D.: Tdt-Het: a new transmission disequilibrium test that incorporates locus heterogeneity into the analysis of

48

147. 148. 149.

150. 151. 152. 153. 154.

155. 156. 157. 158. 159. 160. 161. 162.

163.

164. 165.

166.

167.

168.

1 Introduction to Heterogeneity in Statistical Genetics family-based association data. BMC Bioinf. 13, 13 (2012). https://doi.org/10.1186/14712105-13-13 Lrt.Stat: likelihood ratio test statistic for contingency tables. https://rdrr.io/cran/MADPop/ man/LRT.stat.html. Accessed 20 Jan 2019 Wienker, T.F., Strom, T.M., Henschke, H.: De Finnetti generator. https://ihg.gsf.de/cgi-bi. Accessed 6 Jan 2020 Aulchenko, Y.S., Ripke, S., Isaacs, A., van Duijn, C.M.: Genabel: an R library for genomewide association analysis. Bioinformatics 23(10), 1294–1296 (2007). https://doi.org/10.1093/ bioinformatics/btm108 Fisher, R.A.: On the interpretation of chisquare from contingency tables, and the calculation of P. J. Roy Statist. Soc. B 85(1), 87–94 (1922) Slager, S.L., Schaid, D.J.: Case-control studies of genetic markers: power and sample size approximations for Armitage’s test for trend. Hum. Hered. 52(3), 149–153 (2001) Devlin, B., Roeder, K.: Genomic control for association studies. Biometrics 55(4), 997–1004 (1999) Devlin, B., Roeder, K., Bacanu, S.A.: Unbiased methods for population-based association studies. Genet. Epidemiol. 21(4), 273–284 (2001) Spielman, R.S., McGinnis, R.E., Ewens, W.J.: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (Iddm). Am. J. Hum. Genet. 52(3), 506–516 (1993) Cochran, W.G.: Some methods for strengthening the common X2 tests. Biometrics 10(4), 417–451 (1954) Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley Series in Probability and Statistics. Wiley, Inc., Hoboken, NJ, USA (2013) Edwards, A.W.F.: R. A. Fisher—twice Professor of Genetics: London and Cambridge, or ‘a fairly well-known geneticist’. Statistitian 52(Part 3), 311–318 (2003) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the Em algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977) Sundberg, R.: Maximum likelihood theory for incomplete data from an exponential family. Scand. J. Stat. 1(2), 49–58 (1974) Sundberg, R.: An iterative method for solution of the likelihood equations for incomplete data from exponential families. Commun. Stat. Simul. Comput. 5(1), 55–64 (1976) Kulldorff, G.: Contributions to the Theory of Estimation from Grouped and Partially Grouped Samples, 1st edn. Wiley, Hoboken, New Jersey, NY, New York, USA (1962) Martin-Lof, P.: The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of observational data. In: Foundational Questions in Statistical Inference, Aarhus, Denmark (1973) Martin-Lof, P.: The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data. Scand. J. Stat. 1(1), 3–18 (1974) Laird, N.M.: The Em algorithm in genetics, genomics, and public health. Stat. Sci. 25(4), 450–457 (2010) Lange, K.: Mathematical and Statistical Methods for Genetic Analysis, 2nd edn. Statistics for Biology and Health. Springer, NY, New York, USA, Berlin, Germany, Heidelberg, Germany (2002) Abo, R., Wong, J., Thomas, A., Camp, N.J.: Haplotype association analyses in resources of mixed structure using Monte Carlo testing. BMC Bioinf. 11, 592 (2010). https://doi.org/10. 1186/1471-2105-11-592 Bertrand, J., Comets, E., Chenel, M., Mentre, F.: Some alternatives to asymptotic tests for the analysis of pharmacogenetic data using nonlinear mixed effects models. Biometrics 68(1), 146–155 (2012). https://doi.org/10.1111/j.1541-0420.2011.01665.x Biswas, S., Lin, S.: Evaluations of maximization procedures for estimating linkage parameters under heterogeneity. Genet. Epidemiol. 26(3), 206–217 (2004). https://doi.org/10.1002/gepi. 10314

References

49

169. Chen, Z., Liu, J.: Mixture generalized linear models for multiple interval mapping of quantitative trait loci in experimental crosses. Biometrics 65(2), 470–477 (2009). https://doi.org/ 10.1111/j.1541-0420.2008.01100.x 170. Churchhouse, C., Marchini, J.: Multiway admixture deconvolution using phased or unphased ancestral panels. Genet. Epidemiol. 37(1), 1–12 (2013). https://doi.org/10.1002/gepi.21692 171. Clark, V.J., Metheny, N., Dean, M., Peterson, R.J.: Statistical estimation and pedigree analysis of Ccr2-Ccr5 haplotypes. Hum. Genet. 108(6), 484–493 (2001) 172. Cui, Y., Fu, W., Sun, K., Romero, R., Wu, R.: Mapping nucleotide sequences that encode complex binary disease traits with hapmap. Curr. Genomics. 8(5), 307–322 (2007). https:// doi.org/10.2174/138920207782446188 173. Dudbridge, F.: Pedigree disequilibrium tests for multilocus haplotypes. Genet. Epidemiol. 25(2), 115–121 (2003). https://doi.org/10.1002/gepi.10252 174. Eng, K.H., Hanlon, B.M.: Discrete mixture modeling to address genetic heterogeneity in time-to-event regression. Bioinformatics 30(12), 1690–1697 (2014). https://doi.org/10.1093/ bioinformatics/btu065 175. Foulkes, A.S., Yucel, R., Li, X.: A likelihood-based approach to mixed modeling with ambiguity in cluster identifiers. Biostatistics 9(4), 635–657 (2008). https://doi.org/10.1093/biosta tistics/kxm055 176. Giurcaneanu, C.D., Tabus, I., Astola, J., Ollila, J., Vihinen, M.: Fast iterative gene clustering based on information theoretic criteria for selecting the cluster structure. J. Comput. Biol. 11(4), 660–682 (2004). https://doi.org/10.1089/1066527041887285 177. Guo, Y., Farmen, M.W., Jin, Y., Lee, H.Y., Penny, M.A., Hillgren, K.M., Fossceco, S.L.: Deciphering Adme genetic data using an automated haplotype approach. Pharmaco. Genom. 24(6), 292–305 (2014). https://doi.org/10.1097/fpc.0000000000000047 178. Hayes, J.L., Tzika, A., Thygesen, H., Berri, S., Wood, H.M., Hewitt, S., et al.: Diagnosis of copy number variation by illumina next generation sequencing is comparable in performance to oligonucleotide array comparative genomic hybridisation. Genomics 102(3), 174–181 (2013). https://doi.org/10.1016/j.ygeno.2013.04.006 179. Jansen, R.C.: A general Monte Carlo method for mapping multiple quantitative trait loci. Genetics 142(1), 305–311 (1996) 180. Kang, H., Qin, Z.S., Niu, T., Liu, J.S.: Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. Am. J. Hum. Genet. 74(3), 495–510 (2004). https://doi.org/10.1086/382284 181. Kimmel, G., Shamir, R.: Gerbil: genotype resolution and block identification using likelihood. Proc. Natl. Acad. Sci. U S A 102(1), 158–162 (2005). https://doi.org/10.1073/pnas.040473 0102 182. Kuk, A.Y., Li, X., Xu, J.: An Em algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data. BMC Genet. 14, 82 (2013). https:// doi.org/10.1186/1471-2156-14-82 183. Lai, Y., Adam, B.L., Podolsky, R., She, J.X.: A mixture model approach to the tests of concordance and discordance between two large-scale experiments with two-sample groups. Bioinformatics 23(10), 1243–1250 (2007). https://doi.org/10.1093/bioinformatics/btm103 184. Lee, A., Caron, F., Doucet, A., Holmes, C.: Bayesian sparsity-path-analysis of genetic association signal using generalized T priors. Stat. Appl. Genet. Mol. Biol. 11(2) (2012). https:// doi.org/10.2202/1544-6115.1712 185. Li, C., Boehnke, M.: Haplotype association analysis for late onset diseases using nuclear family data. Genet. Epidemiol. 30(3), 220–230 (2006). https://doi.org/10.1002/gepi.20139 186. Liu, P.Y., Lu, Y., Deng, H.W.: Accurate haplotype inference for multiple linked singlenucleotide polymorphisms using sibship data. Genetics 174(1), 499–509 (2006). https://doi. org/10.1534/genetics.105.054213 187. Loewenstern, D., Yianilos, P.N.: Significantly lower entropy estimates for natural DNA sequences. J. Comput. Biol. 6(1), 125–142 (1999) 188. Long, J.C., Williams, R.C., Urbanek, M.: An E-M algorithm and testing strategy for multiplelocus haplotypes. Am. J. Hum. Genet. 56(3), 799–810 (1995)

50

1 Introduction to Heterogeneity in Statistical Genetics

189. Madbouly, A., Gragert, L., Freeman, J., Leahy, N., Gourraud, P.A., Hollenbach, J.A., et al.: validation of statistical imputation of allele-level multilocus phased genotypes from ambiguous Hla assignments. Tissue Antigens 84(3), 285–292 (2014). https://doi.org/10.1111/ tan.12390 190. Olshen, A.B., Gold, B., Lohmueller, K.E., Struewing, J.P., Satagopan, J., Stefanov, S.A., et al.: Analysis of genetic variation in Ashkenazi Jews by high density Snp genotyping. BMC Genet. 9, 14 (2008). https://doi.org/10.1186/1471-2156-9-14 191. Papachristou, C., Ober, C., Abney, M.: Genetic variance components estimation for binary traits using multiple related individuals. Genet. Epidemiol. 35(5), 291–302 (2011). https:// doi.org/10.1002/gepi.20577 192. Pirinen, M.: Estimating population haplotype frequencies from pooled Snp data using incomplete database information. Bioinformatics 25(24), 3296–3302 (2009). https://doi.org/10. 1093/bioinformatics/btp584 193. Poznik, G.D., Adamska, K., Xu, X., Krolewski, A.S., Rogus, J.J.: A novel framework for Sib pair linkage analysis. Am. J. Hum. Genet. 78(2), 222–230 (2006). https://doi.org/10.1086/ 499827 194. Schroeder, J.C., Weinberg, C.R.: Use of missing-data methods to correct bias and improve precision in case-control studies in which cases are subtyped but subtype information is incomplete. Am. J. Epidemiol. 154(10), 954–962 (2001) 195. Schroeter, P., Vesin, J.M., Langenberger, T., Meuli, R.: Robust parameter estimation of intensity distributions for brain magnetic resonance images. IEEE Trans. Med. Imaging 17(2), 172–186 (1998). https://doi.org/10.1109/42.700730 196. Spinka, C., Carroll, R.J., Chatterjee, N.: Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet. Epidemiol. 29(2), 108–127 (2005). https://doi.org/10.1002/gepi.20085 197. Thomas, D.C., Stram, D.O., Conti, D., Molitor, J., Marjoram, P.: Bayesian spatial modeling of haplotype associations. Hum. Hered. 56(1–3), 32–40 (2003). 73730 198. van der Maas, H.L., Raijmakers, M.E., Visser, I.: Inferring the structure of latent class models using a genetic algorithm. Behav. Res. Methods 37(2), 340–352 (2005) 199. Wang, T., Jacob, H., Ghosh, S., Wang, X., Zeng, Z.B.: A joint association test for multiple Snps in genetic case-control studies. Genet. Epidemiol. 33(2), 151–163 (2009). https://doi. org/10.1002/gepi.20368 200. Warde-Farley, D., Brudno, M., Morris, Q., Goldenberg, A.: Mixture model for subphenotyping in Gwas. Pac. Symp. Biocomput. 363–374 (2012) 201. Xu, S., Vogl, C.: Maximum likelihood analysis of quantitative trait loci under selective genotyping. Heredity (Edinb) 84(Pt 5), 525–537 (2000) 202. Yu, Z., Schaid, D.J.: Methods to impute missing genotypes for population data. Hum. Genet. 122(5), 495–504 (2007). https://doi.org/10.1007/s00439-007-0427-y 203. Zaykin, D.V., Meng, Z., Ehm, M.G.: Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am. J. Hum. Genet. 78(5), 737–746 (2006). https://doi.org/10.1086/503710 204. Zhang, J., Vingron, M., Hoehe, M.R.: Haplotype reconstruction for diploid populations. Hum. Hered. 59(3), 144–156 (2005). https://doi.org/10.1159/000085938 205. Zhang, K., Qin, Z.S., Liu, J.S., Chen, T., Waterman, M.S., Sun, F.: Haplotype block partitioning and tag Snp selection using genotype data and their applications to association studies. Genome Res. 14(5), 908–916 (2004). https://doi.org/10.1101/gr.1837404 206. Zhe, S., Xu, Z., Qi, Y., Yu, P.: Joint association discovery and diagnosis of Alzheimer’s disease by supervised heterogeneous multiview learning. Pac. Symp. Biocomput. 300–311 (2014) 207. Tenenbein, A.: A double sampling scheme for estimating from binomial data with misclassifications. J. Am. Stat. Assoc. 65(331), 1350–1361 (1970) 208. Tenenbein, A.: A double sampling scheme for estimating from binomial data with misclassifications: sample size determination. Biometrics 27, 935–944 (1971) 209. Tenenbein, A.: A double sampling scheme for estimating from misclassified multinomial data with applications to sampling inspection. Technometrics 14(1), 187–202 (1972)

References

51

210. Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., Haroutunian, V.: Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat. Appl. Genet. Mol. Biol. 3, Article26 (2004). https:// doi.org/10.2202/1544-6115.1085 211. Mote, V.L., Anderson, R.L.: An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52, 95–109 (1965) 212. MacKenzie, D.I., Nichols, J.D., Royle, J.A., Pollock, K.H., Bailey, L.L., Hines, J.E.: Chapter 3—Fundamental principals of statistical inference. In: MacKenzie, D.I., Nichols, J.D., Royle, J.A., Pollock, K.H., Bailey, L.L., Hines, J.E. (eds.) Occupancy Estimation and Modeling, 2nd edn., pp. 71–111. Academic Press, Boston (2018) 213. Cox, D.R., Hinkley, D.V.: Theoretical statistics. Chapman and Hall/CRC, Boca Raton (1974)

Chapter 2

Overview of Genomic Heterogeneity in Statistical Genetics

2.1 Heterogeneity Due to SNP Genotype Misclassification We provide definitions of misclassification errors for different types of genomic data, as well as examples of how they may arise. We discuss statistical methods to detect misclassification errors, and several mathematical models that indicate the probability that one category of data (e.g., a SNP genotype) is misclassified as another category (e.g., another SNP genotype). Also, we comment on the difference between differential and non-differential misclassification and indicate the effects of each on population-based and family-based statistical tests of association. Finally, we present an example of genomic heterogeneity other than misclassification. We introduced genomic misclassification in Chap. 1, Sects. 1.1 and 1.7.1. To refresh our definition, genotype misclassification occurs when the true category of an observation and the observed category of the observation (that is, the category into which the observation is placed) are not the same. For example, suppose we have a SNP with two alleles, A and T, in the population. The three possible genotypes are then AA, AT, and TT. If we genotype an individual whose true genotype (category) is AA, and we record the genotype of that individual as being AT (category), we say that we have misclassified a true genotype AA as an observed genotype AT. For a SNP with two alleles, there are three possible true genotypes and three possible observed genotypes (sometimes referred to as the “phenotype” of the genotype). We select an allele to be the “reference allele” for the SNP. This selection is sometimes arbitrary and sometimes based on information about the alleles. For example, we may choose A to be the reference allele if its frequency is less than 0.5, suggesting that it may be risk allele. We can specify a genotype by the number of copies of the reference allele it contains. That is, the genotype TT is coded as 0; AT as 1; and AA as 2. The genotype misclassification matrix has entries that are the conditional probabilities: Pr(observed genotype = j|true genotype = j  ) = θ j  j , 0 ≤ j  , j ≤ 2 © Springer Nature Switzerland AG 2020 D. Gordon et al., Heterogeneity in Statistical Genetics, Statistics for Biology and Health, https://doi.org/10.1007/978-3-030-61121-7_2

(2.1)

53

54

2 Overview of Genomic Heterogeneity in Statistical Genetics

Table 2.1 Table of (mis)classification parameters for a SNP locus with alleles A and T True genotype

Observed genotype

Row total

TT (0)

AT (1)

AA (2)

TT (0)

θ00

θ01

θ02

1

AT (1)

θ10

θ11

θ12

1

AA (2)

θ20

θ21

θ22

1

For the true and observed genotypes, the numbers in parentheses indicate the number of copies of the A (reference) allele in the genotype. Each parameter θ j  j , 0 ≤ j  , j ≤ 2, is defined as the conditional probability: Pr(observed genotype = j|true genotype = j  ), where j  , j are the  number of copies of the T allele in the genotype. Note that, for any value j  , j θ j  j = 1; that is, the sum of each row’s entries is one

where the integers 0 ≤ j  , j ≤ 2 represent the number of copies of the reference allele in the genotype category. There are three conditional probabilities that reflect the correct classification of genotypes: θ00 , θ11 , and θ22 . Since there are nine possible probabilities, there are six possible misclassification probabilities for a SNP with two alleles. We will refer to the full set of probabilities as the genotype (mis)classification matrix, and the terms θ j  j , j  = j as the misclassification parameters or misclassification probabilities. Note that these misclassification parameters apply to any locus with any allele specifications (e.g., alleles A and T; alleles 1, 2, 3, and 4; alleles A, B, and O; etc.). The law of total probability indicates that the population observed genotype frequencyg j , of genotype j under misclassification with parameters from Table 2.1, is gj =

2 

θ j j g j .

(2.2)

j  =0

The term g j  in Eq. (2.2) represents the true frequency of genotype j  . That is, g j  is the frequency of genotype j  in our population when there is no genotype misclassification. Consider some examples using Table 2.1. Suppose, in the first row, θ00 = 0.95, θ01 = 0.05, and therefore θ02 = 0.00. This means that the probabilities of classifying the TT genotype as the TT, AT, and AA genotypes, respectively, are 0.95, 0.05, and 0.00. The first classification is the correct classification, the last two are misclassifications. Further suppose that the true genotype frequency for the TT, AT, and TT genotype in the population (without error) are g j  =0 = 0.25, g j  =1 = 0.45, and g j  =2 = 0.30. Using Eq. (2.2), we determine that the observed genotype frequency for the TT genotype is g j=0 = 0.95 × 0.25 + 0.05 × 0.45 + 0.00 × 0.30 = 0.26.

(2.3)

2.1 Heterogeneity Due to SNP Genotype Misclassification

55

In the presence of misclassification, the true genotype TT frequency is changed from a true frequency of 0.25 to an observed frequency of 0.26 (2.3). As a second example with the third row, let θ20 = 0.001, θ21 = 0.01, and θ22 = 0.989. These values are the probabilities of classifying the AA genotype. In this instance, the third classification is the correct classification. Applying Eq. (2.2), we compute g j=2 = 0.001 × 0.25 + 0.01 × 0.45 + 0.989 × 0.30 = 0.3015.

(2.4)

In the presence of misclassification, the true genotype AA frequency, 0.30, is changed to an observed frequency of 0.3015 (2.4). Without some gold-standard genotyping technology where all off-diagonal probabilities in Table 2.1 are either zero or very close to zero, there is no way to know the true numbers of individuals with genotypes j  , 0 ≤ j  ≤ 2. It follows that there is no way to determine unbiased estimates of the true genotype frequencies g j  . We shall see, in Chap. 5 especially, that any commonly used statistical tests applied to data with genotype misclassification may exhibit power loss for a fixed sample size or increase in MSSN for a fixed power. For some computer programs that perform genetic linkage and association analysis, it is possible to model genotype calls probabilistically, for example, probabilities like those in Table 2.1. Matise et al. [1] used the FASTLINK version [2, 3] of the LINKAGE suite of programs [4, 5] to implement the probability error modeling with SNPs across the entire genome. A description of the modeling may be found in that paper in the section labeled “Genotype cleaning” [1].

2.2 Examples of How Genotype Misclassification May Arise in Practice For several years, there have been SNP genotyping technologies that are highly automated. These high-throughput technologies are applied to “chips” containing hundreds of thousands or millions of SNPs. Blood or saliva samples for individuals are collected, and each individual’s DNA (contained in the blood or saliva) is processed. Two types of fluorescent dyes are applied to each sample, and for each SNP in the chip, sample-responses are measured and plotted as a point on a two-dimensional scatter plot. In this way, a scatter plot containing a point for each participant genotyped is created. Next, for each SNP, classification methods such as k-means clustering or other proprietary methods are used to determine the genotype call. To better understand how we may represent genotype misclassification errors mathematically, we present some example scatter plots. For both figures, each data point represents the bivariate outcome for either a given individual’s DNA at a given SNP or control material such as water. We denote true TT, AT, and AA genotypes by the symbols , , and ◇ respectively. The symbol

56

2 Overview of Genomic Heterogeneity in Statistical Genetics

× represents control samples (with no DNA) and no-calls, where the machine failed to make a call. In Fig. 2.1b, the genotype misclassifications are labeled (i), (ii), and (iii). Item (i) indicates a true AT genotype misclassified as a TT, item (ii) indicates a

Fig. 2.1 Example two-dimensional scatter plots for fictitious SNP genotype samples. a Allelic discrimination plot for fictitious SNP data with alleles A and T in which no genotype misclassification occurred. b Allelic discrimination plot for fictitious SNP data in which example genotype misclassifications occurred

2.2 Examples of How Genotype Misclassification May Arise in Practice

57

true TT genotype misclassified as an AT, and item (iii) represents a true AA genotype misclassified as an AT. Note that for label (ii) in Fig. 2.1b, there are three misclassified genotypes (TT genotype misclassified as AT genotype). Each misclassified genotype call is highlighted by a dashed ellipse that encircles the data point. These bivariate outcomes are produced by applying a given genotype technology (e.g., running DNA through a sequencer) to each individual’s DNA. Genotype calls are made either through the use of clustering algorithms or by industry-proprietary software (e.g., [6–17]). The axes in Fig. 2.1 are based on names and values from a previously published work [18]. In that work, the clustering algorithm produces classifications in a similar fashion to the classifications we present in Fig. 2.1a, b; that is, classifications are determined by regions defined by linear boundaries (see Fig. 2 of that work [18]).

2.3 Mathematical Models of Genotype Misclassification There have been several publications over the years that deal with the question of genotype misclassification error. Several are reviews that discuss the sources of error and how to address the sources in population- and statistical-genetic studies (e.g., [19–22]). Given the number of misclassification error rate parameters, even for SNPs, a number of authors have proposed mathematical models that reduce the number of parameters that need to be estimated. We discuss four such models: Single parameter model—A number of authors (e.g., [23–25]) have proposed the model: θ j  j = 2ε , 0 ≤ j, j  ≤ 2, j = j  , θ00 = θ11 = θ22 = 1 − ε,

(2.5)

and ε = 2 × Pr(any true genotype is misclassified as one of the other two SNP genotypes). Two parameter model—no homozygote-to-homozygote error—Douglas et al. [26] proposed the model: θ02 = θ20 = 0, θ01 = θ21 , θ10 = θ12 .

(2.6)

With this error model, there is no homozygote-to-homozygote error. Three parameter model—Sobel et al. [27] proposed the model: θ02 = θ20 , θ01 = θ21 , θ10 = θ12 .

(2.7)

58

2 Overview of Genomic Heterogeneity in Statistical Genetics

Note that, in this model, it is possible to have homozygote-homozygote error the largest number of (θ02 and θ20 may be non-zero). Also, this is the model with   parameters for which the matrix Table 2.1 is symmetric θ j  j = θ j j  . Six parameter (full) model—With this model, there are no constraints on any parameters in Table 2.1 other than those in the law of total probability, namely   θ = 1, 0 ≤ j  , j ≤ 2. This error model was presented by Mote and Anderson j j j [28], among others. While this model is the most robust in terms of accurately determining all misclassification errors, it has the limitation that calculating the estimate of each parameter may be time-consuming, the standard errors of these estimates may be large, or that the parameters are non-identifiable, i.e., it may not be possible to estimate these parameters.

2.4 Genotype Misclassification for Genomic Data with Three or More Categories One can extend the discussion of genotype errors in SNPs to consider polymorphic sites where more than two alleles occur in the population (multi-allelic loci), or other scenarios where the genomic data consist of more than three categories. Examples of such data include multi-locus genotypes and haplotypes [29]. Mathematically, this extension amounts to increasing the top summand in Eq. (2.2) to the value G (defined below). An example involving a multi-allelic genotype misclassification matrix appears in a publication by Gordon et al. [30]. In that work, the authors computed the genotype misclassification matrix for the APOE locus example (not a SNP), where there are three commonly observed alleles labeled ε2 , ε3 , and ε4 . There are six possible genotypes and thirty possible misclassification parameters. If we use the code: 0 = ε2 ε2 , 1 = ε2 ε3 , 2 = ε2 ε4 , 3 = ε3 ε3 , 4 = ε3 ε4 , 5 = ε4 ε4 , then the misclassification probability θ20 = Pr(observed genotype = 0|true genotype = 2)is the probability that a true ε2 ε4 genotype is misclassified as an ε2 ε2 genotype. The misclassification matrix is a 6 × 6 matrix. There are 36 probability parameters θ j  j , of which 6 reflect correct classification: θ j  j  , 0 ≤ j  ≤ 5. Hence, there are 30 parameters that represent misclassification of one genotype as another (as noted above). When the number of alleles increases, the number of misclassification parameters increases. In general, for a locus with a alleles, we extend Eq. (2.1) to   Pr observed genotype = j|true genotype = j  = θ j  j , 0 ≤ j  , j ≤ J

(2.8)

with J = a(a + 1)/2−1 [25]. For the remainder of this chapter (and in other chapters as well), we will focus our attention on SNPs. If we use data other than SNPs, we will note it to the reader.

2.5 Effects of Misclassification on Statistical Tests

59

2.5 Effects of Misclassification on Statistical Tests One of the main reasons for studying genomic misclassification is that misclassification can alter the type I and type II error probabilities of statistical tests. Another way of stating this fact is that misclassification introduces bias. Such bias, when performing gene-mapping studies, can cause researchers to pursue regions of the genome that do not harbor a disease gene(s), or to miss true disease genes due to an increase in statistical power.

2.5.1 Non-differential Misclassification Error Non-differential misclassification means that the genotype/genomic error rates are equal in the two (or more groups) being compared. That is, the misclassification parameter values in Table 2.1 are equal in the comparison groups, typically those with a disease or trait, and those without. Individuals in the first group are called cases, while those in the second group are called controls. Sometimes, the term “affected” is used for case, and the term “unaffected” is used for control. In Table 2.2, we present an example of non-differential genotype misclassification using fictitious values for the different probabilities. The key point is that the matrices are identical. Table 2.2 Example of non-differential genotype misclassification in cases and controls populations Cases True genotype

Observed genotype AA (0)

AT (1)

TT (2)

AA (0)

0.980

0.010

0.010

AT (1)

0.020

0.965

0.015

TT (2)

0.020

0.020

0.960

Controls True genotype

Observed genotype AA (0)

AT (1)

TT (2)

AA (0)

0.980

0.010

0.010

AT (1)

0.020

0.965

0.015

TT (2)

0.020

0.020

0.960

In this table, we provide values for an example SNP with two alleles, A and T. The number in parenthesis after each genotype is the number of copies of the T alleles in the genotype. For the case and control groups, each cell is the misclassification probability   Pr Observed genotype = j|True genotype = j  , j  , j ∈ {0, 1, 2}. As noted above, the variables j  and j refer to the number of T alleles in the respective genotypes

60

2 Overview of Genomic Heterogeneity in Statistical Genetics

What is the effect of non-differential misclassification on type I error rates? There is no effect; that is, the type I error rate is not altered in the presence of non-differential genotype misclassification when applying chi-square tests of independence for contingency tables (see, e.g., [21, 28]). What is the effect of non-differential misclassification on statistical power? Typically, power is reduced. For test statistics whose probability density function is a non-central chi-square under the alternative hypothesis, power for a fixed sample size may be determined through the NCP. The larger the NCP, the greater the power for a specified significance level. Consider the following example. Suppose we have population-based case/control data, and we are testing for association between a phenotype and a SNP locus using the chi-square test of independence for genotypes. For that test, the null hypothesis is that the genotype frequencies are equal in cases and controls. Suppose the true case genotype frequencies at the SNP are Pr(AA) = 0.01, Pr(AT) = 0.18, Pr(TT) = 0.81, the true control genotype frequencies are Pr(AA) = 0.01, Pr(AT) = 0.18, Pr(TT) = 0.81. Note that the genotype frequencies follow HWE proportions in both populations (Chap. 1, Sect. 1.4.1.1), and pA = 0.10, pT = 0.90 in both the case and control populations. Since the genotype frequencies are equal in the two populations, the null hypothesis is true. Also, let us introduce genotype misclassification into the population of genotypes, where the probabilities are given in Table 2.2. The effect of misclassification is that the observed genotype frequencies are altered away from their true values. Mote and Anderson [28] and others [31–33] documented how we may incorporate the misclassification probabilities in Table 2.1 into the non-centrality parameter. With this information, we can compute power in the presence of genotype misclassification. We shall discuss this calculation further in Chap. 5. We compute each of the observed conditional genotype frequencies in the case and control samples directly below. Cases or Controls Pr(Obs AA) = Pr (Obs AA|True AA)Pr(True AA) + Pr (Obs AA|True AT)Pr(True AT) + Pr (Obs AA|True TT)Pr(True TT) = (0.98)(0.01) + (0.02)(0.18) + (0.02)(0.81) = 0.0296. One can verify that the other probabilities in the case group are Pr(Obs AT) = 0.1900, Pr(Obs TT) = 0.7804. The genotype frequencies, conditional on affection status are equal. That is, this example illustrates that, under non-differential misclassification, the null hypothesis is still true. However, we observe that the conditional genotype frequencies under misclassification are different than under no misclassification. For example, the respective

2.5 Effects of Misclassification on Statistical Tests

61

frequencies for the AA genotype are 0.01 and 0.0296, and similarly, the respective frequencies for AT genotype are 0.19 and 0.18. These results hold for both the case and control populations. This example illustrates that the conditional frequencies in the presence of genotype misclassification are bias estimates of the true conditional frequencies, where all misclassification parameters are zero.

2.5.2 Differential Misclassification Error The second kind of alteration that occurs is differential misclassification. In this instance, genotype/genomic error rates are different in the two (or more groups) that are being compared. In Table 2.3, we present an example of differential genotype misclassification using fictitious values for the different probabilities. The key point is that the matrices are not identical. As an example of differing misclassification rates, consider θ10 in the case and control populations. From Table 2.3, one can check that θ10 is 0.000 in the case population, while it is 0.010 in the control population. Why does differential misclassification cause inflation in type I error? The answer is that differential misclassification has the effect that observed genotype frequencies are different in case and control groups, even though the true genotype frequencies are the same in the two groups; this applies to allele frequencies as well [34, 35]. Consider the chi-square test of independence for SNP genotypes. The null hypothesis is that the conditional genotype frequencies are equal in cases and controls; that  is, g j  1 = g j  0 , 0 ≤ j ≤ 2. The subscript 1 and 0 refer to the case and control populations, respectively. Let the true genotype frequencies be g01 = g00 = 0.0025, g11 = g10 = 0.095, g21 = 2 = 0.9025. These genotype frequencies are determined Table 2.3 Example of differential genotype misclassification as indicated by different misclassification probabilities in cases and controls

Cases True genotype

Observed genotype AA (0)

AT (1)

TT (2)

AA (0)

0.950

0.050

0.000

AT (1)

0.000

0.980

0.020

TT (2)

0.020

0.000

0.980

Controls True genotype

Observed genotype AA (0)

AT (1)

TT (2)

AA (0)

0.990

0.010

0.000

AT (1)

0.010

0.980

0.010

TT (2)

0.000

0.015

0.985

The legend for this table is identical to that for Table 2.2

62

2 Overview of Genomic Heterogeneity in Statistical Genetics

by considering allele frequencies Pr(A) = 0.05, Pr(T) = 0.95, and specifying HWE proportions. From what we have stated, the null hypothesis is true for the true genotype frequencies. From a genetics standpoint, we should not reject the null hypothesis that (in lay terms) there is no association between the disease phenotype and this SNP for 100 × (1 − α) percent of genetic association studies. Here, α is the pre-specified significance level. We compute the observed genotype frequencies using the misclassification parameters in Table 2.3. Cases Pr(Obs AA) = Pr (Obs AA|True AA)Pr(True AA) + Pr (Obs AA|True AT)Pr(True AT) + Pr (Obs AA|True TT)Pr(True TT) = (0.95)(0.0025) + (0.00)(0.095) + (0.020)(0.9025) = 0.0204. One can verify that the other probabilities in the case population are Pr(Obs AT) = 0.0932, Pr(Obs TT) = 0.8864. Controls Using the same formula, Pr(Obs AA) = (0.99)(0.0025) + (0.01)(0.095) + (0.00)(0.9025) = 0.0034. One can verify that the other probabilities in the control population are Pr(Obs AT) = 0.1067, Pr(Obs TT) = 0.8899. From these calculations, we see that the observed genotype frequencies in the case and control populations are no longer equal; that is, from the perspective of the observed genotype data hypothesis testing, the null hypothesis is no longer true, so with sufficient sample size (quantified by Moskvina et al. [35], Ahn et al. [34], and Mayer-Jochimsen et al. [36]), we will reject the null hypothesis more frequently than 100 × (1 − α) percent, even though the true genotype frequencies are equal in cases and controls and the disease phenotype is not associated with the genotyped SNP under consideration. Furthermore, the probability of rejecting the null will increase with the increase in sample size and/or increase in the differences of the individual misclassification probabilities in Table 2.3; that is, for a given cell, as the differences in the misclassification probabilities among cases and controls increases, so will the likelihood that we reject the null hypothesis.

2.5 Effects of Misclassification on Statistical Tests

63

To illustrate this point, consider a sample size of N A = NU = 2000. Then, with the observed genotype frequencies, we may use the NCP (Eq. 1.6) to determine with what probability we reject the null hypothesis for the true genotype frequencies in the presence of differential misclassification error. With the observed genotype frequencies in cases and controls after the introduction of errors, and with the sample sizes, we compute an NCP of 26.0558. Suppose we want to compute the asymptotic power for the chi-square test of independence for genotypes. The test has two degrees of freedom. Further, suppose that we set the significance level to 0.0001. If we the method implemented in the G*Power 3.1 power calculator with the following settings: 1. Test family = X 2 tests (software notation) 2. Statistical test = Generic X 2 test. 3. Type of power analysis = Post hoc: Compute power—given α, and non-centrality parameter. 4. Non-centrality parameter = 26.0558. 5. α = 0.0001. With these settings, our computed power is 0.8212. This result requires serious thought. It means that 82% of the time when we have null data into which differential misclassification errors have been introduced according to Table 2.3, we reject the null hypothesis, even though we specified that we should only reject the null hypothesis 0.01% of the time when the null hypothesis is true (as it is with the error-less data). This amount to an 8,000-fold increase in the type I error for the data without error. As a result, researchers may follow up a region of the genome that has nothing to do with the disease. To obtain a more global view of the effects of differential misclassification, we performed a simulation in which the SNP minor allele frequency ranged between 0.01 and 0.50 randomly for 5000 SNPs. We set the genotype frequencies to be in HWE, and randomly selected (true) genotypes from this distribution to for 50,000 cases and 50,000 controls. These data were simulated under the null hypothesis that genotype frequencies are equal in the case and control populations. The sample sizes used are consistent with those being collected for diseases such as obesity, type 2 diabetes, and cancer at the time of this writing [37–39]. For each SNP, after assigning a genotype to a given individual, we created an “observed” genotype by using the probabilities θ01 = θ10 = 0.005 in cases and θ01 = θ10 = 0.001 in controls. More specifically, we use the error model published by Douglas et al. [26] (Eq. 2.3). For most genotypes, the observed and the true genotype assignments were the same. However, for a non-zero percentage, the observed and true genotypes were not equal. What are the effects of this misclassification on the chi-square test of independence for genotypes applied to the observed data? We present the results in Fig. 2.2. In this figure, we present a Q-Q-plot, where the horizontal axis represents the −log(pvalues) for the chi-square test applied to the true data (without errors), and the vertical axis represents the −log(p-values) for chi-square test applied to the observed data (with differential genotype error). The Q-Q plot is created by sorting each set’s –log(p-values) and creating the scatterplot based on the paired sorting. P-values

64

2 Overview of Genomic Heterogeneity in Statistical Genetics

Fig. 2.2 Q-Q plot of −log(p-values) for chi-square test of independence for genotype data applied to data with and without differential genotype error. Here, we present a scatter plot of –log(p-values), where the p-values are computed by applying the chi-square test of independence for genotypes to simulated genotype data before (horizontal axis) and after (vertical axis) differential genotype misclassification errors are introduced. The sample sizes considered are 50,000 cases and 50,000 controls. Also, data are simulated for 5000 independent SNP markers, and the error rates, using the DSB error model, are θ01 = θ10 = 0.005 for cases, and θ01 = θ10 = 0.001 for controls. All SNPs for which the HWE goodness-of-fit test produce a p-value less than 0.0001 are removed from the analysis

are computed a central chi-square distribution with 2 df as the null distribution (Table 1.5). The effects of misclassification are considerable. If the distribution of the two sets of p-values were the same, then the plot will fall along the line y = x. However, we see the most of the points fall above that line; this result indicates that the distributions for the –log(p-values) are not the same, and hence, the distributions of the chi-square statistics are not the same. A further examination of Fig. 2.2 shows that most of the p-values for the chi-square test applied to the data with errors are smaller than the p-values for data without errors, and this result remains even after filtering out SNPs whose HWE goodness-of-fit test results have p-values less than 0.0001. Why are the results from Fig. 2.2 so problematic? The reason, as the figure illustrates, is that the null distribution of the chi-square test of independence for genotypes is unknown in the presence of differential misclassification. One cannot specify a threshold for a specified significance level.

2.5 Effects of Misclassification on Statistical Tests

65

In recent years, a number of authors have worked on methods to assess and address differential genotype misclassification, either from a methodological standpoint or in applications of actual data [20, 21, 34, 40–54]. Given the large density of SNPs now available, it is relatively straightforward to find SNPs that appear to have good quality assurance values. This statement leads us to the next question: What are some methods used to test for genetic association when the SNPs may have high genotype misclassification rates? There are several published methods to determine SNPs with high genotyping error rates. Among them are repeated sampling [55–60], in which subjects are genotyped more than one time with the same technology; double-sampling, in which a random subset of individuals are genotyped a second time with a gold-standard genotyping technology [42, 61–65] and values that deviate from the gold-standard classification are misclassified; checks for multiple recombinations among markers with very small inter-marker recombination fractions [66] and deviations from HardyWeinberg Equilibrium (HWE) proportions [67]. We will provide more detail on the repeated sampling (specifically, for phenotypes) and gen methods in later chapters. The deviation from HWE method is by far the most commonly used method in genetic association studies to decide if a SNP has a large genotyping error rate. In fact, it is built directly into the quality assurance checks for genotyping in the PLINK software [68]. For a SNP with alleles A and T, if the A allele frequency in the sample population is p, and the T allele frequency in the sample population is q = 1 − p, then by HWE proportions, we mean that Pr(AA) = p 2 , Pr(AT) = 2 pq, and Pr(TT) = q 2 ; that is, the genotype frequencies are the products of the allele frequencies (Chap. 1, Sect. 1.3). Consider a sample of N individuals, in which OAA individuals have the AA genotype, OAT have the AT genotype, and OTT have the TT genotype (the letter O stands for “observed”). Note OAA + OAT + OTT = N . To apply the HWE goodness-of-fit test, first we compute the estimated allele frequencies. They are OAA + (0.5)(OAT ) , N OTT + (0.5)(OAT ) . q= N p=

(2.9)

Next, we compute the expected genotype counts under HWE proportions. Using the allele frequency estimates in Eq. (2.9), they are E AA = N ( p)2 , E AT = N (2 pq), E TT = N (q)2 ,

(2.10)

where E AA , E AT , and E TT are the expected counts for (i.e., expected number of individuals with) genotypes AA, AT, and TT, respectively.

66

2 Overview of Genomic Heterogeneity in Statistical Genetics

After computing these values, we compute the chi-square goodness-of-fit test statistic: (OAA − E AA )2 (OAT − E AT )2 (OTT − E TT )2 + + , E AA E AT E TT  (OG − E G )2 = . EG G

X2 =

(2.11)

Under the null hypothesis that the observed genotype frequencies are consistent with HWE proportions, X 2 asymptotically follows a central chi-square distribution with one degree of freedom. Typically, p-values corresponding to the goodness-of-fit test below a small threshold (e.g., p < 0.0001; see [69]) are used to decide that a SNP has high genotype misclassification error and should not be used in the statistical analysis. While researchers used to check for departures from HWE proportions in cases and controls, work by Wittke-Thompson et al. [70] among others documented that, under certain modes of inheritance, it is possible that case genotype proportions may deviate from HWE proportions even when no genotype error is present. For that reason, it is more common today to test for deviation from HWE proportions in controls only. To demonstrate the effect of misclassification on the goodness-of-fit (GOF) test for HWE proportions (hereafter labeled as the HWE-GOF test (Eq. (2.9)), let us consider an example. Using the notation in Eqs. 2.9 and 2.10 above, suppose we have allele frequencies p = 0.2, q = 0.8. Also, let Pr(AA) = p 2 = 0.04, Pr(AT) = 2 pq = 0.32, and Pr(TT) = q 2 = 0.64. That is, the genotype frequencies are precisely in HWE proportions. Further, suppose we sample one thousand individuals from this population and genotype them. In this fictitious example, we obtain the following observed counts: = OAA = 45, OAT = 310, OTT = 645. From Eq. 2.9, we obtain p = 45+(0.5)(310) 1000 645+(0.5)(310) 200 800 = 0.20, q = = 1000 = 0.80. It is straightforward to check that 1000 1000 the expected genotype counts based on HWE proportions (Eq. 2.10) are E AA = 40, E AT = 320, and E TT = 640. That is, the expected counts are identical to the observed counts, so that the observed counts display perfect HWE proportions. Suppose now that we introduce genotyping error into the set of observed genotypes, using the genotype misclassification matrix in Table 2.4. We have Table 2.4 Genotype misclassification matrix used for HWE-GOF test example

Observed genotype

True genotype AA

AT

TT

AA

0.99

0.02

0.05

AT

0.01

0.97

0.01

TT

0.00

0.01

0.94

2.5 Effects of Misclassification on Statistical Tests

67

Pr(Obs AA) = Pr (Obs AA|True AA)Pr(True AA) + Pr (Obs AA|True AT)Pr(True AT) + Pr (Obs AA|True TT)Pr(True TT) = (0.99)(0.04) + (0.02)(0.32) + (0.05)(0.64) = 0.078. Pr(Obs AT) = Pr (Obs AT|True AA)Pr(True AA) + Pr (Obs AT|True AT)Pr(True AT) + Pr (Obs AT|True TT)Pr(True TT) = (0.01)(0.04) + (0.97)(0.32) + (0.01)(0.64) = 0.3172. Pr(Obs TT) = Pr (Obs TT|True AA) Pr(True AA) + Pr (Obs TT|True AT)Pr(True AT) + Pr (Obs TT|True TT)Pr(True TT) = (0.00)(0.04) + (0.01)(0.32) + (0.94)(0.64) = 0.6048. To produce a representative sample of observed data after genotype misclassification, we compute each genotype’s expected number of observed counts, using the probabilities provided above. We have OAA = 1000 × 0.078 = 78, OAT = 1000 × 0.3172 = 317.2, rounded to 317, OTT = 1000 × 0.6048 = 604.8, rounded to 605. Applying Eq. (2.9), we compute the sample allele frequency estimates to be p = 78+(0.5)(317) = 0.2365,q = 605+(0.5)(317) = 0.7635. Using Eq. (2.10) with the allele 1000 1000 frequencies and the total sample size N = 1000, one can check that the expected counts (after rounding to three decimals) are E AA = 55.932, E AT = 361.136, and E TT = 582.932. The HWE-GOF test statistic X 2 is computed using Eq. (2.9), and the p-value is computed using a central chi-square distribution with one degree of freedom. In general, the degrees of freedom for the HWE-GOF test are computed as (number of genotypes) − (number of alleles). What does this example tell us? We observe that, in the presence of genotype misclassification error, even genotype counts that show no deviation from HWE proportions in the absence of misclassification (we do not reject the null hypothesis) can show marked deviation after the introduction of genotype misclassification errors. If fact, if our “cut-off” for eliminating SNPs is a HWE-GOF statistic p-value of 0.001 or less, we would remove this SNP from further consideration in our statistical association analysis. A second and equally important finding is the fact that the two largest contributions to the HWE-GOF statistic value (third row from the bottom in Table 2.5 come from the AA and AT genotypes. The reason that these components are relatively large is that misclassification of the more common homozygote (AA in our example) as the less common homozygote produces an ever-increasing cost, in terms of MSSN for fixed statistical power of the chi-square test of independence for genotypes, as the misclassification error rate θ20 (Table 2.1) increases [33]. Similarly, misclassification

68 Table 2.5 Computation of steps to produce HWE-GOF test value for fictitious example before (no misclassification) and after (genotype misclassification) errors introduced into counts

2 Overview of Genomic Heterogeneity in Statistical Genetics No misclassification Variables used

Genotype (G)

Observed counts (OG )

AA

Sample allele frequencies

AT

TT

45

310

645

p

q

0.20

0.80

Genotype (G) AA

AT

TT

Expected counts (E G )

40

320

640

(OG − E G )

5

−10

5

(OG − E G )2

25

100

25

(OG − E G ) /E G

0.625

0.3125

0.0391

X2

0.9766

P-value

0.3230

AA

AT

TT

78

317

605

p

q

0.2365

0.7635

2

Genotype misclassification Genotype (G) Observed counts (OG ) Sample allele frequencies

Genotype (G) AA

AT

TT

Expected counts (E G )

55.9323

361.1355

582.9323

(OG − E G )

22.0678

−44.1355

22.0678

(OG − E G )2

486.9856

1947.9420

486.9856

(OG − E G )2 /E G

8.7067

5.3939

0.8354

X2

14.9361

P-value

1.1 × 10−4

In this table, the subscript G refers to the particular genotype {AA, AT, TT} for which a given value (in the rows under the heading “Variable Used”) is provided. For example, let G equal AT. Then the Observed Counts (OG ) row (third row under “Genotype Misclassification”) has a value of 317 when the genotype G is equal to AT. Similarly, the Expected Counts (E G ) row has a value of 361.1355, and the row (OG − E G ) has a value of 317 − 361.1355 = −44.1355

2.5 Effects of Misclassification on Statistical Tests

69

of the more common homozygote (TT in our example) as the heterozygote produces an ever-increasing cost as the misclassification error rate θ21 (Table 2.1) increases. Note that in our example, we have both types of misclassification. In Table 2.4, we specify θ20 = 0.05, and θ21 = 0.01. If we study the table further, we see that the increase in the HWE-GOF statistic after misclassification occurs because of an increase in the observed sample size of the OAA genotype counts (from 45 under no misclassification to 78 under misclassification), and because of a change in the AT expected counts E AT (from 320 under no misclassification to 361.1355 under misclassification). While the counts alone will not produce an increase in the HWE-GOF test, because both genotypes contain at least one copy of the minor allele, the denominator in the term (OG − E G )2 /E G is likely to be smaller, producing a relatively larger contribution to the HWE-GOF test. Leal et al. extended this work to the HWE-GOF test [71], and Ahn et al. [72] extended this work to the trend test for genetic association. In all instances, the two misclassification probabilities listed above are the two most deleterious, in terms of creating a departure from HWE proportions and reducing statistical power for association of the six possible misclassification probabilities (Table 2.1).

2.5.3 Non-differential Misclassification in Family-Based Tests of Association In the previous sections, we focused on population-based tests of association (cases and controls). In this section, we consider the effects of non-differential misclassification on the TDT , one type of family-based test of linkage/association. It is well documented that the TDT is a significant contribution to the statistical genetics literature (e.g., [73, 74]). Nonetheless, there are some limitations. Among these limitations is the robustness of the statistic to issues such as missing genotypes in one parent, and/or genotype misclassification error. We note that both issues are examples of heterogeneity, in that the observed trio data are a mixture of two types of data. In this work, we focus on the later (genotype misclassification error). For genotype misclassification error, data sets where some trios display genotype misclassification error (i.e., genotype errors are detected [26, 27, 67, 75–85]) may be separated into two subsets: the set of trios that displays Mendelian consistency for a typed SNP (e.g., [86]), and the subset that displays Mendelian inconsistency (presumably, genotype misclassification error). Cheung et al. [87] provide a clear and concise definition of Mendelian consistency/inconsistency and the consequences of such errors in genetic analyses. We quote from their work: “In pedigrees, genotyping errors are either Mendelian inconsistent (MI) or Mendelian consistent (MC). A MI genotyping error is an error that is detected because the observed genotypes are not consistent with the transmission pattern as specified

70

2 Overview of Genomic Heterogeneity in Statistical Genetics

by the Mendel’s First Law. Specialized program such as PedCheck [88] and many programs to perform linkage analyses [27, 89, 90] can be used to detect MI errors. When a marker is flagged as Mendelian inconsistent (MI), this marker most likely has either a genotyping error or a de-novo mutation. The fraction of errors that escapes the Mendelian inconsistency check depends largely on a particular pedigree structure [91] and can be high, especially if a considerable numbers of subjects are unobserved for genotypes [26, 86]. Genotyping errors that are not MI are MC, and they cannot be detected by using MI checks.” Cheung et al. [87] document an important fact. When genotype misclassification errors occur in SNPs in a data set, only a portion of the trios with errors can be detected through Mendelian inconsistency (MI). There have been a number of papers, including recent ones (at the time of this writing), studying the probability of detecting genotyping errors in family-based samples, and the effects of such errors on statistical tests of genetic association [80, 86, 92–101]. Fig. 2.3 a Example of genotype error where MC check flags error. In this drawing we use the convention that squares represent males, circles represent females, horizontal lines indicate matings, and vertical lines represent offspring. There are two alleles at the SNP locus, A and T. Also, filled-in shapes are individuals who are affected with a given disease (phenotype). The red letter indicates the error (allele A misclassified as a T). b Example of genotype error where MC check does not flag error. In this drawing we use the same conventions as those listed in Fig. 2.3a. Note that, in this example, the affected child is female and the red colored T is the same allele as observed in Fig. 2.3a

2.5 Effects of Misclassification on Statistical Tests

71

In Fig. 2.3a, b, we present examples of genotype errors that are detected and not detected, respectively, through Mendelian consistency (MC) checks. For the example in Fig. 2.3a, the error is flagged because (barring mutation), the only possible genotype for the child is AA. In Fig. 2.3b, the child’s genotype after misclassification is possible, since the father can transmit a T allele and the mother must transmit an A allele. For our examples with the TDT , we specify that each individual (father, mother, affected child) in the trio has an equal probability of having their genotypes misclassified and that the misclassification is independent of disease status; that is, it is non-differential. What is the consequence of removing trios with observed genotype misclassification errors, while using those, like the trio in Fig. 2.3b that have errors but show Mendelian consistency? Heath and Ott documented that the TDT , when applied to such data, shows an inflation in Type I error [102]. Gordon et al. [82] similarly documented this inflation, and they and other researchers ([103–105]) developed statistics that maintain the correct type I error rates even in the presence of genotype misclassification error. In Chap. 5, we present an analytic solution to the increase in type I error for the TDT statistic.

2.6 Errors in Next-Generation Sequencing (NGS) There have been several recently developed methods to score genotypes. Here, we focus on mathematical models for errors that may arise in next-generation sequencing (NGS). To provide a background on NGS and its importance, we paraphrase portions of an article presented in Wikipedia [106]. “DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery. Next-generation sequencing applies to genome sequencing, genome resequencing, transcriptome profiling (RNA-Seq), DNA–protein interactions (chiPsequencing), and epigenome characterization [107]. Re-sequencing is necessary, because the genome of a single individual of a species will not indicate all of the genome variations among other individuals of the same species. The high demand for low-cost sequencing has driven the development of high-throughput sequencing (or NGS) technologies that parallelize the sequencing process, producing thousands or millions of sequences concurrently [108, 109]. High-throughput sequencing technologies are intended to lower the cost of DNA sequencing beyond what is possible with more standard methods [110]. In ultra-highthroughput sequencing as many as 500,000 sequencing-by-synthesis operations may be run in parallel [111–113].

72

2 Overview of Genomic Heterogeneity in Statistical Genetics

Fig. 2.4 Example of SNP calling for fictitious sequence reads aligned to the reference genome. The reference sequence is provided at the bottom of the figure and is labeled using an arrow. The row of digits below the reference sequence base-pairs enumerates the specific base-pairs (a total of 92). The numbers in parentheses underneath the digits indicate the ten’s digit for each single digit. For example, the first ten digits are the 1st through 10th base-pairs (1–10), the next ten digits are the 11th through 20th base-pairs (11–20), and so forth. Heterozygous SNPs are indicated by alternative alleles that appear in green. SNPs that are homozygous for the alternative allele are indicated in blue, and those sequence reads that are errors are indicated in red

Furthermore, NGS technologies can identify rare-variants in the genome, and it has been hypothesized that these rare-variants account for a significant portion of heretofore undetected disease risk [114, 115]. Before continuing, we provide definitions and notation that we use here and in later parts of the book. We use Fig. 2.4 to illustrate examples of several of the terms.

2.6.1 Definitions and Notation Molecular location = A gene’s precise position on a chromosome. As documented by the researchers at the National Institutes of Health (US), “The Human Genome Project, an international research effort completed in 2003, determined the sequence of base pairs for each human chromosome. This sequence information allows researchers to provide a more specific address than the cytogenetic location for many

2.6 Errors in Next-Generation Sequencing (NGS)

73

genes. A gene’s molecular address pinpoints the location of that gene in terms of base pairs. It describes the gene’s precise position on a chromosome and indicates the size of the gene. Knowing the molecular location also allows researchers to determine exactly how far a gene is from other genes on the same chromosome” [116]. (Physical) Base-pair position = The molecular location for a SNP. Online genomics webtools such as NCBI’s dbSNP [117], The University of California at Santa Cruz’s Genome Browser [118, 119], and others, provide the base-pair positions for SNPs. As the sequence maps (referred to as “reference assemblies”) have been updated over time (due to drops in DNA sequencing cost), the base-pair positions may have changed with each update [120]. Reference (Genomic) sequence = “A digital nucleic acid sequence database, assembled by scientists as a representative example of a species’ set of genes. As they are often assembled from the sequencing of DNA from a number of donors, reference genomes do not accurately represent the set of genes of any single person. Instead a reference provides a haploid mosaic of different DNA sequences from each donor”. This citation comes from the publication [121]. A fictitious reference sequence comprising 92 base-pairs from an individual is labeled in Fig. 2.4. Sequence read = Li and Freudenberg write, “Shotgun and next-generation sequencing (NGS) involve shredding the genome into smaller fragments, and sequence either full or part of the fragments. The sequenced fragments are called [sequence] reads [122]”. A sequence read may consist of one set of nucleotides that appear consecutively in an individual’s genome (maternal or paternal strand; set referred to as a fragment), or it may consist of multiple, mutually exclusive fragments [123]. In Fig. 2.4, ignoring the reference sequence, each horizontal line consists of sequence reads for an individual. Each sequence read consists of a single fragment. Examples of reference genomes for humans are provided in the UCSC Genome Browser [118, 119]. While the true realization of a reference genome is considerably more complicated, for the purposes of our work, one may think of a reference sequence as a subset of the entire reference genome, consisting of a string of letters from the set (A, T, C, G), where repetition is possible. Using the quotation above, the reference sequence is a mosaic of the most common patterns observed in a set of individuals’ DNA sequence. Reference allele = The allele in the reference genome sequence at a given basepair position. For our example in Fig. 2.4, the reference alleles are the string of 92 letters at the bottom of the figure. Each allele has a unique position, from 1 to 92. Alternate (or alternative) allele = The allele that differs from the reference allele at a given base-pair position. Examples of reference and alternate alleles are provided in Fig. 2.4. Alternate alleles appear in sequence reads at positions 11, 26, 40, 68, and 75. At position 11, the reference allele is C, and the alternate allele T (labeled in green), appears in the second, third, and sixth sequence reads (counting from the bottom up). At position 11, the reference allele is T, and the alternate allele G (labeled in red), appears in the fifth sequence read. At position 40, the reference allele is C, while the alternate allele is A (labeled in blue). This allele appears in all of the sequence reads. For position 68, the reference allele is A. The alternate allele, G (labeled in green), appears in the

74

2 Overview of Genomic Heterogeneity in Statistical Genetics

third through fifth and seventh through ninth sequence reads. Finally, the alternate allele A (in red), appears in the tenth sequence read at position 75, while the reference allele is C. The polymorphic SNPs in Fig. 2.4 occur at positions 11, 40, and 68. The other positions may appear to be SNPs, due to the sequence errors. Multi-locus genotype (MLG) = The combination of specific genotypes across two or more loci for an individual k. For example, given SNP01 with alleles A and T in the population, SNP02 with alleles C and T in the population, and SNP03 with alleles A and G in the population, then one possible multi-locus genotype from the set of SNPs [SNP01, SNP02, SNP03] for individual k may be written in vector form as (AA, CT, GG). That is, the genotypes at the first through third loci are AA, CT, and GG, respectively. In mathematical notation, for M SNP loci, we write G k = ( j1k , . . . , j Mk ) to represent the MLG for individual k. The MLG consists of the true genotype jmk at the m th SNP locus, 1 ≤ m ≤ M. The genotype jmk may be displayed in a number of ways, from pairs of letters (e.g., j1k = AA, j2k = CT, j3k = GG for our example above) to an integer value (0, 1, or 2) representing the number of copies of a prespecified reference allele in the m th SNP locus. To use the numerical coding for genotypes, we need to choose a reference allele for each SNP. In our example, let A, C, and A be the reference alleles for SNP01 through SNP03, respectively. Then j1k = 2, j2k = 1, and j3k = 0, and we can write the MLG for individual k in vector form as (2, 1, 0). We observe that, for M loci, there are 3M possible MLGs. Sequencing coverage = “The number of reads that include a given nucleotide [allele] in the reconstructed sequence” [124, 125]. In a section of sequence where there M > 1 SNPs the mth SNP of an individual indexed by k, we denote this quantity by vmk . To be consistent with the MLG notation above, we represent the coverage for a vector of SNPs by v k = (v1k , . . . , v Mk ). Observe that the individual coverage values need not be equal to one another. Here and elsewhere, we shall abbreviate this term by “coverage”. Some example coverages in Fig. 2.4 are seven for position 11, nine for position 26, and five for position 40. Observed reference allele count = The number of times that, for a given SNP position in our sequence, we observe the reference allele. For the mth SNP of individual k, we denote this quantity by xmk . As with other notation in this section, we represent the alternative allele counts for a vector of SNPs by x k = (x1k , ..., x Mk ). For the mth SNP of individual k’s sequence data, at the mth SNP position, we have the bounds 0 ≤ xmk ≤ vmk [47]. We comment that the allele counts are not necessarily the true number of alternative alleles in a reconstructed sequence for a given individual k. Some example observed reference allele counts in Fig. 2.4 are four for position 11 (base-pair C is the reference allele), eight for position 26 (base-pair T is the reference allele), and zero for position 40 (base-pair C is the reference allele). Observed alternative allele count = The number of times that, for a given SNP position in our sequence, we observe the alternative allele. For the mth SNP of individual k, we denote this quantity by xmk . As with other notation in this section, we represent the alternative allele counts for a vector of SNPs by x k = (x1k , ..., x Mk ).

2.6 Errors in Next-Generation Sequencing (NGS)

75

For the mth SNP of individual k’s sequence data, at the mth SNP position, we have the bounds 0 ≤ xmk ≤ vmk [47]. We comment that the allele counts are not necessarily the true number of alternative alleles in a reconstructed sequence for a given individual k. Using the examples above, we can compute the observed alternative allele count as the difference of the coverage and the observed reference allele count. So, for position 11 in Fig. 2.4, the observed reference allele count is seven minus four, or three. Similarly, the observed alternative allele counts at positions 26 and 40 are one and five, respectively. Another way to determine the observed alternative allele count for a position in Fig. 2.4 is to count the number of alleles at the position that are not black. Sequencing misclassification probability for a given individual k = εik = The probability that individual k’s true sequenced nucleotide (allele) differs from the observed value, given that the individual’s true phenotype is i k . These errors are for a given base-pair, not a sequence of nucleotides. Misclassifications need not only occur at positions where there are SNPs In fact, sequence misclassification at base-pair positions that are not SNPs may cause one to think that a SNP is present. This phenomenon has been documented more generally by Robasky et al. [126]. In Fig. 2.4, positions 26 and 75 illustrate this phenomenon. In the supplement to their paper, Kim et al. [47] assign a symmetric constraint for the probability εik . That is, the error probability does not depend on the specific nucleotide. Written another way, given that there are only two nucleotides X and Y (from A, C, T, G) at a base-pair position in a population under study, Pr (observed nucleotide = X|true nucleotide = Y) = Pr (observed nucleotide = Y|true nucleotide = X).

(2.12)

This type of model is in contrast to the genotype misclassification error models presented above, where misclassification probabilities are dependent on the particular genotype (Sect. 2.3). It is possible to consider asymmetric models, such as Pr (observed nucleotide = X|true nucleotide = Y) = Pr (observed nucleotide = Y|true nucleotide = X).

(2.13)

However, as pointed out in the supplement of the Kim et al. publication [47], if Y is a rare allele, e.g., with a frequency less than 1% in the population, the estimation of the probability Pr (observed nucleotide = X|true nucleotide = Y) is unstable precisely because Y has such a small frequency. Simulation results suggest that estimation is unstable for sample sizes as large as 10,00 individuals [47].

76

2.6.1.1

2 Overview of Genomic Heterogeneity in Statistical Genetics

What Are Estimated NGS Probabilities for Empirical Data?

[We gratefully acknowledge Dr. Lisheng Zhou, who provided us with the material for this section. The work comes from her (yet to be published) thesis [127]. The main purpose of that work was the development of a “L RTae -like” statistic for NGS data. Outcome from the application of that statistical method includes estimated NGS error rates for controls ε0 and cases ε1 under the null that MLG frequencies are equal in case and control populations, and the alternative hypothesis that case- and control-MLG frequencies are not necessarily equal.] To answer the question posed in this heading, Dr. Zhou applied her statistical method to NGS data from the 1000 Genomes Project [128–132]. We paraphrase from Dr. Zhou’s work. “Available exome sequencing data on Human Chromosome 20 were downloaded in BAM (Binary Sequence Alignment/Map) [133] format. The data came from 2,504 individuals sequenced for the 1000 Genomes Project Phase 3 Data Archive [134]. To make the process computationally efficient, only those regions on Human Chromosome 20, specifically from base-pair position 60,897,487 to position 60,908,969 (Build 37), were obtained for each individual [47]. After sorting and indexing the data on the selected region, the options “mpileup” in SAMtools (version 1.3.1) [133, 135] and ‘call –m’ in BCFtools (version 1.3.1) [133, 136] were applied to obtain variant calling. Next, data were placed into Variant Calling Format (VCF) [137]. After all variants called for every individual, we filtered out variants with QUAL (quality) lower than 100. QUAL is the Phred-scaled probability indicating the existence of a variant [137]”. Dr. Zhou selected three base-pair positions (all SNP loci) from the Chromosome 20 region provided above. The positions were 60,907,675, 60,908,964, and 60,908,969. Among all 2,504 individuals, she kept those individuals who were heterozygous or homozygous for the alternate allele at all three of the loci. A total of 1,314 individuals were satisfied with this condition. Her reason for excluding individuals who were homozygous for the reference allele at any locus is that, in the VCF file, no information was provided. For example, there was no distinction between a homozygote and a missing genotype. The alternate allele counts and sequencing coverages (the number of reads covering or bridging a given nucleotide position; labeled “DP” in VCF format), each of the three loci was extracted using a computer program developed by Dr. Zhou. In the selected region, the alternate allele counts overall SNPs and individuals ranged from 4 to 84, with a mean count of 20.8 and a median count of 18. The corresponding DP values ranged from 5 to 158, with a mean coverage of 28.4 and a median coverage of 26 [127]. For each of the 1,314 individuals, the affection status was assigned randomly, with 657 affected (cases) and 657 unaffected (controls) individuals. Let us label this Replicate 1. Dr. Zhou applied her method to this data set to obtain MLEs for the sequencing error probabilities ε0 and ε1 .

2.6 Errors in Next-Generation Sequencing (NGS)

77

To obtain a distribution of sequencing error rates for the 1000 Genomes Project data set, Dr. Zhou permutated the (fictitious) case and control status for all 1,314 individuals an additional 499 times to obtain 500 null replicates. For each of these replicates, she computed the MLEs for ε0 and ε1 (as described in Chap. 1, Sect. 1.7).The MLEs were computed by applying the differential sequencing error model (i.e., ε0 and ε1 were not subject to the constraint equation ε0 = ε1 ). Given the number of replicates and the fact that we included MLEs for both the null and alternative hypotheses, we present four box plots using 500 values each in Fig. 2.5. In Chap. 1, Sect. 1.5, we described the different features of a box plot. In Fig. 2.5, we observe that the average (sample mean) error-MLEs are 0.133 and 0.134 in the fictitious controls and cases, respectively. The median MLE for both groups is

Fig. 2.5 Box plots of maximum likelihood estimates (MLEs) for (fictitious) 2000 data values, stratified by case-control status and hypothesis (null, alternative). Each box plot is generated using 500 sequencing error MLEs, where the MLEs are determined for a specific (fictitious) affection status with each replicate’s data being analyzed under a specific hypothesis (null or alternative)

78

2 Overview of Genomic Heterogeneity in Statistical Genetics

Table 2.6 Summary statistics for simulation sequencing error MLEs Statistic

Scenario Control-Null

Control-alternative

Case-null

Case-alternative

Min

0.119

0.117

0.087

0.085

Median

0.129

0.128

0.099

0.098

Mean

0.129

0.128

0.099

0.098

Max

0.138

0.137

0.115

0.111

We present four of the five summary statistics from the box plot for the four scenarios where null sequencing data were generated using the 1000 Genomes Data Project as the framework. The only changes made were the alternate read counts. These were generated randomly using the method described in Dr. Zhou’s thesis [127]. The 1000 Genomes Project Data provided genotype values for all individuals and all SNPs

0.133. Of note, the minimum sequencing error MLEs in the control and case groups are 0.128 and 0.127, respectively (both suspected outliers). The mean, median, and minimum MLEs for both groups are considerably larger than previously published error rates in the NGS platforms: Illumina HiSeq (0.0034), Ion Torrent PGM (0.019) and Complete Genomics (0.024) [138]. What accounts for this difference? One possibility considered was that Dr. Zhou’s statistical method did not estimate parameters correctly. To evaluate this, she performed another simulation study, in which the true value of ε0 was set to 0.13, and the true value of ε1 was set to 0.10. Dr. Zhou generated 500 replicates of case-control data under the null hypothesis (a full description of the simulation procedure is provided in the work [127]). The (min, median, mean, max) MLEs for the Control-Null, Control-Alternative, Case-Null, and Case-Alternative scenarios are presented in Table 2.6. We observe that the mean and median values are the same for each scenario, the same values are practically equal to the generating error parameter values for each scenario (e.g., difference of ε0 (equal to 0.13) and the Control-Null median value is 0.001), and the distances of each median value to its corresponding min and max MLEs are approximately equal for a given scenario. The one exception occurs for the Case-Null statistics: the difference (Median – Min) is 0.012, and the difference (Max – Median) is 0.016. A study of these results suggests that the source of the discrepancy among Dr. Zhou’s results for the 1000 Genomes Project data and the published results were not due to a flaw in our statistical method, at least not for error parameter values on the order of 0.10 and 0.13. Upon further investigation, Dr. Zhou discovered that there are two types of coverage provided in the 1000 Genomes Project data. These may be found through use of the “mpileup” function in SAMtools. The two types are labeled “DP” (mentioned above) and “DP4”. One can find the definitions for each in the webpage for the SAMtools reference [132]. The definitions are. POS = The leftmost position of the variant/SNP (integer valued); DP = The number of reads covering or bridging the value POS;

2.6 Errors in Next-Generation Sequencing (NGS)

79

DP4 = The sum of counts in each of four categories: (1) forward reference alleles; (2) reverse reference alleles; (3) forward non-reference alleles; and (4) reverse non-reference alleles. All categories of alleles are used in variant calling. The DP4 value may be smaller than the DP because low-quality bases are not counted. We determined an updated distribution of sequencing error rates for the 1000 Genomes Project data set, by performing the same steps as those that led to the results in Fig. 2.5, with the one difference being that we used the “DP4” values for coverage in lieu of the “DP” values. Like the steps applied above, we computed the case and control sequence error MLEs. Results are presented in Fig. 2.6. Comparing the results of Figs. 2.5 and 2.6, we see a substantial difference. Focusing on the results in Fig. 2.6, we see that the maximum MLE of any scenario (Control-Null, Control-Alt, Case-Null, Case-Alt) is less than 0.0055. Also, for each scenario, greater than 99% of the MLEs lie between approximately 0.0033 and 0.0051. The median and mean MLEs are between 0.004 and 0.0045. These values

Fig. 2.6 Box plots of sequencing error MLEs using DP4 values as the coverage values. The box plots in this figure are generated in precisely the same manner as those created for Fig. 2.5, with the exception that the coverage values for these box plots are DP4 values, rather than the DP values used in Fig. 2.5

80

2 Overview of Genomic Heterogeneity in Statistical Genetics

are consistent with error rates reported for Illumina HiSeq (0.34%), and are slightly lower than those reported for Ion Torrent PGM (1.9%) and Complete Genomics (2.4%) [138]. All error rates come from this 2013 publication. What can we learn from this example? First, it is clear from that it is critically important that the statistics person/persons have a full understanding of the data they are analyzing. Using the data with higher error rates can lead to either a loss in statistical power and/or an increase in the significance level. We have seen these results for statistical tests when the input data are genotypes. References [19, 31–36, 42, 55, 72, 80, 104, 139–149]. This issue is of even greater importance in NGS studies because the genotypes are latent variables; that is, they are not observed but inferred through methods like EM. As a result, errors are magnified, meaning that power loss and/or increase in type I error is greater. Next, our results indicating the consistency among our sequencing-error MLEs and those reported previously [138]. Also, our simulation results suggest that accurate estimates of sequencing error may be achieved even when the rates are quite low (1%). We conjecture that this accuracy in estimation stems from the fact that the coverage values are quite large, even for the DP4 category, and therefore the expected number of errors meets or exceeds those values suggested by Cochran [150]. Finally, we note that the range of the sequencing error MLEs is quite small (maximum range is approximately 0.0022 in Fig. 2.6, for Control-Null and ControlAlt MLE distributions). This observation speaks to the consistency of the parameter estimates for this 1000 Genome Project data subset. As noted by several review articles in the last decade (since 2006) [151–162], there are unique challenges regarding quality assurance when determining accurate individual genotypes among regions of the genome. In Fig. 2.7, we give another example of NGS error for a fictitious individual’s sequence data. This example differs from the example in Fig. 2.4 in that the sequenceread lengths are equal for every read. We provide this example to view NGS data in a more complete manner, and to illustrate some mathematical results. In Fig. 2.7, the sequence errors are donated in bold font, and the position and alleles of the true SNP are indicated by italic font. If we use the approximation that each of an individual’s two sequence-reads (maternal and paternal), has a 50% probability of being selected, and that selection is random, then if base-pair position 6 in Fig. 2.2 were truly heterozygous, the probability of observing one T and six A alleles is given by the Binomial Formula: Pr(7, 0.5; 1) =

  7 7 (0.5)1 (0.5)6 = 7 ≈ 0.055. 1 2

While this number is small, we could not say with certainty that the locus at position 6 was truly homozygous AA with sequence error, or whether the locus was heterozygous with no sequence error. However, as we increase the coverage (that is, the number of sequence-reads sampled), then the probability of observing only one

2.6 Errors in Next-Generation Sequencing (NGS)

81

Fig. 2.7 Example of sequence read error for fictitious sequence data. We generate seven sequence-reads of the same genomic sequence for a fictitious individual. The sequence-reads are indicated by the rightmost column labels. For example, the first sequence-read is labeled “S01”, the second “S02”, etc. For a given sequence-read, there are precisely 26 base-pairs; these are marked in steps of five by the numbers in the first row. For each sequence-read, the first base-pair under the number 1 is “A”; the fifth base-pair under the number 5 is “T”; and so forth. The two sequence errors are the 6th base-pair in Sequence-read S07, indicated in bold font as “T”, and the 15th base-pair in Sequence-read S04, indicated in bold font as “C”. By contrast, the twentieth base-pair is the site of a true SNP, with alleles “A” and “T”; that is, this person is heterozygous at this SNP location. We indicate the sequence reads at this position by italic font

copy of one allele (like T andC in positions 6 and 15), while all other sequencereads contain the other allele, becomes much smaller. We show this calculation in Table 2.7. Studying this table, we can see that coverage needs to be at least 25 for us to confidently conclude that the one copy of the minor allele (base) is a sequence error and not the result of sequence-read selection for a heterozygous SNP locus. Table 2.7 Probability of observing exactly one copy of a given allele and C − 1 copies of another allele when the coverage is C Coverage C

Pr(observing one copy of a given allele)

2

0.500

5

0.156

10

0.010

25

7.45 × 10−7

50

4.44 × 10−14

Based on the comments above, each probability in this table is calculated using the Binomial  C 1 C Formula, Pr(C, 0.5; 1) = C = 2C . In this formula, we do not consider any sequencing error 1 2

82

2 Overview of Genomic Heterogeneity in Statistical Genetics

2.6.2 Mathematical Model for NGS Data   In general, we can determine that the probability Pr X 1 , V1T , Y1T of an individual’s phenotype Y1T (this variable is 1 for cases, 0 for controls), his/her coverage value V1T at a SNP locus, and his/her observed alternate allele (or rare-variant) count X 1 at that locus. Note: 0 ≤ X 1 ≤ V1T . The superscript T stands for the true value of a variable. Consider an example SNP, with alleles A and T, where A is the rare variant. We start by determining the conditional probability of X 1 given V1T , Y1T , and a given genotype G 1T . Here, 0 ≤ G 1T ≤ 2 is the number of copies of the rare-variant A in the SNP genotype. The conditional probability may be expressed as    Bin X 1 ; V1T ; p Y1T , G 1T ,

(2.14)

  2−G T GT where p Y1T , G 1T = ( 2 1 ) εY1T + 21 1 − εY1T is the probability of a “success”, namely the probability of observing the rare variant A. In this formula, εY1T is the probability of misclassifying the sequence call A as T, or vice versa (the error rates are equal for misclassification in either direction; see Eq. (2.12)). To prove this, consider the value G Y1T = 0. Then the true genotype is TT. Any observed copies of the A allele in are errors. The probability of

the sequence-reads T observing X 1 such errors is Bin X 1 ; V1 ; εY1T . Similarly, for the value G Y1T = 2, the true genotype is AA. Any observed copies of the A allele in the sequence-reads are correct, and the probability that the A allele is correctly classified is 1−εY1T . There are X 1 observed A alleles in total. The probability

of observing this number is Bin X 1 ; V1T ; 1 − εY1T . When G Y1T = 1, the true genotype is AT. The probability of observing an A on a given sequence read is. Pr(observing A allele on sequence read) = Pr(read = A| true allele = T) Pr(true allele = T) +Pr(read = A|true allele = A) Pr(true allele = A)  

 1  1 + 1 − εY1T , = εY1T 2 2   

 1  1 + 1 − εY1T , = εY1T 1 − 2 2     1 1

(2 − 1) = εY1T + 1 − εY1T = p Y1T , G 1T = 1 . = 2 2 2 Thus, the of observing an A allele is completely determined by the   probability formula p Y1T , G 1T for all genotypes.

2.6 Errors in Next-Generation Sequencing (NGS)

83

There are two things to note here. First, we allow for differential misclassification with the subscript Y1T in the error parameters. Also, for the heterozygote calculation, we specify the equality Pr(true allele = A) = Pr(true allele = T) = 1/2. Empirical type I error rate and empirical power Since this section is the first one in which we use the terms empirical type I error rate and empirical power, and since we will use the terms through the remainder of the book, it behoves us to define the terms. The process for establishing empirical type I error rates is as follows. We specify a model (i.e., parameters) for the test statistic under a null hypothesis. We simulate R0 replicates of data under the null model and compute the test statistic and its corresponding p-value for each replicate. The proportion of the R0 replicates for which the test statistic p-value is less than some pre-specified significance level α is the empirical type I error rate at the significance level α. For empirical power, we specify a model (i.e., parameters) for the test statistic under an alternative hypothesis. Identical to the empirical type I error process, we simulate R1 replicates of data under the alternative model and apply the test statistic to each replicate’s data set. The proportion of the R1 replicates for which the test statistic p-value is less than α is the empirical power at the significance level α. The term “empirical” is sometimes replaced by “simulation”. For example, for the Fisher’s exact test implemented in R software, one can specify a number of replicates where data are simulated under the null hypothesis (with the option “simulate.p.value = TRUE”, and a “simulation p-value” is returned; see Sect. 5.4.3.3). The principle is the same as ours.

2.6.3 Empirical Type I Error for Test Statistics Applied to NGS Data with Sequence Error—Simulation Results [Simulations like these with a larger number of parameter settings are presented in Chap. 5, Sect. 5.4.3]. There is potential for differential misclassification in the design of association studies using NGS data. For example, researchers usually want to make sure that the classifications for cases are very accurate, whereas they may not be as concerned with the classifications for controls. Hence, in their study design, they may set much higher coverage values for case-sequence data than for control-sequence data when designing their association study. However, this may lead to differential misclassification, in that estimates of parameters may be much more accurate in cases than in controls, especially in the presence of sequence error. As a result, maximum likelihood estimates of parameters such as multi-locus genotype frequencies will appear to be different in the two different groups, and we are more likely to reject the null hypothesis, even if it is true. In this section, we address the issue. One of the important questions regarding sequence error for NGS data is how much does it matter when performing genetic

84

2 Overview of Genomic Heterogeneity in Statistical Genetics

association testing? We performed simulations in which we generated fictitious NGS data according to the mathematical model above (Sect. 2.6.2). We used a 2K factorial design [163], where K = 5; that is, there were 32 vectors of parameters used. The parameter settings we considered for the simulations are Genetic Model-Based Settings Disease prevalence φ1 : 0.05, 0.50. Disease allele frequency pd : 0.05. Genotype Relative Risk R1 (heterozygote): 1.0. Genotype Relative Risk R2 (homozygote): 1.0. Number of observed SNP loci for each replicate: 4. Correlation among all SNPs: 0.1, 0.9. Sequence Error Rates εY1t : 0.001. (Cases). εY0t : 0.005, 0.02. (Controls). Number of cases N A : 15,000, 30,000. Ratio k = NU /N A : 1, 0.50. EM Algorithm Parameters Number of starting vectors for each replicate: 250. Number of iteration-steps for each starting point: 100. Stopping criterion for difference in log-likelihoods: 1 × 10–6 . Simulation Parameters Number of replicates for each vector of parameter settings: 100. Significance level for each replicate: 5 × 10–8 (pointwise correction for one million independent SNPs in a GWAS), 2.5 × 10–6 (pointwise correction for 22,000 genes in Human Genome). Test statistics: Cumulative Minor Allele Test (CMAT) [164], Sequence Kernel Association Test (SKAT) [165]. The method for simulating sequence data based on the genetic model parameters, is described in the work by Kim et al. [47]. The key point is that each statistic is a test for genetic association that considers sequence data at multiple positions. We chose a relatively small number of 100 replicates because each vector of simulations required more than a week to run, even though we ran the programs in parallel. Also, we chose the sample sizes because now it is becoming more common place to perform genetic association studies with sample sizes on the order of 15,000 or 30,000 cases (and the same number of controls), or even greater [166–170].

2.6 Errors in Next-Generation Sequencing (NGS)

85

Results of our simulations for GWAS and AllGenes significance levels are presented in Fig. 2.8. The empirical type I error rate is the proportion of replications out of 100 for which the given statistic’s p-value was less than the particular significance level. A robust statistic is one for which the proportion of replicates is approximately equal to the significance level. The set of 32 such proportions, one for each vector of parameter settings, are presented in box plot form in Fig. 2.8. The first thing to mention is that the box plots for each statistic were identical for the GWAS and AllGenes significance levels. This result is almost certainly due to the relatively small number number of replicates per vector setting. We present results using the GWAS term in Fig. 2.8. In the first box plot, with empirical type I error rates for CMAT at the 5 × 10−8 significance level, the lowest outlier is the value 0.42. This result means that, in the presence of differential misclassification, when the null hypothesis that genotype relative risks R1 = R2 = 1.0 (the MLGs are not associated with the disease phenotype), CMAT rejects the null hypothesis in 42% of the replicates. The 95% confidence interval based on the binomial proportions is (0.3220, 0.5229). We observe that this interval does not contain 0, the number closest to the expected value (5 × 10–8 ) ×

Fig. 2.8 Box plot of empirical type I error rates for genetic-association statistics in the presence of differential sequence error

86

2 Overview of Genomic Heterogeneity in Statistical Genetics

100. The same result is true for the SKAT statistic, where the lowest empirical type I error rate is 0.46. Given that 0.42 and 0.46 are the lowest empirical type I error rates, we may conclude that the empirical type I error rates for both statistics are inflated in the presence of differential misclassification error, for all 32 vectors of parameter settings.

2.7 Non-misclassification Forms of Heterogeneity All of the work in Sect. 2.1 considered genomic data in which a subset of the data was misclassified. In this section, we consider heterogeneity where there is no misclassification. The type of heterogeneity is locus, or genetic heterogeneity. The most general model for a mixture distribution of categorical data is Pr(X = j) =

 k

Pr(X = j, Y = k) =

 k

Pr(X = j|Y = k) Pr(Y = k), (2.15)

where X is a categorical random variable, like allele frequencies for a given SNP in a population, Y is a categorical random variable, like the non-overlapping subpopulations of individuals, and i and k are the possible values for the respective random values. For Eq. (2.15) to be meaningful, the two random variables should not be mutually disjoint. For genetic heterogeneity, perhaps the most documented example is population stratification. As of this writing, there are more than 14,500 references on PubMed when using the keywords “population stratification”. We quote from the paper of Alexander et al. [171], who developed the software program ADMIXTURE for ancestry estimation in unrelated individuals. These authors extended a method developed by Pritchard et al. [172], who published a software program known as STRUCTURE. Quoting from Alexander et al., “The typical data set consists of genotypes at a large number J of single nucleotide polymorphisms (SNPs) from a large number I of unrelated individuals. These individuals are drawn from an admixed population with contributions from K postulated ancestral populations. Population k contributes a fraction qik of individual i’s genome. Allele 1 [of two possible alleles] at SNP j has frequency f k j in population k”. Applying Alexander et al.’s notation to our Eq. (2.15), we have J = I = 1, qik = q1k = Pr(Y = k), and f k j = f k1 = Pr(X = 0|Y = k), where the event X = 0 means an individual having Allele 1 in their SNP genotype.

2.7.1 Mathematical Model for Heterogeneity Our main mathematical model for genotypic heterogeneity is motivated the work of by Zhou and Pan. These authors developed a test statistic for case-control genetic

2.7 Non-misclassification Forms of Heterogeneity

87

association that estimates a heterogeneity parameter in addition to an allele frequency parameter [173]. The statistic is known as the Mixture Likelihood Ratio Test, MLRT. This statistic is described in greater detail in Chaps. 4 and 5. Our specification for genotype frequencies in the presence of locus heterogeneity contains two parts. First, the control population is a homogeneous population for which there is no association with the phenotype of a disease of interest. Second, the case population is a mixture of two sub-populations: (i) non-associated cases whose SNP genotype frequencies are equal to the control genotype frequencies; and (ii) associated cases, whose SNP genotype frequencies are different from the control genotype frequencies. Sub-population (i) may be thought of as cases who are affected due to mutations/variants at SNPs/loci at different locations in the genome rather than those at the genotyped SNP [173]. Other possible factors for affection may be environmental exposures, where the penetrances Pr(affected|e,g) satisfy the equations Pr(affected|e,g) = Pr(affected|e), 0 ≤ g ≤ 2. In this equation, e is an environmental variable measurement, and g is a specific genotype at the typed SNP (see, e.g., [174]). We provide the full mathematical model immediately below. First, we provide notation.

2.7.1.1

Notation

While Zhou and Pan consider the one-SNP scenario and also multi-SNP scenarios, we confine our attention to the one-SNP scenario. We provide reasons below. Group 0 = The combined populations of the controls and the cases who are not associated with the genotyped SNP. Cases in this population have the same genotype frequencies are controls in this population. Zhou and Pan use the index “2” to refer to this group. We choose “0” to be consistent with our notation throughout the book; that is, in this work, the index “0” refers to the control group/population. Group 1 = The sub-population of cases who are associated with the genotyped SNP. Cases in this population have the different genotype frequencies from those cases and controls in Group 1. π = The probability that a case/affected individual is “associated” (e.g., has the disease mutations/variants at the SNP locus). That is, this variable represents the probability that a case is in Group 1. g j0 , 0 ≤ j ≤ 2 = Conditional genotype frequency for genotype j conditional on individual being a case in Group 0. g tj1 , 0 ≤ j ≤ 2 = Conditional genotype frequency for genotype j conditional on individual being a case in Group 1. g ∗ji , 0 ≤ i ≤ 1, 0 ≤ j ≤ 2 = Conditional genotype frequency for genotype j conditional on phenotype i in the entire (mixed) population. The value j is an individual’s coded genotype. Also, i = 0 means control, and i = 1 means case.

88

2.7.1.2

2 Overview of Genomic Heterogeneity in Statistical Genetics

Mathematical Model for Locus Heterogeneity—Equations

For all individuals in Group 0, SNP genotype frequencies may be written as g00 , g10 , and g20 . These values may be computed either by genetic model-free or genetic model-based means (Sects. 1.4.1 and 1.4.2 of Chap. 1). For individuals in Group 1, t t t , g11 , and g21 to indicate the SNP genotype frequencies. These we use the notation g01 genotype frequencies may computed by genetic model-free or genetic model-based methods as well. The genotype frequencies for the entire (mixed) population of cases and controls are g ∗j1 = πg tj1 + (1 − π )g j0 , (Group 1) g ∗j0 = g j0 .(Group 0)

(2.16)

For both sets of frequencies in Eq. (2.16), 0 ≤ j ≤ 2. The equation for the entire population of cases is determined using the definition of conditional probability. Zhou and Pan [173] reduced the number of parameters in Eq. (2.16) by using the genetic model-free specification that the genotype frequencies in Groups 0 and 1 follow HWE proportions (Sect. 1.4.1.1 of Chap. 1). There is precedent for using a single mixture parameter π. For linkage analysis, Ott used a single mixture parameter to group pedigrees into linked and unlinked groups [175]. Zhou and Pan (already discussed) and Londono et al. [176] used a single mixture parameter to group cases and genotyped trios, respectively, into associated and non-associated individuals or trios. At its core, the rationale for using this model is the idea that we are primarily interested in identifying subsets of associated individuals/families in the presence of heterogeneity. This model is considered extensively in Chaps. 4 and 5.

References 1. Matise, T.C., Sachidanandam, R., Clark, A.G., Kruglyak, L., Wijsman, E., Kakol, J., et al.: A 3.9-centimorgan-resolution human single-nucleotide polymorphism linkage map and screening set. Am. J. Hum. Genet. 73(2), 271–284 (2003) 2. Cottingham, R.W., Jr., Idury, R.M., Schaffer, A.A.: Faster sequential genetic linkage computations. Am. J. Hum. Genet. 53(1), 252–263 (1993) 3. Schaffer, A.A., Gupta, S.K., Shriram, K., Cottingham, R.W., Jr.: Avoiding recomputation in linkage analysis. Hum. Hered. 44(4), 225–237 (1994) 4. Lathrop, G.M., Lalouel, J.M.: Easy calculations of lod scores and genetic risks on small computers. Am. J. Hum. Genet. 36(2), 460–465 (1984) 5. Lathrop, G.M., Lalouel, J.M., Julier, C., Ott, J.: Multilocus linkage analysis in humans: detection of linkage and estimation of recombination. Am. J. Hum. Genet. 37(3), 482–498 (1985) 6. Liu, R., Dai, Z., Yeager, M., Irizarry, R.A., Ritchie, M.E.: KRLMM: an adaptive genotype calling method for common and low frequency variants. BMC Bioinform. 15, 158 (2014). https://doi.org/10.1186/1471-2105-15-158

References

89

7. Wang, Y., Lu, J., Yu, J., Gibbs, R.A., Yu, F.: An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 23(5), 833–842 (2013). https://doi.org/10.1101/gr.146084.112 8. Rippe, R.C., Meulman, J.J., Eilers, P.H.: Reliable single chip genotyping with semi-parametric log-concave mixtures. PLoS ONE 7(10), e46267 (2012). https://doi.org/10.1371/journal. pone.0046267 9. Bourgey, M., Lariviere, M., Richer, C., Sinnett, D.: ALG: automated genotype calling of Luminex assays. PLoS ONE 6(5), e19368 (2011). https://doi.org/10.1371/journal.pone.001 9368 10. Wright, M.H., Tung, C.W., Zhao, K., Reynolds, A., McCouch, S.R., Bustamante, C.D.: Alchemy: a reliable method for automated SNP genotype calling for small batch sizes and highly homozygous populations. Bioinformatics 26(23), 2952–2960 (2010). https://doi.org/ 10.1093/bioinformatics/btq533 11. Bucasas, K.L., Pandya, G.A., Pradhan, S., Fleischmann, R.D., Peterson, S.N., Belmont, J.W.: Assessing the utility of whole-genome amplified serum DNA for array-based high throughput genotyping. BMC Genet. 10, 85 (2009). https://doi.org/10.1186/1471-2156-10-85 12. Giannoulatou, E., Yau, C., Colella, S., Ragoussis, J., Holmes, C.C.: GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics 24(19), 2209–2214 (2008). https://doi.org/10.1093/bioinformatics/btn386 13. Xiao, Y., Segal, M.R., Yang, Y.H., Yeh, R.F.: A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays. Bioinformatics 23(12), 1459–1467 (2007). https://doi.org/ 10.1093/bioinformatics/btm131 14. Wang, Y., Feng, E., Wang, R.: A clustering algorithm based on two distance functions for MEC model. Comput. Biol. Chem. 31(2), 148–150 (2007). https://doi.org/10.1016/j.compbi olchem.2007.02.001 15. Smith, E.M., Littrell, J., Olivier, M.: Automated SNP genotype clustering algorithm to improve data completeness in high-throughput SNP genotyping datasets from custom arrays. Genomics Proteomics Bioinform. 5(3–4), 256–259 (2007). https://doi.org/10.1016/S1672-0229(08)600 14-5 16. Moorhead, M., Hardenbol, P., Siddiqui, F., Falkowski, M., Bruckner, C., Ireland, J., et al.: Optimal genotype determination in highly multiplexed SNP data. Eur. J. Hum. Genet. 14(2), 207–215 (2006). https://doi.org/10.1038/sj.ejhg.5201528 17. Huentelman, M.J., Craig, D.W., Shieh, A.D., Corneveaux, J.J., Hu-Lince, D., Pearson, J.V., Stephan, D.A.: Sniper: improved SNP genotype calling for Affymetrix 10K genechip microarray data. BMC Genomics 6, 149 (2005). https://doi.org/10.1186/1471-2164-6-149 18. Olivier, M., Chuang, L.M., Chang, M.S., Chen, Y.T., Pei, D., Ranade, K., et al.: Highthroughput genotyping of single nucleotide polymorphisms using new biplex invader technology. Nucl. Acids Res. 30(12), e53 (2002). https://doi.org/10.1093/nar/gnf052 19. Pompanon, F., Bonin, A., Bellemain, E., Taberlet, P.: Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet. 6(11), 847–859 (2005). https://doi.org/10.1038/nrg1707 20. Gordon, D., Finch, S.J.: Factors affecting statistical power in the detection of genetic association. J. Clin. Invest. 115(6), 1408–1418 (2005). https://doi.org/10.1172/JCI24756 21. Gordon, D., Finch, S.J.: Consequences of error. Encyclopedia of Genetics, Genomics, Proteomics, and Bioinformatics, 1, 1.4 (2006) 22. Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., Zondervan, K.T.: Data quality control in genetic case-control association studies. Nat. Protoc. 5(9), 1564–1573 (2010). https://doi.org/10.1038/nprot.2010.116 23. Edwards, A.W.F.: Likelihood, Expanded The Johns Hopkins University Press, Baltimore (1992) 24. Hogg, R.V., Craig, A.T.: Introduction to Mathematical Statistics, 4th edn. Macmillan, New York, NY 25. Ott, J.: Analysis of Human Genetic Linkage, 3rd edn. The John Hopkins University Press, Baltimore, MD (1999)

90

2 Overview of Genomic Heterogeneity in Statistical Genetics

26. Douglas, J.A., Skol, A.D., Boehnke, M.: Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. Am. J. Hum. Genet. 70(2), 487–495 (2002). https://doi.org/10.1086/338919 27. Sobel, E., Papp, J.C., Lange, K.: Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet. 70(2), 496–508 (2002). https://doi.org/10.1086/338920 28. Mote, V.L., Anderson, R.L.: An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52, 95–109 (1965) 29. Levenstien, M.A., Ott, J., Gordon, D.: Are molecular haplotypes worth the time and expense? A cost-effective method for applying molecular haplotypes. PLoS Genet. 2(8), e127 (2006). https://doi.org/10.1371/journal.pgen.0020127 30. Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., Haroutunian, V.: Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat. Appl. Genet. Mol. Biol. 3, Article 26 (2004). https:// doi.org/10.2202/1544-6115.1085 31. Gordon, D., Finch, S.J., Nothnagel, M., Ott, J.: Power and sample size calculations for casecontrol genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum. Hered. 54(1), 22–33 (2002). https://doi.org/10.1159/000066696 32. Kang, S.J., Finch, S.J., Haynes, C., Gordon, D.: Quantifying the percent increase in minimum sample size for SNP genotyping errors in genetic model-based association studies. Hum. Hered. 58(3–4), 139–144 (2004). https://doi.org/10.1159/000083540 33. Kang, S.J., Gordon, D., Finch, S.J.: What SNP genotyping errors are most costly for genetic association studies? Genet. Epidemiol. 26(2), 132–141 (2004). https://doi.org/10.1002/gepi. 10301 34. Ahn, K., Gordon, D., Finch, S.J.: Increase of rejection rate in case-control studies with the differential genotyping error rates. Stat. Appl. Genet. Mol. Biol. 8, Article 25 (2009). https:// doi.org/10.2202/1544-6115.1429 35. Moskvina, V., Craddock, N., Holmans, P., Owen, M.J., O’Donovan, M.C.: Effects of differential genotyping error rate on the type I error probability of case-control studies. Hum. Hered. 61(1), 55–64 (2006). https://doi.org/10.1159/000092553 36. Mayer-Jochimsen, M., Fast, S., Tintle, N.L.: Assessing the impact of differential genotyping errors on rare variant tests of association. PLoS ONE 8(3), e56626 (2013). https://doi.org/10. 1371/journal.pone.0056626 37. Lu, Y., Day, F.R., Gustafsson, S., Buchkovich, M.L., Na, J., Bataille, V.et al.: New loci for body fat percentage reveal link between adiposity and cardiometabolic disease risk. Nat. Commun. 7 (2016). https://doi.org/10.1038/ncomms10495 38. Liu, C.-T., Raghavan, S., Maruthur, N., Kabagambe, E.K., Hong, J., Ng, M.C.Y., et al.: Transethnic meta-analysis and functional annotation illuminates the genetic architecture of fasting glucose and insulin. Am. J. Hum. Genet. (2016). https://doi.org/10.1016/j.ajhg.2016.05.006 39. Fehringer, G., Kraft, P., Pharoah, P.D.P., Eeles, R.A., Chatterjee, N., Schumacher, F.R., et al.: Cross-cancer genome-wide analysis of lung, ovary, breast, prostate and colorectal cancer reveals novel pleiotropic associations. Cancer Res. (2016). https://doi.org/10.1158/ 0008-5472.can-15-2980 40. Clayton, D.G., Walker, N.M., Smyth, D.J., Pask, R., Cooper, J.D., Maier, L.M., et al.: Population structure, differential bias and genomic control in a large-scale case-control association study. Nat. Genet. 37(11), 1243–1246 (2005). https://doi.org/10.1038/ng1653 41. Plagnol, V., Cooper, J.D., Todd, J.A., Clayton, D.G.: A method to address differential bias in genotyping in large-scale association studies. PLoS Genet. 3(5), e74 (2007). https://doi.org/ 10.1371/journal.pgen.0030074 42. Londono, D., Haynes, C., De La Vega, F.M., Finch, S.J., Gordon, D.: A cost-effective statistical method to correct for differential genotype misclassification when performing case-control genetic association. Hum. Hered. 70(2), 102–108 (2010). https://doi.org/10.1159/000314470 43. Lash, T.L., Ahern, T.P.: Bias analysis to guide new data collection. Int. J. Biostat. 8(2) (2012). https://doi.org/10.2202/1557-4679.1345

References

91

44. Garner, C.: Confounded by sequencing depth in association studies of rare alleles. Genet. Epidemiol. 35(4), 261–268 (2011). https://doi.org/10.1002/gepi.20574 45. Kim, K.Z., Shin, A., Lee, Y.S., Kim, S.Y., Kim, Y., Lee, E.S.: Polymorphisms in adiposityrelated genes are associated with age at menarche and menopause in breast cancer patients and healthy women. Hum. Reprod. 27(7), 2193–2200 (2012). https://doi.org/10.1093/hum rep/des147 46. Dahabreh, I.J., Schmid, C.H., Lau, J., Varvarigou, V., Murray, S., Trikalinos, T.A.: Genotype misclassification in genetic association studies of the rs1042522 TP53 (Arg72pro) polymorphism: a systematic review of studies of breast, lung, colorectal, ovarian, and endometrial cancer. Am. J. Epidemiol. 177(12), 1317–1325 (2013). https://doi.org/10.1093/aje/kws394 47. Kim, W., Londono, D., Zhou, L., Xing, J., Nato, A.Q., Musolf, A., et al.: Single-variant and multi-variant trend tests for genetic association with next-generation sequencing that are robust to sequencing error. Hum. Hered. 74(3–4), 172–183 (2012). https://doi.org/10.1159/ 000346824 48. Gordon, D., Finch, S.J., De La Vega, F.M.: A new expectation-maximization statistical test for case-control association studies considering rare variants obtained by high-throughput sequencing. Hum. Hered. 71(2), 113–125 (2011). https://doi.org/10.1159/000325590 49. Ahti, T.M., Makivaara, L.A., Luukkaala, T., Hakama, M., Laurikka, J.O.: Effect of family history on the risk of varicose veins is affected by differential misclassification. J. Clin. Epidemiol. 63(6), 686–690 (2010). https://doi.org/10.1016/j.jclinepi.2009.10.003 50. Garcia-Closas, M., Thompson, W.D., Robins, J.M.: Differential misclassification and the assessment of gene-environment interactions in case-control studies. Am. J. Epidemiol. 147(5), 426–433 (1998) 51. Cheng, K.F., Lin, W.J.: The effects of misclassification in studies of gene-environment interactions. Hum. Hered. 67(2), 77–87 (2009). https://doi.org/10.1159/000179556 52. Leu, M., Czene, K., Reilly, M.: Bias correction of estimates of familial risk from populationbased cohort studies. Int. J. Epidemiol. 39(1), 80–88 (2010). https://doi.org/10.1093/ije/ dyp304 53. Szatmari, P., Jones, M.B.: Effects of misclassification on estimates of relative risk in family history studies. Genet. Epidemiol. 16(4), 368–381 (1999). https://doi.org/10.1002/ (SICI)1098-2272(1999)16:43.0.CO;2-A 54. Pearce, C.L., Van Den Berg, D.J., Makridakis, N., Reichardt, J.K.V., Ross, R.K., Pike, M.C., et al.: No association between the Srd5a2 gene A49t missense variant and prostate cancer risk: lessons learned. Hum Mol Genet 17(16), 2456–2461 (2008). https://doi.org/10.1093/ hmg/ddn145 55. Miller, C.R., Joyce, P., Waits, L.P.: Assessing allelic dropout and genotype reliability using maximum likelihood. Genetics 160(1), 357–366 (2002) 56. Borchers, B., Brown, M., McLellan, B., Bekmetjev, A., Tintle, N.L.: Incorporating duplicate genotype data into linear trend tests of genetic association: methods and cost-effectiveness. Stat. Appl. Genet. Mol. Biol. 8, Article 24 (2009). https://doi.org/10.2202/1544-6115.1433 57. Tintle, N., Gordon, D., Van Bruggen, D., Finch, S.: The cost effectiveness of duplicate genotyping for testing genetic association. Ann. Hum. Genet. 73(Pt 3), 370–378 (2009). https:// doi.org/10.1111/j.1469-1809.2009.00516.x 58. Tintle, N.L., Ahn, K., Mendell, N.R., Gordon, D., Finch, S.J.: Characteristics of replicated single-nucleotide polymorphism genotypes from COGA: Affymetrix and center for inherited disease research. BMC Genet. 6 (Suppl 1), S154 (2005). https://doi.org/10.1186/1471-21566-S1-S154 59. Tintle, N.L., Gordon, D., McMahon, F.J., Finch, S.J.: Using duplicate genotyped data in genetic analyses: testing association and estimating error rates. Stat. Appl. Genet. Mol. Biol. 6, Article 4 (2007). https://doi.org/10.2202/1544-6115.1251 60. Lai, R., Zhang, H., Yang, Y.: Repeated measurement sampling in genetic association analysis with genotyping errors. Genet. Epidemiol. 31(2), 143–153 (2007). https://doi.org/10.1002/ gepi.20197

92

2 Overview of Genomic Heterogeneity in Statistical Genetics

61. Gordon, D., Haynes, C., Yang, Y., Kramer, P.L., Finch, S.J.: Linear Trend Tests for case-control genetic association that incorporate random phenotype and genotype misclassification error. Genet. Epidemiol. 31(8), 853–870 (2007). https://doi.org/10.1002/gepi.20246 62. Tenenbein, A.: A double sampling scheme for estimating from binomial data with misclassifications. J. Am. Stat. Assoc. 65(331), 1350–1361 (1970) 63. Tenenbein, A.: A double sampling scheme for estimating from binomial data with misclassifications: sample size determination. Biometrics 27, 935–944 (1971) 64. Tenenbein, A.: A double sampling scheme for estimating from misclassified multinomial data with applications to sampling inspection. Technometrics 14(1), 187–202 (1972) 65. Zhu, W., Kuk, A.Y., Guo, J.: Haplotype inference for population data with genotyping errors. Biom. J. 51(4), 644–658 (2009). https://doi.org/10.1002/bimj.200800215 66. Zou, G., Pan, D., Zhao, H.: Genotyping error detection through tightly linked markers. Genetics 164(3), 1161–1173 (2003) 67. Hosking, L., Lumsden, S., Lewis, K., Yeo, A., McCarthy, L., Bansal, A., et al.: Detection of genotyping errors by hardy-weinberg equilibrium testing. Eur. J. Hum. Genet. 12(5), 395–399 (2004). https://doi.org/10.1038/sj.ejhg.5201164 68. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., et al.: Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007). https://doi.org/10.1086/519795 69. Johnson, E.O., Hancock, D.B., Gaddis, N.C., Levy, J.L., Page, G., Novak, S.P., et al.: Novel genetic locus implicated for HIV-1 acquisition with putative regulatory links to HIV replication and infectivity: a genome-wide association study. PLoS ONE 10(3), e0118149 (2015). https:// doi.org/10.1371/journal.pone.0118149 70. Wittke-Thompson, J.K., Pluzhnikov, A., Cox, N.J.: Rational inferences about departures from Hardy-Weinberg equilibrium. Am. J. Hum. Genet. 76(6), 967–986 (2005). https://doi.org/10. 1086/430507 71. Leal, S.M.: Detection of genotyping errors and pseudo-SNPs via deviations from HardyWeinberg equilibrium. Genet. Epidemiol. 29(3), 204–214 (2005). https://doi.org/10.1002/ gepi.20086 72. Ahn, K., Haynes, C., Kim, W., Fleur, R.S., Gordon, D., Finch, S.J.: The Effects of SNP genotyping errors on the power of the Cochran-Armitage linear trend test for case/control association studies. Ann. Hum. Genet. 71(Pt 2), 249–261 (2007). https://doi.org/10.1111/j. 1469-1809.2006.00318.x 73. Risch, N., Merikangas, K.: The future of genetic studies of complex human diseases. Science 273(5281), 1516–1517 (1996) 74. Gordon, D., Devoto, M.: Advances in family-based association analysis. Introduction. Hum. Hered. 66(2), 65–66 (2008). https://doi.org/10.1159/000119106 75. Ott, J.: Linkage analysis with misclassification at one locus. Clin. Genet. 12(2), 119–124 (1977) 76. Stringham, H.M., Boehnke, M.: Identifying marker typing incompatibilities in linkage analysis. Am. J. Hum. Genet. 59(4), 946–950 (1996) 77. Broman, K.W.: Cleaning genotype data. Genet. Epidemiol. 17(Suppl 1), S79-83 (1999) 78. Douglas, J.A., Boehnke, M., Lange, K.: A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am. J. Hum. Genet. 66(4), 1287–1297 (2000) 79. Goring, H.H., Terwilliger, J.D.: Linkage analysis in the presence of errors II: marker-locus genotyping errors modeled with hypercomplex recombination fractions. Am. J. Hum. Genet. 66(3), 1107–1118 (2000) 80. Abecasis, G.R., Cherny, S.S., Cardon, L.R.: The impact of genotyping error on family-based analysis of quantitative traits. Eur. J. Hum. Genet. 9(2), 130–134 (2001). https://doi.org/10. 1038/sj.ejhg.5200594 81. Akey, J.M., Zhang, K., Xiong, M., Doris, P., Jin, L.: The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am. J. Hum. Genet. 68(6), 1447–1456 (2001)

References

93

82. Gordon, D., Heath, S.C., Liu, X., Ott, J.: A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. Am. J. Hum. Genet. 69(2), 371–380 (2001). https://doi.org/10.1086/321981 83. Geller, F., Ziegler, A.: Detection rates for genotyping errors in SNPs using the trio design. Hum. Hered. 54(3), 111–117 (2002) 84. Badzioch, M.D., DeFrance, H.B., Jarvik, G.P.: An examination of the genotyping error detection function of SimWalk2. BMC Genet. 4 (Suppl 1), S40 (2003) 85. Kang, S.J., Gordon, D., Brown, A.M., Ott, J., Finch, S.J.: Tradeoff between no-call reduction in genotyping error rate and loss of sample size for genetic case/control association studies. In: Pacific Symposium on Biocomputing, pp. 116–127 (2004) 86. Gordon, D., Heath, S.C., Ott, J.: True pedigree errors more frequent than apparent errors for single nucleotide polymorphisms. Hum. Hered. 49(2), 65–70 (1999) 87. Cheung, C.Y., Thompson, E.A., Wijsman, E.M.: Detection of Mendelian consistent genotyping errors in pedigrees. Genet. Epidemiol. 38(4), 291–299 (2014). https://doi.org/10.1002/ gepi.21806 88. O’Connell, J.R., Weeks, D.E.: Pedcheck: a program for identification of genotype incompatibilities in linkage analysis. Am. J. Hum. Genet. 63(1), 259–266 (1998). https://doi.org/10. 1086/301904 89. Abecasis, G.R., Cherny, S.S., Cookson, W.O., Cardon, L.R.: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30(1), 97–101 (2002). https://doi.org/ 10.1038/ng786ng786 90. Lathrop, G.M., Huntsman, J.W., Hooper, A.B., Ward, R.H.: Evaluating pedigree data. II. Identifying the cause of error in families with inconsistencies. Hum. Hered. 33(6), 377–389 (1983) 91. Mukhopadhyay, N., Buxbaum, S.G., Weeks, D.E.: Comparative study of multipoint methods for genotype error detection. Hum. Hered. 58(3–4), 175–189 (2004) 92. Gordon, D., Leal, S.M., Heath, S.C., Ott, J.: An analytic solution to single nucleotide polymorphism error-detection rates in nuclear families: implications for study design. In: Pacific Symposium on Biocomputing, pp. 663–674 (2000) 93. Anney, R.J., Kenny, E., O’Dushlaine, C.T., Lasky-Su, J., Franke, B., Morris, D.W., et al.: Nonrandom error in genotype calling procedures: implications for family-based and case-control genome-wide association studies. Am. J. Med. Genet. B Neuropsychiatr. Genet. 147b(8), 1379–1386 (2008). https://doi.org/10.1002/ajmg.b.30836 94. Cheng, K.F., Chen, J.H.: A simple and robust TDT-type test against genotyping error with error rates varying across families. Hum. Hered. 64(2), 114–122 (2007). https://doi.org/10. 1159/000101963 95. Cobat, A., Abel, L., Alcais, A., Schurr, E.: A general efficient and flexible approach for genome-wide association analyses of imputed genotypes in family-based designs. Genet. Epidemiol. 38(6), 560–571 (2014). https://doi.org/10.1002/gepi.21842 96. Leal, S.M., Yan, K., Muller-Myhsok, B.: Simped: a simulation program to generate haplotype and genotype data for pedigree structures. Hum. Hered. 60(2), 119–122 (2005). https://doi. org/10.1159/000088914 97. Pilipenko, V.V., He, H., Kurowski, B.G., Alexander, E.S., Zhang, X., Ding, L., et al.: Using Mendelian inheritance errors as quality control criteria in whole genome sequencing data set. BMC Proc. 8(Suppl 1 Genetic Analysis Workshop 18 Vanessa Olmo), S21 (2014). https:// doi.org/10.1186/1753-6561-8-s1-s21 98. Wijsman, E.M.: Family-based approaches: design, imputation, analysis, and beyond. BMC Genet. 17(Suppl 2), 9 (2016). https://doi.org/10.1186/s12863-015-0318-5 99. Yan, Q., Chen, R., Sutcliffe, J.S., Cook, E.H., Weeks, D.E., Li, B., Chen, W.: The impact of genotype calling errors on family-based studies. Sci. Rep. 6, 28323 (2016). https://doi.org/ 10.1038/srep28323 100. Yang, Y., Wise, C.A., Gordon, D., Finch, S.J.: A family-based likelihood ratio test for general pedigree structures that allows for genotyping error and missing data. Hum. Hered. 66(2), 99–110 (2008). https://doi.org/10.1159/000119109

94

2 Overview of Genomic Heterogeneity in Statistical Genetics

101. Yu, Z.: Family-based association tests using genotype data with uncertainty. Biostatistics 13(2), 228–240 (2012). https://doi.org/10.1093/biostatistics/kxr045 102. Heath, S.C., Ott, J.: TDT with errors: a likelihood based approach. Am. J. Hum. Genet. 65(4), A253–A253 (1999) 103. Bernardinelli, L., Berzuini, C., Seaman, S., Holmans, P.: Bayesian trio models for association in the presence of genotyping errors. Genet. Epidemiol. 26(1), 70–80 (2004). https://doi.org/ 10.1002/gepi.10291 104. Morris, R.W., Kaplan, N.L.: Testing for association with a case-parents design in the presence of genotyping errors. Genet. Epidemiol. 26(2), 142–154 (2004). https://doi.org/10.1002/gepi. 10297 105. Gordon, D., Haynes, C., Johnnidis, C., Patel, S.B., Bowcock, A.M., Ott, J.: A transmission disequilibrium test for general pedigrees that is robust to the presence of random genotyping errors and any number of untyped parents. Eur. J. Hum. Genet. 12(9), 752–761 (2004). https:// doi.org/10.1038/sj.ejhg.52012195201219 106. Contributors, W.: DNA Sequencing (2015) 107. de Magalhães, J.P., Finch, C.E., Janssens, G.: Next-generation sequencing in aging research: emerging applications, problems, pitfalls and possible solutions. Ageing Res. Rev. 9(3), 315– 323 (2010). https://doi.org/10.1016/j.arr.2009.10.006 108. Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. J. Exp. Biol. 210(9), 1518–1525 (2007). https://doi.org/10.1242/jeb.001370 109. Church, G.M.: Genomes for all. Sci. Am. 294(1), 46–54 (2006) 110. Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Meth. 5(1), 16– 18 (2008) 111. Kalb, G., Moxley, R.: Massively Parallel, Optical, and Neural Computing in the United States. IOS Press, Amsterdam, Oxford, Washington, Tokyo (1992) 112. ten Bosch, J.R., Grody, W.W.: Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J. Mol. Diagn. 10(6), 484–492 (2008). https://doi.org/ 10.2353/jmoldx.2008.080027 113. Tucker, T., Marra, M., Friedman, J.M.: Massively parallel sequencing: the next big thing in genetic medicine. Am. J. Hum. Genet. 85(2), 142–154 (2009). https://doi.org/10.1016/j.ajhg. 2009.06.022 114. Maher, B.: Personal genomes: the case of the missing heritability. Nature 456(7218), 18–21 (2008). https://doi.org/10.1038/456018a 115. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., et al.: Finding the missing heritability of complex diseases. Nature 461(7265), 747–753 (2009). https://doi.org/10.1038/nature08494 116. Genetics Home Reference (2018). https://ghr.nlm.nih.gov/ 117. Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K.: dbSNP: The NCBI database of genetic variation. Nucl. Acids Res. 29(1), 308–311 (2001) 118. Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., et al.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002). https://doi.org/10. 1101/gr.229102 119. Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D., Kent, W.J.: The UCSC table browser data retrieval tool. Nucl. Acids Res. 32(Database issue), D493–496 (2004). https://doi.org/10.1093/nar/gkh103 120. Wikipedia: Reference Genome (2018). https://en.wikipedia.org/wiki/Reference_genome 121. Contributors, W.: Reference genome. In: Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia (2018) 122. Li, W., Freudenberg, J.: Mappability and read length. Front. Genet. 5(381) (2014). https://doi. org/10.3389/fgene.2014.00381 123. Figure—mapping sequence reads. https://en.wikipedia.org/wiki/DNA_sequencing#/media/ File:Mapping_Reads.png. Accessed 7 May 2020 124. Wikipedia: Coverage (Genetics) (2016). https://en.wikipedia.org/wiki/Coverage_(genetics)

References

95

125. Illumina: Coverage depth recommendations (2018). https://www.illumina.com/science/edu cation/sequencing-coverage.html 126. Robasky, K., Lewis, N.E., Church, G.M.: The role of replicates for error mitigation in nextgeneration sequencing. Nat. Rev. Genet. 15(1), 56–62 (2014). https://doi.org/10.1038/nrg 3655 127. Zhou, L.: A Statistical Method for Genotypic Association That Is Robust to Sequencing Misclassification. The State University of New Jersey, Rutgers (2017) 128. 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010) 129. 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012) 130. A global reference for human genetic variation. Nature 526(7571), 68–74 (2015). https://doi. org/10.1038/nature15393 131. Sudmant, P.H., Rausch, T., Gardner, E.J., Handsaker, R.E., Abyzov, A., Huddleston, J., et al.: An integrated map of structural variation in 2,504 human genomes. Nature 526(7571), 75–81 (2015). https://doi.org/10.1038/nature15394 132. Calling SNPs/INDELs with SAMtools/BCFtools (2018). https://www.htslib.org/https://sam tools.sourceforge.net/mpileup.shtml 133. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009). https://doi. org/10.1093/bioinformatics/btp352 134. Project, T.G.: The 1000 genomes project phase 3 archive (2015). ftp://ftp.1000genomes.ebi. ac.uk/vol1/ftp/phase3/data/ 135. Li, H.: A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27(21), 2987–2993 (2011). https://doi.org/10.1093/bioinformatics/btr509 136. Danecek, P., Schiffels, S., Durbin, R.: Multiallelic calling model in Bcftools (-M), p. 2 (2016) 137. The variant call format (Vcf) version 4.2 specification, p. 28 (2017) 138. Ross, M.G., Russ, C., Costello, M., Hollinger, A., Lennon, N.J., Hegarty, R., et al.: Characterizing and measuring bias in sequence data. Genome Biol. 14(5), R51 (2013). https://doi. org/10.1186/gb-2013-14-5-r51 139. Goldstein, D.R., Zhao, H., Speed, T.P.: The effects of genotyping errors and interference on estimation of genetic distance. Hum. Hered. 47(2), 86–100 (1997) 140. Hou, L., Sun, N., Mane, S., Sayward, F., Rajeevan, N., Cheung, K.H., et al.: Impact of genotyping errors on statistical power of association tests in genomic analyses: a case study. Genet. Epidemiol. 41(2), 152–162 (2017). https://doi.org/10.1002/gepi.22027 141. Huebner, C., Petermann, I., Browning, B.L., Shelling, A.N., Ferguson, L.R.: Triallelic single nucleotide polymorphisms and genotyping error in genetic epidemiology studies: MDR1 (ABCB1) G2677/T/a as an example. Cancer Epidemiol. Biomarkers Prev. 16 (2007). https:// doi.org/10.1158/1055-9965.epi-06-0759 142. Knapp, M., Becker, T.: Impact of genotyping errors on type I error rate of the haplotypesharing transmission/disequilibrium test (HS-TDT). Am. J. Hum. Genet. 74(3), 589–591; author reply 591–583 (2004 143. Marquard, V., Beckmann, L., Heid, I.M., Lamina, C., Chang-Claude, J.: Impact of genotyping errors on the type I error rate and the power of haplotype-based association methods. BMC Genet. 10, 3 (2009). https://doi.org/10.1186/1471-2156-10-3 144. Miller, M.B., Schwander, K., Rao, D.C.: Genotyping errors and their impact on genetic analysis. Adv. Genet. 60, 141–152 (2008). https://doi.org/10.1016/S0065-2660(07)00406-3 145. Mitchell, A.A., Cutler, D.J., Chakravarti, A.: Undetected genotyping errors cause apparent overtransmission of common alleles in the transmission/disequilibrium test. Am. J. Hum. Genet. 72(3), 598–610 (2003). https://doi.org/10.1086/368203 146. Ott, J.: Issues in association analysis: error control in case-control association studies for disease gene discovery. Hum. Hered. 58(3–4), 171–174 (2004)

96

2 Overview of Genomic Heterogeneity in Statistical Genetics

147. Powers, S., Gopalakrishnan, S., Tintle, N.: Assessing the impact of non-differential genotyping errors on rare variant tests of association. Hum. Hered. 72(3), 153–160 (2011). https://doi. org/10.1159/000332222 148. Seaman, S.R., Holmans, P.: Effect of genotyping error on type-I error rate of affected sib pair studies with genotyped parents. Hum. Hered. 59(3), 157–164 (2005). https://doi.org/10.1159/ 000085939 149. Tung, L., Gordon, D., Finch, S.J.: The impact of genotype misclassification errors on the power to detect a gene-environment interaction using cox proportional hazards modeling. Hum. Hered. 63(2), 101–110 (2007). https://doi.org/10.1159/000099182 150. Cochran, W.G.: The chi-square test of goodness of fit. Ann. Math. Stat. 23(3), 315–345 (1952) 151. Li, H.: Toward Better Understanding of Artifacts in Variant Calling from High-Coverage Samples. Bioinformatics 30(20), 2843–2851 (2014). https://doi.org/10.1093/bioinformatics/ btu356 152. Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for nextgeneration sequencing. Brief Bioinform. 14(1), 56–66 (2013). https://doi.org/10.1093/bib/ bbs015 153. Capobianchi, M.R., Giombini, E., Rozera, G.: Next-generation sequencing technology in clinical virology. Clin. Microbiol. Infect. 19(1), 15–22 (2013). https://doi.org/10.1111/14690691.12056 154. Annala, M.J., Parker, B.C., Zhang, W., Nykter, M.: Fusion genes and their discovery using high throughput sequencing. Cancer Lett. 340(2), 192–200 (2013). https://doi.org/10.1016/j. canlet.2013.01.011 155. Ozsolak, F.: Third-generation sequencing techniques and applications to drug discovery. Expert Opin. Drug Discov. 7(3), 231–243 (2012). https://doi.org/10.1517/17460441.2012. 660145 156. Lee, H., Tang, H.: Next-generation sequencing technologies and fragment assembly algorithms. Methods Mol. Biol. 855, 155–174 (2012). https://doi.org/10.1007/978-1-61779582-4_5 157. Cordero, F., Beccuti, M., Donatelli, S., Calogero, R.A.: Large disclosing the nature of computational tools for the analysis of next generation sequencing data. Curr. Top. Med. Chem. 12(12), 1320–1330 (2012) 158. Beerenwinkel, N., Zagordi, O.: Ultra-deep sequencing for the analysis of viral populations. Curr. Opin. Virol. 1(5), 413–418 (2011). https://doi.org/10.1016/j.coviro.2011.07.008 159. Nagarajan, N., Pop, M.: Sequencing and genome assembly using next-generation technologies. Methods Mol. Biol. 673, 1–17 (2010). https://doi.org/10.1007/978-1-60761-842-3_1 160. Day, I.N.: Dbsnp in the detail and copy number complexities. Hum. Mutat. 31(1), 2–4 (2010). https://doi.org/10.1002/humu.21149 161. Bravo, H.C., Irizarry, R.A.: Model-based quality assessment and base-calling for secondgeneration sequencing data. Biometrics 66(3), 665–674 (2010). https://doi.org/10.1111/j. 1541-0420.2009.01353.x 162. Gilad, Y., Pritchard, J.K., Thornton, K.: Characterizing natural variation using next-generation sequencing technologies. Trends Genet. 25(10), 463–471 (2009). https://doi.org/10.1016/j.tig. 2009.09.003 163. Box, G.E.P., Hunter, G.S., Hunter, W.G.: Statistics for Experimenters: Design, Discovery, and Innovation, 2nd edn. Wiley Series in Probability and Statistics. Wiley, Hoboken, New Jersey, USA (2005) 164. Zawistowski, M., Gopalakrishnan, S., Ding, J., Li, Y., Grimm, S., Zöllner, S.: Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am. J. Hum. Genet. 87(5), 604–617 (2010). https://doi.org/10.1016/j.ajhg.2010.10.012 165. Wu M , C., Lee, S., Cai, T., Li, Y., Boehnke, M., Lin, X.: Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89(1), 82–93 (2011). https://doi.org/10.1016/j.ajhg.2011.05.029 166. Fuchsberger, C., Flannick, J., Teslovich, T.M., Mahajan, A., Agarwala, V., Gaulton, K.J., et al.: The genetic architecture of type 2 diabetes. Nature 536(7614), 41–47 (2016). https:// doi.org/10.1038/nature18642

References

97

167. Gaulton, K.J., Ferreira, T., Lee, Y., Raimondo, A., Magi, R., Reschen, M.E., et al.: Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci. Nat. Genet. 47(12), 1415–1425 (2015). https://doi.org/10.1038/ng.3437 168. Mahajan, A., Go, M.J., Zhang, W., Below, J.E., Gaulton, K.J., Ferreira, T., et al.: Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat. Genet. 46(3), 234–244 (2014). https://doi.org/10.1038/ng.2897 169. Golbus, J.R., Stitziel, N.O., Zhao, W., Xue, C., Farrall, M., McPherson, R., et al.: Common and Rare Genetic Variation in CCR2, CCR5, or CX3CR1 and risk of atherosclerotic coronary heart disease and glucometabolic traits. Circ. Cardiovasc. Genet. 9(3), 250–258 (2016). https://doi. org/10.1161/circgenetics.115.001374 170. Hibar, D.P., Stein, J.L., Renteria, M.E., Arias-Vasquez, A., Desrivieres, S., Jahanshad, N., et al.: Common genetic variants influence human subcortical brain structures. Nature 520(7546), 224–229 (2015). https://doi.org/10.1038/nature14101 171. Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19(9), 1655–1664 (2009). https://doi.org/10.1101/gr.094052. 109gr.094052.109 172. Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000) 173. Zhou, H., Pan, W.: Binomial mixture model-based association tests under genetic heterogeneity. Ann. Hum. Genet. 73(Pt 6), 614–630 (2009). https://doi.org/10.1111/j.1469-1809. 2009.00542.x 174. Gauderman, W.J.: Sample size requirements for matched case-control studies of geneenvironment interaction. Stat. Med. 21(1), 35–50 (2002) 175. Ott, J.: Linkage analysis and family classification under heterogeneity. Ann. Hum. Genet. 47(Pt 4), 311–320 (1983) 176. Londono, D., Buyske, S., Finch, S.J., Sharma, S., Wise, C.A., Gordon, D.: TDT-HET: a new transmission disequilibrium test that incorporates locus heterogeneity into the analysis of family-based association data. BMC Bioinform. 13, 13 (2012). https://doi.org/10.1186/14712105-13-13

Chapter 3

Phenotypic Heterogeneity

We provide examples of how different forms of heterogeneity may arise. Also, we provide mathematical models for phenotype heterogeneity. We document the effects on population-based and family-based statistical tests of association. Finally, we discuss the relative effects of phenotype misclassification versus genotype misclassification.

3.1 Phenotype Misclassification By phenotype misclassification, we mean that the true phenotype of a participant and the observed phenotype of the participant (that is, the phenotype that is assigned to the participant) are not the same (Fig. 1.3 of Chap. 1). For this book, our definition applies only to phenotypes that are categorical (affected/unaffected). We note that quantitative traits may have measurement error. Statistical methods to analyze quantitative data typically have measurement error built into the mathematical model describing the relationship among the input (e.g., genotype) and output (e.g., phenotype) data [1]. The definitions for phenotype and genotype misclassification are nearly the same, with the exception that, in this book, the number of phenotypes we consider is at least two for phenotype misclassification (affected and unaffected) whereas the number of genotypes is at least three (when considering SNPs, Douglas et al. pointed out that errors usually occur in genotype as compared to alleles [2]). As an example, consider a genetic disease with the two classifications “affected” and “unaffected”. If we examine a subject whose true phenotype is affected, and we assign a status of “unaffected”, we say that we have misclassified a true affected individual as an unaffected individual. This type of misclassification can occur with diseases such as Alzheimer’s disease (AD), where the gold-standard diagnosis of AD requires a post-mortem examination of the brain, and the standard diagnosis involves cognitive function tests as well as brain imaging [3]. AD may be under-diagnosed. we quote from the 2011 report from the Alzheimer’s Association [4] “…one study © Springer Nature Switzerland AG 2020 D. Gordon et al., Heterogeneity in Statistical Genetics, Statistics for Biology and Health, https://doi.org/10.1007/978-3-030-61121-7_3

99

100

3 Phenotypic Heterogeneity

examining the medical records of people who had definitive Alzheimer’s disease at autopsy showed that, among the 463 medical records investigated, about 20% did not include a diagnosis of probable Alzheimer’s disease [5] [reference #166 in the 2011 report]. This finding illustrates the challenges of diagnosing Alzheimer’s disease even in severely affected people”. As in Chap. 2, we use codes for the different categories: 0 = Unaffected, 1 = Affected. The phenotype misclassification matrix has the entries: Pr(observed phenotype = i|true phenotype = i  ) = πi  i , 0 ≤ i  , i ≤ 1.

(3.1)

  The misclassification matrix πi  i is a 2 × 2 matrix, with two conditional probabilities that represent correct classification of phenotypes: π00 and π11 . There are two misclassification probabilities: π01 and π10 . Like genotypes (Chap. 2), we refer to the full set of probabilities as the phenotype (mis)classification matrix. We may expand the number of phenotype categories, in which case the dimensions of the misclassification matrix increase (Table 3.1). We consider an example. Let π01 = 0.025. Then the probability of misclassifying an unaffected subject as an affected subject is 0.025. Then, on average, for any random sample of individuals of size N0 whose true phenotype classification is unaffected, the average number of those individuals who are misclassified as affected is 0.025N 0 . If N0 = 10,000, then on average 250 of the true unaffected individuals will be classified as affected. Like genotype misclassification, unless we have some gold-standard phenotype classification method with very low misclassification error rates, we cannot know the true counts for the different phenotypes, and consequently statistical tests applied to data with phenotype misclassification show statistical power loss and/or inflation in the type I error rate [6–12]. There have been some software programs (based on statistical and machinelearning methodology) developed to address phenotype misclassification. These include programs that 1. Use double-sample data to incorporate phenotype misclassification when performing statistical tests of genetic association [13]. Table 3.1 Table of (mis)classification parameters for affected and unaffected phenotypes True phenotype

Observed phenotype

Row total

Unaffected (0)

Affected (1)

Unaffected (0)

π00

π01

1

Affected (1)

π10

π11

1

For the true and observed phenotypes, the numbers in parentheses indicate numerical codes assigned 

to the phenotype. Each parameter πi  i , 0 ≤ i , i ≤ 1, is defined as the conditional probability:    Pr(observed phenotype = i|true phenotype = i ). Note that, for any true phenotype i , i πi  i = 1, that is, the sum of each row’s entries is one

3.1 Phenotype Misclassification

101

2. Employ simulation methods to evaluate power and sample size in the presence of phenotype misclassification for GWAS [3]. 3. Apply machine-learning classification models iteratively to minimize number of misclassified cell phenotypes [3, 14]. Also, there have been numerous classification methods developed in statistics and machine learning for the purposes of correct phenotype classification. Among the published methods since 2011 are those dealing the coronary heart disease, cancer, and autism [15–44]. In the majority of these publications, however, correct classification is never 100%. Less than perfect classification has implications for statistical hypothesis testing (see below).

3.2 How Phenotype Misclassification May Arise in Practice 3.2.1 Lack of Access to Gold-Standard Classification Phenotype misclassification is more likely to occur with diseases where a goldstandard classification is much more difficult, if not impossible to obtain. Examples of such phenotypes are those for which the gold-standard classification is obtained only after a post-mortem examination. In Table 3.2, we present a list of a few such diseases, along with the reference(s) indicating that a post-mortem examination is the gold-standard diagnosis. Table 3.2 Phenotypes for which a post-mortem examination is the gold-standard diagnosis

Phenotype

References

AD

[45, 46]

Supra-nuclear Palsy

[14]

Amyotrophic Lateral Sclerosis (ALS)

[14]

Dementia with Lewy Bodies

[14]

Vascular Dementia

[47]

Fronto-temporal Lobar Degeneration

[47]

Each of these phenotypes has a genetic component, as documented in the Online Inheritance In Man (OMIM) database [48]. In the list provided here, we consider non-familial forms of each phenotype, so the penetrance for any gene associated with a phenotype is much less than 100%

102

3 Phenotypic Heterogeneity

3.2.2 Variability of Phenotype Expression over Time Phenotype misclassification also may occur for diseases that have subtypes of different severity whose symptoms change or vary over time or that have variable age at onset. As pointed out by Ioannidis et al. [49], “A common problem is the classification of participants as non-diseased because of insufficient diagnostic workup or because participants have not been followed up long enough to develop disease”. Consequently, the mild form of a given phenotype may not be readily diagnosed. An example of such a disease is Fabry Disease. Saito et al. [50] wrote that “Fabry disease is a genetic disorder caused by a deficiency of alpha-galactosidase, exhibiting a wide clinical spectrum, from the early-onset severe ‘classic’ form to the late-onset mild ‘variant’ one”. Rett syndrome is another such disease. Ham et al. [51] reported, “Even within the constraints of the diagnostic criteria, classic and atypical forms of Rett syndrome demonstrate considerable variability in the presentation of specific signs and symptoms between patients and over time in the same patient. The evolution of phenotype with age is important to consider when evaluating the severity of the presentation of the disorder”.

3.2.3 Variable Age of Onset There have been numerous phenotypes where variable age of onset has been documented. If one does a search on the key words “age of onset”, “single gene”, and “disease”, one obtains approximately 3,000 references at the time of this writing. To reduce the number of references presented here, we add the key word “Mendelian”. Doing this reduces the number of publications to 43 [52–94]. An example where initial phenotype misclassification resulted in an apparent false positive is a linkage analysis of bipolar disorder. In a paper by Egeland et al. [95] in 1987, the authors reported a multi-point lod score of approximately 5.0 near the NRAS1 gene on chromosome 11 in an Old Order Amish pedigree. This result strongly suggested that NRAS1 was either a susceptibility gene for bipolar disorder, or near such a gene [96]. In a subsequent reanalysis of the pedigree [97], the authors concluded that “This linkage can be excluded using a large lateral extension of the original Amish pedigree”. The main reasons for the exclusion were that: (i) two individuals who had not previously displayed bipolar disorder developed the disease, thus having their phenotype status change from unaffected to affected (that is, in the initial publication, their phenotypes were misclassified because their age at onset was higher than their age when the phenotyping was performed) and (ii) there was a change in the size of the pedigree, specifically several new individuals were added.

3.2 How Phenotype Misclassification May Arise in Practice

103

3.2.4 Incomplete Knowledge of Gold-Standard Classifications A number of publications document that a given “gold-standard” phenotype classification may be incomplete, in that there may be phenotypic subtypes that provide information on either disease susceptibility or progression that are unknown at the time. For example, Massi et al. [98] reported that for several years, it was claimed that sub-classification of pleomorphic sarcomas (cancers of connective tissue) lacked reproducibility and had no clinical impact (no modification of therapeutic strategies). However, Fletcher et al. [99] documented that malignant fibrous histiocytoma (MFH, a form of pleomorphic sarcoma) could be subtyped on the basis of clear-cut criteria, and provided evidence that such classification is clinically important. Specifically, they documented that one subtype of sarcoma was associated with a shorter time to metastasis than the other. This finding was replicated in a subsequent publication [100]. Given the existence of two subtypes, genetic association analyses may be performed to determine if there is a genetic basis for the difference in time to metastasis. If such genetic factors are found, at a minimum they help in classification and therapeutic modalities, and additionally may provide suggested drug targets for treatment. In statistical genetics, many researchers consider an increase in the false positive (type I error) rate to be a more deleterious form of error than decrease in statistical power (see, e.g., [101]). The reason is that, when the false positive rate is increased, time, money, and effort are expended studying regions of the genome that have no association with the disease. By contrast, lower power can often be addressed by increasing the sample size. While increasing sample size requires time, money, and effort, a larger sample size will often lead to the identification of true disease loci.

3.2.5 Model Misspecification Although it is not precisely phenotype misclassification, model (or mode of inheritance) misspecification in linkage and association analyses may also have a significant effect, especially on statistical power, for disease-gene localization. Several works document the effects, while others propose methods that are robust to model misspecification (see, e.g., [102–130]). In summary, there are multiple ways in which phenotype misclassification may occur in genetics, and there are numerous phenotypes in which misclassification may happen [11].

104

3 Phenotypic Heterogeneity

3.3 Effects of Misclassification on Statistical Tests As with genomic misclassification, phenotype misclassification is important because it can alter power (at a minimum) for statistical tests. In the work that follows, we consider two types of study designs: (i) single-stage design, in which individuals are only classified once and (ii) two-stage design, where individuals go through two classifications. For the two-stage design, only those individuals who have the same classification in each stage (e.g., an individual is classified as a case in both stages I and II) are included when performing statistical tests of association. This type of design is referred to as “duplicate” sampling [131]. It has been used with genomic data (see, e.g., [132–136]).

3.3.1 Non-differential Misclassification Error Example for Single-Stage Genetic Association In this section, we focus on phenotype misclassification that is independent of any genetic exposure variables (e.g., alleles, genotypes, MLGs, diplotypes, etc.). We provide an example calculation that documents the statistical power loss in the presence of such misclassification when the ascertainment design is single stage (Table 3.3). We perform an example calculation to document the power loss. Suppose our true (without misclassification) case SNP genotype frequencies are Pr(AA) = 0.01, Pr(AT) = 0.18, Pr(TT) = 0.81, and our true control genotype frequencies are Pr(AA) = 0.04, Pr(AT) = 0.32, Pr(TT) = 0.64. The true population disease prevalence, φ1 , is 0.05. Since the genotype frequencies are not equal in the two populations, the null hypothesis is false. Consider sample sizes N A = NU = 350. The NCP is 26.996 for the chi-square test of independence for genotypes. It follows that the distribution of the statistic (Chap. 1, Sect. 1.6, and Edwards et al. [7]) is a non-central chi-square distribution with two degrees of freedom and NCP equal to 26.996. The statistical power for Table 3.3 Table of (mis)classification parameters for affected and unaffected phenotypes

Cases and controls True phenotype

Observed phenotype Unaffected (0)

Affected (1)

Unaffected (0)

0.92

0.08

Affected (1)

0.05

0.95

In this table, there are numerical assignments for each of the 

πi  i , 0 ≤ i , i ≤ 1 from Table 3.1. For example, π00 = 0.92, π10 = 0.05

3.3 Effects of Misclassification on Statistical Tests

105

our statistical test at the α = 0.0001 significance level is 0.8438 when there is no phenotype misclassification error, according to the method used in G*Power 3.1. The effect of phenotype misclassification is that the observed genotype frequencies are altered away from their true values. We compute each of the observed genotype frequencies in the case and control samples directly below. To numerically document the effects of misclassification on the conditional genotype frequencies, we consider an example calculation, specifically Pr(True AA| Obs Case). We may speak of true genotypes, since we do not consider genotype misclassification in this example. Pr( True AA|Obs Case) = Pr(True AA, Obs Case)/Pr(Obs Case), [Pr(True AA, Obs Case, True Case) + Pr(True AA, Obs Case, True Control)] , = [Pr(Obs Case, True Case) + Pr(Obs Case, True Control)]   t t π φ g π11 φ1 + g00 01 0 , = 01 (3.2) [π11 φ1 + π01 φ0 ] 



with φi  , i = 0.1 being the prevalence of the i th phenotype. The third equation is obtained by making the specifications: Pr(True AA|Obs Case, True Case ) = Pr(True AA|True Case), Pr(True AA|Obs Case, TrueControl) = Pr(True AA|True Control). That is, the probability of a given true genotype is only dependent upon the true underlying phenotype. We use this fact below with controls. In general, we have the conditional probability formula:  Pr(True Genotype = j  Observed Phenotype = i)  1  i  =0 g j  i  πi  i φi  = 1 . i  =0 (πi  i φi  )

(3.3)

In Eq. (3.3), we make the specification:   Phenotype = i,True Phenotype = i ) Pr(True Genotype = j Obs   = Pr(True Genotype = j True Phenotype = i ).

Returning to our numerical example above, we compute Pr(True Genotype = AA(0)|Obs Phenotype = Case (1)) = [(0.01)(0.95)(0.05) + (0.04)(0.08)(0.95)]/[(0.01)(0.95)(0.05) + (0.04)(0.08)(0.95)] = 0.028.

106

3 Phenotypic Heterogeneity

Using Eq. (3.3), we can compute the remaining conditional genotype frequencies: Pr(True Genotype = AT|Obs Phenotype = Case) = 0.266; Pr(True Genotype = TT|Obs Phenotype = Case) = 0.705; Pr(True Genotype = AA|Obs Phenotype = Control) = 0.040; Pr(True Genotype = AT|Obs Phenotype = Control) = 0.320; Pr(True Genotype = TT|Obs Phenotype = Control) = 0.640. Given Formula (1.8) and these probabilities, we compute  λ = (350)(350) ×

(0.028 − 0.040)2 ((350)(0.028) + (350)(0.040))

(0.705 − 0.640)2 (0.266 − 0.320)2 + + ((350)(0.266) + (350)(0.320)) ((350)(0.705) + (350)(0.640))

= 122, 500 × 5.481 × 10−6 + 1.393 × 10−5 + 8.942 × 10−6 ,



= 3.474.

For the 0.0001 level of significance, with a non-centrality parameter of 3.474, we compute a statistical power of 0.0121 [7, 10]. Comparing this result with that of the no-error result (0.8438), the power loss is 83% [=(0.8438 – 0.0121) × 100]. This power loss is much greater than that with genotyping error (Chap. 2). With such power in the presence of misclassification, it is unlikely that one would establish association between the phenotype of interest and the SNP (and subsequently a gene). For this reason, it is important to develop strategies to deal with phenotype misclassification when performing genetic linkage and association studies. We document one such strategy, the two-stage design, below. First, we address the question of why such a large amount of power is lost.

3.3.2 Why Do We Observe Such Large Power Loss/MSSN Increase for Phenotype Misclassification? In 2005, Zheng and Tian [10] and Edwards et al. [7] quantified the loss of power (equivalently, the increase in sample size) for the test of trend and the chi-square test of independence for genotypes, respectively, in the presence of non-differential phenotype misclassification. Each set of authors considered a case-control genetic association design. To explain why we observe such a large decrease in power, we consider the work of Edwards et al. [7], who addressed this issue using cost coefficients. The authors

3.3 Effects of Misclassification on Statistical Tests

107

documented the costs of misclassifying a true case as an observed control (C10 ) and misclassifying a true control as an observed case (C01 ) (our notation is different from that in the Edwards et al. paper). The costs (see, e.g., [137], Chap. 6) are defined as follows: C10 = The percent increase in the sample size required to maintain a constant power for every one percent increase in the misclassification rate of a true case as an observed control. C01 = The percent increase in the sample size required to maintain a constant power for every one percent increase in the misclassification rate of a true case as an observed control. For example, if C10 = 2, it means that every 1% increase in misclassification of a true case as an observed control requires a 2% increase in sample size to maintain constant power. If C01 = 10, then every 1% increase in misclassification of a true control as an observed case requires a 10% increase in sample size to maintain constant power. The principle is the same for power loss for a fixed sample size, that is, the higher the cost of each misclassification, the more the power loss for a fixed sample size. Edwards et al. [7] documented that 1 − φ1 × (function of true genotype frequencies), φ1 φ1 = × (function of true genotype frequencies). 1 − φ1

C01 = C10

(3.4)

Let us consider the parameters given in Sect. 3.3 as an example. There, φ1 = 0.05, so 1 − 0.05 × (function of true genotype frequencies), 0.05 = 19 × (function of true genotype frequencies).

C01 =

That is, for any value of the true-genotype-frequencies function, the cost C01 is increased 19-fold due to the given prevalence. Similarly, for any value of the truegenotype-frequencies function, the cost C10 is decreased 19-fold due to the given prevalence. So, for this example, misclassification of a true control as an observed case is a much more damaging error (in terms of power loss/increased sample size). To illustrate this point computationally, let us modify Table 3.2 so that there is only misclassification of true cases to observed controls. We presented the modified parameters in Table 3.4. Using the work done in Sect. 3.3.1, we can show that the NCP in the presence of misclassification is λ = 26.8721. For the 0.0001 level of significance, we compute a statistical power of 0.8410 [7, 10]. Comparing this result with that of the no-error result (0.8438), we observe that there is virtually no power loss, even though we have π10 = 0.05, a fairly large misclassification rate by many standards.

108 Table 3.4 Modified version of Table 3.2, where the only misclassification is of true cases to observed controls

3 Phenotypic Heterogeneity Cases and controls True phenotype

Observed phenotype Unaffected (0)

Affected (1)

Unaffected (0)

1.00

0.00

Affected (1)

0.05

0.95

Fig. 3.1 Pictorial representation of π01 misclassification. This figure is to be viewed sequentially from left to right. In each figure, there is paint in the large (black exterior) drum and in the small (silver exterior) can. Cases and controls are represented by black and white paint, respectively. The case prevalence is the ratio of (volume of paint in silver can)/(volume of paint in silver can + volume of paint in black drum). Similarly, the control prevalence is the ratio of (volume of paint in black drum)/(volume of paint in silver can + volume of paint in black drum). Initially, the can holds black-colored paint and the drum holds white-colored paint. In the second frame of this sequence, a cup of white paint, amounting to approximately 1% of the controls, is poured into the can with black paint. The third frame shows the color of the paint in the silver can after the white paint has been added

Fig. 3.2 Pictorial representation of π10 misclassification. This figure is to be viewed sequentially from left to right. The description of prevalences is the same as given in the legend for Fig. 3.1. In the second frame of this sequence, a cup of black paint, amounting to approximately 1% of the cases, is poured into the oil drum with white paint. The third frame shows the color of the paint in the drum after the black paint has been added

3.3 Effects of Misclassification on Statistical Tests

109

Qualitatively, how do we explain such a difference in power loss? The key comes from the prevalence. We provide an illustration of the two types of misclassification below (Figs. 3.1 and 3.2). We thank Amy and Eleanor Gordon for creating these figures. Suppose we have two containers, an oil drum and a paint can. The oil drum is filled with white paint (the volume representing the proportion of controls) and the paint can is filled with black paint (the volume representing the proportion of cases). The prevalence of cases is the ratio of (volume of black paint)/(volume of black paint + volume of white paint). As indicated in Figs. 3.1 and 3.2, a prevalence of 5% means that the relative volume of black paint to white paint is rather small. Suppose we take a cup of white paint, representing 1% of the white paint volume, and pour it into the black paint can. Because 1% of the white paint (controls) is relatively large compared to the volume of the black paint (cases), we expect that the mixture will turn gray, that is, no longer black and closer to white (Fig. 3.1). Said another way, the color of the case paint looks much more like the color of the control paint. Mathematically, the observed case genotype frequencies move closer to the true control frequencies. If we take a cup of black paint, representing 1% of the black paint volume, and pour it into the white paint oil drum paint can, we expect that the mixture will stay almost completely white (Fig. 3.2). That is, the color of the control paint has not changed at all. Mathematically speaking, the observed control genotype frequencies have changed minimally from their true values.

3.3.3 Multi-stage Phenotype Classification and Limits of Observed Genotype Frequencies In Sects. 3.3.1 and 3.3.2, we documented, by formula and by example, that performing one stage of phenotype classification for each individual may result in large loss of power/increase in MSSN when phenotype misclassification is present. In this section, we document how one may mitigate the effects of phenotype misclassification. We consider a multi-stage approach, where phenotype classification occurs more than once. Only those individuals who have the same classification for each stage are kept for use in the test statistic. We consider a “best case” scenario, and a “worst case” when computing the conditional genotype frequencies. The key point here is that phenotype classification (affected, unaffected) has established a vector of phenotype classifications rather than classification performed once. To begin, suppose we perform m stages of phenotyping per individual, where the misclassification parameters for each stage r, 1 ≤ r ≤ m, are given by  πi  (ir =i) , i = i. This notation simply means that, at the rth stage, the observed phenotype classification is i. Since our method for classifying individual phenotypes is that the observed phenotype category must be the same at each stage (e.g., a person is defined as affected if, at each stage, they are classified as affected), when placing

110

3 Phenotypic Heterogeneity

individuals into different phenotype categories, we have (i 1 = i, i 2 = i, . . . , i m = i). − → We abbreviate this vector by i , with the understanding that the number of coordinates in the vector is m. Using our notation for unaffected and affected individuals, − → − → the two possible vectors are 0 and 1 , respectively.

3.3.3.1

Conditional Genotype Frequencies in Presence of Conditionally Independent Phenotype Classification

Consider the following conditions:   −    → a. Pr j |i , i = Pr j |i = g j  i  ; b. (Conditional independence):   Pr i i        = Pr i 1 = i|i  × Pr i 2 = i|i  × . . . × Pr i m = i|i  , m = πi  (ir =i) ; r =1

  c. Pr i = φi  (by definition of prevalence);  d. πi  (ir =(1−i  )) < πi  (ir =i  ) , 0 ≤ i ≤ 1, 1 ≤ r ≤ m. Condition (d) states that the misclassification probability is always less than the correct classification probability. We consider this condition to be reasonable. Any classification procedure where Condition (d) is not met should not be used. We extend the calculations of single-stage phenotype classification. Using the definitions provided above, we can write

→ g j − i

 −

 −  − → →  →  Pr j , i Pr j , i , i = i + Pr j , i , i = (1 − i) − −



. = → = → → Pr i Pr i , i  = i + Pr i , i  = (1 − i)

(3.5)

In the right-most term in Eq. (3.5), the summands in the numerator may be rewritten as





   Pr j  , i, i  = i = Pr j  i, i  = i × Pr i i  = i × Pr i  = i 

 = g j  (i  =i) × Pr i i  = i × φ(i  =i) .





   Pr j  , i, i  = (1 − i) = Pr j  i, i  = (1 − i) × Pr i i  = (1 − i) × Pr i  = (1 − i) 

 (3.6a) = g j  (i  =(1−i)) × Pr i i  = (1 − i) × φ(i  =(1−i)) .

The terms in the denominator may be written as

3.3 Effects of Misclassification on Statistical Tests





   Pr i, i  = i = Pr i i  = i × Pr i  = i 

 = Pr i i  = i × φ(i  =i) .



   Pr i, i  = (1 − i) = Pr i i  = (1 − i) × Pr i  = (1 − i) , 

 = Pr i i  = (1 − i) × φ(i  =(1−i)) .

111

(3.6b)

The equations in Formulas (3.6a) and (3.6b) follow from Conditions (a–c) above. − →  The challenging calculation regards the term Pr i |i . When m equal one stage, − −  →  →   Pr i |i = πi  i . When m equals two stages, Pr i |i = Pr i 1 = i, i 2 = i|i , where i 1 is the specified phenotype at the first stage, and likewise for i 2 . If we disregard Condition (b) above, the value of this conditional probability is not necessarily known. One might estimate each probability by doing a pilot study where each true  phenotype category i is known (see, double-sampling [13, 138]). However, in this section, we specify the formula (Condition (b)) to determine the conditional probability. The equation in Condition (b) follows from a condition known as “conditional independence”. We used this term in a previous publication [13]. Returning to the m = 2 stages example above, conditional independence means that   Pr i i    = Pr i 1 = i, i 2 = i|i  , (3.7)  = Pr i 1 = i|i  × Pr i 2 = i|i  , = πi  (i1 =i) × πi  (i2 =i) . 

The results of Eq. (3.7) are that, conditional on the true phenotype category i for an individual, the phenotype classifications at stages 1 and 2 are independent. We can extend Eq. (3.7) to any number of stages. The general formula (as written in Condition (b)) is     Pr i i  = Pr i 1 = i, i 2 = i, . . . , i m = i|i        (3.8) = Pr i 1 = i|i  × Pr i 2 = i|i  × . . . × Pr i m = i|i  , m = r =1 πi  (ir =i) . In the right-most term in Eq. (3.5), the summands in the numerator may be rewritten as.

Pr j  , i, i  = i   = g j  (i  =i) rm=1 π(i  =i)(i

r =i) φ(i  =i) . (3.9a) Pr j  , i, i  = (1 − i)   = g j  (i  =(1−i)) rm=1 π(i  =(1−i))(ir =i) φ(i  =(1−i)) .

112

3 Phenotypic Heterogeneity

Each term in the denominator may be written as

Pr i, i  = i   = rm=1 π(i  =i)(ir =i) φ(i  =i) . Pr i, i  = (1 − i)  m = r =1 π(i  =(1−i))(ir =i) φ(i  =(1−i)) .

(3.9b)

Combining Eqs. (3.9a)–(3.9b), the conditional genotype frequencies (3.5) are given by → g j − i

=

g j

(i  =i )

 m



r =1

 m

π

r =1



(i  =i )(ir =i) φ(i  =i ) +g j (i  =(1−i)) π

(i  =i )(ir =i) φ(i  =i ) +

m

r =1

π

m

r =1

 π

(i  =(1−i) )(ir =i) φ(i  =(1−i)) . 

(3.10)

(i  =(1−i))(ir =i) φ(i  =(1−i))

π i  =(1−i) (i =i) r Let ρi,r = (π  ) < 1 (condition (d)). i =i =i) (i ( )r     Define ρmax = max min ρi,r , ρmin = ρi,r . These terms satisfy 1≤r ≤m 1≤r ≤m   i = 0, 1 i = 0, 1 the inequality:

0 ≤ (ρmin ) ≤ m

m 

m ρi,r =

r =1

π(i  =(1−i))(ir =i) ≤ (ρmax )m < 1,  r =1 π(i =i )(ir =i)

r =1

m



for i equal to 0 or 1. mDividing the numerator  r =1 π(i =i )(ir =i) , we obtain

and

denominator

in

expression

 m  g j  (i  =i ) φ(i  =i ) + g j  (i  =(1−i)) × r =1 ρi,r ×φ (i =(1−i)) 

   m m    r =1 π(i =i )(ir =i) φ(i =i ) + r =1 ρi,r ×φ (i =(1−i))

(3.10)

by

(3.11)

→ . It satisfies the following inequalities: The quotient (3.11) is g j  − i

g j  (i  =i) φ(i  =i) + g j  (i  =(1−i)) × (ρmin )m × φ(i  =(1−i)) m  ≤ g j i , m r =1 π(i  =i)(ir =i) φ(i  =i) + (ρmax ) × φ(i  =(1−i)) g j  (i  =i) φ(i  =i) + g j  (i  =(1−i)) × (ρmax )m × φ(i  =(1−i)) m  ≥ g j i . m r =1 π(i  =i)(ir =i) φ(i  =i) + (ρmin ) × φ(i  =(1−i))

(3.12)

3.3 Effects of Misclassification on Statistical Tests

113

As the number of stages m → ∞, the terms (ρmin )m and (ρmax )m → 0, and the g j  i  =i φ i  =i quotients (3.10) approach ( φ ) ( ) = g j  (i  =i ) from above and below. That is, (i =i ) → (3.5) approach the true genotype the observed conditional genotype frequencies g j  − i frequencies. Given this convergence, we pose a practical question. Is it possible to determine the number of stages m so that the total sample size in the presence of misclassification is within a certain percentage of the total sample size when no misclassification is present? The answer is yes; however, the value m depends on a number of factors, including the true genotype   frequencies, the ratio of controls to cases, and the misclassification matrices πi  i at each stage. We provide two example figures below. In Fig. 3.3, we present the quantities log10 (number of cases needed to detect association) as a function of the significance level (either 5% or 0.001%). The test statistic considered is the chi-square test of independence for genotypes. From this point forward, we use the abbreviation “log” to mean “log10 ”. For results and discussion regarding Figs. 3.3, we use the abbreviation and MSSN to refer to MSSN for cases. From this figure, we can determine the percent increase in MSSN with each additional stage of phenotype classification. For example, when the significance level is 5% and the number of stages is m = 1, then (Fig. 3.3)

Fig. 3.3 Minimum number of cases needed (MSSN) (log-transformed) to detect association for  two different significance levels with a single-locus multiplicative mode of inheritance R2 = R12 where R2 = 2, pd = 0.25, k = 1, and q1 = 0.05. The notation for the genetic model parameters may be found in Sect. 1.4.2 of Chap. 1. Lines with open circles (O) at each stage represent log(MSSN) values when the significance level is 5%, while lines with solid squares () represent log(MSSN) values when the significance level is 0.001%. For every stage, the (mis)classification matrix is given by π00 = π11 = 0.80, π01 = π10 = 0.20. The horizontals (solid black) are log(MSSN) values when no misclassification is present (and hence are independent of the number of stages)

114

3 Phenotypic Heterogeneity

4.104 = log(MSSN with misclassification), 2.555 = log(MSSN with no misclassification). It follows that  MSSN with misclassification , 4.104 − 2.555 = log MSSN with no misclassification   MSSN with misclassification , 1.549 = log MSSN with no misclassification MSSN with misclassification 101.549 = . MSSN with no misclassification 

(3.13)

Since 101.549 ≈ 35, we have (from Eq. (3.13)) (MSSN with misclassification) = 35 × (MSSN with no misclassification). (3.14) We compute that, for the model parameters specified in Fig. 3.3, when m = 1 (individuals phenotyped only once), to achieve 80% power at the 5% significance level in the presence of misclassification, we require at least 35 times as many cases as the number of cases needed when there is no misclassification. To document how phenotyping in multiple stages helps reduce the sample size requirement, we perform the same calculation as above, only using the values in Fig. 3.3 for m = 3. From the results in Fig. 3.3, we obtain 2.773 = log(MSSN with misclassification), 2.555 = log(MSSN with no misclassification). We have (MSSN with misclassification)   = 100.218 (MSSN with no misclassification), = 1.652 (MSSN with no misclassification). In other words, when m = 3 (individuals phenotyped three times, and all phenotypes are conditionally independent), to achieve 80% power at the 5% significance level in the presence of misclassification, we require approximately 1.7 times as many cases as the number of cases needed when there is no misclassification. This sample size is approximately a 21-fold reduction from the MSSN values when m = 1. Regarding the 0.001% significance level, applying the calculations above to the MSSN with misclassification for m = 1 values in Fig. 3.3, we compute that the ratios MSSN with no misclassification and m = 4 (as example numbers) are 35.414 and 1.148, respectively. We achieve an

3.3 Effects of Misclassification on Statistical Tests

115

approximate 31-fold reduction in MSSN when performing four phenotyping stages as compared with one. Also, the value 1.148 indicates that we have a 14.8% increase in MSSN for four stages as compared to MSSN with perfect classification, as compared with a 3541.4% increase when comparing the one-stage MSSN values with MSSN for perfect classification.

3.3.3.2

Conditional Genotype Frequencies in the Presence of Biased Phenotype Classification

In this section, we consider conditional phenotype classification where the probability − →  Pr i |i makes use of the conditional probability: Pr(i m = i|i m−1 = i, i m−2 = i, . . . , i 1 = i) =   Pr i 1 = i|i  = πi  i .

m−1 , m

(3.15)

In Eq. (3.15), as the number of stages (m) where the phenotype is classified into category i increases, so does the probability that the last phenotype classification (stage m) will be the value i. This statement is true when using just the observed phenotypes. Note that, as m → ∞, the fraction m−1 → 1. One can interpret this m probability as an example of clear bias; each consecutive classifier is influenced by all previous classifications. In addition, the first conditional probability is not dependent  upon the true classification i . One important point regarding the probabilities (3.15) is that the population of individuals considered for this analysis consists of those individuals who are classified as a case m times or as a control m times. Hence, there − → is a reduction in sample size. We specify the probability of observing the vector i  conditional on the true phenotype category i to be  Pr i m = i, i m−1 = i, i m−2 = i, . . . , i 1 = i, i    Pr i i  = Pr(i  )     Pr i m = i|i m−1 = i, i m−2 = i, . . . , i 1 = i, i  × Pr i m−1 = i, i m−2 = i, . . . , i 1 = i, i  Pr(i  )      m−1  × Pr i m−1 = i|i m−2 = i, . . . , i 1 = i, i  × Pr i m−2 = i, . . . , i 1 = i, i  m = Pr(i  )  m−1  m−2   × m−1 × Pr i m−2 = i, . . . , i 1 = i, i  m , = Pr(i  ) .. .  m−1  m−2 × m−1 × . . . × 21 × πi  i m , = Pr(i)

116

3 Phenotypic Heterogeneity

πi  i , (m)Pr(i  ) πi  i = (m)φi 

=

As above, we use the formulas for numerator (3.6a) and denominator (3.6b) and we obtain g j i





Pr j  , i, i  = i + Pr j  , i, i  = (1 − i)



= Pr i, i  = i + Pr i, i  = (1 − i)

= =

φi πii g j  i (m)φ + g j  (1−i) i φi πii (m)φi

+

φ(1−i) π(1−i)i (m)φ(1−i)

φ(1−i) π(1−i)i (m)φ(1−i)

g j  i πii + g j  (1−i) π(1−i)i . πii + π(1−i)i

(3.16)

→ (3.16) is a mixture of the true conditional probThe conditional probability g j  − i → approaches g j  i is if πii > 0 abilities g j  i and g j  (1−i) . Also, the only way that g j  − i and π(1−i)i ≈ 0, or π(1−i)i πii . That is, there is no dependence on the number of phenotyping stages m used. Convergence to the true conditional genotype frequencies depends only on the classification probabilities for the first stage.

3.4 Non-misclassification Forms of Heterogeneity In this chapter, up till now, we have considered only phenotype data in which a subset of the data was misclassified. Here, we consider non-misclassification phenotypic heterogeneity. The term “phenotypic heterogeneity” is defined as pleiotropy [139]. We extend the definition to include any fixed genomic data categories that may result in at least three phenotype categories (Chap. 1, Table 1.3). Given this definition, models of epistasis where the phenotype has at least three values are considered to be a form of phenotypic heterogeneity. An example using epistasis is covered extensively in Chap. 4. The one exception to our rule of at least three phenotype categories is that of threshold-selected quantitative trait loci. We make this exception because the underlying distribution for the determination of affected and unaffected individuals is a mixture of univariate or multivariate normal pdfs. We describe this model in detail in Chap. 6.

3.5 Summary

117

3.5 Summary Based on the results in this chapter, we find that phenotype misclassification can reduce power to detect genetic association or, equivalently, increase the MSSN needed to detect genetic association with statistical tests like the chi-square test of independence for genotypes. Also, this power loss/sample size increase is largely determined by the prevalence of the disease phenotype in the population. We have documented a number of different phenotypes for which phenotype misclassification may occur. In the absence of a gold-standard phenotyping mechanism, we show that if we apply repeated phenotyping in stages, and if the phenotype classification at each stage is conditionally independent of the classifications at all other stages, then the observed genotype frequencies approach the true frequencies, and power will approach the power when no misclassification is present. A final thought—as of this writing, efforts are being made to create very large data sets where phenotypes are derived from use of electronic medical records at medical institutions (see, e.g., [140–168]). One reason for using these approaches is that they have the potential (through much larger sample sizes) to increase statistical power to detect association with rare variants and/or variants with small effect sizes. Such effect sizes were observed in numerous GWAS performed since the development of the high-throughput SNP technology [169]. With the advantage of increased sample size comes several challenges, including accurate phenotyping, especially across studies. While phenotype accuracy may be high for certain phenotypes, it may be more challenging for phenotypes like psychiatric disorders, where human judgement is involved in making a phenotype classification. As Monteith et al. [170] state, “human judgement and subject matter expertise are critical parts of big data analysis, and the active participation of psychiatrists is needed throughout the analytical process”. This statement raises the question, when dealing with tens to hundreds of thousands of phenotypes, how will it be possible that individuals can provide sufficient scrutiny to each phenotype?

References 1. Box, G.E.P., Hunter, W.G., Hunter, J.S.: Statistics for Experimenters. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1978) 2. Douglas, J.A., Skol, A.D., Boehnke, M.: Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. Am. J. Hum. Genet. 70(2), 487–495 (2002). https://doi.org/10.1086/338919 3. Doi: 2051-5960-2-4 [pii] 4. Alzheimer’s Association.: Alzheimer’s disease facts and figures. Alzheimers Dement. 7(2), 208–244 (2011). https://doi.org/10.1016/j.jalz.2011.02.004 5. Anstey, K.J., von Sanden, C., Salim, A., O’Kearney, R.: Smoking as a risk factor for dementia and cognitive decline: a meta-analysis of prospective studies. Am. J. Epidemiol. 166(4), 367–378 (2007). https://doi.org/10.1093/aje/kwm116

118

3 Phenotypic Heterogeneity

6. Bross, I.: Misclassification in 2 X 2 tables. Biometrics 10, 478–486 (1954) 7. Edwards, B.J., Haynes, C., Levenstien, M.A., Finch, S.J., Gordon, D.: Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Genet. 6, 18 (2005). https://doi.org/10.1186/1471-2156-6-18 8. Gordon, D., Haynes, C., Blumenfeld, J., Finch, S.J.: PAWE-3D: visualizing power for association with error in case-control genetic studies of complex traits. Bioinformatics 21(20), 3935–3937 (2005). https://doi.org/10.1093/bioinformatics/bti643 9. Ji, F., Yang, Y., Haynes, C., Finch, S.J., Gordon, D.: Computing asymptotic power and sample size for case-control genetic association studies in the presence of phenotype and/or genotype misclassification errors. Stat. Appl. Genet. Mol. Biol. 4, Article37 (2005). https://doi.org/10. 2202/1544-6115.1184 10. Zheng, G., Tian, X.: The impact of diagnostic error on testing genetic association in casecontrol studies. Stat. Med. 24(6), 869–882 (2005). https://doi.org/10.1002/sim.1976 11. Wojczynski, M.K., Tiwari, H.K.: Definition of phenotype. Adv. Genet. 60, 75–105 (2008). https://doi.org/S0065-2660(07)00404-X[pii]10.1016/S0065-2660(07)00404-X 12. Buyske, S., Yang, G., Matise, T.C., Gordon, D.: When a case is not a case: effects of phenotype misclassification on power and sample size requirements for the transmission disequilibrium test with affected child trios. Hum. Hered. 67(4), 287–292 (2009). https://doi.org/000194981 [pii]10.1159/000194981 13. Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., Haroutunian, V.: Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat. Appl. Genet. Mol. Biol. 3, Article26 (2004). https:// doi.org/10.2202/1544-6115.1085 14. King, A., Maekawa, S., Bodi, I., Troakes, C., Curran, O., Ashkan, K., Al-Sarraj, S.: Simulated surgical-type cerebral biopsies from post-mortem brains allows accurate neuropathological diagnoses in the majority of neurodegenerative disease groups. Acta Neuropathol. Commun. 1, 53 (2013) 15. Alonso-Betanzos, A., Bolon-Canedo, V., Heyndrickx, G.R., Kerkhof, P.L.: Exploring guidelines for classification of major heart failure subtypes by using machine learning. Clin. Med. Insights Cardiol. 9(Suppl 1), 57–71 (2015). https://doi.org/10.4137/cmc.s18746 16. Anderson, A., Douglas, P.K., Kerr, W.T., Haynes, V.S., Yuille, A.L., Xie, J., et al.: Nonnegative matrix factorization of multimodal MRI, FMRI and phenotypic data reveals differential changes in default mode subnetworks in ADHD. Neuroimage 102(Pt 1), 207–219 (2014). https://doi.org/10.1016/j.neuroimage.2013.12.015 17. Anderson, A.E., Kerr, W.T., Thames, A., Li, T., Xiao, J., Cohen, M.S.: Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: a cross-sectional, unselected, retrospective study. J. Biomed. Inform. 60, 162–168 (2016). https://doi.org/10.1016/j.jbi.2015.12.006 18. Candia, J., Maunu, R., Driscoll, M., Biancotto, A., Dagur, P., McCoy, J.P., Jr., et al.: From cellular characteristics to disease diagnosis: uncovering phenotypes with supercells. PLoS Comput. Biol. 9(9), e1003215 (2013). https://doi.org/10.1371/journal.pcbi.1003215 19. Crippa, A., Salvatore, C., Perego, P., Forti, S., Nobile, M., Molteni, M., Castiglioni, I.: Use of machine learning to identify children with autism and their motor abnormalities. J. Autism Dev. Disord. 45(7), 2146–2156 (2015). https://doi.org/10.1007/s10803-015-2379-8 20. Duarte, J.V., Ribeiro, M.J., Violante, I.R., Cunha, G., Silva, E., Castelo-Branco, M.: Multivariate pattern analysis reveals subtle brain anomalies relevant to the cognitive phenotype in neurofibromatosis type 1. Hum. Brain Mapp. 35(1), 89–106 (2014). https://doi.org/10.1002/ hbm.22161 21. Duda, M., Kosmicki, J.A., Wall, D.P.: Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Transl. Psychiatry 4, e424 (2014). https://doi.org/10.1038/ tp.2014.65 22. Durr, O., Sick, B.: Single-cell phenotype classification using deep convolutional neural networks. J. Biomol. Screen (2016). https://doi.org/10.1177/1087057116631284

References

119

23. Gligorijevic, B., Bergman, A., Condeelis, J.: Multiparametric classification links tumor microenvironments with tumor cell phenotype. PLoS Biol. 12(11), e1001995 (2014). https:// doi.org/10.1371/journal.pbio.1001995 24. Guo, P., Zhang, Q., Zhu, Z., Huang, Z., Li, K.: Mining gene expression data of multiple sclerosis. PLoS ONE 9(6), e100052 (2014). https://doi.org/10.1371/journal.pone.0100052 25. Hizukuri, Y., Sawada, R., Yamanishi, Y.: Predicting target proteins for drug candidate compounds based on drug-induced gene expression data in a chemical structure-independent manner. BMC Med. Genomics 8, 82 (2015). https://doi.org/10.1186/s12920-015-0158-1 26. Holec, M., Kuzelka, O., Zelezny, F.: Novel gene sets improve set-level classification of prokaryotic gene expression data. BMC Bioinform. 16, 348 (2015). https://doi.org/10.1186/ s12859-015-0786-7 27. Jiang, H., Ching, W.K.: Classifying DNA repair genes by kernel-based support vector machines. Bioinformation 7(5), 257–263 (2011) 28. Kandaswamy, C., Silva, L.M., Alexandre, L.A., Santos, J.M.: High-content analysis of breast cancer using single-cell deep transfer learning. J. Biomol. Screen 21(3), 252–259 (2016). https://doi.org/10.1177/1087057115623451 29. Leung, R.K., Wang, Y., Ma, R.C., Luk, A.O., Lam, V., Ng, M., et al.: Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: a prospective case-control cohort analysis. BMC Nephrol. 14, 162 (2013). https://doi.org/10.1186/1471-2369-14-162 30. Lin, C., Karlson, E.W., Dligach, D., Ramirez, M.P., Miller, T.A., Mo, H., et al.: Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. J. Am. Med. Inform. Assoc. 22(e1), e151–161 (2015). https:// doi.org/10.1136/amiajnl-2014-002642 31. Lotsch, J., Ultsch, A.: A machine-learned knowledge discovery method for associating complex phenotypes with complex genotypes. Application to pain. J. Biomed. Inform. 46(5), 921–928 (2013). https://doi.org/10.1016/j.jbi.2013.07.010 32. Lueken, U., Hilbert, K., Wittchen, H.U., Reif, A., Hahn, T.: Diagnostic classification of specific phobia subtypes using structural MRI data: a machine-learning approach. J. Neural Transm. 122(1), 123–134 (2015). https://doi.org/10.1007/s00702-014-1272-5 33. Pasolli, E., Truong, D.T., Malik, F., Waldron, L., Segata, N.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12(7), e1004977 (2016). https://doi.org/10.1371/journal.pcbi.1004977 34. Peissig, P.L., Santos Costa, V., Caldwell, M.D., Rottscheit, C., Berg, R.L., Mendonca, E.A., Page, D.: Relational machine learning for electronic health record-driven phenotyping. J. Biomed. Inform. 52, 260–270 (2014). https://doi.org/10.1016/j.jbi.2014.07.007 35. Ramadan, E., Alinsaif, S., Hassan, M.R.: Network topology measures for identifying diseasegene association in breast cancer. BMC Bioinform. 17(Suppl 7), 274 (2016). https://doi.org/ 10.1186/s12859-016-1095-5 36. Sahu, A.D., Aniba, R., Chang, Y.P., Hannenhalli, S.: Epigenomic model of cardiac enhancers with application to genome wide association studies. Pac. Symp. Biocomput. 92–102 (2013) 37. Schmitz, B., De Maria, R., Gatsios, D., Chrysanthakopoulou, T., Landolina, M., Gasparini, M., et al.: Identification of genetic markers for treatment success in heart failure patients: insight from cardiac resynchronization therapy. Circ. Cardiovasc. Genet. 7(6), 760–770 (2014). https://doi.org/10.1161/circgenetics.113.000384 38. Schmitz, B., De Maria, R., Gatsios, D., Chrysanthakopoulou, T., Landolina, M., Gasparini, M., et al.: Genetic markers in cardiac resynchronization therapy treatment success. J. Hypertens. 33(Suppl 1), e3 (2015). https://doi.org/10.1097/01.hjh.0000467358.81839.de 39. Seffens, W., Evans, C., Taylor, H.: Machine learning data imputation and classification in a multicohort hypertension clinical study. Bioinform. Biol. Insights 9(Suppl 3), 43–54 (2015). https://doi.org/10.4137/bbi.s29473 40. Shi, M., Wu, M., Pan, P., Zhao, R.: Network-based sub-network signatures unveil the potential for acute myeloid leukemia therapy. Mol. Biosyst. 10(12), 3290–3297 (2014). https://doi.org/ 10.1039/c4mb00440j

120

3 Phenotypic Heterogeneity

41. Upstill-Goddard, R., Eccles, D., Ennis, S., Rafiq, S., Tapper, W., Fliege, J., Collins, A.: Support vector machine classifier for estrogen receptor positive and negative early-onset breast cancer. PLoS ONE 8(7), e68606 (2013). https://doi.org/10.1371/journal.pone.0068606 42. Wall, D.P., Kosmicki, J., Deluca, T.F., Harstad, E., Fusaro, V.A.: Use of machine learning to shorten observation-based screening and diagnosis of autism. Transl. Psychiatry 2, e100 (2012). https://doi.org/10.1038/tp.2012.10 43. Wilhelm, T.: Phenotype prediction based on genome-wide DNA methylation data. BMC Bioinform. 15, 193 (2014). https://doi.org/10.1186/1471-2105-15-193 44. Yotsukura, S., Karasuyama, M., Takigawa, I., Mamitsuka, H.: Exploring phenotype patterns of breast cancer within somatic mutations: a modicum in the intrinsic code. Brief Bioinform. (2016). https://doi.org/10.1093/bib/bbw040 45. Bird, T.D.: Genetic aspects of alzheimer disease. Genet. Med. 10(4), 231–239 (2008) 46. Appel, J., Potter, E., Shen, Q., Pantol, G., Greig, M.T., Loewenstein, D., Duara, R.: A comparative analysis of structural brain MRI in the diagnosis of alzheimer’s disease. Behav. Neurol. 21(1), 13–19 (2009). https://doi.org/X6514U5621006887[pii]10.3233/BEN-2009-0225 47. Sproul, A.A., Vensand, L.B., Dusenberry, C.R., Jacob, S., Vonsattel, J.P.G., Paull, D.J., et al.: Generation of iPSC lines from archived non-cryoprotected biobanked dura mater. Acta Neuropathol. Commun. 2, 4 (2014) 48. Online Mendelian Inheritance in Man, Omim®. https://omim.org/. Accessed 30 Dec 2019 49. Ioannidis, J.P.A., Yu, Y., Seddon, J.M.: Correction of phenotype misclassification based on high-discrimination genetic predictive risk models. Epidemiology 23(6), 902–909 (2012). https://doi.org/10.1097/EDE.0b013e31826c3129 50. Saito, S., Ohno, K., Sese, J., Sugawara, K., Sakuraba, H.: Prediction of the clinical phenotype of fabry disease based on protein sequential and structural information. J. Hum. Genet. 55(3), 175–178 (2010). https://doi.org/10.1038/jhg.2010.5 51. Ham, A.L., Kumar, A., Deeter, R., Schanen, N.C.: Does genotype predict phenotype in rett syndrome? J. Child Neurol. 20(9), 768–778 (2005) 52. Benitez, B.A., Davis, A.A., Jin, S.C., Ibanez, L., Ortega-Cubero, S., Pastor, P., et al.: Resequencing analysis of five mendelian genes and the top genes from genome-wide association studies in Parkinson’s disease. Mol. Neurodegener. 11, 29 (2016). https://doi.org/10.1186/s13 024-016-0097-0 53. Arning, L.: The search for modifier genes in Huntington disease—multifactorial aspects of a monogenic disorder. Mol. Cell Probes (2016). https://doi.org/10.1016/j.mcp.2016.06.006 54. Funayama, M., Ohe, K., Amo, T., Furuya, N., Yamaguchi, J., Saiki, S., et al.: CHCHD2 mutations in autosomal dominant late-onset Parkinson’s disease: a genome-wide linkage and sequencing study. Lancet Neurol. 14(3), 274–282 (2015). https://doi.org/10.1016/s1474-442 2(14)70266-2 55. Whyte, M.P., Tau, C., McAlister, W.H., Zhang, X., Novack, D.V., Preliasco, V., et al.: Juvenile Paget’s disease with heterozygous duplication within TNFRSF11A encoding rank. Bone 68, 153–161 (2014). https://doi.org/10.1016/j.bone.2014.07.019 56. Watanabe, A., Satoh, S., Fujita, A., Naing, B.T., Orimo, H., Shimada, T.: Perinatal hypophosphatasia caused by uniparental isodisomy. Bone 60, 93–97 (2014). https://doi.org/10.1016/j. bone.2013.12.009 57. Strickler, A., Perez, A., Risco, M., Gallo, S.: Bacillus Calmette-Guerin (BCG) disease and Interleukin 12 receptor beta1 deficiency: clinical experience of two familial and one sporadic case. Rev. Chilena Infectol. 31(4), 444–451 (2014). https://doi.org/10.4067/s0716-101820 14000400010 58. Renaux-Petel, M., Sesboue, R., Baert-Desurmont, S., Vasseur, S., Fourneaux, S., Bessenay, E., et al.: The MDM 2 285g–309g haplotype is associated with an earlier age of tumour onset in patients with Li-Fraumeni syndrome. Fam. Cancer 13(1), 127–130 (2014). https://doi.org/ 10.1007/s10689-013-9667-2 59. Ratnapriya, R., Zhan, X., Fariss, R.N., Branham, K.E., Zipprer, D., Chakarova, C.F., et al.: Rare and common variants in extracellular matrix gene Fibrillin 2 (FBN2) are associated with macular degeneration. Hum. Mol. Genet. 23(21), 5827–5837 (2014). https://doi.org/10.1093/ hmg/ddu276

References

121

60. Abel, L., El-Baghdadi, J., Bousfiha, A.A., Casanova, J.L., Schurr, E.: Human genetics of tuberculosis: a long and winding road. Philos. Trans. R. Soc. Lond. B Biol. Sci. 369(1645), 20130428 (2014). https://doi.org/10.1098/rstb.2013.0428 61. Suraj Singh, H., Ghosh, P.K., Saraswathy, K.N.: DRD2 and ANKK1 gene polymorphisms and alcohol dependence: a case-control study among a Mendelian population of East Asian ancestry. Alcohol Alcohol 48(4), 409–414 (2013). https://doi.org/10.1093/alcalc/agt014 62. Blom, T., Schmiedt, M.L., Wong, A.M., Kyttala, A., Soronen, J., Jauhiainen, M., et al.: Exacerbated neuronal ceroid lipofuscinosis phenotype in Cln1/5 double-knockout mice. Dis. Model Mech. 6(2), 342–357 (2013). https://doi.org/10.1242/dmm.010140 63. Ritchie, M.D., Rowan, S., Kucera, G., Stubblefield, T., Blair, M., Carter, S., et al.: Chromosome 4q25 variants are genetic modifiers of rare ion channel mutations associated with familial atrial fibrillation. J. Am. Coll. Cardiol. 60(13), 1173–1181 (2012). https://doi.org/10.1016/j.jacc. 2012.04.030 64. Michelini, S., Degiorgio, D., Cestari, M., Corda, D., Ricci, M., Cardone, M., et al.: Clinical and genetic study of 46 Italian patients with primary lymphedema. Lymphology 45(1), 3–12 (2012) 65. Russo, L., Iafusco, D., Brescianini, S., Nocerino, V., Bizzarri, C., Toni, S., et al.: Permanent diabetes during the first year of life: multiple gene screening in 54 patients. Diabetologia 54(7), 1693–1701 (2011). https://doi.org/10.1007/s00125-011-2094-8 66. Reitz, C., Mayeux, R.: Endophenotypes in normal brain morphology and alzheimer’s disease: a review. Neuroscience 164(1), 174–190 (2009). https://doi.org/10.1016/j.neuroscience.2009. 04.006 67. Qari, A., Al-Mayouf, S., Al-Owain, M.: Mode of inheritance in systemic lupus erythematosus in saudi multiplex families. Genet. Couns. 20(3), 215–223 (2009) 68. Clarimon, J., Djaldetti, R., Lleo, A., Guerreiro, R.J., Molinuevo, J.L., Paisan-Ruiz, C., et al.: Whole genome analysis in a consanguineous family with early onset alzheimer’s disease. Neurobiol. Aging 30(12), 1986–1991 (2009). https://doi.org/10.1016/j.neurobiolaging.2008. 02.008 69. Beland, K., Lapierre, P., Alvarez, F.: Influence of genes, sex, age and environment on the onset of autoimmune hepatitis. World J. Gastroenterol. 15(9), 1025–1034 (2009) 70. Lesnick, T.G., Papapetropoulos, S., Mash, D.C., Ffrench-Mullen, J., Shehadeh, L., de Andrade, M., et al.: A genomic pathway approach to a complex disease: axon guidance and Parkinson disease. PLoS Genet. 3(6), e98 (2007). https://doi.org/10.1371/journal.pgen. 0030098 71. Kyriakou, T., Pontefract, D.E., Viturro, E., Hodgkinson, C.P., Laxton, R.C., Bogari, N., et al.: Functional polymorphism in ABCA1 influences age of symptom onset in coronary artery disease patients. Hum. Mol. Genet. 16(12), 1412–1422 (2007). https://doi.org/10.1093/hmg/ ddm091 72. Boardman, L.A., Morlan, B.W., Rabe, K.G., Petersen, G.M., Lindor, N.M., Nigon, S.K., et al.: Colorectal cancer risks in relatives of young-onset cases: is risk the same across all first-degree relatives? Clin. Gastroenterol. Hepatol. 5(10), 1195–1198 (2007). https://doi.org/10.1016/j. cgh.2007.06.001 73. Sundin, O.H., Jun, A.S., Broman, K.W., Liu, S.H., Sheehan, S.E., Vito, E.C.et al.: Linkage of late-onset fuchs corneal dystrophy to a novel locus at 13ptel-13q12.13. Invest. Ophthalmol. Vis. Sci. 47(1), 140–145 (2006). https://doi.org/10.1167/iovs.05-0578 74. Simpson, C.L., Al-Chalabi, A.: Amyotrophic lateral sclerosis as a complex genetic disease. Biochim. Biophys. Acta 1762(11–12), 973–985 (2006). https://doi.org/10.1016/j.bbadis. 2006.08.001 75. Mathias, R.A., Hening, W., Washburn, M., Allen, R.P., Lesage, S., Wilson, A.F., Earley, C.J.: Segregation analysis of restless legs syndrome: possible evidence for a major gene in a family study using blinded diagnoses. Hum. Hered. 62(3), 157–164 (2006). https://doi.org/10.1159/ 000096443 76. Bougeard, G., Baert-Desurmont, S., Tournier, I., Vasseur, S., Martin, C., Brugieres, L., et al.: Impact of the MDM2 SNP309 and P53 Arg72pro polymorphism on age of tumour onset in

122

77.

78.

79. 80.

81. 82.

83.

84.

85.

86. 87.

88.

89.

90.

91. 92. 93.

94. 95.

3 Phenotypic Heterogeneity Li-Fraumeni syndrome. J. Med. Genet. 43(6), 531–533 (2006). https://doi.org/10.1136/jmg. 2005.037952 Biskup, S., Mueller, J.C., Sharma, M., Lichtner, P., Zimprich, A., Berg, D., et al.: Common variants of LRRK2 are not associated with sporadic Parkinson’s disease. Ann. Neurol. 58(6), 905–908 (2005). https://doi.org/10.1002/ana.20664 Valeri, A., Briollais, L., Azzouzi, R., Fournier, G., Mangin, P., Berthon, P., et al.: Segregation analysis of prostate cancer in france: evidence for autosomal dominant inheritance and residual brother-brother dependence. Ann. Hum. Genet. 67(Pt 2), 125–137 (2003) Zhang, X., Wang, H., Te-Shao, H., Yang, S., Chen, S.: The genetic epidemiology of psoriasis vulgaris in Chinese Han. Int. J. Dermatol. 41(10), 663–669 (2002) Maher, B.S., Marazita, M.L., Zubenko, W.N., Spiker, D.G., Giles, D.E., Kaplan, B.B., Zubenko, G.S.: Genetic segregation analysis of recurrent, early-onset major depression: evidence for single major locus transmission. Am. J. Med. Genet. 114(2), 214–221 (2002) McCabe, L.L., McCabe, E.R.: Postgenomic medicine. Presymptomatic testing for prediction and prevention. Clin. Perinatol. 28(2), 425–434 (2001) Siegmund, K.D., Todorov, A.A., Province, M.A.: A frailty approach for modelling diseases with variable age of onset in families: the NHLBI family heart study. Stat. Med. 18(12), 1517–1528 (1999) Pei, Y., He, N., Wang, K., Kasenda, M., Paterson, A.D., Chan, G., et al.: A spectrum of mutations in the polycystic kidney disease-2 (PKD2) gene from eight Canadian kindreds. J. Am. Soc. Nephrol. 9(10), 1853–1860 (1998) Aitken, J.F., Bailey-Wilson, J., Green, A.C., MacLennan, R., Martin, N.G.: Segregation analysis of cutaneous melanoma in Queensland. Genet. Epidemiol. 15(4), 391–401 (1998). https:// doi.org/10.1002/(sici)1098-2272(1998)15:4%3c391::aid-gepi5%3e3.0.co;2-5 Rao, V.S., Cupples, A., van Duijn, C.M., Kurz, A., Green, R.C., Chui, H., et al.: Evidence for major gene inheritance of alzheimer disease in families of patients with and without apolipoprotein E epsilon 4. Am. J. Hum. Genet. 59(3), 664–675 (1996) Petronis, A., Kennedy, J.L.: Unstable genes-unstable mind? Am . . J. Psychiatry 152(2), 164– 172 (1995). https://doi.org/10.1176/ajp.152.2.164 Abel, L., Vu, D.L., Oberti, J., Nguyen, V.T., Van, V.C., Guilloud-Bataille, M., et al.: Complex segregation analysis of leprosy in Southern Vietnam. Genet. Epidemiol. 12(1), 63–82 (1995). https://doi.org/10.1002/gepi.1370120107 Rao, V.S., van Duijn, C.M., Connor-Lacke, L., Cupples, L.A., Growdon, J.H., Farrer, L.A.: Multiple etiologies for alzheimer disease are revealed by segregation analysis. Am. J. Hum. Genet. 55(5), 991–1000 (1994) Yang, H., McElree, C., Roth, M.P., Shanahan, F., Targan, S.R., Rotter, J.I.: Familial empirical risks for inflammatory bowel disease: differences between Jews and non-Jews. Gut 34(4), 517–524 (1993) Golbe, L.I., Lazzarini, A.M., Schwarz, K.O., Mark, M.H., Dickson, D.W., Duvoisin, R.C.: Autosomal dominant Parkinsonism with benign course and typical Lewy-body pathology. Neurology 43(11), 2222–2227 (1993) Carter, B.S., Beaty, T.H., Steinberg, G.D., Childs, B., Walsh, P.C.: Mendelian inheritance of familial prostate cancer. Proc. Natl. Acad. Sci. USA 89(8), 3367–3371 (1992) Fitzsimmons, J.S., Guilbert, P.R., Fitzsimmons, E.M.: Evidence of genetic factors in hidradenitis suppurativa. Br. J. Dermatol. 113(1), 1–8 (1985) Costa, T., Scriver, C.R., Childs, B.: The effect of mendelian disease on human health: a measurement. Am. J. Med. Genet. 21(2), 231–242 (1985). https://doi.org/10.1002/ajmg.132 0210205 Harper, P.S., Brotherton, B.J., Cochlin, D.: Genetic risks in Perthes’ disease. Clin. Genet. 10(3), 178–182 (1976) Egeland, J.A., Gerhard, D.S., Pauls, D.L., Sussex, J.N., Kidd, K.K., Alien, C.R., et al.: Bipolar affective disorders linked to DNA markers on chromosome 11. Nature 325(6107), 783–787 (1987)

References

123

96. Ott, J.: Analysis of Human Genetic Linkage, 3rd edn. The John Hopkins University Press, Baltimore, MD (1999) 97. Kelsoe, J.R., Ginns, E.I., Egeland, J.A., Gerhard, D.S., Goldstein, A.M., Bale, S.J., et al.: Reevaluation of the linkage relationship between chromosome 11p loci and the gene for bipolar affective disorder in the Old Order Amish. Nature 342(6247), 238–243 (1989). https://doi. org/10.1038/342238a0 98. Massi, D., Beltrami, G., Capanna, R., Franchi, A.: Histopathological re-classification of extremity pleomorphic soft tissue sarcoma has clinical relevance. Eur. J. Surg. Oncol. 30(10), 1131–1136 (2004). https://doi.org/10.1016/j.ejso.2004.07.018 99. Fletcher, C.D., Gustafson, P., Rydholm, A., Willen, H., Akerman, M.: Clinicopathologic reevaluation of 100 malignant fibrous histiocytomas: prognostic relevance of subclassification. J. Clin. Oncol. 19(12), 3045–3050 (2001) 100. Deyrup, A.T., Haydon, R.C., Huo, D., Ishikawa, A., Peabody, T.D., He, T.C., Montag, A.G.: Myoid differentiation and prognosis in adult pleomorphic sarcomas of the extremity: an analysis of 92 cases. Cancer 98(4), 805–813 (2003). https://doi.org/10.1002/cncr.11617 101. Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., Munafo, M.R.: Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14(5), 365–376 (2013). https://doi.org/10.1038/nrn3475 102. Abreu, P.C., Hodge, S.E., Greenberg, D.A.: Quantification of type i error probabilities for heterogeneity lod scores. Genet. Epidemiol. 22(2), 156–169 (2002). https://doi.org/10.1002/ gepi.0155 103. Hodge, S.E., Hager, V.R., Greenberg, D.A.: Correction: using linkage analysis to detect genegene interactions. 2. Improved reliability and extension to more-complex models. PLoS ONE 11(3), e0151686 (2016). https://doi.org/10.1371/journal.pone.0151686 104. Hodge, S.E., Hager, V.R., Greenberg, D.A.: Using linkage analysis to detect gene-gene interactions. 2. Improved reliability and extension to more-complex models. PLoS ONE 11(1), e0146240 (2016). https://doi.org/10.1371/journal.pone.0146240 105. Hodge, S.E., Vieland, V.J., Greenberg, D.A.: Hlods remain powerful tools for detection of linkage in the presence of genetic heterogeneity. Am. J. Hum. Genet. 70(2), 556–559 (2002). https://doi.org/10.1086/338923 106. Spence, M.A., Greenberg, D.A., Hodge, S.E., Vieland, V.J.: The Emperor’s new methods. Am. J. Hum. Genet. 72(5), 1084–1087 (2003). https://doi.org/10.1086/374826 107. Durner, M., Greenberg, D.A., Hodge, S.E.: Phenocopies versus genetic heterogeneity: can we use phenocopy frequencies in linkage analysis to compensate for heterogeneity? Hum. Hered. 46(5), 265–273 (1996) 108. Greenberg, D.A., Hodge, S.E.: linkage analysis under “random” and “genetic” reduced penetrance. Genet. Epidemiol. 6(1), 259–264 (1989). https://doi.org/10.1002/gepi.137006 0145 109. Hodge, S.E., Abreu, P.C., Greenberg, D.A.: Magnitude of type I error when single-locus linkage analysis is maximized over models: a simulation study. Am. J. Hum. Genet. 60(1), 217–227 (1997) 110. Hodge, S.E., Durner, M., Vieland, V.J., Greenberg, D.A.: Better data analysis through data exploration. Am J. Hum. Genet. 53(3), 775–777 (1993) 111. Hodge, S.E., Greenberg, D.A.: Sensitivity of lod scores to changes in diagnostic status. Am. J. Hum. Genet. 50(5), 1053–1066 (1992) 112. Vieland, V., Greenberg, D.A., Hodge, S.E., Ott, J.: Linkage analysis of two-locus diseases under single-locus and two-locus analysis models. Cytogenet. Cell Genet. 59(2–3), 145–146 (1992) 113. Vieland, V.J., Greenberg, D.A., Hodge, S.E.: Adequacy of single-locus approximations for linkage analyses of oligogenic traits: extension to multigenerational pedigree structures. Hum. Hered. 43(6), 329–336 (1993) 114. Vieland, V.J., Hodge, S.E., Greenberg, D.A.: Adequacy of single-locus approximations for linkage analyses of oligogenic traits. Genet. Epidemiol. 9(1), 45–59 (1992). https://doi.org/ 10.1002/gepi.1370090106

124

3 Phenotypic Heterogeneity

115. Clerget-Darpoux, F., Bonaiti-Pellie, C., Hochez, J.: Effects of misspecifying genetic parameters in lod score analysis. Biometrics 42(2), 393–399 (1986) 116. Ott, J.: Linkage analysis with misclassification at one locus. Clin. Genet. 12(2), 119–124 (1977) 117. Williamson, J.A., Amos, C.I.: On the asymptotic behavior of the estimate of the recombination fraction under the null hypothesis of no linkage when the model is misspecified. Genet. Epidemiol. 7(5), 309–318 (1990). https://doi.org/10.1002/gepi.1370070502 118. Bureau, A., Merette, C., Croteau, J., Fournier, A., Chagnon, Y.C., Roy, M.A., Maziade, M.: A new strategy for linkage analysis under epistasis taking into account genetic heterogeneity. Hum. Hered. 68(4), 231–242 (2009). https://doi.org/10.1159/000228921 119. Curtis, D., Sham, P.C.: Model-free linkage analysis using likelihoods. Am. J. Hum. Genet. 57(3), 703–716 (1995) 120. He, Z., Zhang, M., Lee, S., Smith, J.A., Guo, X., Palmas, W., et al.: Set-based tests for genetic association in longitudinal studies. Biometrics 71(3), 606–615 (2015). https://doi. org/10.1111/biom.12310 121. Heath, S.C.: Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am. J. Hum. Genet. 61(3), 748–760 (1997). https://doi.org/10.1086/515506 122. Mallick, H., Tiwari, H.K.: EM adaptive LASSO—a multilocus modeling strategy for detecting SNPs associated with zero-inflated count phenotypes. Front. Genet. 7, 32 (2016). https://doi. org/10.3389/fgene.2016.00032 123. Mandal, D.M., Sorant, A.J., Atwood, L.D., Wilson, A.F., Bailey-Wilson, J.E.: Allele frequency misspecification: effect on power and type I error of model-dependent linkage analysis of quantitative traits under random ascertainment. BMC Genet. 7, 21 (2006). https://doi.org/10. 1186/1471-2156-7-21 124. Mandal, D.M., Wilson, A.F., Bailey-Wilson, J.E.: Effects of misspecification of allele frequencies on the power of Haseman-Elston sib-pair linkage method for quantitative traits. Am. J. Med. Genet. 103(4), 308–313 (2001) 125. Mandal, D.M., Wilson, A.F., Elston, R.C., Weissbecker, K., Keats, B.J., Bailey-Wilson, J.E.: Effects of misspecification of allele frequencies on the type I error rate of model-free linkage analysis. Hum. Hered. 50(2), 126–132 (2000). https://doi.org/22900 126. Olson, J.M., Song, Y., Lu, Q., Wedig, G.C., Goddard, K.A.: Using overall allele-sharing to detect the presence of large-scale data errors and parameter misspecification in sib-pair linkage studies. Hum. Hered. 58(1), 49–54 (2004). https://doi.org/10.1159/000081456 127. Pal, D.K., Durner, M., Greenberg, D.A.: Effect of misspecification of gene frequency on the two-point lod score. Eur. J. Hum. Genet. 9(11), 855–859 (2001). https://doi.org/10.1038/sj. ejhg.5200724 128. Risch, N., Giuffra, L.: Model misspecification and multipoint linkage analysis. Hum. Hered. 42(1), 77–92 (1992) 129. Sung, Y.J., Rao, D.C.: Model-based linkage analysis with imprinting for quantitative traits: ignoring imprinting effects can severely jeopardize detection of linkage. Genet. Epidemiol. 32(5), 487–496 (2008). https://doi.org/10.1002/gepi.20321 130. Wang, J.Y., Tai, J.J.: Adaptive robust genetic association tests using case-parents triad families. Biom. J. 57(3), 453–467 (2015). https://doi.org/10.1002/bimj.201300135 131. Organisation for Economic Co-operation and Development (OECD.com): Definition of duplicate sample. https://stats.oecd.org/glossary/detail.asp?ID=3758 (2002) 132. Borchers, B., Brown, M., McLellan, B., Bekmetjev, A., Tintle, N.L.: Incorporating duplicate genotype data into linear trend tests of genetic association: methods and cost-effectiveness. Stat. Appl. Genet. Mol. Biol. 8, Article24 (2009). https://doi.org/10.2202/1544-6115.1433 133. Hossain, S., Le, N.D., Brooks-Wilson, A.R., Spinelli, J.J.: Impact of genotype misclassification on genetic association estimates and the Bayesian adjustment. Am. J. Epidemiol. 170(8), 994–1004 (2009). https://doi.org/10.1093/aje/kwp243 134. Huo, Y., Zou, H., Lang, M., Ji, S.X., Yin, X.L., Zheng, Z., et al.: Association between MTHFR c677t polymorphism and primary open-angle glaucoma: a meta-analysis. Gene 512(2), 179– 184 (2013). https://doi.org/10.1016/j.gene.2012.10.067

References

125

135. Lopez-Leon, S., Janssens, A.C., Gonzalez-Zuloeta Ladd, A.M., Del-Favero, J., Claes, S.J., Oostra, B.A., van Duijn, C.M.: Meta-analyses of genetic studies on major depressive disorder. Mol. Psychiatry 13(8), 772–785 (2008). https://doi.org/10.1038/sj.mp.4002088 136. Tintle, N., Gordon, D., Van Bruggen, D., Finch, S.: The cost effectiveness of duplicate genotyping for testing genetic association. Ann. Hum. Genet. 73(Pt 3), 370–378 (2009). https:// doi.org/10.1111/j.1469-1809.2009.00516.x 137. Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 2nd edn. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1984) 138. Gordon, D., Haynes, C., Yang, Y., Kramer, P.L., Finch, S.J.: Linear trend tests for case-control genetic association that incorporate random phenotype and genotype misclassification error. Genet. Epidemiol. 31(8), 853–870 (2007). https://doi.org/10.1002/gepi.20246 139. “Phenotypic Heterogeneity”. Accessed 30 Jan 2020 140. Bielinski, S.J., Pathak, J., Carrell, D.S., Takahashi, P.Y., Olson, J.E., Larson, N.B., et al.: A robust E-epidemiology tool in phenotyping heart failure with differentiation for preserved and reduced ejection fraction: the electronic medical records and genomics (emerge) network. J. Cardiovasc. Transl. Res. 8(8), 475–483 (2015). https://doi.org/10.1007/s12265-015-9644-2 141. Crawford, D.C., Crosslin, D.R., Tromp, G., Kullo, I.J., Kuivaniemi, H., Hayes, M.G., et al.: Emergeing progress in genomics-the first seven years. Front. Genet. 5, 184 (2014). https:// doi.org/10.3389/fgene.2014.00184 142. Cronin, R.M., Field, J.R., Bradford, Y., Shaffer, C.M., Carroll, R.J., Mosley, J.D., et al.: Phenome-wide association studies demonstrating pleiotropy of genetic variants within FTO with and without adjustment for body mass index. Front. Genet. 5, 250 (2014). https://doi. org/10.3389/fgene.2014.00250 143. Crosslin, D.R., McDavid, A., Weston, N., Nelson, S.C., Zheng, X., Hart, E., et al.: Genetic variants associated with the white blood cell count in 13,923 subjects in the emerge network. Hum. Genet. 131(4), 639–652 (2012). https://doi.org/10.1007/s00439-011-1103-9 144. Crosslin, D.R., Robertson, P.D., Carrell, D.S., Gordon, A.S., Hanna, D.S., Burt, A., et al.: Prospective participant selection and ranking to maximize actionable pharmacogenetic variants and discovery in the emerge network. Genome Med. 7(1), 67 (2015). https://doi.org/10. 1186/s13073-015-0181-z 145. Dumitrescu, L., Goodloe, R., Bradford, Y., Farber-Eger, E., Boston, J., Crawford, D.C.: The effects of electronic medical record phenotyping details on genetic association studies: Hdl-C as a case study. BioData Min. 8, 15 (2015). https://doi.org/10.1186/s13040-015-0048-2 146. Gottesman, O., Kuivaniemi, H., Tromp, G., Faucett, W.A., Li, R., Manolio, T.A., et al.: The electronic medical records and genomics (emerge) network: past, present, and future. Genet. Med. 15(10), 761–771 (2013). https://doi.org/10.1038/gim.2013.72 147. McCarty, C.A., Chisholm, R.L., Chute, C.G., Kullo, I.J., Jarvik, G.P., Larson, E.B., et al.: The emerge network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med. Genomics 4, 13 (2011). https://doi.org/10.1186/ 1755-8794-4-13 148. Newton, K.M., Peissig, P.L., Kho, A.N., Bielinski, S.J., Berg, R.L., Choudhary, V., et al.: Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the emerge network. J. Am. Med. Inform. Assoc. 20(e1), e147–154 (2013). https://doi.org/10.1136/amiajnl-2012-000896 149. Pathak, J., Wang, J., Kashyap, S., Basford, M., Li, R., Masys, D.R., Chute, C.G.: Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the emerge network experience. J. Am. Med. Inform. Assoc. 18(4), 376–386 (2011). https://doi.org/10.1136/amiajnl-2010-000061 150. Pendergrass, S.A., Verma, S.S., Hall, M.A., Holzinger, E.R., Moore, C.B., Wallace, J.R. et al.: Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using biofilter, and gene-environment interactions using the phenx toolkit*. Pac. Symp. Biocomput. 495–505 (2015)

126

3 Phenotypic Heterogeneity

151. Pendergrass, S.A., Verma, S.S., Holzinger, E.R., Moore, C.B., Wallace, J., Dudek, S.M. et al.: Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using biofilter, and gene-environment interactions using the phenx toolkit. Pac. Symp. Biocomput. 147–158 (2013) 152. Rasmussen, L.V., Thompson, W.K., Pacheco, J.A., Kho, A.N., Carrell, D.S., Pathak, J., et al.: Design patterns for the development of electronic health record-driven phenotype extraction algorithms. J. Biomed. Inform. 51, 280–286 (2014). https://doi.org/10.1016/j.jbi.2014.06.007 153. Rasmussen-Torvik, L.J., Stallings, S.C., Gordon, A.S., Almoguera, B., Basford, M.A., Bielinski, S.J., et al.: Design and anticipated outcomes of the eMERGE-PGx project: a multicenter pilot for preemptive pharmacogenomics in electronic health record systems. Clin. Pharmacol. Ther. 96(4), 482–489 (2014). https://doi.org/10.1038/clpt.2014.137 154. Verma, S.S., Cooke Bailey, J.N., Lucas, A., Bradford, Y., Linneman, J.G., Hauser, M.A., et al.: Epistatic gene-based interaction analyses for glaucoma in emerge and neighbor consortium. PLoS Genet. 12(9), e1006186 (2016). https://doi.org/10.1371/journal.pgen.1006186 155. Zuvich, R.L., Armstrong, L.L., Bielinski, S.J., Bradford, Y., Carlson, C.S., Crawford, D.C., et al.: Pitfalls of merging gwas data: lessons learned in the emerge network and quality control procedures to maintain high data quality. Genet. Epidemiol. 35(8), 887–898 (2011). https:// doi.org/10.1002/gepi.20639 156. Bush, W.S., Boston, J., Pendergrass, S.A., Dumitrescu, L., Goodloe, R., Brown-Gentry, K.et al.: Enabling high-throughput genotype-phenotype associations in the epidemiologic architecture for genes linked to environment (eagle) project as part of the population architecture using genomics and epidemiology (page) study. Pac. Symp. Biocomput. 373–384 (2013) 157. Crawford, D.C., Goodloe, R., Brown-Gentry, K., Wilson, S., Roberson, J., Gillani, N.B. et al.: Characterization of the metabochip in diverse populations from the international hapmap project in the epidemiologic architecture for genes linked to environment (eagle) project. Pac. Symp. Biocomput. 188–199 (2013) 158. Dumitrescu, L., Carty, C.L., Franceschini, N., Hindorff, L.A., Cole, S.A., Buzkova, P., et al.: No evidence of interaction between known lipid-associated genetic variants and smoking in the multi-ethnic page population. Hum. Genet. 132(12), 1427–1431 (2013). https://doi.org/ 10.1007/s00439-013-1375-3 159. Dumitrescu, L., Goodloe, R., Brown-Gentry, K., Mayo, P., Allen, M., Jin, H., et al.: Serum vitamins a and e as modifiers of lipid trait genetics in the national health and nutrition examination surveys as part of the population architecture using genomics and epidemiology (page) study. Hum. Genet. 131(11), 1699–1708 (2012). https://doi.org/10.1007/s00439-012-1186-y 160. Dumitrescu, L., Restrepo, N.A., Goodloe, R., Boston, J., Farber-Eger, E., Pendergrass, S.A., et al.: Towards a phenome-wide catalog of human clinical traits impacted by genetic ancestry. BioData Min. 8, 35 (2015). https://doi.org/10.1186/s13040-015-0068-y 161. Kocarnik, J.M., Pendergrass, S.A., Carty, C.L., Pankow, J.S., Schumacher, F.R., Cheng, I., et al.: Multiancestral analysis of inflammation-related genetic variants and C-reactive protein in the population architecture using genomics and epidemiology study. Circ. Cardiovasc. Genet. 7(2), 178–188 (2014). https://doi.org/10.1161/circgenetics.113.000173 162. Lim, U., Wilkens, L.R., Monroe, K.R., Caberto, C., Tiirikainen, M., Cheng, I., et al.: Susceptibility variants for obesity and colorectal cancer risk: the multiethnic cohort and page studies. Int. J. Cancer. 131(6), E1038–1043 (2012). https://doi.org/10.1002/ijc.27592 163. Matise, T.C., Ambite, J.L., Buyske, S., Carlson, C.S., Cole, S.A., Crawford, D.C., et al.: The next page in understanding complex traits: design for the analysis of population architecture using genetics and epidemiology (page) study. Am. J. Epidemiol. 174(7), 849–859 (2011). https://doi.org/10.1093/aje/kwr160 164. Oetjens, M.T., Brown-Gentry, K., Goodloe, R., Dilks, H.H., Crawford, D.C.: Population stratification in the context of diverse epidemiologic surveys sans genome-wide data. Front. Genet. 7, 76 (2016). https://doi.org/10.3389/fgene.2016.00076 165. Pashova, H., LeBlanc, M., Kooperberg, C.: Boosting for detection of gene-environment interactions. Stat. Med. 32(2), 255–266 (2013). https://doi.org/10.1002/sim.5444

References

127

166. Pendergrass, S.A., Brown-Gentry, K., Dudek, S., Frase, A., Torstenson, E.S., Goodloe, R., et al.: Phenome-wide association study (PheWAS) for detection of pleiotropy within the population architecture using genomics and epidemiology (page) network. PLoS Genet. 9(1), e1003087 (2013). https://doi.org/10.1371/journal.pgen.1003087 167. Pendergrass, S.A., Brown-Gentry, K., Dudek, S.M., Torstenson, E.S., Ambite, J.L., Avery, C.L., et al.: The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery. Genet. Epidemiol. 35(5), 410– 422 (2011). https://doi.org/10.1002/gepi.20589 168. Restrepo, N.A., Farber-Eger, E., Goodloe, R., Haines, J.L., Crawford, D.C.: Extracting primary open-angle glaucoma from electronic medical records for genetic association studies. PLoS ONE 10(6), e0127817 (2015). https://doi.org/10.1371/journal.pone.0127817 169. Ioannidis, J.P., Trikalinos, T.A., Khoury, M.J.: Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am. J. Epidemiol. 164(7), 609–614 (2006). https://doi.org/10.1093/aje/kwj259 170. Monteith, S., Glenn, T., Geddes, J., Whybrow, P.C., Bauer, M.: Big data for bipolar disorder. Int. J. Bipolar Disord. 4(1), 10 (2016). https://doi.org/10.1186/s40345-016-0051-7

Chapter 4

Association Tests Allowing for Heterogeneity

This chapter consists almost entirely of genetic association tests that allow for heterogeneity. The underlying mixture differs depending upon the statistical test. This chapter consists almost entirely of genetic association tests that allow for heterogeneity. The underlying mixture differs depending upon the statistical test. Specifically, we consider phenotype and genotype misclassification, locus heterogeneity, and phenotype heterogeneity. Our genomic data consist of genotype, multi-locus genotype, or next-generation sequence data. Our phenotype data are all categorical, with two or more phenotype categories. Virtually all of the tests are based on the EM algorithm. For each EM-based test, we provide definitions of terms, the null hypothesis for the test, closed-form solutions of the parameter estimates for each iteration step, and a formula for how the test statistic is computed. For a subset of the statistics, we provide examples of how the EM algorithm is applied to a fictitious data set.

4.1 Introduction There have been a number of papers published over the last several years that deal with the issue of different forms of heterogeneity in tests of genetic association [1–41]. Roughly, the topics may be categorized into: genotype misclassification, phenotype misclassification, locus (or genetic) heterogeneity, and phenotypic (or diagnostic heterogeneity). Below, we present statistics that allow for these different types of heterogeneity and also different types of data. Electronic supplementary material The online version of this chapter (https://doi.org/10.1007/978-3-030-61121-7_4) contains supplementary material, which is available to authorized users.

© Springer Nature Switzerland AG 2020 D. Gordon et al., Heterogeneity in Statistical Genetics, Statistics for Biology and Health, https://doi.org/10.1007/978-3-030-61121-7_4

129

130

4 Association Tests Allowing for Heterogeneity

We reiterate the point we made in the Preface. Due to the complexity of the statistics, all notation is “local”, that is, for each statistical test, when necessary, we provide notation used in defining it.

4.2 Statistical Tests that Use Genotype Data 4.2.1 Likelihood Ratio Test that Allows for Random Phenotype and Genotype Misclassification Error (LRTae ) To develop our previously documented likelihood ratio statistic that allows for random misclassification in phenotypes and genotypes [10], we make use of the double-sampling strategy, first presented in Chap. 1. For our statistic, we specify that some individuals are double-sampled on either phenotype or genotype or both: as expressed in Chap. 1, being double-sampled on a category means that an individual is classified by two independent methods: a standard method and a “gold-standard” method. Also, the term “fallible” refers to the classification that may contain errors and the term “infallible” refers to the classification that has a much lower error rate [40]. Some examples of double-sample phenotype data are Body Mass Index (BMI) (standard diagnosis) and Percent Body Fat (as determined by bioelectrical impedance anthropometric, physiological, and laboratory analyses) (gold-standard diagnosis) for obesity [41], the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) diagnostic test [42] (standard diagnosis) and brain biopsy (goldstandard diagnosis) for Alzheimer’s Disease [6, 43], and (non-invasive) clinical and imaging features (standard diagnosis) and liver biopsy (gold-standard diagnosis) for cirrhosis [44]. The quantitative traits can be converted to dichotomous or categorical traits through thresholds (see Chap. 6).

4.2.1.1

Notation and Definitions

We divide up the sample of subjects into disjoint subsets called groups. We make the following definitions: Group 1 = The set of all individuals who are double-sampled on both phenotype and genotype. We indicate the sample size by N1 . Group 2 = The set of all individuals who are double-sampled on phenotype only. We indicate the sample size by N2 . Group 3 = The set of all individuals who are double-sampled on genotype only. We indicate the sample size by N3 .

4.2 Statistical Tests that Use Genotype Data

131

Group 4 = The set of all individuals who are double-sampled on neither phenotype nor genotype. We indicate the sample size by N4 .

N = N1 + N2 + N3 + N4 = Total sample size.  n(1) (i j )(ij) = Number of individuals in Group 1 with true phenotype i , observed  phenotype i, true genomic category j , and observed genomic category j. There are two phenotype categories, numbered 0 (control) to 1 (case), and k genomic categories, numbered 0 to k − 1, and 0 ≤ j, j ≤ k − 1. The genomic categories may be alleles, genotypes, haplotypes, diplotypes, or multi-locus genotypes [1, 10, 35, 45, 46]. The ordering may have biological relevance (e.g., dosage of risk alleles) or none. For simplicity, we use the term “genotype” from this point forward for the genomic data. These individuals are double-sampled on both phenotype and genotype.   (1) n(1) j j n(i j )(ij) = Number of individuals in Group 1 who have true (i +)(i+) = and observed phenotype i. phenotype i (1) n(1) i i n(i j )(ij) = Number of individuals in Group 1 who have true (+j )(+j) = j. genotype j and   genotype  observed (1) n(1) i i j n(i j )(ij) = Number of individuals in Group 1 who have (+j )(++) = true genotype j  .  n(2) i (ij) = Number of individuals in Group 2 with true phenotype i , observed phenotype i, and observed genotype j. These individuals are double-sampled on phenotype only. (2) n(2) j ni (ij) = Number of individuals in Group 2 who have true phenotype i (i+) = i and observed phenotype i.

n(2) i (+j) =

 i

n(2) i (ij) .

n(3) j (ij) = Number of individuals in Group 3 with observed phenotype i, true genotype j  , and observed genotype j. These individuals are double-sampled on genotype only. (3)  = n(3)  i nj (ij) = Number of individuals in Group 3 with true genotype j and j (+j) observed genotype  (3) j.  = n(3)  j nj (+j) = Number of individuals in Group 3 with true genotype j . j (++)

n(3) j (i+) =

 j

n(3) j (ij) .

n(4) (ij) = Number of individuals in Group 4 with observed phenotype i and observed genotype j. These individuals have no double-sample data.

132

4 Association Tests Allowing for Heterogeneity

n(4) (i+) = n(4) (+j) =

 j

n(4) (ij) .

i

n(4) (ij)



Y = Random variable whose outcome is the set of possible phenotypes 0, 1; it is the fallible phenotype of an arbitrary individual. Y t = Random variable whose outcome is the set of possible phenotypes 0, 1; it is the true (infallible) phenotype of an arbitrary individual. X = Random variable whose outcome is the set of possible genotypes 0, 1, . . . , k − 1; it is the fallible genotype of an arbitrary individual. X t = Random variable whose outcome is the set of possible genotypes 0, 1, . . . ,k − 1; it is thetrue (infallible) genotype of an arbitrary individual. θj j = Pr X = j|X t = j  . When j = j, these parameters are referred to as genotype misclassification parameters [47–49]. A fuller description of these parameters is provided in Chap. 2. As previously published [10], θj j =

n(1) +n(3) (+j )(+j) j (+j) (1) n +j (++) +n(3) j  (++) ( )

is the

MLE of each  and misclassification parameter.  classification πi i = Pr Y = i|Y t = i . When i = i, these parameters are referred to as phenotype misclassification parameters [47–49]. As previously published, πi i =

n(1) +n(2) (i +)(i+) i (i+) n(1) +n(2)  (++) (i +)(++) i gj i = Pr X t

is the MLE of each classification and misclassification parameter.  = j  |Y t = i = True frequency of genotype j , 0 ≤ j ≤ k − 1, conditional on phenotype (affection status) i , 0 ≤ i ≤ 1. th step estimate of the parameter gj i (determined through application of gj(r)  i = r the EM algorithm).   gj ∗ = Pr X t = j  = True frequency of genotype j , 0 ≤ j  ≤ k − 1 under the null hypothesis H0 : gj 0 = gj 1 . The subscript “*” indicates that the genotype frequencies are independent of the phenotypes. th step estimate of the parameter gj ∗ (determined through application of gj(r) ∗ = r the EM algorithm).   φi = Pr Y t = i = True frequency of phenotype i , 0 ≤ i ≤ 1. As in previous chapters, this term represents the prevalence of phenotype i (see Sect. 1.4.2 of Chap. 1, Sect. 3.3.1 of Chap. 3). = rth step estimate of the parameter φi (determined through application of φi(r)  the EM algorithm). τ(i(G)(r)  j  )(ij) = Bayesian posterior probability (at the rth iteration step) that an individual in Group G has true phenotype i and true genotype j given that their observed phenotype and genotype are i and j, respectively. Note that 1 ≤ G ≤ 4.   I1 a, i , i, j  , j = Indicator function that Individual a in Group 1 has true   phenotype i , observed phenotype i, true genotype j , and observed genotype j.   I2 a, i , i, j = Indicator function that Individual a in Group 2 has true phenotype i , observed phenotype i, and observed genotype j.

4.2 Statistical Tests that Use Genotype Data

133

  I3 a, i, j , j = Indicator function that Individual a in Group 3 has observed phenotype i, true genotype j , and observed genotype j. I4 (a, i, j) = Indicator function that Individual a in Group 4 has observed phenotype i and observed genotype j. n(r) (i j )(ij) = rth iteration step estimate of the number of individuals in the total sample that have true phenotype i , observed phenotype i, true genotype j , and observed genotype j. This value differs for each hypothesis (null or alternative).

n(r) (i j )(++) =

 i

n(r) (i +)(++) n(r) (i +)(++) n(r) (+j )(++)

4.2.1.2

=

i

i

i

n(r) (i j )(ij) .

j

 i

n(r) (i j )(ij) .

j

 j

=

j

 j

=

n(r) (i j )(ij) .

n(r) (i j )(ij) .

j

Log-Likelihoods of the Observed Data

With appropriate ordering of individuals, we compute ln(L) =

N1  1  1 k−1   k−1 

     I1 a, i , i, j  , j ln Pr Y = i, Y t = i , X = j, X t = j  for ind a

a=1 i=0 i =0 j=0 j  =0

+

N1 +N2  1  1 k−1 

     I2 a, i , i, j ln Pr Y = i, Y t = i , X = jfor ind a

a=N1 +1 i=0 i =0 j=0

+

N1 +N 1 k−1 2 +N3    k−1 

     I3 a, i, j  , j ln Pr Y = i, X t = j  , X = jfor ind a

a=N1 +N2 +1 i=0 j=0 j  =0 N 

1 k−1  

  I4 (a, i, j)ln Pr(Y = i, X = jfor ind a) .

a=N −N4 +1 i=0 j=0

We simplify each of the probabilities as follows: (Group 1)   Pr Y = i, Y t = i , X = j, X t = j      = Pr Y = i, X = j|Y t = i , X t = j  × Pr Y t = i , X t = j  ,       = Pr Y = i|Y t = i × Pr X = j|X t = j  × Pr X t = j  |Y t = i

(4.1)

134

4 Association Tests Allowing for Heterogeneity

  × Pr Y t = i , = πi i θj j gj i φi .

(4.2)

The equality       Pr Y = i, X = j|Y t = i , X t = j  = Pr Y = i|Y t = i × Pr X = j|X t = j  that we use in Eq. (4.2) is a condition that we presented in previously published work [10]. The meaning is that, conditional on the true underlying phenotype and genotype values, the observed genotype and phenotype values are independent. (Group 2)   Pr Y = i, Y t = i , X = j    Pr Y = i, Y t = i , X = j, X t = j  , = j

=



πi i θj j gj i φi ,

j

⎛ ⎞  = πi i φi ⎝ θj j gj i ⎠.

(4.3)

j

(Group 3)   Pr Y = i, X = j, X t = j    = Pr Y = i, Y t = i , X = j, X t = j  , i   = πi i θj j gj i φi , = πi i θj j gj i φi , i

i  πi i gj i φi . = θj j

(4.4)

i

(Group 4) Pr(Y = i, X = j) Pr(Y = i, X = j)    = Pr Y = i, Y t = i , X = j, X t = j  , =

i

j

i

j



πi i θj j gj i φi .

(4.5)

We drop the suffix “for ind a” because the probabilities listed in Eqs. (4.2)– (4.5) are not dependent upon the individual. Rather, as we present in the next set of equations, the summation over the individuals a in each group may be replaced by (2) the counts n(1) (i j )(ij) , ni (ij) , etc.

4.2 Statistical Tests that Use Genotype Data

135

Using Eqs. (4.2)–(4.5), we may rewrite Eq. (4.1), the log-likelihood of the observed data, as ln(L)

=

1  1  k−1 k−1   i=0 i =0 j=0 j =0

+

1  1  k−1  i=0 i =0 j=0

+

1  k−1  k−1  i=0 j=0

+

1  k−1 

j =0

  n(1) (i j )(ij) ln πi i θj j gj i φi ⎡



⎣ ⎝ n(2) i (ij) ln πi i φi  n(3) j (ij) ln

θj j

 



⎣ n(4) ij ln

i

⎞⎤ θj j gj i ⎠⎦

j



πi i gj i φi

i



i=0 j=0





πi i θj j gj i φi ⎦.

(4.6)

j

Under the null hypothesis, we can write Eq. (4.6) as ln(L)

=

1  1  k−1 k−1   i=0 i =0 j=0 j =0

+

k−1 1  1   i=0 i =0 j=0

+

1  k−1  k−1  i=0 j=0

+

1  k−1  i=0 j=0

j =0

  n(1) (i j )(ij) ln πi i θj j gj ∗ φi ⎡

⎛ ⎞⎤  ⎣ ⎝ n(2) θj j gj ∗ ⎠⎦ i (ij) ln πi i qi  n(3) j (ij) ln θj j ⎡



⎣ n(4) ij ln

i

  i

j



πi i gj ∗ φi ⎤

πi i θj j gj ∗ φi ⎦.

(4.7)

j

The rth iteration step estimate of the log-likelihoods, ln(L(r) ), is computed by (r) (r) substituting gj(r)  i  , φi  , and gj  ∗ in the appropriate places in Eqs. (4.6) and (4.7). Formulas for these terms are given below. For expressions (4.8) and (4.9), we compute terms under the alternative hypothesis, with the understanding that each equation may be reduced to the null hypothesis (r) situation by substituting gj(r)  ∗ for gj  i  . In the set of equations (4.8), that are the BPPs for Groups 1 through 4, respectively, the variables i, i , and u refer to phenotypes, while the variables j, j , and v . Like the equations above, the superscript ’ indicates the true or gold-standard classification. Gordon et al. [10], the BPPs for an arbitrary individual a in Group G be written as (Group 1)

136

4 Association Tests Allowing for Heterogeneity



τ(i(1)  j  )(ij)

=

  1, if I1 a, i , i, j  , j = 1 0, otherwise;

(Group 2) τ(i(2)(r)  j  )(ij)

=

⎧ ⎨





θj j gj(r)  i (r) v  θv  j gv  i

 , if I2

   a, i , i, j = 1

(4.8)

0, otherwise;

(Group 3) τ(i(3)(r)  j  )(ij)

=

⎧ ⎨





(r) πi i gj(r)  i φi (r) (r) u πu i gj  u φu

 , if I3

  a, i, j , j = 1,

0, otherwise;

(Group 4) τ(i(4)(r)  j  )(ij)

=

⎧ ⎨ ⎩

(r) πi i θj j gj(r)  i φi  , if I4 (a, i, j)   (r) (r) u v  πu i θv  j gv  u φu

= 1,

0, otherwise.

As mentioned in Sect. (4.2.1.1), the term τ(i(G)(r)  j  )(ij) is the rth step estimate of the conditional probability that an arbitrary individual a from Group G, 1 ≤ G ≤ 4, has true genotype j and true phenotype i given that their observed phenotype and genotype are i and j, respectively. Let n(r) (i j )(ij) be defined by (1) (2)(r) (3)(r) (4)(r) n(r) (i j )(ij) = n(i j )(ij) + n(i j )(ij) + n(i j )(ij) + n(i j )(ij) , where (1) (1) n(1) (i j )(ij) = n(i j )(ij) × τ(i j )(ij) , (2) (2)(r) n(2)(r) (i j )(ij) = ni (ij) × τ(i j )(ij) , (3) (3)(r) n(3)(r) (i j )(ij) = nj (ij) × τ(i j )(ij) , (4) (4)(r) n(4)(r) (i j )(ij) = n(ij) × τ(i j )(ij) .

(4.9)

In Formula (4.9), the superscripts (r) indicate the rth step parameter estimates of the counts of individuals in the respective groups (1)–(4) that have true phenotype i , observed phenotype i, true genotype j  , and observed genotype j. Observe that n(1) (i j )(ij) has no iteration superscript (r). The reason is that each individual in Group 1 is double-sampled on both phenotype and genotype, so there is no uncertainty regarding their true phenotype and genotype classifications. Gordon et al. [10] proved that the closed-form solutions of the conditional genotype frequencies at the (r + 1)st step, r ≥ 0, under the alternative hypothesis are

4.2 Statistical Tests that Use Genotype Data

gj(r+1) =  i

137

n(r) (i j )(++) n(r) (i +)(++)

.

(4.10)

Similarly, the (r + 1)st step values for the phenotype frequencies φi(r+1) are given  by = φi(r+1) 

n(r) (i +)(++) N

.

(4.11)

Intuitively, Formulas (4.10) and (4.11) make sense. They state that the estimated values of the genotype (respectively, phenotype) frequencies are computed as the quotient of the true number of individuals who have a given phenotype and genotype value (respectively, phenotype value) divided by the number of individuals who have the true phenotype value (for the conditional genotype frequencies (4.10) or total number of individuals (4.11)). Under the null hypothesis, gj 0 = gj 1 = gj ∗ for all j  , that is, the true genotype frequencies are the same in the case and control groups. We have (1) (2)(r) (3)(r) (4)(r) n(r) (i j )(ij) = n(i j )(ij) + n(i j )(ij) + n(i j )(ij) + n(i j )(ij) ,

(4.12)

(3)(r) (4)(r) where n(2)(r) (i j )(ij) , n(i j )(ij) , and n(i j )(ij) are given by Formula (4.9), the one modification

(r) being that gj(r)  i  is replaced by gj  ∗ . It follows that the closed-form solutions of the (r + 1)st step estimates are

gj(r+1) ∗

=

n(r) (+j )(++) N φi(r+1) = 

, 0 ≤ j  ≤ (k − 1), n(r) (i +)(++) N

.

(4.13)

(4.14)

While Eqs. (4.11) and (4.14) look identical, they are determined using the (r) conditional genotype frequencies gj(r)  i  and gj  ∗ , respectively. 4.2.1.3

Test Statistic—Likelihood Ratio Test Allowing for Error (LRTae )

Once we have computed the log-likelihoods (4.6) and (4.7) under the alternative and null hypotheses, respectively (recall from Chap. 1, Eq. (1.19), we stop the algorithm for each hypothesis when the difference in consecutive log-likelihoods is less than a user-specified limit distance), we compute the following test statistic:      LRTae = 2 ln Lr1 (s1 ) − ln Lr0 (s0 ) .

(4.15)

138

4 Association Tests Allowing for Heterogeneity

The terms s1, r1 (s1 ), s0, and r0 (s0 ) are defined in Chap. 1. The log-likelihoods are functions of the parameters gj(r i1 ) , φi(r 1 ) , πi i , and θj j under the alternative hypothesis,

and gj(r ∗0 ) , φi(r 0 ) , πi i , and θj j under the null. The parameters πi i and θj j are estimated from the double-sample data and do not get updated (see Sect. 4.2.1, definitions of θj j and πi i ). The asymptotic null distribution of LRT ae is a central chi-square distribution with k–1 degrees of freedom. Results of simulation studies suggest that, when only one type of data (phenotype, genotype) is double-sampled, permutation on the crossclassified type produces an accurate estimate of the p-values [6, 10, 16].

4.2.1.4

Example Application

Suppose that we have a fictitious disease in a population, for which there is a trait (disease) locus whose parameter settings are as follows (notation from Chap. 1). Genetic model parameters a. b. c. d. e. f. g.

True disease prevalence = φ1 (True) = 0.01; Disease allele frequency = pd = 0.25; Genotype relative risks R1 = 2; Genotype relative risks R2 = 4; Coefficient of maximum LD = c = 1; SNP marker allele frequency = pa = 0.25; Proportion of observed cases in sample = 0.50, that is, there is an equal number of observed cases and controls. With these parameter specifications the marker locus is the disease locus (Chap. 1).

Misclassification matrices Genotype ⎛ ⎞ 0.965 0.025 0.010   θj j = ⎝ 0.025 0.950 0.025 ⎠, 0.010 0.025 0.965 Phenotype (πi i ) =

0.98 0.02 , 0.05 0.95

where the entry in j th row and jth column of the genotype misclassification matrix is the probability of misclassifying true genotype j as genotype j, 0 ≤ j , j ≤ 2, and where the entry in i th row and ith column of the phenotype misclassification matrix is the probability of misclassifying true phenotype i as phenotype i, 0 ≤ i , i ≤ 1.

4.2 Statistical Tests that Use Genotype Data

139

With the genetic model parameter specifications, the marker locus is the disease locus. Using Eq. (1.13) from Chap. 1, we can show that the true genotype frequencies are g00 = 0.5645; g10 = 0.3739; g20 = 0.0615; g01 = 0.3600; g11 = 0.4800; g21 = 0.1600. Using item (g) in the list above, we know the observed prevalence of the nondiseased/control population is φ0 (Observed) = 0.5. From the definitions of the misclassification probabilities and the law of total probability, we have 0.5 = φ0 (Observed) = π00 × φ0 (True) + π10 × φ1 (True), = π00 × φ0 (True) + π10 × (1 − φ0 (True)). It follows that 0.5 = (π00 − π10 ) × φ0 (True) + π10 , φ0 (True) =

(0.5−π10 ) . (π00 −π10 )

Using the values in the matrix (πi i ) above, φ0 (True) =

0.46 (0.50 − 0.05) = 0.5. = 0.92 (0.95 − 0.05)

In fact, when φ0 (Observed) = 0.5, we can show that φ0 (True) is 0.5 for any settings of the misclassification matrix (πi i ). Consider a simulation using the following steps: 1. Specify that the total sample size is 500 individuals. All following assignments are done through generation of random numbers drawn from a U(0,1) distribution. 2. Use the fact that φ0 (True) = φ1 (True) = 0.5. For each individual, assign infallible (true) case or control status according to these probabilities.

140

4 Association Tests Allowing for Heterogeneity

3. Create fallible phenotype assignments for each individual based on the phenotype misclassification matrix. 4. Create an infallible genotype call for each individual based on their true affection status (Step 2) and the true conditional genotype frequencies g00 ,…,g21 above. For example, if an individual has an infallible phenotype assignment of being a case, and the random number 0.9729 is generated for their true genotype assignment, then the individual is assigned the genotype 2 as their true (infallible) genotype. 5. Create each individual’s fallible genotype through application of genotype misclassification matrix. With the completion of Steps 1–5, we have a full data set. We randomly select individuals on phenotype (35% are double-sampled) and on genotype (25% are double-sampled). Since the double-sampling is performed independently, we get approximately 8.75% of all 500 individuals double-sampled on both phenotype and genotype. In Table S1 in the Supplementary Material chapter, we present phenotype and genotype counts for one simulation replicate (Steps 1–5). The number of individuals who are double-sampled on both genotype and phenotype (Group 1) is 45, or 9% of all 500 individuals. From Table S1, we compute the observed (mis)classification parameters. Calculation of test statistic At this stage, we document an example calculation of the LRTae statistic. To begin, we need starting values for all parameters (Sect. 1.7 of Chap. 1). We generate the following random values (standardized to sum to 1 where appropriate): H0 : (0) = 0.2326; g0∗ (0) = 0.1546; g1∗ (0) = 0.6128. g2∗

φ0(0) = 0.4210; φ1(0) = 0.5790. H1 : (0) = 0.5802; g00 (0) = 0.2166; g10 (0) = 0.2032; g20 (0) g01 = 0.0810; (0) = 0.0760; g11

4.2 Statistical Tests that Use Genotype Data

141

(0) g21 = 0.8430.

φ0(0) = 0.6561; φ1(0) = 0.3439. Substituting these initial parameter values, the estimated (mis)classification estimates in Tables 4.1 and 4.2, and the observed counts from Table S2 into Eq. 4.7, we may compute the log-likelihood of the observed data under H0 for the r = 0th iteration step of the EM algorithm. One may check that the log-likelihood is −1140.6869. Table 4.1 Double-sample genotype data for simulation replicate

Observed genotype (j)

  True genotype j  0

1

2

0

42

3

0

1

2

51

1

2

2

1

15

Total

46

55

16

a. Counts

b. Genotype (mis)classification MLEs θj j

Table 4.2 Double-sample phenotype data for simulation replicate

0

0.913

0.0545

0.0000

1

0.0435

0.9273

0.0625

2

0.0435

0.0182

0.9375

Observed phenotype (i)

  True phenotype i 0

1

0

92

3

1

2

83

Total

94

86

a. Counts

b. Phenotype (mis)classification MLEs πi i 0

0.9787

0.0349

1

0.0213

0.9651

142

4 Association Tests Allowing for Heterogeneity

We perform the same substitution, albeit with three more conditional genotype frequency values, to compute the log-likelihood of the observed data under H1 . Here, we use Eq. (4.6). For the alternative hypothesis, the log-likelihood is −1160.6543. Since the counts in Table S1 and the misclassification values in Tables 4.1 and 4.2 remain constant over all iterations, the log-likelihoods only change because of updates in the genotype and phenotype frequency estimates. These frequencies (3)(r) (4)(r) change because of updated values n(2)(r) (i j )(ij) , n(i j )(ij) , and n(i j )(ij) (see Eq. (4.9)).

(2)(0) We provide an example calculation of n(2)(r) (i j )(ij) = n(i j )(ij) under H0 and H1 . Let  i = i = j = j = 0. Under H0 , using Eqs. (4.8) and (4.9), we have 

θj j gj(r)  i (2)  . = n n(2)(r)    i (ij)  (i j )(ij) (r) j g   θ  v v vi It follows: (0) θ00 g0∗ (2) . n(2)(0) (00)(00) = n0(00)  (0) v θv 0 gv ∗

There are a few things we notice about this equation. First, n(2) 0(00) is a constant value that we find in Table S1. The value is 45. Next, to complete this calculation we (0)  need the values θv 0 and gv(0)  ∗ , v = 0, 1, 2. The values θv  0 are in Table 4.1, and gv  ∗ are the starting values listed above. Consequently: n(2)(0) (00)(00)

=

(0) θ00 g0∗  n(2) 0(00)  (0) v θv 0 gv ∗

 = (45)

(0) θ00 g0∗

(0) (0) (0) θ00 g0∗ + θ10 g1∗ + θ20 g2∗

 .

Note that: (0) = (0.9130) × (0.2326) = 0.2124, θ00 g0∗ (0) = (0.0545) × (0.1546) = 0.0084, θ10 g1∗ (0) = (0.0000) × (0.6128) = 0.0000, θ20 g2∗

so n(2)(0) (00)(00) = (45)



0.2124 0.2124 + 0.0084 + 0

= 43.283.

The value in Table S2 is 43.281. That value is more accurate, since rounding is performed after calculation of the product and quotient. Still, this calculation (3)(0) (4)(0) illustrates how we compute each of the values n(2)(0) (i j )(ij) , n(i j )(ij) , and n(i j )(ij) . Using the numbers in Table S2 and Formulas (4.8), (4.9), (4.12)–(4.14), the r = 1st step parameter values under H0 are

4.2 Statistical Tests that Use Genotype Data

143

(1) g0∗ = 0.4561; (1) = 0.3530; g1∗ (1) = 0.1909. g2∗

q0(1) = 0.4943; q1(1) = 0.5057. Similarly, using Formulas (4.8)–(4.11), the r = 1st step parameter values under H1 are (1) = 0.5922; g00 (1) = 0.3130; g10 (1) = 0.0948; g20 (1) g01 = 0.3167; (1) = 0.3565; g11 (1) = 0.3268. g21

q0(1) = 0.5365; q1(1) = 0.4635. Note that these estimates are much closer to the true parameter values than those at step r = 0. As with the r = 0th step estimates, we may use these parameter values and the counts in Table S1 to compute the updated log-likelihoods at the r = 1st step. Under H0 , the log-likelihood becomes −911.3840, a change of 229.3029 from the loglikelihood at the r = 0th step. Under H1 , the log-likelihood becomes −909.1568, a change of 251.4975 from the log-likelihood for the r = 0th step. In Table 4.3, we present the log-likelihoods for our example data under H0 and H1 at multiple steps. We indicate the steps r0 and r1 where the inequalities for H0 and H1 are satisfied, respectively. Also, we compute the LRT ae statistic. The maximum difference ε is set at 10−5 = 1E−05. Studying this table, we note that convergence to particular log-likelihoods (labeled ln(L) in the table) occurs within a relatively small number of steps. In addition, the minimal values r0 and r1 where the inequalities (Formula (4.16)) are satisfied for H0 and H1 are the same for this set of observed data. Specifically, r0 = r1 = 6. In general, it is not necessarily true that r0 = r1 . Using the respective log-likelihoods for Step 6 in Table 4.3, we compute

144

4 Association Tests Allowing for Heterogeneity

Table 4.3 H0 and H1 log-likelihoods for example (observed) data (H0 ln(L)) values for consecutive steps

(H1 ln(L)) values for consecutive steps

Step (r)

H0 ln(L)

0

−1160.6543

1

−909.1568

251.4975

−911.3840

229.3029

2

−890.5447

18.6121

−902.3754

9.0086

3

−890.0048

0.5399

−902.0791

0.2963

4

−889.9868

0.0180

−902.0679

0.0112

5

−889.9861

0.0007

−902.0674

0.0005

6

−889.9861

3.0783E−05

−902.0674

1.8546E−05

7

−889.9861

1.7104E−06

−902.0674

7.7419E−07

H1 ln(L) −1140.6869

The headings “H0 ln(L)” and “H1 ln(L)” in Columns 2 and 4 refer to the log-likelihoods of the example data under the null and alternative hypotheses, respectively, for given steps r. The values in the column labeled “(H0 ln(L)) values for consecutive steps” are computed as          H0 ln L at step r + 1 − H0 ln L at stepr , 0 ≤ r ≤ 6. The value r comes from the first column. We compute similar values in the last column, where ln(L) is computed under the alternative hypothesis

LRTae = 2(−902.0674 + −889.9861), = 24.1626.

(4.16)

This statistic follows a central chi-square distribution with 2 df under the null hypothesis. Thus, the asymptotic p-value is 5.6644 × 10−6 . We computed LRTae in this manner for illustrative purposes. Recall that, in Chap. 1, Sect. 1.7, when computing the test statistic using the EM algorithm, we use a large number of starting vectors (0th step) for the null and alternative hypotheses separately. For each starting vector and for each hypothesis, we apply the EM algorithm until we obtain convergence or until the number of steps is exhausted. We could claim that the value in Eq. (4.16) is the LRTae value if we considered only one starting vector (strategically, a very poor option if we want to obtain an accurate estimate of the maximum log-likelihoods).

4.2.2 Trend Statistic that Allows for Random Phenotype and Genotype Misclassification Error Like the LRTae statistic from Sect. 4.2.1, the test of trend allowing for errors (LTTae ) presented in this section also uses double-sample data [6]. It is an extension of the Cochran-Armitage test for trend (as the test is commonly referenced) [50, 51]. One reason for the interest regarding this statistic when performing genetic association

4.2 Statistical Tests that Use Genotype Data

145

analyses regards robustness. Sasieni [52] documented that the chi-square test of independence applied to locus allele counts for cases and controls is not robust to departures from Hardy-Weinberg equilibrium proportions for locus genotype frequencies in either the case or control populations. The Cochran-Armitage test is robust, and has the same statistical power as the chi-square test of independence on alleles when both populations have locus genotype frequencies that are in HWE proportions. As with the LRTae , we specify that some individuals are double-sampled on either phenotype or genotype or both. One of the key differences between the LTTae and the LRTae is that the LRTae log-likelihoods are written as functions of the conditional genotype frequencies gj i (Eqs. (4.6) and (4.7)) whereas the LTTae log-likelihoods are written in terms of penetrances fi j (definition in Sect. 4.2.2.1).

4.2.2.1

Notation and Definitions

Our presentation below follows the structure of the presentation in Sect. 4.2.1. When notation is the same, we omit additional details. α = True baseline odds ratio. β = True log odds ratio. For the LTTae , the null hypothesis is H0 : β = 0. α0 = Baseline odds ratio assuming the null hypothesis is true. α1 = Baseline odds ratio assuming the alternative hypothesis is true. β1 = Log odds ratio assuming the alternative hypothesis is true.  wj = Weight  to the genotype or MLG j .  corresponding fi j = Pr Y t = i |X t = j  = True penetrance of phenotype (affection status) i , 0 ≤ i ≤ 1 conditional on genotype j , 0 ≤ j  ≤ k − 1. Note that the penetrances follow a logistic model: fi  j  =

 α+βw  i j e 1 + eα+βwj

.

(4.17)

We do not specify a particular hypothesis here (e.g., null or alternative). Our equation corresponds to Eq. (1) in Gauderman’s publication [53] using the  notation (ours on left-hand side of an equation, his on right-hand side): wj = G j  . Also, we do not model environmental or gene-environment interaction terms. That is, using Gauderman’s formulation, βe = βge = 0. Finally, in the publication [53], the function G() is applied to genotypes written as pairs of alleles (e.g., aa, Aa, AA), whereas our j  is integer-valued and represents the number of copies of a reference allele (typically the minor or risk allele) in a genotype, or an ordering of a set of multilocus genotypes. For example, given a SNP locus with alleles A and T, where A is the risk allele, then j = 0, 1, 2 correspond to the genotypes TT, AT, and AA, respectively. We documented this notation previously (Chap. 1). fi(r) j = rth step estimate of the parameter fi j (determined through application of the EM algorithm).

146

4 Association Tests Allowing for Heterogeneity

  fi ∗ = Pr Y t = i = True penetrance of phenotype (affection status) i , 0 ≤ i ≤ 1 under H0 : β = 0. For the section LRTae (4.2.1) above, we used the notation φi to describe this probability. We use the notation “f ” here because penetrances have traditionally been represented by this letter (see, e.g., [54, 55]). Also, we used this notation in our publication regarding the LTTae [6].  gj = Pr X t = j  = True frequency of genotype j , 0 ≤ j  ≤ k − 1 in the entire sample population (cases and controls). = rth step estimate of the parameter gj (determined through application of gj(r)  the EM algorithm).   = Pr X t = j  , Y t = i for ind in Group G|X = j, Y = ifor same ind . τ(i(G)(r)  j  )(ij) This probability is the Bayesian posterior probability for the rth iteration step that the true phenotype and genotypes of an individual in Group G are for conditional genotype frequencies and phenotype frequencies. Note that 1 ≤ G ≤ 4.

4.2.2.2

Log-Likelihoods of the Observed Data

The log-likelihoods of the observed data are identical to those for the LRTae . The only change regards computation of the joint probabilities (Eqs. (4.2)–(4.5) above). For the LTTae , we have (Group 1)   Pr Y = i, Y t = i , X = j, X t = j      = Pr Y = i, X = j|Y t = i , X t = j  × Pr Y t = i , X t = j  ,         = Pr Y = i|Y t = i × Pr X = j|X t = j  × Pr Y t = i |X t = j  × Pr X t = j  , = πi i θj j fi j gj .

(4.18)

(Group 2)   Pr Y = i, Y t = i , X = j    Pr Y = i, Y t = i , X = j, X t = j  , = j

=



πi i θj j fi j gj ,

j

⎛ ⎞  = πi i ⎝ θj j fi j gj ⎠. j

(Group 3)   Pr Y = i, X = j, X t = j 

(4.19)

4.2 Statistical Tests that Use Genotype Data

=



147

  Pr Y = i, Y t = i , X = j, X t = j  ,

i

=



πi i θj j fi j gj ,

i

= θj j gj

 

 πi i fi j .

(4.20)

i

(Group 4) Pr(Y = i, X = j)    = Pr Y = i, Y t = i , X = j, X t = j  , i

j

i

j

   πi i θj j fi j gj . =

(4.21)

We drop the suffix “for ind a” because the probabilities listed in Eqs. (4.18)–(4.21) are not dependent upon the individual. As with the LRTae , the summation over the (2) individuals a in each group may be replaced by the counts n(1) (i j )(ij) , ni (ij) , etc. Applying Eqs. (4.18)–(4.21), we rewrite the log-likelihood of the observed data as ln(L)

=

1  1  k−1 k−1   i=0 i =0 j=0 j =0

+

1  1  k−1  i=0

+

i =0

i=0 j=0 j =0

+





⎣ ⎝ n(2) i (ij) ln πi i

j=0

1  k−1  k−1 

1  k−1 

  n(1) (i j )(ij) ln πi i θj j fi j gj

 n(3) j (ij) ln

θj j gj

⎞⎤ θj j fi j gj ⎠⎦

j

 

 πi i fi j

i





⎣ n(4) ij ln

i=0 j=0



i



πi i θj j fi j gj ⎦.

(4.22)

j

Like the LRTae statistic, we can simplify the log-likelihood (4.22) under the null: ln(L)

=

1  1  k−1  i=0 i =0 j=0

+

k−1 1  1   i=0 i =0 j=0

  n(1) (i j )(ij) ln πi i θj j fi ∗ gj ⎡

⎛ ⎞⎤  ⎣ ⎝ n(2) θj j fi ∗ gj ⎠⎦ i (ij) ln πi i j

148

4 Association Tests Allowing for Heterogeneity

+

1  k−1  k−1  i=0 j=0 j =0

+

1  k−1 

 n(3) j (ij) ln

θj j gj



⎣ n(4) ij ln

i

 πi i fi ∗

i



i=0 j=0

 



πi i θj j fi ∗ gj ⎦.

(4.23)

j

Also, the rth step estimates of the log-likelihoods are computed using the corresponding estimates of the parameters fi j and gj under the specified hypothesis (alternative or null). Gordon et al. documented [10] that the BPPs for an arbitrary individual a may be written as (Group 1) 

τ(i(1)  j  )(ij) =

  1, if I1 a, i , i, j  , j = 1 0, otherwise;

(Group 2) τ(i(2)(r)  j  )(ij)

=

⎧ ⎨ ⎩

  θj j fi(r) g (r) j j  , if I2 a, i  , i, j  (r) (r) v  θv  j fi v  gv 

=1

(4.24)

0, otherwise;

(Group 3) τ(i(3)(r)  j  )(ij)

=

⎧ ⎨





πi i fi(r) j (r) u πu i fu j 

 , if I3

  a, i, j , j = 1,

0, otherwise;

(Group 4) τ(i(4)(r)  j  )(ij)

=

⎧ ⎨ ⎩

πi i θj j fi(r) g (r) j i  , if I4 (a, i, j)   (r) (r) u v  πu i θv  j fu v  qu

= 1,

0, otherwise.

The conditional probabilities (4.24) may be rewritten under H0 by replacing every (r) instance of fi(r) j by fi ∗ . Each term n(r) (i j )(ij) given by

(1) (2)(r) (3)(r) (4)(r) n(r) (i j )(ij) = n(i j )(ij) + n(i j )(ij) + n(i j )(ij) + n(i j )(ij) , where (2) (2)(r) n(2)(r) (i j )(ij) = ni (ij) × τ(i j )(ij) , (3) (3)(r) n(3)(r) (i j )(ij) = ni (ij) × τ(i j )(ij) , (4) (4)(r) n(4)(r) (i j )(ij) = nij τ(i j )(ij) .

(4.25)

4.2 Statistical Tests that Use Genotype Data

149

By definition, the (r + 1)st step estimate of fi j is given by  (r+1) (r+1) i eα1 +β1 wj

, fi(r+1) = j (r+1) (r+1) 1 + eα1 +β1 wj

(4.26a)

under the alternative hypothesis that β = 0, and  (r+1) i eα0

, = fi(r+1) ∗ (r+1) 1 + eα0

(4.26b)

under the null hypothesis that β = 0. Similar to the LRT ae statistic under the null hypothesis, the (r + 1)st step values are given by for the genotype frequencies gj(r+1)  gj(r+1) = 

n(r) (+j )(++) N

.

(4.27)

From Formulas (4.26a) and (4.26b), we see that (r + 1)st step estimate of fi j is determined by (r + 1)st step estimates of α(null and alternative hypotheses) and β (alternative hypothesis). Computation of these estimates is more challenging than for the conditional genotype frequencies (4.10) in the LRTae statistic. The reason the LTTae statistic is “model-based” in the sense that different weights are assigned to different genotype categories. As such, the (r + 1)st step estimates depend on the weights wj . In a previous publication [6], we determined closed-form solutions for α and β terms under either a recessive or dominant model of inheritance. Recessive mode of inheritance Under the assumption that a person is at increased risk of developing a disease only if they have the m genotype, 0 ≤ m ≤ (k − 1), we specify the test of trend weights as  wj =

1, j = m . 0, otherwise

With this set of weights, we can show that the closed-form solutions for estimates of α and β are (Alternative hypothesis)  α1(r+1) β1(r+1)

= ln



 n(r) (1j )(++)





 n(r) (0j )(++)

− ln ,   j =m     j =m (r) (r+1) = ln n(r) . (1m )(++) − ln n(0m )(++) − α1

(4.28)

150

4 Association Tests Allowing for Heterogeneity

(Null hypothesis)     (r) − ln n α0(r+1) = ln n(r) (1+)(++) (0+)(++) .

(4.29)

Dominant mode of inheritance Given the model that a person has an increased risk of developing a disease only if they have any one of a subset of genotypes, and that the risk is the same for that subset genotype, we may specify the test of trend weights as  wj =

1, 0 ≤ j  ≤ m . 0, otherwise

  Without loss of generality, we order the genotypes so that the first m + 1 geno types (labeled   0, 1, 2, …,m ) are the onesthat produce higher risk, and the remaining genotypes m + 1, m + 2, . . . , (k − 1) are non-risk or low risk. Given this set of weights and ordering, the closed-form solutions for estimates of α and β are (Alternative hypothesis)  α1(r+1) β1(r+1)

= ln = ln



 



j >m 

n(r) (1j )(++)

0≤j ≤m

n(r) (1j )(++)





− ln j >m   − ln

 n(r) (0j )(++) 

0≤j ≤m

,

n(r) (0j )(++)



(4.30) −

α1(r+1) .

(Null hypothesis)     (r) − ln n α0(r+1) = ln n(r) (1+)(++) (0+)(++) .

(4.31)

The values n(r) (i j )(ij) that are used to determine the closed-form solutions under the recessive and dominant modes of inheritance differ for the null and alternative hypothesis settings. General set of weights When applying weights for models different than a dominant or recessive mode of inheritance, one can compute parameter estimates by means of different maximization methods. We applied the Newton-Raphson algorithm in our previous work, where we documented a step-by-step process for applying the algorithm [6].

4.2 Statistical Tests that Use Genotype Data

4.2.2.3

151

Test Statistic—(Linear) Test of Trend Allowing for Error (LTTae )

Having determined the log-likelihoods (4.22) and (4.23), the test of trend statistic allowing for errors is given by      LTTae = 2 ln Lr1 (s1 ) − ln Lr0 (s0 ) .

(4.32)

The log-likelihoods are functions of the parameters α1(r1 ) , β1(r1 ) , gi(r 1 ) , πi i , and θj j under the alternative hypothesis, and α1(r0 ) , gi(r 0 ) , πi i , and θj j under the null. The parameters πi i and θj j are estimated from the double-sample data and do not get updated. The asymptotic null distribution of LTT ae is a central chi-square distribution with one degrees of freedom. The LTTae statistic (4.32) may be a more powerful statistic than the LRTae statistic under certain circumstances, because it has only one degree of freedom. The challenge when using this statistic is in specifying weights that accurately reflect the different penetrances based on the underlying genotypes.

4.2.3 Likelihood Ratio Statistic for Family-Based Association that Incorporates Genotype Misclassification Errors (TDTae ) Here, we present the first statistic in this chapter for which the data are familybased. We provide the details of a transmission disequilibrium test [56] that allows for genotype misclassification error. We refer to this statistic as TDTae . Versions of such a statistic have been published previously [57–61]. The sampling unit is a trio of father, mother, and affected child (or child who displays the selected phenotype value for a dichotomous trait), all of whom have genotyped at a given locus. What is the importance of such statistics? In a word, robustness. A common feature of statistical software packages that compute TDT statistics is that trios that do not display Mendelian consistency in their genotypes are ignored. That is, only trios that show Mendelian consistency are used when computing the TDT statistic (Fig. 2.3a, b of Chap. 2).

4.2.3.1

Notation

d = The disease allele at the putative disease SNP locus. “+” = The non-disease allele at the putative disease SNP locus. T abc = The vector of coded (true) genotypes for a trio, where the father’s coded genotype is a, the mother’s coded (true) genotype is b, and the affected child’s true

152

4 Association Tests Allowing for Heterogeneity

coded genotype is c. Each of the coded genotypes is between 0 and 2, and indicates the number d alleles in the person’s genotype. The notation T abc may be extended to any set of trio genotypes for any SNP locus by specifying the allele to be counted in each person’s genotype. The complete list is provided in Table 4.4. Some examples are a. T 222 , where the father has true genotype dd (coded genotype 2), the mother has true genotype dd (coded genotype 2), and the affected child has true genotype dd (coded genotype 2); b. T 121 , where the father has true genotype d +(coded genotype 1), the mother has true genotype dd (coded genotype 2), and the affected child has true genotype d +(coded genotype 1); Table 4.4 Probability distribution function for MC trio genotypes when child is affected Mating type Pr(Mating Child Notation Pr(Child genotype|MT) Pr(Child Genotype, MT) = (father, type) = genotype Pr(Tabc ) mother) Pr(MT) dd × dd

dd

T 222

1

dd

T 212

t

d+

T 211

(1 − t)

dd

T 122

t

d+

T 121

(1 − t)

d+

T 201

1

d+

T 021

1

d + × d+

μ1 μ2 2 μ2 2 μ2 2 μ2 2 μ3 2 μ3 2 μ4

dd

T 112

t2

μ2 2 t μ2 2 (1 − t) μ2 2 t μ2 2 (1 − t) μ3 2 μ3 2 μ4 t 2

d + × d+

μ4

d+

T 111

2t(1 − t)

2 × μ4 × t(1 − t)

d + × d+

μ4 μ5 2 μ5 2 μ5 2 μ5 2 μ6

++

T 110

(1 − t)2

μ4 (1 − t)2

d+

T 101

t

++

T 100

(1 − t)

d+

T 011

t

++

T 010

(1 − t)

μ5 2 t μ5 2 (1 − t) μ5 2 t μ5 2 (1 − t)

++

T 000

1

μ6

dd × d+ dd × d+ d + × dd d + × dd dd × ++ ++ × dd

d + × ++ d + × ++ ++ × d+ ++ × d+ ++ × ++

μ1

Here, the high risk/disease allele is d. The last column is computed using the definition of conditional probability. Schaid and Sommer [55] demonstrated this calculation for a more general setting of genotype relative risks. The value t is defined as t = Pr(heterozygous parent transmits a d allele to an affected child). In Schaid and Sommer’s publication [55], the TDTae publication [57], and in the definition of the parameter t, the trios to which the statistic is applied are only those in which the child is affected with the trait of interest. For that reason, we remove the event “Aff” from the column headers above. In the notation column, the term T abc indicates that the father’s coded (true) genotype is a, the mother’s coded (true) genotype is b, and the affected child’s coded (true) genotype is c. Each of the coded genotypes is between 0 and 2, and indicates the number d alleles in the person’s genotype. The notation T abc may be extended to any set of trio genotypes for any SNP locus by specifying the allele to be counted in each person’s genotype

4.2 Statistical Tests that Use Genotype Data

153

c. T 101 , where the father has true genotype d +(coded genotype 1), the mother has true genotype ++(coded genotype 0), and the affected child has true genotype d + (coded genotype 1). Oxyz = The coded (observed) genotypes for a trio, where the father’s coded genotype is x, the mother’s coded (observed) genotype is y, and the affected child’s coded (observed) genotype is z. Each of the coded genotypes is between 0 and 2, and indicates the number d alleles in the person’s genotype. Similar to T abc , the notation Oxyz may be extended to any set of trio genotypes for any SNP locus by specifying the allele to be counted in each person’s genotype. The complete list is provided in Table 4.5. The observed genotypes for a trio are not necessarily MC. Some examples are a. O200 , where the father has observed genotype dd, the mother has observed genotype ++, and the affected child has observed genotype ++; b. O101 , where the father has observed genotype d+, the mother has observed genotype ++, and the affected child has observed genotype d + ; c. (Example in Fig. 2.3a after errors have been introduced): Let A be the disease allele, and T the wild-type allele. Then the coded genotype trio is O221 , where the father has observed genotype AA, the mother has observed genotype AA, and the affected child has observed genotype AT. From this point forward, we will refer to a trio whose genotypes are indicated by a string of three letters (e.g., abc and xyz in Tables 4.4 and 4.5, respectively) as a genotype trio. When the genotype values of the father, mother, and affected child are known, we may sometimes shorten this term to simply trio. For example, it is clear that the observed genotypes of the genotype trio O101 are 1 (father), 0 (mother), and 1 (affected trio). In such an instance, we may refer to the given genotype trio as “the trio O101 ”. Using the definition of conditional probability and the law of total probability, we can write      Pr Oxyz |T abc Pr(T abc ), (4.33) Pr Oxyz = abc

where   Pr Oxyz |T abc = θax θby θcz .

(4.34)

The genotype (mis)classification matrix θ for SNPs is given in Chap. 2, Table 2.1. In that table, we use example SNP alleles A and T; in practice, the SNP alleles may be any pair of alphanumeric characters, as long as they are defined and used consistently. We obtain the condition (4.34) by applying the rule that genotype errors are random and individual for each individual in a data set. This rule was applied in the work of Bernadinelli et al. [59] and Morris and Kaplan [60]. An example calculation, Pr(O002 ), is provided below, after other notation is defined. From Chap. 2, Table 2.1, we know that there are constraint equations given by

154 Table 4.5 All possible sets of trio genotypes (including MI ones) for a SNP locus

4 Association Tests Allowing for Heterogeneity Father observed genotype

Mother observed genotype

Child observed genotype

Notation Oxyz

dd

dd

dd

O222

dd

dd

d+

O221

dd

dd

++

O220

dd

d+

dd

O212

dd

d+

d+

O211

dd

d+

++

O210

dd

++

dd

O202

dd

++

d+

O201

dd

++

++

O200

d+

dd

dd

O122

d+

dd

d+

O121

d+

dd

++

O120

d+

d+

dd

O112

d+

d+

d+

O111

d+

d+

++

O110

d+

++

dd

O102

d+

++

d+

O101

d+

++

++

O100

++

dd

dd

O022

++

dd

d+

O021

++

dd

++

O020

++

d+

dd

O012

++

d+

d+

O011

++

d+

++

O010

++

++

dd

O002

++

++

d+

O001

++

++

++

O000

In the notation column, the term Oxyz indicates that the father’s observed coded genotype is x, the mother’s observed coded genotype is y, and the affected child’s observed coded genotype is z. Twelve of the trio’s observed genotypes are MI (they are not part of the list provided in Table 4.4). Each of the coded genotypes is between 0 and 2, and indicates the number d alleles in the person’s genotype. The notation Oxyz may be extended to any set of trio genotypes for any SNP locus by specifying the allele to be counted in each person’s genotype

4.2 Statistical Tests that Use Genotype Data 2 

θj j = 1, 0 ≤ j ≤ 2.

155

(4.35)

j=0

In Eq. (4.35), the variables j and j refer to the specific true coded genotype value and the observed genotype value, respectively; the notation is used in Sects. 4.2.1 and 4.2.2 as well. We use these constraint equations when specifying starting values in the EM algorithm. N = The total number of trios in the study. Aff= Event that the child in a trio is affected. t = Pr(heterozygous parent transmits d allele to affected offspring). In this work, the null hypothesis, H0 , is t = 0.5. The alternative hypothesis, H1 , is t = 0.5. Under a multiplicative mode of inheritance (Chap. 1, Eq. (1.8)), one can show that R1 2 = 1+R , where R1 and R2 are the genotype relative risks given in Chap. 1, t = R1R+R 2 1 Sect. 1.4.2. a × b = Mating type where father’s coded genotype is a and mother’s coded genotype is b. For this statistic, we noted above that the possible coded genotypes are 0 (++genotype), 1 (d + genotype), and 2 (dd genotype). μi = Probability that the mating type is i. Like Schaid and Sommer [55], we consider six mating types. However, unlike Schaid and Sommer, we consider 15 possible sets of trio genotypes (Table 4.4). Other models, such as those considered by Weinberg and colleagues [62, 63], require more than six mating-type frequencies. It is relatively easy to increase the number of mating types up to nine; the most general scenario is one where the mating types are 0 × 0, 0 × 1, 0 × 2, 1 × 0, 1 × 1, 1 × 2, 2 × 0, 2 × 1, 2 × 2. We might consider unique mating-type frequencies for these nine mating types in a mathematical model that allows for maternal effects or parental imprinting [63]. For the TDT ae statistics and the TDT-HET statistic below, we consider six mating-type frequencies: 1. 2. 3. 4. 5. 6.

μ1 μ2 μ3 μ4 μ5 μ6

= Pr(2 × 2); = Pr(2 × 1or1 × 2); = Pr(2 × 0or0 × 2); = Pr(1 × 1); = Pr(1 × 0or0 × 1); = Pr(0 × 0).

Further, we constrain the mating-type frequencies to satisfy the condition: Pr(a × b) = Pr(b × a). This equality implies that, for example, μ3 = Pr(2 × 0or 0 × 2), = Pr(2 × 0) + Pr(0 × 2), = 2Pr(2 × 0). It follows that μ3 /2 = Pr(2 × 0) = Pr(0 × 2). This calculation explains why some of the mating  types in the Pr(Mating type) column of Table 4.4 have the value ½. Finally, 6i=1 μi = 1.

156

4 Association Tests Allowing for Heterogeneity

The TDTae statistic may computed by making use of the two tables presented below. zk,b = The indicator variable whose value is 1 when an individual sample’s observed genotype vector (for the trio) is Ob and its true genotype vector is T k . V = (t, μ1 , . . . , μ5 , μ6 ) = Vector of latent variables for the TDTae statistic. All variables have been previously defined. The six genotype misclassification variables are presented in Chap. 2 (Table 2.1 and Sect. 2.3). V (r) = The same vector as V with each of the variables replaced by their rth step estimates. These estimates are determined by applying the EM algorithm. (θ ) = The 3 × 3 (mis)classification matrix (Table 2.1 in Chap. 2). Below, we use the term θ (r) to represent the same matrix with the rth iteration step estimates for each (mis)classification probability. We commented above that we would provide a derivation of Pr(O002 ). With formulas (4.33) and (4.34), we can write Pr(O002 ) =



Pr(O002 |T abc )Pr(T abc )

abc

= Pr(O002 |T 222 )Pr(T 222 ) + Pr(O002 |T 212 )Pr(T 212 ) + ... + Pr(O002 |T 011 )Pr(T 011 ) + Pr(O002 |T 010 )Pr(T 010 ) + Pr(O002 |T 000 )Pr(T 000 ) μ  2 t = θ20 θ20 θ22 (μ1 ) + θ20 θ10 θ22 2 μ  μ  5 5 t + θ00 θ10 θ02 + · · · + θ00 θ10 θ12 (1 − t) 2 2 θ00 θ00 θ02 (μ6 ).

(4.36)

The second equality in the derivation (4.36) is determined using the probabilities listed in Table 4.4. The total number of terms in the summation (4.36) is 15. All parameters (those listed in the vector V ) are estimated using the EM algorithm.  f1 Oxyz , T abc ; θ = Pr Oxyz |T abc = θax θby θcz = Probability that a trio has observed (coded) genotypes x, y, and z given that the respective true coded genotypes are a, b, and c and that the (mis)classification matrix is given by θ . We substitute  (r) (r) (r) θby θcz when using the rth iteration step with the notation f1 Oxyz , T abc ; θ (r) = θax estimates for the (mis)classification probabilities. f2 (T abc ; V ) = Pr(T abc ) = Probability density function of the trio T abc in Table 4.4. Given specified parameter values for V , the probability density function is determined using the last column of Table 4.4. A similar definition is used for trios that have been sequenced (Table 4.20). (r) = rth iteration step probability estimate that the trio Oxyz , (observed τ(abc)(xyz) genotypes) is really Tabc before introduction of errors. This probability is also referred to as the Bayesian Posterior Probability (BPP).

4.2 Statistical Tests that Use Genotype Data

157

nxyz = Number of trios in the data set where the observed trio genotypes are x, y, and z. Given the notation in Table 4.5:  nxyz = N . 0≤x,y,z≤2

Note that there are 27 summands on the left-hand side of this equation.

4.2.3.2

Determination of the Bayesian Posterior Probabilities (BPPs) τ (r) (abc)(xyz)

To compute the rth step estimates of the necessary parameters in the TDT ae statistic, (r) . The reason is that all rth step we must first report the values of the BPPs τ(abc)(xyz) estimates of these parameters are functions of the BPPs, as is the case with all other statistics that use the EM algorithm statistics (e.g., LRT ae , MLRT ) . Applying Bayes rule, and following the work of Morris and Kaplan [60], we report that the BPPs are (r) τ(abc)(xyz)   Pr Oxyz |T abc   = Pr Oxyz     f1 Oxyz , T abc ; θ (r) × f2 T abc ; V (r) =    , (r) × f2 T uvw ; V (r) uvw f1 Oxyz , T uvw ; θ

(4.37)

where the sum in the denominator is over all possible coded genotype vectors in Table 4.4.

4.2.3.3

TDT ae Parameter Estimates

As with all EM algorithm applications, we need to determine the r th step estimates of all relevant parameters to determine the maximum log-likelihoods and subsequently the test statistic value. For the TDT ae statistic, it amounts to deriving the rth step estimates of the parameters in the (mis)classification matrix θ and the vector V . With the exception of the transmission probability t, our results are consistent with those of Morris and Kaplan [60]. Transmission probability t The rth iteration step of t is given by t (r+1) =

A(r) , A(r) + B(r)

(4.38)

158

4 Association Tests Allowing for Heterogeneity

where A(r) =

 xyz

  (r) (r) (r) (r) (r) (r) nxyz τ(212)(xyz) + τ(122)(xyz) + 2 × τ(112)(xyz) + τ(111)(xyz) + τ(101)(xyz) + τ(011)(xyz) ,

and B(r) =

 xyz

  (r) (r) (r) (r) (r) (r) nxyz τ(211)(xyz) + τ(121)(xyz) + τ(111)(xyz) + 2 × τ(110)(xyz) + τ(100)(xyz) + τ(010)(xyz) .

We have 0 ≤ x, y, z ≤ 2 for the terms that make up A(r) and B(r) , for a total of 33 = 27 coded genotype vectors. All 27 values are listed in Table 4.5. Mating-type frequency parameters μi Through application of the E and M steps of the EM algorithm, one can show  μ(r+1) 1

=

μ(r+1) = 2

xyz

, N   (r) (r) (r) (r) τ n + τ + τ + τ xyz xyz (212)(xyz) (211)(xyz) (122)(xyz) (121)(xyz) 

μ(r+1) = 3 μ(r+1) = 4 μ(r+1) = 5 μ(r+1) 6

=

(r) nxyz τ(222)(xyz)

N   (r) (r) xyz nxyz τ(201)(xyz) + τ(021)(xyz)

,

, N    (r) (r) (r) xyz nxyz τ(112)(xyz) + τ(111)(xyz) + τ(110)(xyz)

, N    (r) (r) (r) (r) τ n + τ + τ + τ xyz xyz (101)(xyz) (100)(xyz) (011)(xyz) (010)(xyz) N

 xyz

(r) nxyz τ(000)(xyz)

N

.

, (4.39)

Morris and Kaplan [60] derived the same estimates using different notations. Recall that N is the number of genotyped trios in the sample. Genotype (mis)classification parameters θ uv : Due to the complexity in deriving the genotype (mis)classification rth iteration step estimates, we demonstrate how these terms are derived mathematically. The interested reader may review published results in [57, 59, 60]. (r) ,0 ≤ A straightforward way to estimate the (mis)classification parameters θuv (r) u, v ≤ 2, in the matrix θ is through consideration of the pairs T abc , Oxyz in Tables 4.4 and 4.5, respectively. We determine pairs of genotype vectors where exactly one, two, or three genotypes in T abc is/are equal to the value u, and at least one of the corresponding genotype(s) in Oxyz is/are equal to the value v. For example, if u = 0

4.2 Statistical Tests that Use Genotype Data

159

and v = 1, and we consider the trio T 100 , then the trios Oxyz that correspond precisely to one 0 genotype are Ox10 , Ox12 , Ox01 , Ox21 , 0 ≤ x ≤ 2 (a total of 12 trios). The trios Oxyz that correspond to precisely two 0 genotypes are Ox11 , 0 ≤ x ≤ 2 (a total of three trios), and so forth. Computing θ (r) uv , u  = v Let:     E (r) = E ln L(r) c |OD         (r) (r) (r) = ln θax + ln θcz(r) , + ln θby I (T abc )nxyz τ(abc)(xyz)

(4.40a)

abc xyz

where abc comes from Table 4.4 and xyz comes from Table 4.5. Specifically, I (T abc ) is the indicator function for whether or not the trio T abc is listed in Table 4.4. We can express the expected value this way because the probability density term (f2 ) in the complete likelihood L(r) c contains no misclassification terms; only the f1 term does. Using the method explained in Chap. 1, we compute  ∂      ∂E (r) (r) nxyz τ(222)(xyz) ln(θ2x ) + ln θ2y + ln(θ2z ) = ∂θuv ∂θuv xyz     (r) ln(θ2x ) + ln θ1y + ln(θ2z ) + nxyz τ(212)(xyz)     (r) ln(θ2x ) + ln θ1y + ln(θ1z ) + nxyz τ(211)(xyz) ...

    (r) ln(θ0x ) + ln θ1y + ln(θ0z ) + nxyz τ(010)(xyz)     (r) ln(θ0x ) + ln θ0y + ln(θ0z ) . +nxyz τ(000)(xyz) (r)

(4.40b)

(r+1) = 0 provides updated estimates θuv , u = v. Solving the six equations ∂E ∂θuv We may reduce the number of equations to three or smaller when considering genotype misclassification models that have only three or less parameters (see Chap. 2, Genotype Error Models). (r+1) , 0 ≤ u ≤ 2, are solved using the Observe that the parameter estimates θuu equations: (r+1) =1− θuu

 v=u

(r+1) θuv .

160

4.2.3.4

4 Association Tests Allowing for Heterogeneity

Log-Likelihood of Observed Data

For the likelihood form of the TDT statistic [57] and the TDT ae in this section, the null hypothesis is H0 : t = 21 . The alternative hypothesis, H1 , is t = 21 . In general, we can write the log-likelihood of the observed data for the r th iteration step as       nxyz ln Pr Oxyz , ln L(r) = xyz

=

 xyz abc

=



     nxyz ln Pr Oxyz |T abc + ln(Pr(T abc ))      nxyz f1 Oxyz , T abc ; θ (r) × f2 T abc ; V (r) .

(4.41)

xyz abc

Given the information above, under the null hypothesis, the transmission probability t (r) in V (r) is fixed at ½. Under the alternative, all parameters θ (r) and V (r) are estimated.

4.2.3.5

The TDT ae Statistic

The TDTae statistic is given by the formula:      TDTae = 2 × ln L(S1 ,r1 (S1 )) − ln L(S0 ,r0 (S0 )) .

(4.42)

From our definitions above, there are 5 V (r) + 6 θ (r) = 11 parameters estimated under the null hypothesis, and 12 parameters estimated under the alternative. Under the null hypothesis the transmission probability is fixed to ½. It follows that the likelihood ratio statistic (4.42) is asymptotically distributed as a central chi-square distribution with one df under the null hypothesis.

4.3 Statistical Tests that Consider Heterogeneity Other Than Misclassification In the test statistics presented in Sect. 4.2, all focus on either phenotype and/or genotype data that has been misclassified. In this section, we consider heterogeneity where there is no misclassification. The types of heterogeneity are locus or genetic heterogeneity, and phenotype heterogeneity.

4.3 Statistical Tests that Consider Heterogeneity …

161

4.3.1 Mixture Likelihood Ratio Test (MLRT) for Genetic Association in the Presence of Locus Heterogeneity Starting with this section, we shift our focus from test statistics that address phenotype and/or genotype misclassification errors to ones that consider another type of heterogeneity, specifically locus heterogeneity. In this section, we present a test statistic, developed by Zhou and Pan, for case-control genetic association [7]. The authors refer to it as the MLRT or mixture likelihood ratio test. Throughout this section, whenever we refer to Zhou and Pan, we will mean the referenced paper documenting their statistic. The MLRT is applied to SNP genotype data for cases. The underlying assumption in their mathematical model of genotype frequencies is that the control population is a homogeneous population for which the SNP genotype frequencies follow HWE proportions. The case population is a mixture of two sub-populations: (i) nonassociated cases whose SNP genotype frequencies are equal to the control genotype frequencies and (ii) associated cases, whose SNP genotype frequencies also follow HWE proportions, but with different allele frequencies than those for the control population. Sub-population (i) may be thought of as cases who are affected due to mutations/variants at SNPs/loci at other locations in the genome, and not due to mutations/variants at the typed SNP. Other possible factors for affection might be environmental exposures, where the penetrances, Pr(affected|e, g), satisfy the equation Pr(affected|e, g) = Pr(affected|e). In this equation, e is an environmental variable measurement, and g is a specific genotype at the typed SNP (see, e.g., [53]). The statistic, based on the EM algorithm, is applied only to the case data. The statistic tests for association in the presence of locus heterogeneity. This approach has been applied before for statistical tests of genetic linkage in the presence of locus heterogeneity [64–69]. Comprehensive discussions of heterogeneity tests for linkage analysis may be found in Ott [54] and Londono et al. [8]. We present notation first, followed by observed log-likelihoods, the EM algorithm rth step estimates of all relevant parameters, the test statistic, the null hypothesis being tested, and the permutation method used to determine significance.

4.3.1.1

Notation and Definitions

In what follows, we provide relevant notation for the derivation of the test statistic developed by Zhou and Pan, modifying it to be consistent with the notation we have employed throughout this book. While Zhou and Pan consider the one-SNP scenario and also multi-SNP scenarios, we confine our attention to the one-SNP scenario. We provide reasons below. Observed data nij , 0 ≤ i ≤ 1, 0 ≤ j ≤ 2 = Number of individuals with genotype j and phenotype i. Group 0 = The set of cases who are not associated with the genotyped SNP. The non-associated cases are characterized above. The disease allele frequency in the

162

4 Association Tests Allowing for Heterogeneity

population from which this sample is drawn is equal to the disease allele frequency in the control population. We indicate the sample size by N0 . Zhou and Pan use the index “2” to refer to this group. We choose “0” to be consistent with our notation throughout the book, that is, in this work, the index “0” refers to the control group/population. Group 1 = The set of all cases who are associated with the genotyped SNP. By associated, we mean that the disease allele frequency in the population from which this sample is drawn is different than the disease allele frequency in the control population. We indicate the  sample size by N1 . Observe that N0 + N1 = j n1j = N1+ , the total number of cases. pd 0 = True disease allele frequency in the control population and in the subpopulation of cases who are “not associated” with the typed SNP; d is the disease allele. Zhou and Pan refer to this parameter as θb . p+0 = 1−pd 0 , true low-risk or wild-type allele frequency in the control population and in the sub-population of cases who are “not associated” with the typed SNP, where “+” is the low-risk allele. Throughout the work with the MLRT, the parameters pd 0 and p+0 are fixed. They are considered to be known (through estimation using the counting method), and are never updated in any application of the EM algorithm. pdt 1 = True disease allele frequency in the sub-population of cases who are associated with the typed SNP; d is the disease allele. Zhou and Pan refer to this parameter as the high-risk allele frequency and use the notation θ . t = 1−pdt 1 , true non-disease or wild-type allele frequency in the sub-population p+1 of cases who are associated with the typed SNP, where “+” is the non-disease allele. π1 = The true probability that a case/affected individual is “associated” (e.g., has the disease mutations/variants at the SNP locus). That is, this variable represents the probability that a case is in Group 1. (r) = rth step estimate of the Bayesian Posterior Probability (BPP) that an τ(m)(j) affected individual (case) with genotype j is in Group m (m = 0 or 1). gji , 0 ≤ i ≤ 1, 0 ≤ j ≤ 2 = Frequency for genotype j conditional on phenotype i in the entire (mixed) population. The value j indicates the number of d alleles in an individual’s genotype. We commented that Equation i = 0 means control, and i = 1 means case. The d allele is specified by the researcher. Typically, the allele with the smaller ( 0andpdt 1 = pd 0 . Zhou and Pan provide the general log-likelihood over all cases:

4.3 Statistical Tests that Consider Heterogeneity …

       t(r) ln π1(r) , pdt(r) n1j ln π1(r) gj1 + 1 − π1(r) gj0 . 1 , pd 0 =

165

(4.50)

j

For Eq. (4.50), the log-likelihoods are only defined for case data. From Eqs. (4.44) t(r) and (4.45), we know that the conditional genotypes gj0 and gj1 are functions of the parameters pd 0 and pdt(r) 1 . Zhou and Pan’s test statistic, labeled the Mixture Likelihood Ratio Test (MLRT ), is given by     MLRT = 2 ln π1(r1 (s1 )) , pdt(r11 (s1 )) , pd 0 − ln(0, pd 0 , pd 0 ) .

(4.51)

The iteration step r1 (s1 ) has the same meaning here as it does for test statistics (4.15), (4.32), and (4.42). What is interesting here is that we have no value r0 (s0 ). The reason is that the log-likelihood under the null hypothesis is computed only once, and the parameter estimates are determined using control genotype frequencies (4.44). Under H0 , the condition π1 = 0 is equivalent to pdt(r11 ) = pd 0 , so hypothesis testing cannot be performed by making use of an asymptotic null distribution. This phenomenon has been observed in the heterogeneity LOD or “H-LOD” statistic developed by C.A.B. Smith and extended by Hodge, Ott, and others (see, e.g., [54, 64, 66, 68, 76]). For the MLRT statistic, Zhou and Pan propose determining significance by permutation.

4.3.1.3

Example Application

Because the MLRT uses data either in HWE (Group 0) or data that are a mixture of sets (Group 1), each of which is in HWE, we consider an example where genotype frequencies are modeled using a modification of the formulas presented in Chap. 1, Sect. 1.4.1.1. In our example, our generating parameters are a. Disease allele frequency in Group 0 (controls, non-associated cases) = pd 0 = 0.25. b. Disease allele frequency in Group 1 (associated cases) = pdt 1 = 0.35. c. Proportion of cases in Group 1 = π1 = 0.64. d. Number of cases = N1+ = 750. e. Number of controls = 750.   (0) = 300. f. Number of 0th step (starting) vectors pdt(0) 1 , π1 g. Number of steps (r) for each starting vector (Item f) = 200. h. Number of permutations: 10,000. The presentation here is similar to that in Sect. 4.2.1.4, in that we specify starting (r) values for the parameters (a–c), compute the log-likelihoods and BPP values τ(m)(j) , and update parameters and log-likelihoods for r = 0, 1, and 2. We have a slight modification in this example when calculating the MLRT statistic. Because we have only two parameters (Item f) when performing updating at each

166

4 Association Tests Allowing for Heterogeneity

step, we are able to compute the maximum log-likelihood under H1 (Formula (4.51)) over all 300 × 200 = 60, 000 settings of the parameters pdt 1 , π1 . In this example, we specify case and control genotype counts that are either equal to or close to the expected values for the parameter settings. We do this to document, for at least one example, how well the MLRT statistic performs in terms of obtaining accurate estimates of the true parameter values pdt 1 and π1 . Let us consider the expected control counts first. From Eq. (4.44), we determine E(g00 ) = 750(p+0 )2 = 750(1 − 0.25)2 = 421.875 ≈ 422. E(g10 ) = 750(2)(pd 0 )(p+0 ) = 750(2)(0.25)(1 − 0.25) = 281.250 ≈ 281. E(g20 ) = 750(pd 0 )2 = 750(0.25)2 = 46.875 ≈ 47.

(4.52)

In Eq. (4.53), E() is the expectation operator/function. For the non-associated case genotype counts, we use Eq. (4.45), and items (c) and (d) above. It follows that: Expected count for 0 genotype, non - associated cases = 750(1 − π1 )(p+0 )2 = 750(1 − 0.64)(1 − 0.25)2 = 151.875 ≈ 152. Expected count for 1 genotype, non - associated cases = 750(1 − π1 )(2)(pd 0 )(p+0 ) = 750(1 − 0.64)(2)(0.25)(1 − 0.25) = 101.25 ≈ 101. Expected count for 2 genotype, non - associated cases = 750(1 − π1 )(pd 0 )2 = 750(1 − 0.64)(0.25)2 = 16.875 ≈ 17.

(4.53)

Similarly, for the associated cases, we have (Eq. (4.46))  t   t 2 = 750(π1 ) p+1 = 750(0.64)(1 − 0.35)2 = 202.8 ≈ 203. E g01  t   t  t  E g11 = 750(π1 )(2) pd 1 p+1 = 750(0.64)(2)(0.35)(1 − 0.35) = 218.4 ≈ 218.  t   2 E g21 (4.54) = 750(π1 ) pdt 1 = 750(0.64)(0.25)2 = 58.5 ≈ 59. Values from Eqs. (4.52)–(4.54) are precisely the values given in Table 4.6. To compute MLRT statistic, we add the genotype counts for the non-associated and associated cases, and use the “Total Cases” genotype counts. In this way, we “hide” the true π1 and pdt 1 values and, through application of the statistic, assess the accuracy of these parameters via estimation. In what follows, we present calculations and results for the r = 0, 1, 2 steps. After that, we present the final result. (r = 0 step) Null hypothesis (H 0 ) From Formula (4.44) and Table 4.6, we determine

4.3 Statistical Tests that Consider Heterogeneity …

167

Table 4.6 Case and control genotype counts used in example MLRT calculation Cases

Genotype 0

1

2

Non-associated

152

101

17

Associated

203

218

59

Total cases

355

319

76

Total controls

422

281

47

The counts here are determined using the expected values in Eqs. (4.52)–(4.54). The values in bold   are treated as the observed values nij for the determination of the MLRT statistic

375 281 + 2(47) = = 0.25. 2(422 + 281 + 49) 1500 = 1 − pd 0 = 0.75.

pd 0 = p+0

(4.55)

Using Eq. (4.44), we compute the respective control genotype frequencies as g00 = (p+0 )2 = (0.75)2 = 0.5625. g10 = 2(pd 0 )(p+0 ) = 2(0.25)(0.75) = 0.3750. g20 = (pd 0 )2 = (0.25)2 = 0.0625.

(4.56)

In the specification of the MLRT statistic (Eqs. (4.50) and (4.51)), the loglikelihood of our data under the null hypothesis is ln(0, pd 0 , pd 0 )    = n1j ln g0j j

= (355) × ln(0.5625) + (319) × ln(0.3750) + (76) × ln(0.0625) = −727.8555. (4.57) We noted above, in the notation section, that the log-likelihood (4.57) does not change with each step. Thus, there is no need for an (r) superscript. we begin with random starting Alternative hypothesis (H 1 ) Forthis hypothesis,  t(0) (0) = points. In this example, we specify pdt(0) , π (0.1590, 0.6235). From the pd 1 1 1 t(0) value, we have p+1 = 1 − 0.1590 = 0.8410 Also, 1 − π1(0) = 0.3765. These starting point vectors were chosen from a set of 300 random vectors (Item (f)) above, because they produced the largest log-likelihoods (through 200 EM steps-Item (g)) over the entire set (results not shown). Using Eq. (4.46), and following the pattern in Eq. (4.56), the respective associatedcase genotype frequencies are

168

4 Association Tests Allowing for Heterogeneity t(0) g01 = (0.8410)2 = 0.7073. t(0) = 2(0.1590)(0.8410) = 0.2674, g11 t(0) = (0.1590)2 = 0.0253. g21

(4.58)

We apply Eq. (4.50) and compute the r = 0th step log-likelihood under the alternative hypothesis as   ln π1(0) , pdt(0) 1 , pd 0      t(0) = n1j ln π1(0) gj1 + 1 − π1(0) gj0 j

= (355) × ln(0.6235 × 0.7073 + 0.3765 × 0.5625) + (319) × ln(0.6235 × 0.2674 + 0.3765 × 0.3750) + (76) × ln(0.6235 × 0.0253 + 0.3765 × 0.0625) = −151.3996 − 375.7673 − 246.0004 = −773.1673.

(4.59)

(r = 1 step) Null hypothesis (H 0 ) Recall from the construction of the MLRT statistic that the null hypothesis values are fixed at the 0th step. Therefore, we do not perform any further calculations for this hypothesis. Alternative hypothesis (H 1 ) The first item in determining the Step 1 (updated) (r) (Formula (4.47)) for parameter estimates is the determination of the BPPs τ(m)(j) (0) r = 1. The  are most readily determined by computing the terms π1 ×  calculations t(0) gj1 and 1 − π1(0) × gj0 (values from the previous section). These terms are the

(r) numerators for each of the six (two groups × three genotypes) τ(m)(j) values. Using the values in Table 4.7, we can determine all BPPs. We have (1) = τ(0)(0) (1) τ(1)(0) = (1) τ(0)(1) = (1) τ(1)(1) = (1) τ(0)(2) = (1) τ(1)(2) =

0.2118 = 0.2118 + 0.4410 0.4410 = 0.6756. 0.6528 0.1412 = 0.1412 + 0.1667 0.1667 = 0.5415. 0.3079 0.0235 = 0.0235 + 0.0158 0.0158 = 0.4011. 0.0393

0.2118 = 0.3244, 0.6528

0.1412 = 0.4585, 0.3079

0.0235 = 0.5989, 0.0393 (4.60)

4.3 Statistical Tests that Consider Heterogeneity …

169

(1)

Table 4.7 Numerator values for all BPPs τ(m)(j) , indexed by group (m) and genotype (j) Genotype (j)

Group

0

1

2

m = 0 (Non-associated cases)

0.2118

0.1412

0.0235

m = 0 (Associated cases)

0.4410

0.1667

0.0158

Each of the cells in this table is computed using the appropriate numerator in Formula (4.48). For   example, if m = 0, then the value in the column corresponding to the j genotype is 1 − π1(0) ×gj0 .   t(0) If m = 1, then the value in the column corresponding to the j genotype is π1(0) × gj1 . Recall that π1(0) = 0.6235 and the genotype frequencies are provided in Eqs. (4.57) and (4.59). Using Formula (4.47), the sum of the column corresponding to the (j)th genotype is the denominator of the BPP (1)

τ(m)(j) (1) As a check that the calculations are correct, one may compute the sums τ(0)(j) +

(1) for each genotype, and confirm that they sum to 1.0. τ(1)(j) What do these BPPs tell us? They answer the question: what is the probability that a case with a given genotype is in either group (associated cases or non-associated cases)? This information may be useful, because, if there is significant association among the phenotype and genotypes via the MLRT, these BPPs can inform us as to what cases would be prioritized for follow-up using molecular/biological/clinical methods. Note that we would not use (r = 1)st step estimates of the BPPs, but rather those at the rth step where the MLEs have been  determined. (1) . Instead of choosing these values , π Next, we compute the parameters pdt(1) 1 1 randomly, we update them using the information from the 0th step. Given Formula (4.48), the BPPs computed in Eqs. (4.60), and the values in Table 4.7, we determine (1) (1) + 2n12 τ(1)(2) n11 τ(1)(1) ,  pdt(1) = 1 (1) (1) (1) 2 n10 τ(1)(0) + n11 τ(1)(1) + n12 τ(1)(2)

(319)(0.5415) + 2(76)(0.4011) , (2)((355)(0.6756) + (319)(0.5415) + (76)(0.4011)) 233.693 , = 886.093 = 0.2637. =

(4.61)

t(1) From the pdt(1) 1 value, we have p+1 = 1 − 0.2637 = 0.7363. Applying Formula (4.49), the values in Eq. (4.60), and the values in Table (4.7), we calculate that the updated mixture probability is

170

4 Association Tests Allowing for Heterogeneity

π1(1) =

(1) (1) (1) + n11 τ(1)(1) + n12 τ(1)(2) n10 τ(1)(0)

, N1+ (355)(0.6756) + (319)(0.5415) + (76)(0.4011) = , 750 = 0.5907.

(4.62)

Also, 1 − π1(1) = 0.4093. Under Hardy-Weinberg, the updated associated-case genotype frequencies are t(1) = (0.7363)2 = 0.5421. g01 t(1) = 2(0.7363)(0.2637) = 0.3884, g11 t(1) = (0.2637)2 = 0.0700. g21

(4.63)

Finally, applying Eq. (4.50), we compute the r = 1th step log-likelihood under the alternative hypothesis as        t(1) n1j ln π1(1) gj1 + 1 − π1(1) gj0 ln π1(1) , pdt(1) 1 , pd 0 = j

= (355) × ln(0.5907 × 0.5421 + 0.4093 × 0.5625) + (319) × ln(0.5907 × 0.3884 + 0.4093 × 0.3750) + (76) × ln(0.5907 × 0.0696 + 0.4093 × 0.0625) = −211.947 − 306.242 − 205.810, = −724.000.

(4.64)

There are several changes we observe when updating from the 0th (random starting) step to the 1st step. We list them by parameter. Each “” refers to the difference in the respective parameters/log-likelihood. That is, we compute (value at Step 1-value at Step 0) pdt(r) 1 = 0.1048. π1(1) = −0.0328. t(r) = −0.1652, g01 t(r) = 0.1210, g11 t(r) = 0.0443. g21 (log − likelihood) = 49.168.

From the list (a)–(h) at the beginning of Sect. 4.2.4.3, we know that the true disease allele frequency in the associated cases is 0.35. Our original estimate for the disease allele frequency in Step 0 was 0.1590. Given that pdt(r) 1 is greater than 0.10, this

4.3 Statistical Tests that Consider Heterogeneity …

171

result suggests that in just one step, our estimated disease allele frequency is moving closer to its true value. The same cannot be said about π1(1) . Its true value is 0.64, and the initial estimate was 0.6235. However, after one step, the value moves away from 0.64 to under 0.60. Because we have the constraint of Hardy-Weinberg Equilibrium proportions for our true genotype frequencies and for the frequencies determined in the MLRT statistic, the fact that the disease allele frequency moves closer to its true value means that the genotype frequencies also move closer to their true values. Finally, the increase in the log-likelihood suggests that the parameter estimates at Step 1 are closer to the true values than the estimates for Step 0. Also, because the log-likelihood of the null hypothesis is fixed, our increase suggests that we may obtain an MLRT statistic in which we reject the null hypothesis in favor of the alternative (which we know to be true, based on our specifications of the genetic model parameters in list (a)–(h)). (r = 2 step) Alternative hypothesis (H 1 ) Like the (r = 1) step calculations, our first goal is (2) for r = 2. We noted in the previous section that the determination of the BPPs τ(m)(j)   t(1) the calculations are most readily determined by computing the terms π1(1) × gj1   (r updated to 1) and 1 − π1(1) × gj0 . Using the values in Table 4.8, we determine (2) = 0.4182, τ(0)(0) (2) = 0.5818. τ(1)(0) (2) = 0.4008, τ(0)(1) (2) τ(1)(1) = 0.5992.

(1) Table 4.8 Numerator values for all BPPs τ(m)(j) , indexed by group (m) and genotype (j), for

determination of r = 2nd step parameter estimates Group

Genotype (j) 0

1

2

m = 0 (non-associated cases)

0.2302

0.1535

0.0256

m = 1 (associated cases)

0.3202

0.2294

0.0411

Each of the cells in this table is computed using the appropriate numerator in Formula (4.47). Recall that π1(1) = 0.5907 and the genotype frequencies are provided in Eqs. (4.56) and (4.63). Also, as indicated in Formula (4.47), the sum of the column corresponding to the (j)th genotype is (1) the denominator of the BPP τ(m)(j)

172

4 Association Tests Allowing for Heterogeneity (2) τ(0)(2) = 0.3837, (2) = 0.6163. τ(1)(2)

(4.65)

As before, we use the updated BPPs to update the other parameter estimates. Like the Step 1 calculations, we apply Formulas (4.48) and (4.49), the BPPs computed in Eqs. (4.65), and the values in Table 4.6, to conclude (after rounding) that pdt(2) 1 = 0.3204, π1(2) = 0.5927. The updated associated-case genotype frequencies are t(2) = 0.4619, g01 t(2) = 0.4355, g11 t(2) g21 = 0.1026.

(4.66)

 step log-likelihood under the alternative hypothesis is The r = 2nd (2) t(2) ln π1 , pd 1 , pd 0 = −714.006. Of the changes in parameters, perhaps the most interesting change is that for the disease allele frequency. One can check that from Step 1 to Step 2, the change is t(2) pdt(r) 1 = 0.0567, and the resultant estimated frequency, pd 1 = 0.3204, is even closer to the true value of 0.35 than in Step 1. Note, however, that the mixture parameter π1(2) has not changed much. The change in the log-likelihood from Step 1 to Step 2 is 49.168. The increase in the log-likelihood suggests that the parameter estimates at Step 2 are closer to the true values than the estimates for Step 1, and the increase suggests that we mainly obtain an MLRT statistic in which we reject the null hypothesis in favor of the alternative.

4.3.1.4

Computing the MLRT Statistic for the Example Data

In this example, because of the small number of parameters estimated, and speed with which the program runs, we are able to run all 200 steps for all sets of 300 starting vectors. That is not to say that one should not use thresholds (described in Chap. 1). In practice, adding more steps is a “diminishing returns” exercise, because the log-likelihoods typically converge before the final EM step (199 in this example) is reached. Having said that, one co-author (DG) finds it considerably easier to code the EM using all steps rather than specifying threshold conditions, and the results are usually the same. The “price” paid is the increase in time needed to compute all 60,000 parameter and log-likelihood  values.  t(0) (0) = (0.1590, 0.6235) documented above (section The starting vector pd 1 , π1 on alternative hypothesis) was the 42nd out of 300 random starting vectors that our computer program generated; it was the starting vector that produced the maximum

4.3 Statistical Tests that Consider Heterogeneity …

173

log-likelihood over all starting vectors. When running the MLRT for this vector over all 200 steps, we have obtained the following maximum likelihood estimates of all parameters: = 0.3568, pdt(199) 1 π1(199) = 0.5990, (199) = 0.4765, τ(0)(0) (199) = 0.5235, τ(1)(0) (199) τ(0)(1) = 0.3536, (199) = 0.6464, τ(1)(1) (199) = 0.2473, τ(0)(2) (199) = 0.7527. τ(1)(2)

(4.67)

Also, the final log-likelihood under H1 is −712.221, so that the value of the statistic is MLRT = 2(−712.221 − −727.856) = 31.268. We use the number 199 instead of 200 in Eq. (4.67) because the 0th (starting point) step is also a step in the EM algorithm. Zhou and Pan noted that the MLRT value is difficult to interpret, because the null distribution is typically not known. Using their recommendation, we determined the p-value for this statistic through permutation. Specifically, for each replicate data set, we randomly reassigned case and control status, keeping totals fixed, and computed the MLRT statistic on that permuted (replicate) data set, applying 300 starting points and 200 steps. We did this for 10,000 replicate data sets. The permutation p-value is the proportion of permuted data sets for which the MLRT statistic is larger than 31.268. For our 10,000 replicate data sets, we obtained a permutation p-value of 0.00. To determine an exact 95% confidence interval for the permutation p-value, we applied the method implemented in the BINOM program [77]. The 95% confidence interval computed was (0, 0.0003). The results of this analysis suggest that the fictitious disease phenotype is associated with the SNP locus (for example, at the 5% significance level), and that there appear to be two groups of cases: those who are affected due to variants/mutations at this locus and those who are affected for other reasons. These results should not be surprising, given that we specified our case and control counts (Table 4.6) under the alternative hypothesis (list (a)–(h) at the beginning of the example section). When applying the MLRT for this example, the estimate of the true disease allele frequency appeared to be more precise than that of the proportion of associated cases. Nonetheless, we observe that an estimate of 0.59 for the proportion of associated cases is not much different than the true value, 0.64.

174

4 Association Tests Allowing for Heterogeneity

Some have recommended (see, e.g., [77, 78]) that a method for obtaining “support” intervals (as compared with confidence intervals)) for maximum likelihood parameters is to look at the set of all parameter estimates corresponding to loglikelihoods within a certain “distance” of the max-log-likelihood (e.g., all parameter     (199) t(199) t t values π1 , pd 1 that satisfy ln π1 , pd 1 , pd 0 − ln π1 , pd 1 , pd 0 < L). Different values of L have been proposed, depending on the context. For an example in linkage analysis, see [79]. Zhou and Pan designed the MLRT so that the BPPs might be used to determine the most informative cases for follow-up analyses. If one is interested in variants/genotypes that either increase risk for disease or cause disease, it is possible to specify decision rules for follow-up. One such rule, documented by Ott [77], is that any case with genotype j for which the BPP satisfies (r1 (s1 )) > π1(r1 (s1 )) τ(1)(j)

is in the associated group (m = 1). The term r1 (s1 ) is defined in Chap. 1. In this  t(0) example, s1 = pd 1 = 0.1590, π1(0) = 0.6235 and r1 (s1 ) = 200 − 1 = 199. If we were to apply this rule, we would declare that cases who are either heterozygous (j = 1) or homozygous (j = 2) for the d allele are in the associated group (see Eqs. (4.67)). Of course, one can always perform simulation studies ([80, 81]) to verify different decision rules, as long as there are accurate estimates of the parameters involved.

4.3.2 Transmission Disequilibrium Test that Allows for Locus Heterogeneity (TDT-HET) With the exception of Sect. 4.2.3 (TDT ae ), all of the test statistics presented to this point have been for population-based designs, that is, the sample measurements have come from unrelated cases and controls. In this section, we present the details of a transmission disequilibrium test that allow for locus heterogeneity. We refer to this statistic as TDT-HET . The sampling unit is a trio of father, mother, and affected child (or child who displays the selected phenotype value for a dichotomous trait), all of whom have genotyped at a given locus. When the data set consists of only trios (that is, every family has only one affected child), then the TDT-HET statistic may be used as a test of association [82]. In other words, we may test the null hypothesis that the measure of disequilibrium between the trait and the marker locus is 0. Here, we present a derivation of the TDT-HET statistic. This derivation follows closely from the work presented in a previous publication [8].

4.3 Statistical Tests that Consider Heterogeneity …

4.3.2.1

175

Notation

Much of the notation here is similar, if not identical, to the notation for the TDT ae (Sect. 4.2.3). d = The disease allele at the putative disease SNP locus. “+” = The non-disease allele at the putative disease SNP locus. Note This notation is consistent with that presented in Chap. 1. T abc = The genotype trio where father, mother, and affected child have a, b, and c copies of the d allele at the putative disease locus (range for all copies: 0, 1, 2). This notation is the same as that in Table 4.4. {abc} = The set of 15 trio subscripts in Table 4.4. This notation is used to indicate summation over all 15 trios. N = The total number of trios in the study. Aff = Event that the child in a trio is affected. t = Pr(heterozygous parent transmits d allele to affected child). In this work, the null hypothesis is H0 : t = 0.5. The alternative hypothesis, H1 , is t = 0.5. Note: this variable should not be confused with the superscript t that indicates the true value of a parameter. Under a multiplicative mode of inheritance (Chap. 1, Eq. (1.8)), one R1 2 = 1+R , where R1 and R2 are the genotype relative risks can show that t = R1R+R 2 1 previously defined. μi = Pr(Mating type = i|Aff) = Probability that the mating type is i given that the child is affected. From this point forward, we shall omit the event Aff since it is understood the TDT-HET statistic only applies to the population of trios for which the child is affected. To be consistent with the work in Sect. 4.3.1 (MLRT statistic), we shall replace the term “population” by “group” from this point forward for our discussion of the TDT-HET . Group m, m = 0, 1 = The set of genotype trios for which Hm is true. πm = Pr(trio is in Group m). The null hypothesis above may also be written as H0 : π1 = 0. The equivalence of the two nulls is proved below. Equivalence of the two null conditions (i) t = 0.5 for all trios, and (ii) π 1 = 0: If t = 0.5 for all trios, then t = 0.5 for all trios, then the probability that any trio is in Group 0 is 1, that is, π0 = 1 and π1 = 1 − π0 = 0. Thus, Null Condition (i) implies Null Condition (ii). Conversely, if π1 = 0, then π0 = 1, so that every trio is in Group 0. By definition, this means that, for every trio, t = 0.5. Hence, Null Condition (ii) implies Null Condition (i). The specification that t is the same for all linked trios is consistent with previous statistical methods for heterogeneity in linkage analysis. In some tests of linkage allowing for heterogeneity, the recombination fraction is considered to be the same value for all linked pedigrees (see work by C. A. B. Smith [64, 68] and Ott [54], specifically the method implemented in programs such as HOMOG [54], GENEHUNTER [83], SIMWALK2 [84], VITESSE [85], MERLIN [80], and other programs).

176

4 Association Tests Allowing for Heterogeneity

f (T abc ; V m ) = Probability density function (pdf) of the trio T abc and the vector of parameters V m , 0 ≤ m ≤ 1 (see directly below). The values of this function are provided in Table 4.4 (right-most column). The pdf f (T abc ; V m ) for the TDT-HET is the same as the pdf f2 (T abc ; V m ) for the TDT ae . V m , 0 ≤ m ≤ 1 = Vector of parameters for the probability density functions f (T abc ; V m ). In this work, all trios are drawn from a group with the same set of mating-type frequencies. Our work below shows that the test statistic is robust to sampling trios from different populations, each with a unique vector of mating-type frequencies (Sect. 4.3.5.7). We define V0 = (1/2, π0 , μ1 , μ2 , μ3 , μ4 , μ5 , μ6 ), V1 = (t, π1 , μ1 , μ2 , μ3 , μ4 , μ5 , μ6 ).

(4.68)

V (r) m , 0 ≤ m ≤ 1 = Vector of parameters (4.68) where each coordinate is replaced by its rth iteration step value.  The mating-type frequencies satisfy the constraint: 6i=1 μi = 1. The subscript for each mating-type frequency is indicated in Table 4.4. Note that Pr(T abc |Group m) = f (T abc ; V m ). Finally, t = Pr(heterozygous parent transmits a d allele to an affected child). In the work by Schaid and Sommer [55], our publication on the TDT-HET [8], and in the definition of the parameter t, it is understood that the trios to which the statistic is applied are only those in which the child is affected with the trait of interest. For that reason, we remove the event “Aff” from the column headers above. (r) = rth iteration step BPP that trio T abc is in the mth group, 0 ≤ m ≤ 1. τ(m)(abc)

4.3.2.2

(r) Determination of the BPPs τ(m)(abc)

To compute the rth step estimates of the necessary parameters in the TDT-HET (r) . The reason is that statistic, we must first determine the values of the BPPs τ(m)(abc) all rth step estimates are functions of the BPPs, as is the case with all other statistics that use the EM algorithm statistics (e.g., LRT ae , MLRT above). Applying Bayes rule, and following the work of Zhou and Pan, we report that the BPPs are   πm(r) f T abc ; V (r) m (r) .  (4.69) τ(m)(abc) =  (r) (r) 1 m =0 πm f T abc ; V m

4.3 Statistical Tests that Consider Heterogeneity …

4.3.2.3

177

TDT-HET Parameter Estimates

Like all EM algorithm applications, we need to determine the (r + 1)st step estimates of all relevant parameters. For the TDT-HET statistic, that amounts to deriving the (r + 1)st step estimates of vectors V 0 and V 1 . All of the parameter estimates below are documented in the work by Londono et al. [8]. We present the formulas in the order that they appear in the vectors V m . Transmission probability t The (r + 1)st iteration step of t (under H1 only) is given by A(r) , A(r) + B(r)

t (r+1) =

(4.70)

where (r) (r) (r) + n122 τ(1)(122) + 2n112 τ(1)(112) A(r) = n212 τ(1)(212) (r) (r) (r) + n111 τ(1)(111) + n101 τ(1)(101) + n011 τ(1)(011) ,

and (r) (r) (r) + n121 τ(1)(121) + 2n110 τ(1)(110) B(r) = n211 τ(1)(211) (r) (r) (r) + n111 τ(1)(111) + n100 τ(1)(100) + n010 τ(1)(010) .

  (r) is 1 for all If π0 equals 0, so that all the trios are linked t = 21 , then τ(1)(abc) trios, and the formula (4.69) reduces to the quotient of the number of heterozygous parents that transmit the d allele divided by the number of heterozygous parents in the sample. A number of authors [57, 86, 87] have shown that this quotient is the maximum likelihood estimate of t under locus homogeneity. π 1 parameter: Our solution for π1(r+1) is given by  π1(r+1) =

{abc}

(r) nabc τ(1)(abc)

N

,

(4.71)

where each of the parameters has been defined above. We observe that Eq. (4.71) is virtually identical to the solution provided by Zhou and Pan (Eq. (7) in their (r) publication [7]), with the exception that the BPPs τ(1)(abc) for this statistic differ from the MLRT BPPs, and we have removed the constant C. Also, π0(r+1) = 1 − π1(r+1) . μi parameters In our previous work [8], we determined

178

4 Association Tests Allowing for Heterogeneity

μ(r+1) = μ1 = 1 μ(r+1) = μ2 = 2 = μ3 = μ(r+1) 3 = μ4 = μ(r+1) 4 μ(r+1) = μ5 = 5 μ(r+1) = μ6 = 6

n222 , N n212 + n211 + n122 + n121 , N n201 + n021 , N n112 + n111 + n110 , n n101 + n100 + n011 + n010 , N n000 . N

(4.72)

Each term μ(r+1) in Eq. (4.72) is the proportion of trios in the total sample that i have mating type i (Table 4.4), and each parameter is constant over all r-steps. For that reason, we drop the superscript r from the parameter notation.

4.3.2.4

Log-Likelihood of Observed Data

As with the MLRT statistic, we can write the null hypothesis H0 as t = 21 or π1 = 0. The alternative hypothesis, H1 is t = 21 and π1 > 0. The log-likelihood of the observed data is H1 :    ln L(r) = ln(Pr(T abc )), 1 {abc}

=



 ln Pr(T abc |T abc is in linked group) × Pr(T abc is in linked group)

{abc}

 + Pr(T abc |T abc is in unlinked group) × Pr(T abc is in unlinked group) ,        (r)  (r) (r) = + 1 − π f T . ln π1 f T abc ; V (r) ; V abc 1 1 0 {abc}

=



    + π0(r) f (T abc ; V 0 ) . ln π1(r) f T abc ; V (r) 1

(4.73)

{abc}

H0 : The null hypothesis can be written as either π1 = 0 or t = 0.5 in V 1 , in which case the log-likelihood (4.73) reduces to        (r) = = ln f T ; V ln(f (T abc ; V 0 )). ln L(r) abc 0 0 {abc}

{abc}

(4.74)

4.3 Statistical Tests that Consider Heterogeneity …

179

For several statistics (e.g., TDT-HET , MLRT (Sect. (4.3.1)), HLOD (H-LOD) in linkage analysis [54, 88]), the null hypothesis is a joint hypothesis, in that different conditions are equivalent. For such statistics, the asymptotic null can be difficult if not impossible to determine. As an example, Huang and Vieland [76] documented that the asymptotic null distribution of the HLOD statistic depends on the assumed genetic model for the trait. To address this difficulty, we compute p-values by permutation. Observe that the log-likelihood under the null hypothesis is constant, since each of the coordinates of the V 0 vector is constant. Hence, we drop the superscript r.

4.3.2.5

Computing the TDT-HET Statistic

The TDT −HET statistic is given by the formula:     1 (s1 )) − ln(L0 ) . TDT −HET = 2 × ln L(r 1

(4.75)

The maximum log-likelihood estimates are determined above (notation).   There are two elements in every starting vector for the H1 hypothesis. The vector is t (0) , π1(0) . There are no starting vectors under H0 , since we know that t = 0.5 and π1 = 0 for all iteration steps.

4.3.2.6

Example Calculation

In what follows, we compute the TDT −HET for data simulated from a set of genetic model-based parameters. While in practice, we consider a large number of starting vectors, for the example calculation, we consider only one. Also, in contrast to the component calculations performed in Sect. 4.3.1.3, where BPPs are computed for all genotypes and all groups, in this section, we perform a calculation for two trio types, and provide the BPPs for all other trio types, as well as the probabilities   (r) f T abc ; V 1 , the rth iteration step estimates t (r) and π1(r) , and the log-likelihoods under H0 (a constant value) and H1 for r = 0, 1, 2. The genetic model-based parameter settings from which we generate our data are a. Proportion of cases in Group 1 = π1 = 0.5556. b. Mating-type frequencies: Presented in Table 4.9. c. Probability that a heterozygous parent transmits a d allele to an affected child = t = 0.60. i. For a multiplicative mode of inheritance, the t value corresponds to genotype relative risks:R1 = 1.50, R2 = 2.25. d. Number of trios with affected child =N = 1,800.  e. Number of 0th step (starting) vectors t (0) , π1(0) = 400.

180

4 Association Tests Allowing for Heterogeneity

Table 4.9 Mating-type frequencies for example calculation Mating type

Frequency

μ1

0.1400

μ2

0.2689

μ3

0.0667

μ4

0.2811

μ5

0.2122

μ6

0.0311

These frequencies are randomly generated and standardized so that their sum is one

i. In our example, we specify t (0) = 0.4497 and π1(0) = 0.5664. In results not shown, these starting values are the ones that produced the largest H1 loglikelihood and therefore the TDT −HET statistic. ii. From Item e.i, we see that results are presented for only one of the 400 vectors. f. Number of steps (r) for each starting vector (Item d) = 200. g. Stopping criterion = 10−6 . h. Number of permutations: 5,000. With these parameter settings, we generate the following data set. Let us consider T 112 and T 100 as our example trios for computation. To begin, we need starting values. We use the BPPs to derive the updated parameter estimates. From Formula (4.69), with T 112 , we compute (0) τ(1)(112)

  π1(0) f T 112 ; V (0) 1 = 1 , (0)  (0) m=0 πm f T 112 ; V m

 2 (0.5664)(μ4 ) t (0) = ,  2 (0.5664)(μ4 ) t (0) + (1 − 0.5664)(μ4 )(0.5)2  2 (0.5664) t (0) = ,  2 (0.5664) t (0) + (1 − 0.5664)(0.5)2

(0.5664)(0.4497)2 , (0.5664)(0.4497)2 + (0.4336)(0.5)2 = 0.5138. =

(4.76)

In Expression (4.76), the mating-type frequency cancels. That is, the BPPs are not dependent upon the mating-type frequencies. This result is true for every BPP, since the numerator and denominators have the appropriate multiple of μi as a factor in each term (Table 4.4).

4.3 Statistical Tests that Consider Heterogeneity …

181

For T 100 , we similarly compute (0) τ(1)(100)

  π1(0) f T 100 ; V (0) 1 = 1 , (0)  (0) m=0 πm f T 100 ; V m   (0.5664) 1 − t (0)   , = (0.5664) 1 − t (0) + (0.4336)(1 − 0.5) (0.5664)(1 − 0.4497) = , (0.5664)(1 − 0.4497) + (0.4336)(0.5) = 0.5898.

(4.77)

One may confirm the results in Table 4.11, where we report the BPPs and loglikelihoods for the r =0th iteration step, and the updated parameters t (1) , π1(1) for the r = 1st iteration step. From an inspection of Table 4.11, we see that all the BPPs are values that are relatively close to π1(0) . Also, for the trio types 222, 201, 021, and 000, the BPPs are identical to the value π1(0) . This result will always be true for any data set and any estimates π1(r) . The reason is that all of these trio types are functions only of  their respec are constant tive mating-type frequencies, that is, their probabilities f T abc ; V (r) 1

Table 4.10 Expected trio-type counts based on genetic model-based parameter settings provided above (Items (a)–(g))

Trio type (abc)

Count

222

252

212

109

211

133

122

109

121

133

201

60

021

60

112

103

111

250

110

153

101

86

100

105

011

86

010

105

000

56

For each trio- type T abc , counts are computed as N ×   , rounded to the nearest integer. The total count is f T abc ; V (0) 1 1800

182 Table 4.11 Bayesian posterior probabilities (BPPs) (0)

τ(1)(abc) for the 15 trio types in Table 4.10

4 Association Tests Allowing for Heterogeneity Trio type (abc)

(0) τ(1)(abc)

222

0.5664

212

0.5402

211

0.5898

122

0.5402

121

0.5898

201

0.5664

021

0.5664

112

0.5138

111

0.5639

110

0.6128

101

0.5402

100

0.5898

011

0.5402

010

0.5898

000

0.5664

(r) = 0

Log-likelihood

H0

−4696.6201

H1

−4689.0606

(r) = 1

Value

t (1)

0.4289

(1) π1

0.5675

  = f (T abc ; V 0 ). Hence, for any value of t, and for these four trio types, f T abc ; V (r) 1 (r) these probabilities drop out of Eq. (4.69), and that equation reduces to τ(1)(abc) = π1(r) for any step r. In results not shown, when we applied the EM algorithm to our data in Table 4.11 with the specifications provided in the list (a)–(g) in this section, we found that the maximum log-likelihood H 1 occurred for the starting points provided in the example. This pair was the 250th starting vector of the 400 applied (Item (e)). Also, the stopping value 10−6 (Item (g)) was achieved at the r = 10th step. That is, the difference: (H 1 log-likelihood at the 11th step)–(H 1 log-likelihood at the 10th step) was 9.903 × 10−7 , and the difference: (H 1 log-likelihood at the 10th step)–(H 1 log-likelihood at the 9th step) was 1.302 × 10−6 . The H 0 log-likelihood stays constant for every iteration; it is only used for computing the TDT-HET statistic (Formula (4.75)). The TDT-HET statistic was 18.440, with a permutation p-value of 0. Using the method implemented in the BINOM program [54], we computed a 95% exact confidence interval of [0, 0.0006) for the permutation p-value (based on 5,000 permutations; Item (h)).

4.3 Statistical Tests that Consider Heterogeneity …

183

Finally, the maximum likelihood estimates of the two parameters were t (10) = 0.5869 and π1(10) = 0.5688. These MLEs each are close to their (true) generating values. For this example with these starting values, the estimated probability that a trio is in Group 1, π1(10) , is reasonably close to the generating value π1 = 0.5556 (Item (a)). In a large number of simulations (data not shown), we observed that the TDT-HET tends to give more accurate estimates of the transmission probability t than the group membership probability.

4.3.2.7

How TDT-HET Permutation p-Values Are Computed

To generate the permutation p-values, we first create replicates under the null hypothesis that t = 0.5. For our observed data set, we create a permutation data set where the parental genotypes for each permutation trio correspond to the parental genotypes from the corresponding trio in the observed data set. In the permutation data set, the affected child’s genotype is replaced by the genotype generated using a transmission probability of t = 0.5 for each heterozygous parent. That is, each heterozygous parent transmits the either the 1 or 2 allele (allele names arbitrary), each with probability 50%. For example, if the kth trio (1 ≤ k ≤ N ) in the observed data set is T 101 , then the kth trio in the permutation data set is either T 101 or T 100 , each occurring with 50% probability. A random number generator is used to assign the trio in the permutation data set. We apply these permutations for all N trios, to create a single permutation data set. This data set is called a replicate. For a pre-specified number of replicates (e.g., Item (h) is 5,000 replicates; Sect. 4.3.2.6), we compute the TDT-HET statistic. The proportion of statistics using replicate data sets that are greater than the TDT-HET statistic for the observed data set is the permutation p-value for the TDT-HET . This permutation method is similar to that implemented in the PLINK software program [70].

4.3.2.8

A Proof of the Robustness of the TDT-HET Statistic’s Type I Error When Null Data Are Drawn from Multiple Sub-populations

One of the elegant features of the TDT and one of the main reasons that statistics like it, with trio as sampling units, were originally developed for dichotomous and quantitative phenotype data (e.g., [55, 56, 89–93]; also see [94]) is the fact that population-based tests of association are susceptible to inflation in the false positive rate if there is population stratification (e.g., the case and control samples come from two or more separate populations with differing allele/genotype frequencies for most/all SNPs). The TDT and statistics like it are not. Below, we prove that the TDT-HET satisfies this condition as well.

184

4 Association Tests Allowing for Heterogeneity

Proof for robustness of the TDT−HET type I error rate to different mating-type frequencies in different sub-populations The null hypothesis, H 0 : t = 0.5 (equivalent to π 1 = 0) Consider a population of trios that are comprised of S > 1 sub-populations, parameterized by an index 1 ≤ s ≤ S. We denote the vector V 0 (Eq.  4.68) for the sth sub-population by V 0,s = 21 , π0,s = 1, μ1,s , μ2,s , μ3,s , μ4,s , μ5,s . The mating-type  frequencies satisfy the equation 6i=1 μi,s = 1, where the subscript i is defined in Sect. 4.2.3.1. For any two indices, 1 ≤ s1 = s2 ≤ S, V 0,s1 = V 0,s2 . trio is in the sth sub-population) denote the set of mixing Let {γs } = Pr(an arbitrary  proportions. We have Ss=1 γs = 1. Under the null, it follows that the probability of observing a trio in the general population is Pr(T abc ) =



Pr(T abc |T abc is in sub - population s) × Pr(T abc is in sub - population s),

s

=

   f T abc ; V 0,s × γs .

(4.78a)

s

  From Table 4.4, we know that f T abc ; V 0,s may be written as

  1 × μi,s , f T abc ; V 0,s = g T abc , t = 2

(4.78b)

Here, g is a function of the transmission probability t when it is fixed at ½, and not of the mating types. The key observation regarding identity (4.78b) is that, for any trio, the probability density function under the null is the product of g and the respective mating-type frequency for the genotyped trio  in the sth sub-population. Consider some examples: if T abc = T 222 , then g T 222 , t = 21 = 1, and μi,s = where both parents have μ1,s is the mating-type frequency for the sth sub-population   the two genotypes. Also, if T abc = T 101 , then g T 101 , t = 21 = t = 21 , μi,s = μ25,s is the mating-type frequency for the sth sub-population where one parent has genotype 1 and the other has  genotype 0. Because the g T abc , t = 21 function may always be factored out of the probability   f T abc ; V 0,s , then we may rewrite Eq. (4.78a) as 

    1  Pr(T abc ) = f T abc ; V 0,s × γs = g T abc , t = μi,s × γs . 2 s s The mating-type frequency for T abc is given by the formula (Formula (4.79)). One may confirm that

 s

(4.79)

μi,s × γs



4.3 Statistical Tests that Consider Heterogeneity …



Pr(T abc ) =

{abc}

     f T abc ; V 0,s × γs , s

{abc}

⎞⎤ ⎛ ⎡     ⎣γs × ⎝ f T abc ; V 0,s ⎠⎦. = {abc}

s

For any index s, we know that =

185

 s

 γs ×



 {abc}

   f T abc ; V 0,s = 1. Thus, {abc} Pr(T abc )

      f T abc ; V 0,s = γs = 1.

{abc}

s

  We conclude that the function f T abc ; V 0,S (S fixed), with V 0,S = 1    , π0 = 1, μ1,S , . . . , μ5,S and μi,S = s μi,s × γs , is a probability density func2 tion for the trios T abc in the general population. The fact that π0 = 1 (equivalently, t = 1/2) in the general population comes from the fact that these results are true in every sub-population, and the conditions are independent of the mating-type frequency values. Thus, we determine a pdf for the entire population where the null hypothesis is true. Why does this result not hold true for any mixture of S sub-populations (e.g., where some sub-populations may have t = 0.5)? The answer is that there may be different values of t for different sub-populations. As such, we cannot necessarily factor out the function g in the term (4.79), and hence we cannot compute mating-type frequencies.

4.3.3 Tests that Incorporate Phenotype Heterogeneity In Chap. 1, we defined phenotype heterogeneity, a situation where different variants in the same gene or in a small number of genes produce different disease phenotypes. The genetic information may be extended to consider factors like mode of inheritance, age, gender, ancestry, gene expression values, copy number variation, and other factors. At a minimum, phenotype heterogeneity requires information of different variants in the same or multiple genes. One of the interesting and unique features of the statistics in this section is that grouping for both phenotype and genotype is known; throughout this section, we do not employ the EM algorithm, nor do we compute any BPPs that an observed data value is in a particular group. Among the more common ways to test for association given phenotype heterogeneity are 1. Chi-square tests of independence applied to R × C tables (Table 1.3 in Chap. 1, Sect. 1.6.1).

186

4 Association Tests Allowing for Heterogeneity

a. In our definition of phenotype heterogeneity in Chap. 1, we specify that the number of phenotypes R is greater than or equal to three when considering phenotype heterogeneity. The one exception is for threshold-selected quantitative traits (Chap. 6). For such traits, R is greater than or equal to two. 2. Chi-square tests of multiple 2 × C tables, determined using data from Table 1.3 (Chap. 1). 3. Logistic regression, with stratifying variables (e.g., age, gender, severity) as covariates. 4.3.3.1

Analysis of Data with R (Greater Than Two) Phenotypes and C (Greater Than One) Genomic Data Categories

The chi-square test of independence for genotypes is presented in Formulas (1.18) and (1.19) of Chap. 1. This statistic has been widely used in genetics for crossclassified (categorical) data, especially when R is two (case, control) and C is two (different alleles for a single SNP). Fisher [95] documented that, under H0 stated above, the limiting distribution of the test is a central chi-square statistic with (R − 1) × (C − 1) degrees of freedom (also reported in Chap. 1). When certain sample size criteria are not satisfied, Cochran [96] provided recommendations on how to adjust the statistic to maintain correct type I error. His recommendations were applied to both goodness-of-fit tests and chi-square tests of independence. In addition, Fisher developed the “Exact Test” [95, 97] that computes the exact probability of observing cross-classified data under the null hypothesis stated above. Fisher’s test does not rely upon an asymptotic null distribution (accurate when sample sizes are larger); instead, his calculations are based on the hypergeometric distribution being the null distribution. The chi-square statistic is an omnibus test [98] that tests for overall independence without indicating what particular cells are significantly contributing to the significance (size of the p-value). It has the greatest statistical power when the conditional genotype frequencies are different for every phenotype category. Because the chisquare test is an omnibus test, follow-up (or post hoc) tests are typically applied to determine cells that significantly contribute to the association. Numerous statistical researchers have proposed post hoc tests. Among the more well known of these are tests using the Pearson residuals (see, e.g., [99], Chap. 3).  Chi-square test of independence on multiple rg × C tables, G g=1 rg = R This series of tests is used when one hypothesizes that certain phenotype categories are similar to one other, and may group into a single phenotype. This concept is similar to a blocking design in regression (see, e.g., [100]). To create data for one rg × C table, we add the column values in Table 1.3 for those rows Ri that satisfy the conditions for being phenotype rg . For example, let the phenotype in Table 1.3 be defined as R1 : BMI