323 11 10MB
English Pages 578 Year 2012
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK
For further volumes: http://www.springer.com/series/7651
TM
.
Statistical Human Genetics Methods and Protocols
Edited by
Robert C. Elston Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA
Jaya M. Satagopan Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
Shuying Sun Department of Epidemiology and Biostatistics, Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA
Editors Robert C. Elston Department of Epidemiology and Biostatistics Case Western Reserve University Cleveland, OH, USA [email protected]
Jaya M. Satagopan Department of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center New York, NY, USA [email protected]
Shuying Sun Department of Epidemiology and Biostatistics Case Comprehensive Cancer Center Case Western Reserve University Cleveland, OH, USA [email protected]
ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-61779-554-1 e-ISBN 978-1-61779-555-8 DOI 10.1007/978-1-61779-555-8 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011945439 ª Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)
Preface The recent advances in genetics, especially in the molecular techniques that have over the last quarter of a century spectacularly reduced the cost of determining genetic markers, open up a field of research that is becoming of increasing help in detecting, preventing and/or curing many diseases that afflict us. This has brought with it the need for novel methods of statistical analysis and the implementation of these methods in a wide variety of computer programs. It is our aim in this book to make these methods and programs more easily accessible to the beginner who has data to analyze, whether a student or a senior investigator. Apart from the first chapter, which defines some of the genetic terms we shall use, and the last chapter, which compares three major multipurpose programs/software packages, each chapter of this book takes up a particular analytical topic and illustrates the use of at least one piece of software that the authors have found helpful for the relevant statistical analysis of their own human genetic data. There is often more than one program that performs a particular type of analysis and, once you have used one program for a particular analysis, you may find you prefer another program—and there is a good chance you will find that the same basic analysis is described in more than one chapter of this book. You may therefore wish to browse over several chapters, in the first place restricting your reading to only the introductory sections, which describe the underlying theory. The chapters are ordered in the approximate logical order in which human genetic studies are often conducted; so, if you are new to research in human genetics, this initial reading could serve as an introduction to the subject. Our main purpose, however, is to serve the needs of those who have already performed their study and now need to analyze their data. The second sections of the chapters give you step-by-step instructions for running the programs and interpreting the program outputs, with extra notes in the third sections. However, although our aim is very much to offer a “do it yourself” manual, there will be times when you will need to consult a statistical geneticist, especially for the interpretation of computer output. We have tried to be fairly comprehensive in covering statistical human genetics, but we do not cover here any of the bioinformatic software for gene sequencing, which is still very much in its infancy. Cleveland, OH, USA New York, NY, USA Cleveland, OH, USA
Robert C. Elston Jaya M. Satagopan Shuying Sun
Cleveland, OH, USA
Robert Elston
v
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v ix
1 Genetic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert C. Elston, Jaya M. Satagopan, and Shuying Sun 2 Identification of Genotype Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yin Y. Shugart and Ying Wang 3 Detecting Pedigree Relationship Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Sun 4 Identifying Cryptic Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Sun and Apostolos Dimitromanolakis 5 Estimating Allele Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indra Adrianto and Courtney Montgomery 6 Testing Departure from Hardy–Weinberg Proportions . . . . . . . . . . . . . . . . . . . . . . Jian Wang and Sanjay Shete 7 Estimating Disequilibrium Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maren Vens and Andreas Ziegler 8 Detecting Familial Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam C. Naj, Yo Son Park, and Terri H. Beaty 9 Estimating Heritability from Twin Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karin J.H. Verweij, Miriam A. Mosing, Brendan P. Zietsch, and Sarah E. Medland 10 Estimating Heritability from Nuclear Family and Pedigree Data. . . . . . . . . . . . . . Murielle Bochud 11 Correcting for Ascertainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Warren Ewens and Robert C. Elston 12 Segregation Analysis Using the Unified Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiangqing Sun 13 Design Considerations for Genetic Linkage and Association Studies . . . . . . . . . . Je´re´mie Nsengimana and D. Timothy Bishop 14 Model-Based Linkage Analysis of a Quantitative Trait . . . . . . . . . . . . . . . . . . . . . . Audrey H. Schnell and Xiangqing Sun 15 Model-Based Linkage Analysis of a Binary Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rita M. Cantor 16 Model-Free Linkage Analysis of a Quantitative Trait . . . . . . . . . . . . . . . . . . . . . . . . Nathan J. Morris and Catherine M. Stein 17 Model-Free Linkage Analysis of a Binary Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Xu, Shelley B. Bull, Lucia Mirea, and Celia M.T. Greenwood
1
vii
11 25 47 59 77 103 119 151
171 187 211 237 263 285 301 317
viii
18
Contents
Single Marker Association Analysis for Unrelated Samples . . . . . . . . . . . . . . . . . . . Gang Zheng, Jinfeng Xu, Ao Yuan, and Joseph L. Gastwirth 19 Single-Marker Family-Based Association Analysis Conditional on Parental Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ren‐Hua Chung and Eden R. Martin 20 Single Marker Family-Based Association Analysis Not Conditional on Parental Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junghyun Namkung 21 Allowing for Population Stratification in Association Analysis . . . . . . . . . . . . . . . . Huaizhen Qin and Xiaofeng Zhu 22 Haplotype Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Li and Jing Li 23 Multi-SNP Haplotype Analysis Methods for Association Analysis. . . . . . . . . . . . . Daniel O. Stram and Venkatraman E. Seshan 24 Detecting Rare Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tao Feng and Xiaofeng Zhu 25 The Analysis of Ethnic Mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaofeng Zhu 26 Identifying Gene Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gurkan Bebek 27 Structural Equation Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Catherine M. Stein, Nathan J. Morris, and Nora L. Nock 28 Genotype Calling for the Affymetrix Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arne Schillert and Andreas Ziegler 29 Genotype Calling for the Illumina Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yik Ying Teo 30 Comparison of Requirements and Capabilities of Major Multipurpose Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert P. Igo Jr. and Audrey H. Schnell Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
347
359
371 399 411 423 453 465 483 495 513 525
539 559
Contributors INDRA ADRIANTO Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA TERRI H. BEATY Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA GURKAN BEBEK Center for Proteomics and Bioinformatics, Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA D. TIMOTHY BISHOP Section of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, University of Leeds, Cancer Genetics Building, Leeds, UK MURIELLE BOCHUD Institute of Social and Preventive Medicine, University of Lausanne, Lausanne, Switzerland SHELLEY B. BULL Samuel Lunenfeld Research Institute of Mount Sinai Hospital and Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada RITA M. CANTOR Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA; Center for Neurobehavioral Genetics, Department of Psychiatry, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA REN-HUA CHUNG John P. Hussman Institute for Human Genomics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA APOSTOLOS DIMITROMANOLAKIS Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada ROBERT C. ELSTON Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA WARREN EWENS Department of Biology, University of Pennsylvania, Philadelphia, PA, USA TAO FENG Department of Epidemiology and Biostatistics, Case Western Reserve University School of Medicine, Cleveland, OH, USA CELIA M.T. GREENWOOD Centre for Clinical Epidemiology, Lady Davis Research Institute, Jewish General Hospital, Montreal, QC, Canada; Cancer Research Society Division of Epidemiology, Department of Oncology, McGill University, Montreal, QC, Canada JOSEPH L. GASTWIRTH Department of Statistics, George Washington University, Washington, DC, USA ROBERT P. IGO JR Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA JING LI Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA XIN LI Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA EDEN R. MARTIN John P. Hussman Institute for Human Genomics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA ix
x
Contributors
SARAH E. MEDLAND Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia LUCIA MIREA Maternal-Infant Care Research Centre, Mount Sinai Hospital, Toronto, ON, Canada COURTNEY MONTGOMERY Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA NATHAN J. MORRIS Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA MIRIAM A. MOSING Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia; School of Psychology, University of Queensland, Brisbane, QLD, Australia ADAM C. NAJ John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA JUNGHYUN NAMKUNG Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA JE´RE´MIE NSENGIMANA Section of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, University of Leeds, Cancer Genetics Building, Leeds, UK NORA L. NOCK Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA YO SON PARK John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA HUAIZHEN QIN Department of Epidemiology and Biostatistics, Case Western Reserve University School of Medicine, Cleveland, OH, USA JAYA M. SATAGOPAN Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA ARNE SCHILLERT Institut f€ ur Medizinische Biometrie und Statistik, Universit€ a tsklinikum Schleswig-Holstein, Universit€ a t zu Lubeck, L€ ubeck, Germany AUDREY SCHNELL Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA VENKATRAMAN E. SESHAN Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA SANJAY SHETE Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA YIN Y. SHUGART Unit of Statistical Genomics, Intramural Research Program, National Institute of Mental Health, National Institutes of Health, Bethesda, MD, USA CATHERINE M. STEIN Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA DANIEL O. STRAM Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA LEI SUN Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada SHUYING SUN Department of Epidemiology and Biostatistics, Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA
Contributors
XIANGQING SUN Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA YIK YING TEO Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore; Department of Epidemiology and Public Health, National University of Singapore, Singapore, Singapore; Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore, Singapore ur Medizinische Biometrie und Statistik, MAREN VENS Institut f€ Universit€ a tsklinikum Schleswig-Holstein, Universit€ a t zu L€ ubeck, L€ ubeck, Germany KARIN J.H. VERWEIJ Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia; School of Psychology, University of Queensland, Brisbane, QLD, Australia JIAN WANG Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA YING WANG Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD, USA JINFENG XU Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore WEI XU Department of Biostatistics, Princess Margaret Hospital, Toronto, ON, Canada; Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada AO YUAN National Human Genome Center, Howard University, Washington, DC, USA GANG ZHENG Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, MD, USA XIAOFENG ZHU Department of Epidemiology and Biostatistics, Case Western Reserve University School of Medicine, Cleveland, OH, USA ANDREAS ZIEGLER Institut f€ ur Medizinische Biometrie und Statistik, Universit€ a tsklinikum Schleswig-Holstein, Universit€ a t zu Lubeck, L€ ubeck, Germany BRENDAN P. ZIETSCH Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia; School of Psychology, University of Queensland, Brisbane, QLD, Australia
xi
sdfsdf
Chapter 1 Genetic Terminology Robert C. Elston, Jaya M. Satagopan, and Shuying Sun Abstract Common terms used in genetics with multiple meanings are explained and the terminology used in subsequent chapters is defined. Statistical Human Genetics has existed as a discipline for over a century, and during that time the meanings of many of the terms used have evolved, largely driven by molecular discoveries, to the point that molecular and statistical geneticists often have difficulty understanding each other. It is, therefore, imperative, now that so much of molecular genetics is becoming an in silico statistical science, that we have well-defined, common terminology. Key words: Gene, Allele, Locus, Site, Genotype, Phenotype, Dominant, Recessive, Codominant, Additive, Phenoset, Diallelic, Multiallelic, Polyallelic, Monomorphic, Monoallelic, Polymorphism, Mutation, Complex trait, Multifactorial, Polygenic, Monogenic, Mixed model, Transmission probability, Transition probability, Epistasis, Interaction, Pleiotropy, Quantitative trait locus, Probit, Logit, Penetrance, Transformation, Scale of measurement, Identity by descent, Identity in state, Haplotype, Phase, Multilocus genotype, Allelic association, Linkage disequilibrium, Gametic phase disequilibrium
In this introductory chapter, we give the original meanings of various genetic terms (which will be found in the older literature), together with some of the various meanings that are sometimes ascribed to them today, and how the terms will be defined in the following chapters. For simplicity, exceptions are ignored and what is stated is usually, but not invariably, true.
1. Gene, Allele, Locus, Site The concept of a gene (the word itself was introduced by Bateson) is due to Mendel, who used the German word “Factor.” Mendel used the word in the same way that we might call “hot” and “cold” factors, not in the way that we call “temperature” a factor. In other words, his Factor was the level of what statisticians now call a factor. In the original terminology, still used by some population geneticists, genes occur in pairs on homologous chromosomes. In this Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_1, # Springer Science+Business Media, LLC 2012
1
2
R.C. Elston et al.
terminology, the four blood groups A, B, O, and AB (defined in terms of agglutination reactions) are determined by three (allelic) genes: A, B, and O. Nowadays molecular geneticists do not call these three factors genes, but rather “alleles,” defined as “alternative forms” of a gene that can occur at the same locus, or place, in the genome. Whereas Drosophila geneticists used to talk of two loci for a gene, and human geneticists used to talk of two genes at a locus, modern geneticists talk of “two alleles of a gene” or “two alleles at a locus”; this last, which is nowadays so common, is the terminology that will thus be used in this book. It then follows (rather awkwardly) that two alleles at the same locus are allelic to each other, whereas two alleles that are at different loci are nonallelic to each other. A gene is commonly defined as a DNA sequence that has a function, meaning a class of similar DNA sequences all involved in the same particular molecular function, such as the formation of the ABO red cell antigens. (Note the common illogical use of the phrase “cloning genes” by molecular geneticists when, by their own terminology, “cloning alleles” is meant.) Some restrict the word gene to protein-coding genes, but there are many more sequences of DNA that have function by virtue of being transcribed to RNA without ever being translated to DNA and hence protein coding, so this restricted definition of a gene would appear to be unwarranted. See ref. 1 for a more detailed explanation of the evolution of, and a modern definition of, the word “gene.” A locus is the location on the genome of a gene, such as the “ABO gene.” By any definition a gene must involve more than one nucleotide base pair. Single nucleotide polymorphisms (SNPs) thus do not occur at loci, but rather in and around loci and, in this book, we shall not write SNP markers as being “at” loci. Because of the confusion that occurs when SNPs are described as occurring at loci, some use the term “gene-locus,” but we shall always use the term locus for the location of a functional gene. We shall, however, allow SNP markers to have alleles and use the original term for their locations: “sites” within loci or, more generally, sites within the region of a locus or anywhere in the genome. If in the population only one allele occurs at a site or locus, we shall say that it is monomorphic, or monoallelic, in that population. If two alleles occur, as is common for SNPs, we shall use the original term diallelic which, apart from having precedence, is etymologically sounder than the now commonly used term biallelic. If many alleles occur, we shall describe the polymorphism as polyallelic or multiallelic (the former term is arguably more logical, the latter more common). When there are just two alleles at a locus, the one with the smaller population frequency is called the minor allele. In genetics, the term allele “frequency”—which is strictly speaking a count—is used to mean relative frequency, i.e., the proportion of all such alleles at that locus among the members of a population; thus the term minor allele frequency is often used for diallelic markers.
1 Genetic Terminology
2. Genotype, Phenotype, Dominant, Recessive, Codominant, Additive
3
An individual’s genotype is the totality of that individual’s hereditary material, whereas an individual’s phenotype is the individual’s appearance. However, the terms genotype and phenotype are usually used in reference to a particular locus or set of loci, and to a particular trait or set of traits. Genotypes are not observed directly, but rather inferred from particular phenotypes. Thus, with respect to the ABO locus, the four blood types A, B, O, and AB are (discrete) phenotypes; and the possible genotypes, formed by pairs of alleles, are AA, AO, BB, BO, AB, and OO. With respect to the ABO blood group phenotypes, the A allele is dominant to the O allele and the O allele is recessive to the A allele. Similarly, the B allele is dominant to the O allele and the O allele is recessive to the B allele. The A and B alleles are codominant. Note that for the words “dominant” and “recessive” to have any meaning, at least two alleles and two phenotypes must be specified. If a particular allele at a locus is dominant with respect to the presence of a disease, there must be at least one other allele at that locus that is recessive with respect to the absence of that disease. Geneticists loosely talk about a disease being dominant, meaning that, with respect to the phenotype “disease”, the underlying disease allele is dominant, i.e., the disease is present when either a single or two copies of the allele is present. Similarly, they talk of a disease being recessive, meaning that, with respect to the same phenotype, the underlying disease allele is recessive, i.e., the disease is present only when two disease alleles are present. Alternatively, they may talk of an allele being dominant or recessive, the particular phenotype (often disease) being understood. The important thing to realize is that “dominance” and “recessivity” describe a relationship between one or more genotypes and a particular phenotype. This leads to the concept of phenosets: the genotypes AA and AO form the phenoset corresponding to the A blood type, and the genotypes BB and BO form the phenoset corresponding to the B blood type. In the case of the ABO blood group, a person who has one A allele and one B allele has the blood type AB, which is a different phenotype from that of either of the corresponding homozygotes, AA and BB; this relationship is called codominance. In general, a locus is codominant with respect to the set of phenotypes it controls if the phenotypes of each heterozygote at that locus differ from that of each of the corresponding homozygotes. We make a distinction between codominant and additive; the latter implies that the phenotype (or phenotypic distribution, see below under quantitative phenotypes) corresponding to the heterozygote is in some sense half-way between those of the two corresponding homozygotes. Whereas the term additive is only meaningful when a
4
R.C. Elston et al.
scale of measurement has been defined, codominance is a more general concept that does not require the definition of a scale of measurement.
3. Polymorphism, Mutation The A, B, O, and AB blood types comprise a polymorphism, in the sense that they are alternative phenotypes that commonly occur in the population. A polymorphic locus was originally defined operationally as a polymorphism-determining locus at which the least common allele occurs with a “frequency” of at least 1% (2); but a more appropriate definition would be a locus at which the most common allele occurs with a “frequency” of at most 99%. Different alleles arise at a locus as a result of mutation, or sudden change in the genetic material. Mutation is a relatively rare event, caused, for example, by an error in replication. Thus all alleles are by origin mutant alleles, and a genetic polymorphism was conceived of as a locus at which the frequency of the least common allele has a frequency too large to be maintained in the population solely by recurrent mutation. However, what is important at a locus is the degree of polymorphism, and a locus in which there are 1,000 equifrequent alleles would be considered much more polymorphic than a locus at which there are two alleles with frequencies 0.01 and 0.99. Many authors now use the term mutation for any rare allele, and the term polymorphism for any common allele.
4. Complex Trait, Multifactorial, Polygenic, Monogenic
The term “complex trait” was introduced about two decades ago without a clear definition. It appears to be used for traits that do not exhibit clear one-locus (“Mendelian”) segregation, usually because segregation at more than one locus is involved. Whereas multifactorial and complex are ill-defined and often used interchangeably, a clear distinction should be made between multifactorial and polygenic. Multifactorial implies that more than one factor is involved in the etiology of the phenotype, whether genetic, environmental, or both. Polygenic, on the other hand, implies that only genetic factors are involved, usually in an additive fashion, with the original definition that the number of factors (loci) is so large that they cannot be individually characterized. Thus, strictly speaking, the term polygenic should not be used to include any environmental factors—though in practice it is often used that way.
1 Genetic Terminology
5
Monogenic inheritance implies segregation at a single locus, and the term “mixed model” is used by geneticists to denote an additive combination of monogenic and polygenic inheritance. When both components are present in a segregation model in which both components are latent variables (the former discrete and the latter continuous), the underlying statistical model is random, not mixed, because there are two random components other than any error term. Statistical geneticists often use the term “transmission probability” in two quite different senses. In this book, we carefully distinguish transmission probabilities, probabilities that a parent having a particular genotype transmits particular alleles to offspring, from transition probabilities, probabilities that offspring receive particular genotypes from their parents. This distinction was introduced in ref. 3.
5. Haplotype, Phase, Multilocus Genotype
Let A, B be two alleles at one locus, and D, d be two alleles at another locus. If one parent transmits A and D to an offspring, while the other transmits B and d, the offspring genotype is denoted AD/Bd (or Bd/AD), in which the parental origins are separated by “/”. The two alleles transmitted by one parent constitute a two-locus haplotype; with respect to two alleles at each of two loci there are four possible haplotypes—AD, Ad, BD, and Bd in this case, with AD/Bd and Bd/AD being the two possible phases. If n1 alleles can occur at one of the loci and n2 at the other, n1n2 two-locus haplotypes are possible. At the first locus n1(n1 + 1)/2 genotypes are possible (n1 homozygotes and n1(n1 1)/2 heterozygotes), while at the second locus n2(n2 + 1)/2 genotypes are possible. If we pair these genotypes, one from each locus, the total number of pairs possible is n1 þ 1 n2 þ 1 n1 n2 þ 1 n1 n2 ¼ n1 n2 2 2 2 n1 1 n2 1 n2 : n1 2 2 On the other hand, at the two loci together, there are n1n2 haplotypes; and pairing these we have n1n2(n1n2 + 1)/2 possible pairs of two-locus haplotypes, or diplotypes. In this book, we shall define “two-locus genotypes” this way, i.e., without differentiating the two phases, so that for the same number of alleles at each locus there is a smaller number of two-locus genotypes than there is of two-locus diplotypes. Thus we shall consider the two phases of the double heterozygote, Ad/BD and BD/Ad, as being the same two-locus genotype. Usually, the term “multilocus genotype”
6
R.C. Elston et al.
refers to genotypes when the phases are not distinguished, and the term diplotype is useful for the case when they are distinguished (though this term is not yet in common usage). More generally, a haplotype is the multilocus analog of an allele at a single locus. It consists of one allele from each of multiple loci that are transmitted together from a parent to an offspring. When haplotypes made up of multiple alleles (one from each locus) are paired, a pair in which the genotype at each of n loci is heterozygous corresponds to 2n 1 different diplotypes or phases. It is usual nowadays to restrict the word haplotype to the case where all the loci involved are on the same chromosome pair, so that all the alleles involved are on the same chromosome. Typically, but not always, it is assumed that all the different phases of a particular multiple heterozygote have the same phenotype.
6. Epistasis, Interaction, Pleiotropy
7. Allelic Association, Linkage Disequilibrium, Gametic Phase Disequilibrium
When two loci are segregating, each typically influences a separate phenotype. For example, A and B may be alleles at the ABO locus, determining the ABO blood types, while D and d are alleles at a disease locus, determining disease status. But if alleles at a single locus influence two different phenotypes, we say there is pleiotropy. It is known that a person’s ABO genotype influences the risk of gastric cancer as well as determining blood type. Thus the ABO locus is pleiotropic. Alternatively, alleles at two different loci may determine the same phenotype, such as the presence or absence of a disease; and if the phenotype associated with the genotypes at one locus depends on the genotypes at another locus, we say there is epistasis. Thus gastric cancer may perhaps be caused by the epistatic effect of alleles at two (or more) loci. Epistasis and pleiotropy are sometimes confused in statistical genetics.
If the alleles at one locus are not distributed in the population independently of the alleles at another locus, the two loci exhibit allelic association. If this association is a result of a mixture of subpopulations (such as ethnicities or religious groups) within each of which there is random mating, the association is often denoted as “spurious.” In such a case, there is true association, but the cause is not of primary genetic interest. If the association is not due to this kind of population structure, it is either due to linkage disequilibrium (LD) or, more generally, to gametic phase disequilibrium (GPD); in the former case the loci are linked, i.e., they
1 Genetic Terminology
7
cosegregate in families, in the latter case they need not be linked, i.e., they may segregate independently in families. Owing to an unintended original definition, loci that are not linked have often been mistakenly described as being in LD (4, 5).
8. Identity The concept of allelic identity is an important one. Alleles are identical by descent (IBD) if they are copies of the same ancestral allele, and must be differentiated from alleles that are physically identical but not (at least within the previous dozen or so generations) ancestrally identical. Such alleles, when not IBD, are identical in state (IIS). It is well understood that molecules, atoms, etc., can be in different states (not “by” different states), and the same is true of alleles, though here the states are ancestrally, not physically, different. Whereas in the animal and plant genetics literature the phrases “identity in state” and “identical in state” are commonly used, for no good reason the phrases “identity by state” and “identical by state” are now commonly used in the human genetics literature. In this book, to stress the difference and to be consistent with both the earlier common usage and the usage in the animal and plant genetics literature, we shall use the terminology IIS, not IBS.
9. Quantitative Traits A locus at which alleles determine the level of a quantitative phenotype is called a QTL (quantitative trait locus). Typically, the word “quantitative” is used interchangeably with “continuous” when describing a phenotype. However, quantitative traits can be discrete. Care should be taken to distinguish between those methods of analysis of quantitative traits for which distributional assumptions, such as conditional normality, are critical, and those for which they are not. Transforming the phenotype of a QTL corresponds to changing its units if the transformation is linear, or more generally to changing the scale of measurement (e.g., square root or logarithmic) if the transformation is nonlinear. On the scale of measurement used, alleles at a QTL have an additive effect if the phenotypic distribution of the heterozygote is the average of the corresponding two homozygote phenotypic distributions. With respect to that phenotype, allele A is dominant to the allele B, and allele B is recessive to allele A, if the whole phenotypic distribution of the heterozygote AB is the same as that of the homozygote AA. Any variance among the phenotypic means of the genotypes at a locus over and above that
8
R.C. Elston et al.
due to additive allele action is called dominance, or dominant genetic, variance. Thus dominance variance can arise as a result of one allele being dominant to another, but such simple allele action is not necessarily implied by dominance variance. The presence of dominance variance depends on the scale of measurement; dominant allele action (complete dominance, as described above for discrete traits such as the ABO blood group) does not. If the phenotypic distribution of a heterozygote is not the average of the corresponding two homozygote phenotypic distributions, we shall say there is codominance. Thus in this book we shall not restrict the word codominance to the case of additivity (with the result that codominance is scale independent). Just as dominance has a different meaning when applied to quantitative traits, so does epistasis. From a statistical point of view, dominance can be considered as intralocus interaction, or nonadditivity of the allelic contributions to the phenotype. Epistasis is a genetic term, now generalized when applied to quantitative traits to indicate nonadditivity of the effects on the phenotype of the genotypes at two (or more) loci in a population. It is thus from a statistical viewpoint interlocus interaction, and so dependent on how the phenotype is measured. Statistical interaction is a term with a similar limitation, but is not restricted to genetic factors. Statistical interaction should be carefully distinguished from biological interaction (5, 6). Whereas biological interaction does not require the presence of statistical interaction, the presence of the latter implies the existence of the former. Indeed, statistical interaction is removable if a monotonic transformation can make the effects of the two factors involved (e.g., segregation at two loci, or segregation at one locus and levels of an environmental factor) additive. Furthermore, the magnitude of any interaction effects can depend critically on how the individual factor effects (single locus genotypes in the case of genetic factors) are defined (5). There is usually no loss of generality in assuming that disease status, unaffected or affected, is a quantitative trait that takes on the values 0 or 1, respectively, so that its mean value is the population prevalence of the disease. Then everything that has been written here with regard to dominant allele action, dominance variance, and epistasis also holds in the case of a binary disease phenotype, except that now the scale of measurement (in the sense of a nonlinear monotonic transformation) is irrelevant in the absence of a quantitative measure. However, if there is a quantitative measure, such as a relative risk or odds ratio, then the scale of measurement will determine whether or not there is interaction. Also, in the case of a binary disease phenotype, the penetrance, or probability of being affected, is often transformed to a probit (or logit), giving rise to what is called the “liability” to disease, and this liability is treated as a continuous phenotype. Dominance and epistatic variance can be
1 Genetic Terminology
9
quite different on this liability scale from that measured on the original “penetrance” scale. For a QTL, dominance variance is present when there is intralocus nonadditivity. By the same token, epistatic variance is present when there is interlocus nonadditivity. Each locus gives rise to its own components of additive genetic and dominant genetic variance. If multiple loci affect a QTL, there are multiple components of epistatic variance. Except in the case of a binary phenotype with no associated quantitative measure, the relative magnitudes of all such components are scale (i.e., transformation) dependent, just as corresponding components of genotype (or allele) environment interaction are scale dependent. Finally, for those who wish to have a better theoretical understanding of statistical human genetics, ref. 7 provides an exceptionally good introduction.
Acknowledgments This work was supported in part by the following grants from the National Institutes of Health, USA: P41RR003655 (RCE) and R01CA137420 (JMS). References 1. Gerstein MB et al (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17: 669–681 2. Ford EB (1940) Polymorphism and taxonomy. In Huxley J (ed) The new systematics, Oxford 3. Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Hum Hered 21: 523–542 4. Lewontin RC (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 49–67
5. Wang X, Elston RC, and Zhu X (2010) The meaning of interaction. Hum Hered 70: 269–277 6. Wang X, Elston R, Zhu X (2010) Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet doi:10.1038/ nrg2579-c2 7. Ziegler A, Ko¨nig IR (2010) A statistical spproach to senetic spidemiology: Concepts and applications, 2nd edn. Wiley-VCH, Weinheim
sdfsdf
Chapter 2 Identification of Genotype Errors Yin Y. Shugart and Ying Wang Abstract It has been documented that there exist some errors in most large genotype datasets and that an error rate of 1–2% is adequate to lead to the distortion of map distance as well as a false conclusion of linkage (Abecasis et al. Eur J Hum Genet 9(2):130–134, 2001), therefore one needs to ensure that the data are as clean as possible. On the other hand, the process of data cleaning is tedious and demands efforts and experience. O’Connell and Weeks implemented four error-checking algorithms in computer software called PedCheck. In this chapter, the four algorithms implemented in PedCheck are discussed with a focus on the genotypeelimination method. Furthermore, an example for four levels of error checking permitted by PedCheck is provided with the required input files. In addition, alternative algorithms implemented in other statistical computing programs are also briefly discussed. Key words: Genotype, Genotype error, Parametric linkage analysis, LOD score, Computational efficiency, Automatic genotype elimination, Nuclear-pedigree method, Genotype-elimination method, Critical-genotype method, Odds-ratio method
1. Introduction While gene hunters have limited access to computational resources, they have to rely on visual inspection to check for genotypic errors occurring in a human pedigree. The errors come from two main major sources (1) pedigree errors and (2) true genotyping errors. Pedigree errors include, but are not limited to, nonpaternity, unreported adoption status, errors in data entry as well as sample mix-ups. In this chapter, we focus only on true genotyping errors. One can imagine that if automated approaches for error checking were not available, detection of erroneous genotypes by visual inspection could be a tedious task, particularly in extended pedigrees with multiple generations. Therefore, there is a demand for computational algorithms that can efficiently identify erroneous
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_2, # Springer Science+Business Media, LLC 2012
11
12
Y.Y. Shugart and Y. Wang
genotypes regardless the size of the pedigrees. Needless to say, the elimination of genotype errors will benefit both linkage analysis and family-based association analysis. In this chapter, we use linkage analysis as an example for the sake of illustration. Linkage analysis is a statistical method that has been widely applied for mapping genes related to human disorders with relatively high penetrance and rare disease allele frequencies. Because of the fact that DNA segments located near each other on a chromosome tend to be passed together from one generation to another, genetic markers are often used as tools for tracking the inheritance pattern of a gene that has not yet been identified but the approximate location of the gene of interest is known. The main use of linkage analysis is to test for evidence of linkage between a disease locus of interest and an arbitrary marker locus. The LOD score was defined as a test statistic by Morton (2). A single point LOD score test is typically performed by maximizing the LOD score over a grid of values of y in the interval 0–0.5, where y is the probability of the recombination fraction, viewed as a measure of the extent of linkage. The recombination fraction ranges from y ¼ 0 for loci in close proximity to each other through y ¼ 0.5 for loci that are far apart or on different chromosomes. The formula for a LOD score is Z ðyÞ ¼ log10 ½LðyÞ
log10 ½Lðy ¼ 0:50Þ
In practice, LOD scores can be calculated using linkage analysis programs such as LIPED (3) and LINKAGE (4–6). A wellknown algorithm for parametric linkage analysis is the Elston and Stewart algorithm (7), in which the genetic mode of inheritance is assumed, and this algorithm was implemented in the LINKAGE software in an elegant manner. It is common knowledge that LINKAGE is capable of handling relatively large pedigrees, but the computation time of LINKAGE is exponential in the number of genetic markers (8). In the presence of genotypic errors, the increase in computational time can be tremendous. Therefore, eliminating Mendelian inconsistencies in pedigree data is an important task if only for the sake of improving the efficiency of linkage analysis. Lange and Goradia (9) described an algorithm for automatic genotype elimination which enables great reduction in computational times required for pedigree analysis. Their genotype elimination program was an extension of the one given by Lange and Boehnke (10), which is discussed in this chapter. Later, Stringham and Boehnke (11) developed two methods that were implemented in the Mendel computer software (12) to calculate the posterior probability of erroneous genotypes for each pedigree member. Stringham and Beohnke developed two novel approaches to compute an individual posterior probability of
2 Identification of Genotype Errors
13
genotype error using a weighted sum of all possible genotypes, where the weight is the probability of the genotype being an error. Their first approach allows each individual to have every possible genotype weighted by the probability that the genotype is erroneous and then computes the posterior probability of this individual’s genotype, whereas their second approach allows one genotyped pedigree member at a time to have every possible genotype while the other pedigree members are held at their originally assigned genotypes. They then compute the posterior probability for each genotyped individual, based on the assumption that all other pedigree members have been correctly genotyped. It has been recognized that both methods have the weakness of not being able to automatically deal with more than one marker at a time; therefore, the computation with markers with multiple alleles (i.e., microsatellites) is very time consuming. These two algorithms were implemented in MENDEL (12). Of note, Sobel and Lange (13) developed important extensions to the Lander–Green–Kruglyak algorithm for small pedigrees and to the Markov-chain Monte Carlo stochastic algorithm for large pedigrees to conduct linkage analysis while accommodating genotype errors. These developments have been implemented in the program MENDEL, version 4. In recognition of the need for developing efficient computer software for automatic error checking, O’Connell and Weeks (14) implemented four error-checking algorithms in a new computer program called PedCheck. The main goal of this chapter is to illustrate the algorithms used in PedCheck and, for the purpose of demonstration, offer a few examples for different levels of error checking permitted by PedCheck, as well as provide detailed instructions for users to follow using the simulated pedigree data presented in Figs. 1 and 2. 1.1. The NuclearPedigree Algorithm
Lange and Goradia (9) described the steps involved in a using nuclear pedigree as follows: (A) For each pedigree member, save only those genotypes compatible with his or her phenotype. (B) For each nuclear family: 1. Examine each mother–father genotype pair. (a) Determine which zygote genotypes can arise. (b) If each child in the nuclear family has one or more of these zygote genotypes among his or her list of genotypes, then save the parental genotypes as well as any child genotype that is not conflicted with one of the created zygote genotypes.
14
Y.Y. Shugart and Y. Wang
1
2
3
4
5
6
2/1
4/3
4/6
5/1
7 3/2
Fig. 1. Example of level 1 error.
(c) If any child is incompatible with the current parental pair, then take no action to save any genotypes. 2. For each person in the nuclear family, exclude any genotypes not saved during step 1. (C) Repeat step B until no more genotypes can be excluded. 1.2. The GenotypeElimination Algorithm
The genotype-elimination algorithm used by O’Connell and Weeks was an extension of the Lange–Goradia algorithm (9, 15) for set-recoded genotypes (14). Because the genotype-elimination algorithm is able to detect subtle levels of inconsistencies due to the elimination of certain genotypes presented in pedigrees with more complex structure, it is more powerful than the nuclear family algorithm originally described by Lange and Goradia (9). For example, Fig. 1 shows that individual 7 cannot possibly carry allele 3 given that the genotypes of his father is 4/3 and of his mother is 4/6. As described by O’Connell and Weeks (14), for each pedigree and locus, the genotype-elimination algorithm can pick the first component nuclear family with an error that has been missed by the nuclear family algorithm, and then provide the inferred-genotype lists for each member of that nuclear family. To illustrate the
2 Identification of Genotype Errors
3
1
2
2/4
1/2
4
1/2
15
5 2/2
6 1/3
Fig. 2. Example of Mendelian error in a pedigree with three generations.
situation where the nuclear-family algorithm is not able to find any errors, we presented an example in Fig. 2. As shown in Fig. 2, genotypes in each nuclear family appear to be consistent at the level 1 checking, but individual 3 determines phase (“linkage phase” is defined as the haplotype of the gamete transmitted from parent to offspring) of individual 6, which forces individual 3 to carry a “2” allele, which also means that a parent of individual 4 is a carrier of a “3” allele. However, we know the genotypes of the parents of individual 4, which are 4/2 and 1/2; therefore, he cannot possibly carry a “3” allele. This type of error can easily be detected by the genotype-elimination algorithm as a level 2 error. 1.3. The CriticalGenotype Method
In Subheading 2, we described the nuclear-family method and the genotype-elimination method. Below we discuss two additional algorithms termed “critical genotype” and “odds ratio method,” respectively, both implemented in PedCheck. The genotype of an individual that resolves the issue of inconsistency within a pedigree when removed from the data is defined as a “critical genotype” by O’Connell and Weeks (14). They pointed
16
Y.Y. Shugart and Y. Wang
out that a “critical genotype” does not always provide a perfect solution. The critical genotypes may not be independent, for example, when a parent and offspring are homozygous for different alleles. Under this circumstance, both will be critical genotypes, but blanking either resolves the inconsistency. Therefore, O’Connell and Weeks concluded that “the set of erroneous genotypes is a subset of the critical genotypes.” The critical-genotype algorithm implemented in PedCheck enables users to identify the critical genotypes in a particular pedigree by “untyping” one typed individual at a time (meaning call this individual unknown in the input file), and then applying the genotype-elimination algorithm to determine if the inconsistency has been resolved. If one critical genotype is found, this genotype represents the error; otherwise, the set of critical genotypes blanked may depend on the order of the individuals who were untyped. Below we give one example to demonstrate how PedCheck detected errors at different levels. The genotype data were simulated. PedCheck can be freely downloaded from the following Web site: http://watson.hgen.pitt.edu/register/docs/pedcheck.html. As documented in the PedCheck menu, using PedCheck one is able to specify the algorithms at four levels (from basic to comprehensive): level 1 uses the above described nuclear family algorithm; level 2 checking uses the genotype-elimination algorithm; level 3 checking uses the critical-genotype algorithm; and level 4 checking uses an odds-ratio algorithm. Typically, users start with level 1 checking, resolving identified problems, and then move to levels two, three, and four. Using a program such as PedCheck, if a complete genotype-elimination algorithm finds no errors, that means the genotypes are consistent with Mendelian laws of inheritance, and other downstream analysis can be performed. As clearly stated by the authors of PedCheck, although the genotype-elimination algorithm are guaranteed to find subtle errors, the inferred-genotype lists provided by PedCheck do not always help to detect the source of the inconsistency, simply because the genotype lists for untyped individuals may be extensive. Even if there is only one error, the individual involved may be difficult to detect by examination of only the genotype lists, since either more than one individual may be identified as the possible source of that error or the error may not be in the particular nuclear-family data appearing in the output. Therefore, the authors attempted to find an appropriate statistic to distinguish between several critical genotypes. 1.4. The Odds-Ratio Algorithm
It was thoroughly discussed by O’Connell and Weeks that, in the presence of several critical genotypes at a given locus, one is not able to decide a priori which critical genotype is most likely to be a mistake. To help distinguish between various critical genotypes, O’Connell
2 Identification of Genotype Errors
1
3
4
5 4/7
6 8/9
7
17
2
8
9
11 6/9
10
12 4/6
13
+ 14 7/9
15 4/9
16 4/8
17 5/7
31 4/6
32 4/7
18 4/9
19
20 6/9
Fig. 3. A simulated pedigree that was used for the purpose of demonstration. The plus sign indicates the individual with the actual genotype error; and the triangle points to the individual identified by PedCheck as most likely to have an erroneous genotype.
and Weeks implemented an odds-ratio statistic based on single-locus likelihoods for the pedigree. We would first like to bring to the users’ attention that the authors restrict the genotypes to contain only alleles appearing in the pedigree under consideration. Because “untyping” an individual with a critical genotype results in a consistent pedigree, one recognizes that after genotype elimination the individual must have at least one alternative genotype that is equally valid in a statistical sense. Second, for the particular locus, the authors compute and store the likelihood for the pedigree data of each alternative valid genotype at each critical genotype, while keeping all other critical genotypes at their original value. To explain how PedCheck works at each level of assessment, we present a test data set using one simulated multigeneration pedigree (Fig. 3). Note that, in this particular example, the level 2 option automatically runs level 1 first. Level 3 checking identified a total of three critical genotypes. It is very fast to compute, since it involves only single-locus likelihoods. Finally, level 4 checking calculates a full-pedigree likelihood.
18
Y.Y. Shugart and Y. Wang
2. Method The procedure of running PedCheck is fairly straightforward: Level 1 is a first screening using the nuclear-pedigree algorithm. If no level 1 errors were detected or if the errors have been removed, then level 2 should be run to detect relatively subtle errors using the genotype-elimination algorithm. Level 3 should be run only if the researchers are interested in knowing the “critical genotypes,” and level 4 should be run to detect the most likely source of error. After level 4 checking finishes, level 2 should be run again if there has been any change in the data, until PedCheck indicates that no more pedigree errors can be found and that you are ready for analysis. PedCheck requires two input files: ped file and data file. They can be either pre-MAKEPED files or LINKAGE format files. The files below are prepared in the pre-MAKEPED format. The first row stands for the pedigree ID and then followed by ID, father’s ID, mother’s ID, gender (1 ¼ male, 2 ¼ female), affection status (1 ¼ normal, 2 ¼ affected and 0 ¼ unknown), and alleles. While entering the ID numbers for a father or mother who is a founder (i.e., his or her parents are not known in this particular pedigree), one should use zero for his or her ID. The number of spaces between fields should not matter. The file below is called myfile1. ped for pedigree 1 (see Fig. 1). Fam
ID
Father
Mother
Gender
Aff
Allele1
Allele2
1
1
0
0
1
0
0
0
1
2
0
0
2
0
0
0
1
3
1
2
2
0
2
1
1
4
1
2
2
0
4
3
1
5
1
2
2
0
4
6
1
6
0
0
1
0
5
1
1
7
6
3
1
0
3
2
Below is myfile23.ped for pedigree 2 and 3 (see Figs. 2 and 3). Fam
ID
Father
Mother
Gender
Aff
Allele1
Allele2
2
1
0
0
1
0
2
4
2
2
0
0
2
0
1
2
2
3
1
2
2
0
1
2 (continued)
2 Identification of Genotype Errors
Fam
ID
Father
Mother
19
Gender
Aff
Allele1
Allele2
2
4
0
0
1
0
0
0
2
5
4
3
1
0
2
2
2
6
4
3
2
0
1
3
3
1
0
0
1
0
0
0
3
2
0
0
2
0
0
0
3
3
1
2
2
0
0
0
3
4
1
2
2
0
0
0
3
5
0
0
1
0
4
7
3
6
1
2
2
0
8
9
3
7
0
0
1
0
0
0
3
8
1
2
2
0
0
0
3
9
1
2
1
0
0
0
3
10
0
0
2
0
0
0
3
11
1
2
2
0
6
9
3
12
1
2
2
0
4
6
3
13
1
2
1
0
0
0
3
14
5
6
2
0
7
9
3
15
5
6
2
0
4
9
3
16
7
8
2
0
4
8
3
17
7
8
1
0
5
7
3
18
9
10
2
0
4
9
3
19
0
0
1
0
0
0
3
20
19
18
2
0
6
9
3
31
0
0
2
0
6
4
3
32
17
31
2
0
4
7
Now, we introduce a file called “myfile1.dat” to go with pedigree file one. The meaning of each parameter is provided in each row. This format was described by Terwillier and Ott (16) in their well-known text book called “Handbook of Human Genetic Linkage.”
20
Y.Y. Shugart and Y. Wang
2 Identification of Genotype Errors
21
We now provide the details for each step involved in error checking using PedCheck: Step 1: Run PedCheck to check level 0 and level 1 errors in pedigrees 1, 2, and 3. The results will be saved in a file called pedcheck.err. Command: pedcheck -m -p myfile1.ped -d myfile1.dat Command: pedcheck -m -p myfile23.ped -d myfile23.dat PedCheck identified three level 1 errors in pedigree 1. First, the children have more than 4 alleles. Second, mother 5’s allele is out of range. Third, child 7 and mother are inconsistent. On the other hand, there were no level 0 or level 1 errors in pedigree 2 (see Fig. 2) and pedigree 3 (see Fig. 3), respectively. Setp 2: Run Pedcheck to check level 2 errors in pedigrees 2 and 3. Command: Pedcheck -2 -m -p myfile23.ped -d myfile23.dat After this run, PedCheck can find one inconsistency in each pedigree. Step 3: Run Pedcheck to check level 3 and level 4 errors. Command: Pedcheck -4 -m -p myfile23.ped -d myfile23.dat PedCheck’s run at level 3 and the output indicated that there are four critical genotypes in individuals 1, 2, 3, 6 in pedigree 2 and three critical genotypes in individuals 6, 12, and 17 in pedigree 3. Untyping any person listed will result in a consistent pedigree at the given locus. According to the results of level 4 (see Notes 1 and 2), one can conclude that individual 6 in pedigree 2 and individual 17 in pedigree 3 mostly likely have the erroneous genotypes. We reran PedCheck after their genotypes were removed. No inconsistencies were found. The following shows the diagnostic output from PedCheck for pedigree in Fig. 3:
22
Y.Y. Shugart and Y. Wang
2 Identification of Genotype Errors
23
3. Notes 1. Although PedCheck has been frequently used by the linkage analysts, we are aware of one limitation—that it cannot be applied to pedigrees with loops. Therefore, the readers may wish to visit some algorithms implemented in SimWalk2 version 2.82 (17) which can handle pedigrees with multiple loops. Further, we also like to highlight a few advantages of SimWalk2, which was not discussed in depth in this paper. Similar to the idea of level 4 checking in Pedcheck, SimWalk2 version 2.8 reports the overall probability of mistyping at each observed genotype. Sobel et al. (13) indicated that construction of these so-called posterior mistyping probabilities is based on the marker map, a prior error model and the error model that is used to define the penetrance function at the marker loci. Their algorithm can also accommodate alternative error models. The authors of SimWalk2 pointed out that false homozygosity is often the most common genotyping error. Therefore, SimWalk2 version 2.8 includes an empirical error model that incorporates this information and recognizes that misreading one allele occurs more frequently than misreading two. In addition, SimWalk2 version 2.8 imputes at each genotype the expected number of each allele appearing in that genotype while allowing for mistyping. 2. Other computer programs can also be used for error checking for various types of pedigree structure, including UNKNOWN of the Linkage package (4–6), MERLIN (18), RELPAIR (19), SIMWALK2 (17), and MENDEL version 4 (20). Owing to limited space, we did not discuss the statistical details behind each of these programs. The detailed algorithms for these programs are all freely available on line. However, most of these programs are not general enough to catch all genotyping errors. For instance, UNKNOWN takes a very long time to run and does not always provide diagnostic errors in the output file. Finally, it has been shown there exists a good amount of errors in most large genotype datasets and that an error rate of 1–2% is adequate to lead to the distortion of map distance as well as false conclusion of linkage (1), so the analysts may wish to consider using an analytical method using error models when analyzing large pedigree datasets (13).
Acknowledgments The views expressed in this chapter do not necessarily represent the views of the NIMH, NIH, HHS, or the US Government.
24
Y.Y. Shugart and Y. Wang
References 1. Abecasis GR, Cherny SS, Cardon LR. (2001) The impact of genotyping error on familybased analysis of quantitative traits. Eur J Hum Genet. 9(2):130–134. 2. Morton NE.(1955) Sequential tests for the detection of linkage. Am J Hum Genet 7(3): 277–318. 3. Ott J.(1974) Estimation of the recombination fraction in human pedigrees: efficient computation of the likelihood for human linkage studies. Am J Hum Genet. 26(5):588–597. 4. Lathrop GM, Lalouel JM (1984) Easy calculations of LOD scores and genetic risks on small computers. Am J Hum Genet. 36(2):460–465. 5. Lathrop GM, Lalouel JM, Julier C, Ott J. (1984) Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA. 81(11):3443–3446. 6. Lathrop GM, Lalouel JM, White RL. (1986) Construction of human linkage maps: likelihood calculations for multilocus linkage analysis. Genet Epidemiol. 3(1):39–52. 7. Elston RC, Stewart J. (1971) A general model for the genetic analysis of pedigree data. Hum Hered. 21(6):523–542. 8. Ott J.(1999) Analysis of Human Genetics Linkage. Baltimore: Hopkins University Press. 9. Lange K, Goradia TM. (1987) An algorithm for automatic genotype elimination. Am J Hum Genet. 40(3):250–256. 10. Lange K, Boehnke M. (1983) Extensions to pedigree analysis. V. Optimal calculation of Mendelian likelihoods. Hum Hered. 33(5): 291–301. 11. Stringham HM, Boehnke M. (1996) Identifying marker typing incompatibilities in linkage analysis. Am J Hum Genet. 59(4):946–950.
12. Lange K, Weeks D, Boehnke M. (1988) Programs for Pedigree Analysis: MENDEL, FISHER, and dGENE. Genet Epidemiol. 5(6):471–472. 13. Sobel E, Papp JC, Lange K.(2002) Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 70(2): 496–508. 14. O’Connell JR, Weeks DE. (1998) PedCheck a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet. 63(1):259–266. 15. Lange K, Weeks DE.(1989) Efficient computation of LOD scores: genotype elimination, genotype redefinition, and hybrid maximum likelihood algorithms. Ann Hum Genet. 53(Pt 1):67–83. 16. Terwilliger JD, Ott J. (1994) Handbook of Human Genetics Linkage. 1 ed. The Johns Hopkins University Press. 17. Sobel E, Lange K. (1996) Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am J Hum Genet. 58(6):1323–1337. 18. Abecasis GR, Cherny SS, Cookson WO, Cardon LR. (2002) Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 30(1):97–101. 19. Broman KW, Weber JL. (1998) Estimation of pairwise relationships in the presence of genotyping errors. Am J Hum Genet. 63(5): 1563–1564. 20. Sobel E, Sengul H, Weeks DE.(2001) Multipoint estimation of identity-by-descent probabilities at arbitrary positions among marker loci on general pedigrees. Hum Hered. 52(3): 121–131.
Chapter 3 Detecting Pedigree Relationship Errors Lei Sun Abstract Pedigree relationship errors often occur in family data collected for genetic studies, and unidentified errors can lead to either increased false positives or decreased power in both linkage and association analyses. Here we review several allele sharing, as well as likelihood-based statistics, that were proposed to efficiently extract genealogical information from available genome-wide marker data, and the software package PREST that implements these methods. We provide detailed analytical steps involved using two application examples, and we discuss various practical issues including results interpretation. Key words: Pedigree error, Relationship estimation, IBD, IBS, IIS, Allele sharing, Likelihood, Hidden Markov model, EM algorithm, Software, PREST, PREST-plus, Multiple hypothesis testing, Simulation, Linkage, Association, Robustness
1. Introduction Pedigree errors or misspecified relationships among individuals often occur in data collected for genetic studies using family data. The potential causes of pedigree errors are numerous, including undocumented non-paternity, non-maternity, adoption, sample duplication, sample swap or mating between relatives. Unidentified pedigree relationship errors can lead to either increased false negatives or false positives in both linkage (1) and association analyses (2). For example, linkage analysis looks for regions of the genome that are shared by affected relatives, in excess of what is expected under the null hypothesis of no linkage (see Chapter 17 for details on the concept of “identical by decent,” IBD, and model-free linkage analysis based on IBD sharing). The null expected IBD sharing by a pair of relatives, however, is determined by the assumed pedigree structure. If two putative full sibs (with expected IBD sharing of 1 under the null hypothesis of no linkage, Snull, full-sib ¼ 1) are in fact half-sibs (Snull, half-sib ¼ 1/2), the
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_3, # Springer Science+Business Media, LLC 2012
25
26
L. Sun
linkage statistic calculated under the erroneous full-sib relationship, Sobs Snull, full-sib, will be smaller than that calculated under the true half-sib relationship, Sobs Snull, half-sib, thus resulting in the loss of power. This is true whenever the true relationship is more distant than the hypothesized relationship. Conversely, if two putative full sibs are in fact monozygotic (MZ) twins, then an increase in type 1 error rate is expected. Thornton and McPeek (3) among others showed that misspecified relationship has similar consequences in association analyses, and Chapter 4 investigates cryptic relatedness and its effects on case–control association studies. Some pedigree relationship errors can be detected using additional genealogical information (e.g., surname, age, or gender), however such information might not be available or powerful. Genome-wide marker data, collected for linkage or association scans, can provide accurate information on relatedness among individuals. Consequently, a number of statistical methods have been proposed and corresponding analysis software developed, including, for example, RELATIVE (4), RELCHECK (1), PEDCHECK (5), SIBERROR (6), PREST (7), GRR (8), and ECLIPSE (9). All these computational tools are listed and available at http:// linkage.rockefeller.edu/soft/. Mendelian inconsistencies can be used to identify relationship errors (i.e., inconsistencies occurring at the genome-wide level within a specific parent–child trio), in addition to genotyping error (i.e., inconsistencies occurring at a specific marker across multiple parent–child trios). However, most existing methods use the more powerful allele sharing- or likelihood-based statistics. Here we focus on the methods implemented in PREST (7, 10), because the proposed statistics are suitable for a range of relative pairs broader than sib pairs. For clarity we start with data set-up and notation. We then review four test statistics, IBS, AIBS, EIBD, and MLRT, and a simple relationship estimation method proposed by McPeek and Sun (7). We leave methods evaluation (see Note 1) and discussion of alternative methods (see Notes 2 and 3) to the Subheading 3. Although most information can be found in the original work and PREST documentation (http://www.utstat.toronto.edu/sun/Software/Prest), we try to make the material in this chapter comprehensive and self-sufficient, particularly when it comes to practical details. 1.1. Notation
Let R denotes the relationship type between a pair of individuals. Figure 1 shows pedigree structures for a set of 11 relationship types, R, common for human pedigrees, and Table 1 provides the corresponding IBD distributions and the kinship coefficients. The collected family data typically contain information on family ID, individual ID, father’s ID, mother’s ID, gender, and disease status, as well as on genotypes of either microsatellites or SNPs for each individual, Gm : genotype for marker m, m ¼ 1; :::; M (Table 2).
3 Detecting Pedigree Relationship Errors
Pedigree
27
Relationship MZ-twin
parent-offspring
full-sib
half-sib+first-cousin
half-sib
grandparent-grandchild
avuncular
first-cousin
half-avuncular
half-first-cousin
unrelated
Fig. 1. Eleven relationships common for human pedigree. The shaded pair of individuals has the specified relationship. See Table 1 for the corresponding IBD distributions and kinship coefficients.
The first five columns, C1–C5, determine the pedigree structure and the presumed or hypothesized relationship type, R0 , between a pair of individuals (e.g., in family 1, individuals with IDs 1 and 2 have full-sib relationship; in family 4, IDs 1 and 5 have the grandparent– grandchild relationship). The goal of detecting relationship errors can be formulated as a hypothesis-testing problem with H0 : R ¼ R0 ;
28
L. Sun
Table 1 IBD marginal distribution and kinship coefficient Distribution of IBD Sharing Reltype in PREST
Relationship type (notation)
p0
p1
p2
Kinship coefficient, f
11
MZ-twin (MZ)
0
0
1
0.5
10
Parent–offspring (PO)
0
1
0
0.25
1
Full-sib (FS)
0.25
0.5
0.25
0.25
9
Half-sib + first-cousin (HSFC)
0.375 0.5
0.125 0.1875
2
Half-sib (HS)
0.5
0.5
0
0.125
3
Grandparent–grandchild 0.5 (GPC)
0.5
0
0.125
4
Avuncular (AV)
0.5
0.5
0
0.125
5
First-cousin (FC)
0.75
0.25
0
0.0625
7
Half-avuncular (HAV)
0.75
0.25
0
0.0625
8
Half-first-cousin (HFC)
0.875 0.125 0
0.03125
6
Unrelated (UN)
1
0
0
0
p ¼ ðp0 ; p1 ; p2 Þ and ’ are shown for each of the 11 relationship types shown in Fig. 1. Note that p1 þ 2p2 ¼ 4f. PREST reltype numerical coding is also provided
where R0 is the relationship implied by the presumed pedigree structure. Evidence supporting or rejecting the null hypothesis can be sought in Gm ; m ¼ 1; :::; M ; by assessing whether the observed genotype data are compatible with what’s expected for a pair of individuals with the hypothesized relationship type. For example, if R0 is MZ-twin but the observed genotypes for the two individuals are not identical across the genome (allowing for a small genotyping error rate), then we have strong evidence for relationship error. To formally assess the statistical evidence, we review four test statistics. In the following, we use Gm ¼ ðði; j Þ; ðk; lÞÞ to denote the observed genotypes for a pair of individuals at marker m, where (i, j) is the genotype for one individual and (k, l) for the other. We also introduce Dm ¼ i; i ¼ 0; 1; or 2 to denote the number of alleles shared IBD by the pair at marker m, which is typically unknown. The underlying fDm g; m ¼ 1; :::; M ; along the genome is also called the IBD process.
3 Detecting Pedigree Relationship Errors
29
Table 2 Typical information contained in collected family data C1
C2
C3
C4
C5
C6
Genotype data
Family ID
Individual ID
Father ID
Mother ID
Disease Gender status
G1
G2
G3
G4
Gm
1
1
7
6
1
2
5 5
4 4
3 4
2 5
...
1
2
7
6
2
2
5 5
0 0
0 0
5 6
...
1
3
7
6
1
2
5 5
0 0
3 3
2 5
...
1
7
0
0
1
1
0 0
0 0
3 3
5 5
...
1
6
0
0
2
0
5 7
0 0
0 0
2 6
...
4
1
8
9
2
2
5 7
4 5
0 0
7 8
...
4
2
8
9
1
1
5 7
0 0
6 6
5 7
...
4
3
8
9
2
1
5 5
0 0
6 6
7 8
...
4
8
0
0
1
0
5 5
0 0
3 6
7 7
...
4
9
5
6
2
1
5 7
0 0
0 0
5 8
...
4
5
0
0
1
0
0 0
0 0
0 0
0 0
...
4
6
0
0
2
0
0 0
0 0
0 0
0 0
...
Sex: 1 ¼ male, 2 ¼ female; diseases status: 1 ¼ unaffected, 2 ¼ affected (or quantitative phenotype measures); 0 ¼ unknown in all cases
1.2. Likelihood-Based Statistic: MLRT
The likelihood for a relationship type R is LðRÞ ¼ PR ðG1 ; G2 ; :::; Gm ; :::; GM Þ: Due to the linkage correlation between markers, this probability calculation is nontrivial and is typically executed via the hidden Markov model (HMM) (11) assumed for the underlying IBD process, {Dm} (see Chapters 14–17 for more details). For a pair of individuals, the likelihood calculation boils down to three key components: PR ðDm ¼ iÞ; the IBD marginal or stationary distribution PR ðDmþ1 ¼ j j Dm ¼ iÞ; the IBD transition probability PðGm j Dm ¼ iÞ; the genotype conditional probability. Table 1 lists the IBD marginal distribution, p ¼ ðp0 ; p1 ; p2 Þ ¼ ðPR ðDm ¼ 0Þ; PR ðDm ¼ 1Þ; PR ðDm ¼ 2ÞÞ; for each of the 11 relationships shown in Fig. 1. Note that the IBD marginal distribution is independent of marker m but dependent on the relationship type R. We did not include subscript R in p ¼ ðp0 ; p1 ; p2 Þ for notation simplicity. Table 1 also includes the kinship coefficient, f, for each
30
L. Sun
Table 3 IBD transition probability for a full-sib pair, PR=full–sib (Dm+1 = j |Dm = i ) Next IBD status, Dm+1 Current IBD Status, Dm
0
1
2
0
’2
2’C
C2
1
fy
f 2 + C2
fy
2
y2
2fy
’2
’ ¼ y2 þ ð1 yÞ2 and C ¼ 2 y ð1 yÞ, and y is the recombination fraction between the two markers. Under a no interference model, y ¼ ð1 e 2t Þ=2, where t is the genetic distance (in units of Morgans) between the two markers
Table 4 Genotype conditional probability, conditional on the IBD status, P(Gm|Dm = i ) IBD status, Dm Unordered genotype, Gm
0
1
2
((i, i), (i, i))
fi 4
fi3
fi 2
((i, i), (i, j))
4fi3fj
2 fi2fj
0
((i, i), (j, j))
2fi2fj2
0
0
((i, i), (j, k))
4fi2fjfk
0
0
((i, j), (i, j))
4fi2fj2
fi fj(fi+fj)
2fifj
((i, j), (i, k))
8fi2fj fk
2fifjfk
0
((i, j), (k, l))
8fifjfkfl
0
0
Note that genotypes in this table are unordered within an individual and between the two individuals, e.g., ((i, j), (k, l)) ¼ ((j, i), (l, k)) ¼ ((k, l ), (i, j)). fi, fj, fk, and fl, are the population frequencies of alleles i, j, k, and l, and i 6¼ j 6¼ k 6¼ l. For a diallelic marker such as a SNP, the table could be simplified further
relationship. Note that p ¼ ðp0 ; p1 ; p2 Þ and f are simplified summaries of R and do not always uniquely determine a relationship type. Table 3 shows the IBD transition probability for a full-sib pair, which depends on R and the map distance between the two markers (see ref. 12 for other relationships). Table 4 provides the
3 Detecting Pedigree Relationship Errors
31
conditional probability of the observed genotype, conditional on the underlying IBD status at that marker. Note that this conditional probability does not depend on R (13, 14). To gain computational efficiency, the actual calculation utilizes a recursive formula instead of considering the unknown IBD status at all markers simultaneously (see refs. 7 and 11 for details of the algorithm). As a result, the computational time grows only linearly with the number of markers, making the likelihood calculation feasible for genome-wide scans with thousands or more markers and for analyzing thousands or more different relative pairs. The Markov property of the IBD process requires the assumptions of linkage equilibrium and no interference. In addition, McPeek and Sun (7) emphasized that it is valid for only certain relative pairs (e.g., full-sib or half-sib) and proposed an alternative approach for general relationships (e.g., avuncular or first-cousin). However, they also showed that the likelihood calculated under the Markov assumption is a good approximation of the true likelihood in most cases. To perform the likelihood ratio test, the likelihood is first calculated for the hypothesized relationship (e.g., full-sib), LðR0 Þ, then for an alternative (e.g., half-sib), LðR1 Þ, and the difference between the log likelihoods serves as the test statistic. In practice, the possible alternative is often not unique. For example, full-sib could be MZ-twin, half-sib due to non-paternity, or unrelated due to adoption. In that case, we can consider a composite alternative hypothesis, calculate the likelihood for each plausible alternative, ^ LðR1 Þ; R1 2 A, and use the maximum, LðAÞ. Formally, to test H0 : R ¼ R0 ; H1 : R ¼ R1 ; R1 2 AnR0 ; the maximum likelihood ratio test (MLRT) statistic is ^ MLRT ¼ log LðAÞ logðLðR0 ÞÞ:
The distribution of 2 MLRT, however, cannot be approximated by the standard w2 distribution due to the discreteness of the parameter space (i.e., each parameter value here is a pedigree relationship type). Therefore, statistical significance must be assessed empirically via simulations. Briefly, for each replicate k, genotype data for the pair are first simulated assuming R ¼ R0 , the corresponding MLRTðkÞ then calculated. The number of replicates with MLRTðkÞ more extreme than the observed MLRT, divided by the total number of replicates, is the empirical P -value. Although MLRT is powerful, it is computationally expensive. In the following, we review three allele sharing-based statistics that are simpler to calculate and easier to assess statistical significance via normal approximation.
32
L. Sun
1.3. Allele-SharingBased Statistics: IBS, AIBS, and EIBD
The idea behind these three statistics stems from nonparametric linkage analysis (see Chapter 17 for details), though with some key differences. For example, linkage analysis with the affected-sib-pair design investigates localized allele sharing for a given marker, averaged over all affected sib-pairs, and over-sharing indicates linkage between the marker and the susceptibility locus. In contrast, pedigree relationship analysis investigates global allele sharing for a given relative pair, averaged over all markers, and over-sharing (or under-sharing) indicates that the true relationship has a higher (or lower) kinship coefficient than the hypothesized one. Also note that, unlike linkage or association analyses, phenotype information is not needed for pedigree error detection. To detect pedigree relationship errors, the sharing statistics proposed by McPeek and Sun (7) have the general form of S¼
M 1 X Xm ; M m¼1
where Xm is the marker-specific allele-sharing statistic for a pair of individuals at marker m, and S is the average of Xm across all available markers. In the genome-wide setting where M is typically large (e.g., ~500 or more for linkage scans and ~100 K or more for association scans), S is approximately normally distributed, S N ðmR ; s2R Þ; where mR is the expected value and s2R is the variance of S for a pair of individuals with relationship type R. To test the null hypothesis, H0 : R ¼ R0 ; the statistical significance of S is assessed using the null distribution, S N ðmR0 ; s2R0 Þ: Specifically, the IBS statistic counts the number alleles shared “identical in state,” IIS (also known as “identical by state,” IBS; see Chapter 1), i.e., Xm ¼ the number of alleles shared IIS: The AIBS statistic calculates the probability that two alleles are IBD conditional on being shared IIS, i.e., Xm ¼ 0, if 0 alleles IIS, Gm ¼ ðði; j Þ; ðk; lÞÞ, f Xm ¼ f þð1R0f Þfi , if 1 allele IIS, Gm ¼ ðði; jÞ; ði; lÞÞ, R0
Xm ¼
R0
fR0 fR0 þð1 fR0 Þfi
þf
fR0 fR0 Þfj
R0 þð1
, if 2 alleles IIS, Gm ¼ ðði; j Þ;
ði; j ÞÞ, where fR0 is the kinship coefficient for the null relationship R0, and fi and fj are the allele frequencies. The EIBD statistic is the expected number of alleles shared IBD conditional on the observed genotype, i.e.,
3 Detecting Pedigree Relationship Errors
Xm ¼ ER0 ½Dm jGm ¼
33
PðGm jDm ¼ 1Þp1 þ PðGm jDm ¼ 2Þp2 P ; i¼0;1;2 PðGm jDm ¼ iÞpi
where p ¼ ðp0 ; p1 ; p2 Þ is the IBD marginal distribution for R0 as in Table 1, and PðGm j Dm ¼ iÞ is the genotype conditional probability as in Table 4. These statistics are simple to calculate and require only the IBD marginal distribution, genotype conditional distribution and population allele frequencies. However, the calculations of mR0 and s2R0 for assessing the statistical significance are more complicated. Briefly, mR0 is the average of ER0 ½Xm , whose calculation requires information similar to that for Xm. However, s2R0 involves ER0 ½Xm2 and ER0 ½Xm1 Xm2 , whose calculation needs also the IBD transition probability between marker m1 and marker m2, i.e., the second order of the IBD process. We refer readers to (7) and (12) for technical details of the calculation. Interestingly, mR0 for the EIBD statistic depends only on the IBD marginal distribution for R0 and has a close relationship with the kinship coefficient, ER0 ½EIBD ¼ ¼
1.4. Relationship Estimation
M M 1 X 1 X ER0 ½Xm ¼ ER ER ½Dm jGm M m¼1 M m¼1 0 0 M 1 X ER ½Dm ¼ p1 þ 2p2 ¼ 4fR0 : M m¼1 0
When strong evidence against the hypothesized null relationship, H0 : R ¼ R0 has been found, it is very helpful to know the alternative relationship(s) that are compatible to the genotype data observed for the pair of individuals. McPeek and Sun (7) proposed a simple strategy that estimates the most likely IBD distribution, p ¼ ðp0 ; p1 ; p2 Þ; by obtaining ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ that maximizes the following quantity ! M M X X X LðpÞ ¼ log PðGm jDm ¼ iÞpi : logðPðGm ; pÞÞ ¼ m¼1
m¼1
i¼0;1;2
The maximization can be efficiently achieved by an application of the EM algorithm. Specifically, PM 1 ðkþ1Þ ðkÞ M m¼1 PðGm jDm ¼ iÞ ; and pi ¼ pi P ðkÞ j ¼0;1;2 PðGm jDm ¼ iÞpj ðkÞ
ðkÞ
ðkÞ
p0 þ p1 þ p2 ¼ 1;
ðkÞ
ðkÞ
ðkÞ
where pðkÞ ¼ ðp0 ; p1 ; p2 Þ is the current estimate and pðkþ1Þ ¼ ðkþ1Þ ðkþ1Þ ðkþ1Þ ðp0 ; p1 ; p2 Þ is the next updated estimate. When the markers are unlinked, L(p) is the correct log likelihood for parameter p, and ^p is the MLE (13, 14). When markers are linked, this procedure
34
L. Sun
still provides good estimates for p (7), and Chapter 4 shows that the conclusion holds even for high-throughput SNP data collected for genome-wide association studies (GWAS) (15). Depending on the values of ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ, one can propose plausible relationship(s) and formally test if the proposed relationship is compatible with the observed genotype data (10). This procedure can be practically important because it allows the possibility to keep individuals that had relationship errors by reconstructing alternative pedigrees. Although the IBS statistic is less informative than IBD, IBS is more robust to misspecified allele frequencies. Therefore, it is also useful to obtain the proportion of markers that have 0, 1, or 2 alleles shared IIS. Such an IIS distribution can quickly pinpoint, for example, MZ twins, for which we expect the proportion of 2 IIS to be 1 or close to 1 if allowing for some genotyping errors.
2. Methods The test statistics and estimation method reviewed in Subheadings 1.2–1.4 have been implemented in a program called PREST, which stands for Pedigree RElationship Statistical Test (10). To meet the computational challenges introduced by the high-throughput genotype data, PREST-plus was later developed to increase the computational efficiency as well as the interface of PREST (15). We will use PREST and PREST-plus interchangeably unless specified otherwise. In the following, we provide details on essential steps and result interpretation when using PREST to detect pedigree relationship errors, as well as applications to two datasets with distinct features. 2.1. Essential Steps in Detecting Pedigree Relationship Errors via PREST
Step 0: Download a PREST version suitable for your computer platform at http://www.utstat.toronto.edu/sun/Software/Prest/. Step 1: Prepare two input files, both are identical to what are used by PLINK (16): l
“prest.ped”: pedigree and genotype information. Specifically each row is an individual and the total number of columns is 6 þ 2 M , where the first six columns are C1–C6 as shown in Table 2, M is the total number of markers that were genotyped. C6 is not used but a required column to be consistent with linkage or association input files. Markers can be either microsatellites or SNPs or both, with corresponding alleles coded as integers (0 for missing data). (The alleles of SNPs should be coded as 1/2 and not with A/C/G/T characters. To accomplish this, the plink option –recode12 can be used to recode the data).
3 Detecting Pedigree Relationship Errors l
35
“prest.map”: map and allele frequency information. Specifically each row is a marker and columns provide information of, in this order, chromosome, marker name, map in cM (optional), bp position (not used), number of alleles, followed by the corresponding allele frequency for each allele (optional). If a cM map is not provided, relationship estimation is performed but not relationship testing. If allele frequency is not provided, it will be estimated from the genotype data. (A note to PREST users: convert-geno pedigree chromfiles will convert PREST-ready input files to PREST-plus-ready input files. The script convert-geno is included in the PRST-plus package.)
Step 2: Run the program with different command options depending on user-specific needs. For example, l
option 0 (least computing time): relationship estimation only
l
prest –file prest.ped –map prest.map –wped –out prest.wped. option0 option 1 (more computing time): both relationship estimation and testing, but using the fast EIBD, AIBS, and IBS statistics only
l
prest –file prest.ped –map prest.map –wped –cm –out prest.wped. option1 option 2 (substantially more computing time): testing using the more powerful MLRT statistic as well prest –file prest.ped –map prest.map –wped –cm –mlrt –out prest. wped.option2
Step 3: Interpret the results. PREST produces several output files: l
l
l
2.2. Essential Steps in Interpreting the Results
“prest_mendel”: family trios with Mendelian inconsistencies if there are any. “prest_summary”: summary statistics such as the total numbers of families, individuals, and markers, and the number of relative pairs being tested. “prest_results”: the main results file. Each row provides various relationship testing and estimation results for a pair of individuals, as detailed in Table 5. An R program is provided to assist the users navigate and digest the results. For details, see two application examples in Subheadings 2.4 and 2.5.
Step 1: Checking the estimated IBD distribution (e.g., Fig. 2) and the IIS distribution (e.g., Fig. 3), stratified by the relationship type can quickly pinpoint obvious outliers. Step 2: The histogram of P -values (e.g., the empirical P -values of the MLRT statistic in Fig. 4) can also provide a sense of pedigree relationship error rate. Note that if all hypotheses are truly null
36
L. Sun
Table 5 PREST results interpretation Column name
Interpretation
FID1
Family ID and individual ID of individual 1
Option
IID1 FID2
Family ID and individual ID of individual 2
IID2 reltype
The hypothesized null relationship type for the pair of individuals determined by the given family data (see Table 1 for relationship types and their reltype codings)
commark
The number of markers with genotype data for both individuals
EIBD
The EIBD, AIBS, and IBS statistics as reviewed in Subheading 1.3
Option 0—basic output point estimates only
AIBS IBS p.IBS0
The proportion of markers with 0, 1, or 2 alleles IIS
p.IBS1 p.IBS2 p.IBD0
The estimated probability of 0, 1, or 2 alleles IBD as reviewed in Subheading 1.4, i.e., the estimated IBD distribution
p.IBD1 p.IBD2 p.normal.EIBD
The P -value of EIBD, AIBS, or IBS based on normal approximation (P -value of MLRT requires simulations as discussed in Subheading 1.2)
Option 1—statistical significance provided for EIBD, AIBS, IBS
The P -value of EIBD, AIBS, IBS, or MLRT based on simulations
Option 2—statistical significance provided for MLRT as well, but simulation needed
p.normal.AIBS p.normal.IBS p.emprical.EIBD
p.emprical.AIBS p.emprical.IBS p.emprical. MLRT
p1, Prob(IBD=1)
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
p1, Prob(IBD=1)
0.0
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
X
0.8
1.0
R0: HSFC, 0 pairs
0.2
X
R0: FC, 309 pairs
0.2
X
0.0
X
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
R0: PO, 993 pairs
0.2
X
R0: UN, 1293 pairs
0.2
X
R0: HS, 66 pairs
0.0
X
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
R0: MZ, 0 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HAV, 62 pairs
0.2
X
R0: GPC, 165 pairs
0.0
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
All, 5035 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HFC, 40 pairs
0.2
X
R0: AV, 900 pairs
Fig. 2. Application 1: Relationship estimation based on IBD. Each plot shows the estimated IBD distribution, ^ p1 vs. p^0 , for relative pairs having the hypothesized relationship, R0. The cross marks the expected IBD distribution for R0. The bottom right plot shows the results for all 5,035 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
p1, Prob(IBD=1)
1.0
0.8
0.6
0.4
0.2
0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
p1, Prob(IBD=1) p1, Prob(IBD=1) p1, Prob(IBD=1)
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
p1, Prob(IBD=1) p1, Prob(IBD=1) p1, Prob(IBD=1)
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
p1, Prob(IBD=1) p1, Prob(IBD=1) p1, Prob(IBD=1)
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
R0: FS, 1207 pairs
3 Detecting Pedigree Relationship Errors 37
1.0
proportion of IBS=1 0.2 0.4 0.6 0.8
0.0
1.0
proportion of IBS=1 0.2 0.4 0.6 0.8
0.0
1.0
proportion of IBS=1 0.2 0.4 0.6 0.8
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: HSFC, 0 pairs
proportion of IBS=0
0.2
R0: FC, 309 pairs
proportion of IBS=0
0.2
R0: FS, 1207 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: PO, 993 pairs
proportion of IBS=0
0.2
R0: UN, 1293 pairs
proportion of IBS=0
0.2
R0: HS, 66 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
R0: MZ, 0 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HAV, 62 pairs
proportion of IBS=0
0.2
R0: GPC, 165 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
All, 5035 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HFC, 40 pairs
proportion of IBS=0
0.2
R0: AV, 900 pairs
Fig. 3. Application 1: Relationship estimation based on IIS (also known as IBS). Each plot shows the proportion of markers with 0 IIS vs. 1 IIS for relative pairs having the hypothesized relationship, R0. The expected IIS distribution for R0 depends on the allele frequencies but is generally the center of the cluster, assuming the majority of the pairs do not have relationship error and have the same markers genotyped. The bottom right plot shows the results for all 5,035 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
0.0
1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0
1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
38 L. Sun
3 Detecting Pedigree Relationship Errors
39
Fig. 4. Application 1: Relationship testing. Histogram of MLRT P -values.
(i.e., no relationship error), the P -values follow the Uniform (0, 1) distribution, assuming tests are independent of each other. Although pairs from the same pedigree have some dependency, it generally does not noticeably affect the Uniform assumption. Therefore, abundant amount of small P -values indicate that a proportion of the relative pairs might have relationship errors. Step 3: To formally assess the statistical significance of the P -value for a relative pair, one must adjust for multiple hypothesis testing because of the thousands or more pairs investigated. This can be achieved by using the conservative Bonferroni correction, i.e., rejecting the null hypothesis if the P -value is less than a=total no. of pairs, where a is the overall type 1 error rate for all the pairs tested. Alternatively, one can use the less stringent false discovery rate (FDR) control (17) to strike a balance between the trade-off between type 1 error rate and power (18), and adopt a stratified FDR approach to further improve power (19). Step 4: There is a number of other factors influencing the conclusion of pedigree relationship errors, including the number of markers used and the pattern of results among multiple pairs from the same pedigree. For pairs with strong evidence for relationship error, it is also useful to use the companion program, ALTERTEST (10), to identify potential alternative relationship(s) that are compatible with the observed genotype data. In some cases, alternative pedigree structure could be proposed. See PREST documentation for additional details.
40
L. Sun
2.3. Pedigree Relationship Errors and Cryptic Relationships
Cryptic relationships can also occur in family data, including relatedness between founders (often causes inbreeding) and across pedigrees. The methods discussed in this chapter were designed for general relationships, including unrelated relationship type, therefore they are suitable for identifying cryptic relatedness as well. Essentially, the null hypothesis, H0 : R ¼ R0 , hypothesizes the relationship as unrelated, and corresponding MLRT and IBS statistics can be calculated. Note that EIBD and AIBS are not applicable to unrelated pairs (7). In addition, relationship estimation can be performed in the exact same manner by obtaining ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ that maximizes the probability of the observed genotype data. PREST can detect cryptic relatedness between families by using the –aped option combined with other options as listed above, e.g., prest –file prest.ped –map prest.map –aped –out prest.aped.option0 prest –file prest.ped –map prest.map –aped –cm –out prest.aped. option1 prest –file prest.ped –map prest.map –aped –cm –mlrt –out prest.aped. option2 Note that the computing time increases considerably when using the –aped option, because the number of pairs to be analyzed is approximately in the order of n2 =2, where n is the total number of individuals in the dataset. In the context of GWAS, PLINK (16) implemented a methodof-moment-based p ¼ ðp0 ; p1 ; p2 Þ estimation method, tailored for identifying cryptic relationships among putatively unrelated case–control samples. We refer readers to Chapter 4 for issues that are specific to detecting cryptic relationships, particularly when high-throughput SNP data are available.
2.4. Application 1: Small to ModerateSized Pedigrees with Genome-Wide Low-Density Microsatellite Data
The COGA data (20) were provided by the collaborative study on the genetics of alcoholism (U10AA008401) and distributed as part of the biennial genetic analysis workshops (GAW11). (We thank COGA investigators and the GAW advisory committee for permission to use the data.) The data consist of 105 families (mostly three- or four-generation pedigrees) with 1,214 individuals in total, among which 992 were genotyped at 285 autosomal microsatellite markers. Missing genotype rates vary across individuals but most of the 992 have genotype data on >200 markers. Potential pedigree errors in the data were previously analyzed and reported (7, 12). Here we provide some of the key results and practical advice, after running PREST with command line: prest –file coga.ped –map coga.map –wped –cm –mlrt –out coga.wped. option2
3 Detecting Pedigree Relationship Errors
41
In total, 5,381 relative pairs (within pedigree pairs, two individuals of a pair must have the same family ID; see Chapter 4 for results of across pedigree analysis) were analyzed, but some pairs have as few as 25 markers genotyped in common. Deleting pairs with insufficient genotype data (2 individuals simultaneously; Sun et al. (24) extended the EIBD statistic to analyzing large inbred pedigrees such as the Hutterites. (Note that animal genetic literature offers extensive references on inbred pedigrees.) 3. Pedigree errors and genotyping errors are often confounded. The common practice is to first detect genotyping errors by evaluating marker-specific statistic such as the Mendelian inconsistencies rate, averaged across all pedigrees or samples. After removing markers with high genotyping error rate, the relative pair-specific statistics as discussed above are then used to identify pedigree errors. Broman and Weber (25) proposed a method that incorporates a pre-specified genotyping error rate in the likelihood calculation, which is implemented in the MLRT statistic (10). However, joint modeling the two types of errors in a single framework remains an open question. In addition, little work has been done to tackle the problem of incorporating pedigree errors or relationship uncertainties in linkage or association analyses with only a few exceptions (3, 26).
1.0
p1, Prob(IBD=1) 0.2 0.4 0.6 0.8
0.0
1.0
p1, Prob(IBD=1) 0.2 0.4 0.6 0.8
0.0
1.0
p1, Prob(IBD=1) 0.2 0.4 0.6 0.8
0.0
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
X
0.8
1.0
R0: HSFC, 0 pairs
0.2
X
R0: FC, 2192 pairs
0.2
X
0.0
X
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
R0: PO, 226 pairs
0.2
X
R0: UN, 1110 pairs
0.2
X
R0: HS, 16 pairs
0.0
X
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
R0: MZ, 0 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HAV, 6 pairs
0.2
X
R0: GPC, 119 pairs
0.0
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
All, 5550 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HFC, 0 pairs
0.2
X
R0: AV, 1468 pairs
Fig. 5. Application 2: Relationship estimation based on IBD. Each plot shows the estimated IBD distribution, ^ p1 vs. ^ p0 , for relative pairs having the hypothesized relationship, R0. The cross marks the expected IBD distribution for R0. The bottom right plot shows the results for all 5,550 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
0.0
1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0
1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0
1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0
R0: FS, 413 pairs
3 Detecting Pedigree Relationship Errors 43
proportion of IBS=1
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
proportion of IBS=1
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: HSFC, 0 pairs
proportion of IBS=0
0.2
R0: FC, 2192 pairs
proportion of IBS=0
0.2
R0: FS, 413 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: PO, 226 pairs
proportion of IBS=0
0.2
R0: UN, 1110 pairs
proportion of IBS=0
0.2
R0: HS, 16 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
R0: MZ, 0 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HAV, 6 pairs
proportion of IBS=0
0.2
R0: GPC, 119 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
All, 5550 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HFC, 0 pairs
proportion of IBS=0
0.2
R0: AV, 1468 pairs
Fig. 6. Application 2: Relationship estimation based on IIS (also known as IBS). Each plot shows the proportion of markers with 0 IIS vs. 1 IIS for relative pairs having the hypothesized relationship, R0. The expected IIS distribution for R0 depends on the allele frequencies but is generally the center of the cluster, assuming the majority of the pairs do not have relationship error and have the same markers genotyped. The bottom right plot shows the results for all 5,550 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
proportion of IBS=1
1.0
0.8
0.6
0.4
0.2
0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
44 L. Sun
3 Detecting Pedigree Relationship Errors
45
Table 6 Methods comparison in terms of power, robustness, and computational efficiency Statistics
Power
Robustness
Computation
Others
MLRT
1
3
3
EIBD
2
2
2
na for unrelated and parent–offspring
AIBS
2
2
2
na for unrelated
IBS
3
1
1
Ranks are from 1 (best) to 3 (worst)
References 1. Boehnke M, Cox NJ (1997) Accurate inference of relationships in sib-pair linkage studies. Am J Hum Genet 61(2):423–429 2. Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet 1(3):e32 3. Thornton T, McPeek MS (2010) Roadtrips: case-control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet 86 (2):172–184 4. Goring HH, Ott J (1997) Relationship estimation in affected sib pair analysis of late-onset diseases. Eur J Hum Genet 5(2):69–77 5. O’Connell JR, Weeks DE (1998) Pedcheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet 63(1):259–266 6. Ehm M, Wagner M (1998) A test statistic to detect errors in sib-pair relationships. Am J Hum Genet 62(1):181–188 7. McPeek MS, Sun L (2000) Statistical tests for detection of misspecified relationships by use of genome-screen data. Am J Hum Genet 66 (3):1076–1094 8. Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2001) Grr: graphical representation of relationship errors. Bioinformatics 17 (8):742–743 9. Sieberts SK, Wijsman EM, Thompson EA (2002) Relationship inference from trios of individuals, in the presence of typing error. Am J Hum Genet 70(1):170–180
10. Sun L, Wilder K, McPeek MS (2002) Enhanced pedigree error detection. Hum Hered 54(2):99–110 11. Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77:2257–2286 12. Sun L (2001) Two statistical problems in human genetics. PhD thesis, University of Chicago 13. Thompson EA (1975) The estimation of pairwise relationships. Ann Hum Genet 39 (2):173–188 14. Thompson EA (1986) Pedigree analysis in human genetics. The Johns Hopkins University Press, Baltimore 15. Dimitromanolakis A, Paterson AD, Sun L (2009) Accurate IBD inference identifies cryptic relatedness in 9 hapmap populations. Abstract no. 1768 16. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575 17. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 57:289–300 18. Craiu RV, Sun L (2008) Choosing the lesser evil: trade-o_ between false discovery rate and non-discovery rate. Statistica Sinica 18:861–879
46
L. Sun
19. Sun L, Craiu RV, Paterson AD and Bull SB (2006) Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genet Epidemiol 30:519–530 20. Begleiter H, Reich T, Nurnberger JJ, Li TK, Conneally PM, Edenberg H, Crowe R, Kuperman S, Schuckit M, Bloom F, Hesselbrock V, Porjesz B, Cloninger CR, Rice J, Goate A (1999) Description of the genetic analysis workshop 11 collaborative study on the genetics of alcoholism. Genet Epidemiol 17 Suppl 1:S25–30 21. Antoni G, Morange P, Luo Y, Saut N, Burgos G, Heath S, Germain M, Biron-Andreani C, Schved J, Pernod G, Galan P, Zelenika D, Alessi M, Drouet L, Visvikis-Siest S, Wells P, Lathrop M, Emmerich J, Tregouet D, Gagnon F (2010) A multi-stage multi-design strategy provides strong evidence that the bai3 locus is associated with early-onset venous thromboembolism. J Thromb Haemost DOI 10.1111/j.15387836.2010.04092.x
22. Epstein MP, Duren WL, Boehnke M (2000) Improved inference of relationship for pairs of individuals. Am J Hum Genet 67 (5):1219–1231 23. McPeek MS (2002) Inference on pedigree structure from genome screen data. Statistica Sinica 12:311–335 24. Sun L, Abney M, McPeek MS (2001) Detection of mis-specified relationships in inbred and outbred pedigrees. Genet Epidemiol 21 Suppl 1:S36–41 25. Broman KW, Weber JL (1998) Estimation of pairwise relationships in the presence of genotyping errors. Am J Hum Genet 63 (5):1563–1564 26. Ray A, Weeks DE (2008) Relationship uncertainty linkage statistics (ruls): affected relative pair statistics that model relationship uncertainty. Genet Epidemiol 32(4):313–324
Chapter 4 Identifying Cryptic Relationships Lei Sun and Apostolos Dimitromanolakis Abstract Cryptic relationships such as first-degree relatives often appear in studies that collect population samples such as the case–control genome-wide association studies (GWAS). Cryptic relatedness not only creates increased type 1 error rate but also affects other aspects of GWAS, such as population stratification via principal component analysis. Here we discuss two effective methods, as implemented in PREST and PLINK, to detect and correct for the problem of cryptic relatedness using high-throughput SNP data collected from GWAS or next-generation sequencing (NGS) experiments. We provide the analytical and practical details involved using three application examples. Key words: Cryptic relatedness, Pedigree error, Relationship estimation, IBD, IBS, IIS, Likelihood, EM algorithm, Method-of-moments, Software, PREST, PREST-plus, PLINK, GWAS, Sequencing
1. Introduction Data quality control and error detection is paramount in ensuring valid and powerful gene discoveries and the success of replication studies. Apart from quality control of the genotypes, it is important to specify correct genealogical relationships for all individuals involved in a study. In the context of high-throughput genome scans such as genome-wide association studies (GWAS), it is common to collect putative unrelated case–control samples. However, cryptic relationships, such as duplicated samples or first-degree relatives, often appear in a study. In this chapter, we discuss effective methods to detect and correct for the problem of cryptic relatedness using high-throughput genome-wide scan data collected from GWAS or next-generation sequencing (NGS) experiments. GWAS form a new paradigm that provides an agnostic search of the whole human genome by looking for association between a trait and a vast number of SNPs that tag a large proportion of the
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_4, # Springer Science+Business Media, LLC 2012
47
48
L. Sun and A. Dimitromanolakis
genome (see Chapter 18 for more details on GWAS). Modern GWAS collect thousands of case–control samples genotyped on half-million or more SNPs. Prior to the actual association analysis, data quality control must be performed on both the SNPs and samples, which includes, among other things, genotype quality and cryptic relatedness. The latter turns out to be a crucial element for a number of reasons. First and foremost, related individuals within GWAS can lead researchers to false association results, because genotypes of different samples are no longer independent of each other as assumed by the standard case–control association analysis. Cryptic relationship creates increased type 1 error rate and overall inflation of the significance of P -values (1), unless a correct correlation structure among individuals is specified in the association model (2). Cryptic relatedness also affects other aspects of the data quality control, such as population stratification via principal component analysis (see Chapter 21 and (3) for more details on the topic of population stratification and association analysis). In this case, cryptic first-degree relatives, for example, could completely alter the population stratification results, in which the most important principal component partitions samples based on different degrees of relatedness rather than different populations. Unfortunately, this type of error is not academic and occurs frequently in practice. In fact, the GWAS literature so far shows that there are not many GWAS that did not have the cryptic relatedness issue. It is even more important to pay attention to this type of error when the GWAS design is moving from using thousands of samples to using tens of thousands of individuals, since the probability of sampling related individuals is then likely to increase dramatically, particularly among the cases. Another source for cryptic relatedness is duplication of samples, which can occur in the lab during the genotyping process. This is an extreme case where the two samples are in effect having a relationship equivalent to monozygotic twins. Some GWAS and the emerging high-through sequencing studies also collect family data, for example, to improve power by enriching the occurrence of rare variants. Because cryptic relatedness can also occur between putatively independent families, identifying cryptic relationship is not limited to GWAS with case–control samples, just as detecting pedigree errors is not limited to data collected for linkage analysis. The methods described in Chapter 3 and implemented in PREST (i.e., the three allele-sharing and the likelihood-based statistics and relationship estimation) were developed for general relationship types, including putatively unrelated individuals, and so they are suitable for identifying cryptic relationships (4). Application of these methods to high-throughput GWAS or sequencing
4
Identifying Cryptic Relationships
49
data, however, requires additional care. To perform formal relationship hypothesis testing using genotype data, linkage disequilibrium (LD) between SNPs must be modeled in the variance and likelihood calculation. This approach utilizes all available data but considerably increases the sensitivity of the methods to the assumption that map and allele frequencies were accurately estimated, and it also significantly increases the computational burden. Alternatively, the most informative SNPs that are also in linkage equilibrium can be selected so that existing methods are directly applicable. In practice, relationship estimation alone, based on the estimates of the identity by descent (IBD) distribution using all or partial GWAS or NGS SNP data, can often provide sufficient information to identify cryptic relationships such as first-degree relatives (5). In the following, we first briefly review the relationship estimation methods implemented in PREST (5, 6) and PLINK (7). In Subheading 2, we compare the two methods using the HapMap (8) phase III GWAS data (May 2010 release) and provide implementation and application details (see also Notes 1 and 2). 1.1. Measures of Relatedness
The relatedness between a pair of individuals can be summarized by their mean IBD probability distribution, p ¼ ðp0 ; p1 ; p2 Þ, as defined in Chapter 3. It describes the probability of a randomly sampled marker to have 0, 1, or 2 common ancestry alleles between the two individuals. The kinship coefficient f is an alternative measure of relatedness, which is also commonly employed. It provides the probability of two randomly sampled alleles from a pair of individuals, at a randomly sampled marker, to have the same common ancestry. Therefore, the following relationship holds between the IBD probabilities and the kinship coefficient: f ¼ 0:25p1 þ 0:5p2 : Table 1 in Chapter 3 summarizes the IBD distributions and kinship coefficients for the most commonly encountered human relationships. Although these summary statistics do not fully determine a relationship type (e.g., half-sib, grandparent–grandchild, and avuncular), the estimates of these statistics could be used to identify cryptic relatedness, particularly when the true relationship is first-degree or second-degree (e.g., the first seven relationships in Table 1 of Chapter 3 that have f 0.125).
1.2. Relationship Estimation Methods Implemented in PREST and PLINK
There are two commonly employed methods that use genomewide scan data to estimate the IBD distribution, p ¼ ðp0 ; p1 ; p2 Þ. One is the maximum likelihood-based method (see Chapter 3) and implemented in PREST (5, 6). The other is the method-ofmoments approach implemented in PLINK (7). The likelihoodbased method is more powerful than the method-of-moments method, but there is a trade-off between statistical power and computational efficiency. With the increasing availability of raw
50
L. Sun and A. Dimitromanolakis
Table 1 Advantages and disadvantages of PREST and PLINK, the two computer programs commonly used for IBD estimation PREST vs. PLINK for identifying cryptic relationships Computation
PREST is more computationally intensive but can be easily parallelized in a computing cluster
Interface
Both are user friendly and use the SAME input files, but PLINK is more familiar to many users
Relationship estimation
PREST provides considerably more accurate IBD estimates, particularly when the number of markers is limited (e.g., 0.15 and genotype call rate >0.97, and –indeppairwise 50 5 0.2 performs a pair-wise LD pruning of the SNPs so that values of r2 between the remaining SNPs are less than 0.2. Level 3 is used only if a formal relationship hypothesis testing is desired. In that case, only SNPs that are in linkage equilibrium are used and a bp map must be converted to a cM map.
2.4. Application 1: Identifying Cryptic Relatedness Across Pedigrees with GenomeWide Low-Density Microsatellite Data
In this application, we revisit the COGA data (9) discussed in Subheading 2.4 of Chapter 3. Briefly, the data consist of 105 families and 1,214 individuals in total, among which 992 were genotyped at 285 autosomal microsatellite markers. To identify cryptic relatedness between families, we can first obtain the estimates of the IBD distribution with the PREST command line: prest –file coga.ped –map coga.map –aped –out coga.aped.option0
54
L. Sun and A. Dimitromanolakis
Fig. 2. Relationship estimation for COGA between-family pairs via PREST as in Subheading 2.4. The plot shows the estimated IBD distribution, ^ p1 vs. ^ p0 , for the 483,571 pairs having the hypothesized relationship, R0, of UNrelated. The estimates are based on genotypes of 100–285 microsatellite markers. The cross marks the expected IBD distribution for R0, i.e., p ¼ ðp0 ; p1 ; p2 Þ ¼ ð1; 0; 0Þ.
In total, 486,036 putatively unrelated pairs (across-pedigree pairs, two individuals of a pair must have different family IDs; see Chapter 3 for results of the within-pedigree analysis) were analyzed, of which 483,571 had 100 markers genotyped in common (Fig. 2). Note that PLINK cannot be applied to microsatellite markers and the IBD estimates via the method-of-moments are not reliable when only a few thousand markers are available. There are some clear outliers, indicating cryptic relationships. For example, there are seven pairs with ^p0 install.packages("genetics") > library("genetics") 2. Read the data for the 18 genetic markers (see Note 4), and create genotype object using function genotype. Since the genotype function only works for a single marker, we use a marker in Table 3 as an example with genotypic counts for AA, Aa, and aa of 5, 185, and 810, respectively. > allmarker onemarker genodata # to obtain chi-square test statistics without continuity correction and P -value based on simulation one needs to run default setting of the function > t_chisq # to obtain asymptotic chi-square test statistics with continuity correction and associated P -value one needs to run the following script > t_chisq # to perform asymptotic chi-square test and associated P -value without continuity correction > t_chisq t_exact genocounts p_exact p_exactR>PLINK if the exact test is performed (the asymptotic
98
J. Wang and S. Shete
chi-square test is always quick). Using 100,000 permutations, SAS needs approximately 4–5 min to complete the analysis, R package “genetics” needs about 2–3 s, and PLINK only needs about 0.03 s. Therefore, the time it takes to perform the exact Hardy–Weinberg proportion test for SNPs in a candidate region or at the genomewide level is less than a day. Given the liberal nature of the asymptotic-based chi-square test, we recommend that the exact test be performed routinely. 2.4. Other Software
Many other software/programs are also useful for testing the departure from Hardy–Weinberg proportion: SNP-HWE (http://www.sph.umich.edu/csg/abecasis/Exact/) (17) HWtest (http://www.mathworks.com/matlabcentral/fileexchange/ 14425-hwtest) (78) Haploview (http://www.broadinstitute.org/mpg/haploview/) (17, 79) TFPGA (http://www.marksgeneticsoftware.net/)
3. Notes 1. Li and Leal (80) studied the departure from Hardy–Weinberg equilibrium in a family-based study (i.e., parental and unaffected sibling genotype data). They found that the pattern of departure from Hardy–Weinberg equilibrium is different in different groups of individuals, such as the parent group, affected proband group, and unaffected sibling group. 2. The number of mixture samples L can be decided by conducting simulations. For example, given a data set, one can use different numbers of L to evaluate the empirical distribution of P -values and the maximum likelihood estimator. If the empirical distribution and the value of the maximum likelihood estimator are approaching stability when L is greater than some number, one can use this number or a greater number in the analysis. 3. We tried two different numbers of permutations, PERMS ¼ 10,000 and 100,000. In both cases, the exact P -values show some variation if we conduct the ALLELE procedure multiple times without the SEED ¼ option. If the multiple tests are conducted using a fixed random seed number, the exact same results can be replicated. The variations of exact P -values are larger with PERMS ¼ 10,000 than with PERMS ¼ 100,000. These variations might not have a significant impact on the conclusions from the exact test (Hardy–Weinberg proportion test is significant or nonsignificant), but we still recommend more permutations for accurate results, if it is feasible.
6 Testing Departure from Hardy–Weinberg Proportions
99
4. The input data are in a format of columns of genotype pairs. One can use different delimiters to separate two alleles, such as “A/a” and “A-a,” or use no delimiter between two alleles, such as “Aa.” When creating genotypes using genotype function, the delimiter needs to be specified in the function with the option sep¼"" (by default, sep¼"/"). One can also use 0, 1, or 2 to represent genotypes aa, Aa, or AA, and then use function as.genotype.allele.count to convert them to genotype pairs A/A, A/a, and a/a. If only the genotypic counts are available, one can also create the genotype data and then apply the genotype function: > genocounts data genodata > < minðp :q ; q :p Þ if DAB >0 A B A B D0 ¼ : (3) p :p p :p21 > 11 22 12 > if DAB 1,000 kb apart, and we ignore the option that excludes individuals. After clicking “OK”, four flags may be chosen: “LD Plot”, “Haplotypes”, “Check Markers”, and “Tagger”. If “LD Plot” is chosen, the LD plot can be saved, e.g., as a PNG file. Choose “File”, then “Export Data”, as Output Format we choose “PNG Image” and we restrict the PNG Image to Markers “1–41”. Afterwards we click on “OK” and save the image wherever we want.
Fig. 1. Illustrative linkage disequilibrium plot generated with Haploview.
7 Estimating Disequilibrium Coefficients 111
112
M. Vens and A. Ziegler
The result is displayed in Fig. 1. It is possible to use different color schemes in Haploview. To change the color schemes choose “Display” and “LD color scheme”. We want to change the color scheme to “R-squared”. Afterwards we want to show other LD values. Choose again “Display”, then “Show LD values” and change to “R-squared”. We saved this image analogously to the procedure described above. The result is shown in Fig. 2. It is possible in Haploview to get detailed information about the relationship of two SNPs by doing a right-click on the square belonging to the SNPs. By doing so, you get the following information: distance in kb, D0 , LOD, D2, confidence bounds for D0, and haplotype frequencies. 2.3. Further Software Implementations
There are different packages for estimation or visualization of LD available. The package GENASSOC is based on STATA/SE and provides a variety of allelic LD measures (http://www-gene.cimr.cam.ac.uk/ clayton/software/stata/). Several R libraries are available for the estimation of LD. A commonly used library is GenABEL (28). A further possibility to plot the estimated LD structure in heat maps is GOLD (29). LocusZoom is a plotting tool which displays, e.g., regional information relative to LD (30).
3. Notes 1. The measure DAB is the most straightforward measure of LD. However, it depends on allele frequencies. Its maximum range is 0.25 to 0.25 when allele frequencies are 0.5. This range of values of DAB is even more restricted if allele frequencies become closer to 0 or 1 (31). The values of DAB depends on the allele frequencies at both marker loci (2). It can be shown that maxð pA pB ; qA qB Þ DAB minðpA qB ; qA pB Þ: DAB is comparable between two pairs of SNPs only if the allele frequencies of the SNPs are similar. This limits the practical use of this measure. To overcome this limitation, several standardizations of DAB have been suggested (32) (see above). All of them have in common that DAB is in the numerator (see Eqs. 2–5, and 7). b AB of DAB has expected value and large The estimator D sample variance (14)
Fig. 2. Illustrative linkage disequilibrium plot generated with Haploview using the “R-squared” “LD color scheme”.
7 Estimating Disequilibrium Coefficients 113
114
M. Vens and A. Ziegler
b AB ¼ 2n 1 DAB E D 2n b AB ¼ 1 pA qA pB qB þ ð1 Var D 2n
2pA Þð1
2pb ÞDAB
D2AB ;
d D b AB Þ for and it is asymptotically unbiased. An estimator for Varð b AB Þ is obtained by replacing the allele frequencies and VarðD DAB by their estimators. It is possible to use DAB to test for LD using the null hypothesis H0 : DAB ¼ 0 versus H1 : DAB 6¼ 0. Statistical tests can be constructed as score tests (2): b AB b AB EH D D b AB 0 D TS ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ^ ^ ^ ^ b AB d H0 D Var 2n ðpA q A pB q B Þ
or Wald tests: TW
b AB b AB EH0 D D b AB D ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi : ffi ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d D b AB d D b AB Var Var
Both test statistics are asymptotically standard normally distributed under H0. Asymptotically valid (1 a)-confidence intervals for LD can thus be constructed as rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi b d b DAB z1 a=2 Var DAB :
2. Sometimes D is also termed r or r. D is usually squared to avoid the arbitrary sign which is introduced when the alleles are labeled. D is also known as the Pearson’s product moment correlation coefficient, and D2 is often termed the coefficient of determination. The definition of D2 can be understood by considering the alleles as realizations of two binary random variables with values 0 and 1 among which the correlation coefficient is estimated. D2 ranges between 0 and 1. It is equal to one only if two entries of Table 2 are equal to 0, and the LD is said to be total (or perfect) in this case. This measure is preferred whenever the focus is on the predictability of one polymorphism given the other. Thus, it is often used in power studies for association designs (33); for properties of D, see VanLiere and Rosenberg (34). 3. D0 ranges between 1 and 1. If the denominator of D0 equals 1, the LD is said to be complete. To achieve this property, a single cell in Table 2 needs to equal 0. D0 is preferred to assess recombination patterns. Haplotypes have often been defined using D0 (33). If an SNP has only a low minor allele frequency, it is possible that the corresponding rare haplotype is not observed in a small study population. This leads to a Lewontin’s D0 equal
7 Estimating Disequilibrium Coefficients
115
to 1, independent of the true level of LD (35, 36). Thus, the values of D0 are inflated in the presence of rare alleles. To avoid these spurious results, empirical confidence intervals for D0 should be estimated using resampling schemes (37). Lewontin’s D0 is sometimes rewritten as DAB Dmax with Dmax ¼ minðpA qB ; qA pB Þ if DAB >0 and Dmax ¼ maxð pa pB ; qA qB Þ if DAB 1 would indicate an association of positive family history with disease, while an OR ¼ 1 would indicate no association. An OR < 1 would indicate an excess positive family history among those without disease (i.e., controls). As it is expected that (1) the frequency of FH + among controls should approximate the prevalence of disease in the general population and (2) the prevalence of affected family members with disease among cases should be no less than the population prevalence, observing OR < 1 would suggest the need to reevaluate the study design for the presence of factors that may bias this association.
124
A.C. Naj et al.
Table 2 Distribution of family history of psoriasis among acute guttate psoriasis cases and unrelated controls (adapted from Naldi et al. (6)) Trait
Cases, No. (%)
Controls, No. (%)
Chi-square test for homogeneitya (P value)
Family history of psoriasis (parents or siblings)
a
No
49 (67.1)
397 (92.3)
Yes
24 (32.9)
33 (7.7)
36.98 (0.000)
The number of degrees of freedom is equal to the number of categories, 1
An example estimating association of positive family history with disease is explored in a study by Naldi et al. (6), which examined several risk factors for acute guttate psoriasis in 73 cases and 430 controls. As demonstrated in Table 2, 24 of 73 cases (32.9%) had a family history of psoriasis, whereas only 33 of 430 controls (7.7%) had a family history of this condition. Assuming there are no sources of other bias that need to be accounted for, the unadjusted OR and 95% confidence interval (CI) for odds of psoriasis from a history of psoriasis in parents or siblings were estimated as 5.89 (3.22, 10.8). In Subheadings 2.1 and 2.2, we provide a step-by-step description of how to calculate the odds ratio and the confidence interval using the R statistical software package (http://cran.r-project.org). In this study, the investigators identified several potential confounding variables (age, sex, body mass index, smoking, and alcohol consumption) to adjust for in estimating this OR, and thus chose more complex approaches to report the adjusted OR estimates, some of which are described in more detail subsequently. 1.1.2. Measures of FH-Disease Association with Cochran–Mantel–Haenszel Adjustment for Potential Confounders
If any variables are recognized as potential confounders of observed relationship between FH and disease, these may be accounted for in the design stage of the study, using approaches such as frequency matching or individually matching cases and controls, or restricting ascertainment to individuals meeting certain criteria (though this may reduce the case–control sample’s overall representativeness of the general population). If these potentially biasing factors are frequent enough in the population from which the cases and controls are sampled, stratifying cases and controls into groups with similar values for these variables (Fig. 1b) may allow for their effects to be adjusted out using various procedures, such as the Cochran–Mantel–Haenszel (CMH) approach (7). Stratification may also permit a cursory
8 Detecting Familial Aggregation
125
examination of whether these potential confounders modify the effects of FH on risk of disease. If a trait has K states, the within-stratum CMH OR for comparing positive family history with affection status would be defined for each stratum i ¼ 1, . . ., K as: ORi ¼
PrðFH þ jD; ith stratumÞ= PrðFH ith stratumÞ= PrðFH PrðFH þ jD;
jD; ith stratumÞ ith stratumÞ : jD; (2)
Using the cell counts in Fig. 1b, we note for stratum i, each of these probabilities is defined as: ai PrðFH ai þ bi bi ¼ ai þ bi ci PrðFH þ jD; ith stratumÞ ¼ PrðFH ci þ di di : ¼ ci þ di
PrðFH þ jD; ith stratumÞ ¼
jD; ith stratumÞ
ith stratumÞ jD;
(3) Restating the ORi in these terms, we have ORi ¼
ðai =ai þ bi Þ=ðbi =ai þ bi Þ ai di ¼ : ðci =ci þ di Þ=ðdi =ci þ di Þ bi ci
(4)
To estimate an average effect across all K strata weighted by the number of individuals in each stratum, the across-stratum CMH OR would be defined as: PK ai di =ni ORCMH ¼ Pi¼1 ; (5) K i¼1 bi ci =ni
where ni is the total sample size of the ith stratum. CMH estimation of ORs and other measures of association are widely implemented in most statistical software packages (see Subheading 2.4). A study by Ponz de Leon and colleagues (8) examining familial aggregation of tumor development among individuals in a colon cancer registry used a modified version of this approach to estimate the OR for tumor development among first-degree relatives of cases and age- and sex-matched controls. The investigators parsed first-degree relatives into 389 strata (K ¼ 389), each stratum containing the relatives of one case and the relatives of their matched control to adjust the measured association for the effects of age and sex. Within stratum i, for example, the probability of a positive family history for case i, Pr(FH + |D, ith stratum), would be estimated as the fraction of person i’s first-degree relatives with tumors out of the total number of their first-degree relatives at risk for tumor development, for example; all other within-stratum
126
A.C. Naj et al.
probabilities would be estimated similarly. Using this approach, these authors observed a CMH OR ¼ 7.5 for tumor development (P < 0.001), indicating a strong excess of tumors among firstdegree relatives of colon cancer patients. The functions available in the software packages R, SAS, and Stata to calculate the CMH odds ratio are outlined in Subheading 2.5. 1.1.3. Logistic Regression Approaches to Measuring FH-Disease Association with Adjustment for Potential Confounders
Several approaches for implementing general linear models to measure association of disease among unrelated cases and controls with family history of disease in their relatives have been explored. Hopper et al. (9) proposed examining familial binary trait data using a restricted class of log-linear models, within which the modeling of a first-order interaction term can be interpreted as the conditional log odds ratio measuring familial aggregation. An approach similarly using log-linear models was developed and described by Connolly and Liang (10). Subsequently, Hopper and Derrick (11) demonstrated how data from families of different sizes and structure can be used to estimate within-class and betweenclass “correlations” for binary data as functions of their odds ratios, under the assumption that no second-order or higher-level interactions would be included in their log-linear model. Unlike conventional epidemiologic risk factors for disease, family history is not a physical characteristic particular to individuals being studied, but depends on factors that may be entirely unrelated to disease, such as number of family members, types of relationship to the proband, age distribution of the family members, and population prevalence of the disease (12). Because of this, logistic regression approaches modified to account for both withinfamily and among-family effects, such as the approach described by Liang and Beaty (13), are warranted. Under this approach, a pedigree with n members, with Yj denoting the binary outcome of the jth individual, we have: log
PrðYj ¼ 1Þ ¼ b0 þ b1 x1j þ þ bp xpj ¼ bt xj ; PrðYj ¼ 0Þ
(6)
where xj is a p 1 column vector of covariates thought to be associated with risk of disease through bt, a 1 p row vector. These covariates may include demographic variables (such as sex and age), environmental factors (e.g., smoking habits), as well as variables specific to the entire family (such as race/ethnicity and markers of socioeconomic status, like family income). Under this logistic regression model, a common risk among all family members need not be assumed. Following on this, we can define the odds ratio between the jth and kth members of the family as: OR jk ¼
PrðYj ¼ 1; Yk ¼ 1Þ PrðYj ¼ 0; Yk ¼ 0Þ : PrðYj ¼ 1; Yk ¼ 0Þ PrðYj ¼ 0; Yk ¼ 1Þ
(7)
8 Detecting Familial Aggregation
127
The odds ratio is the most common measure of association between categorical variables and ranges from zero to infinity (14). If this odds ratio is truly equal to one, there is no association between pairs of relatives. In our context, OR ¼ 1 for each of the n pairs of relatives indicates no genetic contribution to disease 2 risk. On the other hand, OR > 1 would reflect familial aggregation of risk. To complete the modeling, we assume log OR jk ¼ gtz ;
(8)
where z, a q 1 vector, includes variables specific to the family and variables indicating the status of the (j, k) pair in the family. Thus, in families including all first-degree relatives, we may assume that log ORjk ¼ gss, gpp, or gps depending on whether the (j, k) pair are siblings, parents, or a parent–sibling pair. This equation introduces two intraclass odds ratios for familial aggregation, namely egss and egpp , and one interclass parameter, egps . The model characterized in Eq. 8 is extensible to examine additional hypotheses, for instance, differences in familial aggregation by race/ethnic group. A familial aggregation study including both white and black families might model log OR jk ¼ g0 þ g1 ðRaceÞ, where Race ¼ 1 for white families and Race ¼ 0 for black families, and a significant ^g1 term would lead to the conclusion that the degree of familial aggregation (as measured by this odds ratio) may differ between black and white families. 1.1.4. Extending Modeling of Family History as a Covariate in Regression Approaches
An alternative approach to addressing concerns about variable family sizes, age distribution and biologic relationships across cases and controls is to create for each family a family history score (FHS) (12). This is accomplished by first deriving for each case or control the expected number of affected relatives based on person-time at risk, i.e., E¼
n X
Ej ;
(9)
j ¼1
where Ej is the expected risk to the jth relative based on risks in the general population considering age, gender, and other demographic variables, such as birth year. A FHS is then defined as ðO E Þ ; E 1=2
(10)
the “standardized” version of O, the observed number of affected among n relatives. Consider a 1980 genetic epidemiologic study by Cohen examining chronic obstructive pulmonary disease (COPD) (15), which sampled 105 cases and 79 controls and their families. First-degree relatives of case and control probands were directly examined for
128
A.C. Naj et al.
Table 3 Affection status for chronic obstructive pulmonary disorder (COPD) among relatives of COPD case and control probands (adapted from Liang and Beaty (64)) Disease status Affected
Unaffected
Case relative
71
173
244
Control relative
29
134
163
pulmonary function in this study, and Table 3 shows the number of affected relatives by case–control status. In this study, FHSs were estimated for cases and controls, adjusting the expected number of affected first-degree relatives for age, sex, and birth year. The average FHS among cases was 0.383 [standard error (s.e.) ¼ 0.117], while the average FHS among controls was 0.006 (s.e. ¼ 0.107) (data not shown here). FHS is a Poisson variable, where the mean and variance are the same, and standard error is estimated accordingly (16). A simple t-test revealed a significant difference in FHS between cases and controls, suggesting that case relatives were at excess risk compared to control relatives, as measured by FHS. 1.1.5. Accounting for Nongenetic Contributors to Excess Family History of Disease
Family history (FH) as a variable can be subject to misclassification. For example, even when the disease has no genetic etiology, the probability of a case having a positive family history is a function of his/her total number of first-degree relatives n 1
ð1
pÞn ;
(11)
where p is the disease prevalence. With p ¼ 0.05, this proportion would be 0.19 when n ¼ 4 and increases to 0.34 when n ¼ 8. Meanwhile, a proportion of 0.34 is expected if the disease prevalence is 0.10 instead of 0.05, even for a constant family size of n ¼ 4. Thus, if the distribution of family sizes differs substantially between cases and controls, this odds ratio based on FH can be quite misleading. Another concern about the use of this simple family history variable is the potential bias of information or recall. Depending on biologic relationship, number of relatives and other factors, the true disease status of relatives may be misreported, leading to further misclassification of FH. Unlike the previous concerns about family size, however, the degree of recall bias may often differ between cases and controls. Consequently, the estimated odds ratio could either be attenuated or inflated, and the magnitude of this discrepancy can be substantial (12). This concern
8 Detecting Familial Aggregation
129
about potential recall bias may be alleviated, to some extent, by carefully choosing informants from whom to gather FH information. For example, rather than only interviewing cases (or controls), one may instead consider interviewing parents or spouses as informants (multiple informants should also help). Indeed, in the situation where the cases (or controls) are deceased, such an alternative will become a necessity. 1.1.6. Modifications of Logistic Regression Approaches
Testing gene–environment interactions. Case–control designs can also be utilized to test for interactions between genes and environmental risk factors. In the absence of knowledge regarding specific susceptibility genes, one may use either FH or FHS as a surrogate measure of “genetic loading,” or one can use markers in candidate genes. To test the hypothesis of interaction between environmental factors and such factors as family history (FH or FHS), one can consider the following logistic regression model, (17): logit PrðDjFHðSÞ; ENVÞ ¼ a þ b1 FHðSÞ þ b2 ENV þ b3 ½FHðSÞ ENV;
(12)
where ENV stands for an observed environmental variable such as maternal smoking, etc. One can test the hypothesis of no interaction by examining the magnitude and sign of b3, the coefficient for the interaction term. It is worth noting that this interaction, if it exists, is modeled in a multiplicative fashion. Specifically, when comparing two individuals who differ by one unit in ENV, E + 1 versus E say, the odds ratio relating disease risk to ENV for those with positive family history (FH+) is eb3 times that of those without family history (FH ), i.e., OR FHþ ðE þ 1; EÞ ¼ eb2 þb3 and OR FH ðE þ 1; EÞ ¼ eb2 :
(13)
In the situation where the risk factor is a marker in a candidate gene, one can test for interaction between marker genotype and exposure to the environmental risk factor by applying Eq. 13 with FH(S) replaced by a dichotomous variable, GEN, which is 1 if the case (or control) carries the targeted allele(s) and 0 otherwise. As an illustration, consider a case–control study on oral clefts in which 333 children born with oral clefts and 166 healthy infants were sampled (18). A main objective of the study was to test for association between risk of being a case and genotype at a candidate gene, transforming factor alpha locus (TGFA), and the possible gene– environment interaction between TGFA and maternal smoking (MS). These data are summarized in Table 4 in two 2 2 tables denotes the stratified by the maternal smoking status; here G (G) presence(absence) of a C2 allele at the taq I polymorphism in TGFA. There was little evidence of association between oral cleft and TGFA whether the mother smoked or not (OR ¼ 1:05 and 1:07; respectively). The logistic regression model was fitted including an
130
A.C. Naj et al.
Table 4 Distribution of presence (G) or absence (G ) of the C2 allele at the transforming factor alpha locus (TGFA) by oral cleft case (D )/control (D ) status, stratifying by maternal smoking status (yes/no) (adapted from Liang and Beaty (64))
Maternal smoking D D No maternal smoking D D
G
G
28 5 33
80 15 95
108 20 128
31 19 50
194 127 321
225 146 371
interaction term (as in Eq. 12 above) with the addition of maternal age (MA) as another covariate. Results were very similar to that shown above, as the odds ratio relating TGFA to the risk for oral cleft was estimated at 1.00 (¼e0.0014) for children whose mothers did not smoke and as 1.09 (¼e0.0014 + 0.088) for children whose mothers did smoke. Thus, these data provided little evidence of interaction between TGFA and maternal smoking regarding risk for oral cleft. Subheading 2.3 describes how to use the R software package to fit the model given by Eq. 12 that involves an interaction term. Lung disease provides an example of a complex disease in which genetic effects can be understood only when both genetic and environmental contributions are considered (reviewed in detail elsewhere (19)). Environmental factors such as cigarette smoking may modify genetic effects in disease pathogenesis in diseases such as COPD, when genetically susceptible individuals are exposed chronically or to high concentrations of these factors (20). Studies have shown that even for a form of COPD under monogenic control, a 1 antitrypsin (AAT) deficiency, there appears to be significant variation among family members in pulmonary function (21), suggesting the importance of environmental factors as modifiers of risk (19). Pare and colleagues (22) have recently proposed a new method called “variance prioritization,” which prioritizes the genetic markers examined to facilitate subsequent tests of gene–gene and gene– environment interactions. Variance prioritization applies Levene’s test of variance equality to select SNPs under a predefined threshold and uses standard regression models to study them further for interaction effects (22). Similar efforts have been undertaken by others,
8 Detecting Familial Aggregation
131
who have noted the importance of evaluating gene–environment interaction effects in addition to composite genetic effects (23). Finally, sample size calculations have been developed to detect gene–environment interactions in unmatched case–control studies (24–26). Sturmer and Brenner (27) pointed out that considerable gain in power for testing interactions may be achieved by matching (frequency or individual) on the environmental factor in the design stage, especially if the environmental factors are rare in the population. Such a gain in statistical power may be offset by the extra difficulty in identifying matched controls, for the very reason of low prevalence of environmental factors. Sturmer and Brenner (27) suggest the balance between power gain and extra costs to achieve matching must take into account the specific research questions and surrounding circumstances. Statistical software programs such as QUANTO (28) allow for the consideration of prevalence of environmental factors, varying allelic and genotypic frequencies, and modeling of different statistical tests to determine how much power is available for tests of gene–environment interaction under a varying set of conditions. Identifying homogeneous subgroups. Under the general framework of case–control designs, one can test for homogeneity among different subgroups of cases. Assuming a hypothesized subtyping variable (such as early versus late onset of the disease) is available, the following polytomous logistic regression model can be fitted: log
PrðY ¼ j jFHðSÞÞ ¼ aj þ bj FHðSÞ; PrðY ¼ 0jFHðSÞÞ
j ¼ 1; . . . ; C;
(14)
where Y is a categorical representation of C + 1 categories with Y ¼ 0 for controls and Y ¼ j for cases of the particular subtype j, j ¼ 1, . . ., C. For the onset age example, C would be 2 with Y ¼ 2 (1) if the case is diagnosed with late (early) onset. If the regression coefficients (bj s) are significantly different from one another, then more homogeneous subgroups can be identified and may be targeted for further investigation. Returning to the oral cleft study, it is also of interest to examine if the two anatomical subtypes of oral cleft, cleft palate only (CP), and cleft lip with/without palate (CLP) are heterogeneous etiologically. The 2 3 table in Table 5 shows the prevalence of carrying the C2 allele is highest among cases with CP (21.8%), followed by CLP cases (15.3%) and 14.5% among controls. To formally answer this question, consider the following polytomous logistic regression model log
PrðY ¼ jjTGFA; MS; MAÞ ¼ aj þ bj 1 TGFA þ bj 2 MS PrðY ¼ 0jTGFA, MS, MAÞ þ bj 3 TGFA MS þ bj 4 MA,
132
A.C. Naj et al.
Table 5 Distribution of presence (G) or absence (G ) of the C2 allele at the transforming factor alpha locus (TGFA) by oral cleft case (CP)/oral cleft lip/palate (CLP)/control (D ) status, stratifying by maternal smoking status (yes/no) (adapted from Liang and Beaty (64)) Oral cleft TGFA
CP
CLP
D
G
27
32
24
83
G
97
177
142
416
124
209
166
499
where Y ¼ 2(1) if the case had CP (CLP) and Y ¼ 0 for controls. It appeared children of nonsmoking mothers showed no association between TGFA and the risk of either type of oral cleft ^ ^ (eb11 ¼ 1:05 and eb21 ¼ 0:98 for CP and CLP, respectively), and hence there was very little evidence of heterogeneity in genotypic effect between these two types of cleft. However, for children whose mothers smoked, there is a stronger positive association ^ ^ between TGFA and risk of being born with CP (eb21 þb13 ¼ 1:87) compared to the association between TGFA and the risk of being ^ ^ born with CLP (eb21 þb23 ¼ 0:74). Although the difference between these two odds ratios is not statistically significant (at the conventional 0.05 level), these data raise the possibility that these two subtypes of oral cleft may have different genetic etiology with respect to TGFA, and perhaps this is modified by maternal smoking. In Subheading 2.4, we provide a step-by-step description of fitting a polytomous logistic regression model using the R software package. 1.2. Family-Based Approaches to Familial Aggregation of Binary Traits 1.2.1. Description
During the past decade, family-based designs have drawn a good deal of attention among researchers in genetic epidemiology to address issues considered here, see, for example, Claus et al. (29) and Mettlin et al. (30) for breast cancer, Pulver and Liang (31) for schizophrenia and, more recently, Nestadt et al. (2) for obsessive compulsive disorders. Specifically, with the consent of cases and controls, their relatives were recruited and interviewed directly for detailed evaluations about their disease status, laboratory assessments, and demographic and risk factor information relevant to
8 Detecting Familial Aggregation
133
Fig. 2. Epidemiologic 2 2 contingency tables for disease status among relatives of cases and controls data.
disease. For this design, the phrase “case (control) proband” has been coined for cases (controls) as they represent probands, or the individuals through which the family is ascertained. Just as in Subheading 2, the data can be summarized in a 2 2 contingency table (Fig. 2). The primary response in this design is the risk among relatives, known as the familial risk. Thus, familial aggregation of disease may be claimed if the risk among case relatives is substantially higher than that among control relatives. It is important to point out that a primary difference between the conventional case–control design and this family case–control design lies in the sampling unit (here the family) and on the quantity to be used for comparison. For the standard case–control design, the unit of sampling and analysis is an individual (i.e., the case and control). Characteristics (including family history) are compared between cases and controls. For the latter design, however, the unit becomes the family and characteristics of individual relatives, and disease status within families is compared between case families and control families. This distinction has profound implications on validity of statistical inferences and on practical implementation, as addressed below. 1.2.2. Matching Cases and Controls
Just as in conventional case–control studies, one has the choice of matching each case with a control or not. The matching criteria for family case–control designs should be subject to some modification, however. Recall a major principle behind matching is to assure, to the extent possible, that “units” to be compared are indeed comparable. Given the comparison is made between case relatives and control relatives, sampling a control comparable to a case may not be adequate for this purpose. Thus, for family case– control designs, matching case and control probands in the design stage may be warranted if the primary confounding variables are themselves highly familial. This would increase the likelihood that case relatives and matched control relatives are more comparable to each other, at least for the matching variables. Confounding variables that are not necessarily familial—for instance, sex—can be easily adjusted through regression. Whether matching or not, statistical methods that have been fully developed for the conventional case–control design, including the conditional logistic regression analysis for matched designs, are
134
A.C. Naj et al.
not adequate here. This is because the unit is a family and the response variables (i.e., affected status of relatives) are not statistically independent of each other. This additional analytical complication of within-family correlation in risk can be adjusted for by using the generalized estimating equation (GEE) method, which was specifically developed to consider dependence within clusters (2, 32). In a matched design, the method developed by Liang (33) could be used to address issues considered here to account for correlations among related individuals. These two methods have been successfully applied to some genetic studies that address the three issues discussed below. For a more detailed discussion of the utility of this design and the two analytical methods mentioned above for other questions of interest, and the issue of sample size calculations, see Liang and Pulver (34). Finally, for age of onset outcomes, methods for detecting familial aggregation have also been developed to take into account the within-family correlation complication (35–37). In Subheading 5, we outline the functions that are available in R, SAS, and Stata programs for analyzing data using the GEE method. 1.2.3. Detecting Familial Aggregation and Testing Gene–Environment Interactions
To see how the family case–control design may be used to more formally address the issue of familial aggregation, let Y ¼ (Y1, . . ., Yn) be the disease status of n relatives from a family ascertained through a proband (either a case or control). Consider for each relative j, j ¼ 1, . . ., n, logit PrðYj ¼ 1Þ ¼ a þ bt xj þ gz;
(15)
where z ¼ 1 (0) if the proband for this family was a case (control) and the xj covariates are specific to the jth relative or to the proband. The key parameter of interest is obviously g which corresponds to the log odds ratio from the 2 2 table displayed in Fig. 2 (ignoring covariates). In general, this parameter characterizes the overall difference in familial risk, i.e., Pr(Y ¼ 1), between case families and control families, but the logistic regression model allows for differential risk among individuals with different observed risk factor values, xj. Ignoring correlations in Y among related individuals would lead to incorrect estimates of the variance of these regression estimators in Eq. 15 above (38). This concern can be alleviated by adopting the GEE method for unmatched designs and the extended Mantel–Haenszel method by Liang (33) for matched designs. Returning to the COPD study, we recall that first-degree relatives of case and control probands were directly examined for pulmonary function in this study, and Table 3 shows the number of affected relatives by case/control status. Here 71 out of 244 case relatives were diagnosed with impaired pulmonary function (29%), whereas 29 of 163 control relatives experienced the same condition
8 Detecting Familial Aggregation
135
(18%). This led to an estimated log odds ratio of 0.64 (with s.e. ¼ 0.25), suggesting that the familial risk among case families is twice (e0.64) that of control families. This simple approach may be criticized on the grounds that the estimated standard errors may be too conservative, owing to their failure to account for within-family correlations in Y, and that important risk factors such as smoking status were not properly considered. To alleviate these concerns, a logistic regression model with and without the GEE method of adjusting for dependence within family was applied to these data, in models with covariate adjustment only for race (Model I). Compared with the non-GEE logistic regression analysis results, GEE results in Model I suggested corrected s.e.s for ^g were larger (s.e. ¼ 0.28), although the discrepancy (0.25 versus 0.28) was modest in this example. Model II adjusted for some individual risk factors, including smoking, age, etc. Comparing results from GEE and non-GEE analyses revealed Model II provided stronger evidence of familial aggregation after adjustment. Specifically, the familial risk in case families was now estimated to be 2.23 (¼e0.80) times that in control families. To test for gene–environment interaction under this family case–control framework, one could simply add in Eq. 15 interaction terms between a covariates (x) and z. A third model (Model III) tested for possible interaction between z and smoking status. While not statistically significant, results suggest the effect of smoking is considerably higher among case families (OR ¼ e0:91þ0:14 ¼ 2:86; 95% CI : 0:67; 12:13) than that among control families (OR ¼ e0:14 ¼ 1:15; 95%CI : 0:46; 2:89). 1.2.4. Searching for Homogeneous Subgroups
Family case–control designs are particularly useful in identifying subgroups that may be etiologically distinct, especially if observed variables for subtyping are clinical characters associated with disease. Examples of the use of subgrouping in studies include age at onset (early versus late) for breast cancer (29, 30) and schizophrenia (31). A more detailed example of subgrouping is of a matched family case–control study of patients diagnosed with congenital cardiovascular malformation (CCVM), where probands were categorized by the presence or absence of “flow lesion” defects (39). In the context of analyzing this dataset, if we wanted to account for the difference between probands with or without flow lesions, we could expand the logistic regression model in Eq. 14 of the previous section by allowing multiple zs, i.e., logit PrðYj ¼ 1Þ ¼ a þ bt xj þ g1 z1 þ þ gC zC ;
(16)
where C represents the number of subgroups among all case probands and controls serve as the reference group. For the CCVM example, C would involve two contrast variables with
136
A.C. Naj et al.
z1 ¼ z2 ¼
0 if the case proband has a flow lesion 1 otherwise
0 if the case proband does not a flow lesion ; 1 otherwise
to characterize three groups, case probands with flow lesions ðz1 ¼ 1; z2 ¼ 0Þ, case probands without flow lesions ðz1 ¼ 0; z2 ¼ 1Þ, and controls ðz1 ¼ 0; z2 ¼ 0Þ. This expanded model where zs identify subgroups having different patterns of familial risk could be quite useful, and presumably families with higher familial aggregation could be targeted first for further investigation. We now illustrate the use of the model in Eq. 16 and the statistical method developed by Liang (33) on data from the matched family case–control study of CCVM (39) mentioned in the previous section. Here each of 570 cases who have one or more full sibs (363 with flow lesion defects and 207 without) was matched with a control born within the same 30-day period and also having at least one full sib. Among 1963 case relatives (1,140 parents and 823 siblings), 41 were diagnosed with CCVM; whereas only 10 of 1,946 control relatives (1,140 parents and 806 siblings) were affected. While these data revealed strong evidence of familial aggregation overall (2.1% versus 0.05%), this ad hoc approach ignores the matching aspect of the study design and individual risk factors such as gender were not considered. After adjusting for race, gender, and relationship to probands (parents versus siblings) and using the GEE method proposed by Liang (33), Maestri et al. (39) found that Eq. 16 yielded an estimated g1 ¼ 1.405 (s.e. ¼ 0.424), corresponding to the familial risk among case families being 4.14 (¼e1.405) times higher than among relatives of controls, again showing a strong evidence of familial aggregation. To further test the hypothesis that familial risks in families of a proband with a flow lesion and nonflow lesion case probands were different, Maestri et al. (39) considered the model contrasting these two subtypes of CCVM and found ^g1 ¼ 1:698 (s.e. ¼ 0.498) and ^g2 ¼ 0:330 (s.e. ¼ 0.765). These estimated parameters suggested familial risk among relatives of flow lesion cases is stronger than in cases with other types of CCVM, and there may be etiologic heterogeneity between these two types of CCVM because the parameter estimates yielded ð^g1 ^g2 Þ2 ¼ 4:94; Varð^g1 ^g2 Þ corresponding to a statistically significant difference at the 0.05 level.
8 Detecting Familial Aggregation 1.2.5. Estimating Risk of Recurrence Among First-Degree Family Members
137
Using family case–control designs, it is possible to quantitate familial aggregation of disease and estimate how much higher a burden of disease may exist in families of affected persons compared to expected (1) among the family members of individuals without disease [controls] (familial relative risks) or (2) among individuals sampled from the population at random (familial recurrence risks) (40). How much of this excess disease occurs among the families of affected persons compared to the families of unaffected persons can be determined from a traditional Poisson regression approach, estimating relative risk of disease, where the presence or absence of disease among index individuals (cases and controls) is taken to be presence/absence of an exposure of interest. This expected relative risk (RR) can be written as: Sibling RR ¼
Prðsib2 Djsib1 DÞ : Prðsib2 Djsib1 DÞ
(17)
If the disease of interest is observed with frequency p in the population, it is possible to compute the numerators and denominators in Eq. 17 above using conditional probabilities defined as follows: Prðsib2 Djsib1 DÞ ¼
Prðsib2 D and sib1 DÞ ; p
(18)
¼ Prðsib2 Djsib1 DÞ
Prðsib2 D and sib1 DÞ : ð1 pÞ
(19)
The joint probabilities of disease in Eqs. 18 and 19 can be computed using the Uij matrix as shown in Figs. 3 and 4. Note Pr (D)i and Pr(D)j are exposure-specific risks for siblings i and j defined as: Prðsib2 D and sib1 DÞ ¼
2 2 X X
Uij Pr ðDÞj Pr ðDÞi : (20)
i¼1 j ¼1
¼ Prðsib2 D and sib1 DÞ
2 2 X X
Uij Pr ðDÞj 1
i¼1 j ¼1
Pr ðDÞi :
(21)
With this information, we can estimate the most common measure of familial recurrence risk, the sibling recurrence risk (ls): " # 2 P 2 P Uij Pr ðDÞi Pr ðDÞj Sibling recurrence riskðlz Þ ¼
j ¼1 i¼1
p
:
(22)
138
A.C. Naj et al.
Likewise, with this information, a more thorough characterization of sibling relative risk is derived as: " # 2 P 2 P Uij Pr ðDÞj Pr ðDÞi =p Sibling RR ¼ "
i¼1 j ¼1
2 P 2 P
i¼1 j ¼1
Uij Pr ðDÞj Pr ðDÞi ð1
#
:
Pr ðDÞi Þ =p (23)
Sibling 1 Sibling 2
Total Exposed (i =1)
Exposed ( j =1) Nonexposed ( j = 2) Total
f [ (1 − c) f + c ] f (1 − c) (1 − f) f
Nonexposed (i = 2) f [ (1 − c) ( 1 − f) (1 − f ) [1− (1 − c) f]
f 1− f
1 −f
1
* Uij is the relative frequency of sibling 1 and sibling 2 to be of exposure status i and j, respectively. ** f is the probability of exposure to the factor in the population and, in this instance, the population prevalence of the disease of interest; c is the correlation in exposure (in this case, disease status) between two siblings.
Fig. 3. Sibling–sibling probability matrix (Uij)* for exposure** to a risk factor of interest, in this case the presence/absence of disease among index individuals (cases and controls) (adapted from Khoury et al. (40)).
E E
E
Fig. 4. Various probabilities of disease in two siblings according to their exposure status* (adapted from Khoury et al. (40)).
8 Detecting Familial Aggregation
139
Fig. 4. (continued).
We consider an application of estimating sibling RR and ls in a study of congenital amusia (commonly known as “tone deafness”) (3), conducted on a set of 71 members of 9 large families of amusic probands and 75 members of 10 control families. As shown in Table 6, using data reported by probands and controls under the assumption of single ascertainment, 9 of 21 tested siblings of case probands were amusic, whereas 2 of 22 control siblings were amusic. The investigators cite a population prevalence (p) of amusia of 4% (in Fig. 3, this would correspond to f ¼ 0:04). Therefore, we can estimate the probability of the sib and proband were affected to be Prðsib2D and sib1 DÞ ¼ 9=21 ¼ 0:43 (in Fig. 3, this would correspond to c ¼ 0:43), and with p ¼ 0.04, ls ¼ 10.8. Likewise, ¼ 2=22 ¼ 0:09, which is more than twice Prðsib2 D and sib1 DÞ the population prevalence of p ¼ 0.04. With this information, the sibling relative risk can be estimated to be more conservative than the sibling recurrence risk, which was RR ¼ 4.73. 1.2.6. Assessing Standardized Incidence Ratio
Often alternative approaches are necessary to study disease incidence and relative risks among families when familial case–control studies are not feasible or possible. One such method is the standardized incidence ratio (SIR, or, for mortality data, the standardized mortality ratio, SMR). SIRs can be used to analyze events and person-time at risk in a longitudinal cohort study in which individuals might be followed over a long and variable length of time and a range of ages (41). SIRs can be calculated as the ratio of the observed to the
140
A.C. Naj et al.
Table 6 Amusia in siblings of amusic probands and in siblings of controls (excluding probands) (adapted from Peretz et al. (3)) Determined
n
Affected
Unaffected
Unknown*
No. of siblings of probands By report By test
57 21
39 9
15 12
3 ...
No. of siblings of controls By report By test
52 22
22 2
41 20
9 ...
*When the proband was unsure whether a relative was amusic, the relative was classified as “unknown”
Fig. 5. Application of the standardized incidence ratio (SIR) methods in a cohort study (adapted from Thomas (41)).
expected number of cases, and inferences can be made based on the assumption of an underlying Poisson distribution (for the estimation of SIRs in statistical programs, see Subheading 2.4). The standardized RR is defined as the ratio of SIRs between exposure categories. Thomas (41) explains the methods of calculating SIR through tabulating each individual’s time at risk over a two-dimensional array of ages and calendar years in a certain interval, to obtain the total person-time Tzs in each age-year stratum s and exposure category z. The number of expected cases in stratum z, Ez, can be estimated by summing the product of age-year-specific incidence rates ls and the total person-time Tzs across all age-year strata. Comparing the number of observed cases, Yz, to the number of expected, Ez, calculates the SIR within each exposure category (SIRz). These estimations are illustrated in Fig. 5. One of the major advantages of estimating SIRs within the cohorts is that population incidence rates can be used instead of examining controls as a proxy for the larger population (41).
8 Detecting Familial Aggregation
141
Bratt and colleagues (42) estimated SIRs in their study of the nationwide population-based Prostate Cancer Database Sweden (PCBaSe Sweden), which includes multiple Register cohorts and the Census database. Calculation of the expected number of specific incidences were derived from the corresponding age- and timespecific annual incidence of all patients registered in the National Prostate Cancer Register, comprising 98% of all diagnosed prostate cancer patients in Sweden during the studied time period (42). SIRs and their 95% confidence intervals (CIs) were calculated by dividing the total number of observed cases by the number of expected cases, as explained in the previous paragraph. The observed patients were assumed to have a Poisson distribution by the use of Byar’s normal approximation (42). The investigators found higher incidences of prostate cancer among brothers of the proband (FH+) than among men of the same age in the general Swedish population (SIR ¼ 3.1, 95% CI: 2.9, 3.3), and the incidences were still higher if individuals had both a father and brother with prostate cancer (SIR ¼ 5.3, 95% CI: 4.6, 6.0). Incidence was highest among men with two brothers diagnosed with prostate cancer (SIR ¼ 11, 95% CI: 8.7, 14). In Subheading 2.5, we outline the functions available in SAS and Stata programs for calculating SIR. 1.3. Familial Aggregation of Quantitative Traits: Measuring Correlations of Trait Levels in Families
An important advantage of the family case–control design is its ability to measure familial correlations in terms of both their magnitude and patterns. It is intuitive that the larger the magnitude of familial correlation, the stronger the evidence for some genetic basis in determining disease. This information about the magnitude of familial risk has a significant bearing on the statistical power available to locate or map potential causal genes. Furthermore, examining the patterns of familial aggregation closely allows investigators to further tease apart the roles played by genetic and environmental factors. For example, with phenotype data from first-degree relatives of probands, higher correlations among siblings compared to parents could suggest a dominance effect of unobserved gene(s). On the other hand, higher correlations between mothers and offspring compared to fathers and offspring provides a clue that the underlying genetic mechanism(s) is not simply autosomal. Thus, while examining familial correlations remains an exploratory exercise where assumptions about genetic mechanisms are typically held to a minimum, it still provides invaluable information and clues concerning the underlying genetic mechanisms and their possible interaction with environmental factors. For quantitative traits observed on pairs of relatives (e.g., twins), such as cholesterol level or birth weight, Y1 and Y2 say, the most commonly used measure of familial correlation is
142
A.C. Naj et al.
Pearson’s product moment correlation coefficient. This interclass correlation is defined as: r¼
CovðY1 ; Y2 Þ ðVarðY1 ÞVarðY2 ÞÞ1=2
:
(24)
For designs involving twins only, this correlation also provides a direct estimate of heritability, the proportion of total variation due to genetic variance due to one or more autosomal loci. For sets of three or more relatives, such as sibships, the intraclass correlation (ri) can be used. This is defined as the ratio of the variance among sibships to the total variance (i.e., the sum of the within sibship variance and the among sibship variance). This intraclass correlation is viewed as the average intraclass correlation over all possible pairs of sibs. Clearly, evidence of familial resemblance is stronger for higher ri, but observed correlations may or may not reflect actions of genes. Careful design and modeling of r is still necessary to provide clues about the roles of genetic factors. For a family of arbitrary size n, a general model can be constructed for rjk, the correlation coefficient for a quantitative trait Y in the jth and kth relatives, j < k ¼ 1, . . ., n 1 þ rjk ¼ a0 þ at zjk ; (25) logit 2 where zjk is a set of q covariates which could be specific to the (j, k) pair, specific to the family, or some combination of both (43). The transformation on the left-hand side of Eq. 25, logit [(1 + r)/2], ensures this measure ranges over the whole set of real numbers. 1.3.1. Measuring Correlations in Nuclear Families
For designs including nuclear families, which consist of parents and offspring, several different correlations should be considered: rjk ¼ rSS ; rPS ; or rPP ;
(26)
depending on whether the (j, k) relatives pair are siblings, a parent and an offspring or two parents, respectively. Of particular interest is the comparison between rSS and rPP. Assuming all relatives share a common environment, a significantly higher rSS compared to rPP would strengthen the argument for some genetic control of the trait. On the other hand, one can test the hypothesis of, for example, maternal transmission mechanism by contrasting rMS with rFS, where rMS (rFS) is the pairwise r between mother (father) and each offspring. 1.3.2. Measuring Correlations with Sibling Pairs
For designs involving siblings only, one can readily test the hypothesis that a constant within-family correlation is associated with observable family-specific covariates, for example, race. For example, let zjk ¼ 1(0) if the family is white (black), and then fit separate familial correlation coefficients. This approach would also allow
8 Detecting Familial Aggregation
143
investigators to identify variables reflecting heterogeneity among subgroups of families. 1.3.3. Adjusting for Additional Risk Factors When Measuring Correlations with Sibling Pairs
While a constant sibling correlation does provide one measure of heritability, one can further examine risk factors that may influence the sibship correlation using Eq. 24. For traits such as birth weight, one can, for example, model rjk as a function of the time between the two pregnancies of sibs j and k. To make inference on the rs, one can take the likelihood approach by assuming Y ¼ (Y1, . . ., Yn) from a sibship of size n follows a multivariate normal distribution with covariance matrix consistent with the rs that are specified. An alternative is to take the estimating function approach (38), which may be viewed as a multivariate analog of the quasi-likelihood method of Wedderburn (44). In either case, it is imperative that important risk factors for the phenotype Ys be properly considered. This can be achieved by modeling EðYj jxj Þ ¼ xjt b;
(27)
where xj is the set of observed covariates for the jth relative, j ¼ 1, . . . , n. For applications of the estimating function approach and use of Eq. 27 to measure familial correlations of birth weight, see Beaty et al. (43).
2. Methods Here we illustrate how to conduct some of the analyses described in Subheading 1 using the statistical software package R (http://cran. r-project.org). These calculations may also be performed using alternative software packages, such as SAS or Stata. 2.1. Calculating the Odds Ratio Described in Subheading 1
Consider the data given in Table 2. The calculations described in Subheading 1 can be done using the R package as follows. We can use the “oddsratio” function available in the “epitools” library, which can be implemented as follows: 1. Download and install the “epitools” library to R. install.packages("epitools") library(epitools)
2. Input the data into a 2 2 matrix for analysis. disease.fh.data
0.05) while the majority are also filtered to minimise information redundancy by selecting tag SNPs in the areas of high LD. Although these arrays have different genome coverage rates, they have virtually similar power in populations of European ancestry (53). Study design tries to balance expenditure on sample collection with genotyping costs; currently, the increasing density of arrays provides more genome coverage, increasing the chance of mapping relevant loci. In the near future, the critical density may well be reached where there is comparatively little benefit achieved by denser arrays. Denser arrays provide higher power in populations with weaker LD (e.g. African descent), and they are also more advantageous for European populations in the interpretation of results since they provide more significant findings in areas of true association; finding multiple support SNPs in a small region can be a major encouragement in terms of believing the signal is real rather than due to the aberrant performance of a single SNP. If one opts for a cheaper, less dense array, in silico genotyping can be a remedy to the limited number of significant SNPs in a particular region. In fact, despite an apparent saturation of extant GWAS arrays in European descent populations, they all substantially benefit from imputation of many more SNPs using the latest HAPMAP SNP data as a gold standard (53).
Multistage Design Multistage design is a strategy originally proposed to reduce the genotyping cost of GWAS (63). Under this design, a maximum number of SNPs are genotyped on the whole genome in a small number of cases and controls in the first stage, and thereafter, a decreasing number of SNPs are selected for genotyping in an increasing number of cases and controls in successive stages, using a more stringent selection threshold at each stage until reaching genome-wide significance. There is no predetermined number of stages, but it has been shown that up to four stages could be necessary to maximise the power and, at the same time, minimise the cost (64). Owing to the reasonable cost of SNP genotyping,
13 Design Considerations for Genetic Linkage and Association Studies
253
this scheme has not so far been as popular as anticipated. It can be a good avenue, however, in designing follow-up studies to strengthen the evidence of association for the multitude of SNPs that, in a typical large-scale single-stage study, fall below the stringent genome-wide significance threshold. Table 6 and Fig. 3 show the power of a two-stage design (a ¼ 5 10 7) with a total sample of 1,000 cases and 1,000 controls when 20–50% of the samples are genotyped on the whole genome (stage 1) and the 1–10% most promising SNPs are taken forward to stage 2 in the remaining samples. The assumed disease prevalence is 10%, the risk allele frequency is 25%, and the allelic relative risk is 1.5 (multiplicative). All estimations were conducted within CaTS (65) (see Note 3). The maximum achievable power in the second stage analysis ignoring stage 1 samples is 0.83, while the power of a one-stage design is 0.93. There is a probability of 0.94 or higher to rank a truly associated locus among the top 5% of stage 1 results when 30% or more of the samples are genotyped, or in the top 1% if 40% or more of the samples are genotyped (Table 6). There are two types of two-stage designs: (1) a joint analysis of stage 1 and stage 2 samples and (2) ignoring the stage 1 samples when analysing the stage 2 data. The power of the joint analysis of stage 1 and stage 2 samples is higher than ignoring the stage 1 samples, and it approaches the power of the one-stage design when at least 40% of samples are genotyped in stage 1 (or 30% of samples in stage 1 with at least 5% SNPs selected for stage 2). Joint analysis is therefore the most cost-effective (65), as it reduces the total number of genotypes in the same way as the two-stage design that ignores stage 1 samples when analysing the stage 2 data (45–79% of reduction in Table 6), and has nearly the power of a single-stage design.
Replication The requirement to validate findings before they can be published in high-impact journals has prompted most GWAS investigators to split their samples into two datasets: one for initial analysis and the other for replication. This scheme is different from a two-stage GWAS design as only the significant findings from the original analysis are tested in the second dataset rather than a predetermined proportion of initial SNPs. Although this scheme allows confirming the best findings, it reduces the overall study power because the sample size in each data subset is smaller than the pooled data (65). Jointly reanalysing the two data subsets increases the power, but only for those SNPs significant in the first subset for which replication is sought. The ideal replication remains an independent study. When planning a replication study following an initial GWAS, it should be borne in mind that there is danger that the study is
254
J. Nsengimana and D.T. Bishop
Table 6 Power of a two-stage GWAS (a ¼ 5 10 7). Calculations obtained using the CaTS program package % Samples genotyped on whole genome (stage 1)
% Markers Genotype selected for reduction stage 2 rate
Probability of carrying over an associated locus to stage 2
Power of the two-stage design that ignores the stage 1 samples in stage 2 analysis
Power of joint analysis of stages 1 and 2 for selected markers
50
10
0.45
0.998
0.57
0.93
50
5
0.48
0.996
0.63
0.93
50
1
0.50
0.978
0.74
0.92
40
10
0.54
0.99
0.73
0.93
40
5
0.57
0.98
0.77
0.92
40
1
0.59
0.94
0.82
0.89
30
10
0.63
0.97
0.82
0.91
30
5
0.67
0.94
0.83
0.90
30
1
0.69
0.84
0.79
0.81
20
10
0.72
0.86
0.82
0.85
20
5
0.76
0.83
0.78
0.80
20
1
0.79
0.63
0.61
0.62
1.0 stage 2=1%SNPs stage 2=5%SNPs
0.9
stage 2=10%SNPs
Power
0.8
0.7
0.6
0.5 20
30
40 50 % Samples in stage 1
60
Fig. 3. Power of a two-stage GWAS as function of the proportion of samples genotyped in stage 1 and the proportion of SNPs selected after stage 1 analysis to genotype in stage 2 samples, with the stage 2 analysis not including stage 1 samples.
13 Design Considerations for Genetic Linkage and Association Studies
255
underpowered. It is well documented that genetic effect size estimates for the most significant or top-ranked SNPs, as is often the case in GWAS, are biased, particularly if they are small (66, 67). Furthermore, estimates may also be subject to ascertainment bias if, for example, the case set were genetically enriched (68, 69). Indeed, GWAS are explorative in nature and are more concerned with increasing the study power than correctly estimating the effect size. The consequence is that using these estimates to plan a future study is too optimistic. Garner (67) examined analytically and through simulations the relationships between bias, true genetic effect, allele frequency, sample size, and significance threshold. These relationships should be used to determine the bias in a particular study when planning for its replication. 2.2.2. Sources of False Positives
In case–control studies, the statistical power is not the only issue to worry about. In fact, it is well known that this design is subject to a number of confounders, which may explain, at least in part, the low success in replicating early candidate gene studies. For GWAS, the high cost of collecting a large number of samples and largescale genotyping commands that a careful approach is followed to ensure that positive findings are not spurious associations. Three main confounders have been extensively discussed in the literature: population structure (or stratification), cryptic relatedness, and differential genotype calling between cases and controls (differential bias).
Population Structure and Cryptic Relatedness
According to Astle and Balding (70), population structure and cryptic relatedness are two ends of a spectrum of the same confounder: the unobserved pedigree of possibly distant relationships among cases and controls. Population structure (see Chapter 21) is a confounder when there are different identifiable groups or (sub)populations (e.g. due to geographical, social, or cultural reasons) with unequal disease prevalence rates and the proportions of cases and controls sampled from each group do not reflect those prevalence rates. Cryptic relatedness (see Chapter 4), on the other hand, is the unobserved pedigree structure that links cases and/or controls across generations. The multigenerational pedigree structure is hidden and may not create any visible structure in the population, but it is acknowledged that if there is a genetic cause to the disease, then cases are likely to share some distant relatives from whom the disease-causing mutations originate (71). It is both easy and difficult to tackle the problem of population stratification at the study design stage. Well-designed studies select cases and controls in the same population, matching by ancestry. However, it is impossible to match the study on all potentially confounding variables. Subgroups might not be perfectly independent, as there is almost always some admixture between them. In populations of European ancestry, population structure inflates
256
J. Nsengimana and D.T. Bishop
tests of genetic association by about 4%, which is regarded as negligible (50, 58). It is admitted that population structure has a limited impact as long as samples are matched by ethnic background and checks are undertaken to find and exclude samples with admixed or ambiguous ethnicity (54). However, the effect of stratification increases with sample size, and hence, the larger samples needed to find low-risk variants can cause greater false-positive rates if this is not accounted for (72). Simulation studies have shown that cryptic relatedness can have only a limited impact on genetic association with diseases in most populations (71). However, it can be a serious cause of concern in small expanding populations or in populations where marriages between relatives are common. Fortunately, there is a host of methods that can be used to correct for the adverse effect of both stratification and cryptic relatedness. It is important to keep in mind that some of these methods can only deal with population structure or cryptic relatedness (73–76), while others adjust for both confounders (70, 77, 78). Family-based designs offer an alternative scheme to investigate genetic association with robustness to confoundings. This robustness comes at some cost, however, as more subjects have to be genotyped than in a case–control design. Family designs are often criticised because they are less powerful than unrelated cases and controls, but this criticism is not always justified, because any adjustment of confounding factors in cases and controls also causes a power loss. Furthermore, it was shown that family trios could have equal or higher power than cases and controls under some genetic models (79). Additional advantages of this design are better detection of genotyping errors and the possibility of analysing complex genetic components such as maternal effects and imprinting. Differential Bias
Differential genotype calling between cases and controls in large-scale GWAS will inflate association tests (58–60). It can be triggered by poor DNA quality in some samples, which causes suboptimal interaction with chemical reagents. Separately calling cases and controls has been suggested to reduce this problem when samples are collected by different research centres (59). However, it recently emerged that randomising cases and controls on plates and genotyping them together offers the advantage of testing the plate effect on genotype calls and allows, therefore detecting sample processing errors that would otherwise remain undetectable. In fact, a handful of poor-quality samples can induce erroneous genotype calling on the whole plate, and even when those few samples are detected through normal quality control measures, their removal does not eliminate all the errors (60). Furthermore, eliminating the errors may cause unequal call rates in cases and controls, which itself can create false positives (63). Although methods have been proposed to detect and adjust
13 Design Considerations for Genetic Linkage and Association Studies
257
Fig. 4. Top screen of parameter input in the DESPAIR program.
for the adverse effects of differential bias (58), it remains unclear whether these methods eliminate all false positives while keeping all true genetic associations. It is therefore necessary to take this aspect into account at the study design stage by controlling all aspects of DNA acquisition, storage, preparation, and processing. It is acknowledged that, even with extra care, subtle differences between samples generally subsist and could, in some cases, outweigh population structure in causing spurious associations.
3. Notes 1. DESPAIR: This program calculates the power and sample size for multistage linkage studies using concordant and discordant relative pairs while minimising the cost (47). It is part of the
258
J. Nsengimana and D.T. Bishop
Fig. 5. Outlook of the parameter input screen for case–control power calculation using Genetic Power Calculation software online.
S.A.G.E. package (see Chapter 30) that is utilised interactively online at http://darwin.cwru.edu/despair. Figure 4 illustrates the top screen of input parameters when the program is run. All the required information is self-explanatory, and documentation can be downloaded for a detailed description. Non-recurrence risk parameters are only required when the data contain discordant pairs. Reference (46) gives guidelines on which test version to use among the mean and proportion tests. Briefly, the mean test is appropriate for most designs and is the only method implemented for data containing discordant
13 Design Considerations for Genetic Linkage and Association Studies
259
Fig. 6. Outlook of CaTS screen displaying parameter input boxes and power calculations for one-stage, two-stage, and joint analysis designs.
relative pairs, but the proportion test can be more powerful when the recurrence risk is as high as or greater than 5. 2. Genetic Power Calculator: The Genetic Power Calculator (49) can be freely and interactively used online (http://pngu.mgh. harvard.edu/~purcell/gpc/) to estimate association study power for both discrete and quantitative traits. Figure 5 displays the parameter input screen for the case–control design. Note that the user-defined power and sample size are both required. The program calculates the power achievable with the entered sample size, and the sample size required to attain the specified power under different models. By ticking “unselected controls,” the user assumes that the controls represent the population and may contain undiagnosed cases. The program assumes by default that controls are screened for the disease to avoid misclassifications. 3. CaTS: This software calculates the power and sample sizes for a two-stage GWAS (65) and can be freely downloaded
260
J. Nsengimana and D.T. Bishop
(www.sph.umich.edu/csg/abecasis/CaTS/) for installation on a local machine. Like the other software described above, CaTS is interactive and easy to use. Figure 6 is an outlook of a self-explanatory CaTS screen with dedicated boxes for parameter input at the top and the estimated powers at the bottom. References 1. Lee-Kirsch M A, et al (2006) Familial chilblain lupus, a monogenic form of cutaneous lupus erythematosus, maps to chromosome 3p. Amer J Hum Genet 79: 731–737 2. Kruglyak L, et al (1996) Parametric and nonparametric linkage analysis: A unified multipoint approach. Amer J Hum Genet 58: 1347–1363 3. Lander ES, Botstein D (1987) Homozygosity Mapping - a Way to Map Human Recessive Traits with the DNA of Inbred Children. Science 236: 1567–1570 4. Mueller RF, Bishop DT (1993) Autozygosity Mapping, Complex Consanguinity, and Autosomal Recessive Disorders. J Med Genet 30: 798–799 5. Wang S, Haynes C, Barany F, Ott, J (2009) Genome-Wide Autozygosity Mapping in Human Populations. Genet Epidemiol 33: 172–180 6. Boehnke M (1986) Estimating the Power of a Proposed Linkage Study - a Practical Computer-Simulation Approach. Amer J Hum Genet 39: 513–527 7. Ploughman LM, Boehnke M (1989) Estimating the Power of a Proposed Linkage Study for a Complex Genetic Trait. Amer J Hum Genet 44: 543–551 8. Samani N J, et al. (2005) A genomewide linkage study of 1,933 families affected by premature coronary artery disease: The British heart foundation (BHF) family heart study. Amer J Hum Genet 77: 1011–1020 9. Whittemore AS, Tu IP (1998) Simple, robust linkage tests for affected sibs. Amer J Hum Genet 62: 1228–1242 10. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases, Science 273: 1516–1517 11. Risch N (1990) Linkage Strategies for Genetically Complex Traits .2. The Power of Affected Relative Pairs. Amer J Hum Genet 46: 229–241 12. Lander E, Kruglyak L (1995) Genetic Dissection of Complex Traits - Guidelines for Interpreting and Reporting Linkage Results, Nature Genetics 11: 241–247
13. Bishop DT, Williamson JA (1990) The Power of Identity-by-State Methods for Linkage Analysis. Amer J Hum Genet 46: 254–265 14. Risch NJ (2000) Searching for genetic determinants in the new millennium, Nature 405: 847–856 15. Brown BD, et al (2010) An evaluation of inflammatory gene polymorphisms in sibships discordant for premature coronary artery disease: the GRACE-IMMUNE study, BMC Medicine 8: 5 16. Hodge SE, Vieland VJ, Greenberg DA (2002) HLODs remain powerful tools for detection of linkage in the presence of genetic heterogeneity. Amer J Hum Genet 70: 556–558 17. Whittemore AS, Halpern J (2001) Problems in the definition, interpretation, and evaluation of genetic heterogeneity. Amer J Hum Genet 68: 457–65 18. Altmuller J, et al (2001) Genomewide scans of complex human diseases: True linkage is hard to find. Amer J Hum Genet 69: 936–50 19. Hauser ER, et al. (2004) Ordered subset analysis in genetic linkage mapping of complex traits. Genet Epidemiol 27: 53–63 20. Nsengimana J, et al (2007) Enhanced linkage of a locus on chromosome 2 to premature coronary artery disease in the absence of hypercholesterolemia. Eur J Hum Genet 15: 313–319 21. Almasy L, Blangero J (2009) Human QTL linkage mapping. Genetica 136: 333–340 22. Abecasis GR, Cherny SS, and Cardon LR (2001) The impact of genotyping error on family-based analysis of quantitative traits. Eur J Hum Genet 9: 130–134 23. Abecasis GR, et al (2001) GRR: graphical representation of relationship errors. Bioinformatics 17: 742–743 24. Pompanon F, et al (2005) Genotyping errors: Causes, consequences and solutions, Nat Rev Genet 6: 847–859 25. Chang YPC, et al (2006) The impact of data quality on the identification of complex disease genes: experience from the Family Blood Pressure Program. Eur J Hum Genet 14: 469–477
13 Design Considerations for Genetic Linkage and Association Studies 26. Goring HHH, OttJ (1997) Relationship estimation in affected rib pair analysis of late-onset diseases. Eur J Hum Genet 5: 69–77 27. Boehnke M, Cox NJ (1997) Accurate inference of relationships in sib-pair linkage studies. Amer J Hum Genet 61: 423–429 28. Douglas JA, Boehnke M, Lange K (2000) A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Amer J Hum Genet 66: 1287–1297 29. Sun L, Wilder K, McPeek MS (2002) Enhanced pedigree error detection. Hum Hered 54: 99–110 30. Sobel E, Papp JC, Lange K (2002) Detection and integration of genotyping errors in statistical genetics. Amer J Hum Genet 70: 496–508 31. Ray A, Weeks DE (2008) Relationship uncertainty linkage statistics (RULS): Affected relative pair statistics that model relationship uncertainty. Genet Epidemiol 32: 313–324 32. Hauser ER, et al. (1996) Affected-sib-pair interval mapping and exclusion for complex genetic traits: Sampling considerations. Genet Epidemiol 13: 117–137 33. Sawcer SJ, et al (2004) Enhancing linkage analysis of complex disorders: an evaluation of high-density genotyping. Hum Mol Genet 13: 1943–1949 34. Evans DM, Cardon LR (2004) Guidelines for genotyping in genomewide linkage studies: Single-nucleotide-polymorphism maps versus microsatellite maps. Amer J Hum Genet 75: 687–692 35. Guo XQ, Elston RC (2000) Two-stage global search designs for linkage analysis II: Including discordant relative pairs in the study. Genet Epidemiol 18: 111–27 36. Huang QQ, Shete S, Amos CI (2004) Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis. Amer J Hum Genet 75: 1106–1112 37. Schaid DJ, et al (2004) Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancer-susceptibility loci. Am J Hum Genet 75: 948–65 38. Nsengimana J, Renard H, Goldgar D (2005) Linkage analysis of complex diseases using microsatellites and single-nucleotide polymorphisms: application to alcoholism. BMC Genet 6: S10 39. Wilcox MA, et al (2005) Comparison of singlenucleotide polymorphisms and microsatellite markers for linkage analysis in the COGA and simulated data sets for genetic analysis workshop 14. Genet Epidemiol 29: S7-S28
261
40. Boyles AL, et al (2005) Linkage disequilibrium inflates type I error rates in multipoint linkage analysis when parental genotypes are missing. Hum Hered 59: 220–227 41. Abecasis GR, Wigginton JE (2005) Handling marker-marker linkage disequilibrium: Pedigree analysis with clustered markers. Am J Hum Genet 77: 754–67 42. Kurbasic A, Hossjer O (2008) A general method for linkage disequilibrium correction for multipoint linkage and association. Genet Epidemiol 32: 647–57 43. Webb EL, Sellick GS, Houlston RS (2005) SNPLINK: multipoint linkage analysis of densely distributed SNP data incorporating automated linkage disequilibrium removal. Bioinformatics 21: 3060–3061 44. Fukuda Y, et al (2009) SNP HiTLink: a highthroughput linkage analysis system employing dense SNP data. BMC Bioinformatics 10: 121 45. Selmer KK, et al (2009) Genome-wide Linkage Analysis with Clustered SNP Markers. J Biomol Screen 14: 92–96 46. Fischer ANM, et al (2010) A genome-wide linkage analysis in 181 German sarcoidosis families using clustered bi-allelic markers. Chest 138: 151–157 47. Guo XQ, Elston RC (2000) Two-stage global search designs for linkage analysis I: Use of the mean statistic for affected sib pairs. Genet Epidemiol 18: 97–110 48. Ochs-Balcom HM, et al (2010) Program update and novel use of the DESPAIR program to design a genome-wide linkage study using relative pairs. Hum Hered 69: 45–51 49. Purcell S, Cherny SS, Sham PC (2003) Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19: 149–150 50. WTCCC. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678 51. Bishop DT, et al (2009) Genome-wide association study identifies three loci associated with melanoma risk. Nat Genet 41: 920–925 52. Panoutsopoulou KZE (2009) Finding common susceptibility variants for complex disease: past, present and future. Brief Funct Genomic Proteomic 8: 345–352 53. Spencer CCA, et al (2009) Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip. PLOS Genetics 5: e1000477
262
J. Nsengimana and D.T. Bishop
54. McCarthy MI, et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356–369 55. Amos CI (2007) Successful design and conduct of genome-wide association studies. Hum Mol Genet Spec 2: R220-R225 56. Zondervan KT, Cardon LR, Kennedy SH (2002) What makes a good case-control study? Design issues for complex traits such as endometriosis. Hum Reprod 17: 1415–1423 57. Newton-Cheh C, Hirschhorn JN (2005) Genetic association studies of complex traits: design and analysis issues. Mutat Res-Fund Mol M 573: 54–69 58. Clayton DG, et al (2005) Population structure, differential bias and genomic control in a largescale, case-control association study. Nat Genet 37: 1243–1246 59. Plagnol V, et al (2007) A method to address differential bias in genotyping in large-scale association studies. PLOS Genet 3: e74 60. Pluzhnikov A, et al. (2010) Spoiling the whole bunch: quality control aimed at preserving the integrity of high-throughput genotyping. Am J Hum Genet 87: 123–28 61. Tabor HK, Risch NJ, Myers RM (2002) Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet 3: 391–7 62. Pettersson FH, et al. (2009) Marker selection for genetic case-control association studies. Nat Protoc 4: 743–752 63. Hirschhorn JN, Daly MJ. (2005) Genomewide association studies for common diseases and complex traits, Nature Reviews Genetics 6: 95–108 64. Pahl R, Schafer H, Muller HH (2009) Optimal multistage designs-025EFa general framework for efficient genome-wide association studies. Biostatistics 10: 297–309 65. Skol AD, et al. (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38: 209–213 66. Bowden J, Dudbridge F (2009) Unbiased Estimation and Inference for Replicated Associa-
tions Following a Genome Scan. Genet Epidemiol 33: 406–418 67. Garner C (2007) Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol 31: 288–295 68. Goldgar D, et al (2007) BRCA phenocopies or ascertainment bias? J Med Genet 44: 10–15 69. Terwilliger JD, Weiss KM (2003) Confounding, ascertainment bias, and the blind quest for a genetic ‘fountain of youth’. Ann Med 35: 532–544 70. Astle, W., and Balding, D. J. (2009) Population Structure and Cryptic Relatedness in Genetic Association Studies. Stat Sci 24: 451–471 71. Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case-control association studies. PLOS Genet 1: 302–311 72. Marchini J, et al (2004) The effects of human population structure on large genetic association studies. Nat Genet 36: 512–517 73. Choi Y, Wijsman EM, Weir BS (2009) CaseControl Association Testing in the Presence of Unknown Relationships. Genet Epidemiol 33: 668–678 74. Slager SL, Schaid DJ (2001) Evaluation of candidate genes in case-control studies: A statistical method to account for related subjects. Am J Human Genet 68: 1457–1462 75. Bourgain C, et al (2003) Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet 73: 612–626 76. Pritchard JK, et al. (2000) Association mapping in structured populations. Am J Hum Genet 67: 170–181 77. Sillanpaa MJ (2011) Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses. Heredity 106(4):511–519 78. Price AL, et al (2010) New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11: 459–463 79. Laird NM, Lange C (2009) The Role of Family-Based Designs in Genome-Wide Association Studies. Statist Sci 24: 388–397
Chapter 14 Model-Based Linkage Analysis of a Quantitative Trait Audrey H. Schnell and Xiangqing Sun Abstract Linkage analysis is a family-based method of analysis to examine whether any typed genetic markers co-segregate with a given trait, in this case a quantitative trait. If linkage exists, this is taken as evidence in support of a genetic basis for the trait. Historically, linkage analysis was performed using a binary disease trait, but has been extended to include quantitative disease measures. Quantitative traits are desirable as they provide more information than binary traits. Linkage analysis can be performed using single marker methods (one marker at a time) or multipoint (using multiple markers simultaneously). In model-based linkage analysis, the genetic model for the trait of interest is specified. There are many software options for performing linkage analysis. Here, we use the program package Statistical Analysis for Genetic Epidemiology (S.A.G.E.). S.A.G.E. was chosen because it includes programs to perform data cleaning procedures and to generate and test genetic models for a quantitative trait, in addition to performing linkage analysis. We demonstrate in detail the process of running the program LODLINK to perform single marker analysis and MLOD to perform multipoint analysis using output from SEGREG, where SEGREG was used to determine the best fitting statistical model for the trait. Key words: Linkage analysis, Quantitative trait, LOD score, Recombination fraction, Statistical Analysis for Genetic Epidemiology, Single point, Multipoint, Model-based, Segregation analysis, Homogeneity
1. Introduction 1.1. Linkage Analysis
“Linkage” is the term used to denote the occurrence of two loci on the same chromosome (i.e., syntenic) that are inherited together. Two loci are said to be linked if they are close enough on the same chromosome that recombination during meiosis is rare enough for the loci to “cosegregate,” and this phenomenon can then be observed within families. Classic model-based linkage, where model-based refers to the genetic model for the trait, was developed to assess the association of a binary trait (e.g., affection status) and a genetic marker within families (pedigrees). However, procedures have also been developed to adapt linkage analysis for a
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_14, # Springer Science+Business Media, LLC 2012
263
264
A.H. Schnell and X. Sun
quantitative trait, and these will be demonstrated in this chapter using methods based on the exact likelihood calculation. Modelbased linkage is sometimes referred to as parametric linkage but the term model-based will be used here. If a trait demonstrates familial aggregation (see Chapters 8–10) and/or segregation (see Chapter 12), linkage analysis would be justified. A finding of linkage would provide a basis for further study using association analysis (described in later chapters). Classic model-based linkage analysis calculates the likelihood ratio in favor of linkage to a marker over a range of recombination fractions (y). If recombination between the marker and trait loci is infrequent, it is possible to detect linkage. The LOD score is the 10 based logarithm of the likelihood ratio, often miscalled the log of the odds for linkage. Rather than calculating an odds for linkage it compares the likelihood of observing the data if the two loci are linked (y PEDINFO. 2. On the left panel of the Project Window drag your pedigree file from its position in Data—Internal to Jobs—PEDINFO1— Errors—Missing Data File. 3. On the lower right hand side of the Project Window, press the Next button and the Run button. At this point the program should run. Typical run time is only on the order of seconds. 4. The resulting output should now appear under Jobs—PEDINFO1—Output. Simply double click on the output files to view the results. When viewing the results, first view the pedinfo.inf information file to look for potential errors. Next, view the actual results file PEDINFO1. Finally, we point out that, if the user has the phenotype data, pedigree data, and genotype data in separate files, it will be helpful for downstream analysis to merge the genotype, pedigree, and phenotype files. That is, there should be one file for each region/ chromosome to be analyzed. (Note, however, that it is possible to keep the phenotype and genotype data separate. That is, you could have one file per chromosome that has pedigree and genotype information for calculating allele frequencies and IBD values, and one file that has pedigree and phenotype information for actually performing the final linkage analysis.) 2.3. Obtaining and Formatting a Genetic Map File
Besides the actual data, a second crucial piece of information for linkage analysis is known as the genetic map information. In order to make full use of your genetic data the program GENIBD, which estimates the IBD values, must know the genetic distance between linked markers. Some of the vendors of marker assays distribute a genetic map file. The Rutgers Genome Map file based on the work of Kang et al. (20, 21) contains a number of genetic markers and can be found as a supplementary file. If no genetic map is available, then the genetic position of the markers should be estimated by linear interpolation using the physical position of the marker between two markers of known physical and genetic position.
16
Model-Free Linkage Analysis of a Quantitative Trait
307
In S.A.G.E., the genetic map file for the markers is known as the “Genome Description File.” The information must be saved in an ASCII text file in a format similar to this:
The distance in the above example is the distance between markers and it must be recorded in centimorgans. The map function may be specified as Haldane or Kosambi, depending on how the genetic distance was measured. The first entry may be entered as the distance from the p terminus to the first marker, or the user may choose to start with the first marker, as exemplified above. The marker names (e.g., “chr1marker1” and “chr1marker2” etc.) should be replaced by the true marker names. In order to use the “Genome Description File” in the S.A.G.E. GUI, the user must import the file. To accomplish this critical step, drag the genome file icon (i.e., ) from the Pallet to Data— Internal in the Project Window. It is also possible to import the map file from several other formats using the Tools > Create a Genome Description File option. We note that, historically, linkage studies were sometimes performed using what is sometimes referred to as “single-marker” or “two-point” linkage analysis. Single-marker linkage analysis analyzes the genetic markers independently, and so it is not necessary to know the genetic distance between markers. We strongly recommend that “multipoint” IBD values be calculated to make full use of the information available. Thus, it is critical to obtain a correctly formatted genetic map file. 2.4. Obtaining the Allele Frequencies for the Markers
In order to calculate the IBD values, GENIBD needs to know the allele frequencies of your markers. In S.A.G.E., this information is stored in a file known as the “Marker Locus Description File.”
308
N.J. Morris and C.M. Stein
If the user already has estimates of the allele frequencies, then the frequencies may be entered as described in the S.A.G.E. manual. We assume that the user estimates the allele frequencies from the available data using the program FREQ. Estimating allele frequencies has already been covered in greater detail in Chapter 5, but we briefly go over the steps to achieving this in S.A.G.E.: 1. Go to Analysis > Allele Frequency Estimation > FREQ. 2. On the left panel of the Project Window, drag the pedigree file from its position in Data—Internal to Jobs—PEDINFO1— Errors—Missing Data File. 3. On the lower right hand side of the Project Window, press the Next button. 4. On the right hand side of the Project Window, click on the drop down box for “Marker” and click “Select All.” 5. Click Run on the lower right hand side of the Project Window. 6. Click OK in the “Analysis Information” pop-up box. 7. Run time may take some time depending on the data size. The user may look at the tasks window to see the status of the job. When the job is finished, as always check the information file (i.e., the file ending in .inf) first, to make sure there were no errors in analyzing the data. Next, the user should look at the summary file (ending in .sum) and the detail file (ending in .det). The final file that FREQ outputs is the “Marker Locus Description File” which ends in .loc. This file will be used as input for GENIBD. 2.5. Obtaining IBD Information Using GENIBD
Before any model-free analysis can proceed, the IBD information must be obtained by using the program GENIBD. This program takes as input the “Marker Locus Description File” and the “Genome Description File.” To run GENIBD in the S.A.G.E. GUI: 1. Go to Analysis > IBD Allele Sharing Estimation > GENIBD. 2. On the left panel of the Project Window drag the pedigree file from its position in Data—Internal to Jobs—GENEIBD1— Errors—Missing Data File. 3. On the left panel of the Project Window drag the “Genome Description File” imported earlier from its position in Data— Internal to Jobs—GENEIBD1—Errors—Missing Genome file to the multipoint. 4. On the left panel of the Project Window drag the “Marker Locus Description File” created by FREQ from its position in Jobs—FREQ1—Output to Jobs—GENIBD1—Errors—Missing Marker locus file. 5. On the left panel of the Project Window click Next. The left panel of the Project Window should now be on the Analysis Definition tab.
16
Model-Free Linkage Analysis of a Quantitative Trait
309
6. Select the desired options for the analysis. The default options are generally appropriate, but we briefly discuss what the various options are: (a) Title. Just describes the title that will be displayed in the output. (b) Region. Selects which unlinked regions in the “Genome Description File” will be analyzed. (c) IBD mode. Selects whether the markers will be considered independent or not. As discussed in our section on the genetic map file, this should usually be set to Multipoint. (d) Scan type. Determines whether the IBD values will be reported at the actual marker position, or at those positions and at some uniform grid across the entire region. The uniform grid (as specified by selecting the intervals option) may result in slightly nicer plots of the linkage signal. (e) Output pair types. Determines for which types of relative pairs the IBD values are calculated. If SIBPAL is going to be used, then it is sufficient to output All sibs, but doing the calculations for all relatives will not hurt anything. If RELPAL is going to be used, then it is preferable to use the All relatives option. (f) Use simulation. Decides whether an exact Hidden Markov Model or a Markov Chain Monte Carlo (MCMC) method will be used to estimate the IBD values. If set to True, then the MCMC method will be used only when the pedigrees are large and an exact analysis would take up a large amount of memory. If set to False, then the MCMC method will never be used. This is often faster than the MCMC method, but it may take too much memory. It is often more computationally efficient to set Use simulation to False and Split pedigree to true. We do not recommend that Use simulation be set to Always. Also, we do not recommend that the user change the other settings for the MCMC simulation without a deeper understanding of the algorithm. (g) Maximum bit. Determines the largest family that can be analyzed exactly. We recommend that the user not change this without some knowledge of the underlying algorithm. (h) Split pedigrees. Splits multigenerational pedigrees into nuclear families before analysis. This loses some information, and makes it impossible to estimate non-sibling IBD pairs. However, it is often much faster computationally than using MCMC simulations. (i) Allow loop. Allows pedigrees with inbreeding to be analyzed. This option cannot be used with the multipoint option.
310
N.J. Morris and C.M. Stein
7. Once the appropriate options have been selected, click the Run button in the lower right hand side of the project window. 8. Click OK on the Analysis Information pop-up window. 9. Run time may vary depending on the amount of data. For large pedigrees with a large number of markers, runtime can potentially be hours for a single chromosome. If this is the case, then the user may wish to run the analysis on a computer cluster instead of a personal computer. We briefly discuss this later. The output may be found under Jobs—GENIBD1—Output. As always, check the information file (genibd.inf) first for errors and warnings. The file ending with .ibd contains the actual IBD information, which will be used by the other programs. 2.6. Analyzing Sibling Pair Data Using SIBPAL
After following all of the previous steps, it is now actually time to perform the linkage analysis. SIBPAL is a program that can robustly make use of sibling pair information. To run SIBPAL: 1. Go to Analysis > Linkage Analysis > Model-free > Sibling Pairs > SIBPAL. 2. On the left panel of the Project Window drag the pedigree file from its position in Data—Internal to Jobs—SIBPAL1— Errors—Missing Data File. 3. On the left panel of the Project Window drag the IBD file created earlier from its position in Jobs—GENIBD1—Output to in Jobs—SIBPAL1—Missing IBD file. 4. On the left panel of the Project Window click Next. The left panel of the Project Window should now be on the Analysis Definition tab. 5. Select the Trait Regression option. The left panel of the Project Window should now be on Trait Regression tab. 6. Next to the Trait option click Define. In the Specification popup box, select the trait that you wish to analyze. If the data involves unascertained samples, we recommend that the user select BLUP mean, for the mean. For strongly ascertained traits, the user should look to the paper by Sinha et al. (22) for guidance (see Note 1). Click Add trait and OK. 7. If you have covariates that you wish to adjust out, click on the Define button next to Covariate. Select the covariate that you wish to use. Recall that in HE regression, covariates must be pair specific. SIBPAL will form a pair-specific covariate using one of the operations under option. For covariates that are measured at the individual level, available options are sum, difference, and mean. The user should select the option that will be the easiest to interpret. If the covariate is already coded at the sibship level (e.g., ethnicity), then this could be indicated and analyzed as is. The power option will put the newly formed
16
Model-Free Linkage Analysis of a Quantitative Trait
311
pair-specific covariate to some power. Click Add covariate and OK. (Note that it is often more readily interpretable to preadjust the trait for the covariates by performing linkage analysis on the residuals of the trait after linear regression.) 8. Select the markers you wish to analyze. Generally, this will be All. 9. If you wish to analyze only a subset of the data, you may select a variable for the Subset option. This variable must be a dummy variable (i.e., take on only the values 0 or 1). Only those individuals with a value of 1 for this variable will be considered for analysis. 10. If you wish to model interactions between markers, you may do so using the Interactions option. 11. We recommend that the user compute empirical P-values. This requests that SIBPAL use a permutation method to assess statistical significance. To do this, click on the Define button next to Compute empirical P-values. The default in the Specification pop-up box is to only compute empirical P-values if the asymptotic P-value is less than 0.05. We recommend that the user keep the default values and click ok. If more precision is desired for the P-value, this may be done in a second analysis of the data by decreasing the Width of P-value option. 12. For the Dependent variable, we recommend that the user select the W4 option when the sample is not ascertained. As mentioned in Subheading 2, this has been shown to have good power when compared to the other methods (see Note 1). 13. For Pair type select full, half, or both, depending on whether full sibling pairs or half sibling pairs are available in the data. 14. It is particularly useful to select the Produce tab delimited output option because this will allow for the output to be easily imported into other programs to make nice looking plots and tables. 15. We do not recommend using the robust variance estimator as it is believed to be overly conservative. 16. Click Next. 17. Click Run. 18. Click OK. 19. The output should show up under to Jobs—SIBPAL1—Output. View the information file first to find any errors and warnings. The .treg file contains a summary of the results. The .treg_det contains more details about the results. The .treg_export file is only produced if the Produce tab delimited output check box is selected. It will contain a file which may be easily imported into programs, such as R or Excel. See Note 2 below for detail regarding interpretation of jagged P-value distributions. Also see Fig. 2 for some sample output and notes on interpretation.
312
N.J. Morris and C.M. Stein
Fig. 2. SIBPAL output (slightly edited).
16
2.7. Analyzing Extended Families Using RELPAL
Model-Free Linkage Analysis of a Quantitative Trait
313
If the data contain extended relative pairs, it may be more appropriate to use the program RELPAL. To analyze data using RELPAL: 1. Go to Analysis > Linkage Analysis > Model-free > Arbitrary Relative Pairs > RELPAL. 2. On the left panel of the Project Window drag the pedigree file from its position in Data—Internal to Jobs—RELPAL1— Errors—Missing Data File. 3. On the right panel of the Project Window click the “. . .” next to Data file. Select the IBD file created earlier. 4. Click Next. 5. Select a Trait from the drop down box. 6. For Model select Single marker. This is the option for linkage analysis in general. The Multiple marker option allows you to adjust your signal for linkage at other locations. The Zero marker option allows you to estimate the covariate effects without doing linkage analysis. 7. If there are covariates to be adjusted for, click define next to First level. Select a Covariate and leave Trait to be adjusted blank. Click Add Covariate. You may repeat this to add multiple covariates. 8. If the user wishes to analyze only specific markers, then click define next to Second level. We do not recommend that the user select covariates for the second level because it is difficult to interpret what such covariates mean. See the S.A.G.E. user manual for more about these options. 9. The Produce comma delimited output option may be used to create an output file which can be easily imported into other programs such as R or Excel. 10. Click Run. 11. The output should show up under to Jobs—RELPAL1—Output. View the information file first to find any errors and warnings. The .out file contains a summary of the results. The .out file is only produced if the Produce comma delimited output check box is selected. It will contain a file that may be easily imported into programs such as R or Excel. Note that the “Empirical P-value” reported by RELPAL is actually asymptotic, and does not involve permutations. See Note 3 below for detail on convergence of RELPAL, and Note 2 regarding numerical instability if P-values appear jagged. See the S.A.G.E. user manual for the most current overview of the different possible robust score tests that are reported. We recommend against using the naı¨ve variance estimate unless the data is known to be very close to a normal distribution. See Fig. 3 for some sample output.
314
N.J. Morris and C.M. Stein
Fig. 3. RELPAL output (slightly edited).
16
Model-Free Linkage Analysis of a Quantitative Trait
315
3. Notes 1. Choice of parameterization of dependent variable. As summarized by Sinha and Gray-McGuire (22), there are many options for the parameterization of the dependent variable and mean correction factor. Their paper examines the implications of ascertainment scheme and trait distribution on power and type I error of various parameterizations. We refer the interested reader to Figures 7 and 8 of their paper (22) for a scheme to select the dependent variable and mean correction. Unfortunately, not all the possible options in SIBPAL were investigated because the BLUP may be combined with W4. 2. Marks of numerical instability. If the plot of the log10 P -value is extremely jagged (i.e., nearby markers do not have similar P -values), there may be a problem with numerical stability. In extreme cases, the reader ought to consider using a different method. However, note that if empirical P-values are being used, the problem may also be due to an insufficient number of simulations/permutations. In the GUI for SIBPAL, this may be changed by clicking on Define next to Compute empirical P-values. 3. Failure to converge in RELPAL. Occasionally, RELPAL may issue an error stating that it failed to converge. This is typically because the data displays extreme non-normality. If this is the case, it may help to first transform the residuals. Alternatively, under the First level analysis definition, the user may specify Normalize residual. This may also help. References 1. Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2: 3–19 2. Elston RC, et al (2000) Haseman and Elston revisited. Genet Epidemiol 19: 1–17 3. Shete S, Jacobs KB, Elston RC (2003) Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: Weighting sums and differences. Hum Hered 55: 79–85 4. Wang T, Elston RC (2004) A modified revisited Haseman-Elston method to further improve power. Hum Hered 57: 109–116 5. Amos CI. (1994) Robust variance-components approach for assessing genetic linkage in pedigrees. Amer J Hum Genet 54: 535–543
6. Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Amer J Hum Genet 62: 1198–1211 7. Allison DB, et al (1999) Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci–mapping procedure. Amer J Hum Genet 65: 531–544 8. Sham PC, Purcell S (2001) Equivalence between Haseman-Elston and variance-components linkage analyses for sib pairs. Amer J Hum Genet 68: 1527–1532 9. Chen WM, Broman KW, Liang KY (2004) Quantitative trait linkage analysis by generalized estimating equations: Unification of variance components and Haseman-Elston regression. Genet Epidemiol 26: 265–272
316
N.J. Morris and C.M. Stein
10. Goldgar DE (1990) Multipoint analysis of human quantitative genetic variation. Amer J Hum Genet 47: 957–967 11. Shete S, Jacobs KB, Elston RC (2003) Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: Weighting sums and differences. Hum Hered 55: 79–85 12. Wang T, Elston RC (2005) Two-level Haseman-Elston regression for general pedigree data analysis. Genet Epidemiol 29: 12–22 13. Lange KL, Little RJA, Taylor J (1989) Robust statistical modeling using the T distribution. J Amer Statist Assoc 84: 881–896 14. Blangero J, Williams JT, Almasy L (2000) Robust LOD scores for variance componentbased linkage analysis. Genet Epidemiol Suppl 19: S8–S14. 15. Chen WM, Broman KW, Liang KY (2005) Power and robustness of linkage tests for quantitative traits in general pedigrees. Genet Epidemiol 28: 11–23 16. Abecasis GR, et al (2001) Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30: 97–101
17. Cardon LR, et al (1994) Quantitative trait locus for reading disability on chromosome 6. Science 266: 276–279 18. Cardon LR, et al (1995) Quantitative trait locus for reading disability: correction. Science 268: 1553 19. Goode EL, Jarvik GP (2005) Assessment and implications of linkage disequilibrium in genome wide single nucleotide polymorphism and microsatellite panels. Genet Epidemiol Suppl 29: S72–S76 20. Kong X, et al (2004) A combined linkage-physical map of the human genome. Amer J Hum Genet 75: 1143–1148 21. Matise TC, et al (2007) A second-generation combined linkage–physical map of the human genome. Genome Res 17: 1783–1786 22. Sinha R, Gray-McGuire C (2008) Haseman Elston regression in ascertained samples: Importance of dependent variable and mean correction factor selection. Hum Hered 65: 66–76
Chapter 17 Model-Free Linkage Analysis of a Binary Trait Wei Xu, Shelley B. Bull, Lucia Mirea, and Celia M.T. Greenwood Abstract Genetic linkage analysis aims to detect chromosomal regions containing genes that influence risk of specific inherited diseases. The presence of linkage is indicated when a disease or trait cosegregates through the families with genetic markers at a particular region of the genome. Two main types of genetic linkage analysis are in common use, namely model-based linkage analysis and model-free linkage analysis. In this chapter, we focus solely on the latter type and specifically on binary traits or phenotypes, such as the presence or absence of a specific disease. Model-free linkage analysis is based on allele-sharing, where patterns of genetic similarity among affected relatives are compared to chance expectations. Because the model-free methods do not require the specification of the inheritance parameters of a genetic model, they are preferred by many researchers at early stages in the study of a complex disease. We introduce the history of model-free linkage analysis in Subheading 1. Table 1 describes a standard model-free linkage analysis workflow. We describe three popular model-free linkage analysis methods, the nonparametric linkage (NPL) statistic, the affected sib-pair (ASP) likelihood ratio test, and a likelihood approach for pedigrees. The theory behind each linkage test is described in this section, together with a simple example of the relevant calculations. Table 4 provides a summary of popular genetic analysis software packages that implement model-free linkage models. In Subheading 2, we work through the methods on a rich example providing sample software code and output. Subheading 3 contains notes with additional details on various topics that may need further consideration during analysis. Key words: Genetic linkage analysis, Nonparametric linkage (NPL) score, Identity by descent (IBD) sharing, Affected relative pairs, Affection status, Likelihood ratio based linkage model, Pedigree structure, Kong and Cox model, Genetic heterogeneity, GENEHUNTER, ALLEGRO
1. Introduction 1.1. Overview
Genetic linkage analysis aims to detect chromosomal regions containing disease genes that influence risk of specific inherited diseases, by examining patterns of inheritance in families. Pedigree relationships are needed, as well as disease or trait information and genetic marker genotypes for at least some of the family members.
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_17, # Springer Science+Business Media, LLC 2012
317
318
W. Xu et al.
The presence of linkage is indicated when a disease or trait cosegregates through the families with genetic markers at a particular region of the genome. Two main types of genetic linkage analysis are in common use. One of these, usually referred to as model-based or parametric linkage analysis, requires the assumption of a number of parameters about the relationship between the unknown gene mutations and the disease or trait and the genetic model underlying it. In this chapter, we focus solely on the other type, known as model-free linkage analysis, and specifically on binary traits or phenotypes, indicating the presence or absence of a specific disease. Modelfree linkage analysis is based on allele-sharing, where patterns of genetic similarity among affected relatives are compared to chance expectations. Allele sharing models are called model-free methods because prior specification of a disease inheritance model is not required. The popularity of these methods is largely derived from the conceptual simplicity of the approach. In a linkage study, the disease susceptibility genotypes carried by each individual are unobserved. The relationship between a particular genetic variant and disease risk can be described by a penetrance function, which is defined as the probability of being affected given the genotype at the disease gene. Let y denote the disease status, with y ¼ 1 denoting disease, y ¼ 0 denoting no disease. Then P(y|g) denotes the penetrance for phenotype y of an individual with risk genotype g. If there are only two alleles at the locus, one abnormal (D), and one normal (d), then there are only three genotypes (DD, Dd, and dd). If the penetrances are known, a model-based linkage analysis that uses this information will be the most powerful test of linkage. However, this is rarely the case, especially for complex diseases. Since the model-free methods do not require specification of the penetrance parameters, they are preferred by many researchers at early stages in the study of complex disease (1, 2). The relevant observation in studies of affected relative pairs (ARPs) is how frequently the two related individuals share copies of the same ancestral marker allele; such copies are said to be inherited with “identity by descent” (IBD) (3, 4). Usually, IBD sharing at a genetic marker cannot be unequivocally determined from genotype data, but rather, is determined probabilistically based on observed multiple marker (multipoint) data. The software GENEHUNTER (5) was one of the first of several programs that can carry out such multipoint calculations to estimate IBD sharing probabilities for pairs of relatives using the Lander–Green algorithm (6). This algorithm is based on a hidden Markov model formulation of the pattern of inheritance at multiple loci. Gudbjartsson et al. improved the algorithm to run much faster, allowing slightly larger families to be analyzed (ALLEGRO software) (7, 8). Here we describe three popular model-free linkage analysis methods. First, we discuss methods developed for particular types of relative
17
Model-Free Linkage Analysis of a Binary Trait
319
Table 1 Standard analysis workflow for model-free linkage analysis Step 1
Assemble families for analysis, paying attention to consistent phenotype definitions (see Note 4) and completeness and quality of DNA collection. Estimate power to detect linkage in the available families, prior to genotyping (see Chapter 13)
Step 2
Obtain genotyping data from desired set of markers. Perform error checking using the genotype data—Mendelian errors (see Chapter 2), pedigree errors (see Chapters 3 and 4), Hardy–Weinberg equilibrium to test for poorly performing markers (Chapter 7) and clean the data set as needed
Step 3
Estimate allele frequencies, or obtain appropriate allele frequency estimates (see Chapter 5 and Note 2). Estimate marker informativity (see Note 1)
Step 4
Decide on the appropriate model-free linkage method to be used (see Subheading 1, and Note 5), and hence on the software to use (see Table 4 for a comparison of software features)
Step 5
Calculate patterns of IBD allele sharing (see Subheading 1.3, Note 3)
Step 6
Calculate model-free evidence of linkage. Plot results as a function of genetic distance and calculate likely intervals for linkage peaks (see Note 12)
Step 7
Estimate significance levels using random gene dropping, to take pedigree structures and marker informativity into account (see Note 10)
Step 8
Assess sensitivity to assumptions such as (1) phenotype definition (see Note 4), (2) allele frequencies (see Note 3), (3) choice of test statistic (see Note 5)
Step 9
Consider possible evaluation of evidence of linkage heterogeneity as a function of covariates (see Note 8)
pairs, in particular sibling pairs. For example, the affected sibpair (ASP) likelihood ratio test of Risch (9, 10) is based on three parameters, the respective probabilities that a sibling pair shares 0, 1, and 2 parental alleles IBD at a marker or a particular place in the genome; this approach and its extensions are implemented in the S.A.G.E. software LODPAL. Secondly, the nonparametric linkage (NPL) statistic, developed by Whittemore and Halpern (11) and implemented in several software packages is described. This approach can be used for larger families than simple relative pairs. Finally, we describe the likelihood approach of Kong and Cox (12), implemented in the ALLEGRO software, which has better properties than the NPL statistic when the information about the patterns of inheritance, obtained from the marker data, is incomplete. Table 1 describes a standard model-free linkage analysis workflow. The theory behind the linkage tests is described in Subheadings 1.3–1.7 together with a simple example of the relevant calculations. Subheading 2 works through a richer example in detail. Subheading 3 contains notes on various topics. Notes 1–3 address genotyping considerations, while notes 4 and 5 address phenotyping and model choice respectively. We discuss special topics in Notes 6–9, and issues of statistical inference in Notes 10–12.
320
W. Xu et al.
1.2. Simple Example
To illustrate the basic concepts, we present a simple example with one microsatellite marker genotyped in two families, each containing three individuals affected with a disease of interest. The pedigree diagrams are presented in Fig. 1. The genetic marker has five possible alleles (1/2/3/4/5) with allele frequencies (0.20, 0.20, 0.20, 0.20, and 0.20). In family 1, the parental genotype information is unavailable, but in family 2, everyone has genotype data. Linkage analysis can be undertaken to investigate whether this marker is linked to the disease locus or, in other words, whether the disease and the marker tend to be inherited together.
1.3. Identity by Descent
All linkage analysis methods depend on estimates of IBD, that is, estimates of whether a chromosomal segment has been inherited from the same ancestor. This is simplest to describe for a pair of siblings. At any location on the autosomes, siblings can share 0, 1, or 2 copies of their parents’ chromosomes. The two unaffected siblings in family 2, for example, share two alleles IBD since they inherited the same alleles from each parent. In contrast, sibs 23 and 25 (in family 2) share no alleles IBD since allele 3, which they each carry, was
Fig. 1. Pedigree diagrams for the simple example. Notes: Circles represents females, rectangles represent males, and black symbols represent affected individuals. Each individual is assigned an identifying label (above the symbols). and the genotypes of each individual are marked below each symbol. For example, individuals 21 and 22 in family 2 each inherited marker allele 4 from their father, and marker allele 5 from their mother.
17
Model-Free Linkage Analysis of a Binary Trait
321
Table 2 Expected IBD sharing under the null hypothesis of no linkage (z0, z1, z2), with IBD estimates based on the marker genotypes ðz^0 ; z^1 ; z^2 Þ for all sib pairs in two simple pedigrees (estimated by GENEHUNTER) Pedigree
Sib pair
z0
z1
z2
z^0
z^1
z^2
Family 1
21, 22
0.25
0.5
0.25
0
0
1
Family 1
21, 23
0.25
0.5
0.25
0
1
0
Family 1
21, 24
0.25
0.5
0.25
1
0
0
Family 1
21, 25
0.25
0.5
0.25
0
1
0
Family 1
22, 23
0.25
0.5
0.25
0
1
0
Family 1
22, 24
0.25
0.5
0.25
1
0
0
Family 1
22, 25
0.25
0.5
0.25
0
1
0
Family 1
23, 24
0.25
0.5
0.25
0
1
0
Family 1
23, 25
0.25
0.5
0.25
0.5
0
0.5
Family 1
24, 25
0.25
0.5
0.25
0
1
0
Family 2
21, 22
0.25
0.5
0.25
0
0
1
Family 2
21, 23
0.25
0.5
0.25
0
1
0
Family 2
21, 24
0.25
0.5
0.25
1
0
0
Family 2
21, 25
0.25
0.5
0.25
0
1
0
Family 2
22, 23
0.25
0.5
0.25
0
1
0
Family 2
22, 24
0.25
0.5
0.25
1
0
0
Family 2
22, 25
0.25
0.5
0.25
0
1
0
Family 2
23, 24
0.25
0.5
0.25
0
1
0
Family 2
23, 25
0.25
0.5
0.25
1
0
0
Family 2
24, 25
0.25
0.5
0.25
0
1
0
inherited from a different parent. At a location that is not associated with disease, for a sibling pair: IBD ¼ 0 with probability ¼, IBD ¼ 1 with probability ½, and IBD ¼ 2 with probability ¼. Table 2 shows the IBD estimates for all the sibling pairs in our simple example. When multiple markers have been genotyped, recombination or crossovers during meiosis in each parent will lead to observable changes in the IBD patterns between relatives along the chromosomes. Suppose that, for a pair of siblings, a large number of genetic markers spanning the genome have been genotyped. Then, assuming that recombination rates and allele frequencies are known, estimates of IBD can be calculated at any point in the genome,
322
W. Xu et al.
including between marker locations. Usually, IBD cannot be completely inferred from the available marker data, so IBD estimates are simply the probabilities of the three IBD states at a particular location. In larger pedigrees of more general structure, Kruglyak et al. (5) demonstrated how to obtain patterns of IBD in a computationally efficient manner. All possible inheritance patterns from the founders in each family down to the bottom generation are enumerated for each founder allele. These patterns are termed inheritance vectors, and the probability of each inheritance vector can be calculated by taking into account the recombination in each pattern. Then IBD is estimated in a straightforward manner from the inheritance patterns and their probabilities. Details of the algorithm and more recent advances are well-described elsewhere (8, 13, 14). 1.4. Model-Free Linkage Analysis for Affected Relative Pairs
Once estimates of IBD sharing at markers for ARPs have been calculated, tests of linkage can be developed. Let zi denote the probability that two relatives inherit i marker alleles IBD, i ¼ 0, 1, and 2. Let z ¼ ðz0 ; z1 ; z2 Þ, and ^z ¼ ð^z0 ; ^z1 ; ^z2 Þ be the IBD estimates from the data at a genomic location of interest. Outbred relatives cannot share more than two alleles IBD. Several tests of linkage have been proposed for ARPs and, in each case, the null hypothesis is that the proportions z follow their expectations in the absence of linkage, that is z ¼ ðz0 ; z1 ; z2 Þ ¼ p, where p depends on the relative pair type. For example, for sib pairs, p ¼ (0.25, 0.5, 0.25). Under an alternative hypothesis of linkage, excess allele sharing would be expected for ARPs at positions where the marker locus is linked to the disease gene, with a pattern of sharing along the chromosome that attenuates with distance from a linkage peak of excess sharing. An early model-free ARP linkage test is the “mean test,” which compares the mean value of alleles shared IBD by the ARPs, 2^z2 þ ^z1 , with its null expectation 2z2 þ z1 . A second test, the proportion test, compares the proportion ^z2 with its null expectation z2 . This latter test is appropriate only for sib pairs since other relative pairs are not expected to share two alleles IBD (4, 15–17). Both the mean test and the proportion test can be seen as special cases of a general test statistic obtained by taking a weighted linear combination (18) w0 ð^z0
z0 Þ þ w1 ð^z1
z1 Þ þ w2 ð^z2
z2 Þ:
Since the weights can be standardized arbitrarily, it is possible to set w0 ¼ 0 and w2 ¼ 1. Therefore, the test can be completely specified by assigning the weight w1 . Specifically, w1 ¼ 0:5 corresponds to the mean test, and w1 ¼ 0 corresponds to the proportion test. Tests in this family are referred to as “1 degree of freedom tests.” Although all are valid tests in the absence of linkage, they have different power for detecting linkage, and the power depends on the true penetrances and genetic model.
17
Model-Free Linkage Analysis of a Binary Trait
323
In each of families 1 and 2, we can examine the three pairs of affected siblings. In family 1, two affected pairs share one allele IBD, and one affected pair shares two alleles IBD; while in family 2, one affected pair shares zero alleles IBD, and two affected pairs share one allele IBD. If for the moment we assume that all the six sibling pairs are independent, the mean test would compare 2(1/6) + 1(4/6) ¼ 6/6 to the expected value of 1. The proportion test would compare 1/6 to the expected value of 1/4. 1.5. Model-Free Linkage Analysis in General Pedigrees
For families containing a variety of configurations of affected individuals, more general test statistics are needed. Whittemore (19) showed how tests of linkage could be conceptually unified, by demonstrating that patterns of IBD sharing among affected relatives can be assigned scores with the property that the scores are larger when there is more sharing among affected relatives. Allele sharing is quantified by a scoring function Sðvi ðtÞ; Fi Þ that depends on the inheritance vector vi ðtÞ and phenotype Fi specific to pedigree i. The scoring function of each family is normalized under the null hypothesis of no linkage to obtain a pedigree-specific NPL score, and then inference about linkage is based on a linear combination of pedigree-specific NPL scores (5). This approach to linkage, often termed “nonparametric linkage” became widely used with the development of software to calculate the IBD patterns simultaneously with the scores (5, 20). Several different scoring functions have been proposed (see Note 5). One popular scoring function is Sall (11) which measures the IBD allele sharing by giving a larger weight to alleles shared among many different individuals in a family. Let h represent a collection of alleles obtained by selecting one allele from each of the a affected individuals, and let bj (h) equal the number of times that the j th founder allele appears in h (for j ¼ 1, . . ., 2f ). Then, " 2f # 1 X Y bj ðhÞ! : Sall ¼ a 2 h j ¼1 An alternative scoring function is Spairs, which is the number of alleles IBD shared by two distinct affected relatives, summed over all possible pairs. Suppose that IBD can be inferred with certainty at a particular marker at chromosomal position t (this case is also known as “complete data”). Then for pedigree i with phenotype Fi the inheritance vector vi ðtÞ is known with certainty and, for a chosen statistic S, the corresponding NPL score Zi(t) is defined as: S ½vi ðtÞ; Fi mi Zi ðtÞ ¼ ; si
324
W. Xu et al.
where the terms mi and si are the mean and standard deviation of the scoring function, respectively, calculated under the null hypothesis of no linkage. Let P0 ½vi ðtÞ ¼ w represent the null probability of inheritance vector w and define: Si;w ðtÞ ¼ S ½vi ðtÞ ¼ w; Fi ;
mi ¼ E0 ½Sðvi ðtÞ; Fi Þ; and s2i ¼ E0 S 2 ðvi ðtÞ; Fi Þ
E02 ½S ðvi ðtÞÞ:
When the inheritance vector is known with certainty, i.e., in the complete data case, S ½vi ðtÞ; Fi can be directly calculated by enumerating all sets. For the collection of N pedigrees, the overall NPL score is: N X
Z ðtÞ ¼
i¼1
gi Zi ðtÞ;
where gi is a pedigree-specific weight. In the complete data case, the null variance of Z(t) is well pffiffiffiffiffiapproximated by 1. Kruglyak et al. (5) proposed using gi ¼ 1= N , so that in the absence of linkage the NPL score has asymptotically a standard normal distribution. When full inheritance information is not available, several inheritance vectors may be compatible with the observed data, and the probability distribution of the set of possible inheritance vectors can be calculated from the marker data. Then, the expected value of the scoring function is computed using the possible inheritance vectors indicated by the data, weighted by their probabilities. Let gi;w ðtÞ ¼ P½vi ðtÞ ¼ wj marker data. The expected value of the scoring function is then: X i ðtÞ; FÞ ¼ Sðv Si;w ðtÞgi;w ðtÞ: w
Given that
Si;w ðtÞ Zi;w ðtÞ ¼ si
mi
;
the expected value Zi ðtÞ is the pedigree-specific NPL score X Zi;w ðtÞgi;w ðtÞ Zi ðtÞ ¼ w
and the Zi ðtÞ are combined for the N sampled families to obtain an overall NPL statistic: Z ðtÞ ¼
N X i¼1
gi Zi ðtÞ:
A large NPL value at a particular genomic location provides evidence of linkage between that locus and a gene that increases disease risk. Using the program GENEHUNTER (5), Table 3 shows the estimated NPL scores and P -values for the single genetic
17
Model-Free Linkage Analysis of a Binary Trait
325
Table 3 NPL scores and P -values using Sall or Spairs in our simple example Test
Sall (Spairs)
Family
NPL score
P -value
Family 1
0.816
0.438
Family 2
0.816
1.000
Total
0.000
0.684
marker in our simple example. In family 2, the NPL score is negative, reflecting less than expected allele sharing. This yields a P -value of 1.0 due to the test being one-sided in favor of excess sharing. For ASPs, the scoring functions Sall and Spairs give the same result. 1.6. Likelihood-Based Models in Allele-Sharing Linkage Analysis
Risch (9, 10) proposed a likelihood-based model for analyzing marker data from n ARPs. A test of linkage is based on the likelihood ratio Lð^z Þ=LðpÞ for affected sib pairs as follows: ! P2 n X z i wij i¼0 ^ LLR ¼ 2 ; loge P2 j ¼1 i¼0 zi wij
where wij are the probabilities of observing the marker genotypes of the j th pair, given that they share i alleles IBD. The ^zi are the probabilities of an affected pair sharing i alleles IBD (which are unknown and need to be estimated) under the alternative hypothesis, and zi are the null IBD sharing probabilities, for example, (0.25, 0.5, 0.25) for sib pairs. The LLR test statistic compares the likelihood of the ARP data when the three allele sharing probabilities are estimated, Lð^z0 ; ^z1 ; ^z2 Þ, to the likelihood under the null hypothesis LðpÞ. The EM algorithm is used to maximize this LLR statistic with respect to the parameters z. If the maximized LLR statistic is greater than the test criterion, the null hypothesis of no linkage is rejected. The only constraint on the maximum likelihood estimate (MLE) is that ^z0 þ ^z1 þ ^z2 ¼ 1. When no other constraints are imposed on the estimated proportions, this likelihood ratio test is expected to be asymptotically distributed as w22 . In ASPs, however, the power of the LR test can be increased by evaluating the numerator not at the point ^z but, rather, at a different point z 0 , where z 0 is constrained to lie within the triangle of values consistent with the underlying genetics of IBD allele sharing by the sibs (21). Hence, this constrained likelihood ratio test is based on the ratio Lðz 0 Þ=LðpÞ.
326
W. Xu et al.
Using this likelihood ratio test for sibling pairs in our simple example, the ð^z0 ; ^z1 ; ^z2 Þ estimate is (0.266, 0.509, 0.225), and the test statistic for linkage is 1.401 with corresponding P -value 0.496. The test is based on all the affected sib pairs, assuming independence of the pairs (see Note 11). In general pedigrees, Whittemore (19) presented a differently parameterized likelihood approach to linkage analysis and showed its correspondence with NPL analysis. To construct the likelihood model, we can consider a pedigree i with M markers genotyped at position t1, . . ., tM along a chromosome. Let Yi ðtm Þ represent the observed genotypes of the marker at position tm and let Wi ðtm Þ denote the phase-known genotypes such that Yi ðtm Þand Wi ðtm Þ are equivalent to identity in state (IIS) and IBD allele configurations. Whittemore (19) defined the pedigree risk ratio Ri;w ðt; dÞ as the ratio of the conditional probability of phenotype Fi given IBD configuration w at locus t, over the probability of observing exactly the same pedigree phenotype irrespective of the IBD configuration at t for an arbitrary pedigree of the same size and structure. This ratio depends on a parameter d, which measures the effect of the gene at t on the disease risk: Ri;w ðt; dÞ ¼
P ½Fi jWi ðtÞ ¼ w; d : P ½Fi
A simple linear model for the pedigree phenotype risk ratio is: Ri;w ðt; dÞ ¼ 1 þ xi;w ðtÞd; where xi;w ðtÞ is an explanatory variable that is a function of the IBD configuration of the disease gene. The likelihood function incorporating the pedigree phenotype risk ratio is X Li ðt; dÞ ¼ Ri;w ðt; dÞgi;w ðtÞL0;i : w
The factors gi;w ðtÞ and L0;i in the above likelihood are calculated directly using the observed marker genotypes and populationbased estimates of marker allele frequencies. The pedigree phenotype risk ratio Ri;w ðt; dÞ is the only component of the likelihood that depends on the unknown parameter d. To test for linkage, Whittemore (19) specified the null hypothesis H0 : d ¼ 0 and the one-sided alternative HA : d>0. A likelihood ratio statistic LR(t) is constructed to compare the likelihood, maximized with respect to d, to the likelihood under the null when d0 ¼ 0: LRðtÞ ¼ 2 ln
N X X Lðt; ^dÞ Ri;w ðt; ^dÞgi;w ðtÞ: ¼ 2 ln Lðt; d0 Þ i¼1 w
The distribution of LR(t) is asymptotically w2 with one degree of freedom. The null hypothesis of no linkage is rejected when the LOD(t) score exceeds a predetermined critical level, where
17
Model-Free Linkage Analysis of a Binary Trait
LODðtÞ ¼ log10
327
Lðt; ^dÞ LRðtÞ : ¼ L0 ðtÞ 2 lnð10Þ
Based on the peak LOD estimator of location we can construct a 1-LOD support interval (22). The 1-LOD support interval is determined by the chromosomal points where the LOD scores are within 1 LOD unit of the peak LOD score (see Note 12). 1.7. Kong and Cox Model
Following the approach of Whittemore (19), Kong and Cox (12) developed a linkage likelihood model in which the covariates of the pedigree phenotype risk ratio are weighted NPL scores. They defined xi;w ¼ gi Zi;w ðtÞ. Then the corresponding pedigree phenotype risk ratio is defined as Ri;w ðt; dÞ ¼ ½1 þ dgi Zi;w ðtÞ: For a set of N pedigrees, the resulting score statistic SS(t) is equivalent to the overall NPL When the pedigree-specific weights gi PNscore. 2 ¼ 1, and the null variance of Z(t) is are constrained to g i¼1 i approximated by 1, the efficient score statistic simplifies to ESðtÞ ¼ Z ðtÞV0 ½SSðtÞ 1 Z ðtÞ ¼ ½Z ðtÞ2 :
When inheritance information is only partially available, however, application of the complete data approximation of Kruglyak et al. (5) gives conservative results, because the null variance of Z(t) is less than 1. The likelihood ratio test proposed by Kong and Cox (12) does not require the complete data approximation and provides a more accurate linkage test than ES(t) or NPL analysis for most situations where inheritance information is incomplete. The linear likelihood function specified by Kong and Cox (12) for a set of N pedigrees is: L ðt; dÞ ¼
N Y i¼1
½1 þ dgi Zi ðtÞL0;i :
Testing for linkage via a likelihood ratio test requires the maximization of Lðt; dÞ with respect to d. An upper bound b is imposed on the MLE ^d to ensure Ri ðt; dÞ 0: If ai ðtÞ is the smallest possible value that the scoring function Si;w ðtÞ can theoretically take at position t, bi ðtÞ ¼ si =½mi ai ðtÞ and the upper bound of ^d is b ¼ minðbi ðtÞÞ for i ¼ 1, . . ., N sampled pedigrees. The lower bound of ^d is 0 as the model does not permit a negative gene effect. A likelihood ratio test LR(t) is constructed using the likelihood maximized under the constraint 0 ^d b: " # h i Lðt; ^dÞ LRðtÞ ¼ 2 ln ¼ 2 lðt; ^dÞ l ðt; d0 Þ : L ðt; d0 Þ
328
W. Xu et al.
Taking the natural logarithm of the likelihood function gives: ( ) N Y l ðt; dÞ ¼ ln ½1 þ dgi Zi ðtÞL0;i i¼1
¼Cþ
N X i¼1
ln½1 þ dgi Zi ðtÞ;
P where C ¼ N i¼1 ln½L0;i is a constant calculated using the observed marker data. Equivalently one can compute the statistic rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h iffi ^ ZLR ðtÞ ¼ 2 lðt; dÞ l ðt; d0 Þ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N h i u X ln 1 þ ^dgi Zi ðtÞ ; ¼ t2 i¼1
which is well approximated by a Gaussian distribution when the number of pedigrees is large. Kong and Cox (12) implemented the ZLR(t) linkage test in the program GENEHUNTER-PLUS, and this test statistic is currently available in the ALLEGRO and MERLIN software (see Table 4). When inheritance is incomplete, ZLR(t) is a more powerful test than the NPL method. However, in the presence of a gene effect, the upper bound imposed on the MLE of d restricts the amount of possible deviation. This can lead to substantial power losses if the dataset consists of a small number of pedigrees and dramatic sharing is observed (12). Kong and Cox (12) outlined an alternative exponential model where the parameter d has no upper bound. The exponential model can be written as dgi Si;w ðtÞ mi P ðvi ðtÞ ¼ wjdÞ ¼ pi;w ðtÞri ðdÞ exp ; si where ri ðdÞ ¼
pi;w ðtÞ
X w
dg ½Si;w ðtÞ exp i si
mi
!
1
isP the renormalization constant necessary to ensure that w P½vi ðtÞ ¼ wjd ¼ 1. Computation of ZLR(t) is more demanding for the exponential model because the conditional distribution of each Zi(t) must be calculated. However, the upper bound is no longer a problem for this exponential model. Since our simple example contains only two families, the Kong and Cox tests of linkage would be unreliable and are not given here. 1.8. Summary
In contrast to model-based tests of linkage (see Chapter 15), model-free linkage tests do not need to specify the true relationship
ASP, ARP, pedigrees
MERLIN (Multipoint Engine for Rapid Likelihood INference)c
ASP, ARP, pedigrees
ASP, ARP, pedigrees
a
Must be provided by the user; Haplotype reconstruction; multiple measures of information content
Allele frequency estimation
Option to use pedigree information to identify unlikely genotypes
Perfect-data approximation
NPL statistics, Perfect-data using S(all), S approximation (pairs), and four for NPL; other scoring asymptotic LR functions; LOD for LOD score; score from Kong two types of and Cox linear or multipoint data exponential simulation model
Models and statistics P -value calculation
NPL statistics, Maximum using S(all) or S likelihood (pairs) scoring estimation for all functions non-founder (calculated as relative pairs single-point or multipoint)
Prior and posterior pairwise IBD sharing for all relative pairs, single-point and multipoint
IBD allele sharing estimation
May be provided by Single-point or NPL statistics, Asymptotic or by the user or multipoint for all using S(all) or S gene-dropping estimated by relative pairs in a (pairs), and LOD simulation maximum pedigree, with score from Kong likelihood or additional IBD and Cox linear or counting; states for inbred exponential haplotype and non-inbred model estimation in pedigrees pedigrees; estimation of information content
By examination of Must be provided by the user; estimated estimation of haplotypes for information excessive obligate content recombination
Calculation of observed crossover rate to detect genotyping errors or marker order problems
Family structures Error checking
GENEHUNTERb
ALLEGRO
Software
Table 4 Comparison of available software for model-free linkage analysis of binary traits
(continued)
Options to allow for LD among neighboring markers; option for analysis of large pedigrees using SIMWALK2
Includes X chromosome analysis
Allows unequal family weights; use of sexspecific marker maps
Other
ASP, ARP, Discordant Sib Pairs (DSP)
MARKER INFO: for Mendelian errors, RELTEST: for relationship errors
Family structures Error checking
IBD allele sharing estimation
FREQ: estimated in GENIBD: singleand multisingletons and in marker, pairwise pedigrees by for fullsibs, half maximum sibs, likelihood or by grandparental, counting avuncular, first cousin
Allele frequency estimation SIBPAL: test of mean allele sharing, for full and half-sibs; LODPAL: LOD score for ASPs, LOD score for ARPs in 1 and 2 parameter conditional logistic model Asymptotic and empirical
Models and statistics P -value calculation
Links to software, documentation, and tutorials: a http://www.decode.com/software/ b http://www.broad.mit.edu/ftp/distribution/software/genehunter/ c http://www.sph.umich.edu/csg/abecasis/Merlin/ http://www.sph.umich.edu/csg/abecasis/Merlin/tour/ d http://darwin.cwru.edu/sage/
SAGE (Statistical Analysis for Genetic Epidemiology)d
Software
Table 4 (continued)
Allows for discordant relative pairs, Xlinked models, parent-oforigin models, and option for pair-level covariates
Other
17
Model-Free Linkage Analysis of a Binary Trait
331
between the genes and the disease risks. However, while the presumed disease model is made explicit in model-based linkage analysis, model-free methods make implicit assumptions about the disease–gene relationships, which can influence the power of the tests of linkage (see Note 5). In a genome scan for linkage, modelfree linkage tests are usually only the first step in the search for disease susceptibility genes, and a genome-wide scan for linkage peaks may be used simply to identify regions that may harbor such genes. These regions can then be further analyzed with alternative methods such as fine-mapping studies. These allele-sharing methods have been widely used for the study of linkage of complex diseases; however, in their simple form they cannot directly detect gene–gene or gene–environment interactions.
2. Methods 2.1. Data
The methods for model-free linkage analysis are illustrated by analysis of a data set containing families with multiple cases of inflammatory bowel disease. Inflammatory bowel disease is a disorder of the autoimmune system characterized by chronic inflammation of the gastrointestinal tract. Epidemiological studies have provided evidence of a substantial genetic contribution to susceptibility. Familial aggregation has been observed in 10% of cases, with monozygotic and dizygotic twin concordance rates of 40–50% and 8%, respectively, and disease prevalence ten times greater in first degree relatives of affected persons than in unrelated individuals from the general population (23, 24). Cases with early age at onset are hypothesized to have a more strongly genetic etiology. The disease occurs in two main forms, Crohn disease (CD) and ulcerative colitis (UC). To illustrate model-free linkage methods, we apply them in an analysis of 122 Canadian CD families recruited from the Toronto area as part of a genome-wide linkage study (25). As reported previously (26, 27), all types of affected relatives, as available, were included in the analyses (Table 5). The 122 CD families are classified into two categories: CD16 (at least one patient diagnosed at 16 years or younger) and CD > 16 (families not in CD16). Genotyping data were available for 17 microsatellite markers on chromosome 5 spanning 156 cM with an average inter-marker distance of 10.34 centiMorgans (cM) (Table 6). These were highly polymorphic markers with considerable variation in allele frequencies.
2.2. Statistical Analyses
Pedigree-specific NPL scores from multipoint linkage analyses, using the allele-sharing scoring functions Sall and Spairs, were obtained in ALLEGRO. We also fitted the Kong and Cox (12) linear and exponential likelihood models separately to CD, CD16, and
332
W. Xu et al.
Table 5 A breakdown of the 122 CD families, according to the type of affected relatives Number of families Relationship of other affected relatives to the affected siblings
1 Affected 2 Affected 3 Affected sibling siblings siblings
None
–
106
7
Aunt/uncle
2
2
–
Cousin
–
2
–
a
1
1
1
Other a
Other includes: child; parent and grandparent; great aunt/uncle and great niece/nephew
CD > 16 family subgroups. To compare CD16 vs. CD > 16 family subgroups, analyses of genetic heterogeneity were conducted at the chromosome 5 position with the highest linkage signal, as reported previously (26, 27). A positive covariate value (X ¼ 1) was assigned to the CD16 subgroup in which the observed evidence for linkage was greater. The linear and exponential models were applied to yield the one-sided ZLR test for linkage and two likelihood ratio tests: LRC (2 df) assessing linkage and heterogeneity, and LRH (1 df) testing heterogeneity in allele sharing between CD16 and CD > 16 family subgroups (27). The test statistics, LR C ¼ Z 2 LR ðCD16Þ þ Z 2 LR ðCD > 16Þ and LR H ¼ Z 2 LR ðCD16Þ þ Z 2 LR ðCD > 16Þ
Z 2 LR ðCDÞ ;
were computed using the ZLR allele-sharing test statistics evaluated in the CD16, CD > 16, and CD (combined) samples. 2.3. Linkage Analyses of CD16 Families Using ALLEGRO
The linkage analyses software package ALLEGRO can be downloaded from http://www.decode.com/software/. At this time, the latest ALLEGRO version 2.0 includes distribution files for Unix, Redhat/Linux, Windows/DOS, and Mac/G5 platforms. Analyses are performed using a script or options file that specifies input data files, commands for various analyses, and output files with results. Here we illustrate running ALLEGRO on a Windows machine to analyze the CD16 families. The program is invoked by typing allegro-2_v0f CD.opt at the DOS command prompt in a directory that includes both the allegro-2v0f.exe executable file and
17
Model-Free Linkage Analysis of a Binary Trait
333
Table 6 Allele frequencies and Marker map distances on chromosome 5 Distance from first marker (cM)
Marker
Allele frequency distribution (estimated from founders)
D5S1492
(0.002, 0.006, 0.0636, 0.0915, 0.3738, 0.4632)
0
D5S807
(0.0019, 0.0075, 0.0188, 0.0226, 0.0245, 0.0245, 0.0414, 0.0546, 0.064, 0.1205, 0.2015, 0.4181)
9.6
D5S817
(0.0039, 0.0097, 0.0874, 0.2019, 0.2427, 0.4544)
13.5
D5S1473
(0.002, 0.002, 0.002, 0.002, 0.0059, 0.0059, 0.0119, 0.0257, 0.0317, 0.0931, 0.1743, 0.2554, 0.3881)
26.8
D5S1470
(0.0019, 0.0039, 0.0154, 0.0751, 0.0963, 0.106, 0.1233, 0.1464, 0.1734, 0.2582)
35.9
D5S2494
(0.0303, 0.0303, 0.0303, 0.0606, 0.3333, 0.5152)
49.5
GATA67D03 (0.0021, 0.0083, 0.0145, 0.0455, 0.0663, 0.0787, 0.0911, 0.1863, 0.2505, 0.2567)
59.8
D5S1501
(0.0077, 0.0251, 0.0271, 0.029, 0.0445, 0.0754, 0.1161, 0.1721, 0.1915, 0.3114)
75.8
D5S1719
(0.0041, 0.0164, 0.0184, 0.0777, 0.0879, 0.2495, 0.2495, 0.2965)
85.4
D5S1453
(0.002, 0.004, 0.0061, 0.0242, 0.0323, 0.0384, 0.0505, 0.0545, 105.3 0.1273, 0.2061, 0.4545)
D5S1505
(0.0017, 0.0087, 0.0419, 0.0471, 0.1082, 0.1431, 0.1728, 0.2286, 0.2478)
120.4
GATA68A03 (0.002, 0.0082, 0.0368, 0.0573, 0.1268, 0.1391, 0.2188, 0.411) 124.2 D5S816
(0.0018, 0.0036, 0.0344, 0.058, 0.1087, 0.1703, 0.1884, 0.2174, 0.2174)
129.9
D5S1480
(0.002, 0.004, 0.0381, 0.0641, 0.0842, 0.0882, 0.2244, 0.2385, 138.1 0.2565)
D5S820
(0.0019, 0.0243, 0.0598, 0.0654, 0.086, 0.1981, 0.2804, 0.2841)
D5S1471
(0.0018, 0.0037, 0.0221, 0.0314, 0.0978, 0.1402, 0.286, 0.417) 162.7
D5S1456
(0.0039, 0.0308, 0.1291, 0.1541, 0.1792, 0.1888, 0.3141)
150.4
165.4
334
W. Xu et al.
the following CD.opt options file:
Lines preceded by % correspond to comments and do not affect analyses. The commands PREFILE and DATFILE read the input pedigree cd16.ped and marker data ch5.loci files, respectively. Both of these are in LINKAGE format as described by Terwilliger and Ott (22). The pedigree file cd16.ped specifies the pedigree structure, affection status and genotypes at the 17 microsatellite markers on chromosome 5, for members of the CD16 families (N ¼ 51). The marker file ch5.dat specifies the marker positions on chromosome 5 and allele frequencies of the 17 microsatellite markers. The LODEXACTP and NPLEXACTP commands request computation of exact P -values for the LOD and NPL scores, respectively. ALLEGRO has the capacity to run multiple models simultaneously. The first MODEL command requests the following linkage analysis: multipoint (mpt) IBD estimation using the linear (lin) model and the Sall scoring function (all) with families receiving equal weights (equal). The output file cd16lin.out provides a summary of results summarized across families, whereas cd16lin.out provides family-specific results. The second MODEL command requests a similar linkage analysis, in this case for the exponential likelihood model (exp). The output file cd16exp.out is listed below.
17
Model-Free Linkage Analysis of a Binary Trait
335
336
W. Xu et al.
ALLEGRO computes all statistics at (m 1) positions between consecutive markers with m ¼ 2 as the default. The first column corresponds to the location in cM, relative to the first marker, and the last column contains the marker name or “–” for positions between markers. The second and fourth columns provide the allele-sharing LOD score and the nonparametric linkage NPL score. The dhat column is the MLE ^d and the Zlr column gives the ZLR score as defined earlier. The nplexactp and lodexactp columns provide P -values for the LOD and NPL scores, respectively, and the info column gives a likelihood-based measure of information (see Note 1). The family-specific output file cd16fexp.out has a similar format, as do the linear model output files. For additional commands and further details regarding alternative analyses options, please see the documentation included with the ALLEGRO software. 2.4. Results
The information content of the genotyped markers for CD families is provided in Fig. 2. The greatest evidence for linkage was observed near position 129.9 cM (Fig. 3) for both CD and CD16 subgroups. At this locus, the summary NPL scores using Sall are Z ¼ 2.11 (p ¼ 0.017) for CD and Z ¼ 2.68 (p ¼ 0.0037) for CD16 families, but only Z ¼ 0.49 for CD > 16 (Table 7). Although the NPL score was higher in the CD16 families, these data alone do not provide sufficient evidence to declare significant linkage (p < 2 10 5) according to established criteria (28). As expected, when inheritance information is less than complete, the model-based ZLR values from the Kong and Cox (12) likelihood models are consistently higher than the summary Z statistics (Table 7). In all subgroups, the linkage parameter d in the linear model maximized within the imposed constraints.
Fig. 2. Information content for CD families (n ¼ 122).
17
Model-Free Linkage Analysis of a Binary Trait
337
Fig. 3. Summary multipoint NPL scores for CD (n ¼ 122), CD16 (n ¼ 51), and CD > 16 (n ¼ 71) family subgroups. Calculations were performed using the Sall scoring function, with similar results obtained for Spairs.
Table 7 Results of linkage analyses at 129.9 cM position in CD family subgroups defined according to age at diagnosisa Linear model Subgroup N
Info Z
ZLR
d
Exponential model
P -value ZLR
d
P -value
(1) Z and ZLR tests based on Sall scoring function CD 122 0.74 2.11 2.47 0.26 0.007 CD16 51 0.77 2.68 3.10 0.46 0.001 CD > 16 71 0.71 0.49 0.58 0.08 0.28
2.46 0.26 0.007 3.06 0.49 0.001 0.59 0.08 0.28
(2) Z and ZLR tests based on Spairs scoring function CD 122 0.74 2.25 2.62 0.27 0.004 CD16 51 0.77 2.71 3.10 0.46 0.001 CD > 16 71 0.71 0.66 0.78 0.11 0.22
2.63 0.28 0.004 3.11 0.51 0.001 0.78 0.11 0.22
a
The summary NPL score Z and one-sided ZLR tests for the linear and exponential models were evaluated using Sall and Spairs scoring function implemented in the package ALLEGRO
The results of joint linkage and heterogeneity analyses, conducted at locus 129.9 cM, yielded similar results under the linear and exponential likelihood models (Table 8). In the CD families, the P -value for the LRC (2 df) test for linkage including a binary covariate defining the CD16 and CD > 16 subgroups is of the same order of magnitude as that of the ZLR (1 df) test for linkage without covariates (Table 8). Marginal evidence for heterogeneity between the CD16 and CD > 16 groups was detected with the Sall scoring function in both the linear and exponential models,
338
W. Xu et al.
Table 8 p Values for test statistics: ZLR for linkage in all CD families, LRC for linkage in CD16 and CD > 16 groups jointly, and LRH for allele-sharing heterogeneity between two CD family subgroups defined according to age at diagnosis Linear model
Subgroup
ZLR (1 df)
Exponential model
LRC (2 df)
LRH (1 df)
ZLR (1 df)
LRC (2 df)
LRH (1 df)
CD16 vs. CD > 16 Sall
0.007
0.007
0.051
0.007
0.008
0.055
Spairs
0.004
0.006
0.068
0.004
0.006
0.065
consistent with the observation of higher excess allele sharing in the families with early age at onset. This locus has become known as IBD5, and subsequent analyses refined the locus to a 250-kb risk haplotype (29). In a recent review (30), the authors note that the association of this haplotype with IBD has been widely replicated in a number of independent populations and that the IBD5 risk haplotype has been principally associated with CD. This region has proven to be particularly difficult to fine map because it contains a significant degree of linkage disequilibrium, making it difficult to discern a causative allele from a marker allele co-inherited with the disease-causing allele.
3. Notes 1. Marker informativity. The ability to detect linkage rests on the ability to accurately estimate the patterns of identity by descent. Microsatellite markers, which tend to have four or more alleles, are more informative about inheritance patterns. However, SNP markers, which only have two alleles, are now the most commonly used marker type since there are millions in the genome and they can be typed efficiently. To achieve the same level of information about the inheritance patterns across the genome as obtained with microsatellites, denser SNP typing is needed (31). Several alternative measures of information content have been proposed, including an entropy-based measure (5) and likelihood-based measures (8, 32).
17
Model-Free Linkage Analysis of a Binary Trait
339
2. Allele Frequency Errors. Genotype data will be unavailable for some individuals, especially for the top generation of a pedigree. Estimation of IBD then relies on population frequency estimates of each allele (allele frequencies, see Chapter 5). Unfortunately, linkage results can be substantially biased if the wrong allele frequencies are used (33, 34). When a common allele is mistakenly assumed to be rare and the founders are not genotyped, false-positive linkage signals can be obtained, since family members carrying this allele will be inferred (erroneously) to have the same ancestor (35). Ideally, allele frequency estimates appropriate for the ancestral background of each pedigree should be used, but most software packages do not allow different allele frequencies for different sets of families. 3. Linkage Disequilibrium Between Markers. Most algorithms to calculate IBD assume linkage equilibrium between markers; however, closely spaced markers are likely to be in linkage disequilibrium. Hence, some haplotypes may be more common than would be expected under linkage equilibrium, and this can be falsely interpreted as within-family sharing (36, 37). It is, therefore, important to ensure that the markers used in linkage analysis are well spaced and in approximate equilibrium. This can be achieved by judiciously removing markers from the data set (38). The MERLIN software uses a clustering method to address this issue (39). An alternative approach is implemented in the EAGLET software (40). 4. Phenotype Definitions. The trait being studied is crucial to the success of any linkage study. Ideally, the best choices for phenotypes are diseases or traits that are closely or directly influenced by risk genes. However, since the mechanisms relating genes to phenotypes are unknown, choosing on this basis directly is impossible. The chosen phenotypes for analysis should, at a minimum, be clearly measurable and show good inter- and intra-rater reliability. Particularly for psychiatric disorders, it has been argued that the major diagnoses may be too heterogeneous, and so different phenotypes, possibly based on specific test results, may be more useful (41). For example, when studying suicide, impulsive aggressive behavior is associated with completed suicides, and this trait may be more directly affected by genes (42). 5. Choice of Models and Test Statistics. Although available test statistics provide valid tests of linkage (so that in the absence of linkage their distributions in large samples are known), their power varies, and the optimal statistic to test for linkage depends on the true underlying penetrances and genetic model. McPeek (43) compared a number of different scoring functions for tests of linkage against a number of different genetic models and identified the most powerful scores for
340
W. Xu et al.
different situations. In the absence of knowledge about the best choice, some have tried maximizing the tests of linkage across a selection of models, with empirical P -values obtained through simulation (44). Unfortunately, this approach does not always lead to better power. In the Kong and Cox approach, even after specifying the scoring function, it is necessary to choose either the linear or the exponential model. Although estimation in the linear model is easier, when sharing deviates markedly from the null hypothesis the exponential model may be more powerful. Certainly, in the presence of heterogeneity, the exponential model appears to be more powerful (27). For multigenerational pedigrees, Basu et al. (45) proposed a new model specification, within the likelihood framework of Kong and Cox, implemented in the program lm ibdtests in the software package MORGAN (http://www.stat.washington.edu/thompson/Genepi/MORGAN/Morgan.shtml). Their approach is also applicable to smaller pedigrees, can incorporate information on both affected and unaffected individuals, and does not require specification of an IBD measure such as Sall or Spairs. 6. X Chromosome. Any linkage method, including estimation of IBD, needs modification to work properly for markers on the X chromosome. The most popular software packages for model-free linkage with binary traits all work properly on X chromosome data (see Table 4). 7. Unaffected Relative Pairs or Discordant Pairs. Although linkage tests can also be developed as a function of IBD sharing patterns in unaffected relatives, or between pairs of individuals who are discordant for the disease of interest, such study designs generally have low power (46). However, when the disease has very high prevalence, using discordant pairs can be more powerful than using pairs of affected relatives (47). 8. Detection of Heterogeneity in Allele-Sharing Linkage Analysis. The etiology of complex disorders is varied and may involve several susceptibility loci with interaction among multiple genetic and environmental factors. A likely component of complex disease is genetic heterogeneity, which refers to the situation in which the disease trait is independently caused by two or more factors, at least one of which is genetic. Analytic approaches used in gene mapping studies, including traditional likelihood LOD scores and model-free allele-sharing methods, are generally sensitive to the underlying genetic mechanism, and particularly to the presence of genetic heterogeneity. If analyses overlook the presence of genetic heterogeneity, then results may be biased and conclusions misleading. Heterogeneity among the allele-sharing distributions of sampled pedigrees can arise when genetic susceptibility varies
17
Model-Free Linkage Analysis of a Binary Trait
341
among the sampled pedigrees. This is the case with locus heterogeneity when two unlinked genes independently cause disease. Environmental factors can also independently affect the disease phenotype or interact with one or more genetic loci to alter the penetrance of susceptibility genotypes. Age, for example, is an important covariate that may affect the penetrance of susceptibility loci. Regardless of the cause, the presence of genetic heterogeneity affects the extent of observed IBD allele sharing at map positions closely linked to a disease gene, with serious consequences for linkage analysis. An approach to modeling heterogeneity in allele sharing in ASPs was developed by Greenwood and Bull (48, 49) who generalized the LR approach of Risch (9, 10) to include covariates and developed an EM approach to parameter estimation. They proposed a multinomial logistic regression model for the inclusion of covariates, and applied it with multiple covariates (50). Olson (51) developed a closely related conditional-logistic model for ARP linkage analysis, allowing different types of relative pairs to be included in the same ARP analysis (see Goddard et al. (52) for an example application). This model is parameterized in terms of the logarithms of allele-sharing-specific relative risks and is equivalent to Risch’s LR models (10). For multiple covariates, Xu et al. (53, 54) developed a recursive partitioning algorithm to identify nonlinear GxE interactions based on the allele-sharing likelihood ratio tests. The Kong and Cox (12) linear and exponential likelihood models were extended by Nicolae (55) and Mirea et al. (26, 27) to model heterogeneity in allele sharing among affected relatives using a family-level covariate vector and a corresponding regression parameter in the likelihood function. Incorporating individual-level covariates in model-free linkage analysis is conceptually challenging, since linkage is based on sharing between individuals. However, Whittemore and Halpern (56) proposed a method for including individual, covariate-based weights in NPL linkage analysis and they showed that this approach could lead to increased power to detect linkage. 9. Imprinting and Transmission Ratio Distortion. Most linkage models assume that inheritance of alleles from parents follows the usual Mendelian rules. However, violations of these assumptions are known to exist. Particular parental alleles may be preferentially inherited, leading to transmission ratio distortion. Such effects may be mediated through imprinting in the parental genome. Linkage models have been extended to allow for parent-of-origin imprinting effects (57, 58), and the performance of affected sib pair models has been examined when transmission ratio distortion is present (59). 10. Genome-Wide Significance for Linkage. Guidelines for genomewide significance levels for linkage analysis were proposed in
342
W. Xu et al.
1995 and became generally accepted (60). For sibling pairs, these authors recommended that a P -value of 7 10 4 be considered suggestive linkage, but a P -value of 2 10 5 would be needed to conclude significant linkage. Slightly different thresholds apply for different relative pair types. These stringent thresholds were recommended to control for the fact that researchers would repeat analyses with increasing numbers of families and increasingly dense markers in the hopes of finding significant linkage, and these numbers rest on the assumption of infinitely dense markers (completely informative for IBD patterns) and on large numbers of families. It may be impossible to achieve anything like these P -values in a small data set with a fixed marker set. The ability to detect linkage will also depend on the available family structures. Therefore, simulation (or gene dropping) is recommended to estimate the expected distribution of linkage test statistics specific to the data set, in the absence of linkage. The observed family structures (with the disease status) are held fixed, and marker data are “dropped” through each family from the founders down to the bottom of each pedigree, allowing for recombination at the normal rates. Tests of linkage can then be calculated on the simulated data, and a null distribution obtained by repeating the gene dropping many times. 11. Dependence of Multiple Relative Pairs from the Same Family. When using a linkage method for ARPs, multiple pairs from the same family are not independent. Methods such as the mean test or the proportion test for ASPs will give valid tests of linkage, but ARP likelihood-based methods, as described in Subheading 1.6, will be biased, generally inflating the evidence for linkage. Weighting schemes have been proposed (61) to reduce the total contribution of families containing multiple pairs; however, such weighting schemes can lead to overly conservative tests of linkage (62). When many pedigrees contain more than a single affected pair, it is a better choice to use a pedigree-based method (such as the NPL score or the Kong and Cox method) for calculating the evidence for linkage. 12. Intervals for Disease Gene Localization. The likely location of a putative disease locus is often constructed by the “1-LOD interval,” finding all points on the chromosome with an LOD score greater than the peak LOD score minus one. For parametric linkage models, this simple interval has approximately 95% coverage (22). However, these intervals are not always accurate, especially for model-free linkage tests, and so several other interval estimates have been proposed. Sinha et al. (63) describe several proposed methods, in addition to developing a new Bayesian method that has better coverage than many of the other approaches.
17
Model-Free Linkage Analysis of a Binary Trait
343
Acknowledgments We acknowledge the support of research grants from the Natural Sciences and Engineering Research Council of Canada and the Canadian Network of Centres of Excellence in Mathematics (MITACS, Inc.). References 1. Ott J (1996) Complex traits on the map. Nature 379: 772–773 2. Elston RC (2000) Introduction and overview. Statistical methods in genetic epidemiology. Statist Meth Med Res 9: 527–541 3. Fishman, et al (1978) A robust method for the detection of linkage in familial disease. Amer J Hum Genet 30: 308–321 4. Suarez BK (1978) The affected sib pair IBD distribution for HLA-linked disease susceptibility genes. Tissue Antigens 12: 87–93 5. Kruglyak L, et al (1996) Parametric and nonparametric linkage analysis: a unified multipoint approach. Amer J Hum Genet 58: 1347–1363 6. Lander ES, Green P (1987) Construction of multilocus genetic maps in humans. Proc Natl Acad Sci 84: 2363–2367 7. Gudbjartsson DF, et al (2000) Allegro, a new computer program for multipoint linkage analysis. Nat Genet 25: 12–13 8. Gudbjartsson DF et al (2005) Allegro version 2, Nat Genet 37: 1015–1016 9. Risch N (1990a) Linkage strategies for genetically complex traits. I. Multilocus models. Amer J Hum Genet 46: 222–228 10. Risch N (1990b) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Amer J Hum Genet 46: 229–241 11. Whittemore AS, Halpern J (1994) A class of tests of linkage using affected pedigree members. Biometrics 50: 118–127 12. Kong A, Cox NJ (1997) Allele-sharing models: LOD scores and accurate linkage tests. Amer J Hum Genet 61: 1179–1188 13. Kruglyak L, Lander ES (1998) Faster multipoint linkage analysis using Fourier transforms. J Comp Biol 5: 1–7 14. Markianos K, Daly MJ, Kruglyak L (2001) Efficient multipoint linkage analysis through reduction of inheritance space. Amer J Hum Genet 68: 963–977 15. Suarez BK, Van Eerdewegh P (1984) A comparison of three affected-sib-pair scoring meth-
ods to detect HLA-linked disease susceptibility genes. Amer J Med Genet 18: 135–46 16. Blackwelder WC, Elston RC (1985) A comparison of sib-pair linkage tests for disease susceptibility loci. Genet Epidemiol 2: 85–97 17. Tierney C, McKnight B (1993) Power of affected sibling method tests for linkage. Hum Hered 43: 276–287 18. Schaid DJ, Nick TG (1990) Sib-pair linkage tests for disease susceptibility loci: common tests vs. the asymptotically most powerful test. Genet Epidemiol 7: 359–730 19. Whittemore AS (1996) Genome scanning for linkage: an overview. Amer J Hum Genet 59: 704–716 20. Kruglyak L, Lander ES (1995) Complete multipoint sib-pair analysis of qualitative and quantitative traits. Amer J Hum Genet 57: 439–454 21. Holmans P (1993) Asymptotic properties of affected-sib-pair linkage analysis. Amer J Hum Genet 52: 362–374 22. Terwilliger JD, Ott J (1994) Handbook of human genetic linkage. The Johns Hopkins University Press, Baltimore 23. Orholm M, et al (1991) Familial occurrence of inflammatory bowel disease. New England J Med 324: 84–88 24. Tysk C, et al (1988) Ulcerative colitis and Crohn’s disease in an unselected population of monozygotic and dizygotic twins. A study of heritability and the influence of smoking. Gut 29: 990–996 25. Rioux JD, et al (2000) Genome wide search in Canadian families with inflammatory bowel disease reveals two novel susceptibility loci. Amer J Hum Gene 66: 1863–70 26. Mirea L (1999) Detection of heterogeneity in allele sharing of affected relatives. M.Sc. Thesis, University of Toronto 27. Mirea L, Briollais L, Bull S (2004) Tests for covariate-associated heterogeneity in IBD allele sharing of affected relatives. Genetic Epidemiology 26: 44–60
344
W. Xu et al.
28. Kruglyak L, Lander ES (1995) High-resolution genetic mapping of complex traits. Amer J Hum Genet 56: 1212–1223 29. Rioux JD, et al (2001) Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nature Genetics 29: 223–228 30. Walters TD, Silverberg MS (2006) Genetics of inflammatory bowel disease: current status and future directions. Can J Gastroenterol 20: 633–639 31. Evans DM, Cardon LR (2004) Guidelines for genotyping in genome wide linkage studies: single-nucleotide-polymorphism maps versus microsatellite maps. Amer J Human Genet 75: 687–692 32. Nicolae DL, Kong A (2004) Measuring the relative information in allele-sharing linkage studies. Biometrics 60: 368–375 33. Goring HH, Terwilliger JD (2000) Linkage analysis in the presence of errors I: complexvalued recombination fractions and complex phenotypes. Amer J Hum Genet 66: 1095–1106 34. Margaritte-Jeannin P, et al (1997) Heterogeneity of marker allele frequencies hinders interpretation of linkage analysis: Illustration on chromosome 18 markers. Genet Epidemiol, 14: 669–674 35. Williamson JA, Amos CI (1995) Guess LOD approach: sufficient conditions for robustness. Genet Epidemiol 12: 163–176 36. Huang, Q, Shete S, Amos CI (2004) Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis. Amer J Hum Genet 75: 1106–1112 37. Schaid DJ, et al (2002) Caution on pedigree haplotype inference with software that assumes linkage equilibrium. Amer J Hum Genet 71: 992–995 38. Cho K, Dupuis J (2009) Handling linkage disequilibrium in qualitative trait linkage analysis using dense SNPs: a two-step strategy. BMC Genetics 10: 44 39. Abecasis GR, Wigginton JE (2005) Handling marker-marker linkage disequilibrium: Pedigree analysis with clustered markers. Amer J Hum Genet 77: 754–767 40. Stewart WC, Peljto AL, Greenberg DA (2010) Multiple subsampling of dense SNP data localizes disease genes with increased precision. Hum Hered 69: 152–159 41. Gershon ES, Goldin LR (1986) Clinical methods in psychiatric genetics, I: Robustness of genetic marker investigative strategies. Acta Psychiatr Scand 74: 113–118
42. Zouk H, et al (2007) The effect of genetic variation of the serotonin 1B receptor gene on impulsive aggressive behavior and suicide. Amer J Med Genet B Neuropsychiatr. Genet 144B: 996–1002 43. McPeek MS (1999) Optimal allelesharing statistics for genetic mapping using affected relatives. Genet Epidemiol 16: 225–249 44. Margaritte-Jeannin P, Babron MC, ClergetDarpoux F (2007) On the choice of linkage statistics. BMC Proc 2007 Suppl 1: S102 45. Basu S, et al (2010) A likelihood-based traitmodel-free approach for linkage detection of a binary trait. Biometrics 66: 201–213 46. Risch N, Zhang H (1995) Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268: 1584–1589 47. Rogus JJ, Krolewski AS (1996) Using discordant sib pairs to map loci for qualitative traits with high sibling recurrence risk. Amer J Hum Genet 59: 1376–1381 48. Greenwood CMT, Bull SB (1997) Incorporation of covariates into genome scanning using sib-pair analysis in bipolar affective disorder. Genet Epidemiol 14: 635–640 49. Greenwood CMT, Bull SB (1999) Analysis of affected sib pairs, with covariates–with and without constraints. Amer J HumGenet 64: 871–885 50. Bull SB, et al (2002) Regression models for allele sharing: analysis of accumulating data in affected sib pair studies. Statist Med 21: 431–444 51. Olson JM (1999) A general conditionallogistic model for affected-relative-pair linkage studies. Amer J Hum Genet 65: 1760–1769 52. Goddard KA et al (2001) Model-free linkage analysis with covariates confirms linkage of prostate cancer to chromosomes 1 and 4. Amer J Hum Genet 68: 1197–1206 53. Xu W, et al (2005) Recursive partitioning models for linkage in COGA data. BMC Genet 6 Suppl 1: S38 54. Xu W, et al (2006) A tree-based model for allelesharing-based linkage analysis in human complex diseases. Genet Epidemiol 30: 155–169 55. Nicolae DL (1999) Allele sharing models in gene mapping: A likelihood approach. Ph.D. thesis, Dept. Statistics, Univ. Chicago 56. Whittemore AS, Halpern J (2006) Nonparametric linkage analysis using person-specific covariates. Genet Epidemiol 30: 369–379 57. Strauch K, et al (2000) Parametric and nonparametric multipoint linkage analysis with
17
Model-Free Linkage Analysis of a Binary Trait
imprinting and two-locus-trait models: application to mite sensitization. Amer J Hum Genet 66: 1945–1957 58. Sinsheimer JS, Blangero J, Lange K (2000) Gamete-competition models. Amer J Hum Genet 66: 1168–1172 59. Greenwood CMT, Morgan K (2000) The impact of transmission ratio distortion on allele sharing in affected sibling pairs. Amer J Hum Genet 66: 2001–2004 60. Lander ES, Kruglyak L (1995) Genetic dissection of complex traits: Guidelines for interpret-
345
ing and reporting linkage results. Nat Genet 11: 241–247 61. Sham PC, Zhao JH, Curtis D (1997) Optimal weighting scheme for affected sib-pair analysis of sibship data. Ann Hum Genet 61: 61–69 62. Greenwood CMT, Bull SB (1999) Downweighting of multiple affected sib pairs leads to biased likelihood ratio tests under no linkage. Amer J Hum Genet 64: 1248–1252 63. Sinha R, et al (2009) Bayesian intervals for linkage locations. Genet Epidemiol 33: 604–616
sdfsdf
Chapter 18 Single Marker Association Analysis for Unrelated Samples Gang Zheng, Jinfeng Xu, Ao Yuan, and Joseph L. Gastwirth Abstract Methods for single marker association analysis are presented for binary and quantitative traits. For a binary trait, we focus on the analysis of retrospective case–control data using Pearson’s chi-squared test, the trend test, and a robust test. For a continuous trait, typical methods are based on a linear regression model or the analysis of variance. We illustrate how these tests can be applied using a public available R package “Rassoc” and some existing R functions. Guidelines for choosing these test statistics are provided. Key words: Additive, Association, ANOVA, Binary trait, Case–control design, Dominant, Genetic model, Genotype relative risks, MAX3, Mode of inheritance, Penetrance, Rassoc, Recessive, Quantitative trait, Robustness
1. Introduction Statistical procedures for testing whether there is an association between a phenotype and a single nucleotide polymorphism (SNP) are described and illustrated. Usually, the phenotype of interest is either a binary or quantitative one. For a binary trait, we focus on a retrospective case–control study, in which cases and controls are randomly drawn from case and control populations, respectively. For a continuous trait, the data are obtained from a random sample of the general population. Although a large number of SNPs is available for testing association, single marker analysis is often employed. The significance level to test a single hypothesis is 0.05. When multiple SNPs are tested, the Bonferroni correction can be applied. Denote the genotypes of an SNP as G0, G1, and G2. For case–control data, denote the penetrances as f0, f1, and f2 with respect to the three genotypes, respectively. Under the null hypothesis H0, we have f0 ¼ f1 ¼ f2 ¼ Pr(case). A genetic model is recessive if f1 ¼ f0, additive if f1 ¼ ( f0 + f2)/2, or dominant if f1 ¼ f2. For a single SNP, Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_18, # Springer Science+Business Media, LLC 2012
347
348
G. Zheng et al.
the observed case–control data consist of genotype counts (r0, r1, r2) among r cases and (s0, s1, s2) among s controls. Denote nj ¼ rj + sj (j ¼ 0, 1, 2) and n ¼ r + s. The Cochran-Armitage trend test (referred to as the trend test) is one of the two most commonly used statistics for the analysis of case–control data. It can be written as (1, 2) P2 ’Þrj ’sj g j ¼0 xj fð1 T1 ðxÞ ¼ P 2 1=2 (1) P2 2 2 n’ð1 ’Þ j ¼0 xj nj =n j ¼0 xj nj =n
where (x0, x1, x2) ¼ (0, x, 1), x is determined by the genetic model and ’ ¼ r/n. Under H0, given x, T1(x) asymptotically follows a standard normal distribution N(0,1). When the genetic model is recessive and the risk allele is known, T1(0) is used. When the genetic model is dominant and the risk allele is known, T1(1) is used. When we only know that the genetic model is recessive (or dominant) but not the risk allele, T1(0) (or T1(1)) cannot be used alone. When the genetic model is additive regardless of the risk allele, T1(1/2) is used. Pearson’s chi-squared test (referred to as Pearson’s test) is another commonly used test. It can be written as T2 ¼
2 X ðrj j ¼0
2 rnj =nÞ2 X ðsj snj =nÞ2 þ : ðrnj =nÞ ðsnj =nÞ j ¼0
Under H0, T2 asymptotically follows a chi-squared distribution with two degrees of freedom (df), denoted as w22 . The robust test, MAX3, is given by MAX3 ¼ maxðjT1 ð0Þj; jT1 ð1=2Þj; T1 ð1ÞjÞ; where T1(0), T1(1/2), and T1(1) are the trend tests given by Eq. 1 with different x values. Under H0, the asymptotic distribution of MAX3 is far more complex than those of T1(x) and T2. A procedure for determining its asymptotic null distribution and P -value is discussed in Note 2. For other robust tests, see Note 3. The power of each statistical test depends on the underlying genetic model. It will be seen that MAX3 is more robust than any single trend test or Pearson’s chi-squared test when the genetic model is unknown, thus, it should be used in practice. Since MAX3 does not follow any chi-squared distribution, we discuss how to find its P -value using the R package Rassoc (3). This package is available from the Comprehensive R Archive Network at “http:// CRAN.R-project.org/package¼Rassoc.” For a continuous trait Y, a typical model is Y ¼ m + g + [, where m is a fixed overall mean of the trait under H0, g is the random genetic effect due to G, and [ is a random error. The genetic value of g is a when G ¼ G0, d when G ¼ G1, and a when G ¼ G2. Under H0, we have a ¼ d ¼ 0. A genetic model is
18 Single Marker Association Analysis for Unrelated Samples
349
recessive, additive, or dominant if d ¼ a, d ¼ 0, or d ¼ a, respectively. The observed data consist of pairs (Yij, Gj) for i ¼ 1,. . ., nj and j ¼ 0, 1, 2, where Yij is the trait value of the ith individual with genotype Gj. Denote n ¼ n0 + n1 + n2. For the analysis of a quantitative trait, linear regression and the analysis of variance (ANOVA) are routinely used. Let (x0, x1, x2) ¼ (0, x, 1) be the values for the genotypes (G0, G1, G2), where x ¼ 0, 1/2, or 1 for the recessive, additive, or dominant models, respectively. Using the data (Yij, Gi), i ¼ 1,. . ., nj and j ¼ 0, 1, 2, the F-test derived from a linear regression model is given by ðn
F ðxÞ ¼ PP j
i
ðxj
2
xÞ
2Þ
( PP j
PP j
i
ðYij
i
ðxj 2
Þ Y
xÞðYij ( PP j
i
Þ Y ðxj
2
) xÞðYij
Þ Y
)2 ;
(2)
Pnj
¼ where x ¼ j ¼0 nj xj =n and Y j ¼0 i¼1 Yij =n: Given x, F(x) has an asymptotic F-distribution with (1,n 2) df under H0. :j ¼ Pnj Yij nj for j ¼ 0, 1, 2, Alternatively, denote Y i¼1
P :j Y Þ2 and SSe ¼ P2 Pnj Yij Y :j 2 . The SSb ¼ 2j ¼0 nj ðY j ¼0 i¼1 F-test derived from the ANOVA is given by P2
P2
F ¼
SSb =2 ; SSb =ðn 3Þ
(3)
which under H0, has an asymptotic F-distribution with (2,n 3) df. We illustrate how to use these F-tests using some existing R functions later. The allele-based analysis, valid under Hardy–Weinberg equilibrium (HWE), has similar performance to the genotype-based analysis under the additive model. Therefore, we focus on the genotypebased analysis, which does not require HWE. In Note 4, we present some power comparisons of different F-tests for quantitative traits.
2. Methods 2.1. Analysis of Case–Control Data
Denote the genotypes as (G0, G1, G2) ¼ (AA, AB, BB). If we happen to know the risk allele, it is always denoted as B. The choice of test statistic depends on which of the following four situations holds: (1) the genetic model and the risk allele are known, (2) the genetic model is known but not the risk allele, (3) the risk allele is known but not the genetic model, and (4) neither the genetic model nor the risk allele is known. Common genetic models include recessive, additive, and dominant. It is important that one does not determine the genetic model and/or the risk allele from
350
G. Zheng et al.
the same data that will be used in the subsequent association analysis. Of course, the genetic model and/or the risk allele may be known based on scientific knowledge or information from previous data. In our view, (3) and (4) are the most common situations in practice. 2.1.1. Which Test to Use?
Which test to choose depends on each of the four situations outlined above (see Note 1). Let w22 ð1 aÞ be the upper 100(1 a)th percentile of w22 and z(1 a) be the upper 100(1 a)th percentile of N(0,1). 1. The genetic model and the risk allele are known. T1(x) is optimal and should be used. Since the risk allele is known, a one-sided H1 is used and z(0.95) ¼ 1.645. For the recessive (additive, or dominant) model, reject H0 if T1(0) > z(0.95) (T1(1/2) > z(0.95), or T1(1) > z(0.95)). In each case, the P -value equals the probability of Z > T1(x), where Z ~ N(0,1). 2. The genetic model is known but not the risk allele. When the model is additive, use T1(1/2) and reject H0 if |T1(1/2)| > z (0.975) ¼ 1.96. The P -value equals two times the probability of Z > |T1(1/2)|. When the model is recessive or dominant, use MAX3 (see (3) next). 3. The genetic model is unknown (regardless of the risk allele). Use MAX3. Three approaches are available to calculate the P -value of MAX3 using the R package Rassoc. But we recommend using the one based on the asymptotic null distribution of MAX3. 4. The same as (3). Thus, we only discuss (3) in the following. Note that we do not recommend T2, because it is always less powerful than MAX3 (see Note 1). If T2 is used, reject H0 if T2 >w22 ð0:95Þ ¼ 5:9915. The P -value equals the probability of T > T2, where T w22 .
2.1.2. Examples Using R
The R package Rassoc can be loaded from
There are two functions CATT(data,x) and MAX3(data, method,m) in the package for computing the trend tests and MAX3 and their P -values. Pearson’s test and its P -value can be obtained using an existing R function. In both functions, the “data” comprises a 2 3 contingency table, i.e., genotype counts (r0, r1, r2) for cases and (s0, s1, s2) for controls. In the first function, “x” is 0, 0.5, or 1 for the recessive, additive, or dominant models, respectively. In the second function, the “method” refers to the procedure to calculate the P -value of MAX3. Three methods are available: “boot” for the bootstrap procedure, “bvn” for the bivariate normal procedure, or “asy” for the asymptotic procedure.
18 Single Marker Association Analysis for Unrelated Samples
351
The first two procedures are simulation-based and the last one is based on the asymptotic distribution of MAX3 (see Note 2). The “m” in the second function refers to the number of replicates when “boot” or “bvn” is used. When “asy” is used, “m” can be any positive integer. For illustration, we use a SNP (rs420259) reported by the WTCCC (4), which was the only SNP showing strong association with bipolar disorder in a genome-wide association study (GWAS) with 500,000 SNPs (the actual number of SNPs tested after quality control steps is less than 500,000). The genome-wide significance level used by (4) was 5 10 7 for strong association. The genotype counts are (r0, r1, r2) ¼ (83, 755, 1020) and (s0, s1, s2) ¼ (260, 1134, 1537). The data can be entered as follows.
To check that the data are correctly entered, just type a.
We may not have sufficient scientific knowledge to claim a priori which allele (A or B) is the risk one and what the true genetic model is. For illustration purpose, let us say we know a priori the true model is dominant and that B is the risk allele. The analysis is carried out based on the three situations outlined before. 1. If we know the genetic model is dominant and the risk allele is B, apply T1(1) using the R function CATT as follows.
The output shows that |T1(1)| ¼ 5.7587 and its P -value is 8.478 10 9. This is a two-sided test. We use a one-sided test because the risk allele is known. Thus, the actual P -value is half of the reported one, that is, 4.239 10 9, which is less than the significance level 5 10 7. Hence we reject H0.
2. If we know the genetic model is dominant but not the risk allele, use MAX3 not T1(1). Suppose we happen to enter the data as b and apply T1(1) as before.
352
G. Zheng et al.
The two-sided P -value is 0.09656, which is not significant. This example shows that knowing the risk allele is necessary for using T1(0) or T1(1). The use of MAX3 is illustrated in (3) later. We first show how to use the following R function to obtain Pearson’s test T2 ¼ 33.165 and its P -value 6.285 10 8. This P -value is also significant but larger than that of T1(1) in case (1), because T1(1) is optimal for the dominant model. If we apply T2 to the dataset b, we would obtain the same results.
3. If we do not know the genetic model, we apply MAX3 and calculate its P -value using the “asy” procedure. The reported statistic is MAX3 ¼ 5.7587 with P -value ¼ 2.347 10 8. Thus, we reject H0. Note that this P -value is smaller than that of T2 but larger than that of T1(1) in case (1).
If x ¼ 0.5 is used in T1(x) regardless of the true genetic model and the risk allele, the following results show that the P -value of T1(1/2) is not significant.
2.2. Quantitative Trait 2.2.1. Which Test to Use?
For a continuous trait, the four situations outlined before also apply. Let Fu,v(1 a) be the upper 100(1 a)th percentile of an F-distribution with (u,v) df. 1. When the genetic model and the risk allele are known, the statistic F(x) given in Eq. 2 is used. Since the risk allele is known, a onesided H1 is used. For the recessive (additive, dominant) model, reject H0 if F(0) > F1,n 2(0.95) (F(1/2) > F1,n 2(0.95), F(1) > F1,n 2(0.95)), where F(0) (F(1/2), F(1)) is the observed statistic. In each case, the P -value equals half of the probability of f1,n 2 > F(x), where f1,n 2 follows an F-distribution with (1,n 2) df. 2. The genetic model is known but not the risk allele. When the genetic model is additive, F(x) given in Eq. 2 is used (with x ¼ 1/2). Reject H0 if F(1/2) > F1,n 2(0.95), where F(1/2) is the observed statistic. The P -value equals the probability of f1,n 2 > F(1/2). When the genetic model is not additive
18 Single Marker Association Analysis for Unrelated Samples
353
(either recessive or dominant), F given in Eq. 3 is used. Reject H0 if F > F2,n 3(0.95), where F is also the observed statistic. f2,n
The P -value equals the probability of f2,n 3 > F, where 3) df. 3 follows an F-distribution with (2,n
3. When the genetic model is unknown, F given in Eq. 3 is used. The rejection rule and P -value are similar to those in case (2) when F is used. 4. The same as (3). Thus, we focus on (3) only. 2.2.2. Examples Using R
For illustration, we simulated a dataset called “QTLex.txt,” which contains (Y,G) for n ¼ 100 individuals. In the simulation, the true model was dominant and the risk allele was B with population frequency 0.3. HWE was assumed in the population. The heritability was set to 0.1. The trait Y was simulated from a normal distribution using the model given in Subheading 1 where m ¼ 0, E([) ¼ 0, and Var([) ¼ 1. The genotype G is AA, AB, or BB. The data can be read as follows.
If we know the true genetic model (dominant) and the risk allele a priori, we use F(x) given in Eq. 2 with x ¼ 1 as follows.
The function “as.integer(G)” assigns 1 for AA, 2 for AB, and 3 for BB. Thus, “objF1 ¼ aov(Y ~ (as.integer(G)¼¼1),data ¼ c)” conducts the ANOVA comparing the mean trait values between the two genotype groups: AA and AB + BB. In the first line of the output, “F value” is the statistic F(1) and “Pr(>F)” is the P -value. These values are reported in the second line. In this example, F(1) ¼ 31.317 and the P -value is 1.993 10 7. The strength of association is indicated by “***” near the P -value, and the interpretation of this significance code is given in the last line of the output. Since the risk allele is known, a one-sided test should be used. Thus, the actual P -value is half of that reported one, i.e., 9.965 10 8. This P -value is very significant compared to the 0.05 significance level. If we did not know the genetic model, the statistic F given in Eq. 3 should be used. See the following output. In this case, we would not assign scores (1, 2, 3) to the three genotypes. The output given below shows F ¼ 15.575 with P -value 1.361 10 6,
354
G. Zheng et al.
which is also significant, but larger than the P -value obtained from F(1), which is the most powerful test when the true model is dominant.
For illustration, we also calculate F(1/2) and F(0) and their P -values. F(1/2) can be obtained by
In this case, “as.integer(G)” is equivalent to using scores 1, 2, 3 for the three genotypes (under the additive model). The reported P -value is 1.011 10 6, which is the P -value if we do not know the risk allele. If we know the risk allele, the P -value is 1.011 10 6/2 ¼ 5.055 10 7. Both one-sided and two-sided P -values are significant. Interestingly, if we apply F(1/2) even when the risk allele is unknown, the two-sided P -value (1.011 10 6) is smaller than that of F. The test F(0), which is optimal for a recessive model, is obtained as follows.
In this case, the ANOVA is applied with two genotype groups: AA + AB and BB. The reported P -value is 0.1398. If we know the risk allele, the P -value is 0.1398/2 ¼ 0.0699, which is not significant at the 0.05 level. This illustrates loss of power can occur if the test used is not appropriate for the underlying genetic model.
3. Notes 1. Choosing among the trend tests, Pearson’s test, and MAX3. In practice, neither the genetic model nor the risk allele is known. The trend test T1(1/2) and Pearson’s test T2 are not robust when they are used alone. A robust test should protect against substantial loss of power when the model is misspecified (5, 6). To examine which test is most robust across the three genetic models, we conducted a simulation choosing the genotype relative risk (GRR), given by f2/f0, for a given x so that the optimal trend test for that x had about 80% power. The results are reported in Table 1. The power of the optimal test given
18 Single Marker Association Analysis for Unrelated Samples
355
Table 1 Empirical power (%) and robustness of different tests for the analysis of case–control data P
x
GRR
T1(0)
T1(1/2)
T1(1)
T2
MAX3
0.10
0.0 0.5 1.0 Min Max of min
3.15 1.88 1.47
80.40 21.23 8.94 8.94
30.30 80.30 77.60 30.30
11.56 79.20 79.85 11.56
70.58 73.09 72.78 70.58
72.42 76.89 75.07 73.42 72.42
0.30
0.0 0.5 1.0 Min Max of min
1.65 1.60 1.38
80.37 44.78 12.91 12.91
52.07 81.71 71.00 52.07
15.52 76.59 80.73 15.52
71.18 73.01 70.92 70.92
72.46 76.96 72.54 72.54 72.54
0.45
0.0 0.5 1.0 Min Max of min
1.45 1.57 1.44
79.69 56.52 14.63 14.63
61.13 80.01 64.92 61.13
16.46 67.91 80.81 16.46
70.25 71.52 71.72 70.25
72.27 76.07 73.74 73.74 73.74
a genetic model is in bold. The minimum power of each test across the three genetic models is presented. In the table, p ¼ Pr(B). The test with higher minimum power for any of the three possible underlying genetic models is the most robust test. The results show that the power of T1(1/2) ranges from 30 to 80% for p ¼ 0.1, 50 to 80% for p ¼ 0.3, and 60 to 80% for p ¼ 0.45. However, MAX3 is most robust as the power of MAX3 always exceeds 70% regardless of the underlying genetic model or the allele frequency p. The minimum power of T2 across the three genetic models exceeds 70%, although it has slightly lower power than MAX3 in the simulation studies. More extensive simulations and results can be found in ref. 7. 2. The asymptotic null distribution and P -value of MAX3. Three approaches to compute the P -value of MAX3 are presented in ref. 3. Let rxy be the asymptotic null correlation of T1(x) and T1(y), where x,y ¼ 0, 1/2, 1, and let pj be the population frequency of genotype Gj (j ¼ 0, 1, 2). Then rxy ¼
ðxyp1 þ p2 Þ fðx 2 p
1
þ p2 Þ
ðxp1 2 1=2
ðxp1 þ p2 Þ g
þ p2 Þðyp1 þ p2 Þ
fðy 2 p1 þ p2 Þ
1=2
ðyp1 þ p2 Þ2 g
:
356
G. Zheng et al.
Denote o0 ¼ ðr01 r01 r11 Þ 1 r201 and o1 ¼ ðr11 2 2 2
r01 r01 Þ 1 r201 . In the following, rxy, o0 , and o1 are 2
estimated under H0 by replacing pj with ^pj ¼ nj =n (j ¼ 1,2). The asymptotic distribution of MAX3 under H0 is far more complex than those of the trend test and Pearson’s test. An expression for the asymptotic null distribution of MAX3, P(t) ¼ Pr(MAX3 < t), is given by 1 0 Z tð1 o1 Þ o0 B t r01 u C ffiAfðuÞ du; PðtÞ ¼ 2 F@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 0 1 r01 0 1 Z t B t r01 uC ffi AfðuÞ du 2 F@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0 1 r201 0 1 Z t Bt o0 u=o1 r01 uC qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ 2 tð1 o Þ F@ AfðuÞ du; 1 o0 1 r201 (4)
where f and F are the density and distribution functions of N (0,1) (3). Using Eq. 4, the asymptotic P -value of MAX3 is given by 1 P(max3), where max3 is the observed MAX3. This approach is denoted as “asy” in the second R function in Rassoc. Alternatively, simulations can be used to approximate the null distribution of MAX3. Given the observed data (r0, r1, r2) and (s0, s1, s2), in the jth simulation (j ¼ 1,. . ., m), we generate (r0j, r1j, r2j) from the multinomial distribution Mulðr; ^p0 ; ^p1 ; ^p2 Þ and (s0j, s1j, s2j) from the same distribution except that r is replaced by s, where ^pi ¼ ni =n (i ¼ 0, 1, 2). For each j, we compute MAX3 denoted as MAX3j. Then MAX31, . . ., MAX3m form an empirical null distribution of MAX3 when m is large enough. For single marker analysis, we use m ¼ 100,000 to determine the null distribution. A larger m may be used, if the P -value of MAX3 is smaller than 10 5. This parametric bootstrap procedure is denoted as “boot” in the second R function in Rassoc. A more efficient simulation approach, denoted as “bvn” in the second R function in Rassoc, is to directly generate T1(0) and T1(1) in the jth simulation from a bivariate normal distribution with zero means and unit variances with correlation r01. Then compute T1(1/2) ¼ o0T1(0) + o1T1(1) and MAX3 denoted as MAX3j. Hence, MAX31,. . ., MAX3m form an empirical null distribution of MAX3.
18 Single Marker Association Analysis for Unrelated Samples
357
We recommend using the method “asy” to compute the P -value of MAX3, especially for small P -values, which require a large number of replicates m. For example, in the illustration in Subheading 2.2, to obtain an accurate estimate of a P -value as small as 1e 8, we need at least 10 million replicates, which is computationally intensive. Using a smaller m, say, m ¼ 100,000 the estimated P -value would be 0, which may be reported as “0” or as “ < 2.2e 16” by R. In the following example, we used the “boot” procedure with 100,000 replicates. The output shows the P -value is less than 2.2 10 16.
One can also use an approximation of the tail probability of MAX3 to approximate the P -value of MAX3 (8), which has a closed form. This approximate P -value, however, is not reported by the R package Rassoc. 3. Other Robust Tests for Binary Traits. Nearly all robust methods have been developed for case– control studies. In addition to MAX3 (2), other robust tests are also developed for case–control association studies. A review of different robust tests for association studies can be found in ref. 9, 10. Although different robust tests have been developed, they have similar performance under the alternative hypothesis. In ref. 9, a function “casecontrol” in R is provided, which also outputs the three trend tests: Pearson’s test, MAX3, and other robust tests. The P -values of the trend tests and Pearson’s test are based on the asymptotic distributions, while the P -value for MAX3 is based on the bootstrap simulation. Discussion of applying single-marker analysis with robust tests in genomewide association studies can be found in ref. 11. 4. Comparison of F-Tests for Quantitative Traits. We conducted a simulation to compare F(x) (x ¼ 0,1/2,1) and F by choosing the frequency of allele B (denoted as p), the sample size n, the heritability h, and the unit variance for the random error. Given a genetic model x and the values of p and h, we computed a and d. The empirical power is reported in Table 2. F(x) is most powerful when x is correctly specified. However, when x is misspecified, F(1/2) is most robust among the three F(x) statistics. On the other hand, F is slightly less powerful than F(1/2) under the additive or dominant models, but it protects against substantial power loss under the recessive model.
358
G. Zheng et al.
Table 2 Empirical power (%) for the analysis of a quantitative trait using different test statistics given p, h ¼ 0.1, and n ¼ 100 Model (x)
P
F(0)
F(1/2)
F(1)
F
0.0
0.10 0.30 0.45
62.29 87.65 89.86
31.54 56.16 70.22
12.88 15.60 17.00
59.70 80.49 82.87
0.5
0.10 0.30 0.45
53.80 57.15 70.88
89.05 90.57 90.49
87.65 83.96 77.42
82.89 83.50 84.03
1.0
0.10 0.30 0.45
40.03 15.58 17.76
87.82 84.14 77.45
89.75 90.15 89.62
84.44 83.39 82.85
References 1. Sasieni PD (1997) From Genotypes to Genes: Doubling the Sample Size. Biometrics 53: 1253–1261 2. Freidlin B, Zheng G, Li Z, Gastwirth JL (2002) Trend tests for case–control studies of genetic markers: power, sample size and robustness. Hum Hered 53: 146–152. (Erratum (2009) 68: 220) 3. Zang Y, Fung WK, Zheng G (2010) Simple algorithms to calculate asymptotic null distributions of robust tests in case–control genetic association studies in R. J Stat Softw 33 (8): 1–24 4. The Wellcome Trust Case Control Consortium (WTCCC) (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678 5. Gastwirth JL (1966) On Robust Procedures. J Am Stat Assoc 61: 929–948 6. Gastwirth JL (1985) The Use of Maximin Efficiency Robust Tests in Combining Contingency Tables and Survival Analysis. J Am Stat Assoc 80: 380–384
7. Zheng G, Freidlin B, Gastwirth JL (2006) Comparison of robust tests for genetic association using case–control studies. In: Rojo J (ed) Optimality: The Second Eric L. Lehmann Symposium, IMS Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Beachwood, Ohio 8. Li QZ, Zheng G, Li Z, Yu K (2008) Efficient approximation of P -value of the maximum of correlated tests, with applications to genomewide association studies. Ann Hum Genet 72: 397–406 9. Joo J, Kwak M, Chen Z, Zheng G (2010) Efficiency robust statistics for genetic linkage and association studies under genetic model uncertainty. Stat Med 29: 158–180 10. Kuo CL, Feingold E (2010) What’s the Best Statistic for a Simple Test of Genetic Association in a Case–control Study? Genet Epidemiol 34: 246–253 11. Zheng G, Joo J, Tian X, Wu CO, Lin J-P, Stylianou M, Waclawiw MA, Geller NL (2009) Robust genome-wide scans with genetic model selection using case–control design. Stat Its Interface 2: 145–151
Chapter 19 Single-Marker Family-Based Association Analysis Conditional on Parental Information Ren‐Hua Chung and Eden R. Martin Abstract Family-based designs have been commonly used in association studies. Different family structures such as extended pedigrees and nuclear families, including parent–offspring triads and families with multiple affected siblings (multiplex families), can be ascertained for family-based association analysis. Flexible association tests that can accommodate different family structures have been proposed. The pedigree disequilibrium test (PDT) (Am J Hum Genet 67:146-154, 2000) can use full genotype information from general (possibly extended) pedigrees with one or multiple affected siblings but requires parental genotypes or genotypes of unaffected siblings. On the other hand, the association in the presence of linkage (APL) test (Am J Hum Genet 73:1016-1026, 2003) is restricted to nuclear families with one or more affected siblings but can infer missing parental genotypes properly by accounting for identity-by-descent (IBD) parameters. Both the PDT and APL are powerful association tests in the presence of linkage and can be used as complementary tools for association analysis. This chapter introduces these two tests and compares their properties. Recommendations and notes for performing the tests in practice are provided. Key words: Family-based association test, Linkage disequilibrium, Transmission statistics, Nontransmission statistics, Parental information, EM algorithm, Rare variants, Genome-wide association, Extended pedigree, Nuclear family, Parallelization, Population stratification
1. Introduction Family-based designs have been commonly used for association analysis in candidate gene studies and more recently in genomewide association studies (GWAS). Several studies have identified candidate genes based on the family design (1–4). Family-based association tests using full genotype data are generally robust to population stratification, which can cause spurious results for population-based association analysis (e.g., case–control study) (5). However, family data may be more difficult to ascertain than unrelated individuals used in case–control studies.
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_19, # Springer Science+Business Media, LLC 2012
359
360
R.-H. Chung and E.R. Martin
The transmission/disequilibrium test (TDT) for case–parent triads was the first popular family-based association test (6). The TDT statistic is calculated conditional on the parental genotypes so that the statistic is robust to population stratification. Thereafter, the TDT was generalized to different family structures such as extended pedigrees with multiple affected siblings in the pedigree disequilibrium test (PDT) (7, 8). The TDT requires parental genotypes for the test; however, for late-onset diseases, parental genotypes are often missing. Two general approaches were proposed to deal with this problem. The first approach is to compare the difference in allele frequencies between affected and unaffected siblings without using parental information. Examples of such approaches include the S-TDT, the DAT, and SDT (9–11). The second approach is to infer missing parental genotypes based on the siblings’ genotypes and then calculate the transmission and non-transmission statistics. The missing parental genotypes can be constructed based on sample allele frequencies, such as done by association in the presence of linkage (APL), TRANSMIT, and UNPHASED (12–14), or conditional on the observed genotypes within each family, such as in the RC-TDT and FBAT (15, 16). It has been shown that methods that infer missing parental genotypes based on sample allele frequencies can have more power than methods that reconstruct parental mating types based on information within each family (17). However, methods that use sample allele frequencies to infer missing parental genotypes may have inflated type 1 error rates in the presence of population stratification, owing to the difference in allele frequencies among subpopulations. In this chapter, we introduce two popular single-marker familybased association tests: PDT and APL. The PDT considers phenotypically informative nuclear families and discordant sibships in each pedigree. Phenotypically informative nuclear families are ones in which there is at least one affected child, and both parents genotyped at the marker. Phenotypically informative discordant sibships have at least one affected and one unaffected sibling (discordant sibpair; DSP) and may or may not have parental genotype data. For a specific allele M1 at a marker, define a random variable XT as the difference between the transmission and non-transmission statistics: XT ¼ ð#M1 transmittedÞ
ð#M1 not transmitted)
and define a random variable XS as: XS ¼ ð#M1 in affected sibs)
ð#M1 in unaffected sibs):
Then for a pedigree with nT informative family triads and nS informative DSPs, a summary random variable D is defined as ! nT nS X X 1 D¼ (1) XSj : XTj þ nT þ nS j ¼1 j ¼1
19
Single-Marker Family-Based Association Analysis Conditional. . .
361
This is referred to as the “PDT-avg” statistic (18). Alternatively, “PDT-sum” was proposed to use the sum in Eq. 1 without dividing by nT + nS (18). Note that triads with homozygous parents or DSPs with identical genotypes do not contribute to the PDT-sum statistic. If there are n unrelated informative pedigrees and Di is the summary random variable for pedigree i, the PDT statistic T is defined as n P
Di ffiffiffiffiffiffiffiffiffiffiffiffiffi ; T ¼ si¼1 n P Di2
(2)
i¼1
which asymptotically follows a normal distribution with a mean of 0 and a standard deviation of 1 under the null hypothesis of no linkage or no association. The PDT-sum gives more weight to families of larger size, whereas the PDT-avg gives all families equal weight. Simulation results suggested that neither test is uniformly more powerful over all genetic models (18). The PDT was also extended to the genoPDT, which is a genotype-based association test for general pedigrees (19). Simulation results showed that the geno-PDT can have more power than the allele-based PDT in recessive and dominant models, whereas the original allele-based PDT can have more power than the geno-PDT if the alleles have an additive effect. The most important property of the geno-PDT is its ability to test for association with particular genotypes, which can reveal underlying patterns of association at the genotypic level. The APL considers independent informative nuclear families. Informative nuclear families for the APL are ones that have at least one affected sibling, with or without parental genotypes. Unaffected siblings are not required, but they can improve the parental genotype inference. The APL statistic is based on the difference between the observed number of alleles in affected siblings and its expected value conditional on parental genotypes under the null hypothesis that there is no linkage or no association. Specifically, for nuclear family i, let Xi be the number of copies of a specific allele in affected siblings, Gi be a vector of genotypes of the siblings, A be the siblings’ affection status, Gpj be a vector of the parental mating-type, C be the set of all possible parental mating types conditional on Gi, and Npij be the observed number of alleles in Gpj. Then the numerator of the APL statistic Ti is calculated as X ^ pj jGi ; AÞNpij : PðG (3) Ti ¼ Xi j 2C
362
R.-H. Chung and E.R. Martin
When parental genotypes are available, C is the vector of observed genotypes. When parental genotypes are missing, we replace the parental genotype probabilities in Eq. 3 with the following: mGp PðGp jG; AÞ ¼
2 P
k¼0
zk PðGjGp ; IBD ¼ kÞ PðGjAÞ
;
(4)
where Gp is a vector of parental mating types, G is a vector of the siblings’ genotypes, mGp is the unconditional parental mating-type probability, zk is the identity-by-descent (IBD) parameter, and P GjGp ; IBD ¼ k is the Mendelian transition probability conditional on the IBD status of the siblings. The APL correctly infers missing parental genotypes in the presence of linkage by considering the IBD parameters when estimating parental mating-type probabilities. The probabilities P GjGp ; IBD ¼ k reduce to Mendelian transition probabilities P GjGp if there is only one affected sibling in a family. The APL also uses genotypes from unaffected siblings and partial parental genotypes to help estimate parental mating-type probabilities (12). The expectation maximization (EM) algorithm is used to estimate the parameters mGp and zk. The APL statistic Ts is the sum of Ti over all nuclear families. The APL has been generalized to use nuclear families with different missing patterns and different numbers of affected and unaffected siblings by adopting a bootstrap procedure to estimate the variance of Ts (17). Briefly, k bootstrap resamplings are performed. Each family is treated as an independent unit for resampling. For each bootstrap sample, a new set of n families is resampled with replacement from the original n families. The APL statistic is calculated for each bootstrap sample and the sample variance is calculated based on the APL statistics from the bootstrap samples. The sample variance provides the estimate of the variance for the APL statistic Ts, which is standardized to be asymptotically normal with a mean of 0 and variance 1. Both the PDT and APL tests have been widely used for different disease studies (20–25). The PDT and APL share some properties such as both being powerful association tests in the presence of linkage and they can both use families with one or more affected siblings. They also have several complementary properties. For example, the PDT can use full information for extended pedigrees, while the APL uses general nuclear families but can infer missing parental genotypes. Steps to perform the two tests will be described in the following section including (1) software download, (2) input files, (3) control file setup, and (4) interpretation of the results. Finally, practical notes regarding the two methods will be discussed.
19
Single-Marker Family-Based Association Analysis Conditional. . .
363
2. Methods 2.1. Perform the PDT Test 2.1.1. Software Download
The PDT software, PDT2 (currently PDT version 6.0), is available for download at the Hussman Institute for Human Genomics (HIHG) website: http://hihg.med.miami.edu/software-download/ pdt. PDT2 is implemented in C++, and the source code is included in the package for users to compile on local machines.
2.1.2. Input Files
PDT2 requires a map file, which contains three columns for chromosomes, marker names and base-pair positions, and a ped file, which contains pedigree and genotype information. For SNPs with two alleles, the alleles are coded as 1s and 2s, and missing alleles are coded as 0s. Both map and ped files can be generated using PLINK (26) with the “recode12” option. However, a PLINK map file contains two columns for map information (one for the genetic map distance and the other for the base pair positions). The column for genetic distance in a PLINK map file needs to be removed as PDT2 only accepts one column for map information. Currently, covariates are not considered in PDT or APL (see Note 1). Therefore, covariates are not accepted in the input files. More detailed descriptions about the input files can be found in the PDT2 user manual included in the PDT2 software package.
2.1.3. Control File Setup
A control file is necessary for PDT2, and the parameters in the control file should be carefully specified because they determine how PDT2 performs the tests. Some parameters that need special attention are listed as follows: geno_pdt This option decides whether the allele-based PDT or the global test for the geno-PDT will be performed. The null hypothesis for the global test in the geno-PDT is that none of the genotypes are associated with the disease or the locus is not linked. max_cpus To efficiently handle GWAS data that may have > 1 million markers, PDT2 is implemented with parallel algorithms based on the POSIX threads (p-threads) technique. Each independent family with all of the markers is analyzed in parallel threads. The number of threads is specified in max_cpus. Each thread keeps receiving and analyzing one independent family at a time over all markers until all families are analyzed. The number of threads should be equal to or less than the number of cores on the computer. Our simulation results suggested that on a computer with dual quad-core processors, using seven threads achieved the optimum performance. It is ideal
364
R.-H. Chung and E.R. Martin
to leave one thread available so that it can perform routine work such as memory or task management on the system. options This parameter lets users decide whether to consider only transmission and non-transmission statistics in nuclear families, to use only the difference in the numbers of a specific allele in discordant sibships, or to use all of the available information as in Eq. 1, in the PDT test. This option allows users to examine separately whether an association signal comes largely from the transmission/ non-transmission component or from the DSP component in the PDT statistic. Using only transmission/non-transmission statistics may be more powerful if parents are available and if we expect reduced penetrance so that unaffected siblings may actually be carriers of the disease allele. 2.1.4. Interpretation of the Results
In the PDT2 output files, users can find the numbers of triads, DSPs, and independent pedigrees considered in the PDT statistics for each marker. For the allele-based PDT, the transmission and non-transmission statistics from parents to affected siblings, allele counts in DSPs, and statistic calculated based on Eq. 2 for each allele will be shown. For the geno-PDT, the statistic similar to Eq. 2 for each genotype and P -values for the global tests will be reported.
2.2. Perform the APL Test
The APL test, which is included in the CAPL software package, is available for download at the HIHG website (http://hihg.med. miami.edu/software-download/capl). CAPL is implemented in C ++, and the source code is included in the package. The CAPL software package provides the CAPL association test, a generalization of APL that can accommodate family and case–control data, and adjusts for population stratification (see Note 2). When there are only family data in the sample and one population is considered, CAPL reduces to the APL. In the following text, we refer to CAPL as the software implementation of the APL.
2.2.1. Package Download
2.2.2. Input Files
Three input file formats are accepted in the CAPL, which can all be generated by the commonly used software PLINK. The three formats are the binary files (bim, bed, and fam files), text files with alleles coded as A, T, C, and G (map and ped files); text files with alleles coded as either 1’s or 2’s; and missing alleles coded as 0’s (also map and ped files). Note that each marker should have only two types of allele coding with the missing allele code because current implementation of CAPL assumes only diallelic markers (see Note 3). More detailed descriptions about the input files can be found in the user manual included in the CAPL software package.
19 2.2.3. Control File Setup
Single-Marker Family-Based Association Analysis Conditional. . .
365
The control file for the CAPL determines how the CAPL performs the test and therefore parameters in the control file should be carefully specified. Some parameters that need special attention are as follows: em_precision The EM algorithm is used in the APL test to estimate the parameters such as the allele frequencies and IBD parameters. CAPL stops the EM iterations if the difference in the allele frequency estimates between the current iteration and the previous iteration is less than em_precision. Therefore, more EM iterations in CAPL will be performed if smaller em_precision is specified and more CPU time will be required. Our simulation results suggested that 10 6 for em_precision gives good estimates for the parameters in CAPL. bootstrap_length This parameter determines how many bootstrap resamplings will be performed in the APL for the variance estimator. Generally, 200 bootstrap replicates are enough to give a good estimate of the variance (27). Since the bootstrap procedure is a stochastic process, the estimated variance, and consequently resulting test statistic and P -value for a marker, may not be exactly the same if the test is repeated. Our simulation results suggested that generally using 1,000 bootstrap replicates gives a reasonably accurate estimate of the variance. Note that each bootstrap replicate involves several EM iterations to estimate the parameters in the APL. Therefore, increasing the number of bootstrap replicates also increases the running time linearly. In practice, users can first use 200–500 bootstrap replicates for initial estimates of P -values. Markers with P -values less than a certain threshold can then be tested with 1,000 bootstrap replicates for more accurate estimates of the P -values. max_cpus CAPL is also implemented with parallel algorithms to efficiently handle GWAS data. Since each marker can be analyzed independently in the APL, testing of different markers can be performed in parallel. We provide two versions of CAPL. The first version is implemented with p-threads and the second is implemented with both message passing interface (MPI) and p-threads. The first version of CAPL uses parallel threads with shared memory on one machine to take advantage of current computers with multi-core design. Each thread keeps receiving and analyzing one marker at a time until all markers are analyzed. The number of threads is specified in max_cpus. mpi_processes The second version of CAPL is implemented with a hybrid of MPI and p-threads, which allows the jobs to be distributed across a cluster
366
R.-H. Chung and E.R. Martin
of computers, and parallel threads are performed on each computer with shared memory. The number of computer nodes is specified in mpi_processes. The number of threads to be invoked on each node is specified in max_cpus. The total number of threads is therefore the number of nodes times the number of threads on each node. start and stop These two parameters are useful when users are only interested in a specific region, such as a candidate region or a chromosome. The start and stop parameters correspond respectively to the start and end positions of the SNPs in a region in the map file. 2.2.4. Interpretation of the Results
CAPL shows the numbers of families for different nuclear family structures used in the APL test. Five types of family structures are shown: C for unrelated cases, U for unrelated controls, A for families with one affected sib, AA for families with two affected sibs, and AAA for families with at least three affected sibs (see Note 4 for more details). The CAPL output file shows the names of markers, allele frequencies, observed allele counts in affected siblings, expected allele counts conditional on parental genotypes, variance for the APL statistics, and P -values. The variance for the APL statistics in the output file can be used to evaluate the asymptotic property of the APL statistics (see Note 5). One of the frequently asked questions from users is how to identify the risk alleles based on the results. Users can identify a risk allele if its observed allele count is greater than the expected allele count for a significant marker. Also, users frequently ask the choice between using PDT and APL. Detailed comparisons between the PDT and APL can be found in Note 6. Currently, odds ratios for markers are not calculated in either the PDT or APL. However, odds ratios can be calculated with other tools (see Note 7).
3. Notes 1. Covariates are not considered in either PDT or APL. We have developed APL-OSA (36), which is an extension of the ordered subset analysis (OSA) (37). APL-OSA can identify a subset of families that provide the most evidence of association based on covariate values and tests the null hypothesis that there is no relationship between the family-specific covariate and the family-specific evidence for allelic association. Alternatively, the software UNPHASED (14) can also be used to include covariates in association tests. 2. The APL uses allele frequencies estimated from the entire sample to infer the missing parental mating-type probabilities.
19
Single-Marker Family-Based Association Analysis Conditional. . .
367
Table 1 Properties for the PDT and APL software Test/properties
PDT
APL
Null hypothesis
No linkage or no association
No linkage or no association
Family structures
Independent extended pedigrees or nuclear families with one or more affected siblings
Independent nuclear families with one or more affected siblings
Parental genotypes
Complete/incomplete (incomplete requires DSP)
Complete/incomplete
Missing parental genotype inference
No
Yes
Allele-based test
Yes
Yes
Genotype-based test
Yes
No
Parallelization in software
Threads with shared memory
Threads with shared memory and MPI with threads in a distributed system
When there is population stratification in the data, the allele frequency estimates may reflect the true allele frequency for each family, which can cause inflated type 1 error rate for the APL (32). In practice, software to identify population structure such as STRUCTURE (33) or EIGENSTRAT (34) should be used in the quality control (QC) step to generate a homogeneous sample. Recently, the CAPL, which is an extension of the APL, was proposed to account for population stratification in the APL statistic and allow for use of unrelated cases and controls (32). Briefly, a clustering algorithm is used in the CAPL to identify subpopulations in the sample, and Eq. 4 is calculated conditional on the subpopulation information. The CAPL is useful when users have discrete subpopulations in the sample and would like to perform joint analysis using all of the samples from the subpopulations. 3. Current implementation of CAPL only assumes diallelic markers, while PDT2 can handle markers with multiple alleles. To analyze a multi-allelic marker using CAPL, users can collapse alleles other than the allele to be tested into one allele. For example, the allele to be tested can be coded as 1, and other alleles can be all coded as 2 in the ped file. 4. The APL test implemented in the CAPL software currently uses nuclear families with up to three affected siblings (17). When there are extended pedigrees or nuclear families with more than three affected siblings, the APL will select the
368
R.-H. Chung and E.R. Martin
most informative nuclear families. For a nuclear family with more than three affected siblings, the APL selects the three affected siblings with the highest genotyping rates based on all markers among all affected siblings in the nuclear family. There is no restriction on the number of unaffected siblings in the APL. Therefore, all unaffected siblings in a nuclear family will be used. In an extended pedigree, the APL examines every nuclear family and extracts the nuclear family with the most affected siblings and with the highest genotyping rate. 5. When sample size is small or allele frequency is low, the APL may have inflated type 1 error rate because the assumption of an asymptotic normal distribution for the APL statistic may not hold (17). Chung et al. (17) used simulations to demonstrate that APL statistics with variance >5 are generally valid. In practice, APL tests with variance