Disease Gene Identification: Methods and Protocols [1 ed.] 1617379530, 9781617379536

Recent efforts to characterize genetic variation in the human genome, coupled with the rapidly developing field of genom

256 68 5MB

English Pages 312 [315] Year 2011

Table of contents :
Front Matter....Pages i-xii
Front Matter....Pages 1-1
Front Matter....Pages 3-16
Front Matter....Pages 17-36
Front Matter....Pages 37-46
Back Matter....Pages 47-47
....Pages 49-60

Recommend Papers

Disease Gene Identification: Methods and Protocols [1 ed.] 1617379530, 9781617379536, 1617379549, 9781617379543

Recent efforts to characterize genetic variation in the human genome, coupled with the rapidly developing field of genom

256 3 5MB Read more

Disease Gene Identification: Methods and Protocols 9781493974719, 9781493974702, 1493974718

108 15 Read more

Mitochondrial Gene Expression: Methods and Protocols [1st ed.] 9781071608333, 9781071608340

This volume details the most recent advancements in the field of mitochondrial gene expression. Chapters guide readers t

430 64 8MB Read more

Synthetic Gene Circuits: Methods and Protocols 9781071610312, 9781071610329

560 19 18MB Read more

Nonviral Vectors for Gene Therapy. Methods and Protocols

Mark A. Findeis and a panel of active researchers present their best methods not only for preparing, handling, and chara

542 78 3MB Read more

Gene Therapy of Cancer: Methods and Protocols 9781071624418, 9781071624401, 1071624415

This third edition provides new and updated chapters on gene therapeutic strategies of cancer. Chapters guide readers th

118 112 42MB Read more

Rice Genome Engineering and Gene Editing: Methods and Protocols (Methods in Molecular Biology, 2238) 1071610678, 9781071610671

This detailed volume explores rice molecular biology, genetic engineering, and genome editing technologies. Dividing int

119 27 13MB Read more

Gene Knockout Protocols (Methods in Molecular Biology, 158) 0896035727, 9780896035720

As the major task of sequencing the human genome is near completion and full complement of human genes are catalogued, a

121 99 8MB Read more

Gene Mapping, Discovery, and Expression: Methods and Protocols (Methods in Molecular Biology, 338) 1588295753, 9781588295750

Completion of the sequence of the human genome represents an unpar- leled achievement in the history of biology. The pro

113 83 12MB Read more

Gene Targeting Protocols (Methods in Molecular Biology, 133) 0896033600, 9780896033603

A panel of innovative investigators presents, in readily reproducible detail, the latest techniques for gene replacement

114 41 7MB Read more

Disease Gene Identification: Methods and Protocols [1 ed.]
1617379530, 9781617379536

Author / Uploaded
Johanna K. DiStefano
Darin M. Taverna (auth.)
Johanna K. DiStefano (eds.)

Similar Topics
Biology
Molecular

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

METHODS

IN

MOLECULAR BIOLOGY™

Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK

For other titles published in this series, go to www.springer.com/series/7651

wwwwwwwwwwww

Disease Gene Identification Methods and Protocols

Edited by

Johanna K. DiStefano Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA

Editor Johanna K. DiStefano, Ph.D. Diabetes, Cardiovascular, and Metabolic Diseases Division Translational Genomics Research Institute Phoenix, AZ USA [email protected]

ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-61737-953-6 e-ISBN 978-1-61737-954-3 DOI 10.1007/978-1-61737-954-3 Springer New York Dordrecht Heidelberg London © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)

Dedication This book is dedicated in memory of Stephanie Savage, a colleague and friend who touched the hearts of many and devoted her work to improving the lives of others.

v

wwwwwwwwwwww

Preface Completion of the Human Genome Project (HGP) has provided us with a greatly enhanced understanding of human genetics, including a greater appreciation of how DNA shapes species development and evolution, biology, and disease susceptibility. The HGP has also affected the development and/or maturation of research disciplines such as genome annotation, knowledge of genome evolution and segmental duplication, and comparative genomics, among others. Yet, perhaps the greatest impact of the HGP has been on the manner in which researchers investigate the causes of complex human diseases. Completion of the HGP gave rise to the development of efforts to characterize genetic variation in the human genome, which has lead directly to the application of whole genome association studies to identify common alleles which contribute to complex disease risk. Efforts to identify genetic mutations underlying highly penetrant diseases have been widely successful due to the facts that (1) a single mutation is enough to cause the disease (i.e., monogenic) and (2) the mutation is inherited in a simple manner between generations in affected families. To date, more than a thousand genes for such disorders have been identified. These monogenic diseases are rare compared to diseases such as type 2 diabetes mellitus, cardiovascular disease, and cancer, which have complex etiologies. Unlike monogenic diseases, which arise due to a single genetic aberration, complex diseases result from a complicated interaction of multiple genetic and environmental determinants, none of which are amenable to identification and characterization using the traditional approaches to monogenic disease gene discovery. Recent efforts to characterize genetic variation in the human genome, coupled with the rapidly developing field of genomics, have lead directly to the development of new and innovative approaches to the identification of genes contributing to complex human diseases. This volume is written to provide up-to-date molecular methodologies used in the process of identifying a disease gene, from the initial stage of study design to the next stage of preliminary locus identification, and ending with stages involved in target characterization and validation. The need for such a book derives from the intellectual revolution in biomedical science and the realization that the molecular determinants of human diseases are rapidly becoming identifiable through well-planned, technologically advanced approaches. Although the research literature is replete with descriptions of technical procedures, there is typically a dearth of extensive practical detail in these publications, particularly in terms of modifications developed from personal experience, as well as discussions of potential problems that may be encountered throughout the protocol and ways to avoid them. The structure of this volume is unique in that it aims to address these deficiencies. The chapters contained within have been contributed by experts in their fields, who have kindly accepted the invitation to compile the protocols within this volume and share with us their expertise, experience, and results. This text is written at a level accessible to graduate students, postdoctoral researchers, and bench scientists in the fields of molecular genetics and molecular biology. The primary aim of

vii

viii

Preface

this volume is to present detailed laboratory procedures in an easy-to-follow format that can be carried out with success by competent investigators lacking previous exposure to a specific research method. The book’s main focus is on the application of molecular approaches to disease gene identification, but overviews and case studies are also presented. The volume begins with three introductory chapters which provide an overview of disease gene identification strategies and a description of study sample selection and successful experimental design. The next section of the text contains chapters presenting methods for identifying potential susceptibility loci, including practical procedures for high-throughput SNP genotyping using whole genome arrays, medium-throughput SNP genotyping using mass spectrometry, and low-throughput, targeted SNP genotyping approaches commonly used in fine-mapping and candidate gene investigations. The section ends with a chapter on bar-coded, multiplexed sequencing of targeted DNA regions to pinpoint specific allelic variants which contribute to disease risk. The volume follows with a section on current applications in human genomics, which provide tools for target validation and functional assessment. These protocols are typically associated with the steps of disease gene identification pursuant to initial locus discovery, such as those pertaining to functional characterization of susceptibility alleles and loci. Examples of such approaches include global mRNA expression profiling using mainstream platforms, the newly emerging field of microRNA profiling, and allelic expression profiling, as well as quantitative PCR and RNA mapping methods. Chapters on comparative genomic hybridization, which is a molecular-cytogenetic method to detect copy number changes, and high content analysis are also included. Finally, we end with four discursive chapters, to provide examples of disease gene identification and application. The first chapter in this section is related to bioinformatics approaches to the elucidation of gene identity and function, with a particular focus on an integrative systems biology approach. The second chapter in this section illustrates the entire process of disease gene identification with a real-life case study, and the concluding chapter presents an RNA-interference approach to functional pharmacogenomics and an application of molecular diagnostics to the treatment of b-thalassemia. Without the support and contributions of many individuals, completion of this volume would not have been possible. In particular, I thank Dr. John Walker, the editor of this series at Humana Press, who ensured the smooth and effortless completion of this project from the very start. I also express gratitude to the authors, all of whom contributed outstanding chapters, emanating from years of experience in the field. It was a pleasure working with this expert team of scientists, and I would gladly do so again at a moment’s notice. It is my hope that this volume leads to the identification and characterization of many more disease-related genes, which may someday pave the way toward more accurate and improved methods for disease diagnosis, as well as novel and effective strategies for disease treatment and prevention.

Phoenix, AZ, USA

Johanna K. DiStefano

Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii xi

PART I INTRODUCTION 1 Technological Issues and Experimental Design of Gene Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Johanna K. DiStefano and Darin M. Taverna 2 Statistical Issues in Gene Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard M. Watanabe 3 Identification of Causal Sequence Variants of Disease in the Next Generation Sequencing Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher B. Kingsley

3 17

37

PART II METHODS FOR GENE IDENTIFICATION 4 Microarray-Based Genome-Wide Association Studies Using Pooled DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Szabolcs Szelinger, John V. Pearson, and David W. Craig 5 Medium-Throughput SNP Genotyping Using Mass Spectrometry: Multiplex SNP Genotyping Using the iPLEX® Gold Assay . . . . . . . . . . . . . . . . . . Meredith P. Millis 6 Targeted SNP Genotyping Using the TaqMan® Assay . . . . . . . . . . . . . . . . . . . . . Dorit Schleinitz, Johanna K. DiStefano, and Peter Kovacs 7 Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions Using the Illumina Genome Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . Szabolcs Szelinger, Ahmet Kurdoglu, and David W. Craig

PART III

49

61 77

89

FUNCTIONAL CHARACTERIZATION OF SUSCEPTIBILITY ALLELES AND LOCI

8 Site-Directed Mutagenesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patricia E. Carrigan, Petek Ballar, and Sukru Tuzmen 9 Gene Expression Profiling of Tissues and Cell Lines: A Dual-Color Microarray Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sonsoles Shack 10 Methods for MicroRNA Microarray Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aarati R. Ranade and Glen J. Weiss 11 Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan D. Gruber

ix

107

125 145

153

x

Contents

12 Quantitative Polymerase Chain Reaction Using the Comparative Cq Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kimberly Yeatts 13 Genomic Analysis by Oligonucleotide Array Comparative Genomic Hybridization Utilizing Formalin-Fixed, Paraffin-Embedded Tissues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephanie J. Savage and Galen Hostetter 14 RNA Mapping Protocols: Northern Blot and Amplification of cDNA Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Lucrecia Alvarez and Mahtab Nourbakhsh 15 High-Content RNA Interference Assay: Analysis of Tau Hyperphosphorylation as a Generic Paradigm . . . . . . . . . . . . . . . . . . . . . . RiLee H. Robeson and Travis Dunckley

PART IV

171

185

199

221

ALTERNATIVE APPROACHES

16 Integrative Systems Biology Approaches to Identify and Prioritize Disease and Drug Candidate Genes. . . . . . . . . . . . . . . . . . . . . . . . . Vivek Kaimal, Divya Sardana, Eric E. Bardes, Ranga Chandra Gudivada, Jing Chen, and Anil G. Jegga 17 Identification of a Common Variant Affecting Human Episodic Memory Performance Using a Pooled Genome-Wide Association Approach: A Case Study of Disease Gene Identification . . . . . . . . . . . . . . . . . . . . Traci L. Pawlowski and Matthew J. Huentelman 18 RNAi-Based Functional Pharmacogenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukru Tuzmen, Pinar Tuzmen, Shilpi Arora, Spyro Mousses, and David Azorsa 19 Genetic Predisposition to b-Thalassemia and Sickle Cell Anemia in Turkey: A Molecular Diagnostic Approach . . . . . . . . . . . . . . . . . . . . . . A. Nazli Basak and Sukru Tuzmen

241

261 271

291

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

Contributors MARIA LUCRECIA ALVAREZ • Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA SHILPI ARORA • Translational Genomics Research Institute, Phoenix, AZ, USA DAVID AZORSA • Translational Genomics Research Institute, Phoenix, AZ, USA PETEK BALLAR • Faculty of Pharmacy, Biochemistry Department, Ege University, Bornova, Izmir, Turkey ERIC E. BARDES • Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA A. NAZLI BASAK • Translational Genomics Research Institute, Phoenix, AZ, USA PATRICIA E. CARRIGAN • Biodesign Institute, Arizona State University, Phoenix, AZ, USA JING CHEN • Department of Environmental Health, University of Cincinnati, Cincinnati, OH, USA DAVID W. CRAIG • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA JOHANNA K. DISTEFANO • Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA TRAVIS DUNCKLEY • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA JONATHAN D. GRUBER • Department of Ecology & Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA; Department of Molecular, Cellular, & Developmental Biology, University of Michigan, Ann Arbor, MI, USA RANGA CHANDRA GUDIVADA • Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA GALEN HOSTETTER • Integrated Cancer Genomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA MATTHEW J. HUENTELMAN • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA ANIL G. JEGGA • Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA; Department of Biomedical Engineering, University of Cincinnati, Cincinnati, OH, USA; Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, OH, USA VIVEK KAIMAL • Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA; Department of Biomedical Engineering, University of Cincinnati, Cincinnati, OH, USA CHRISTOPHER B. KINGSLEY • Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA

xi

xii

Contributors

PETER KOVACS • Interdisciplinary Center for Clinical Research, University of Leipzig, Leipzig, Germany AHMET KURDOGLU • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA NATHALIE MEURICE • Translational Genomics Research Institute, Phoenix, AZ, USA MEREDITH P. MILLIS • Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA Spyro Mousses • Translational Genomics Research Institute, Phoenix, AZ, USA MAHTAB NOURBAKHSH • Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA TRACI L. PAWLOWSKI • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA JOHN V. PEARSON • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA AARATI R. RANADE • Translational Genomics Research Institute, Phoenix, AZ, USA RILEE H. ROBESON • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA DIVYA SARDANA • Department of Computer Science, College of Engineering, University of Cincinnati, Cincinnati, OH, USA STEPHANIE J. SAVAGE • Integrated Cancer Genomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA DORIT SCHLEINITZ • Interdisciplinary Center for Clinical Research, University of Leipzig, Leipzig, Germany SONSOLES SHACK • Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ, USA SZABOLCS SZELINGER • Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA DARIN M. TAVERNA • Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA PINAR TUZMEN • Translational Genomics Research Institute, Phoenix, AZ, USA SUKRU TUZMEN • Pharmaceutical Genomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA RICHARD M. WATANABE • Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Department of Physiology & Biophysics, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA GLEN J. WEISS • Translational Genomics Research Institute, Phoenix, AZ, USA KIMBERLY YEATTS • Diabetes, Cardiovascular, and Metabolic Diseases Division, Translational Genomics Research Institute, Phoenix, AZ, USA

Part I Introduction

Chapter 1 Technological Issues and Experimental Design of Gene Association Studies Johanna K. DiStefano and Darin M. Taverna Abstract Genome-wide association studies (GWAS), in which thousands of single-nucleotide polymorphisms (SNPs) spanning the genome are genotyped in individuals who are phenotypically well characterized, currently represent the most popular strategy for identifying gene regions associated with common diseases and related quantitative traits. Improvements in technology and throughput capability, development of powerful statistical tools, and more widespread acceptance of pooling-based genotyping approaches have led to greater utilization of GWAS in human genetics research. However, important considerations for optimal experimental design, including selection of the most appropriate genotyping platform, can enhance the utility of the approach even further. This chapter reviews experimental and technological issues that may affect the success of GWAS findings and proposes strategies for developing the most comprehensive, logical, and cost-effective approaches for genotyping given the population of interest. Key words: Association analysis, Genome-wide association study, Candidate gene, Singlenucleotide polymorphism, Microarray, Linkage disequilibrium

1. Introduction Most common diseases have both environmental and genetic determinants, and it is likely that an individual’s unique genetic load, in combination with his/her lifetime environmental exposure, determines the overall risk of disease development. At present, there are an unprecedented number of approaches for querying the human genome to identify genetic determinants of common human diseases. Since the late 1990s, major advances in genotyping technology have moved the field from testing one marker at a time to assaying thousands of markers in a single experiment. At present, microarrays representing over a million

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_1, © Springer Science+Business Media, LLC 2011

3

4

DiStefano and Taverna

single-nucleotide polymorphisms (SNPs) are available and that capacity will probably be doubled within the year. Although there are millions of SNPs in the human genome, it is well recognized that the effect of an ungenotyped functional allele may still be gauged if it is correlated with a variant that has been genotyped. This correlation between pairs of markers makes it possible to query the entire genome without genotyping every variant and provides the basis for genome-wide approaches for identification of alleles which contribute to risk for common diseases (1, 2). Genome-wide association (GWA) studies are currently the most widely used approach to identify genetic variation related to phenotypic diversity (3). Over the past few years, GWA studies have identified statistical association between hundreds of markers throughout the genome and various common complex traits and diseases (4). Findings reported from these studies have substantially enhanced our understanding and appreciation of the diverse molecular pathways underlying specific human diseases. These GWA studies also complement targeted approaches such as candidate gene/pathway investigations and provide a foundation for deep sequencing efforts to identify causal alleles. Further, results from such studies provide insight into the genetic underpinnings of disease susceptibility, such as validation of the common disease–common variant hypothesis, which posits that complex traits are largely due to frequently occurring alleles with small to modest effect sizes (5). However, emerging studies suggest that rare variants may also contribute to disease susceptibility through the accumulation of low-frequency, high-penetrance variants (6). Results from recent association studies are demonstrating that the two hypotheses are not mutually exclusive and that both types of variants are likely important for disease risk. Many excellent reviews related to issues of GWA study design, including definition and characterization of individuals for a case– control sample, assessment of population substructure within the study sample, and the importance of replication studies, have been addressed elsewhere (3, 7) and are, therefore, not discussed here. Instead, this chapter focuses on technological issues to consider in designing a GWA study, with attention on choice of genotyping platform, and the kind of data to expect based upon platform selection, population of interest, and specific experimental approach. We also address instances in which a targeted investigation of genes constituting a specific pathway is more appropriate and scientifically meaningful than a global approach, and propose strategies for developing the most comprehensive, logical, and cost-effective approaches, given the population of interest.

Technological Issues and Experimental Design of Gene Association Studies

2. Gene Association Studies 2.1. Linkage Disequilibrium and Implications for Association Studies

5

The alleles of SNPs located in the same genomic interval are often correlated with one another. The efforts of the International HapMap Project (1) were based on the aim of identifying groups of such highly correlated SNPs throughout the genome in three major ethnic groups including 90 individuals (30 parent–parent– offspring trios) of European descent from Utah (CEU), 90 Yoruba individuals (30 family trios) from Ibadan, Nigeria (YRI), 45 unrelated individuals from Tokyo (JPT), and 45 unrelated individuals from Beijing (CHB). An important finding from the HapMap project established that the majority of SNPs with a minor allele frequency of at least 5% could be effectively reduced to ~550,000 or 1,100,000 representative markers for individuals of European and Asian or African ancestries, respectively (2). Genome-wide approaches quickly made resourceful use of the knowledge that genotyping a “tag” SNP representative of a group of correlated markers provided information for over 80% of the variants (MAF ≥ 0.05) present in the human genome (8–10). Nonrandom association between alleles at two different loci is described in terms of linkage disequilibrium (LD), and this correlative relationship varies in a complex and unpredictable manner across the genome and between different populations (11). The level of LD is influenced by a number of factors including genetic linkage, rate of recombination, rate of mutation, genetic drift, nonrandom mating, and population structure. Decay of LD depends largely on rate of recombination; thus, in populations with historically more opportunities for recombination, LD typically extends over shorter distances compared to populations showing less expansion. The range of linkage disequilibrium varies significantly among populations with different gene histories. For example, the range of LD in individuals from Africa typically extends over only short distances, while the range of LD in individuals from populations like the Old Order Amish extends over large chromosomal regions (12). We have also found that LD extends over greater distances in Native American Indians compared to other populations (13). For example, as shown in the comparative plots, LD between markers is stronger and block size is greater in Pima Indians (Fig. 1a) compared to Caucasians (Fig. 1b). This property of LD has strong implications for genome-wide association study design. LD is typically described in terms of r 2, which is the square root of the product of two locus-specific variances, and which serves as a measure of concordance: absolute values of 1 only occur when there is complete LD between markers and when the associated alleles have identical frequencies. Examination of pairwise r 2 values can thus determine which SNPs are largely

6

DiStefano and Taverna

Fig. 1. Estimation of pairwise linkage disequilibrium between markers on 8q24 in (a) Pima Indians and (b) Caucasians. Linkage disequilibrium is shown in terms of r 2, which is a measure of concordance between markers. Haplotype blocks were estimated using the method of Gabriel et al. (46).

redundant (i.e., they give the same information). If two variants are in 100% genotypic concordance with one another (implying r 2 » 1), then only one is selected as a tag for the locus (the concept of “tagging” refers to the use of one marker, i.e., the tag SNP, as a representative SNP for other markers in LD with each other).

Technological Issues and Experimental Design of Gene Association Studies

7

Generally, an r 2 value of 0.8 or higher is considered appropriate for one SNP to tag another in genotyping studies (8, 10). A consequence of the calculation for LD is that for r 2 = 1 between two SNPs, they must have the same minor allele frequency; accordingly, tagging a genome for rare SNPs will require many more tags than for common SNPs. When LD structure is examined across populations, it is also apparent that tag SNPs have a dramatically varied representation of LD among different ethnic groups. When LD is high, tag SNPs represent more markers; in areas of short LD, tag SNPs represent less total genetic variance and, therefore, provide less coverage in that area. Tags are chosen based on population-specific patterns of LD, and it is critical to realize that what makes a marker a good tag in one population may not make it appropriate for another due to differing LD structure. For instance, a strong tag SNP in one population may represent ten SNPs with an r2 of 0.8, while in another population the same marker may represent only two markers. As discussed in the following section, the LD structure relevant to a particular population will have important implications for selection of the most appropriate genotyping approach and technological platform. 2.2. Genome-Wide Association Studies 2.2.1. Considerations for Selecting the Appropriate Platform

GWA studies are performed on DNA microarray “chips.” Although these chips are just barely thumb-sized, they contain a dense, regular array of DNA oligonucleotides designed to probe the presence of over a million alleles dispersed throughout the genome. However, not all GWAS platforms are considered equal. The currently available platforms differ in cost, genomic pathway or gene coverage, and added value content (such as mitochondrial SNPs, nonsynonymous SNPs and copy number variants). Other differences, such as amount of target DNA required, protocol complexity, chip throughput, and scanner expense are not dealt with here, but should be considered contingent on experimental concerns. It is well recognized that there are presently two competing technologies leading the market: Affymetrix (Santa Clara, CA) and Illumina (San Diego, CA). Both companies offer products containing roughly 500,000 and 1 million SNPs and chips from these manufacturers also include added content, such as tests for copy number variations. Despite these similarities, the two platforms differ both in terms of physical chip design and SNP content, and not surprisingly, these differences have implications for overall effectiveness and quality of data findings from GWA studies. Chips available from Affymetrix use a printed-array format where each spot of the array contains a cluster of 25-mer oligonucleotides representing either a locus or allele. Genotyping with these chips is based on allelic discrimination resulting from differential hybridization. The costs to produce a printed chip are mainly up-front, making this a cost-effective platform for high-volume studies.

8

DiStefano and Taverna

In contrast, chips from Illumina consist of an ordered array of beads (again, one for each locus or allele) representing 50-mer oligonucleotides. This platform uses hybridization and singlebase extension at the locus to detect alleles. Despite the facts that these chips are more expensive than printed arrays and the inherent task of decoding bead positions with this technology is timeintensive, oligonucleotide specificity is much higher with this platform compared to that offered by Affymetrix. Further, the Illumina platform also provides the flexibility to quickly adapt new content simply by changing beadpool composition. For example, at the time of this writing, the updated Illumina 1M chip has been redesigned to incorporate new markers detected through the 1,000 Genomes Project (14). Despite differences between chips offered by these manufacturers, experiments using both the Affymetrix and Illumina platforms have been largely and equally successful in producing hundreds of hits in GWA studies for complex traits and diseases (4, 15). 2.2.2. Genomic Coverage and Redundancy

Genomic coverage =

A critical consideration for selecting the most suitable platform is determining how well the chip relates to the study population of interest. As noted earlier, the effect of an ungenotyped functional allele may still be gauged if it is correlated with a SNP that is genotyped, thereby obviating the need to test every single variant in a gene or locus. Correspondingly, commercially available microarrays with genome-wide coverage are typically developed using markers that are tags for another variant or series of variants in linkage disequilibrium (LD) with one another. Given the LD structure in the population of interest, the tags available on a chip will cover a certain proportion of LD and thus, a certain proportion of the genome. For SNP selection strategies, Illumina almost exclusively uses a tagging approach based on data from HapMap samples, while Affymetrix primarily focuses on spacing markers evenly throughout the genome. To discuss the ramifications of these two disparate approaches, we discuss their effects on genomic coverage and tagging redundancy. In addressing these areas below, we assume the standard r 2 cutoff of 0.8 as a measure of significant LD.

Number of common SNPs in LD ≥ 0.8 to a tag set + number of tags Number of SNPs present in reference population

.

 Number of groups where a set of tags is in LD ≥ 0.8to each other   .  Number of tags

Tagging redundancy = 1 − 

(1) (2)

Genomic coverage is most simply defined as the fraction of SNPs that are represented by being either tags or markers in LD with tags on a genotyping chip in a given population. That is, the

Technological Issues and Experimental Design of Gene Association Studies

9

more SNPs belonging to a particular population are represented on a chip, the greater the coverage of that genome. While there are more thorough and mathematically detailed ways to define coverage, these are beyond the introductory scope of this chapter. Further, coverage is only one measure to assess the effectiveness of a particular platform and other considerations include overall power with respect to allele frequency, sample size and expected disease risk (46). Assuming that tags on a particular chip have been optimized using a training population, Eq. 1 identifies at least three ways, coverage may be decreased by applying those tags to another reference population. First, coverage may be lowered by reducing the number of tags. Clearly, chips with one million optimized tags will cover a larger fraction of the genome than only half a million tags. Second, SNPs in the reference population could be in low LD with the tags. This can happen if the tags developed in the training population have a different LD structure than the reference. Finally, the size of the reference set could be increased by including newly discovered SNPs outside of tagged LD blocks or being permissive in which SNPs to include in the LD calculations. Genomic coverage calculations show that chips provided by Illumina are comprised mainly of tags optimized for the HapMap CEU (i.e., Caucasian) population; specifically, the Illumina 1M product is estimated at 0.95 coverage in these individuals (16). In contrast, DNA from the Yoruban (YRI) population sampled from Ibadan, Nigeria, has both more variation and smaller LD blocks than that of the CEU population, resulting in lower coverage (estimated at 0.80). Chips provided by Affymetrix, whose design was based on selecting one million tags evenly spaced across the genome without preference for population-specific tags, give estimated coverages of 0.84 and 0.68 for the CEU and YRI populations, respectively. Information related to population-specific LD structure can also be used to analyze the cost of a study in the following way. Coverage provided by a chip is inversely proportional to the number of samples needed in common disease models (17). For instance, if N subjects are needed for a specified level of power when the disease variant is observed directly (that is, the marker in question is a tag), then N/r 2 samples are needed to achieve that same level of power if the true variant is in LD = r 2 with the tag (18). The ratio in coverage for CEU individuals using Affymetrix and Illumina one million SNP chips is 1.13, implying that 13% more samples will need to be run on the Affymetrix platform to equal the power of the same study using the Illumina platform. Similarly, a study comprised of YRI (or other African ancestry) samples would require 17% more samples using the Affymetrix platform. Clearly, in these cases, one would want to consider the relative costs of the chips, along with the costs of

10

DiStefano and Taverna

acquiring and running the increased number of samples before selecting a genotyping platform. Findings from GWA studies can also be significantly impacted by redundancy. As shown in Eq. 2, redundancy is simply the fraction of tags that are effectively represented by other SNPs on the chip. As we know, when two tags are in LD with each other, one SNP effectively tags the other; if a chip contains many instances of tags in LD, the redundancy can be high. Redundancy can also gauge the robustness of data in the presence of individual tag failure. It is not unusual to have a tag fail in a sample due to a nearby variant interfering with hybridization, and a measure of redundancy can provide reassurance that variants in tagged region are adequately queried. We have estimated redundancy in the CEU population from the HapMap resource for the Affymetrix 500K and 1M chips to be 0.46 and 0.50, respectively, and for the Illumina 500K and 1M chips to be 0.26 and 0.46, respectively, (unpublished data). A similar redundancy level for both Affymetrix platforms is expected due to the evenly spaced tag SNP selection strategy. On the other hand, evaluation of the Illumina platform suggests a point of diminishing returns with an increase in the number of optimized tags, with a measure of redundancy approaching that of the Affymetrix product. As mentioned earlier, Illumina recently updated their 1M chip by removing some of this redundancy, and these markers have been replaced with added content from the 1,000 genomes project. Although the HapMap resource offers comprehensive information pertaining to three ethnically diverse populations of Caucasian, African, and Asian ancestries, these genotype data do not always extend easily to other populations. For example, we found that only 30–40% of common variants on the Affymetrix array have r 2 > 0.80 with another marker in American Indian populations (19), implying that genetic variation may not be exhaustively captured with this chip in this population. In this population, it is possible that variants with important effects on disease susceptibility may not be identified due to differences in LD and incomplete marker coverage. Similar results may be expected for other populations with ancestries different from those represented in the HapMap database. In these cases, populations with larger LD blocks are assumed to show correspondingly more redundancy on genomewide chips. In other words, a larger LD block will carry more tag SNPs, which will correspondingly account for a greater proportion of genetic variation at that locus. Tools such as EIGENSTRAT (20) or STRUCTURE (21, 22), in combination with a set of SNPs testing for admixture (ranging from several hundred to a few thousand), will assess relatedness of a population of interest to the populations represented in the HapMap resource. This is a well-recognized consideration when testing individuals not

Technological Issues and Experimental Design of Gene Association Studies

11

belonging to the HapMap study samples. For example, Illumina provides a product comprised of 1,509 markers, the African American Admixture Panel, that provides estimates of varying degrees of Caucasian and African ancestry in African American individuals. Despite products such as this, as well as the overall appreciation for LD structure and its implications for GWA study design, it is imperative to understand that what is currently known about LD in world populations is based upon examination of a relatively small number of individuals representing only three ethnic groups. As such, there is much about whole genome LD structure that remains to be learned. 2.2.3. Getting the Most Out of the Chip Selected

We can extend the concept of coverage to any portion of the genome. This portion may be constructed by any prior knowledge indicating possible areas of association such as localization to a pathway, gene family or portion of a chromosome. One can run coverage calculations on just these domains to see which technology best addresses this emphasized content, thereby increasing the chances of identifying a susceptibility locus. However, by choosing a GWAS chip based on an analysis of coverage in a subset of the genome, the potential for bias is introduced into the study design. Care must be taken to incorporate the best knowledge of where the association could lie. As an example, if a researcher is specifically interested in diseases related to inflammation pathways, then consideration of coverage for inflammation-related genes on different chips is warranted. In this case, it is possible that the regions of interest may be similarly covered on two different chips but at vastly different prices. On the other hand, one may decide that the overall coverage for a targeted area is so poor on available chips that it is necessary to design a custom chip representing the region of interest. All of the chips mentioned in this chapter can interrogate more than just tag SNPs. As extra content, these chips contain probes that directly interrogate copy number variants (CNVs), loss of heterozygosity (LOH), mitochondrial SNPs, SNPs in recombination hotspots, expressed SNPs, and otherwise “highvalue” SNPs, as determined by the research and medical communities. This added value content should also be considered if relevant to the research goals. Because of the growing interest in CNVs as determinants of complex disease (16, 23–25), SNP genotyping manufacturers have provided products to meet the demand for assessing these variants. At this time, the Affymetrix products cover 2,000 and 5,677 probes specifically designated for known CNVs on its 5.0 and 6.0 chips, respectively, while the Illumina products cover 5,000, 4,247, and 6,000 probes specifically designated for known CNVs on its 660K, 1M duo, and 1M Omni chips, respectively. SNPs can also be used to boost CNV coverage. There are numerous algorithms available which use

12

DiStefano and Taverna

neighboring SNP tags to predict where CNVs have occurred such as cnvPartition (http://www.illumina.com), Canary (http:// www.Affymetrix.com), freeware PennCNV (26–29), and vendor software CNAM (http://www.goldenhelix.com). These algorithms calculate the likelihood that there has been a change in heterozygosity or model consistent deviations in probe hybridization for a continuous stretch of SNPs in a particular sample. CNVs for which probes have specifically been designed are usually covered with a density of one marker every 15–50 bases. This allows resolution of CNVs smaller than 1 kb. Software using genotype calls to assess CNVs are subject to the tag distribution on a particular chip, where density of SNPs ranges one every 3–10 kb (500K chips) or 1–4 kb (1M chips). As available algorithms currently require a stretch of at least four neighboring SNPs to predict a CNV, resolution decreases to sizes larger than 12 and 5 kb for the 500K to 1M chips, respectively. 2.2.4. Pooled DNA Approaches to GWAS

Because GWAS require large numbers of carefully phenotyped cases and controls, the present costs to undertake such an approach remain generally beyond the fiscal reach of many research groups (30). To circumvent the costs associated with GWA studies, we (31), and others (32–40) have explored the feasibility of genomewide scans using pooled genomic DNA. Multiple studies have assessed the accuracy of allelic frequency predictions using the Affymetrix 10K or 100K platforms (32–40). In these studies, fairly good agreement in allele frequency differences was observed in genotyping between pools of individuals, suggesting that this approach may be suitable and accurate for identification of susceptibility alleles for complex disease. Consideration of the strengths and limitations of a pooled approach is useful before designing a GWA study. The major strength of a pooled DNA genotyping design is its cost-effectiveness, while the most significant weakness is the loss of power when compared to results obtained from genotyping individual samples. For example, in our work, we estimated the loss in power due to utilization of a pooled study design to be approximately 17% (31). However, previous studies have convincingly shown this approach to be an economical and efficient strategy for accurately identifying markers showing significant differences in allele frequency between affected and unaffected individuals in complex diseases such as melanoma, palsy, and Alzheimer disease, despite similar reductions in power due to a pooled design (41, 42). Similarly, we have utilized this approach to identify loci which may contribute to the development of end-stage renal disease attributed to type 2 diabetes, which were later validated in individuals with ESRD attributed to type 1 diabetes. Thus, the potential for identifying new risk loci for complex diseases may override the loss of power associated with a pooled study design.

Technological Issues and Experimental Design of Gene Association Studies

2.3. Characterization of Targeted Regions

13

Consideration of technological issues and experimental design in gene association studies is also pertinent in the characterization of specific genomic regions targeted by linkage analysis, GWAS findings, gene expression results, or specific candidate gene investigation. For example, findings obtained from a GWAS approach may pinpoint a region of potential interest, but because of issues related to incomplete coverage, a deeper investigation of the locus is warranted. In these cases, two approaches are generally utilized: sequencing the region of interest in a representative sample to identify all genetic variation or genotyping more markers throughout the locus of interest to refine the association signal and identify putative functional variants. Although the merits and limitations of a sequencing approach are discussed in detail in a companion chapter, it is worth mentioning that this is presently an area of rapid evolution in technologies, with equipment costs decreasing and amount of base sequence reads increasing. Sequencing reveals many types of variation including rare variants, CNVs, SNPs with more than two alleles, insertions, deletions, and inversions. There are also new cost-cutting strategies being implemented such as pooling and pooling-indexed samples (also known as multiplexing) in which multiple samples can be uniquely identified and sequenced together in the same reaction. Multiplexing has similar cost savings as pooling, yet because each sample remains distinct, this strategy offers more flexibility in data analysis and is therefore potentially more powerful than a pooling approach. When designing probes for fine-map genotyping, there are many choices of technologies available. For a moderate sample of ~300 SNPs, platform options include the iPLEX technology (Sequenom, Inc.; San Diego, CA) or the VeraCode system (Illumina). For more than 300, but less than 70,000, SNPs, both Illumina and Affymetrix offer suitable platforms. In designing an assay, one must naturally consider a selection strategy. This process can really be broken down into three parts: choosing high-value SNPs, tagging SNPs to maximize coverage, and marker density. High-value SNPs, like nonsynonymous or exon splice site variants, are usually directly targeted. In so doing, potential uncertainty about the genotype call is avoided, and with findings of association, functional significance is already implied based upon the nature of the marker. Knowledge of the LD structure must also be considered when choosing tagging SNPs to maximize coverage. In cases of large genes/loci or a huge LD/haplotype block, it is more efficient to genotype a denser set of markers throughout the region of interest. Two common and easy to use SNP selection tools are Tagger (43) and ldSelect (44), which provide similar output but are distinguished by different tagging algorithms. The Tagger algorithm selects tags based on direct association with a set of SNPs,

14

DiStefano and Taverna

given population-specific LD structure. The Tagger program also uses LD patterns and multimarker analysis to minimize the number of tags needed. In contrast, the ldSelect algorithm assesses LD between markers, bins markers according to a specified r 2 value, and partitions haplotypes into blocks represented by tags. In general, both algorithms are excellent, but a Tagger output typically requires less tags to cover a region, which is important to consider for genotyping costs and available space in the assay design. Multi-SNP tagging is another possibility with Tagger, although this is quite sensitive to LD structure. However, in cases where the tags chosen are not the causal alleles, and LD structure is poorly understood, then haplotype structure has an advantage, making the ldSelect algorithm a better choice for SNP selection. Given the current trend toward large GWAS meta-analyses, algorithms have been designed to choose tag SNPs over multiple populations in follow-up studies (45). These algorithms tend to be sensitive to SNP “missingness,” in that SNPs may be removed from the analysis when they are not present (or fail a given MAF cutoff) in one of the populations. Finally, due to low or unknown LD structure, there are regions of the genome that may require coverage with evenly spaced SNPs at a specified density (e.g., 1 marker/kb).

3. Conclusions Gene association studies offer a powerful approach for identifying allelic variants underlying complex traits and common diseases. Genome-wide assessment of markers is presently the most widely used strategy for identifying gene regions associated with nonMendelian phenotypes and to date, hundreds of SNP-disease associations have been reported using GWAS. Consideration of factors such as population-specific LD structure, genomic coverage, redundancy, and ethnic admixture presents profound implications for data resulting from such studies. The chances of successfully identifying a true “causal” locus may be enhanced through consideration of these issues during the experimental study design. References 1. Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P (2005) A haplotype map of the human genome. Nature 437:1299–1320 2. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA et al (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–861

3. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9:356–369 4. Manolio TA, Brooks LD, Collins FS (2008) A HapMap harvest of insights into the genetics of common disease. J Clin Invest 118:1590–1605

Technological Issues and Experimental Design of Gene Association Studies 5. Reich DE, Lander ES (2001) On the allelic spectrum of human disease. Trends Genet 17:502–510 6. Bodmer W, Bonilla C (2008) Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 40: 695–701 7. Pearson TA, Manolio TA (2008) How to interpret a genome-wide association study. JAMA 299:1335–1344 8. Barrett JC, Cardon LR (2006) Evaluating coverage of genome-wide association studies. Nat Genet 38:659–662 9. Clark AG, Li J (2007) Conjuring SNPs to detect associations. Nat Genet 39:815–816 10. Pe’er I, de Bakker PI, Maller J, Yelensky R, Altshuler D, Daly MJ (2006) Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet 38:663–667 11. Slatkin M (2008) Linkage disequilibrium – understanding the evolutionary past and mapping the medical future. Nat Rev Genet 9:477–485 12. Hsueh WC, Mitchell BD, Aburomia R, Pollin T, Sakul H, Gelder Ehm M et al (2000) Diabetes in the Old Order Amish: characterization and heritability analysis of the Amish Family Diabetes Study. Diabetes Care 23:595–601 13. Millis MP, Bowen D, Kingsley C, Watanabe RM, Wolford JK (2007) Variants in the plasmacytoma variant translocation gene (PVT1) are associated with end-stage renal disease attributed to type 1 diabetes. Diabetes 56:3027–3032 14. Siva N (2008) 1000 Genomes project. Nat Biotechnol 26:256 15. Ragoussis J (2009) Genotyping technologies for genetic research. Annu Rev Genomics Hum Genet 10:117–133 16. Barnes C, Plagnol V, Fitzgerald T, Redon R, Marchini J, Clayton D et al (2008) A robust statistical method for case-control association testing with copy number variation. Nat Genet 40:1245–1252 17. Li C, Li M, Long JR, Cai Q, Zheng W (2008) Evaluating cost efficiency of SNP chips in genome-wide association studies. Genet Epidemiol 32:387–395 18. Pritchard JK, Przeworski M (2001) Linkage disequilibrium in humans: models and data. Am J Hum Genet 69:1–14 19. Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA et al (2006) A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet 38:1251–1260

15

20. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909 21. Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164:1567–1587 22. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959 23. Cooper GM, Zerr T, Kidd JM, Eichler EE, Nickerson DA (2008) Systematic assessment of copy number variant detection via genomewide SNP genotyping. Nat Genet 40:1199–1203 24. Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S et al (2008) Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 40:1253–1260 25. McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A et al (2008) Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet 40:1166–1174 26. Diskin SJ, Li M, Hou C, Yang S, Glessner J, Hakonarson H et al (2008) Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res 36:e126 27. Spencer CC, Su Z, Donnelly P, Marchini J (2009) Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 5:e1000477 28. Wang K, Chen Z, Tadesse MG, Glessner J, Grant SF, Hakonarson H et al (2008) Modeling genetic inheritance of copy number variations. Nucleic Acids Res 36:e138 29. Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF et al (2007) PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17:1665–1674 30. Wang WY, Barratt BJ, Clayton DG, Todd JA (2005) Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6:109–118 31. Hanson RL, Craig DW, Millis MP, Yeatts KA, Kobes S, Pearson JV et al (2007) Identification of PVT1 as a candidate gene for end-stage renal disease in type 2 diabetes using a pooling-based genome-wide single nucleotide polymorphism association study. Diabetes 56:975–983

16

DiStefano and Taverna

32. Brohede J, Dunne R, McKay JD, Hannan GN (2005) PPC: an algorithm for accurate estimation of SNP allele frequencies in small equimolar pools of DNA using data from high density microarrays. Nucleic Acids Res 33:e142 33. Meaburn E, Butcher LM, Schalkwyk LC, Plomin R (2006) Genotyping pooled DNA using 100K SNP microarrays: a step towards genomewide association scans. Nucleic Acids Res 34:e27 34. Meaburn E, Butcher LM, Liu L, Fernandes C, Hansen V, Al-Chalabi A et al (2005) Genotyping DNA pools on microarrays: tackling the QTL problem of large samples and large numbers of SNPs. BMC Genomics 6:52 35. Craig I, Meaburn E, Butcher L, Hill L, Plomin R (2005) Single-nucleotide polymorphism genotyping in DNA pools. Methods Mol Biol 311:147–164 36. Kirov G, Nikolov I, Georgieva L, Moskvina V, Owen MJ, O’Donovan MC (2006) Pooled DNA genotyping on Affymetrix SNP genotyping arrays. BMC Genomics 7:27 37. Craig I, Plomin R (2006) Quantitative trait loci for IQ and other complex traits: singlenucleotide polymorphism genotyping using pooled DNA and microarrays. Genes Brain Behav 5(Suppl 1):32–37 38. Liu QR, Drgon T, Walther D, Johnson C, Poleskaya O, Hess J et al (2005) Pooled association genome scanning: validation and use to identify addiction vulnerability loci in two samples. Proc Natl Acad Sci USA 102:11864–11869 39. Butcher LM, Meaburn E, Dale PS, Sham P, Schalkwyk LC, Craig IW et al (2005) Association analysis of mild mental impairment

40.

41.

42.

43.

44.

45.

46.

using DNA pooling to screen 432 brainexpressed single-nucleotide polymorphisms. Mol Psychiatry 10:384–392 Butcher LM, Meaburn E, Knight J, Sham PC, Schalkwyk LC, Craig IW et al (2005) SNPs, microarrays and pooled DNA: identification of four loci associated with mild mental impairment in a sample of 6000 children. Hum Mol Genet 14:1315–1325 Brown KM, Macgregor S, Montgomery GW, Craig DW, Zhao ZZ, Iyadurai K et al (2008) Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat Genet 40:838–840 Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, Homer N et al (2007) Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet 80:126–139 de Bakker PI, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D (2005) Efficiency and power in genetic association studies. Nat Genet 37:1217–1223 Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 74:106–120 Howie BN, Carlson CS, Rieder MJ, Nickerson DA (2006) Efficient selection of tagging single-nucleotide polymorphisms in multiple populations. Hum Genet 120:58–68 Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B et al (2002) The structure of haplotype blocks in the human genome. Science 296:2225–2229

Chapter 2 Statistical Issues in Gene Association Studies Richard M. Watanabe Abstract This chapter reviews statistical issues related to gene association studies. The goal is to review various aspects of study design and analysis for individuals who do not have an extensive statistical background. We will review statistical issues as they relate to both genome-wide and candidate gene studies. Topics reviewed include study design, power and sample size, data checking, statistical methods, population stratification, and multiple testing. We draw examples from the type 2 diabetes genetics literature to illustrate some of the issues discussed. Key words: Statistics, Study design, Statistical power, Sample size, Population stratification, Statistical analysis, Multiple testing

1. Introduction The positional cloning approach has been used with great success to identify genes underlying susceptibility to a variety of monogenic disorders; the first success story being cystic fibrosis (1, 2). This success led to the application of this approach to complex diseases. However, the low penetrance of genes underlying susceptibility to complex diseases, combined with the complex interplay among genes and environmental exposures, resulted in relatively few success stories. The relatively low success rate was a consequence of a convergence of many issues. Genomic technology was initially limited to labor-intensive approaches such as sequencing and microsatellite marker genotyping. Technology for the latter allowed for genotyping of several hundred markers in several hundred individuals over the course of weeks or months and only provided genome-wide scans on the centimorgan scale. This rather low resolution meant that even if evidence for a possible locus was detected, megabases of DNA would need to be Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_2, © Springer Science+Business Media, LLC 2011

17

18

Watanabe

screened to identify the specific locus. Additionally, statistical methodology was limited to linkage analysis (3–7) in related individuals, which integrated relatively difficult patient recruitment with methodology of relatively low statistical power. In 1996, Risch and Merikangas published a landmark paper in which they showed theoretically, and by computer simulation, that gene association was statistically more powerful than linkage analysis (8). Based on their results, Risch and Merikangas forwarded the idea of genome-wide association. Quoting from their paper, “[F]or example, imagine the time when all human genes (say 100,000 in total) have been found and that simple, diallelic polymorphisms in these genes have been identified. Assume that five such diallelic polymorphisms have been identified within each gene, so that a total of 10 × 105 = 106 alleles need to be tested.” Unfortunately, at that time “all” human genes had not been identified, a catalog of polymorphisms across the human genome did not exist, and the technology to genotype polymorphisms on such a scale in large numbers of individuals was not feasible. However, after the year 2000, a series of significant events occurred that allowed investigators to seriously consider genomewide association (GWA) studies. In 2001, the Human Genome Project published results from the first draft of the human genome (9) and by 2005 the International Haplotype Mapping Project (HapMap) had cataloged over 6 million single nucleotide polymorphisms (SNP) across the human genome (10). Most importantly, the information from both the Human Genome and HapMap projects were made available to the public via the internet. Finally, improved genotyping technologies became available that allowed for hundreds of thousands of SNPs to be efficiently and cost-effectively genotyped in thousands of individuals (11). Of note, the human genome is currently estimated to encompass 20–25,000 genes (9), which is substantially lower than the 100,000 human genes speculated by Risch and Merikangas (8). Together, these events culminated in the birth of the era of genome-wide association (GWA) studies. Although prior to the year 2000, a limited number of susceptibility genes for complex diseases had been identified, as we move forward into the new century, GWA has led to a literal explosion in the identification of genes underlying susceptibility to such diseases, as well as those contributing to variation in quantitative traits. This is exemplified by the number of GWA-based publications appearing in the literature, which the National Human Genome Research Institute is attempting to catalog and make available through a public archive (http://genome.gov/gwastudies/). As of October 17, 2010, this catalog lists 702 publications with 3,406 SNPs for a variety of diseases and quantitative traits. In this chapter, we will review statistical issues related to gene association studies, including both GWA and candidate gene

Statistical Issues in Gene Association Studies

19

investigations. We will consider study design and statistical power, and discuss specific issues related to statistical analysis and interpretation of results. We will emphasize various considerations that must be made to avoid the so-called lemming effect, which has recently led to a skewed view that all gene association studies must meet the GWA standard. Therefore, we will also discuss the merits of smaller scale gene association studies. The goal is to provide a broad synopsis to assist those whose “home” is not statistics to better understand and interpret gene association studies. Finally, throughout this chapter we will cite examples from the type 2 diabetes mellitus literature, given the recent successes of GWA studies in this disease.

2. Methods 2.1. Study Design

Any good investigation starts with a solid study design. For gene association studies, it is critical to define the primary outcome of interest, the potential gene effect size, and the assumptions underlying the estimate of the magnitude of the gene effect size. These choices have implications not just for sample size/power, but also for the statistical approach that might be used to test the hypothesized association. For complex diseases, the assumptions have typically revolved around the “common disease, common variant” hypothesis. One premise of this hypothesis has been that disease susceptibility variants have low effect sizes and this has, for the most part, turned out to be true. Over a dozen susceptibility variants have now been identified for type 2 diabetes (12–17) and the estimated odds ratios for disease reported for individual susceptibility variants has generally been 2.0) than that observed for disease susceptibility (42, 45). If we examine Fig. 1, even on the genome-wide scale this larger effect size results in a significant smaller sample size requirement. The PPARG example highlights three important concepts. First, gene effects need to be considered carefully within the context being examined. One additional characteristic of the pharmacogenetics underlying response to TZD therapy is that the nonresponse rate appears to be consistent across very diverse samples and populations. This suggests that the traditional complex disease model of many genes of small effect interacting among themselves and with environmental exposures is not likely to be correct in the pharmacogenetics context. Second, the validity of results from any given study is not dependent upon whether the sample size was on the scale of a GWA or whether it achieved genome-wide statistical significance. The success of GWA studies, especially GWA-based meta-analyses, has raised confidence in the GWA-based results, while raising questions regarding previously reported associations that have not replicated in GWA studies. However, many of the previous studies were either poorly designed, under-powered, or did not fully appreciate the issue of multiple testing. Thus, each association result must be carefully evaluated within the context in which it was performed. The fact that a study is not performed in a GWA context does not necessarily negate the results. The ability to identify genetic variants contributing to susceptibility to disease or to variation underlying a quantitative trait

22

Watanabe

essentially reduces to an issue of signal-to-noise ratio. While there are many approaches to this problem, the most common is to increase the quality of the signal, such that it rises above the noise and is detectable. Improving the “signal” can be achieved by increasing the sample size such that the “noise” is reduced and “signal” improved. This is the strategy that essentially underlies GWA studies and has proven to be very successful. In fact, the need for ever larger sample sizes has resulted in large consortia that pool together results from multiple GWA studies, as has been successfully implemented for type 2 diabetes and related traits (12–29, 46–48). An alternative is to be more selective in the design of the study sample to reduce “noise” and thereby increase the quality of the “signal.” In the case of type 2 diabetes, this may include refining the definition of what constitutes a case, such as selecting only subjects with younger age-at-onset or subjects who only develop secondary complications. For quantitative traits, this may mean selecting subjects within the extremes of the distribution of the trait of interest or selecting subjects stratified by a secondary trait. Alternatively, one can ascertain subjects based on traits that are predictive of future disease. In the case of type 2 diabetes, one such trait is the disposition index (49), which is a strong predictor of future type 2 diabetes (50–52). One important consideration with respect to case–control sampling that is typically ignored is that cases are usually defined using standard clinical criteria. However, clinical definitions change over time and are defined with healthcare issues as a primary concern. For example, in the USA the definition of diabetes established in 1979 (53), changed in 1997 (54) and differed from the World Health Organization definitions, which changed in 1985 and 1998 (55, 56). Furthermore, the clinical definition is based on examining risk for diabetes complications, which has additional implications given that not all individuals with disease will develop complications. Similar changes in definition can be seen for other complex diseases such as hypertension (57). While such definitions are defensible for clinical care purposes, they may not be optimal for genetic studies as such criteria may introduce significant heterogeneity. This practice likely accounts, in part, for the low odds ratios associated with disease susceptibility variants and the need for large sample sizes. The selection effect can be illustrated by looking at the association between transcription factor 7-like 2 (TCF7L2) and type 2 diabetes. TCF7L2 was among the first susceptibility genes identified for type 2 diabetes (58) and confirmed in GWA studies (12–15) and other large case–control association studies (59–64). Some groups reported that the odds ratio for the association between specific TCF7L2 variants and type 2 diabetes increased when they restricted their sample to individuals with a body mass index (BMI) < 30 kg/m2 (61, 63, 64). Also, several smaller studies

Statistical Issues in Gene Association Studies

23

have examined the association between gestational diabetes mellitus (GDM) and variation in TCF7L2 and reported strong evidence for association despite having only hundreds of GDM cases (65–67) and with odds ratios slightly higher than that reported for type 2 diabetes. Quantitative traits present additional challenges other than those mentioned above. Is the primary goal to identify genes underlying variation in a given trait in the population? Or are we interested in identifying genes underlying variation for a given trait in the context of disease? If the former, then a random sample from the population may be appropriate so as to capture the entire trait distribution. If the latter, additional confounding effects of disease treatment and its impact on the quantitative trait of interest needs to be considered. Also, given a case–control sample for disease association, the use of the control individuals for quantitative trait association will likely truncate the trait distribution, which may reduce statistical power. Additionally, if specific ascertainment criteria are used to collect samples, then one needs to be cautious about extending study observations to the general population. For highly selected samples, one could consider the application of ascertainment correction (68–73) to allow extension of study results to the general population. When considering power or sample size for quantitative trait associations, it is useful to frame the problem in terms of the fraction of variance in the trait accounted for by the quantitative trait locus (QTL). This general framework is derived from the seminal work of R.A. Fisher (74) who demonstrated that the fraction of the total trait variance (σ2) attributable to the effect of polygenes ( σ G2 ) could be defined as follows: σ 2 = σG2 + σ e2 = σ 2A + σ D2 + σ e2 ,

(1)

where σ A2 is the additive genetic variance component or the additive effect of polygenes, σ D2 is the dominance genetic variance component or the nonlinear interaction between alleles at a given locus, and σ e2 is the environmental variance component or all other sources of variation. Fisher demonstrated that given allele frequencies for a genetic marker and known quantities for the additive and dominance genetic effects, the genetic variance component could be estimated. This framework is easily extended to quantify the effect of a single QTL by extracting it from the additive genetic component: 2 σ 2 = σG2 + σ e2 = σQTL + σ 2A + σ D2 + σ e2 .

(2)

Figure 2 shows an example where power is computed at various a-levels for a sample of 500 individuals. At a = 0.001, which is already more stringent than the typical a = 0.05, 500 individuals would allow detection of a QTL that explains ~3.5% of the

24

Watanabe

Fig. 2. Statistical power to detect association with a quantitative trait. Statistical power to detect association is shown as a function of gene effect size (represented as percent trait variance explained) at a fixed sample size of 500 subjects and 80% power for three different levels of significance. The solid line shows power at a = 1 × 10−3, the dashed line at a = 1 × 10−4, and the dotted line at a = 1 × 10−8.

total trait variance at 80% power. As the a-level is increased to 0.0001 and to the genome-wide significance level of 10−8, the same sample size only allows detection of QTLs that explain 4.5 and 8% of the total trait variance, respectively. Thus, as was seen with disease status, sample size and power are largely driven by the hypothesized effect size. However, as mentioned above, it is extremely important to carefully consider effect size in the context of the specific hypothesis being addressed. 2.2. Statistical Analysis 2.2.1. Data Checking

Testing for association between genetic variation and outcomes of interest can take a variety of different forms, each with advantages and disadvantages. However, before any data analysis can occur, it is important that both genotype and phenotype data are free from error. Genotype data should be checked for consistency with Hardy Weinberg equilibrium (HWE) (75). Deviation from HWE can be indicative of genotyping errors, e.g., heterozygotes called as homozygotes, which distorts HWE proportions. However, assuming no evidence for genotyping error is found, the fact that a given marker fails HWE does not necessarily exclude that marker from genetic analysis. In general, for most data analysis approaches, including markers that have failed HWE is acceptable, although results for those markers should be interpreted taking into account that they are out of HWE. The exception is when genotype data are analyzed using tests of proportions, such as chi-square analysis, where the deviation from HWE can result in false positives. Additional genotype data checks should include completeness of genotyping, which could be indicative of nonrandom missing data and can also skew results. Testing only those markers that pass stringent completeness criteria will tend

Statistical Issues in Gene Association Studies

25

to protect one against this problem, although it does not guarantee total protection from the effects of nonrandom missing data. If the sample includes related individuals, such as parent– offspring trios, nuclear families, or extended pedigrees, additional tests of non-Mendelian inheritance should be performed. Although family members typically self-report as being a specific relative type, frequently such information is inaccurate. Thus, Mendelian inheritance checks can reveal unknown half-siblings, adoptions, uncertain parentage, etc. Although most statistical genetic analysis software will perform basic data checks, such as non-Mendelian inheritance, specific programs are available that perform the spectrum of data checking. Examples of such software include PEDCHECK (76), MENDEL (77), PREST (78), and PEDSTATS (79). If sufficient marker data are available, likelihood-based methods that test for specific familial relationships can be used to identify additional errors. Such methods are implemented in a variety of software including RELPAIR (80, 81) and PREST (78). In general, these packages should be used to perform thorough data cleaning prior to any data analysis, rather than relying on built-in data checks in the analysis software, whose routines tend to be less stringent. In this context, the maxim “garbage in, garbage out” is highly relevant. 2.2.2. Association Testing

Methods for testing association between a marker and phenotype are partly dependent upon the type of sample being analyzed. Analysis of unrelated individuals can be accomplished using standard statistical methods that are familiar to most investigators. In the case of association with disease status, one is interested in determining if the frequency of a specific marker allele is significantly different between cases and controls. Such contrasts can be immediately accomplished using chi-square analysis. However, such analysis does not allow one to adjust for potential confounders, which can be accomplished using the logistic regression framework. When testing under this framework, one consideration is how to code the genetic data. Traditionally, one assumes a specific genetic model (additive, dominant, or recessive) although alternative models such as multiplicative or simple contrast among genotype groups can be accomplished. The logistic model can be written as follows: Pr(D G ) = b0 + b1 G + ∑ bi Ci ,

(3)

where the probability of disease (D) given a specific genotype (G) is dependent upon G and a series of covariates (Ci). In this framework, G is coded for the specific genetic model, b1 is tested for significance, and the exponent of b1 (eb1) is equivalent to the odds ratio. The additive genetic model is typically used for most genetic data analysis, although the choice of genetic model can be driven

26

Watanabe

by a variety of factors. For example, if one has a relatively small sample size and wishes to assess markers with relatively low allele frequencies, then it may be appropriate to choose a dominant genetic model to maximize the data. The standard linear regression framework can be used to test for association with a quantitative trait. The test of association can be summarized as follows: y = b0 + b1 G + ∑ bi Ci ,

(4)

here, y is a continuous quantitative trait and G and Ci are as described for Eq. 3. Here, b1 is tested for significance and represents the rate of change in y given G. For example, if an additive genetic model is assumed for G, then b1 represents how much y changes per number of copies of the reference allele. For both logistic and linear regression, it is critical that the underlying assumptions for both methods are met to reduce the risk of false-positive results. For example, false-positive results are possible if the univariate normality assumption underlying linear regression is violated. If the sample includes related individuals, then the framework presented in Eqs. 3 and 4 are not valid, and methodology that takes into account the correlation among related individuals is necessary. If methods for uncorrelated observations are applied to correlated data, the variance in the data will be underestimated, which can lead to false-positive results. The appropriate method to apply to related individuals depends on the data to be analyzed. Specific methodology exists for certain family structures. For example, in the case of parent–offspring trios, the family of transmission disequilibrium tests (TDT) can be applied (82, 83). One advantage of the TDT family of tests is that it protects from population or cryptic stratification (discussed below). Samples based on sibships can take advantage of mixed models or generalized estimating equations (84–86). However, if complex family structures are involved, then methods that incorporate the kinship coefficient (74, 87, 88), such as variance components (89–93) can be used. The kinship coefficient uses the observed family structure and assumes Mendelian inheritance to estimate genetic sharing probabilities. The kinship coefficient can then be used to account for the level of relatedness to estimate the variance and provide more accurate statistics. 2.3. Cryptic Stratification

Population or cryptic stratification is a condition in which a marker allele is associated with an unobserved difference in the comparison groups. Population stratification was first brought to attention by Knowler and colleagues when examining association between type 2 diabetes and a specific haplotype thought to contribute to risk for type 2 diabetes (94). Their initial examination

Statistical Issues in Gene Association Studies

27

suggested an association between the haplotype and risk for type 2 diabetes. However, when the analysis was restricted to full-heritage Pima Indians, there was no evidence for association. In their specific case, the frequency of the haplotype differed between individuals of northern European and Pima Indian ancestry. Therefore, their analysis was confounded by northern European admixture in their sample, which created the appearance of an association. There is a misconception that population stratification is restricted to issues of population substructure, but it can be the consequence of any unobserved difference, which make the term “cryptic stratification” a better descriptor of the problem. There are several approaches to handling cryptic stratification. Careful consideration of subject recruitment criteria can be helpful in minimizing the effect of cryptic stratification due to population substructure, but clearly cannot eliminate the effect given that self-identified ethnicity does not accurately or precisely capture population substructure. If the sample being analyzed is composed of parent–offspring trios or pedigree data, there are statistical approaches to avoid the effect of cryptic stratification. These methods examine transmission of alleles from one generation to another. The transmission disequilibrium test (TDT) was the first to introduce this approach (82) for disease association and was quickly followed by methodologies to be applied to quantitative traits (83). The general framework has now been extended to pedigree-based analyses (95–98). While essentially providing protection from cryptic stratification, the transmission-based methods tend to be less efficient and have lower power than other association methods due to the fact that only genetically informative transmissions contribute to the statistic. Another approach to correcting for cryptic stratification is to use genetic marker data to formally assess population substructure in the sample and then adjust for individual-level substructure information in the statistical analysis (99–104). The main issue is selecting appropriate genetic markers that will distinguish among the different potential subpopulations. Specific marker panels of ancestrally informative markers (AIM) have been developed to differentiate between African Americans and individuals of northern European ancestry and between Latino Americans and individuals of northern European ancestry (105–111). In the case of GWA marker data, there already exists sufficient marker information to perform substructure analysis without having to genotype a separate panel of AIMs. However, it is sometimes not sufficient to simply genotype a panel of AIMs and perform substructure analysis. Careful consideration should be made with respect to what populations are being studied and what method is being used to assess

28

Watanabe

substructure. For example, the model-based clustering method of Pritchard and colleagues (101) is implemented in the widely used STRUCTURE package. In STRUCTURE, it is useful to include genotype data from reference populations to assist in the identification of the various subpopulations within ones data. For this purpose, many studies have relied upon the HapMap reference populations, whose data are readily available for download from the HapMap web site (http://www.hapmap.org). Yet, depending upon the population being studied, the HapMap populations may not be sufficient to appropriately identify different subpopulations. For example, if one is studying a sample of Mexican Americans, Amerindian admixture may be a significant subpopulation in this sample and would not be captured by using the HapMap reference samples. Finally, for GWA studies, the genomic control approach has been widely used to adjust statistics for inflation in type 1 error rates due to cryptic stratification (112, 113). In this approach, neutral loci from genome-wide marker data are used to quantify the inflation as represented by the parameter l. If there is no inflation due to cryptic stratification, l should take a value of 1. However, in the presence of stratification, l will take a value >1 and l can then be used to adjust statistics to achieve the correct type 1 error rate. The genomic control approach has been used with great success in GWA studies. It is of significant note that many of the GWA studies for type 2 diabetes have been in populations of northern European ancestry, which in general is considered to be relatively homogeneous. However, many of these studies were required to adjust their statistics due to inflation in the type 1 error rate as assessed by genomic control values >1 (12–29, 46–48). This demonstrates the need to evaluate ones data for potential cryptic stratification. 2.4. Multiple Testing

One of the most important, yet very misunderstood statistical issue in genetic association analysis is that of multiple testing. The basic premise of multiple testing is that if the same hypothesis is assessed multiple times, a number of positive associations will be observed due to random chance alone. A good example is screening a candidate gene for association with disease. In this case, it is typical to genotype several SNPs across the coding region of the gene and then test each SNP for association with the disease of interest. In this case, the hypothesis “my gene of interest is associated with my disease of interest” is being tested multiple times. Therefore, correction for multiple testing is required. It is possible to estimate the number of false-positive results expected under the null hypothesis. This estimate is derived by multiplying the a-level used to assess statistical significance multiplied by the number of tests performed.

Statistical Issues in Gene Association Studies

29

There are a variety of approaches to correct for multiple testing (114–117), but the most commonly known and easiest to apply is the Bonferroni method (118). The Bonferroni approach essentially attempts to maintain the family-wise error rate at the desired level by adjusting for the number of tests performed. So, if n statistical tests were performed, the Bonferroni corrected level of significance would be a/n or alternatively one could multiply the observed p-value by n to achieve the same correction. Although the Bonferroni approach is simple to apply and easy to understand, care must be exercised when applying the approach to genetic association studies. For example, one of the assumptions underlying Bonferroni is that there is independence among the statistical tests being performed. However, in genetic analysis there is typically linkage disequilibrium (LD) among SNPs in a gene region, which results in some level of dependence among SNPs being tested for association. The selection of tag SNPs has been used to avoid this level of dependence (119–121), but most of the algorithms implemented in software such as TAGGER (121), use LD criteria such as r 2 > 0.8 to eliminate SNPs with redundant information, which means that even tag SNPs have some degree of LD among them. Therefore, when a Bonferroni correction is applied there is an overcorrection for multiple testing, resulting in overly conservative conclusions. The testing of genetic variants for association with quantitative traits introduces an additional factor that must be taken into account. It is common to test SNPs for association with related quantitative traits. A classic example is lipid levels which are usually examined as total cholesterol and other subfractions of cholesterol (HDL, LDL, and VLDL). In this case, in addition to possible correlation among SNPs, there is correlation among the traits being tested. Therefore, two levels of correlation must be appropriately accounted for, if accurate multiple test correction is to be performed. Of course, one could still apply the Bonferroni correction; however, the resulting results will be again overly conservative. Recently, Conneely and Boehnke introduced an approach that attempts to account for correlation among SNPs and quantitative traits in genetic association studies (117). This approach has been implemented in the p_ACT program implemented in R (http://csg.sph.umich.edu/boehnke/ p_act.php). The issue of multiple test correction in GWA studies emanates from similar concerns, as discussed above for single gene association studies, with the exception that the magnitude of the correction is now magnified by the large number of SNPs typically included in many of the GWA chipsets. Studies by the HapMap group examining the so-called ENCODE regions of the genome (122), which are specific regions targeted for detailed sequencing and analysis, suggested that a genome-wide p-value of 10−8 is necessary to overcome the multiple testing issue for a GWA

30

Watanabe

study (10). However, this benchmark is only based on testing of SNPs for association with a single trait (disease status or quantitative trait) and does not take into consideration the issues of multiple traits as discussed above. An additional word of caution is that the genome-wide significance level of 10−8 should not be construed as a benchmark for all genetic studies. While the smaller the p-value the greater the potential strength of the association, any p-value needs to be evaluated within the context of the study design, the population being examined, and the statistical analysis. Thus, larger p-values should not be summarily dismissed simply because they do not meet the current benchmark for genomewide significance. 2.5. Replication

Given that many gene effects associated with common diseases and quantitative traits have relatively small effect, the concept of replication plays an important role in confirming any reported association. In fact, many GWA studies have now implemented a general study design that first examines evidence for association within individual GWA studies, followed by combining GWA results via meta-analysis, and ending with replication in samples independent of the original GWA samples. Such study designs build confidence that a reported association is a true result and not a type 1 error. For gene association studies with smaller sample sizes, replication becomes a critical component to support original findings. Replication of findings entails some attention to specific details. For example, are the observed allele frequencies similar among the replication studies? Assuming replication is performed in samples derived from similar populations, then allele frequency estimates should be similar among studies. Is the same allele for the given marker associated with the outcome? Differences in alleles could suggest differences in genotyping among the replication studies, e.g., PCR primers designed for different strands of the chromosome. Is the effect of the associated allele in the same direction? Flips in direction can suggest a variety of possible differences among studies, including differences in coding of the genetic model to issues related to transformation of the data in the case of quantitative traits. It is critical to pay close attention to such details and ensure consistency across replication studies. It is also important to note that because replication is typically assessed in samples derived from the same or similar populations, observations from replication studies cannot be immediately extended to other populations. For example, the susceptibility loci identified by GWA for type 2 diabetes have exclusively been derived from populations of northern European ancestry. Therefore, it is equivocal whether the association also extends to other populations. Taking TCF7L2 as a specific example, the association between variants in this gene and type 2 diabetes

Statistical Issues in Gene Association Studies

31

susceptibility gene has been replicated in almost every population examined, except for the Pima Indians (123) and some Chinese populations (124). Therefore, one cannot readily assume that because association between a genetic variant and disease or quantitative trait is replicated in one population, the same association should exist in other populations. In order for any genetic association to be generalized beyond the specific population in which the original observation and initial replication was made, it must be tested in other populations.

Acknowledgments The author would like to thank Drs. Richard N. Bergman and Michael Boehnke for mentoring and support. R.M.W. is supported by grants from the NIH (R01-DK-069922) and Merck & Co. References 1. Kerem B-S, Rommens JM, Buchanan JA, Markiewicz D, Cox TK, Chakravarti A et al (1989) Identification of the cystic fibrosis gene: genetic analysis. Science 245: 1073–1080 2. Riordan JR, Rommens JM, Kerem B-S, Alon N, Rozmahel R, Grzelczak Z et al (1989) Identification of the cystic fibrosis gene: cloning and characterization of the complementary DNA. Science 245:1066–1073 3. Morton NE (1956) The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum Genet 8:80–96 4. Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:3–19 5. Risch N (1990) Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 46:222–228 6. Risch N (1990) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am J Hum Genet 46:229–241 7. Risch N (1990) Linkage strategies for genetically complex triats. III. The effect of marker polymorphism on analysis of affected relative pairs. Am J Hum Genet 46:242–253 8. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517 9. International Human Genome Mapping Consortium (2001) A physical map of the human genome. Nature 409:934–941

10. The International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437:1299–1320 11. Oliphant A, Barker DL, Stuelpnagel JR, Chee MS (2002) BeadArray technology: enabling an accurate, cost-effective approach to high-throughput genotyping. Biotechniques Suppl:56–58 12. Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University and Novartis Institute for Biomedical Research (2007) Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316:1331–1336 13. Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL et al (2007) A genomewide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316:1341–1345 14. Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H et al (2007) Replication of genome-wide association signals in U.K. samples reveals risk loci for type 2 diabetes. Science 316:1336–1341 15. Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D et al (2007) A genome-wide association study identified novel risk loci for type 2 diabetes. Nature 445:881–885 16. Steinthorsdottir V, Thorleifsson G, Reynisdottir I, Benediktsson R, Jonsdottir T, Walters GB et al (2007) A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. Nat Genet 39:770–775

32

Watanabe

17. Zeggini E, Scott LJ, Saxena R, Voight BF, Diabetes Genetics Replication and Metaanalysis (DIAGRAM) Consortium (2008) Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet 40:638–645 18. Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM et al (2007) A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316:889–894 19. Sanna S, Jackson AU, Nagaraja R, Willer CJ, Chen WM, Bonnycastle LL et al (2008) Common variants in the GDF5-UQCC region are associated with variation in human height. Nat Genet 40:198–203 20. Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R et al (2008) Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet 40:161–169 21. Chen W-M, Erdos MR, Jackson AU, Saxena R, Sanna S, Silver KD et al (2008) Variations in the G6PC2/ABCB11 genomic region are associated with fasting glucose levels. J Clin Invest 118:2609–2628 22. Bouatia-Naji N, Rocheleau G, Van Lommel L, Lemaire K, Schuit F, Cavalcanti-Proença C et al (2008) A polymorphism within the G6PC2 gene is associated with fasting plasma glucose levels. Science 320:1085–1088 23. Loos RJ, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I et al (2008) Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat Genet 40:768–775 24. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M et al (2008) Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet 40:575–583 25. Orho-Melander M, Melander O, Guiducci C, Perez-Martinez P, Corella D, Roos C et al (2008) Common missense variant in the glucokinase regulatory protein gene is associated with increased plasma triglyceride and C-reactive protein but lower fasting glucose concentrations. Diabetes 57:3112–3121 26. Vaxillaire M, Cavalcanti-Proenca C, Dechaume A, Tichet J, Marre M, Balkau B et al (2008) The common P446L polymorphism in GCKR inversely modulates fasting glucose and triglyceride levels and reduces type 2 diabetes risk in the DESIR prospective general French population. Diabetes 57:2253–2257

27. Bouatia-Naji N, Bonnefond A, Cavalcanti-Proenca C, Sparso T, Holmkvist J, Marchand M et al (2009) A variant near MTNR1B is associated with increased fasting plasma glucose levels and type 2 diabetes risk. Nat Genet 41:89–94 28. Prokopenko I, Langenberg C, Florez JC, Saxena R, Soranzo N, Thorleifsson G et al (2009) Variants in MTNR1B influence fasting glucose levels. Nat Genet 41: 77–81 29. Kathiresan S, Willer CJ, Peloso GM, Demissie S, Musunuru K, Schadt EE et al (2009) Common variants at 30 loci contribute to polygenic dyslipidemia. Nat Genet 41:56–65 30. Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB (2002) Two-stage designs for gene-disease association studies. Biometrics 58:163–170 31. Satagopan JM, Elston RC (2003) Optimal two-stage genotyping in population-based association studies. Genet Epidemiol 25:149–157 32. Skol AD, Scott LJ, Abecasis GR, Boehnke M (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38:209–213 33. Gauderman WJ (2002) Sample size requirements for matched case-control studies of gene-environment interaction. Stat Med 21:35–50 34. Gauderman WJ (2002) Sample size calculations for association studies of gene-gene interaction. Am J Epidemiol 155:478–484 35. Deeb SS, Fajas L, Nemoto M, Pihlajamäki J, Laakso M, Fujimoto W et al (1998) A Pro12Ala substitution in PPARg2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity. Nat Genet 20:284–287 36. Altshuler D, Hirschhorn JN, Klannemark M, Lindgren CM, Vohl M-C, Nemesh J et al (2000) The common PPARg Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes. Nat Genet 26:76–80 37. Nolan JJ, Ludvik B, Beerdsen P, Joyce M, Olefsky J (1994) Improvement in glucose tolerance and insulin resistance in obese subjects treated with troglitazone. N Engl J Med 331:1188–1193 38. Antonucci T, Whitcomb R, McLain R, Lockwood D (1997) Impaired glucose tolerance is normalized by treatment with the thiazolidinedione troglitazone. Diabetes Care 20:188–193

Statistical Issues in Gene Association Studies 39. Spiegelman BM (1998) PPAR-g: adipogenic regulator and thiazolidinedione receptor. Diabetes 47:507–514 40. Aronoff S, Rosenblatt S, Braithwaite S, Egan JW, Mathisen AL, Schneider RL et al (2000) Pioglitazone hydrochloride monotherapy improves glycemic control in the treatment of patients with type 2 diabetes. Diabetes Care 23:1605–1611 41. Baba S (2001) Pioglitazone: a review of Japanese clinical studies. Curr Med Res Opion 17:166–189 42. Blüher M, Lübben G, Paschke R (2003) Analysis of the relationship between the Pro12Ala variant in the PPAR-g2 gene and the response rate to therapy with pioglitazone in patients with type 2 diabetes. Diabetes Care 26:825–831 43. Buchanan TA, Xiang AH, Peters RK, Kjos SL, Marroquin A, Goico J et al (2002) Preservation of pancreatic b-cell function and prevention of type 2 diabetes by pharmacological treatment of insulin resistance in high-risk Hispanic women. Diabetes 51:2796–2803 44. Camp HS, Li O, Wise SC, Hong YH, Frankowski CL, Shen X et al (2000) Differential activation of peroxisome proliferator-activated receptor-g by troglitazone and rosiglitazone. Diabetes 49:539–547 45. Wolford JK, Yeatts KA, Dhanjal SK, Black MH, Xiang AH, Buchanan TA et al (2005) Sequence variation in PPARG may underlie differential response to troglitazone. Diabetes 54:3319–3325 46. Pare G, Chasman DI, Parker AN, Nathan DM, Miletich JP, Zee RY et al (2008) Novel association of HK1 with glycated hemoglobin in a non-diabetic population: a genomewide evaluation of 14, 618 participants in the Women’s Genome Health Study. PLoS Genet 4:e1000312 47. Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, Heid IM et al (2009) Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet 41:25–34 48. Thorleifsson G, Walters GB, Gudbjartsson DF, Steinthorsdottir V, Sulem P, Helgadottir A et al (2009) Genome-wide association yields new sequence variants at seven loci that associate with measures of obesity. Nat Genet 41:18–24 49. Bergman RN, Phillips LS, Cobelli C (1981) Physiologic evaluation of factors controlling glucose tolerance in man. Measurement of insulin sensitivity and b-cell glucose sensitiv-

50.

51.

52.

53.

54.

55.

56.

57.

58.

59.

60.

33

ity from the response to intravenous glucose. J Clin Invest 68:1456–1467 Buchanan TA (2001) Pancreatic B-cell defects in gestational diabetes: implications for the pathogenesis and prevention of type 2 diabetes. J Clin Endocrinol Metab 86:989–993 Lyssenko V, Nagorny CL, Erdos MR, Wierup N, Jonsson A, Spegel P et al (2009) Common variant in MTNR1B associated with increased risk of type 2 diabetes and impaired early insulin secretion. Nat Genet 41:82–88 Weyer C, Bogardus C, Mott DM, Pratley RE (1999) The natural history of insulin secretory dysfunction and insulin resistance in the pathogenesis of type 2 diabetes mellitus. J Clin Invest 104:787–794 National Diabetes Data Group (1979) Classification and diagnosis of diabetes mellitus and other categories of glucose intolerance. Diabetes 28:1039–1057 The Expert Committee on the Diagnosis and Classification of Diabetes Mellitus (1997) Report of the expert committee on the diagnosis and classification of diabetes mellitus. Diabetes Care 20:1183–1197 World Health Organization (1985) Diabetes mellitus: report of a WHO Study Group. Technical Report Series 727 ed. World Health Organization, Geneva World Health Organization (1999) Definition, diagnosis and classification of diabetes mellitus and its complications. Report of a WHO Consultation. Technical Report Series 646 ed. World Health Organization, Geneva Joint National Committee 7 (2004) The seventh report of the Joint National Committee on prevention, detection, evaluation, and treatment of high blood pressure – complete report. 04-5230 ed. National Heart, Lung, and Blood Institutes, Bethesda, MD, pp 1–86 Grant SF, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, Sainz J et al (2006) Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet 38:320–323 Damcott CM, Pollin TI, Reinhart LJ, Ott SH, Shen H, Silver KD et al (2006) Polymorphisms in the transcription factor 7-like 2 (TCF7L2) gene are associated with type 2 diabetes in the Amish. Diabetes 55:2654–2659 Scott LJ, Bonnycastle LL, Willer CJ, Sprau AG, Jackson AU, Narisu N et al (2006) Association of transcription factor 7-like 2

34

61.

62.

63.

64.

65.

66.

67.

68. 69.

70.

71.

Watanabe (TCF7L2) variants with type 2 diabetes in a Finnish sample. Diabetes 55:2649–2653 Groves CJ, Zeggini E, Minton J, Frayling TM, Weedon MN, Rayner NW et al (2006) Association analysis of 6, 736 U.K. subjects provides replication and confirms TCF7L2 as a type 2 diabetes susceptibility gene with a substantial effect on individual risk. Diabetes 55:2640–2644 Zhang C, Qi L, Hunter DJ, Meigs JB, Manson JE, van Dam RM et al (2006) Variant of transcription factor 7-like 2 (TCF7L2) gene and the risk of type 2 diabetes in large cohorts of U.S. women and men. Diabetes 55:2645–2648 Saxena R, Gianniny L, Burtt NP, Lyssenko V, Giuducci C, Sjögren M et al (2006) Common single nucleotide polymorphisms in TCF7L2 are reproducibly associated with type 2 diabetes and reduced insulin response to glucose in nondiabetic individuals. Diabetes 55:2890–2895 Cauchi S, Meyre D, Dina C, Choquet H, Samson C, Gallina S et al (2006) Transcription factor TCF7L2 genetic study in the French Population. Diabetes 55:2903–2908 Watanabe RM, Allayee H, Xiang AH, Trigo E, Hartiala J, Lawrence JM et al (2007) Transcription factor 7-like 2 (TCF7L2) is associated with gestational diabetes mellitus and interacts with adiposity to alter insulin secretion in Mexican Americans. Diabetes 56:1481–1485 Shaat N, Lernmark A, Karlsson E, Ivarsson S, Parikh H, Berntorp K et al (2007) A variant in the transcription factor 7-like 2 (TCF7L2) gene is associated with an increased risk of gestational diabetes mellitus. Diabetologia 50:972–979 Lauenborg J, Grarup N, Damm P, BorchJohnsen K, Jørgensen T, Pedersen O et al (2009) Common type 2 diabetes risk gene variants associated with gestational diabetes. J Clin Endocrinol Metab 94:145–150 Cannings C, Thompson EA (1977) Ascertainment in the sequential sampling of pedigrees. Clin Genet 12:208–212 Dawson DV, Elston RC (1984) A bivariate problem in human genetics: ascertainment of families through a correlated trait. Am J Med Genet 18:435–448 Vieland VJ, Hodge SE (1995) Inherent intractability of the ascertainment problem for pedigree data: a general likelihood framework. Am J Hum Genet 56:33–43 Vieland VJ, Hodge SE (1996) The problem of ascertainment for linkage analysis. Am J Hum Genet 58:1072–1084

72. Hodge SE, Vieland VJ (1996) The essence of single ascertainment. Genetics 144: 1215–1223 73. de Andrade M, Amos CI (2000) Ascertainment issues in variance components models. Genet Epidemiol 19:333–344 74. Fisher RA (1918) The correlation between relatives on the supposition of Mendelian inheritance. Trans R Soc Edinb 52:399–433 75. Xu J, Turner A, Little J, Bleecker ER, Meyers DA (2002) Positive results in association studies are associated with departure from Hardy-Weinberg equilibrium: hint for genotype error? Hum Genet 111:573–574 76. O’Connell JR, Weeks DE (1998) PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet 63:259–266 77. Lange K, Weeks D, Boehnke M (1988) Programs for pedigree analysis: MENDEL, FISHER, and dGENE. Genet Epidemiol 5:471–472 78. McPeek MS, Sun L (2000) Statistical tests for detection of misspecified relationships by use of genome-screen data. Am J Hum Genet 66:1076–1094 79. Wigginton JE, Abecasis GR (2005) PEDSTATS: descriptive statistics, graphics and quality assessment for gene mapping data. Bioinformatics 21:3445–3447 80. Boehnke M, Cox NJ (1997) Accurate inference of relationships in sib-pair linkage studies. Am J Hum Genet 61:423–429 81. Epstein MP, Duren WL, Boehnke M (2000) Improved inference of relationship for pairs of individuals. Am J Hum Genet 67:1219–1231 82. Spielman RS, Ewens WJ (1996) The TDT and other family-based tests for linkage disequilibrium and association. Am J Hum Genet 59:983–989 83. Allison DB (1997) Transmissiondisequilibrium tests for quantitative traits. Am J Hum Genet 60:676–690 84. Zeger SL, Liang K-Y (1988) Models for longitudinal data: a generalized estimating equation approach. Biometrics 44:1049–1060 85. Grove J, Zhao LP, Quiaoit F (1993) Correlation analysis of twin data with repeated measures based on generalized estimating equations. Genet Epidemiol 10:539–544 86. Bull SB, Chapman NH, Greenwood CM, Darlington GA (1995) Evaluation of genetic and environmental effects using GEE and APM methods. Genet Epidemiol 12:729–734

Statistical Issues in Gene Association Studies 87. Hopper JL, Mathews JD (1982) Extensions to multivariate normal models for pedigree analysis. Ann Hum Genet 46:373–383 88. Chen W-M, Abecasis GR (2007) Family based association tests for genome wide association scans. Am J Hum Genet 81: 913–926 89. Lange K, Westlake J, Spence MA (1976) Extensions to pedigree analysis. III. Variance components by the scoring method. Ann Hum Genet 39:485–491 90. Lange K, Boehnke M (1983) Extensions to pedigree analysis. IV. Covariance components models for multivariate traits. Am J Med Genet 14:513–524 91. Amos CI (1994) Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet 54:535–543 92. Blangero J, Almasy L (1997) Multipoint oligogenic linkage analysis of quantitative traits. Genet Epidemiol 14:959–964 93. Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet 62:1198–1211 94. Knowler WC, Williams WC, Pettitt DJ, Steinberg AG (1988) Gm 3;5, 13, 14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet 43:520–526 95. Abecasis GR, Cookson WO, Cardon LR (2000) Pedigree tests of transmission disequilibrium. Eur J Hum Genet 8:545–551 96. Martin ES, Monks SA, Warren LL, Kaplan NL (2000) A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am J Hum Genet 67:146–154 97. Lange C, Laird NM (2002) Power calculations for a general class of family-based association tests: dichotomous traits. Am J Hum Genet 71:575–584 98. Lange C, DeMeo DL, Laird NM (2002) Power and design considerations for a general class of family-based association tests: quantitative traits. Am J Hum Genet 71: 1330–1341 99. Rabinowitz D, Laird N (2000) A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered 50:211–223 100. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909 101. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using

102.

103.

104.

105.

106.

107.

108.

109.

110.

111.

112. 113.

35

multilocus genotype data. Genetics 155: 945–959 Pritchard JK, Stephens M, Rosenberg NA, Donnelly P (2000) Association mapping in structured populations. Am J Hum Genet 67:170–181 Pritchard JK, Donnelly P (2001) Casecontrol studies of association in structured or admixed populations. Theor Popul Biol 60:226–237 Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA et al (2002) Genetic structure of human populations. Science 298:2381–2385 Shriver MD, Smith MW, Jin L, Marcini A, Akey JM, Deka R et al (1997) Ethnicaffiliation estimation by use of populationspecific DNA markers. Am J Hum Genet 60:957–964 Parra EJ, Marcini A, Akey J et al (1998) Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet 63:1839–1851 Parra EJ, Kittles RA, Argyropoulos G, Pfaff CL, Hiester K, Bonilla C et al (2001) Ancestral proportions and admixture dynamics in geographically defined African Americans living in South Carolina. Am J Phys Anthropol 114:18–29 Pfaff CL, Parra EJ, Bonilla C, Hiester K, McKeigue PM, Kamboh MI et al (2001) Population structure in admixed populations: effect of admixture dynamics on the pattern of linkage disequilibrium. Am J Hum Genet 68:198–207 Smith MW, Patterson N, Lautenberger JA, Truelove AL, McDonald GJ, Waliszewska A et al (2004) A high-density admixture map for disease gene discovery in African Americans. Am J Hum Genet 74:1001–1013 Allard MW, Polanskey D, Wilson MR, Monson KL, Budowle B (2006) Evaluation of variation in control region sequences for Hispanic individuals in the SWGDAM mtDNA data set. J Forensic Sci 51: 566–573 Price AL, Patterson N, Yu F, Cox DR, Waliszewska A, McDonald GJ et al (2007) A genomewide admixture map for Latino populations. Am J Hum Genet 80:1024–1036 Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55:997–1004 Bacanu SA, Devlin B, Roeder K (2000) The power of genomic control. Am J Hum Genet 66:1933–1944

36

Watanabe

114. Hochberg Y, Benjamini Y (1990) More powerful procedures for multiple significance testing. Stat Med 9:811–818 115. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300 116. Sabatti C, Service S, Freimer N (2003) False discovery rate in linkage and association genome screens for complex disorders. Genetics 164:829–833 117. Conneely KN, Boehnke M (2007) So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. Am J Hum Genet 81:1158–1168 118. Bonferroni CE (1936) Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni Del R Istituto Superiore Di Scienze Economiche e Commerciali Di Firenze 8:3–62 119. Johnson GC, Esposito L, Barratt BJ, Smith AN, Heward J, Di Genova G et al (2001) Haplotype tagging for the identification of common disease genes. Nat Genet 29:233–237

120. Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE et al (2003) Choosing haplotype-tagging SNPS based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Hum Hered 55:27–36 121. de Bakker PIW, Yelensky R, Pe’er D, Gabriel SB, Daly MJ, Altshuler D (2005) Efficiency and power in genetic association studies. Nat Genet 37:1217–1223 122. The ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306:636–640 123. Guo T, Hanson RL, Traurig M, Muller YL, Ma L, Mack J et al (2007) TCF7L2 is not a major susceptibility gene for type 2 diabetes in Pima Indians: analysis of 3,501 individuals. Diabetes 56:3082–3088 124. Ren Q, Han XY, Wang F, Zhang XY, Han LC, Luo YY et al (2008) Exon sequencing and association analysis of polymorphisms in TCF7L2 with type 2 diabetes in a Chinese population. Diabetologia 51:1146–1152

Chapter 3 Identification of Causal Sequence Variants of Disease in the Next Generation Sequencing Era Christopher B. Kingsley Abstract Over the last decade, genetic studies have identified numerous associations between single nucleotide polymorphism (SNP) alleles in the human genome and important human diseases. Unfortunately, extending these initial associative findings to identification of the true causal variants that underlie disease susceptibility is usually not a straightforward task. Causal variant identification typically involves searching through sizable regions of genomic DNA in the vicinity of disease-associated SNPs for sequence variants in functional elements including protein coding, regulatory, and structural sequences. Prioritization of these searches is greatly aided by knowledge of the location of functional sequences in the human genome. This chapter briefly reviews several of the common approaches used to functionally annotate the human genome and discusses how this information can be used in concert with the emerging technology of next generation high-throughput sequencing to identify causal variants of human disease. Key words: Genetics, Human disease, Causal variant, High-throughput sequencing

1. Introduction Recent technological advances in high-throughput genotyping have revolutionized the field of human genetics. In particular, cost-effective, microarray-based genotyping of up to one million SNPs per hybridization experiment has enabled population-based, genome-wide association (GWA) studies for common human diseases such as diabetes, schizophrenia, and coronary heart disease. GWA studies (followed by requisite validation studies) have identified statistically significant evidence for associations between complex human diseases and hundreds of common SNP alleles in human populations (1). As discussed in subsequent sections, finding disease-associated SNPs is only the first step in the Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_3, © Springer Science+Business Media, LLC 2011

37

38

Kingsley

identification of the causal variants that directly contribute to disease risk. Therefore, one of the major challenges currently facing the field of human genetics is how to move from identification of SNPs identified by GWA approaches to delineation of causal variants in nearby genomic regions. For many years, identification of causal variants has involved querying public databases for the presence of characterized functional elements in the vicinity of associated SNPs. These functional elements would then be prioritized for targeted resequencing of relatively small genomic regions in affected individuals, with the aim of identifying novel variants that could be causal for disease. Of course, the success of such an approach depends strongly on the means and accuracy with which functional elements can be identified. Further, traditional, low-throughput sequencing technologies place severe limitations on the number of individuals and the number of genomic regions that can be resequenced in an economically feasible way. Recently, several “next generation” (i.e., next-gen) technologies have emerged, promising inexpensive and efficient sequencing of large amounts of genomic DNA as a follow-up to GWA studies (2). While this advance will enable researchers to find many more novel sequence variants than previously possible, it seems likely that the bottleneck in causal variant identification will move from sequencing to functional validation, a process for which databases of functional genomic elements will also prove useful in prioritizing specific variants for testing. In this review, several methods for the functional annotation of genomic sequences will be introduced and discussed briefly. In addition, other sections cover the benefits and challenges of nextgen sequencing for the discovery of causal variants of human disease.

2. Methods 2.1. Genome Annotation

Acquisition of DNA sequence, through the efforts of the Human Genome Project and the International HapMap Project, represented the first step toward the goal of understanding genome function at the molecular level. The next step is genome annotation, or the systematic identification and characterization of the functional units of the human genome. Genome annotation is an ongoing, multidisciplinary process that includes many experimental and computational components designed to identify different types of functional sequences. Several such approaches are described in this section.

Identification of Causal Sequence Variants of Disease

39

2.1.1. Gene Identification

Identification of genes in the human genome has traditionally been performed using a combination of experimental and computational methods. Experimentally, cDNA sequencing projects have generated large databases of EST (expressed sequence tag) libraries that have been used to identify potential protein-coding sequences in the human genome. Computational approaches to gene finding in the human genome are much more difficult than in bacterial genomes, for example, due to the low fraction of coding sequences and the presence of introns within genes. Therefore, computational methods typically incorporate many sources of information in addition to the presence of open reading frames, including amino acid coding bias, splice site structure, and known distributions of intron and exon lengths. Both experimental and computational methods were used to provide an estimate of 30,000–40,000 genes in the initial human genome publications (3, 4). Since that time, as many of the initial predicted genes were determined to be nonfunctional pseudogenes, the estimate of the number of protein-coding genes has decreased to approximately 20,000 (5).

2.1.2. Phylogenetic Conservation

The acquisition of whole genome assemblies from many different organisms has identified numerous sequences, presumed to be functional based on sequence conservation, which stems from the assumption that such regions are less tolerant of mutations, and therefore, more likely to be conserved over evolutionary time. Comparison of the human genome sequence with the genome sequences of species at long, intermediate, and close evolutionary distances (e.g., fruit fly, mouse, and chimpanzee, respectively) provide complementary information on the functional units of the genome (6–9). Distant evolutionary comparisons can define the core set of genes required for the development and function of eukaryotes, intermediate evolutionary comparisons can identify both coding and noncoding sequences that are likely to be functional, while close evolutionary comparisons can identify the small percentage of divergent sequence that is responsible for specifically human traits. Human-to-mouse comparisons have been particularly revealing, in that the amount of conserved noncoding DNA sequence exceeds the amount of conserved coding sequence (7). This finding indicates the existence of a large amount of conserved regulatory and/or structural sequences that may affect gene expression (10). In addition, several “ultra conserved” noncoding elements have been identified over a wide range of vertebrate species that are highly similar (>95% over 200 bp or more), indicating strong selective pressure over hundreds of millions of years (11). The reason for such a high degree of conservation of these elements over so many branches of the vertebrate phylogenetic tree remains an active topic of research.

40

Kingsley

2.1.3. Tiling Microarrays

Advances in microarray fabrication techniques have enabled the manufacture of whole genome tiling arrays, whose probes uniformly cover the entire human genome. Tiling arrays can be hybridized with fluorescently labeled cDNA for an unbiased identification of transcribed regions across the entire human genome. This technique has shown that a much larger fraction of the human genome is transcribed than had been previously known, with most of the transcription producing noncoding RNA including large numbers of antisense transcripts (12). In addition, chromatin immunoprecipitation with transcription factor-specific antibodies, followed by hybridization to tiling microarrays, has been used as an unbiased means of identifying binding sites for individual transcription factors (and potential regulatory regions) in the human genome.

2.2. High-Throughput Sequencing

DNA sequencing methods developed three decades ago led to a revolution in the biological sciences. Indeed, the whole genome sequences that began to appear in the 1990s were only possible because of the development of commercial capillary sequencers based upon the Sanger sequencing method (13, 14). These paradigm-shifting advances led to the emergence of entire research disciplines devoted to the analysis of the large amount of nucleotide sequence now available to researchers. Despite this considerable success, however, considerations of time and cost have remained a major roadblock to large-scale sequencing efforts by smaller research groups. Recently, a new set of technologies has emerged that promises to reduce both the cost of sequencing entire mammalian genomes (from many millions to only a few thousand of dollars) and the time needed to complete the task (from months to days). These next-gen technologies have been, and continue to be, developed by a number of commercial entities, and at the time of this writing, include the Illumina Genome Analyzer, the ABI SOLiD, the Helicos Helioscope, and the Roche 454 FLX sequencers. These technologies use advances in chemistry, fluorescent imaging, and automation to facilitate massively parallel sequencing reactions in which many millions of individual sequences are determined simultaneously (2). While next generation sequencing is still an emerging field, it is clearly in the ascendancy, and is rapidly advancing to the point where almost any review article on the subject will be outdated by the time of publication. While many advances have been made, many challenges associated with the analysis of high-throughput sequencing have also emerged. With the exception of 454 technology, most high-throughput sequencing methods yield relatively short sequencing reads in the 30–40 nucleotide range, making alignment or de novo sequence assembly particularly difficult. In addition, the sequencing error rates associated with

Identification of Causal Sequence Variants of Disease

41

the next-gen methods are currently higher than those achieved by capillary sequencing. It is expected that further refinements of the sequencing technology leading to lower error rates and longer read lengths, combined with improved computational methods, will be effective in overcoming these difficulties to the point where a small research group will be able to sequence large genomic regions, or even entire genomes, in a matter of days for only a few thousand dollars. 2.3. Limitations of Genome-Wide Association Studies

Over the past 5 years, there has been a sizeable increase in the number of published genetic associations between clinical phenotypes and sequence variants in the human genome, which has largely been due to the introduction of high-throughput SNP genotyping microarrays from manufacturers such as Affymetrix and Illumina. Although GWA studies have emerged as powerful tools for elucidating the genetic basis of disease, several limitations associated with this approach can hamper the use of GWAs without careful consideration (1). One such limitation is the difficulty in moving from diseaseassociated SNPs identified by GWAS to causal variants. The SNP alleles identified by GWAS are rarely the true causal variants that directly lead to an increased risk of disease. Rather, the associated SNP alleles are identified because they are in close enough genomic proximity to be correlated with alleles of the causal variant and therefore, are associated with the disease only in an indirect relationship. At the time of writing, GWA studies have implicated many genomic regions in various clinical disorders and traits, but no definitive causal variants have yet been identified. Identification of potential causal variants can be straightforward if the associated SNP lies near a protein-coding gene and the nearby causal variant has an obvious functional consequence, such as a nonsynonymous mutation within the coding region. For example, resequencing of candidate genes that cause Mendelian forms of certain diseases has identified several potential coding variants in population samples (15). Many GWA studies, however, have identified associated SNPs that lie far from any known protein-coding gene and are presumed to be associated with functional variants in long-range transcriptional regulatory sequences. Causal variants of this class present a much greater challenge, with the result that many high-profile GWA publications have reported significant associations between SNPs and human diseases, but have yet to identify even potential causal variants. A second limitation in GWA studies is the fact that the number of individuals required to achieve adequate power to detect a significant SNP/disease association becomes extremely large as the allelic frequency of the causal variant declines much below 5% (16). As a result, while many disease-associated variants have been identified through population GWA studies, they have largely

42

Kingsley

been common variants whose allelic frequencies are typically moderate to high (i.e., >0.1) and whose odds ratios are typically moderate to low (i.e., 1.1–1.3). Variants with low odds ratios such as these are unlikely to be responsible for the observed familial clustering of many common diseases, such as heart disease and diabetes, because familial clustering generally requires that individuals who share the risk allele have a high probability of displaying the phenotype. As disease-associated variants with low odds ratios are unlikely to be responsible for a large fraction of an individual’s inherited risk, the accumulated results of recent GWA studies have called into question the validity of the “common disease–common variant” hypothesis for complex diseases; the proposal that common polymorphisms contribute a significant proportion of the susceptibility to common diseases. As it now appears that only a minority of inherited disease risk can be accounted for by common variants, it is presumed that the majority of the inherited risk is caused by a heterogeneous collection of rare variants that exist in the population. While identification of these rare variants has become a major focus in human genetics, study samples of tens of thousands of individuals would be required to adequately power a GWA study to identify them, an option too costly for the vast majority of researchers. Because of these limitations in GWA studies, many researchers are turning to high-throughput sequencing as a means of identifying rare causal variants. For example, the 1,000 Genomes Project (http://www.1000genomes.org) is a multisite effort designed to find rare variants by sequencing and comparing the entire genomes of large numbers of individuals. While obtaining whole genome assemblies of large numbers of affected individuals at high depth of coverage is still cost-prohibitive for individual laboratories at the time of writing, high-throughput sequencing does show great promise for identifying elusive causal variants through the sequencing of large numbers of candidate genes and large genomic regions in the vicinity of associated SNPs. 2.4. Identification of Disease-Associated Sequence Variants

Targeted identification of novel rare variants requires an efficient means of isolating DNA sequences from a given genomic region of interest over a large number of samples. Polymerase chain reaction (PCR) amplification of DNA fragments from individuals or from pooled DNA samples has been used for several such studies, but for very large genomic regions or large numbers of separate regions (e.g., the set of all protein-coding exons) this approach is not cost-effective. Because of these limitations, several academic and commercial approaches have been developed using nucleotide hybridization to preferentially capture genomic DNA from targeted regions (17–20). Under such approaches, large numbers of custom-synthesized complementary oligonucleotides, typically

Identification of Causal Sequence Variants of Disease

43

immobilized on chromatographic beads or microarray surfaces, are used to capture DNA fragments, which are then eluted and sequenced. Following sequencing, individual nucleotide sequences are aligned against the reference human genome, often using alignment pipelines supplied by the manufacturer of the sequencing device. Deviations of the aligned sequence from the reference genome are classified as sequencing errors or genuine population variants using criteria such as quality control values from the sequencer and expected distributions of deviations under error vs. true variant models (21, 22). While this approach scales much better than PCR, the specificity of hybridization-based methods suffers somewhat due to nonspecific hybridization with off-target genomic DNA. 2.5. Prioritization of Identified Variants

Once a list of sequence variants has been assembled through resequencing of a genomic region of interest, it is important to prioritize them for experimental assays that assess their functional consequences. Prioritization of variants is typically conducted through a sequential series of filters until the number of variants to be tested is consistent with the available experimental resources. These filters often involve statistical analysis of genetic association in primary and validation study samples, followed by prediction of the functional consequences of individual sequence variants. It is the latter functional prediction step that makes use of genome annotation information. The first selection criterion typically applied to presumptive causal sequence variants is the degree of association between each individual marker and the trait of interest in the original study sample. Those markers showing the most significant association (i.e., lowest p value or highest odds ratio) in the initial study population are selected for validation in a second population to reduce the number of false positives. While validation samples are usually drawn from populations of the same ethnicity as the primary sample, different ethnic populations can be used when many adjacent markers are significantly associated with the outcome due to high levels of correlation between the markers, or linkage disequilibrium (LD). In these cases, validation in a sample of different ethnic origin can sometimes distinguish causal variants from indirectly associated variants, because of the different allelic frequencies and patterns of linkage disequilibrium among ethnic populations. For example, validation in samples of African origin can occasionally refine genetic associations due to shorter blocks of LD relative to other ethnic populations. This approach can be complicated, however, in that population differences in allelic frequencies may also lead to decreased power to detect genuine genetic associations in the validation population (1). Analysis of the genomic context of the associated sequence variants is typically the second step in marker prioritization.

44

Kingsley

Variants with clear functional consequences on coding sequences (e.g., nonsense, missense, and splice site mutations) are obvious candidates for further study. In addition, analytical approaches have been developed to predict the functional consequences of nonsynonymous mutations, based upon the analysis of multiple sequence alignments and/or protein 3D structures (23). For those cases in which the associated variants occur far from any known genes, as has been the case for most regions identified by GWAS, genome annotation information can be very useful. For this class of variants, colocalization with functional genomic elements such as transcription factor binding sites, noncoding RNAs, and/or regions of strong phylogenetic conservation should be taken into consideration when prioritizing potential causal variants for further study. By applying genetic and genomic filters such as those described above, the number of associated sequence variants can be pared down to a reasonable size for in vitro and/or in vivo functional studies to assess the effect of the variant on some qualitative or quantitative outcome. For those variants that affect coding sequences of genes whose protein products possess a measurable activity, this can be a straightforward process of expressing a version of the protein containing the variant and conducting the appropriate assay. For those variants that occur in gene-proximal regulatory elements, transcriptional effects can be measured using reporter constructs containing the variant compared with the normal sequence in transfection experiments (24). For those variants that occur far from known genes, transfection experiments may also be used to measure transcriptional effects, although transgenic and/or knockout technology in mice has also been used and may be more appropriate for long-range acting regulatory sequences (10).

3. Conclusions While high-throughput sequencing technology has shown dramatic advances in a very short time, several challenges remain for its use in identifying causal variants in complex human diseases. From a technical standpoint, longer read lengths and lower error rates will improve the accuracy of alignment of sequences to the reference genome, with subsequent increases in the sensitivity and specificity of detecting genuine sequence variants. Analytically, improved algorithms for variant detection, especially structural variants such as indels, will more completely identify sequence variants in genomic regions of interest. With respect to the functional annotation of the genome, a good deal of work has yet to be done to characterize

Identification of Causal Sequence Variants of Disease

45

non-protein-coding functional elements. More noncoding genomic sequences are phylogenetically conserved than coding sequences, but the function of this noncoding sequence remains largely unknown. Also, while much of the human genome is transcribed, the relevance of many transcribed regions remains controversial due to the lack of demonstrable function and lack of phylogenetic conservation. Genome annotation is an ongoing effort, and the catalog of functionally characterized genomic regions will grow as more experimental and computational approaches are developed, with a resulting increase in the number of identified non-protein-coding causal variants. Despite these challenges, high-throughput resequencing of genes identified by GWAS and family studies has already been successful in identifying disease-associated rare variants (15, 25–27). Although no definitive causal variants have been identified through GWAS, it is important to remember that this experimental approach has only become available in the last few years. Follow-up studies involving targeted resequencing of regions identified by GWAS to find potential causal variants, followed by in vitro and in vivo functional studies to assess the functional effects of these variants, will be required to pinpoint the causal variants themselves. While many of the common variants identified by GWAS have been difficult to interpret due to their presence in gene deserts, the higher odds ratios associated with many disease-associated rare variants are consistent with mutations in or near proteincoding regions that will have more obvious effects on protein function, and whose identification may be more straightforward as a result. References 1. Altshuler D, Daly MJ, Lander ES (2008) Genetic mapping in human disease. Science 322:881–888 2. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145 3. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921 4. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG et al (2001) The sequence of the human genome. Science 291:1304–1351 5. Pennisi E (2007) Genetics. Working the (gene count) numbers: finally, a firm answer? Science 316:1113

6. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG et al (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195 7. Mural RJ, Adams MD, Myers EW, Smith HO, Miklos GL, Wides R et al (2002) A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296:1661–1671 8. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562 9. The Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87

46

Kingsley

10. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M et al (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature 444:499–502 11. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS et al (2004) Ultraconserved elements in the human genome. Science 304:1321–1325 12. Kapranov P, Willingham AT, Gingeras TR (2007) Genome-wide transcription and the implications for genomic organization. Nat Rev Genet 8:413–423 13. Hunkapiller T, Kaiser RJ, Koop BF, Hood L (1991) Large-scale and automated DNA sequence determination. Science 254:59–67 14. Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA et al (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265:687–695 15. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH (2004) Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305:869–872 16. Wang WY, Barratt BJ, Clayton DG, Todd JA (2005) Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6:109–118 17. Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL et al (2007) Multiplex amplification of large sets of human exons. Nat Methods 4:931–936 18. Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME (2007) Microarraybased genomic selection for high-throughput resequencing. Nat Methods 4:907–909 19. Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X et al (2007)

20.

21.

22.

23. 24.

25.

26.

27.

Direct selection of human genomic loci by microarray hybridization. Nat Methods 4:903–905 Turner EH, Lee C, Ng SB, Nickerson DA, Shendure J (2009) Massively parallel exon capture and library-free resequencing across 16 genomes. Nat Methods 6:315–316 Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M, Corneveaux JJ et al (2008) Identification of genetic variants using barcoded multiplexed sequencing. Nat Methods 5:887–893 Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858 Ramensky V, Bork P, Sunyaev S (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30:3894–3900 Wang QF, Prabhakar S, Wang Q, Moses AM, Chanan S, Brown M et al (2006) Primatespecific evolution of an LDLR enhancer. Genome Biol 7:R68 Nejentsev S, Walker N, Riches D, Egholm M, Todd JA (2009) Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324:387–389 Ji W, Foo JN, O’Roak BJ, Zhao H, Larson MG, Simon DB et al (2008) Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat Genet 40:592–599 Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH et al (2007) Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat Genet 39:513–516

Part II Methods for Gene Identification

Chapter 4 Microarray-Based Genome-Wide Association Studies Using Pooled DNA Szabolcs Szelinger, John V. Pearson, and David W. Craig Abstract Pooling genomic DNA samples within clinical classes of disease for use in whole-genome single nucleotide polymorphism (SNP) genotyping allows for rapid and inexpensive genome-wide association studies (GWAS). We describe here a general outline for combining hundreds of genomic DNA samples prior to genotyping on commercially available high-density SNP microarrays. The pool construction approach is universal, and independent of the SNP genotyping platform utilized, and therefore provides a quick, efficient, and low-cost alternative to interrogating thousands of individual samples on singular SNP microarrays. While the strategy for pooled DNA genotyping on SNP microarrays is straightforward, the success of such studies is critically dependent upon the accuracy of allelic frequency calculations, the ability to identify falsely positive results arising from assay variability, and the willingness to better resolve association signals through investigation of neighboring SNPs. Key words: Genome-wide association study, Single nucleotide polymorphism, DNA pooling, Equimolar concentration, Allelic frequency, GenePool

1. Introduction While 99.9% of the human genome sequence is identical in human beings, it is estimated that there are approximately ten million single nucleotide polymorphisms (SNPs) that, in combination, can distinguish among individuals (1–3). SNPs are nucleotide variants, for example, an adenosine to thymine (A/T) substitution, that occur throughout the human genome. These variants can contribute directly to disease predisposition by modifying a gene’s function, or they can be used as genetic markers to detect nearby disease-causing mutations through association or familybased linkage studies. There are three general classes of SNPs: those with strong functional effects, which dramatically alter a Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_4, © Springer Science+Business Media, LLC 2011

49

50

Szelinger, Pearson, and Craig

gene’s behavior (classic mutations); those with more subtle functional effects that predispose to disease in concert with an individual’s genetic background or environment (functional SNPs); and those which are completely silent with respect to function (nonfunctional SNPs). The immediate importance of genotyping a highdensity panel of SNPs does not necessarily come from their functional relevance, but rather the proximity of the marker to the disease-causing variant or functional SNP (7). These types of studies, while long anticipated, are only recently possible for complex diseases (5, 6), and genotyping hundreds of individual DNA samples is a popular approach taken by researchers conducting GWAS. However, a major weakness of genome-wide genotyping approaches is the significant cost associated with purchase of the necessary reagents and microarrays, putting such studies beyond the reach of many research groups. We have successfully (7) developed an alternative approach of genotyping pooled instead of individual genomic DNA samples to reduce the overall cost of GWA studies. Several previous reports have investigated the feasibility of pooling on SNP genotyping microarrays or related technologies (8–11). With a few exceptions, these reports have focused on predicting allelic frequencies across thousands of SNPs rather than on the effectiveness of pooling in identifying the genetic basis of a specific complex disorder. For individual genotyping, there are a number of factors that influence the ability of GWA studies to detect genetic associations. These include, but are not limited to: (1) the allele frequency of the causal variant; (2) the odds ratio (OR) or genetic relative risk; (3) the linkage disequilibrium (LD) between the causal variant and the probed SNPs; (4) the number of individuals in each cohort; (5) the number of probed SNPs in LD with the causal variant; and (6) the analysis approach taken. Specific to pooling, there are additional factors that influence the ability to detect a true association. These additional factors include (7) the precision of allele frequency measurements made by the SNP genotyping microarray; (8) the accuracy of pool construction by pipetting; (9) the integrity of the pooled genomic DNA; (10) the number of individuals pooled or overall pooling strategy; and (11) the number of microarray technical replicates. Furthermore, population stratification and admixture can mask true associations in all studies. Beyond these additional factors, in pooling, one loses the ability to compare subphenotypes of pools, to directly measure genotype, and to detect gene–gene interactions. However, the most important factor in favor of pooling-based GWA studies is that this study design can be completed for thousands of dollars, whereas individual genotyping may require millions of dollars. There are numerous orphan diseases and many small populations which cannot realistically be studied using individual genotyping at this time, and a pooling-based GWA study presents an attractive, cost-effective alternative.

Microarray-Based Genome-Wide Association Studies Using Pooled DNA

51

2. Materials 2.1. Quality Control and Quantification of Genomic DNA

1. Molecular biology grade water.

2.1.1. Quality Control Using Gel Electrophoresis

4. DNA ladder.

2. 50× Tris-acetate–EDTA buffer. 3. GelStar nucleic acid gel stain 10,000× (Lonza; Rockland, ME). 5. Ultrapure agarose. 6. 96-well Gel System (Thermo Fisher Scientific; Waltham, MA). 7. Microwave. 8. Magnetic stir plate and magnetic stir bars. 9. Dark Reader (Clare Chemical Research; Dolores, CO).

2.1.2. Quality Control and Quantification Using PicoGreen

1. Quant-iT PicoGreen dsDNA kit (Invitrogen; Carlsbad, CA). 2. Costar 96-well, black, flat bottom plates (Corning Life Sciences; Lowell, MA). 3. Molecular biology grade water. 4. Multi- and single-channel pipettes. 5. Aluminum foil. 6. Sterile glass bottle for 1× Tris–EDTA working solution. 7. Microcentrifuge. 8. Vortex. 9. Sterile tubes (1.5 mL). 10. Falcon tubes (50 mL, 15 mL). 11. Disposable reagent reservoirs. 12. Human genomic DNA Cat#:11 691 112 001 (Roche Applied Science; Indianapolis, IN).

2.2. Pool Construction

1. Single-channel pipettes (P1, P2, P10, P20). 2. Sterile collection tube (1.5–5.0 mL). 3. Ice bucket. 4. Parafilm.

2.3. Genotyping

Materials for pooled genotyping can vary and depend primarily on the infrastructure, funding, and choices of the researcher. In terms of this chapter, however, the array type is irrelevant, as the focus here is on a pooled genomic DNA technique that is independent of genotyping platform. For interested readers, specific protocols for some of the genotyping assays are described within other sections of this book. That said, the predominant array types include those manufactured by both

52

Szelinger, Pearson, and Craig

Illumina (i.e., the Human 610-Quad v1.0, Human 1M-Duo v3.0, Human CNV370-Quad v3.0) and Affymetrix (i.e., the Genome-Wide Human SNP Array 5.0 and Genome-Wide Human SNP Array 6.0). 2.4. Analysis of Pooled Genotype Data

GenePool software (freely available from http://genepool. tgen.org).

3. Methods 3.1. Quality Control and Quantification of Genomic DNA 3.1.1. Quality Control and Quantification Using Agarose Gel Electrophoresis

The visualization of DNA samples to be pooled on an agarose gel is not only an excellent way to control for the use of DNA samples of suboptimal quality, but it is also a great reference for use during troubleshooting procedures (see Notes 1 and 2). Agarose gel electrophoresis has long been associated with fluorometric assays that demanded the use of hazardous compounds such as ethidium bromide and an excitation wavelength in the ultraviolet range, both of which increase health risks to the researcher. However, novel, less toxic, fluorescent gel-staining products have become available, which are more sensitive for nucleic acid visualization and do not promote the health risks associated with ethidium bromide and ultraviolet radiation. Thus, we recommend fluorescent gel-staining products for DNA visualization during agarose gel electrophoresis. 1. Thaw the Gel Star stain at room temperature for 10–15 min (see Note 3). 2. Dilute the 50× TAE buffer to a 1× working concentration with molecular biology grade water. 3. Prepare a 1% agarose gel using 1× TAE (e.g., 2 g Ultrapure agarose for a 200 mL gel). 4. Heat mixture in microwave until agarose is dissolved. 5. Place agarose solution on stir plate and agitate gently until the temperature of the mixture reaches 55°C. 6. Add Gel Star stain (10,000× stock) to agarose solution to achieve a 1× concentration (see Note 4). 7. Mix stain and agarose on stir plate, then pour into gel tray and let sit at room temperature until solidified. 8. Place the tray containing the gel in the electrophoresis apparatus and cover with 1× TAE. Load the appropriate DNA ladder and DNA samples with running dye and separate fragments with a constant voltage. 9. Visualize the DNA bands using Darkreader’s blue light between 400 and 500 nm.

Microarray-Based Genome-Wide Association Studies Using Pooled DNA 3.1.2. Quality Control and Quantification Using Quant-iT dsDNA PicoGreen

53

1. Precise quantification of genomic DNA is critical for the success of a pooling approach to GWAS (see Note 5). This protocol assumes high-throughput sample quantification and the following description is for large-scale studies requiring the use of several 96-well Costar plates filled to capacity (see Note 6). However, the framework of the protocol is the same for scaled-down pooling studies as well. The most critical part of the assay is the creation of the standard curve. The reference human genomic DNA comes in a 200 ng/mL stock concentration; therefore, the DNA to be quantitated should be at a lower concentration. Prior to standard curve preparation this could be achieved by a quick assessment of the sample DNA concentration on a Nanodrop or a 260 absorbance reading, for example. DNA concentration should fall somewhere in the middle of the standard curve. After the approximate concentration of the DNA is determined, all samples to be pooled should be diluted with sterile water to an approximate concentration of 100 ng/mL. 2. Dilute the 20× TE buffer, included in the Quant-iT kit, to a 1× working concentration (see Note 7). 3. Create the eight-point standard curve by serial dilution of the stock Roche DNA (see Note 8). Use 1.5 mL sample tubes with 1× TE buffer to a final concentration series of 0, 3.125, 6.25, 12.5, 25, 50, 100, and 200 ng/mL. Assay each point of the standard curve in triplicate and use the median concentration of each to generate the curve. 4. Label the Costar 96-well, black assay plate with sample/plate name. 5. Keep all samples to be assayed on ice throughout the entire procedure. 6. Add 99 mL of 1× TE to each well of the assay plate. 7. Add 1 mL of sample DNA to each well of the assay plate in triplicate or quintriplicate in rows A–F, and 1 ml of DNA standard to rows G–H (see Note 9). Cover plate with plate seal, vortex briefly, and spin at 1,000 rpm for 30 s to collect the TE and DNA to the bottom of the wells. 8. Dilute the PicoGreen reagent (150 mL of stock reagent to 19.85 mL 1× TE). Prepare dilution in a plastic Falcon tube and wrap in aluminum foil to prevent photodegradation. PicoGreen reagent has a light orange/magenta color. 9. Pour the diluted PicoGreen reagent into a large trough and add 100 mL to each well of the assay plate containing the DNA using a multichannel pipette. Remove any bubbles from the bottom of the wells, as they can influence the fluorescence signal detected by the reader. Cover the plate with aluminum foil and let stand for 5 min at room temperature.

54

Szelinger, Pearson, and Craig

10. Read and record the fluorescence signal in plate reader. DNA-bound fluorescent dye is excited at 480 nm and intensity measured at 520 nm. 11. Calculate median DNA concentrations for each sample (see Note 10). 3.2. Pool Construction

1. Calculate the desired amount of genomic DNA to combine for each individual sample, and list in ascending order on a pooling sheet with the corresponding well number (see Note 11). 2. Load the samples to be pooled on a 96-well pooling plate corresponding to the pooling sheet. 3. Once this sheet is printed, place the DNA plate on ice and label a 1.5–5 mL, sterile collection tube with the desired pool ID, date, and DNA concentration. 4. Using a single-channel pipette, transfer the desired amount of DNA, per pooling sheet, into the collection tube. Once the sample is added to the collection tube, check off the corresponding number on the pooling sheet to avoid errors and track progress (see Notes 12, 13, 14). 5. Once all samples from the DNA plate are pooled, determine the concentration of each pool in triplicate using PicoGreen assay.

3.3. Genotyping

The study design in this chapter suggests the combined usage of both Illumina and Affymetrix SNP genotyping arrays. Both Illumina and Affymetrix assays are performed according to vendor protocols, described elsewhere this book.

3.4. Analysis of Pooled Genotype Data

We exclusively use the GenePool software package to detect shifts in relative allele frequency between pooled genomic DNA from cases and controls using Affymetrix or Illumina SNP-based genotyping microarrays. GenePool is written in C++ (gpextract) and C (gpanalyze).These programs can be run individually using command line Unix. GenePool can be downloaded from the GenePool Web site at http://genepool.tgen.org/. The software is currently provided as a precompiled binary for X86-Linux and as a source code. Manual pages for all executables are bundled in both source and binary distributions and are also available from the GenePool Web site in PDF and HTML formats for online viewing (7).

3.5. Gpextract

The gpextract program is the first of the three programs in the GenePool system. For Affymetrix analysis, gpextract uses Affymetrix’s Fusion library to extract intensity values from the Affymetrix CEL files and write them out to a more compact customized binary file format that will be read by gpanalyze. The new files have the same name as the original CEL file but with the

Microarray-Based Genome-Wide Association Studies Using Pooled DNA

55

string .gpb appended to each filename. For Illumina analysis, gpextract uses the raw .txt Illumina files to extract intensity values and writes them to a more compact, customized binary file format that will be read by gpanalyze. Note that for Illumina chips, gpextract may report extracting intensity values for more SNPs than is stated for a given chip since there are SNPs on the chip that are used only for quality control and testing purposes. In addition, each Illumina chip may have a different number of SNPs and intensity values since the distribution of beads on the chip is random. 3.6. Gpanalyze

The gpanalyze program takes the intensity values from the custom binary file (.gbp) and uses a variety of distance measures to assign a probability score to each SNP where the score indicates how unlikely the observed difference in allele frequency is between the hybridizations for the two DNA pools.

3.7. Gpcommand

The gpcommand program is the third of the three programs in the GenePool system. It is a perl script that helps users run basic pooling analyses by reading a configuration file and automatically invoking the other two GenePool programs gpextract and gpanalyze. The programs gpextract and gpanalyze are designed for direct use by researchers; however, each has a large number of command line options which can, initially, be difficult to understand and use correctly. gpcommand is a perl wrapper that will create and execute basic command lines that run gpextract and gpanalyze. All commands executed by gpcommand can optionally be echoed to a log file for later inspection.

3.8. Affymetrix 500K Arrays

For Affymetrix 500K arrays, the following protocol is recommended. Probe intensity data are read from CEL files and relative allele signal (RAS) values are calculated for each quartet. RAS values correspond to the ratio of the A probe to the sum of the A and B probes (where A is the major allele and B is the minor allele) and provide a quantitative index correlating to allele frequencies in pooled DNA. These values yield independent measures of different hybridization events and are consequently treated as individual data points. A silhouette statistic, which represents the mean of the distance of a point to all other points in its class (i.e., case pool) versus points in the other class (i.e., control pool), is used to rank all genotyped SNPs. Silhouette statistics range from 1, where complete separation between pools has been achieved, to −1, where it is not possible to distinguish between case and control pools. The calculation for a silhouette statistic is given as follows: S=

∑

i =1…N

N

s (i)

,

s (i) =

b (i) − a (i) , max{a (i), b (i)}

b (i) = min{b (i)}, (1)

56

Szelinger, Pearson, and Craig

where the overall silhouette statistic, S, is the average of all of the individual silhouette values, s(i), for each of the class comparisons, N refers to the number of replicate measures, I represents the replicate array within a class, a(i) refers to the average Euclidian distance of RASa and RASs for replicate i of a class to all other replicates within its class, and b(i) refers to the average distance of a replicate (i) to all replicates not within its class. This particular test-statistic has been heuristically shown to perform well at identifying SNPs with large allelic frequency differences for the Affymetrix 500K arrays only. 3.9. Illumina and Affymetrix 5.0/6.0

For the Illumina and Affymetrix arrays, we utilize a modified test of two proportions based on predicted allelic frequencies. Allelic frequencies are approximated using a k-correction method such that f = A/(A + kB), where k is the correction factor for the ratio of the intensity of A and B probes. The pooling-test statistic, T, is given as follows: T =

fa − fc f (1 − f ) (1 / 2na + 1 / 2nc ) + s r2 (1 / ma + 1 / mc )

,

(2)

where fa and fb are the respective predicted allelic frequencies for the a and c cohorts, na and nc are the number of individuals 2 pooled, s r is the standard deviation for technical replicates, and ma and mc are the number of replicate arrays. For analysis, one must specify the chip files (CEL for Affymetrix and .TXT for Illumina) that are in each of the respective cohorts. One must also specify the number of individuals within each cohort. All other values are computed during the analysis, and a ranked list of SNPs by the test-statistic is produced. Depending on the study, one must then individually genotype SNPs to determine a p-value. P-values are not computed from the analysis of pooled data since any number of spurious environmental factors can throw off analysis substantially. In the end, pooling can be viewed as a screening tool, but does not replace individual genotyping. 3.10. Validation of Analysis Results

Various approaches exist for validating analysis results. Essentially, one individually genotypes SNPs identified during analysis. The types of methods include Illumina GoldenGate, Illumina iSelect, Illumina BeadArray/Veracode, Sequenom MassArray genotyping, Applied Biosystems TaqMan, direct sequencing, and restriction-fragment length polymorphism (RFLP) analysis. Most of these methods are discussed elsewhere in this text. Due to the subjective nature for a researcher’s choice in validating their results and also because of space constraints, we did not describe

Microarray-Based Genome-Wide Association Studies Using Pooled DNA

57

the methods for individual genotyping here. However, we do encourage researchers to validate the analysis results, as the pooling-based GWAS should only be viewed as a screening tool for the SNPs showing the greatest differences in allele frequency between pools. When reporting results, only p-values from individual genotyping should be considered.

4. Notes 1. Sample DNA amplified through the use of available wholegenome amplification technologies is not suggested due to uneven amplification. 2. Like most molecular biology applications, the quality of the template DNA greatly determines the success of the study. Therefore, great care must be taken to ensure that the DNA used in pooling-based genotyping is of high quality and void of protein and other contaminants. We suggest that DNA samples are plated out on a 96-well microplate prior to study for easy handling. The layout of this “DNA plate” should be used for subsequent procedures. 3. Gel Star does not only stain dsDNA but ssDNA and RNA as well. However, it is several orders of magnitude more sensitive to dsDNA than RNA or ssDNA, which are common parts of purified DNA solutions. Other stains that are more specific to dsDNA are recommended for DNA quantification. 4. A 1× stain concentration in the agarose solution is recommended by the manufacturer, e.g., 5 mL of Gel Star stain is added to every 50 mL of agarose solution. However, we achieved comparable results using only 0.2× of stain concentration in the gel, e.g., 2 mL of Gel Star stain to 100 mL of agarose solution. 5. Precise quantification of genomic DNA is critical for the success of a pooling approach to GWAS. Because pooling-based genotyping measures the allelic frequency of a genetic variant using pools of genomic DNA partitioned into risk and nonrisk groups, inaccuracies in the estimation of DNA concentration could potentially mask a predicted allele frequency difference between affected and unaffected pools. The Quant-iT kit is a rapid, accurate, and cost-effective method when working with dozens to hundreds of DNA samples simultaneously. Not only does this method enable highthroughput application, but PicoGreen results are also less influenced by common sample contaminants such as residual proteins, salts, or RNA. However, it should be noted that the PicoGreen reagent is sensitive to photodegradation and thus,

58

Szelinger, Pearson, and Craig

should be kept in a dark place, preferably at 4°C. Further, while performing the quantification assay, all prepared reagents should be kept in a plastic container, as Picogreen reagent adheres easily to glass surfaces. The method used by the authors follows the basic outline of the manufacturer’s protocol, but adjusted to accommodate high-throughput requirements. Human genomic DNA standard is used instead of the calf thymus DNA (supplied with the Quant-iT kit) to more closely emulate the properties of the sample DNA. At the very least, sample DNA concentration should be measured in triplicate, but for higher accuracy, we recommend five replicate measurements for each sample to be pooled. The increased replicate count will inherently confirm some outlier measurements that would preferentially increase or decrease the actual concentration. Thus, more replicates with the use of median instead of mean will likely give the most accurate concentration estimate. 6. Pooling studies usually require dozens to hundreds of samples to be combined; therefore working with DNA samples in a 96-well microplate format greatly increases speed, simplicity, and precision. To reduce error during sample handling, the same sample layout should be used for the entire quantification and pooling process. 7. If large numbers of samples are to be assayed, the 1× TE buffer can be made up in a large batch and stored in a labeled glass container at room temperature for an extended period of time. 8. The Roche DNA should never be kept at −20°C. We observed consistently increased variance of R2 values below 0.8 for DNA standard made out of previously frozen Roche reference DNA. 9. We recommend that rows G and H be dedicated for the diluted Roche DNA serial diluted standard. Thus, once the 99 mL of 1× TE is added to these wells, the standards are loaded in triplicate. The 0 ng/mL standard taking positions G1, G2, and G3, the 3.125 ng/mL standard taking up the positions G4, G5, G6 wells, and so on, until the wells H10, H11, and H12 contain the most concentrated standard 200 ng/mL. Just like the DNA samples of interest, these standards (1 mL) are added to each reaction well. For clarity and straightforwardness, a standard plate layout can be created and used for all subsequent assays. A standardized layout will also require only a single protocol in a fluorescent dye reader like Bio-Tek KC4. 10. Samples with greater than 2 standard deviations from the median volume to be pooled for each sample are not to be included in the final pool. Other samples are pooled at equimolar concentrations, ensuring the same number of DNA molecules represented in the combined DNA pool.

Microarray-Based Genome-Wide Association Studies Using Pooled DNA

59

If concentration measurement of sample to be pooled is inaccurate or not equimolarily pooled, and pipetting errors occur, there is likely to be bias toward the allelic signal that is generated by over-, or under-represented genomic loci. 11. By pooling, one loses the power to detect associations, thus, we suggest study designs with multiple subpools containing an equal number of samples at equimolar concentration. The creation of subpools does not increase the power but allows for the reduction in pooling variance by pipetting error during pool construction. To increase resolution, we suggest the application of as many SNP platforms as possible with each subpool run in triplicate technical replicates. The redundancy between the SNP platforms allows for the reduction of measurement noise on highly correlated neighboring SNPs in LD and insures that there are very few gaps in overall genomewide SNP coverage. The pool concentration is based on the obtained median concentrations of individual samples by the PicoGreen assay. Based on the median concentration of each individual sample, pooling volumes of each DNA sample is calculated; so all samples will contribute the same amount of DNA to each pool. At this point, the number of samples for each subpool is determined. The pool concentration is calculated to be approximately 10× greater than the template concentration needed for the genotyping assay. This stock pool can be later diluted to the working concentration as desired by the various genotyping assays. If the concentration of DNA sample by Picogreen assay is 10 ng/mL, for example, and the desired final pool concentration is 100 ng/mL, the sampled DNA volume is 100/10 = 10 mL as 10 mL of sample contains 100 ng of DNA to be pooled. The above example can be automated in a spreadsheet for all samples. 12. No less than 1 mL of the sample in question is added to the pool. Volumes below 1 mL increase pipetting error significantly. 13. For clarity and reducing error, the use of an unopened box of pipette tips is recommended. Pipette the sample DNA from the well A1 of the pooling plate using the pipette tip from the corresponding location on the pipette box. This will ensure that a sample is not pooled more than once and helps keep track of progress. 14. The DNA samples on the pooling sheet should be sorted from smallest to largest volumes so changing volumes on the pipette is uncomplicated and straightforward. Use of a spreadsheet with pooling volumes listed in increasing order will prevent fatigue in hands caused by constant adjustment of pipette to desired volume. If the smallest volume is pooled first, then simply upward adjustment of pipette dial will be needed until the last sample is added to the pool.

60

Szelinger, Pearson, and Craig

References 1. Ardlie KG, Kruglyak L, Seielstad M (2002) Patterns of linkage disequilibrium in the human genome. Nat Rev Genet 3:299–309 2. Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA (2003) Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet 33:518–521 3. Kruglyak L, Nickerson DA (2001) Variation is the spice of life. Nat Genet 27:234–236 4. Schaid DJ, Guenther JC, Christensen GB, Hebbring S, Rosenow C, Hilker CA, McDonnell SK, Cunningham JM, Slager SL, Blute ML, Thibodeau SN (2004) Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancer-susceptibility loci. Am J Hum Genet 75(6):948–965 5. Botstein D, Risch N (2003) Nat Genet 33(Suppl):228–237 6. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517 7. Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, Homer N, Brun M,

8.

9.

10.

11.

Szelinger S, Coon KD, Zismann VL, Webster JA, Beach T, Sando SB, Aasly JO, Heun R, Jessen F, Kolsch H, Tsolaki M, Daniilidou M, Reiman EM, Papassotiropoulos A, Hutton ML, Stephan DA, Craig DW (2007) Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet 80:126–139 Bansal A et al (2002) Association testing by DNA pooling: an effective initial screen. Proc Natl Acad Sci USA 99:16871–16874 Meaburn E, Butcher LM, Schalkwyk LC, Plomin R (2006) Genotyoping pooled DNA using 100K SNP microarrays: a step towards genomewide association scans. Nucleic Acids Res 34:e27 Hinds DA et al (2004) Application of pooled genotyping to scan candidate regions for association with HDL cholesterol levels. Hum Genomics 1:421–434 Barratt BJ et al (2002) Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann Hum Genet 66:393–405

Chapter 5 Medium-Throughput SNP Genotyping Using Mass Spectrometry: Multiplex SNP Genotyping Using the iPLEX® Gold Assay Meredith P. Millis Abstract Depending on the scope of the research project, categories of single-nucleotide polymorphism (SNP) genotyping experiments range from low to medium to high throughput, with each approach differing widely in cost, platform, and efficiency. Medium-throughput genotyping is generally appropriate for assaying up to 36 markers in 384 individuals and is commonly used for fine-mapping chromosomal regions identified in genome scans. Multiplexing, which allows for simultaneous assessment of multiple SNPs, is an efficient, rapid, and economic way to augment medium-throughput genotyping output and is readily performed using matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS). In this chapter, we describe a technique for medium-throughput genotyping using the iPLEX® Gold assay available from Sequenom, Inc. (San Diego, CA). This multiplex SNP genotyping platform incorporates locus-specific PCR amplification of genomic DNA, followed by shrimp alkaline phosphatase treatment to inactivate unincorporated nucleotides, single-base primer extension using mass-modified terminators, and MALDI-TOF MS for allele-specific detection. This protocol utilizes proprietary enzymes, software and SpectroCHIP® Arrays that are pre-spotted with a MALDI matrix. Key words: SNP genotyping, Multiplex PCR, Primer extension, MALDI-TOF MS, iPLEX® Gold

1. Introduction Platform selection for SNP genotyping primarily depends on (1) the number of markers to investigate and (2) the number of individuals to genotype. There are several genotyping options that rely on different detection methods such as fluorescence or matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) (1). Low-throughput genotyping typically consists

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_5, © Springer Science+Business Media, LLC 2011

61

62

Millis

of less than four markers in a few hundred samples, and utilizes platforms such as the TaqMan SNP genotyping assay (Applied Biosystems; Foster City, CA). In contrast, high-throughput SNP genotyping is generally considered when assaying more than a couple hundred SNPs in hundreds to thousands of individuals (i.e., genome-wide association studies). Medium-throughput genotyping is an intermediary option between these two extremes and is ideal for medium-scale studies when a total of 5–100 SNPs are to be assayed in a moderate number of samples (100–2,000). Medium-throughput genotyping is commonly used for finemapping, identification of candidate disease genes, and validation of whole genome association studies (2). A popular method for medium-throughput genotyping is the iPLEX® Gold assay, used in conjunction with the MassARRAY® System from Sequenom, Inc. (San Diego, CA). This system offers a cost-effective method for genotyping up to 40 SNPs per well, with scalable sample runs from 96 samples to 3,840 samples, in as little as 2 days. Allelic discrimination is achieved using MALDI-TOF MS, which has been used for many years to analyze multiplex products and genotype individuals based on nucleotide molecular weight differences (3). An advantage of the iPLEX® Gold assay is that Sequenom provides Assay Design software to design the three necessary primers for assays with up to 95% efficiency. This enables flexibility to design any number of assays to a wide variety of genomic targets. Also, the MassEXTEND® primer extension reaction utilizes a single termination mix and universal cycling conditions for any SNP in the plex. Sequenom has designed 384- and 96-sample SpectroCHIP® Arrays that are available pre-spotted with a MALDI matrix and are loaded into the mass spectrometer for real-time analysis of mass-dependent genotypes. The SpectroCHIP® Arrays have also recently been upgraded to improve signal-to-noise ratios, signal intensities of the main peaks, and reduce salt adducts that previously affected data quality. Genotype data are analyzed using the Sequenom software, Typer 4.0, and application, Typer Analyzer, which clusters genotype plots across multiple chips and performs automated QC analysis and assesses assay quality. Within 1 week, a large amount of data can be generated for statistical analysis using the iPLEX® Gold and MassARRAY® System. This chapter focuses on medium-throughput SNP genotyping using the iPLEX® Gold Assay from Sequenom and describes each step in detail, from assay design to bench-work to data analysis. In particular, we focus on techniques and modifications gleaned from our own experience that improve both the efficiency of the assay and the quality of resulting genotype data.

Medium-Throughput SNP Genotyping Using Mass Spectrometry

63

2. Materials 2.1. Assay Design

To perform the genomic DNA amplification PCR, forward and reverse primers are designed to produce amplicons of approximately 100 bp spanning the SNP of interest. Primers should map uniquely to the region of interest and should not dimerize with themselves or other primers in the mix. The extension primers are designed to anneal directly adjacent to the SNP of interest. The following programs are required to design iPLEX® Gold assays and can be found at https://mysequenom.com: 1. rs Sequence retriever. 2. ProxSNP. 3. PreXTEND. 4. PleXTEND. 5. Assay design 3.1 (download).

2.2. PCR Amplification

1. Complete PCR reagent set (Sequenom, see Note 1): (a) PCR buffer with 20 mM MgCl2: 10×. (b) MgCl2: 25 mM. (c) dNTP mix: 25 mM. (d) PCR enzyme: 5 Units/ml. 2. Forward and reverse primer mix: 500 nM each (see Note 2). 3. Nanopure or HPLC-grade water (see Note 3). 4. Genomic DNA: 5–10 ng (see Note 4). 5. PCR Thermocycler (96- or 384-well plate format).

2.3. Shrimp Alkaline Phosphatase Incubation

1. iPLEX® Gold Reagent Kit for processing 10 SpectroCHIP® Arrays (960 or 3,840 reactions) (a) Shrimp alkaline phosphatase (SAP) buffer: 10×. (b) SAP enzyme: 1.7 Units/ml.

2.4. Primer Extension

1. iPLEX® Gold Reagent Kit for processing 10 SpectroCHIP® Arrays (960 or 3,840 reactions) (a) iPLEX® Buffer Plus: 10×. (b) iPLEX® Termination mix. (c) iPLEX® enzyme: 1.7 Units/ml. 2. Extension primer mix. Tier the primer concentrations according to plex size (see Note 5).

64

Millis

2.5. Resin Addition

1. SpectroCHIP® Arrays and clean resin kit (a) Clean resin (see Note 6). 2. MassARRAY® Clean Resin Tool Kit (a) 96 or 384-Well (6 mg) resin plate. (b) Clean resin spoon and scraper. 3. Rotator capable of rotating microtiter plates 360° around the long axis.

2.6. Chip Dispensing

1. SpectroCHIP® arrays and clean resin kit (a) SpectroCHIPs. 2. iPLEX® Gold Reagent Kit for processing 10 SpectroCHIP® arrays (a) 3 pt Calibrant. 3. RoboDesign Nanodispenser (see Note 7). 4. Water. 5. Isopropyl alcohol. 6. Sodium hydroxide.

2.7. Chip Detection

1. MassARRAY® analyzer compact (see Note 8). 2. MassARRAY® workstation RT software.

2.8. Data Analysis

1. Typer 4.0 software (see Note 9).

3. Methods 3.1. Assay Design

Designing and selecting the best assay to order can take 1–3 days depending on the desired number of SNPs to genotype and the primer design ability in the flanking sequences. If a SNP is excluded from an assay because it is in a region with a high number of repeats or homologous to other areas of the genome, and genotype data must be obtained, another SNP can be selected to take its place based on linkage disequilibrium (4). Primer ordering and delivery can take 5 days. 1. Go to https://mysequenom.com and register for a mySequenom account to use the online genotyping tools for preparing assay design files. The tools used for iPLEX® assay design include: (a) rs Sequence Retriever – from NCBI dbSNP. (b) ProxSNP – Sequence Mapping/Proximal SNP Demarkation. (c) PreXTEND – SNP Validation/Amplicon Design. (d) PleXTEND – Multiplexed Assay Validation

Medium-Throughput SNP Genotyping Using Mass Spectrometry

65

The desktop software, MassARRAY® Assay Design 3.1 must also be downloaded to complete the assay design process. 2. Select SNPs to be genotyped and name them in a list by SNP ID (rs#) (see Note 10). Go to the Sequenom website and select the “rs Sequence Retriever from the NCBI dbSNP” tool. This application will retrieve the flanking sequence for each SNP and format the results in a text file used for the next step, “ProxSNP.” Alternatively, a dbSNP file can be uploaded by clicking on the Browse button and selecting the file (see Note 11). 3. Send the file to ProxSNP using the default settings. This program will search the current database for proximal SNPs that lie in the flanking sequences of the SNPs of interest and mark the alleles using the IUPAC code in the Output file (_FAS_ sqnm -s.txt). These marked SNPs will subsequently be avoided during primer design. 4. Now send the ProxSNP Output file to PreXTEND using the default settings. The PreXTEND program will scan the flanking sequences and choose forward and reverse primers for each SNP, putting the primer sequences in capital letters. These primers are mapped to the genome and preferentially designed to uniquely amplify the SNPs. Save the PreXTEND Output file (_FAS_sqnm -p.txt) to your PC and open the MassARRAY® Assay Design 3.1 software (see Note 12). 5. Click Browse for the SNP Group field and upload the _FAS_ sqnm -p.txt file. Choose SBE (Single-Base Extension) Mass Extend. Go to Presets and choose the low, medium, or high plexing (depending on the number of SNPs you are uploading). Keep the Scan & Restrict option marked. If given the option (as in higher plexes), choose “Progressive Search.” Click Run. 6. The Assay Design software will generate two files (.trs and .xls) and will be uploaded into the Typer 4.0 software when creating the plate and assay documents (see Note 13). 7. Run the Assay Group file (.xls) through the online genotyping tool, PleXTEND, on the Sequenom website (see Note 14). 3.2. PCR Amplification

1. Array sample plates with the following controls: (a) Duplicate samples for assessing genotyping accuracy. (b) No template controls (NTC; no DNA plus PCR cocktail with enzyme) to check for contamination. (c) No polymerase controls (NPC; DNA plus PCR cocktail without enzyme) to evaluate self-extension of primers.

66

Millis

Table 1 iPLEX® Gold PCR Reagent

Final concentration

Vol (1×) (ml)

Vol (384× + 20%)

Water, HPLC grade

N/A

1.8

829.8

10× PCR buffer with 20 mM MgCl2

2 mM MgCl2

0.5

230.4

25 mM MgCl2

2 mM

0.4

184.4

25 mM dNTP mix

500 mM

0.1

46.1

0.5 mM Primer mix

0.1 mM

1.0

461.0

5 U/ml PCR enzymea

1 Unit

0.2

92.2

Total

4.0

1,843.9

Add to DNA template (1 ml, 10 ng/ml) Adjust PCR enzyme to 0.5 U/rxn if >.” Select the passive reference depending on the genotyping mastermix you use (in most cases Rox). Click next. 6. In the setup sample plate page, select the marker for wells: Click-drag to select wells, select the “Use” box for the marker. Click the “task” field for one of the detectors and select “NTC” for none template controls or “Unknown” for samples. This determines the way the SDS software uses the data collected from the well during analysis. 7. Click finish. Now, with the help of the well inspector (“View > Well Inspector”) it is possible to enter sample names, although this step is not necessary if a masterplate set-up is available (see Subheading 3.1) 8. Select the “Instrument” tab. After creating and setting up an allelic discrimination plate read document in the SDS software, the plate read is performed at 60°C. Change the sample volume to correspond to the reaction volume (see Subheading 3.2). 9. Save the file (“File > Save”). Be sure that the file name specifies which samples/cohort/SNPs were used for the experiment. Then click “Post-Read.” 10. After the run, click the analysis button to start genotype read. Normally, the analysis of the plate read document performs automatically but it also offers an option for alleles being called manually (see Note 7). An example for allele calling is presented in Fig. 2. Ultimately, the allele calls need to be converted to genotypes (see Note 8). The appropriate allele labeling is included with every assay provided by the company.

4. Notes 1. Depending on the quality of starting material, the best DNA extraction method needs to be individually determined by the researcher. We strongly recommend quantifying genomic DNA using a spectrophotometer and assessing integrity using agarose gel electrophoresis. The purified DNA (stock or diluted for final concentration) should be stored at £−20°C. Biological samples such as tissues, body fluids, infectious agents, and blood have the potential to transmit infectious diseases. Follow all applicable local, state/provincial, and/or national regulations. Wear appropriate protective equipment, which includes, but is not limited to: protective eyewear, face

Targeted SNP Genotyping Using the TaqMan® Assay

85

shield, clothing/lab coat, and gloves. All work should be conducted in properly equipped facilities using the appropriate safety equipment. 2. While ABI recommends specific instruments, we have found that any conventional thermocycler can be used in this protocol. In our experience, it is rare that the thermal cycler does not perform adequately for the assay; however, because the potential exists, we recommend testing each TaqMan® probe in several plates on the thermocycler prior to using in a largescale genotyping study. 3. Nontemplate controls (NTCs, such as water) are included in the plate set-up for internal quality control purposes. We also highly recommend inclusion of positive controls, such as DNA samples with known genotypes (determined by a different method), as well as quality controls (QCs) representing blinded duplicates to the genotyped samples, to validate the reproducibility and accuracy of the assay. Including positive controls for each allele combination are particularly important in experiments with small sample sizes (e.g., N < 50) or for variants with rare minor allele frequencies or a low number of heterozygotes. 4. In our experience, the assays are typically easy to handle and the amplification reaction is stable. Therefore, it is sometimes possible to reduce the genotyping reaction volume, which may help to reduce genotyping costs and preserve DNA samples. Considering the costs and time required for other techniques, such as sequencing, this protocol is efficient for sample sizes of 50 up to several thousand (use of different semiautomatic/automatic pipette solutions may be advantageous). 5. Applied Biosystems provides an “Allelic Discrimination (AD) Getting Started Guide” for the AB Real-Timer PCR System in which one can find detailed information about performing AD pre-run (although in our experience, this is not essential when working according to good laboratory practices [GLP]), amplification run, and AD postread run. The authors strongly recommend reading this guide prior to beginning the TaqMan® genotyping experiment. 6. The PCR amplification reaction is followed by the endpoint analysis for allelic discrimination using an Applied Biosystems Real-Time PCR System. The Sequence Detection System (SDS) software uses the fluorescence measurements obtained during the plate read to plot fluorescence (Rn) values based on the signals from each well. The plotted fluorescence signals indicate the alleles in each sample (TaqMan® SNP Genotyping Assays Protocol, Applied Biosystems 2006).

86

Schleinitz, DiStefano, and Kovacs

7. Most of the TaqMan® Assays are robust, but we nevertheless recommend processing 1–2 test plates per genotyping assay and cohort. By doing so, one will determine how the assay performs and how frequent the minor allele is. However, it may occur that the TaqMan® Assays do not perform well (i.e., the plate readout is difficult or not possible at all). Commonly, this is due to poor DNA quality (e.g., extracted from serum or small amounts of tissue, which leads to low concentration and poor quality). Also, if the DNA concentration is not consistent between wells across the plate, the TaqMan® Assay may not perform optimally. Check and improve these factors, if necessary and test again. If these problems are still encountered, run the plate for additional ten cycles or add Taq polymerase to the master mix prior to the PCR reaction to increase the yield of PCR. One can also test genotyping master mixes offered by other companies (please be aware of potential differences in PCR cycling conditions when using various master mixes) or change the thermal cycler. If any initial setting has been changed, start the experiment again with just one or two plates to check whether it performs differently. 8. For data analyses use suitable programs, for phenotype–genotype association studies, e.g., SPSS (SPSS, Inc.; Chicago, IL, USA) or Graphpad Prism (GraphPad Software, Inc.; La Jolla, CA, USA). For statistical calculations, we recommend coding the SNP alleles: 11 or 22 for homozygotes for either Allele 1 or 2 or 12 for heterozygotes that carry both alleles and to convert them to their respective nucleotides afterward. References 1. Cargill M, Altshuler D, Ireland J et al (1999) Characterisation of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22:231–238 2. Tindall EA, Speight G, Petersen DC, Padilla EJD, Hayes VM (2007) Novel Plexor™ SNP genotyping technology: comparisons with TaqMan® and homogenous MassEXTEND™ MALDI-TOF mass spectrometry. Hum Mutat 28(9):922–927 3. The International HapMap Consortium (2003) International HapMap Project. Nature 426:789–796 4. Kruglyak L, Nickerson DA (2001) Variation is the spice of life. Nat Genet 27:234–236 5. Frayling TM, Timpson NJ, Weedon MN et al (2007) A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316(5826):889–894

6. Mergen M, Mergen H, Ozata M, Orner R, Orner C (2001) A novel melanocortin 4 receptor (MC4R) gene mutation associated with morbid obesity. J Clin Endocrinol Metab 86(7):3448 7. Stefansson H, Ophoff RA, Steinberg S et al (2009) Common variants conferring risk of schizophrenia. Nature 460(7256): 744–747 8. Eeles RA, Kote-Jarai Z, Giles GG et al (2008) Multiple newly identified loci associated with prostate cancer susceptibility. Nat Genet 40(3):316–21 9. Stacey SN, Manolescu A, Sulem P et al (2007) Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet 39(7):865–869 10. Becker KG (2004) The common variants/ multiple disease hypothesis of common

Targeted SNP Genotyping Using the TaqMan® Assay complex genetic disorders. Med Hypotheses 62(2):309–317 11. Evans WE, McLeod HL (2003) Pharmacogenomics-drug disposition, drug targets, and side effects. N Engl J Med 348:538–549 12. Livak KJ, Marmaro J, Todd JA (1995) Towards fully automated genome-wide polymorphism screening. Nat Genet 9:341–342

87

13. Kutyavin IV, Afonina IA, Mills A et al (2000) 3¢-Minor groove binder-DNA probes increase sequence specificity at PCR extension temperatures. Nucleic Acids Res 28(2):655–661 14. Applied Biosystems (2010) TaqMan® SNP Genotyping Assays. Product Bulletin. http:// www3.appliedbiosystems.com/cms/groups/ mcb_marketing/documents/generaldocuments/ cms_040597.pdf

Chapter 7 Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions Using the Illumina Genome Analyzer Szabolcs Szelinger, Ahmet Kurdoglu, and David W. Craig Abstract To date, genome-wide association (GWA) studies, in which thousands of markers throughout the genome are simultaneously genotyped, have identified hundreds of loci underlying disease susceptibility. These regions typically span 5–100 kb, and resequencing efforts to identify potential functional variants within these loci represent the next logical step in the genetic characterization pipeline. Next-generation DNA sequencing technologies are, in principle, well-suited for this task, yet despite the massive sequencing capability afforded by these platforms, the present-day reality is that it remains difficult, time-consuming, and expensive to resequence large numbers of samples across moderately sized genomic regions. To address this obstacle, we developed a generalized framework for multiplexed resequencing of targeted regions of the human genome on the Illumina Genome Analyzer using degenerate, indexed DNA sequence barcodes ligated to fragmented DNA prior to sequencing. Using this method, the DNA of multiple individuals can be simultaneously sequenced at several regions. We find that achieving adequate coverage is one of the most important factors in the design of an experiment, but other key considerations include whether the objective is to discover genetic variants for genotyping later by a separate method, to genotype all identified variants by sequencing, or to exhaustively identify all common and rare variants in the region. Given the massive bandwidth of next-generation sequencing technologies and their low inherent throughput in terms of sequencing arrays per week, multiplexed sequencing using the barcoding approach offers a clear mechanism for focusing bandwidth to a smaller region across many more individuals or samples. Key words: Barcode, Multiplexing, Next-generation sequencing, Illumina genome analyzer, Bayes factor

1. Introduction Genome-wide association (GWA), candidate gene, and linkage studies have identified thousands of moderately sized genomic regions that are associated with human disease. In particular, GWA studies have identified hundreds of disease-associated loci, Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_7, © Springer Science+Business Media, LLC 2011

89

90

Szelinger, Kurdoglu, and Craig

typically spanning 5–100 kb (1–3), and in such disease identification studies, a logical follow-up strategy consists of resequencing the associated regions to identify any potential functional variants underlying the original association signal. Numerous biological studies, linkage studies, and pathway analyses have yielded a large number of candidate genes as potential sequencing targets, and RNA (e.g., microRNA), small genomes (mitochondria), and other non-human genomic DNA are also frequently considered potential targets for high-throughput sequencing. Thus, sequencing provides a powerful approach for enhancing our understanding of the molecular mechanisms underlying differences in genetic architecture that lead to increased risk for disease. Next-generation DNA sequencing technologies are, in principle, well-suited to these and other tasks due to their capabilities for high-throughput, low-cost sequencing. While these technologies offer massive sequencing capacity, it is difficult, time-consuming, and/or expensive to resequence larger numbers of samples across moderately sized genomic regions (i.e., 5 kb to 1 Mb). The basis of the sequencing approach detailed here utilizes barcoding (or indexing) of fragmented DNA from each individual with a short identifying oligonucleotide. While indexing has the obvious benefit of multiplexing large numbers of samples within a run, DNA indexing offers two key additional advantages: direct measure of base-by-base error rate and reduction of arrayto-array or day-to-day variability. Previous pioneering efforts to develop DNA indexing have shown considerable promise (4–7); however, adoption of this approach is still in its infancy and considerable challenges remain, including the development of practical and cost-effective approaches for short-read platforms. Beyond these experimental challenges, there exist few analytical frameworks that are also characterized for discovering and genotyping genetic variants across a targeted region interval using multiplexed short-read sequence data from multiple individuals. In this chapter, we present an experimental and analytical approach for simultaneous sequencing of multiple individuals using DNA indexes on the Illumina Genome Analyzer (GA I and II). The basic workflow includes PCR amplification of the region of interest for all samples, followed by quantification and purification of individual amplicons. After quantification, all amplicons are pooled for each sample to obtain 5 mg of total amplicon pool. The amplicon pools are then randomly fragmented, the ends of the amplicon fragments are blunt end-repaired, and an adenosine ligation at the 3¢ end of each amplicon fragment is performed. An adenosine-tothymine reaction then ligates unique, bar-coded adaptor oligonucleotides (Table 1) to both ends of each amplicon pool, creating uniquely identifiable samples. Post-ligation, each unique amplicon pool is pooled together into a single collection, which is referred to as a “library.” The resulting library is PCR-enriched, size-selected, quantified, and diluted prior to sequencing on the Illumina GA.

Table 1 Full oligonucleotide sequences of indexed adaptors

92

Szelinger, Kurdoglu, and Craig

2. Materials 2.1. Amplification of Targeted Regions

1. PfuUltra II HS DNA polymerase kit (Stratagene; La Jolla, CA).

2.1.1. Amplification of Targeted Regions Using Long-Range PCR

3. Molecular biology grade water.

2.1.2. Quality Control and Quantification Using PicoGreen

1. Quant-iT PicoGreen dsDNA kit (Invitrogen; Carlsbad, CA).

2. 96-Well PCR plates. 4. Multichannel pipettes. 5. Disposable reagent reservoirs.

2. Costar 96-well, black, flat-bottom plates (Corning Life Sciences; Lowell, MA). 3. Molecular biology grade water. 4. Multichannel pipettes. 5. Aluminum foil. 6. Single-channel pipettes (P1000, P200, P20, and P10). 7. Sterile glass bottle for 1× TE working solution (0.5 L). 8. Microcentrifuge. 9. Vortexer. 10. Sterile tubes (1.5 mL). 11. Falcon tubes (50 and 15 mL). 12. Disposable reagent reservoirs. 13. Human genomic DNA (Roche Applied Science; Indianapolis, IN).

2.2. Indexing Adaptor Preparation

1. Annealing buffer: 1 M Tris–HCl, pH 7.8. 2. 0.5 M NaCl. 3. Molecular biology grade water.

2.3. Indexed Library Preparation

1. Covaris S2 system (Covaris; Woburn, MA). 2. Covaris microTube with AFA fiber and Snap-Cap with presplit Teflon/silicone/Teflon septa (Covaris; Woburn, MA). 3. EB buffer: 10 mM Tris-Cl, pH 8.5. 4. Molecular biology grade water. 5. T4 DNA ligase buffer with 10 mM ATP. 6. dNTP Mix (10 mM each). 7. T4 DNA polymerase (3 U/mL). 8. DNA polymerase I, large (Klenow) fragment (5 U/mL). 9. T4 polynucleotide kinase (10 U/mL). 10. NEB2 buffer (New England Biolabs; Ipswich, MA).

Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions

93

11. dATP (10 mM). 12. Klenow fragment (3¢→5¢, exo) (5 U/mL). 13. T4 DNA ligase, 400 U/mL (NEB; Ipswich, MA). 14. Qiaquick Gel Extraction kit (Qiagen; Germantown, MD). 15. Purelink 96 PCR purification kit (Invitrogen; Carlsbad, CA). 16. Agarose. 17. Dark Reader (Clare Chemical Research; Dolores, CO). 18. GelStar DNA stain (Lonza; Rockland, ME). 19. Loading buffer (50 mM Tris, pH 8.0, 40 mM ethylenediaminetetraacetic acid, 40% (w/v) sucrose). 20. Low molecular weight DNA ladder (NEB; Ipswich, MA). 21. Thermal bath or heat block. 22. Phusion High-Fidelity PCR Master Mix with HF buffer (NEB; Ipswich, MA). 23. iQ SYBR Green Supermix (BioRad; Hercules, CA). 2.4. Analysis of Sequence Data and Validation of Analysis Results

1. Linux distro operating system. 2. High performance multiple core computer. 3. Illumina Genome Analyzer Pipeline compiled. 4. Various Perl, Bash scripts. 5. R statistical package.

3. Methods 3.1. Amplification of Targeted Regions 3.1.1. Amplification of Targeted Regions Using Long-Range PCR

This discussion focuses on traditional single-plex PCR amplification (see Note 1). PCR is robust and easy to optimize with a variety of readily available, high-fidelity amplification products that made it an attractive choice for this proof-of-principle study. 1. The regions of interest are amplified in a 25 mL reaction volume, using a 96-well format and comprising the following components: 75 ng PicoGreen-quantified template DNA, 1× PfuUltra buffer, 2 mM dNTPs (total), 400 nM each of forward and reverse primers, and 0.5 mL PfuUltra II HS DNA polymerase per reaction. 2. The cycling conditions include: a denaturation step at 95°C for 2 min, 30 cycles consisting of 95°C for 20 s, an annealing temperature specific to each primer for 20 s, and 68°C for 3 min, and ending with a final extension of 68°C for 5 min. 3. The initial PCR products are put through a second PCR reaction to generate a sufficient amount of template DNA. Second-round

94

Szelinger, Kurdoglu, and Craig

reaction conditions consist of 2 mL first-round amplification product, 2 mM dNTPs (total), 400 nM each of forward and reverse primers, 1.5 mL PfuUltra II HS DNA polymerase, and 1× PfuUltra buffer in a 100 mL final reaction volume. The thermocycler conditions are the same for both the first- and secondround amplification reactions. 4. Following second-round amplification, PCR products are purified using the Purelink 96-well PCR purification kit. 3.1.2. Quality Control and Quantification Using PicoGreen

1. The 20× TE buffer, included in the Quant-iT PicoGreen kit, is diluted to a 1× working concentration (see Note 2). 2. The eight-point standard curve is created by a serial dilution of the stock Roche DNA in 1.5 mL sample tubes with 1× TE buffer to a final concentration series of 0, 3.125, 6.25, 12.5, 25, 50, 100, and 200 ng/mL, respectively. Each point of the standard curve is assayed in triplicate and the median concentration is used for generating the curve. 3. The Costar 96-well, black assay plate is labeled with sample/ plate name. 4. The samples to be assayed are kept on ice throughout the procedure and 99 mL of 1× TE is added to all reaction wells of the Costar plate, including those assigned to standards. 5. Add 1 mL of DNA to each well containing 99 mL of 1× TE, including the diluted standard Roche DNA in the designated wells in the bottom two rows of the plate (see Note 3). Keep the plate covered with plate seal while preparing the PicoGreen Reagent until it is ready to be added to the plate. 6. Dilute the stock PicoGreen reagent (150 mL of Stock Reagent to 19.85 mL 1× TE). Prepare dilution in a plastic Falcon tube and cover tube with aluminum foil to prevent photodegradation. PicoGreen reagent has a light orange color. 7. Pour the diluted PicoGreen reagent into a large trough and add 100 mL PicoGreen reagent to each well of the Costar plate containing 1× TE and DNA. Mix the 200 mL solution with a multichannel pipette. Remove bubbles by popping them with sterile pipette tips, as they can influence the fluorescent signal detected by the reader. 8. Cover plate with aluminum foil and let stand for 5 min at room temperature. 9. Read fluorescent signal in plate reader. DNA-bound fluorescent dye is excited at 480 nm and intensity is measured at 520 nm. 10. Sort concentrations for each sample and calculate median concentrations for all amplicons (see Note 4).

Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions

3.2. Indexing Adaptor Preparation

95

1. Hydrate the bar-coded adaptor oligos (Table 1.) with 10 mM Tris–HCl, pH 7.8 to a 100 mM stock concentration. 2. Prepare the 10× annealing buffer containing 100 mM Tris–HCl, pH 7.8 and 0.5 M NaCl. 3. For each uniquely indexed adaptor, prepare an oligo mix from the 100 mM forward and reverse stock with 10× annealing buffer for a final annealing buffer concentration of 1×. Each forward and reverse adaptor oligo should have a final concentration of 350 nM (see Note 5). 4. Subsequently, a step-down annealing reaction is performed using the thermocycler of choice, where the indexed adapter mix is incubated at 95°C for 2 min, followed by a series of cooling steps of 1°C/min from 95 to 25°C (see Note 6).

3.3. Indexed Library Preparation 3.3.1. Amplicon Pooling and Fragmentation

1. Pool captured amplicons, based on concentration obtained from the PicoGreen quantitation, to give 5 mg of total template DNA per sample (see Note 7). 2. Dilute template to 100 mL with 10 mM Tris (EB Buffer) and pipette into Covaris microtube with AFA fiber and Snap-Cap. 3. Using the following setting on the Covaris S2 system, shear the samples to about 500 bp length; duty cycle: 5%, intensity: 3, cycles per burst: 200, total time (s): 90. 4. Concentrate postfragmentation samples to 75 mL using a speed vacuum (see Note 8).

3.3.2. End-Repair and A-Tailing

1. Perform blunt end-repair on fragmented pools using T4 DNA polymerase, DNA polymerase I Klenow fragment, and T4 polynucleotide kinase enzyme (see Note 9). 2. Purify end-repaired samples using Purelink 96-well PCR purification kit as described and elute in 32 mL of EB buffer (see Note 10). 3. Perform dATP incorporation to the blunt-ended amplicons with Klenow Fragment 3¢–5¢ exo minus enzyme, as described by Illumina Paired-End Sequencing Sample Preparation Guide (8). 4. Purify the A-tailed DNA using the Purelink 96 PCR purification kit and elute in 50 mL EB buffer. Concentrate the pure product using a speed vacuum and rehydrate sample to 10 mL with EB buffer.

3.3.3. Indexed Adaptor Ligation and Size Selection

1. Dilute stock T4 DNA ligase to 1 U/mL working stock with T4 DNA ligase buffer containing 10 mM ATP. 2. To each purified A-tailed sample, add a uniquely indexed, previously annealed adaptor in a 96-well format with 5 U of T4 DNA ligase and T4 DNA ligase buffer with 10 mM ATP for a total reaction volume of 50 mL (see Note 11).

96

Szelinger, Kurdoglu, and Craig

3. Perform ligation reaction overnight in a thermocycler with an initial 2 h ligation step at 20°C, followed by a 16 h ligation step at 16°C (see Note 12). 4. Following ligation, pool bar-coded amplicons for all individuals into a single tube. This mixture is now referred to as a library. 5. Purify each library using a Purelink PCR purification column and elute in 30 mL of EB buffer. 6. Prepare a 2% TAE gel with GelStar DNA stain and load 30 mL purified library plus 20 mL loading buffer onto the agarose gel. Include a low molecular weight DNA ladder. Run gel at 110 V for 90 min. 7. Cut bands representing an insert size determined by Covaris sheared fragments from the gel and purify bands using Qiagen Minelute Gel Extraction columns as described in kit manual (see Note 13). 3.3.4. Enrichment and Sequencing of Indexed Library

1. PCR-enrich the purified library with the Phusion HighFidelity PCR Master Mix with HF Buffer in several replicate reactions to obtain sufficient library for multiple sequencing runs and to avoid sequencing preferentially amplified fragments. Reaction volume is 50 mL, which includes 25 mL of the Phusion High-Fidelity PCR Master Mix 1–10 mL of ligation product (based on total input DNA amount for sample preparation), and 1 mL each of the forward and reverse PCR primers. Use the Illumina sample preparation guide (8) for suggested PCR primers. 2. Cycle under the following conditions: 30 s at 98°C, 12 cycles of: 10 s at 98°C, 30 s at 71°C, and 30 s at 72°C, followed by 5 min at 72°C, and a final hold at 4°C. 3. Pool replicate PCR reactions into single tube and purify using a Purelink PCR purification column. 4. Load the PCR-enriched, purified library on a 2% TAE agarose gel for selection against primer-dimers. The gel extraction follows the same procedure as the postligation gel extraction described earlier. 5. Determine the concentration of library using quantitative PCR (qPCR) (see Note 14). The reaction conditions are an initial denaturation at 95°C for 5 min, followed by 40 cycles of 95°C for 20 s and 65°C for 20 s. The reaction mixture is prepared in a total volume of 20 mL with 10 mL iQ SYBRGreen Mastermix, 5 mL sterile water, 4 mL qPCR primer mix, each at 0.5 mM in the final primer mix, and 1 mL of PCR-enriched, diluted library. 6. Dilute the quantified, ready to sequence library to 10 nM prior to loading onto flowcell.

97

Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions

3.4. Analysis of Sequence Data

1. Because the standard Illumina Genome Analyzer pipeline does not natively support indexed runs, a few modifications are required prior to sequence data analysis. These changes involve editing the cycle number from 2 to 7 in the forward direction, and a similar change in the reverse direction, which will depend on the number of cycles being run. This is necessary because the default pipeline assumes the second base will be a random base and builds the X_matrix.txt files based on this. However, with an indexed run, the first five bases are not random, and therefore, the pipeline has to be programmed to look for the first real random base, which is the seventh in this particular case. These changes are done on all Makefiles that are generated in the Firecrest and Bustard directories after the following command is run: /path/to/ pipeline/GAPipeline-X.X.X/bin/goat_pipeline.py/path/ to/run_directory. Once the edits are in place, start the pipeline with a recursive make command: cd/path/to/firecrest_directory nohup make recursive –j 8 &. 2. The GA Pipeline output is reformatted to allow for “per base” processing, which involves creating a file where each line is a reference location and the columns contain information on bases matched, mismatched and/or in either direction for the total number of reads that span/cover that base position. 563

132

g

gnl|2|hg18_ chr7_102910426102910790

19

12

0

0

564

133

g

gnl|2|hg18_ chr7_102910426102910790

18

12

1

0

565

134

a

gnl|2|hg18_ chr7_102910426102910790

19

8

0

1

566

135

c

gnl|2|hg18_ chr7_102910426102910790

18

9

0

0

567

136

g

gnl|2|hg18_ chr7_102910426102910790

18

9

0

0

568

137

a

gnl|2|hg18_ chr7_102910426102910790

19

8

0

1

A

T

C

98

Szelinger, Kurdoglu, and Craig

In the above example, the columns are as follows: 1. Line number 2. Base position within that particular region 3. Reference base 4. Name of region 5. Number of matches in forward direction 6. Number of mismatches in forward direction 7. Number of matches in reverse direction 8. Number of mismatches in reverse direction 9. Visual representation of matches and mismatches in forward direction 10. Visual representation of matches and mismatches in reverse direction. 3. Bayes factor algorithm is applied for variant discovery. This algorithm is implemented in R. For each base and for each person (bar-code), the number of matches and mismatches are counted. At each position two models were compared: K =

p (x | M 2 ) . p (x | M 1 )

The first model ( M 1) defines the probability that the error rate at that position, considering all indexes, follows a binomial distribution.  ∑ Xi  . M 1 = binom  X i , Yi ,   ∑ Yi  The second model ( M 2 ) defines the probability that the error rate for an index follows a three-bin distribution as expected from a diploid variant with prior probabilities of 0.98, 0.49, and 0.015. However, the algorithm also allows for these priors to be adjusted. For each base position, a first pass of the algorithm bins each index into one of three (AA, AB, and BB) categories by selecting the one with the highest binomial probability value: max[binom(X i , Yi , dist AA ), binom(X i , Yi , dist AB ), b ), binom(X i , Yi , dist BB )], where dist AA = 0.980 , dist AB = 0.490 , and dist BB = 0.015. For each index, the priors are adjusted by 0.001 depending on which distribution it fits. During the second pass, the algorithm calculates the mean of the three binomial probabilities, which are then used to calculate the ratio ( K ). This ratio is the likelihood score for a base position to be a variant.

Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions

K =

∏ (binom (X ,Y , ∑ X / ∑ Y )) i

∏ ([binom(X ,Y , dist i

i =1…n

3.5. Validation of Analysis Results

i

i

i

i

i =1…n new

AA

) + binom(X , Y , dist i

i

new AB

) + binom(X , Y , dist i

i

new BB

99

. )]/ 3)

For validation, we recommend examining a few simple metrics. The output of the GA Pipeline will produce files that contain all indexes. The first step is to parse this file line by line to get a count of reads for each index, as well as a count for all non-index 5-mers in order to produce some basic statistics on index performance. The output of the GA Pipeline can then be separated into smaller files, one for each index. One can then generate more statistics describing aligned vs. unaligned for each index. After SNP calling has been completed using the method described in the section above, comparing a set of known SNPs from dbSNP or similar databases to the experimentally identified markers can also provide validation.

4. Notes 1. Targeted sequencing of dozens of individuals in a single sequencing run is greatly delayed by target capture. The most widely available method of target capture is PCR amplification followed by equimolar pooling of amplified regions. However, the cost and time associated with PCR amplification make it an inefficient and costly approach. Efforts are currently underway toward developing faster, cheaper, and more scalable methods for capturing regions of interest, including solid-, liquid-, and hybrid-phase target selection methods (9–12). However, difficulties associated with optimization and reproducibility do not allow these methods to become true alternatives at present. As shown in Fig. 1, as part of the library preparation process, the target capture is the most time- and resource-consuming step, even compared with the optimization of PCR primers, performing, possibly, hundreds of PCR reactions, and pooling captured amplicons. In our laboratory, sample preparation was performed as recommended by Illumina, but adjusted for high-throughput application. Illumina provides reagents for ten samples, which essentially allows for the preparation of either a single library of ten pooled samples or ten unique libraries. The goal of indexed sequencing is the pooling of dozens of individuals into a single library requiring several Illumina sample preparation kits; thus making it uneconomical in large-scale studies. Indexing requires the ligation of adaptors with unique five-base index sequences to both ends

100

Szelinger, Kurdoglu, and Craig

Fig. 1. Schematic describing the preparation of indexed libraries.

of the DNA fragment to be sequenced. These modified adaptors and sample preparation reagents are purchased from third-party vendors. The oligonucleotide sequences for the adaptors, PCR primers, and sequencing primers were obtained by permission from Illumina. 2. If a large number of samples are to be assayed over a period of time, the 1× TE buffer can be made up in a large batch and stored in a labeled glass container at room temperature for an extended period of time. 3. The last two rows are dedicated for the diluted Roche standard. For clarity and straightforwardness, a standard layout can be created for use in subsequent assays, which will decrease confusion and errors. A standardized layout also requires only a single protocol in a fluorescent dye reader such as Bio-Tek KC4. 4. Following PCR amplification, determine the concentration of each sample to ensure that amplicons are pooled in equimolar

Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions

101

concentration. Doing so helps to lower the likelihood of uneven coverage between the target regions. The following instructions are based on high-throughput sample quantitation and thus, are optimized for large-scale studies requiring several 96-well plates. The most critical part of the assay is the creation of the standard curve. The Roche DNA standard comes in a 200 ng/mL stock concentration, and, therefore, the DNA to be quantitated should be at a lower concentration. Prior to standard curve preparation, this could be achieved by a quick assessment of the sample DNA concentration by absorbance reading at 260 nm. The sample DNA concentration should fall somewhere in the middle of the standard curve; thus after the approximate concentration of the DNA is determined, all amplicons to be pooled should be diluted with sterile water to an approximate concentration of 100 ng/mL. 5. The indexed adaptor mix is prepared by combining 3.5 mL of 100 mM forward, indexed adaptor oligo to 3.5 mL of 100 mM reverse, indexed adaptor oligo, 10 mL of 10× Annealing Buffer, and 83 mL of sterile water. This step is performed for each of the bar-coded adaptors that are used to uniquely tag an individual DNA sample. 6. In this reaction, the forward and reverse, indexed adaptor oligos are hybridized to each other to create a double-stranded oligomer with a 3¢ overhang on either strand, which allows A-T ligation to the A-tailed DNA fragment to be sequenced. 7. Essentially, if ten amplicons were amplified for each individual, we would want to combine the ten amplicons for Sample #1 so each of the ten amplicons contributes 500 ng to the total of 5 mg library preparation template. If the study is based on 40 individuals over ten amplicons, we would expect to start library preparation with 40 samples each containing 5 mg total template DNA with 500 ng of each of the ten amplicons combined. These 40 samples would then be sheared separately and labeled by a unique bar-code during the adaptor ligation step. 8. Random fragmentation of the PCR amplicons is a critical step in library preparation. The goal is to create indiscriminate DNA fragments of specific size with sticky ends. The randomization ensures that sequence traces do not preferentially align to regions determined by the location of the breaks in the DNA strands. The available methods include nebulization, enzymatic digestion, adaptive focused acoustics (AFA), and sonication. Nebulization is a rapid approach that uses compressed air in a closed chamber to randomly fragment DNA. Nebulization is not easily scalable, the fragment size distribution is relatively high, and a significant portion of template

102

Szelinger, Kurdoglu, and Craig

DNA is lost during fragmentation. The enzymatic approach is easily scalable and provides a fast and cheap approach for large-scale studies. Digestion by DNAse I enzyme can be optimized for a narrow fragment size; however, reproducibility is heavily dependent upon reaction conditions and quality of template DNA. Sonication is a viable alternative; however, limited number and inconsistent performance of highthroughput instrumentation make it problematic for processing large number of samples. 9. End-repair and A-tailing of randomly fragmented DNA samples are performed using reaction conditions recommended by Illumina (8). Modifications include the use of enzymes and buffers purchased from third-party vendors; however, reagents and buffers should have the same working concentrations in the reactions as suggested by the Illumina sample preparation guide. Between reaction steps, the reaction products are purified from the enzymes and buffers. We have investigated a few purification products to accommodate high-throughput purification requirements, including the Qiaquick 96 and Minelute 96 (both Qiagen; Germantown, MD), and the Purelink 96 PCR Purification kit (Invitrogen; Carlsbad, CA). Both Qiagen kits are based on a vacuum source to pull liquid through a filter matrix. The final elution in the Qiaquick 96 kit requires the elution buffer to pass through the matrix and the pure DNA to be washed into an elution plate. Using QiaQuick 96 kit, we found some inconsistent volume recovery across the wells of the elution plate that may be due to uneven seal of the vacuum manifold. Similar problems were encountered with the Minelute 96 kit; however, this method requires elution from the top of the filter matrix. The Purelink kit allows for centrifugation-based purification, which, in our experience, results in a consistently uniform recovery of purified products across all wells of a purification plate. However, the availability of various purification methods (e.g., magnetic beadbased) and kits allow the researcher to optimize this very critical step according to personal preference. 10. During the end-repair and A-tailing steps, once the reaction product is purified, it is concentrated using a speed vacuum, and rehydrated in the recommended volume for the next step in the sample preparation protocol. 11. A 10:1 molar ratio of adaptor to genomic DNA insert concentration is recommended for the ligation reaction. Thus, if sample preparation was started with 5 mg of DNA, a 10 mL bar-coded adaptor mix will be added to ligation reaction. When input DNA is, 2.5 mg, the bar-coded, adaptor mix is only added in a 5 mL volume to the ligation step.

Bar-Coded, Multiplexed Sequencing of Targeted DNA Regions

103

12. In this step, each DNA sample receives a unique bar-code. Because the targeted amplicons have already been pooled for each sample, all amplicons for a particular sample are ligated to a specific bar-coded adaptor. After this step, the samples can be pooled together in a single tube, thus creating a library. At this point, the desired coverage of the sequenced region becomes an important consideration. The number of samples pooled postligation and the size of the total targeted region (pooled amplicons) determine the final coverage. 13. Essentially, if Covaris fragmentation was performed to obtain a median fragment size of 500 bp, the gel cut after ligation should be 500 ± 100 bp. Removing a tighter band is recommended for increased read quality. 14. The use of a serial diluted PhiX Control sample (which is available for purchase from Illumina) is suggested for the creation of a standard curve for quantifying library. The PhiX control has a known concentration of 10 nM and under the same reaction conditions as the library can be used as a reference for quantitation. Essentially, the 10 nM PhiX library should be diluted 10-, 100-, and 1,000-fold, and each dilution should be assayed in triplicate on a qPCR assay plate along with the library. The volume of the PhiX sample in the reaction mix is 1 mL, just like the library. The PCR-enriched library should be diluted 1:100, 1:1,000, and 1:10,000 prior the assay and assayed in triplicates, as well. We recommend loading a negative control reaction on the assay plate. Depending on the thermocycler used, a Ct value and/or actual concentrations can be obtained. Those Ct or concentration values of the diluted library sample that fall on the PhiX standard curve could be used along with the dilution factor to calculate an actual nanomolar concentration of the library. Because the sequencing protocol of Illumina is optimized for a 10 nM starting concentration for a library, we recommend that it is diluted to 10 nM before starting cluster generation on a sequencing flowcell.

References 1. International HapMap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–861 2. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661–678

3. Zondervan KT, Cardon LR (2007) Designing candidate gene and genome-wide case-control association studies. Nat Protoc 2:2492–2501 4. Milosavljevic A et al (2005) Pooled genomic indexing of rhesus macaque. Genome Res 15:292–301 5. Parameswaran P, Jalili R, Tao L, Shokralla S, Gharizadeh B, Ronaghi M, Fire AZ (2007) A pyrosequencing-tailored nucleotide barcode

104

Szelinger, Kurdoglu, and Craig

design unveils opportunities for large-scale sample multiplexing. Nucleic Acids Res 35:e130 6. Meyer M, Stenzel U, Myles S, Prufer K, Hofreiter M (2007) Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res 35:e97 7. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods 5:235–237 8. Paired-end sequencing sample preparation guide. Illumina, Inc. Part# 1005063 Rev. B

9. Albert TJ et al (2007) Direct selection of human genomic loci by microarray hybridization. Nat Methods 4(11):891–892 10. Porreca GJ et al (2007) Multiplex amplification of large sets of human exons. Nat Methods 4(11):931–936 11. Gnirke A et al (2009) Solution hybrid selection with ultra-long oligonucleotides for massively parallel, targeted sequencing. Nat Biotechnol 27(2):182–189 12. Ng SB et al (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–276

Part III Functional Characterization of Susceptibility Alleles and Loci

Chapter 8 Site-Directed Mutagenesis Patricia E. Carrigan, Petek Ballar, and Sukru Tuzmen Abstract The technique of site-directed mutagenesis has been used to characterize gene and protein structure– function relationships, protein–protein interactions, binding domains of proteins, or active sites of enzymes for the last three decades. In this technique, a nucleotide sequence of interest is experimentally altered using synthetic oligonucleotides. The most commonly used approach is to use an oligonucleotide that is complementary to part of a single-stranded DNA template, but containing an internal mismatch to direct the mutation. In addition to single point mutations, this approach may also be used to construct multiple mutations, insertions, or deletions. As a result of its broad applicability in disease gene characterization studies, numerous commercial kits are now available, making this technique quick, straightforward, and reliable. In this chapter, we detail the steps involved in site-directed mutagenesis and highlight the essentials of this versatile technique based upon our experience. Key words: DNA, Polymerase chain reaction, Site-specific, Mutagenesis, Template, Point mutation, Insertion, Deletion, Amino acid, Oligonucleotide, Restriction site, Efficiency, ssDNA, Transformation

1. Introduction Mutagenesis is a powerful DNA methodology that is used to alter the nucleotide sequence of DNA to study its effect on gene or DNA function. The mutagenesis can be conducted in vivo (in studies of model organisms or cultured cells) or in vitro (typically using plasmid constructs) and can be either generated at a specific site in a predetermined way (site-directed mutagenesis) or randomly incorporated in the DNA sequence of interest (1). In in vivo mutagenesis, gene targeting offers well-crafted, sitedirected mutagenesis within living cells. One such example is given by the exposure of male mice to high levels of a powerful mutagen such as ethyl nitrosourea, and subsequent mating of the mice facilitates a form of random mutagenesis, which can be

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_8, © Springer Science+Business Media, LLC 2011

107

108

Carrigan, Ballar, and Tuzmen

important in generating new mutants (1). In vitro mutagenesis, on the other hand, can involve essentially random approaches to mutagenesis, which may be valuable in producing libraries of new mutants. In addition, if a gene has been cloned and a functional assay of the product is available, it is also very useful to be able to employ a form of in vitro mutagenesis that results in the alteration of a specific amino acid or small component of the gene product in a predetermined way (1). Several approaches to this technique have been published, which utilize either single- (2–5) or double-stranded DNA as template (6). Methods employing single-stranded DNA (ssDNA) as template are technically labor-intensive and time-consuming due to the prerequisite tasks of subcloning and ssDNA rescue. In contrast, there are several kits on the market using different strategies to perform site-directed mutagenesis with dsDNA as a template, all of which are performed with mutagenic primers. One of the most commonly used systems is the QuikChange™ system, a product of Stratagene (Fig. 1). This system utilizes a PCR procedure, which first denatures the DNA template, then anneals and extends the mutagenic primers with a proprietary DNA polymerase. A unique feature of this system is removal of the parental DNA with Dpn1 endonuclease, which digests methylated or hemimethylated sequences in parental DNA. The next step following mutation of the sequence of interest is transformation into competent cells, which are able to perform nick repair. A similar approach using mutagenic primers and digestion of parental DNA

Mutant฀Strand฀Synthesis Perform thermal cycling to: 1) Denature DNA template 2) Anneal mutagenic primers containing desired mutation 3) Extend primers with Pfu Ultra DNA polymerase

Dpn฀I฀Digestion฀of฀Template Digest parental methylated and hemimethylated DNA with Dpn฀ I Transformation Transform mutated molecule into competent cells for nick repair

Fig. 1. Overview of site-directed mutagenesis method using Dpn1 digestion of parental strand (Stratagene’s strategy).

Site-Directed Mutagenesis

109

Fig. 2. Overview of site-directed mutagenesis method using in vitro methylation (Invitrogen’s strategy).

is in GeneTailor™ system of Invitrogen (Fig. 2); however, instead of in vitro digestion in this procedure prior to PCR step template DNA is methylated by DNA methylase. After transformation into host cells, circularization of linear mutated DNA and digestion of methylated parental DNA by McrBC endonuclease of host cells occur. Another difference is that QuikChange™ system requires two mutated primers (forward and reverse), whereas in GeneTailor™ system, only one of the two primers should contain target mutation. Both QuikChange™ and GeneTailor™ systems do not require special vectors, host strains, or restriction sites. Site-directed mutagenesis systems, such as GeneEditor™ (Promega, Inc.), use multiple transformations and can only be used with cloning vectors containing ampicillin resistance. This system utilizes antibiotic selection method by using selection oligonucleotides compatible with a region of ampicillin resistance gene (TEM-1 b-lactamase gene). The wild-type (wt) TEM-1 enzyme degrades ampicillin readily, while degrading some other

110

Carrigan, Ballar, and Tuzmen

b-lactam antibiotics much less efficiently. Several mutations in b-lactamase gene alter the substrate specificity of enzyme increasing resistance to other b-lactam antibiotics. The selection oligonucleotides are designed to create mutant b-lactamase and selection is done with a mixture of other b-lactam antibiotics. The efficiency of the system is augmented with initial transformation into repair minus BMH 71-18 mutS cells. This strain avoids selection against the desired mutation. A second transformation into JM109 competent cells ensures proper segregation of plasmids resulting in high proportion of mutant plasmids. In such system, alkaline denaturation of dsDNA is required prior to annealing of the mutagenic oligonucleotide and selection oligonucleotide. Following hybridization, the oligonucleotide is extended with DNA polymerase. In this process, it is important to isolate template DNA from a modification (+) Escherichia coli K12 strain (e.g., DH5a, XL-1 Blue, JM109, etc.), or it will be restricted by the modification (+) BMH 71-18 mutS used in this system (Fig. 3). insert Plasmid of interest with Amp resistance Ampr Alkaline denaturation 1 1. Mutagenic primer for gene of interest 2 2. Mutagenic primer for ampicillin resistance gene

Synthesize the mutant strand

Transformation insert

Ampr Antibiotic selection

mutant

Ampr + new resistance

Antibiotic selection

Fig. 3. Overview of site-directed mutagenesis method using mix antibiotic selection (Promega’s strategy).

Site-Directed Mutagenesis

111

Fig. 4. Overview of primers and PCR steps of overlap extension for site-directed mutagenesis (adapted from ref. 7).

Another commonly used and cheaper method is PCR-driven overlap extension (Fig. 4).The overlap extension method enables the amplification of the target gene from genomic DNA or cDNA without a requirement of cloning the gene of interest into a plasmid (7). Segments of the target gene are amplified from template DNA using two flanking master primers (A and D) that label the 5¢ ends of both strands and two internal primers (B and C) that introduce the mutation of interest (asterisk in Fig. 4). Both the internal primers (B and C) should contain the desired mutation and overlapping nucleotide sequence. Internal primers (B and C) must not only contain the desired mutation, but also create overlapping nucleotide sequences. At the end of the PCR, AB and CD segments will be obtained and denatured during the second PCR procedure and strands with 3¢ complementary ends created by mutagenic primers B and C will anneal. Product size of the AD fragment can essentially range from 100 nucleotides up to two kilobases, depending on the fidelity and function of the polymerase used. Although strands of AB and CD segments will also rehybridize, without primers B and C to increase their copy number throughout the PCR cycles, they are not representing the major products of this PCR protocol. The primary limitation to maximal accuracy and the size of gene product AD is the efficiency of the DNA polymerase used for PCR. This method is useful when there is employable restriction enzyme sites located close to target region (that will be mutated) of weight gene cloned in vector of interest. In this case, primers A and D can be designed accordingly and PCR product AD can be subcloned into the vector after the original region in the weight gene is removed with restriction enzymes. In any case, the desired final PCR product should be sequenced for the accuracy of the mutations created.

112

Carrigan, Ballar, and Tuzmen

In vitro site-directed mutagenesis is an invaluable technique for studying protein structure–function relationships and gene expression, and for performing vector modifications. In this chapter, the commonly used site-directed mutagenesis procedure of Stratagene, as well as the overlap extension method, will be discussed in detail. The goal of this chapter is to provide a thorough and clear description of these methods. We also mention two additional commercially available kits offered by Promega, Inc. (Madison, WI), and Invitrogen/Life Technologies (Carlsbad, CA), suggesting several options to choose from, depending on the purpose of the experiment. We use a leucine (Leu) to phenylalanine (Phe) conversion in the interleukin 2 gene (IL2) as a test example to illustrate the site-directed mutagenesis method with the understanding that the technique is amenable for any mutation desired in a sequence of interest.

2. Materials 2.1. Reagents for StratagenePatented Site-Directed Mutagenesis

1. PfuUltra® high-fidelity DNA polymerase. 2. 10× Reaction buffer. 3. Dpn1 endonuclease. 4. dNTP mix. 5. HPLC-purified primers. 6. Deionized water. 7. Template DNA. 8. Competent cells (e.g., XL1-Blue, DH5a, and JM109). 9. Oligonucleotides with desired mutation. 10. pUC18 control plasmid for transformation control. 11. pWhitescript control plasmid. 12. Oligonucleotide control primers in order to create a point mutation on the pWhitescript control plasmid. Control primer #1: 5¢-CCA TGA TTA CGC CAA GCG CGC AAT TAA CCC TCA C-3¢ Control primer #2: 5¢-GTG AGG GTT AAT TGC GCG CTT GGC GTA ATC ATG G-3¢

2.2. Reagents for Overlap Extension Method

1. HPLC-purified primers. 2. Deionized water. 3. Template DNA. 4. High-fidelity PCR system (DNA polymerase and buffer with MgCl2).

Site-Directed Mutagenesis

113

5. dNTP mix. 6. T4 DNA ligase. 7. Restriction enzymes with their compatible buffers. 8. DH5a competent cells. 9. Plates with appropriate selection antibiotics. 10. Agarose, ethidium bromide, and 1× TAE buffer.

3. Methods 3.1. Site-Directed Mutagenesis Using the Stratagene Method

The method presented below outlines the design of oligonucleotides for single and multiple amino acid changes per reaction, PCR cycling conditions, Dpn1 enzymatic digestion of the PCR product, transformation of the Dpn1-digested PCR product into the bacterial strain XL1-Blue, isolation of the newly transformed DNA, and confirmation of the successful incorporation of the mutation into the target sequence.

3.1.1. Mutagenic Primer Design

The mutagenic oligonucleotide primers must be designed individually according to the desired mutation. Both of the mutagenic primers must contain the desired mutation and anneal to the same sequence on opposite strands of the plasmid. Primers should be between 25 and 45 bases in length, with a melting temperature (Tm) of 78°C. Primers longer than 45 bases may be used, but using longer primers increases the likelihood of secondary structure formation, which may affect the efficiency of the mutagenesis reaction. The following formula is commonly used for estimating the Tm of primers: Tm = 81.5 + 0.41(%GC) − 675/N − % mismatch. For calculating Tm (see Note 1): ●฀

N is the primer length in bases.

●฀

Values for %GC and % mismatch are whole numbers.

The desired mutation (deletion or insertion) should be flanked by at least 10–15 bases of unmodified sequence on both sides. AT-rich regions result in low Tm and/or secondary structure in the oligonucleotide. To stabilize annealing 3¢-end of the oligonucleotide, the primers should optimally have a minimum GC content of 40% and should terminate in one or more C or G bases. If possible, identify a restriction enzyme site in or close to the desired mutagenic site in the original sequence when designing primers. Creation of a new restriction site or disrupting a present restriction site while mutating the sequence is highly recommended. This is especially useful for prescreening of the bacterial clones for positive clones.

114

Carrigan, Ballar, and Tuzmen

3.1.1.1. Additional Primer Considerations

The mutagenesis protocol uses 125 ng of each oligonucleotide primer. To convert nanograms to picomoles of oligo, use the following equation: X pmol of oligo =

ng of oligo × 1, 000. 330 × # of bases in oligo

For example, for 125 ng of a 25-mer: 125 ng of oligo × 1, 000 = 15 pmol. 330 × 25 bases Primers need not be 5¢-phosphorylated, but must be purified either by fast polynucleotide liquid chromatography (FPLC) or by polyacrylamide gel electrophoresis (PAGE). Failure to purify the primers results in a significant decrease in mutation efficiency. It is important to keep primer concentration in excess. Stratagene suggests varying the amount of template while keeping the concentration of the primer constantly in excess. 3.1.2. Site-Directed Mutagenesis PCR Amplification ( See Note 2)

1. Synthesize two complementary oligonucleotides containing the desired mutation, flanked by unmodified nucleotide sequence. Purify these oligonucleotide primers before use in the following steps (see Subheading 3.1.1). To change a Leu to a Phe and an aspartic acid (Asp) to a lysine (Lys) in IL2, the following primer sequences were used: Forward primer: 5¢-cagatgattttTaaGggatta-3¢ Reverse primer: 5¢-taatccCttAaaaatcatctg-3¢. 2. Prepare the control reaction as indicated below: 5 ml of 10× reaction buffer 2 ml (10 ng) of pWhitescript 4.5-kb control plasmid (5 ng/ml) 1.25 ml (125 ng) of oligonucleotide control primer #1 [34-mer (100 ng/ml)] 1.25 ml (125 ng) of oligonucleotide control primer #2 [34-mer (100 ng/ml)] 1 ml of dNTP mix 38.5 ml ddH2O (to bring the final reaction volume to 50 ml) Then add: 1 ml of PfuUltra HF DNA polymerase (2.5 U/ml). 3. Prepare the sample reaction(s) as indicated below (see Note 3): 5 ml of 10× reaction buffer X ml (10–100 ng) of dsDNA template X ml (125 ng) of oligonucleotide primer #1

Site-Directed Mutagenesis

115

X ml (125 ng) of oligonucleotide primer #2 1 ml of dNTP mix ddH2O to a final volume of 50 ml Then add: 1 ml of PfuUltra HF DNA polymerase (2.5 U/ml) (see Note 4). 3.1.3. PCR Parameters for the Mutagenesis Reaction

1. Cycle each reaction using the parameters outlined in Table 1. For the control reaction, use a 2.5-min extension time (see Notes 5 and 6). 2. Adjust segment 2 of the cycling parameters according to the type of mutation desired (see Table 2). 3. Following temperature cycling, place the reaction on ice for 2 min to cool the reaction to £37°C. 4. Optional: Check the amplification on agarose gel (see Note 7).

3.1.4. Dpn1 Digestion of the Amplified Products

1. Add 1 ml of the provided Dpn I restriction enzyme (10 U/ml) directly to each amplification reaction. 2. Gently and thoroughly mix each reaction mixture by pipetting the solution up and down several times. Briefly spin down the

Table 1 Illustration of the cycling parameters for the Stratagene’s strategy Segment

Cycles

Temperature (°C)

Time

1

1

95

30 s

2

12–18

95 55 68

30 s 1 min 1 min/kb of plasmid lengtha

Cycling parameters for the QuikChange II site-directed mutagenesis method a For example, a 5-kb plasmid requires 5 min at 68°C per cycle

Table 2 The cycling parameters according to the type of mutation desired Type of mutation desired

Number of cycles

Point mutations

12

Single amino acid changes

16

Multiple amino acid deletions or insertions

18

116

Carrigan, Ballar, and Tuzmen

reaction mixtures and then immediately incubate at 37°C for 1 h to digest the parental (i.e., the nonmutated) supercoiled dsDNA (see Note 8). 3.1.5. Transformation of XL1-Blue Competent Cells

1. Gently thaw the XL1-Blue competent cells on ice (see Notes 9 and 10). For each control and sample reaction to be transformed, aliquot 50 ml of the ultra-competent cells to a prechilled 14-ml BD Falcon polypropylene round-bottom tube. 2. Transfer 1–3 ml of the Dpn I-treated DNA from each control and sample reaction to separate aliquots of the ultra-competent cells (see Note 11). Swirl the transformation reactions gently to mix and incubate the reactions on ice for 30 min. 3. Heat-pulse the tubes in a 42°C water bath for 45 s, and then incubate the tubes on ice for 2 min (see Note 12). 4. Add 0.5 ml of preheated (42°C) NZY+ broth to each tube, then incubate the tubes at 37°C for 1 h with shaking at 225–250 rpm. 5. Plate the appropriate volume of each transformation reaction, as indicated in Table 3, on agar plates containing the appropriate antibiotic for the plasmid vector. For the mutagenesis and transformation controls, spread cells on LB-ampicillin agar plates containing 80 mg/ml X-gal and 20 mM IPTG (see Subheading “Preparing the Agar Plates for Color Screening”). 6. Incubate the transformation plates at 37°C for >16 h.

3.1.6. Expected Results

1. Transformation control The expected colony number from the transformation of the pWhitescript 4.5 kb control mutagenesis reaction is >100 colonies. Greater than 85% of the colonies should contain the mutation and appear as blue colonies on agar plates containing IPTG and X-gal (see Note 13). If transformation of the pUC18 control plasmid was performed, >50 colonies

Table 3 Transformation reaction plating volumes Reaction type

Volume to plate

pWhitescript mutagenesis control

250 ml

pUC18 transformation control

5 ml (in 200 ml of NZY+ broth)a

Sample mutagenesis

250 ml on each of two plates (entire transformation reaction)

a Place a 200-ml pool of NZY+ broth on the agar plat, pipet the 5 ml of the transformation reaction into the pool, then spread the mixture

Site-Directed Mutagenesis

117

(>109 cfu/mg) should be observed, with >98% having the blue phenotype. 2. Sample transformation The expected colony number depends upon the base composition and length of the DNA template utilized. For suggestions on increasing colony number, see “Notes”. The insert of interest should be sequenced to verify that selected clones contain the desired mutation(s). 3.1.7. Mutation Confirmation

After the DNA has been isolated from the transformed cells, it will need to be sequenced to confirm that the Leu (TTG) was changed to Phe (TTT). Because IL2 was cloned into pGem11Zf(+), T7 and SP6 primers can be used for sequencing the change in nucleotides.

3.1.8. Recipes of Required Solutions

1. 10 g of NaCl

3.1.8.1. LB Agar (Per Liter)

3. 5 g of yeast extract

2. 10 g of tryptone 4. 20 g of agar 5. Add deionized H2O to a final volume of 1 l 6. Adjust pH to 7.0 with 5 N NaOH 7. Autoclave 8. Pour into Petri dishes (~25 ml/100-mm plate).

3.1.8.2. LB-Ampicillin Agar (Per Liter)

1. 1 l of LB agar 2. Autoclave 3. Cool to 55°C 4. Add 100 mg of filter-sterilized ampicillin 5. Pour into Petri dishes (25 ml/100-mm plate).

3.1.8.3. NZY+ Broth (Per Liter)

1. 10 g of NZ amine (casein hydrolysate) 2. 5 g of yeast extract 3. 5 g of NaCl 4. Add deionized H2O to a final volume of 1 l 5. Adjust pH to 7.5 using NaOH 6. Autoclave 7. Add the following filter-sterilized supplements prior to use: 12.5 ml of 1 M MgCl2 12.5 ml of 1 M MgSO4 20 ml of 20% (w/v) glucose (or 10 ml of 2 M glucose).

3.1.8.4. Preparing the Agar Plates for Color Screening

Add 80 mg/ml 5-bromo-4-chloro-3-indosyl-b-d-galactopyranoside (X-gal) and 20 mM isopropyl-1-thio-b-d-galactopyranoside

118

Carrigan, Ballar, and Tuzmen

(IPTG) and appropriate antibiotic to the LB agar. Dissolve IPTG in sterile dH2O and X-Gal in dimethylformamide. Alternatively, spread 100 ml of 2% X-gal and 100 ml of 10 mM IPTG to LB agar plates 30 min before plating the transformation. Do not mix IPTG and X-Gal solutions before pipetting them onto plates since they might precipitate. 3.1.8.5. TE Buffer

1. 10 mM Tris–HCl (pH 7.5) 2. 1 mM EDTA.

3.2. Site-Directed Mutagenesis by the Overlap Extension Method 3.2.1. Primer Design

3.2.2. PCR to Generate AB and CD Fragments

Internal primers (B and C) contain mutagenic and surrounding nucleotides. The lengths of these primers are 18–35 nucleotides. When insertion into gene of interest is the aim of creating the mutation, the desired insertion sequence should be placed at the 3¢ end of each internal primer. For deletions, overlapping ends of internal primers should span the junction of coding sequence excluding the target nucleotides. Flanking primers (A and D) contain both restriction enzyme sites and DNA coding sequence. The length of the flanking primers should be approximately 18–35 nucleotides, and at least ten nucleotides of the primers should contain the DNA coding sequence (see Note 14). Prepare the sample and control reactions as indicated in Table 4. Cycle each reaction using the parameters outlined in Table 5 (see Note 15).

Table 4 PCR components for negative controls and AB and CD fragments Components

AB

CD

Negative control for AB

Negative control for CD

10× PCR buffer with MgCl2

5 ml

5 ml

5 ml

5 ml

10× dNTP

5 ml

5 ml

5 ml

5 ml

Template DNA

50–125 ng

50–125 ng

–

–

Primer A

50 pmol

–

50 pmol

–

Primer B

50 pmol

–

50 pmol

–

Primer C

–

50 pmol

–

50 pmol

Primer D

–

50 pmol

–

50 pmol

High-fidelity Taq polymerase mix

3.5 U

3.5 U

3.5 U

3.5 U

Water

To 50 ml

To 50 ml

To 50 ml

To 50 ml

Site-Directed Mutagenesis

119

Table 5 The cycling parameters for amplification of fragments Cycle number

Temperature (°C)

Time

1

95

2 min

30

95 50–60 72

1 min 1 min 1 min/kb

1

72

7 min

Table 6 PCR components for negative control and AD fragment Components

AD fragment

Negative control

10× PCR buffer with MgCl2

5 ml

5 ml

10× dNTP

5 ml

5 ml

AB fragment

50–125 ng

–

CD fragment

50–125 ng

–

Primer A

50 pmol

50 pmol

Primer D

50 pmol

50 pmol

High-fidelity Taq polymerase mix

3.5 U

3.5 U

Water

To 50 ml

To 50 ml

3.2.3. Isolation and Purification of Products AB and CD

Isolation of AB and CD fragments is performed by gel extraction method. For this purpose, 1% (w/v) agarose gel (containing ethidium bromide) should be prepared. After placing the polymerized gel into an electrophoresis apparatus and covering the gel with 1× TAE buffer, PCR samples, and DNA ladder should be loaded into the gel. Run the gel at 80–100 V and then confirm the expected size of AB and CD segments by comparing with DNA ladder. Carefully excise the bands of interest with a clean razor blade and place them into microfuge tubes. Purification of the segments can be performed using any gel extraction method.

3.2.4. PCR for AD Fragment

See Table 6 for PCR preparation of the AD fragment. In PCR of AD fragment, same cycling conditions can be used as in reaction for AB and CD fragments (outlined in Table 5) (see Note 15). It is important to measure and calculate the amount of AB and CD fragments by spectrophotometer (see Note 16).

120

Carrigan, Ballar, and Tuzmen

Table 7 Digestion reaction components for control and AD fragment Components

AD fragment (insert)

Vector

10× Buffer (enzyme specific)

5 ml

5 ml

AD fragment (insert)

≅Half volume of AD product

–

Vector

–

1–5 mg

Enzyme 1

2 U/mg DNA

2 U/mg DNA

Enzyme 2

2 U/mg DNA

2 U/mg DNA

Water

Up to 50 ml

Up to 50 ml

3.2.5. Isolation and Purification of AD Fragment

AD fragment isolation and purification should be performed as previously described for the isolation and purification of the AB and CD fragments.

3.2.6. Digestion and Ligation of the AD Product

1. The digestion reaction to cut purified AD fragment should be set up depending on the requirements of the appropriate restriction enzymes. While double digestion is a commonly used method (see reaction in Table 7), some enzymes require sequential digestion. All reactions must be performed on ice. 2. After digesting the desired bands, purify both the digested AD product and the vector DNA to remove unwanted excessive nucleotides. Purification may be performed using any gel extraction method or PCR purification kit. 3. By following the protocol recommended by the manufacturer, ligation of digested vector DNA and AD fragment should be performed. While some kits require overnight ligation at 14°C, for others 30-min ligation at room temperature would be sufficient (see Note 17). For a control reaction, perform the same ligation reaction without adding the insert.

3.2.7. Transformation and Plating Bacteria

Transformation of ligation reaction into DH5a cells should be done using manufacturer’s recommended protocol. During transformation 4–10 ml of ligation reaction can be used. After transformation, plate the bacteria on plates with appropriate antibiotics selection. Incubate the transformation plates at 37°C for >16 h.

3.2.8. Verifying the Insert Sequence of the AD Fragment

There are several methods to verify the insertion of AD fragment, such as hybridization of a radiolabeled oligonucleotide specific for mutated sequence or isolation of DNA after transformation with kits available in the market (i.e., Qiagen, Inc.) and sequencing the DNA.

Site-Directed Mutagenesis

121

4. Notes 1. The Tm should be calculated using the nearest-neighbor method. For primers longer than 20 nucleotides, Tm should be 3°C higher than the Tm of the lower Tm primer. Primers should optimally have a Tm between 65 and 72°C. 2. Ensure that the plasmid DNA template is isolated from a damZ E. coli strain. The majority of the commonly used E. coli strains are dam+. Plasmid DNA isolated from dam (−) strains (e.g., JM110 and SCS110) is not suitable. To maximize temperaturecycling performance, use thin-walled tubes, which ensure ideal contact with the temperature cycler’s heat blocks. The following protocols were optimized using thin-walled tubes. 3. A series of sample reactions using various amounts of dsDNA template ranging from 5 to 50 ng (e.g., 5, 10, 20, and 50 ng of dsDNA template) should be set up while keeping the primer concentration constant. 4. It is highly recommended to add Pfu Turbo polymerase after the temperature reaches 95°C. Hot start modification in the polymerase prevents the amplification of nonspecific products and unwanted degradation of primers before the first amplification cycle. 5. When plasmid length is determined, it is important to add the length of gene with the length of empty plasmid. 6. Another option commonly used for PCR cycling is: 1 Cycle

95°C

1 min

18 Cycles

95°C 60°C 68°C

50 s 50 s 1 min/kb Plasmid length

1 Cycle

68°C

7 min

7. If desired, amplification may be checked by electrophoresis of 10 ml of the product on a 1% agarose gel. A band may or may not be visualized at this stage. In either case, proceed with Dpn I digestion and transformation. 8. Addition of Dpn1 buffer (10×) into reaction mixture is always recommended. Reaction mixture components: 44 ml PCR product 5 ml Dpn1 buffer 1 ml Dpn1 enzyme. 9. It is important to store the XL1-Blue competent cells at −80°C to maintain efficiency.

122

Carrigan, Ballar, and Tuzmen

10. In some cases, using DH5a competent cells gives higher efficiency. 11. As an optional control, verify the transformation efficiency of the XL1-Blue competent cells by adding 1 ml of 0.1 ng/ml pUC18 control plasmid to a 50 ml aliquot of competent cells. 12. The heat pulse has been optimized for transformation in 14-ml BD Falcon polypropylene round-bottom tubes. However, any 14–15-ml round-bottom tube will suffice. 13. The mutagenesis efficiency (ME) for the pWhitescript 4.5-kb control plasmid is calculated using the following formula: ME =

Number of blue colony forming units (cfu) × 100%. Total number of colony forming units (cfu)

14. As in all primer design at least 50% GC content must be maintained and secondary structure formation within primers must be avoided. 15. Annealing temperatures vary depending on primers. In general, higher temperatures increase the efficiency and accuracy of PCR. 16. Equal amounts of each AB and CD templates should be added to PCR. 17. Using excessive amount of insert compared to the vector always increases efficiency of ligation. Several ratios of insert to vector might be used (1:1, 3:1, and 5:1). The calculation of insert amount for 1:1 ratio is as follows: Nanogram insert =

Nanogram vector × (insert size (kb)) . Vector size (kb)

18. Troubleshooting I: Problem: Low mutagenesis efficiency or low colony number with the control reaction. a. Super-competent cells: ensure cells are stored at −80°C. b. Thermal cycler: Adjust cycling parameters for control reaction and repeat the protocol for samples. c. Agar plates: Prepare them once again by checking carefully IPTG, X-Gal, and antibiotic concentrations. d. Insufficient growth period: to visualize the blue phenotype of control reaction, plates must be incubated at least 16 h at 37°C. e. Multiple freeze–thaw cycles of dNTP mix: aliquot dNTP mix as single use aliquots and store at −20°C. 19. Troubleshooting II: Problem: Low transformation efficiency or low colony number. a. Insufficient mutant DNA synthesized in the reaction: increase the amount of Dpn1-treated DNA used in transformation up to 4 ml.

Site-Directed Mutagenesis

123

b. The quality and quantity of DNA template: visualize the DNA template on an agarose gel. The template DNA must be at least 80% supercoiled. An excessive amount of target plasmid in PCR can result in background transformants. c. Large mutation creation might give low numbers of colonies. However, most of the colonies will contain mutagenized plasmid. d. Precipitate the Dpn1-digested PCR product with ethanol and resuspend in a decreased volume of water before transformation. 20. Troubleshooting III: Problem: False positives. a. Poor primer quality: To verify, radiolabel the primers and check for degradation on an acrylamide gel. Resynthesize the primers if required. b. Increase the annealing temperature in PCR cycle up to 68°C to increase the stringency of the reaction. 21. Troubleshooting IV: Problem: Poor yield of AB and CD segments. a. Polymerase enzyme might not be properly functioning. Use a new tube of enzyme. b. Check the accuracy of primers. c. Primers might not be functioning due to high annealing temperature. Lower the annealing temperature used in PCR. d. Extension time might be too short. e. Template DNA quality: visualize the DNA template on an agarose gel. The template DNA must be at least 80% supercoiled. 22. Troubleshooting V: Problem: Multiple or incorrect size of AB and CD segments. a. Primer quality: Verify the correct design and sequence of primers. b. Nonspecific priming might occur due to low annealing temperature. Increase the annealing temperature used in PCR. c. MgCl2 concentration might be too high. d. If the cDNA is used and differentially spliced genes are present, this problem might occur. Cut out the correct size product and ignore other products. After obtaining mutated DNA, sequencing the DNA is required. 23. Troubleshooting VI : Problem: Improper AD segment size. a. Concentration of AB and CD template might be too high. b. Annealing temperature might be incorrect.

124

Carrigan, Ballar, and Tuzmen

24. Troubleshooting VII: Problem: Additional unwanted mutations in AD segment. a. Original DNA template might have a preexisting mutation. b. The Taq polymerase used in PCR might have poor quality.

Acknowledgements The authors would like to acknowledge Pinar Tuzmen for her critical review and her continued support. Additionally, the authors acknowledge Agilent Technologies, Stratagene Products Division, Life Technologies Corporation, Invitrogen Products Division, and Promega Corporation for their permission to reference their Site Directed Mutagenesis kits, and for the use of protocols, figures, and related information included in the manuals of the kits as the source of the materials and methods. References 1. Strachan T, Read AP (1999) Human molecular genetics, 2nd edn. BIOS Scientific Publishers, Oxford 2. Kunkel TA (1985) The mutational specificity of DNA polymerase-beta during in vitro DNA synthesis. Production of frameshift, base substitution, and deletion mutations. J Biol Chem 260:5787–5796 3. Sugimoto M, Esaki N, Tanaka H, Soda K (1989) A simple and efficient method for the oligonucleotide-directed mutagenesis using plasmid DNA template and phosphorothioate-modified nucleotide. Anal Biochem 179:309–311

4. Taylor JW, Ott J, Eckstein F (1985) The rapid generation of oligonucleotide-directed mutations at high frequency using phosphorothioatemodified DNA. Nucleic Acids Res 13:8765–8785 5. Vandeyar MA, Weiner MP, Hutton CJ, Batt CA (1988) A simple and rapid method for the selection of oligodeoxynucleotide-directed mutants. Gene 65:129–133 6. Papworth C, Bauer JC, Braman J, Wright DA (1996) Site-directed mutagenesis using doublestranded plasmid DNA templates. Strategies 9:3–4 7. Heckman KL, Pease LR (2007) Gene splicing and mutagenesis by PCR-driven overlap extension. Nat Protoc 2:924–932

Chapter 9 Gene Expression Profiling of Tissues and Cell Lines: A Dual-Color Microarray Method Sonsoles Shack Abstract Since its origin in the mid-1990s, gene expression profiling by microarray has become a productive and useful tool in basic science and preclinical research. Current dual-color, high-density cDNA oligo arrays contain 60-mer detectors for the whole human genome. With this powerful technology, expression of RNA samples from cell lines or tissue can be assessed, revealing specific gene expression signatures. The technique includes three major steps: (1) isolation and purification of RNA from cells or tissues, (2) labeling of total RNA, and (3) hybridization with Agilent cDNA microarrays. Conveniently, this technique can be performed with as little as 50 ng of purified total RNA; however, it is important to keep in mind that the quality of the RNA template, namely the level of sample degradation and the presence of contaminants that are carried over from the starting material or introduced during RNA isolation, can significantly impact the efficiency of the labeling reaction and the reliability of the hybridization. In this chapter, the details of each step of this technique are explained thoroughly, while highlighting the key issues that can prevent a failed hybridization. Key words: Gene expression profiling, Microarrays, RNA, RNA extraction, cRNA, DNA, cDNA, Cyanine dyes, Cyanine-3, Cyanine-5, Murine Moloney Leukemia Virus, VivaSpin, T7 Polymerase, Reverse transcription, In vitro transcription, Hybridization, Agilent, Agilent scanner, Feature extraction

1. Introduction As described in the first publications about the subject, microarrays were prepared by high-speed robotic printing of complementary DNAs (cDNA) on glass (1–3). The current generation of highquality arrays avoids many of the sources of error inherent in the previous process by direct chemical synthesis of oligonucleotides on the surface of glass slide. The arrays used today contain oligonucleotides, approximately 60 bp in length, complementary to

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_9, © Springer Science+Business Media, LLC 2011

125

126

Shack

selected regions of genes that represent the full human genome (4–6). The rationale of this powerful technology is based on the principle of nucleic acid hybridization, wherein fluorescently labeled RNA sequences bind with complementary DNA (immobilized on the glass slide), and the number of labeled sequences hybridized to the detectors is proportional to the original expression level of each RNA species, allowing this level to be estimated on the basis of fluorescence detected at each hybridization spot. In this chapter, we explain in detail how high-quality RNA can be isolated (from cells and frozen tissues), quantitated, amplified, labeled, and finally hybridized with high-density 60-mer cDNA microarrays. To start with a good RNA template is perhaps the most critical element in gene expression profiling. Specifically, good RNA template quality, integrity, and lack of macromolecular contamination are essential for efficient labeling and reliable hybridization results. RNA extraction, using the method described here, is superior to the classical alcohol precipitation method, because it eliminates the precipitation step, which enables the irreversible attachment of some types of contaminating molecules, especially polysaccharides to the nucleic acid template, and thus prevents the optimal functioning of the enzymes that synthesize the cDNA and build the fluorescently labeled cRNA strands. Here, we focus on the Agilent Quick Amp method of labeling and the hybridization to cDNA Agilent expression microarrays. We prefer this platform because it offers a two-color method, in which two samples can be compared, such as sample vs. reference, treated vs. untreated cells, or diseased vs. normal tissue. Finally, by means of mathematical and statistical tools provided in the Feature Extraction software, we can build a database and determine which genes show statistically significant changes in expression. With this approach, gene expression profiling provides invaluable insight into the complex changes in gene expression patterns, such as those in multifaceted diseases like cancer. These discoveries can later be validated with other readily available technologies, such as quantitative PCR, protein analysis by Western blotting, or immunohistochemistry, and thus become a record of the status of the cell or tissue at the time of collection.

2. Materials 2.1. Total RNA Extraction and Purification (Materials Common to Extraction from Both Cells and Tissues)

1. TRIzol® (Invitrogen; Chicago, IL). Store at 4°C. Warning: TRIzol contains phenol (see Note 1). 2. Chloroform. Warning: toxic organic solvent and chemical hazard, handling same as above. 3. 200 Proof ethanol, from Warner Graham Co. or similar.

Gene Expression Profiling of Tissues and Cell Lines

127

4. 70% Ethanol (prepared with DEPC-treated or moleculargrade, nuclease-free water). 5. TE pH 8.0 (molecular grade). 6. DNAse-, RNase-free tubes, not autoclaved (see Note 2). 7. Chemical fume hood. 8. 65°C Heat block (see Note 3). 9. RNeasy Mini Kit or Micro kit for total RNA isolation (Qiagen, Valencia, CA). 10. Tabletop microcentrifuge (not refrigerated). 11. Agilent spectrophotometer and quartz cuvette. 12. RNAse Away solution (Molecular BioProducts, cat. #: 7002). 13. Nuclease-free, filtered pipet tips, ranging from 2 to 1,000 mL. 14. Pipetors to fit the above tips, and handled with gloved hands, preferably designated for RNA-related work only. 15. Individually wrapped, sterile serological pipettes, sizes 2–25 mL. 16. Electronic pipetor for serological pipettes. 17. Vivaspin 500 columns, 30,000 MWCO PES (Sartorius, cat. #: VS0122). 18. Ice and ice bucket or tray. 19. Powder-free nitrile gloves (to minimize particulates contaminating samples during RNA purification and labeling). 20. Vortexer. 21. Agilent 8453 UV–visible spectrophotometer (Santa Clara, CA). 22. −80°C Freezer. 2.1.1. For Cell Lines only

1. Cell cultures – ideally in logarithmic growth phase and subconfluent (see Note 4). 2. Cell lifters (Fisher Scientific; Pittsburgh, PA).

2.1.2. For Frozen Tissues only

1. Fresh frozen tissue from biopsy or surgical material (see Note 5). 2. Heavy-duty razor blades or disposable scalpels. 3. Sterile plastic disposable dishes (Petri dish or cell culture), 60 or 100 mm in diameter. 4. Small weigh dish (preferably antistatic). 5. Borosilicate glass tubes for use with Covaris S-2 tissue processor, 16 mm × 100 mm (Fisher Scientific, cat. #: 14962-26F).

128

Shack

6. Covaris S-2 tissue processor, connected to a computer, and Multitemp III Thermostatic circulating water bath (Amersham Biosciences, cat. #: 18-1102-77). 7. Dry ice. 8. Wide and shallow ice tray for dry ice. 2.2. RNA Amplification and Synthesis of Fluorescent cRNA

1. 50–2,000 ng of total RNA purified as described in Subheading 3. 2. Sterile 0.2-mL PCR tubes, RNase- and DNase-free. 3. Quick Amp Labeling Kit (Agilent Technologies, cat. #: 51900447). 4. Cyanine-3-CTP and Cyanine-5-CTP fluorescent dyes, 10 mM (Agilent Technologies or Perkin Elmer). 5. Thermal cycler such as Peltier Thermal Cycler PTC-200, from MJ Research. 6. Mini RNeasy Kit from Qiagen. 7. Molecular-grade, nuclease-free water or DEPC water. 8. 200 Proof ethanol. 9. Nuclease-free 1.5-mL centrifuge tubes. 10. Nuclease-free 200-mL PCR tubes or tube strips. 11. Nuclease-free pipet tips with filters (see previous section). 12. Pipetors to fit above tips (see previous section). 13. Nonrefrigerated tabletop microcentrifuge. 14. Vortexer.

2.3. Hybridization of Labeled RNAs with 60-mer cDNA Microarrays

1. General disposable items described in previous sections: powder-free nitrile gloves, filter pipet tips, 0.2-mL PCR tubes, etc. 2. DEPC-treated or molecular-grade, nuclease-free water. 3. Ethanol, 200 proof, anhydrous. 4. Hybridization kit (Agilent Technologies, cat. #: 51885242). 5. PTC-200 Peltier Thermal cycler (MJ Research) or similar. 6. Agilent 60-mer cDNA microarrays (Agilent Technologies). 7. Gasket slides to fit the format of microarrays (Agilent Technologies). 8. Slanted-tip, teflon-coated tweezers (Bel Art Scienceware, cat. #: H379420000). 9. Hybridization chamber (Agilent Technologies, cat. #: G2534A). 10. Hybridization oven (Agilent Technologies, cat. #: G2545A).

Gene Expression Profiling of Tissues and Cell Lines

129

11. Hybridization chamber rotator rack (Agilent Technologies, cat. #: G2530-60029). 12. Microarray slide holder (Agilent Technologies, cat. #: G250560525). 13. Gene Expression Wash Buffer 1 (Agilent Technologies, cat. #: 5188-5242). 14. Gene Expression Wash Buffer 2 (Agilent Technologies, cat. #: 5188-5326). 15. Staining dishes with covers (Thermo Shandon, cat. #: 121 – one complete staining assembly; and cat. #:122, four staining dishes, glass 110 × 92 × 67 mm). 16. Glass slide racks (Wheaton, cat. #: 900204). 17. 60-mm Dishes (Falcon, cat. #: 353002) or similar. 18. Water bath set up at 37°C. 19. Magnetic stir plate. 20. Magnetic stir bars. 2.4. Scanning of Microarrays and Data Analysis

1. Microarray scanner (Agilent Technologies, cat. #: G2505B). 2. Agilent scanner control software. 3. Feature Extraction software. 4. Microsoft Excel (Microsoft Office). 5. File Maker Pro or similar database software of choice.

3. Methods Below is an optimized and detailed method to purify RNA from cell lines and frozen tissues, followed by RNA labeling, and hybridization to a cDNA microarray. As noted, RNA integrity is vital to the success of this protocol. RNA can become degraded even before purification, often due to the presence of abundant RNAses in certain tissues or cells, or by the condition of a tissue at the time it was frozen. Unfortunately, once RNA is degraded, little or nothing can be done with that specimen to achieve reliable gene expression profiling results. As explained in detail below, quality control (QC) assessments will be performed at several stages, from the RNA extraction to the hybridization. Ultimately, it will be the hybridization QC analysis that will show if the process passed or failed. This final QC assessment, carried out on the quantitative results extracted from the fluorescent images of the hybridized slides, should include an evaluation of the median signal intensities (if they are above a minimal threshold) and of the reproducibility of the signal across the array, by comparing

130

Shack

the signals of replicates (ten replicates of ten genes) distributed throughout the array surface (5). In our experience, it is a good idea to first practice with the cells or tissue of interest, to become familiar with the most critical steps of the technique, test all reagents, and perform quality control assessment after each step. Practice is particularly important for RNA extraction. Yet even after becoming familiar with the process, we advise running a preliminary test of one or two samples when starting a new study (even more so if samples are from tissue) to assess the quality and yield of the sample and determine if they pass the RNA quality and labeling QCs. Moreover, careful planning is essential because special handling procedures may be necessary with some specimens, as in the case with rare or low abundance samples (i.e., those from fine needle aspiration biopsies) that require the use of the RNA Micro kit to optimize yield. 3.1. Harvesting Samples for RNA Extraction 3.1.1. Cell Lysates

1. Using powder-free nitrile gloves, thoroughly clean the working bench or chemical hood surface (see Note 1). 2. Prepare all supplies needed: place a bottle of TRIzol® in a large container of ice, and set up pipettes, tips, cell lifters, and labeled nuclease-free tubes. 3. Place one or two plates of adherent cells at one time on ice, and tilt plates for a minute, while covered, so the medium will pool to the bottom. Aspirate or pipette out medium, and then place plates horizontally on the ice (see Note 6). 4. Add the appropriate amount of TRIzol® (see Note 7). Make sure TRIzol® coats the entire plate surface. 5. Thoroughly scrape the culture plate with a cell lifter. Tilt the plate again, scrape the lysate down, and transfer it to a 1.5mL microcentrifuge tube that can withstand organic solvents using a filtered tip. Do not vortex! (see Note 8). 6. If unable to proceed to RNA extraction right away, or if doing time points, freeze TRIzol® lysates at −80°C for up to 2 months.

3.1.2. Frozen Tissue

1. If the Covaris S-2 tissue processor is to be used for tissue homogenization, it will be necessary to turn on the system and start the degassing process first as shown below, as it usually takes 30 min to be ready for use. 2. Turn on the Multitemp III circulating waterbath, the Covaris S-2 tissue processor, and the computer program SonoLab. 3. Fill the water chamber of the Covaris unit with distilled water to the fill line. 4. Turn on the degassing process, which will also chill the water to about 17–19°C. Degassing will run during the entire homogenization process until the last sample is processed.

Gene Expression Profiling of Tissues and Cell Lines

131

5. While wearing powder-free nitrile gloves, thoroughly clean the fume hood surface or work area with RNAse Away, and maintain a clean and dust-free work surface at all times. 6. Prepare a large tray with dry ice, and place on it a labeled borosilicate glass tube, a Petri (or tissue culture) plastic 60- to 100-mm dish, a pair of hemostats or serrated forceps, and scalpels or heavy-duty razor blades. These need to be kept ice cold at all times during the procedure. 7. Transfer the frozen tissue, from the −80°C or liquid nitrogen freezer to a smaller tray or bucket with dry ice. Do not allow tissue to be out of the dry ice for any length of time, except to be transferred to the prechilled Petri dish (see Note 9). 8. Place frozen tissue in prechilled Petri dish, and while holding it with the prechilled forceps, start cutting using the prechilled razor blades or scalpel, as if cutting “shavings.” Every few seconds, chill razor and forceps again on a dry ice chip so they remain very cold. The resulting tissue pieces should not be thicker than 1 mm. Approximately 100–200 mg of tissue (about ~20 mm3) should be sufficient for a yield of well over 10 mg of RNA. Place uncut tissue portion back in its vial in dry ice until it is returned to the freezer. 9. Place a weigh dish on the dry ice, and wait a few seconds for it to chill. Tilt the plate containing the tissue “shavings” and tap it against the dry ice so they gather on the plate edge. Then, swiftly tap the plate again above the weigh dish so the tissue pieces will quickly fall onto it. Take great care that the tube is thoroughly chilled and upright before transfer, or else the tissue fragments will quickly melt against the tube wall as they enter it (see Note 10). 10. Once tissue is transferred into the tube, move to the Covaris station and open the program for tissue homogenization. Use the Covaris Sonolab program settings: 2–5%dc500mV100cb. tmt for 5 s; 2–20%dc500mV50cb.tmt for 15 s; 2–20%dc500mV100cb.tmt for 15 s; and 2–5%dc500mV100cb.tmt for 5 s. 11. Add 500 mL of RLT buffer (from the Qiagen RNeasy kit) to the glass tube, still in dry ice, containing the minced tissue, put the cap on, and quickly move it from the dry ice to the tube holder inside the Covaris S-2 water chamber, and start the cycle immediately (see Note 11). 12. When cycle has finished, remove the tube, place on wet ice, and add 500 mL of TRIzol. Mix well. (see Note 12). At this point, the sample can either remain on ice to continue with the RNA extraction or it can be frozen at −80°C until ready to proceed. 13. When all samples have been homogenized, shut down the Covaris system according to the manufacturer’s instructions and proceed to RNA extraction.

132

Shack

3.2. RNA Extraction

1. First, thoroughly clean the surface of the work bench or fume hood with RNase Away or 70% ethanol. Keep area dust-free and use nitrile gloves at all times. 2. Heat the TRIzol® or RLT/TRIzol® lysates in a 65°C heat block with water-filled wells for 5 min (see Note 3). 3. Shake samples vigorously for 5–10 s by hand; do not vortex (see Note 8). 4. Immediately, and while the TRIzol® lysates are still hot, add 200 mL of chloroform per 1 mL of TRIzol® reagent. 5. Vigorously shake the tubes for 30 s, but do not vortex. 6. Cool on ice for 5 min to allow for a good fraction separation. From this point on, keep all samples and reagents at room temperature. 7. Centrifuge samples at 10,000 × g for 10 min at room temperature. Samples will separate into three phases: the aqueous phase (clear) on the top, the thin inter-phase (white) in the middle, and the organic phase (pink) on the bottom (see Note 13). 8. Very carefully and slowly remove the aqueous phase with a pipette, making sure not to disturb or approach the interphase. Place aqueous phase into new, labeled 1.5-mL microcentrifuge tube (see Note 14). 9. Add 1 volume of room temperature 70% ethanol to the aqueous phase (see Note 15). 10. Add up to 700 mL of the aqueous phase/ethanol mixture to a Qiagen RNeasy Mini Column. Spin Mini Column at 16,000 × g for 30 s. Put flow-through back in the column and spin it one more time. This step will retain any RNA that was not captured the first time around. Discard flow-through. Repeat with the remaining aqueous phase/ethanol mixture. 11. To wash column, add 700 mL Qiagen RW1 wash buffer to the Mini Column. Centrifuge at 16,000 × g for 30 s. 12. Transfer Mini Column to a new 2-mL collection tube, discarding previous collection tube and flow-through. 13. Add 500 mL Qiagen RPE wash buffer (reconstituted). Centrifuge at 16,000 × g for 30 s. Discard flow-through. 14. Add 500 mL Qiagen RPE wash buffer again. Centrifuge at 16,000 × g for 30 s. Discard flow-through. Spin column at 16,000 × g for 1 min to dry column thoroughly. 15. Place Mini Column in a new Qiagen 1.5-mL collection tube with cap. Add 30 mL RNase-free water to the center of the column without touching the membrane. Incubate at room temperature for 1–5 min (see Note 16). Centrifuge at 16,000 × g for 30 s. Do not discard eluate; it contains purified RNA.

Gene Expression Profiling of Tissues and Cell Lines

133

16. Again add 30 mL of RNase-free water to each Mini Column. Incubate at room temperature for 1–5 min. Centrifuge at 16,000 × g for 1 min. Retain eluate, vortex to mix well, and, from this step on, keep RNA on ice. 17. To clean samples further, place eluted RNA in a VivaSpin 500 column, add 300 mL of DEPC- or nuclease-free water, and spin for 2–3 min or until desired volume is reached. Repeat if desired. Then remove sample from VivaSpin column and discard flow-through. This step can also be used to concentrate samples, as RNA loss from this step is negligible. 18. Quantitate total RNA using a suitable spectrophotometer, like the Agilent 8453 UV–visible spectrophotometer if available (see below). 19. Store RNA aliquoted in a −80°C freezer and avoid multiple freezing–thawing cycles. 3.3. RNA Quantitation and Quality Control Before Labeling

1. Absorbance measurement by spectrophotometry: RNA quantitation can be performed by dissolving 2–4 mL of purified RNA in 100 mL of TE pH 8.0 or 1 mM NaOH. A260 should be between 0.1 and 1 to be within the linear range. A260 represents the absorbance of the RNA and its concentration can easily be obtained with the formula: A260 × 40 × dilution factor × 10−3 = mg/mL of RNA (see Note 17). The ratio of the absorbances A260 and A280 should be 1.8–2.1 for its purity to be acceptable. A ratio outside this range indicates that the RNA is contaminated with proteins, salts, or other molecules. 2. Assessment of RNA integrity can be performed by running an agarose gel and comparing the intensity of the bands that correspond to 18S and 28S RNAs. The intensity of the 28S band should be twofold that of the 18S band. Alternatively, a better and more accurate analysis can be performed with the Agilent BioAnalyzer. Results from the BioAnalyzer will include a gel-like picture of the samples, an electropherogram or graph, and calculated densitometry measurement of the 28S/18S ratio, which should ideally be 1.4–2.0. Frozen tissues from biopsy frequently have lower ratios, and can still provide reliable data when the ratio is 1 or higher.

3.4. RNA Labeling 3.4.1. cDNA Synthesis from Total RNA Template

1. Clean working bench with RNAse Away. Keep surfaces and materials dust- and lint-free and use gloves at all times. 2. Following the Quick Amp labeling kit protocol (7), thaw the first set of components and spin down briefly: T7 Promoter Primer, and reagents for the first master mix, except the enzymes: Murine Moloney Leukemia Virus (MMLV) RT and RNAse inhibitor.

134

Shack

3. To a 200-mL PCR tube (or tube strip), add 50 ng to 2 mg of purified total RNA, the appropriate amount of T7 Promoter Primer (1.2 mL – for up to 500 ng of RNA or 1.2–2 mL – for 500 ng to 2 mg of RNA), and DEPC water to a total of 11.5 mL. 4. Mix and spin down tubes briefly, and incubate in a thermal cycler to denature RNA: 65°C for 10 min (always use heated lid), and then at 4°C for 5 min. Place tubes in a tube rack at room temperature. 5. To dissolve the 5× strand buffer, warm vial to 80°C in a heat block for 1–2 min. Vortex and briefly spin, then keep at room temperature. 6. Add the master mix components in the order indicated and maintain everything except the enzymes at room temperature (see Table 1, as indicated in Agilent Quick Amp protocol). Avoid forming bubbles (see Note 18). 7. Add enzymes (MMLV RT and RNAse inhibitor) last. Vortex lightly and at low-medium setting to avoid forming bubbles, by pulsing several times, then spin and add 8.5 mL to each denatured RNA + T7 Promoter sample in the PCR tubes. 8. Incubate in thermal cycler as follows: 40°C for 2 h, 65°C for 15 min (to inactivate MMLV RT), and 4°C for 5 min. Spin samples and proceed to subsequent steps. 3.4.2. In Vitro Transcription: Fluorescent cRNA Synthesis from cDNA Template and Incorporation of Cyanine-3 or Cyanine-5CTP

1. Thaw the first components for in vitro transcription (Table 2), except the enzymes, and warm the 50% PEG solution at 40°C for 1 min. Spin down all component tubes briefly and keep at room temperature. 2. Make the master mix for the in vitro transcription, by adding components in the order shown in the table. Take enzymes

Table 1 cDNA synthesis master mix Component

Volume (mL/reaction)

5× First strand buffer

4.0

0.1 M DTT

2.0

10 mM dNTP mix

1.0

MMLV RT

1.0

RNAse inhibitor

0.5

Total volume

8.5

The components and volumes for a sample reaction are shown. Note that for multiple samples, it is recommended to increase the total volume of the master mix by about 10% to ensure there is enough mix for all samples

Gene Expression Profiling of Tissues and Cell Lines

135

Table 2 In vitro transcription master mix Component

Volume (mL/reaction)

Nuclease-free water

15.3

4× Transcription buffer

20.0

0.1 M DTT

6.0

NTP mix

8.0

50% PEG

6.4

RNAse inhibitor

0.5

Inorganic pyrophosphatase

0.6

T7 RNA polymerase

0.8

Total volume

57.6

The components and volumes for a sample reaction are shown. Note that for multiple samples, it is recommended to increase the total volume of the master mix by about 10% to ensure there is enough mix for all samples

briefly to ice tray, add last, and just before starting the reaction. Do not vortex T7 Polymerase. 3. Gently vortex master mix on low setting by pulsing and spin down briefly. Add 57.6 mL of mix to each sample containing the cDNA synthesized in the previous step. 4. Next, add to each tube either 2.4 mL of 10 mM Cyanine-3CTP (pink) or 2.4 mL of 10 mM Cyanine-5-CTP (blue) (see Notes 19 and 20). 5. Place sample tubes in a Thermal Cycler at 40°C for 2 h. 3.4.3. Purification of Amplified and Labeled RNA

As recommended by Agilent, use the Qiagen’s RNeasy Mini Kit. Do the following steps at room temperature. 1. To each sample add 20 mL of nuclease-free water. 2. In a new set of 1.5-mL tubes, add 350 mL of buffer RLT. Then move each sample from the 200-mL PCR tube to the 1.5-mL tube and mix with RLT thoroughly. 3. Add 250 mL of 100% ethanol and mix once more. Do not centrifuge tubes. 4. Load the 700 mL to a labeled RNeasy mini column with a 2-mL collection tube. 5. Centrifuge the sample for 30 s at 16,100 × g. Because the labeled RNA is retained by the membrane, color should be visible if labeling was successful (pink for Cyanine-3 and blue for Cyanine-5).

136

Shack

6. Pass the flow-through a second time (see Note 21). Spin again. This step will allow the membrane to retain the RNA that may have not been captured the first time. Discard flowthrough and collection tube. 7. Transfer column to a new 2-mL collection tube. Add 500 mL of RPE and centrifuge columns at 16,000 × g for 30 s. Carefully discard flow-through, and place the column back into the same collection tube. 8. Repeat, adding 500 mL of RPE, centrifuge columns at 16,000 × g for 30 s. Discard flow-through and collection tube, and place column in a new collection tube. 9. Spin again for 1 min to dry columns. 10. Place columns in 1.5-mL collection tube where final labeled sample will be collected. Add 30 mL of RNase-free water, and let it rehydrate the silica-gel membrane for at least 1 min. Spin columns at 16,000 × g for 30 s. 11. Leave flow-through in the tube and add again 30 mL of water, wait for 1–5 min, and spin columns 16,000 × g for 30 s. 12. Discard RNeasy column. The total flow-through is the labeled RNA and it should be blue (if labeled with Cyanine-5) or pink (if labeled with Cyanine-3). As these are very clean preparations, they can be stored in the dark at 4°C for several days prior to use. If the samples will be used over longer periods of time, aliquot to avoid repeated freezing–thawing cycles, and store at −80°C. Take 4 mL to another tube to quantitate yield and labeling efficiency. 3.5. Quantitation and Quality Control of Labeled cRNA

1. Dilute the labeled RNA aliquot (4 mL) with TE pH 8.0 (76 mL), and measure the following absorbances in an Agilent spectrophotometer: A260 (RNA), A550 (Cyanine-3), and A650 (Cyanine-5). Use the formulas shown below to calculate both yield (concentration) and labeling efficiency of each labeled cRNA. Simply enter values for A260 and A550 or A650 as indicated (see Note 22). See detailed calculations below for a Cyanine-5-labeled sample: 1. Sample cRNA concentration (ng/mL) = [A260 − (0.038 × A6 50)] × 10 × 40 2. Sample cRNA mol/L ntp = (cRNA concentration × 0.000001 × 1,000)/330 3. Sample cRNA mol/L dye = A650/25,000 4. Sample efficiency incorporation = (cRNA mol/L dye/cRNA mol/L ntp) × 100

Gene Expression Profiling of Tissues and Cell Lines

137

For a Cyanine-3-labeled sample (control or reference): 5. Control cRNA concentration (ng/mL) = [A260 − (0.07 × A550)] × 10 × 40 6. Control cRNA mol/L ntp = (cRNA concentration × 0.000001 × 1,000)/330 7. Control cRNA mol/L dye = A550/15,000 8. Control efficiency incorporation = (cRNA mol/L dye/cRNA mol/L ntp) × 100 3.6. Hybridization of cRNA Samples with Agilent cDNA Microarrays

1. Following the Agilent protocol for the desired microarray format, add (in a 0.2-mL PCR tube): the required amount of Cyanine-5- and Cyanine-3-labeled probes (see Note 23), 25× fragmentation buffer, 10× blocking agent, and nuclease-free water to the required volume.

3.6.1. Fragmentation of cRNA Samples and Preparation of Hybridization Solution

2. Mix by vortexing, centrifuge briefly, and incubate for 30 min at 60°C in a thermal cycler to fragment RNA before hybridization. 3. While incubation takes place, prepare all components for the assembly: hybridization chambers, gasket slides and microarrays, and write down serial numbers of the arrays and its corresponding samples identifiers. 4. After incubation is completed, spin briefly, and add the required (equal) amount of the 2× GE HI-RPM Hybridization Buffer. Mix carefully either by pipetting or with short pulses at a very low setting in a vortex mixer, so as not to introduce bubbles. 5. Spin again briefly and load onto the microarray immediately (see below).

3.6.2. Preparation of the Hybridization Assembly

1. Place the first gasket on the base of the first hybridization chamber base, making sure that the label of the gasket slide is facing up, and that it is well placed and flush with the chamber base. 2. Carefully pipette out the hybridization solution from the first sample tube, avoiding bubbles. Slowly dispense it on the center of the gasket slide, while dragging it throughout the surface of the gasket slide with the pipette tip, and leaving a space about 1–2 mm all the way around between the solution and between the gasket. Do not touch the gasket. 3. Without moving the chamber and gasket slide, place the microarray over it as soon as possible, as follows: 4. Remove one microarray from package while only touching the barcode area, keep the “active side” down (“Agilent side is down,” serial number up). Keeping it very horizontal, align

138

Shack

array with the four guide posts on the chamber base, and lower array carefully against the gasket slide. 5. Place stainless steel chamber cover onto the sandwiched slides, and then slide on the clamp assembly until it comes to the middle of the chamber. Hand-tighten clamp onto the chamber. 6. While holding the chamber assembly vertically, slowly rotate it 2–3 times to allow the hybridization solution to wet the gasket and the microarray, and inspect for bubble formation. A large mixing bubble is necessary for adequate mixing. If smaller stationary bubbles are detected, gently tap the chamber against your hand or other surface, as it is critical that all bubbles are dislodged and moving. Continue loading and assembling the rest of the Agilent microarray hybridization chambers (see Note 24). 7. When all the chambers are fully assembled, load them into the hybridization rotator rack, in a hybridization oven set at 65°C, making sure to balance the loaded hybridization chambers with others or with an empty chamber. Incubate for 17 h rotating at 10 rpm. 8. In preparation for the next day, place 1 L of Gene Expression Wash Buffer 2 in a new clean bottle (it can be designated for this purpose in the future), and incubate in a 37°C water bath overnight. 3.6.3. Microarray Washing and Scanning

It is important to note that Cyanine-5 is susceptible to degradation by ozone. The microarray wash and scanning procedures must be done in an ozone-controlled room where ozone levels are 5 ppb or less. 1. Prepare three staining dishes: fill the first and second dishes with Wash Buffer 1. Do not add the warmed Wash Buffer 2 until ready to transfer slides to the third dish (see Note 25). Also, have a pair of blunt-end forceps nearby. Use gloves at all times. 2. Turn the scanner on, and after a few minutes, start Agilent Scan Control to warm up the lasers. 3. Place a plastic slide rack and a magnetic stir bar into the second slide-staining dish. Add enough buffer to cover the slide rack. Place this dish on a magnetic stir plate at a low setting. 4. Add a magnetic stir bar to the third dish (empty). 5. Remove a couple of hybridization chambers at a time from the oven, and place on a flat surface. Loosen up thumbscrew by turning counter-clockwise. Remove clamp and cover.

Gene Expression Profiling of Tissues and Cell Lines

139

6. Grab the array-gasket sandwich by the edges and lift it carefully, keeping the numeric barcode facing up. Quickly transfer the sandwich into the first dish and fully submerge it. 7. While completely submerged in Gene Expression Wash Buffer 1, pry the sandwich open from the barcode end only by using the blunt-end forceps, and push gasket slide down to the bottom of the dish, maintaining a good hold on the edges of the microarray. 8. By carefully holding by the edges and/or the barcode, remove the microarray slide and quickly insert the submerged slide rack in the second glass dish. Minimize exposure of the slide to air. 9. When all slides in the group are placed into the slide rack, stir at 400 rpm (on low) for 1 min. 10. During this step, remove Gene Expression Wash Buffer 2 from the 37°C water bath and pour into the third glass staining dish. 11. Transfer slide rack to third staining dish containing warm Gene Expression Wash Buffer 2. Stir using setting 400 rpm for 1 min (see Note 26). 12. Remove the slide rack, taking care to minimize droplets on the slides by using the “slow pull” method: Lift rack out of the buffer at a slow and steady speed so it should take 5–10 s to remove the entire slide rack out of the buffer. 13. Place one slide at a time in a scanner slide holder making sure they are positioned properly, with the active (“Agilent”) side inside the holder, and numerical barcode facing out. 14. Place slide holders in scanner slots, starting with slot #1, and always leaving “Home” slot empty. Scan slides immediately to minimize the impact of environmental oxidants on signal intensities. 15. Once the Agilent Scanner control software is opened, choose the number of slots to be scanned and change default settings if necessary, such as number of bits (8 or 16) or resolution (5 or 10 mM). 16. As soon as the lasers are warmed, and the scanner status shows: “Scanner Ready,” scanning can begin. Barcodes will be read automatically and all files will contain the barcode number. Each scan will take a few minutes, depending on the density of the microarray. 3.6.4. Microarray Data Processing

1. After scanning, the Agilent Scanner control software will generate an output image file. The Feature Extraction software provided by Agilent is a necessary tool for data processing.

140

Shack

2. The Feature Extraction software will generate a series of files, out of which the txt file will contain the actual data necessary to create a database. 3. We generally transfer the raw data (txt file) to Excel, and apply macros that will list the Cyanine-5/Cyanine-3 ratios (representing the relative expression of each mRNA species) and their statistical significance. We also obtain a detailed QC of Green and Red median intensities (which need to be ³100), as well as the reproducibility of duplicate signals throughout the array. The microarray is acceptable if the fold change of the median value of the replicates log2 ratio is below 1.45. 4. If the microarray has failed this last QC measure, it should be considered unreliable. In this case, the microarray can be repeated with increased amount of labeled sample, particularly if the intensities of one or both of the samples were below 100. Low intensities can greatly affect reproducibility of the replicates. As samples are easily under-loaded (resulting in signal intensities that are too low), it is far better to load more sample because it is relatively difficult to saturate the signals. This is especially true with respect to the Cyanine-3labeled RNA, as its signal is generally weaker than Cyanine-5. In our laboratory, we load 1.5- to 2-fold more Cyanine-3- than Cyanine-5-labeled cRNA as a rule. 5. Once these parameters are examined, if the hybridization passes this last QC, a database can be created with the software of choice, such as FileMaker Pro.

4. Notes 1. TRIzol® causes burns and is toxic if swallowed or inhaled. Avoid skin contact. Use a chemical fume hood and personal protective equipment (gloves and lab coat), and dispose using appropriate chemical disposal procedures. 2. While autoclaving can efficiently sterilize, the water vapor from the autoclave chamber can have detrimental effects as it contains metals, rust, or other particulates from previous cycles. 3. Always incubate tubes in a thermal cycler or heating block, do not use a waterbath, which can contain unwanted particles, as well as bacteria or fungus. 4. For cells growing in suspension, centrifuge cells in a refrigerated centrifuge, and add TRIzol® to the pellet. 5. Tissue must be flash-frozen in liquid nitrogen no longer than 30 min after excision to avoid RNA degradation.

Gene Expression Profiling of Tissues and Cell Lines

141

Frozen tissues must be kept at −80°C or in a liquid nitrogen freezer until ready to process. 6. Do not wash cells with PBS as it will only allow RNAses to start degrading the RNA. 7. Use 1 mL of TRIzol® for 5–10 × 106 cells; if collecting large quantities of RNA, the procedure can be scaled up accordingly, using the RNeasy Midi kit instead. 8. Vortexing lysates will shear DNA and may contaminate the RNA fraction. 9. Never let tissue warm up or thaw. The slightest rise in temperature will cause RNA degradation. 10. We have found that an antistatic weigh dish is the easiest tool for transferring the frozen cut tissue to the glass tube. Also, avoid “sweeping” tissue shavings with the scalpel or razorblade, as this will likely electrify the tissue fragments, thereby making it very difficult to pour them into the weigh dish and tube. 11. Because the RLT freezes in contact with the tube, it is important to deliver the RLT to the bottom of the tube right where the tissue is. Melting will occur quickly as homogenization starts. 12. If the fragments were cut too big, another cycle may be done before TRIzol® addition. Tissue fragments will still be visible even though the homogenization has occurred successfully. Homogenizing repeatedly is not recommended, as it releases large carbohydrate molecules that will adhere to the RNA strands, thus interfering with amplification and labeling of the RNA. 13. It is very important to maintain reagents and samples at room temperature until the end of the procedure to avoid RNA precipitation. 14. If inter-phase is picked up or is disturbed at all, place sample back in tube, spin down the sample, and collect the upper layer again. 15. If extracting large volume samples, do this step in a 15-mL centrifuge tube, and add ethanol drop-wise to upper layer while vortexing at low speed. However, if working with small specimens and an upper layer no larger than 500 mL, ethanol can be added quickly and then pipetted up and down immediately. 16. For very small samples where low yield is anticipated, increasing incubation time to 10 min can slightly increase RNA recovery. 17. 40 mg/mL of singal-stranded RNA will give an absorbance of 1 at A260 in TEpH 8.0.

142

Shack

18. Bubbles lead to enzyme denaturation, resulting in less than optimal reaction efficiency. 19. Sample replicates or “dye swap” replicates are not usually necessary. We have extensively evaluated “dye swap” replicates in our laboratory, and have found no difference in the resulting data. 20. Both cyanine dyes are light sensitive, but ambient indoor lighting is not of sufficient intensity at the excitation wavelength to cause photo-oxidation, particularly for a short period of time that they are exposed to it. 21. Carefully remove the column from the collection tube to collect flow-through, but avoid touching the bottom of the column. It must be kept clean and RNAse-free. 22. An efficiency of less than 0.3 points to an RNA labeling below acceptable range. Labeling can be repeated. Alternatively, low cyanine dye incorporation can be compensated for by adding more probe to the microarray, and as long as the cyanine dyes are equivalent in both labeled probes and the samples contain at least 10–20 pmol of cyanine dye/mg of cRNA. As a rule, Cyanine-3-labeled probes show weaker signal intensities in microarray hybridizations, therefore our laboratory adds 1.5– 2× more Cyanine-3 cRNA as a rule. 23. Always compensate for differences in labeling efficiencies, aiming for an ideal efficiency of 0.6–0.9. 24. It is recommended to prepare only a few hybridization assemblies at a time. 25. These glass staining dishes should be designated for this purpose. Never wash these dishes with any type of detergent since this will result in unwanted fluorescence in the microarrays. 26. Although the buffer was at 37°C, the predicted temperature of the second wash step is around 31°C due to the rapid cooling by the room temperature dish, rack, and microarray slides. References 1. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470 2. DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM (1996) Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet 14(4):457–460

3. Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM (1999) Expression profiling using cDNA microarrays. Nat Genet 21(1 Suppl): 10–14 4. Kronick MN (2004) Creation of the whole human genome microarray. Expert Rev Proteomics 1(1):19–28 5. Minor JM (2006) Microarray quality control. Methods Enzymol 411:233–255

Gene Expression Profiling of Tissues and Cell Lines 6. Wolber PK, Collins PJ, Lucas AB, De Witte A, Shannon KW (2006) The Agilent in situsynthesized microarray platform. Methods Enzymol 410:28–57

143

7. Agilent Technologies (2008) Two-color microarray-based gene expression analysis (Quick Amp Labeling) protocol (G4140-90050 v.5.7). Agilent Technologies , Santa Clara, California.

Chapter 10 Methods for MicroRNA Microarray Profiling Aarati R. Ranade and Glen J. Weiss Abstract MicroRNAs (miRNAs) are single-stranded, small RNA molecules of 21–23 nucleotides that are primarily involved in the regulation of gene expression. A number of recent studies have shown that miRNAs play a vital role in signaling pathways important for human oncogenesis and may hold promise as potential biomarkers for cancer diagnosis, prognosis, and therapeutics. Microarray-based expression analysis is an emerging and powerful strategy for initially identifying candidate miRNAs, which can then be correlated to specific biological process such as carcinogenesis, and eventually developed as a molecular signature for a disease state. This chapter presents a protocol for miRNA microarray profiling using the Agilent platform, which is one of the most widely utilized technologies currently available. Key words: MicroRNA, Microarray, Profiling, Method, Biomarker discovery, Carcinogenesis

1. Introduction MicroRNAs (miRNAs) are single-stranded small RNA molecules of 21–23 nucleotides that play a role in the regulation of gene expression (1, 2). Several recent reports have also documented the involvement of miRNAs in the regulation of cell signaling pathways, apoptosis, metabolism, cardiogenesis, and neural development, and not surprisingly, dysregulation of miRNA expression has been linked to many types of cancer (3–7). In general, miRNAs function by posttranscription silencing of protein translation and suppression of target mRNA transcription. Several hundred miRNAs have already been characterized. Each individual miRNA can potentially regulate several hundred targets, and target recognition occurs mainly through limited base-pairing interactions between the 5¢-end of the miRNA

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_10, © Springer Science+Business Media, LLC 2011

145

146

Ranade and Weiss

(nucleotides 2–8, known as the seed region) and the complementary sequence in the 3¢ untranslated regions (3¢ UTRs) of the cognate mRNA. The identification and characterization of miRNAs is a rapidly growing area of research due to the strong involvement in a variety of processes such as development, cell proliferation and death, and oncogenesis (8–10). Microarray-based expression analysis is an ideal strategy for identifying both candidate miRNAs that correlate with biological pathways and potential biomarkers. There are several commercially available kits to perform highthroughput miRNA expression profiling. These include beadbased methods using flow cytometry, qRT-PCR, hybridization platforms, oligonucleotide microarrays, and multispectral imaging. Manufacturers of the various platforms for miRNA expression profiling include: Agilent, Exiqon, Genosensor, LC life science, Oxford gene technology, and Febit. For the purposes of this chapter, we will provide an overview of sample preparation and analysis for the Agilent miRNA array platform. This chapter presents a simple protocol for utilizing a miRNA microarray on one of the more commonly used platforms, the Agilent microarray platform. In this method, RNA isolated from the tissue of choice is first labeled and purified, then hybridized with microarray slides. Scan data and feature extractions is then used to measure miRNA expression.

2. Materials 1. RNA isolation kit (e.g., Qiagen miRNeasy Mini Kit). 2. DNase/RNase-free water. 3. Nuclease-free 1.5 ml Eppendorf tube. 4. Calf intestine alkaline phosphatase (CIP). 5. 10× alkaline phosphatase (CIP) buffer. 6. 10× T4 RNA ligase buffer. 7. T4 RNA ligase. 8. MicroBioSpin six columns (Bio-Rad; Hercules, CA). 9. miRNA labeling reagent and hybridization kit (Agilent part no. 5190-0408). 10. Gene expression wash buffer kit. 11. Agilent microarray scanner (Agilent, part no. G2565BA). 12. Hybridization chamber, stainless (Agilent, part no. G2534A). 13. Hybridization chamber gasket slides (Agilent, part no. G2534-60014).

Methods for MicroRNA Microarray Profiling

147

14. Hybridization oven 20 rpm capability. 15. Hybridization oven rotator for Agilent microarray hybridization chambers. 16. Microcentrifuge. 17. Magnetic stir plate. 18. Magnetic stir bar. 19. Circulating water bath or heat blacks to set to 16, 37, and 100°C. 20. Pipettes (10, 20, 200, and 1,000 ml). 21. Powder-free gloves. 22. Vortex mixer.

3. Methods 3.1. Labeling of RNA

Probe labeling utilizes total RNA isolated from the tissue of choice, avoiding fractionation methods that will discard lowmolecular weight RNA and eluting in 1× TE (pH 7.5) or DNase/ RNase-free water (see Note 1). Total RNA is labeled to generate fluorescent miRNA by ligation of one cyanine 3-pCp molecule to the 3¢ end of the RNA molecule. The cyanine 3-pCp is provided in Agilent’s miRNA labeling reagent and hybridization kit. All the following steps are carried out on ice unless otherwise specified. Labeling of total RNA consists of three major steps: dephosphorylation, denaturation, and ligation.

3.1.1. Dephosphorylation

1. Total RNA is diluted to desired concentration (e.g., 25 ng/ml) in 1× TE (pH 7.5) or DNase/RNase-free water. 2. Transfer 4 ml (100 ng) of diluted total RNA to a 1.5 ml Eppendorf tube; place on ice. 3. Prepare the CIP master mix (MM) as follows: For one reaction, you will need the following reagents/ concentrations: (a) 10× CIP buffer: 0.7 ml/reaction. (b) RNase-free water: 1.6 ml/reaction. (c) CIP: 0.7 ml/reaction. The quantity of CIP MM depends on number of desired reactions. 4. Add 3 ml of CIP MM to each sample and mix by pipetting. 5. To dephosphorylate the sample, incubate samples at 37°C for 30 min. (At this point, samples may be stored at −80°C until needed.)

148

Ranade and Weiss

3.1.2. Denaturation

1. Add 5 ml of 100% DMSO to each sample. 2. Incubate at 100°C for 10 min (see Note 2). 3. Transfer samples to ice water bath (see Note 3). Immediately proceed to ligation.

3.1.3. Ligation

1. Warm 10× T4 RNA ligase buffer at 37°C, and vortex until all precipitate is dissolved (see Note 4). 2. Prepare ligation master mix. Important: use ligation master mix within 15 min of preparation. For one reaction, you will need the following reagents/ concentrations: (a) 10× T4 RNA ligase buffer: 2 ml/reaction. (b) RNase-free water: 2 ml/reaction. (c) pCp-Cy3: 3 ml/reaction. (d) T4 RNA ligase (20 U/ml): 1 ml/reaction. 3. Immediately add 8 ml of ligation master mix to each sample for a total reaction volume of 20 ml. 4. Gently mix by pipetting, and then spin down at 1,000 × g. 5. Incubate at 16°C for 2 h.

3.2. Purification of Labeled RNA

Labeled RNA is purified using Bio-Spin six columns provided with the kit by manufacturer (Bio-Rad, part no. 732-6221). The column is conditioned as described below: 1. Micro Bio-Spin six column preparation: (a) The column is inverted several times to resuspend the settled gel and to remove any air bubbles. (b) The tip is snapped off and the column is placed into a 2 ml microcentrifuge tube supplied by the manufacturer. (c) The green cap from the column is removed and placed back until the buffer drips into the 2 ml tube. Let the buffer drain for 2 min. (d) Discard the drained buffer from the 2 ml microcentrifuge tube and place the column back into the tube. (e) Spin the tube containing the column for 2 min at 1,000 × g in a centrifuge. (f) Remove the column from the 2 ml tube and place it into a sterile 1.5 ml microcentrifuge tube. 2. Add 30 ml of RNase-free water or 1× TE, pH 7.5 to labeled sample for a total volume of 50 ml. 3. Without disturbing the gel bed in the Bio-Spin six column, pipette the 50 ml sample onto the gel bed.

Methods for MicroRNA Microarray Profiling

149

4. Elute the purified sample by centrifuging the bio-column for 4 min at 1,000 × g. 5. Discard the columns and place the tube containing the eluate on ice. The eluate contains the labeled miRNA sample. 6. Check that the final eluate is translucent and slightly pink. The volume should be close to 50 ml and uniform across all samples. 3.3. Hybridization 3.3.1. Preparation of the Hybridization Sample

1. Dry down the labeled sample using a speed-vacuum at 45°C on the medium heat setting for 60 min. 2. Resuspend the dried sample in 18 ml of nuclease-free water. 3. Add 4.5 ml of 10× GE blocking agent to each sample. 4. Add 22.5 ml of 2× Hi-RPM hybridization buffer to each sample; total volume is 45 ml. 5. Vortex gently to mix the sample. 6. Incubate in a water bath at 100°C for 5 min, and then transfer to an ice water bath for 5 min, followed by centrifugation at 1,000 × g (see Note 5). 7. To hybridize the sample, insert in the Agilent microarray hybridization chamber at 55°C for 20 h at 20 rpm in a rotating hybridization oven. The procedure for preparation of the hybridization assembly is described below.

3.3.2. Preparation of the Hybridization Assembly

1. Load a clean gasket slide into the hybridization chamber base (Agilent SureHyb chamber) with the label facing up and aligned with the rectangular base of the chamber. 2. Slowly dispense the hybridization sample in the central zone of the gasket without touching the pipette tip or the hybridization sample to the gasket walls. Avoid introduction of bubbles during sample loading. 3. Place an array “active side” down onto the gasket slide in such a way that the “Agilent” – labeled barcode is facing down and the numeric barcode is facing up. 4. Place the hybridization chamber cover onto the sandwiched slides and fix the assembly by sliding the clamp and tightening it. 5. Rotate the assembly to assess the mobility of the bubble. If required gently tap the assembly by hand to move the stationary bubbles. 6. Place the assembled slide chamber in a hybridization oven set to 55°C at 20 rpm for 20 h.

3.4. Microarray Wash

1. Prewarm gene expression wash buffer overnight at 37°C in a water bath (see Note 6).

150

Ranade and Weiss

Table 1 The wash conditions for washing the microarray slides after hybridization Dish no. Procedure

Wash buffer Time (min) Temperature

1

Disassembly

Buffer 1

2

First wash

Buffer 1

5

Room temperature

3

Second wash Buffer 2

5

37°C

Room temperature

2. Place slide-staining dish no. 3 into a 1.5 L glass dish three-quarters filled with water. Warm to 37°C by storing overnight in an incubator set to 37°C. 3. Wash microarray slides in buffer 1 and 2 as described in Table 1: (a) Disassemble the hybridization chamber by loosening the screws. Slide off the clamp assembly and remove the chamber cover. (b) With gloved fingers, remove the array gasket sandwich by grabbing the slides from both ends. (c) Without letting go of the slides, submerge the array gasket sandwich into slide-staining dish no. 1 containing gene expression wash buffer 1. (d) With the sandwich completely submerged, separate the gasket slide from the microarray slide using a blunt forceps. (e) Remove the microarray slide and place into a slide rack in slide-staining dish no. 2 containing gene expression buffer 1 at room temperature for 5 min. (f) Transfer slide rack to slide-staining dish no. 3 containing gene expression buffer 2 at 37°C for 5 min. (g) Scan the slides immediately to minimize the impact of environmental oxidants on signal intensities. The above-mentioned microarray wash procedure must be done in environments where ozone levels are 50 ppb or less (see Note 7). 3.5. Scanning Slides and Using the Feature Extraction Software

1. Place the slides into the slide holder such that the numeric bar code side is visible. 2. Place assembled slide holders into the scanner carousel. 3. Scan the slides on scanner settings for miRNA scans. 4. Microarray data analysis begins with feature extraction (FE) from the scanned microarray images: (a) Open the Agilent Feature Extraction software and add images (.tif) to be extracted to the FE project.

Methods for MicroRNA Microarray Profiling

151

(b) Extract .tif images using the FE software and save them as .txt files. 3.6. Data Analysis

1. The data files in the .txt format are analyzed using GeneSpring GX software by Agilent. 2. When the data file is imported into GeneSpring GX, the following information gets imported: control type, probe name, signal, and feature columns. 3. Enter the name for a new experiment, select the appropriate experiment type (single color expression data, two color expression data, etc.), and choose the desired workflow. For miRNA expression data, choose the workflow to find differentially expressed miRNAs. 4. Guided workflow will take you through experiment creation and data analysis, while advanced analysis will allow access to the full set of analysis tools: (a) GeneSpring GX will give a summary report where the distribution of normalized intensity values for each sample is displayed in the box–whisker plot. (b) The second step in guided workflow is experimental grouping. It requires adding parameters to help define the grouping and replicate structure of the experiment. (c) The third step is quality control (QC) on samples which helps to evaluate the reproducibility and reliability of your microarray data. (d) The next step is called “filter probesets” in which the entries are filtered based on their flag values (present, marginal, and absent). If flag values are absent, the data entries are filtered based on their signal intensity value. (e) Then GeneSpring calculates significance analysis. Depending upon the experimental grouping, either t test or ANOVA is performed. (f) The entries satisfying the significance analysis are passed on for the fold change analysis. Fold change analysis is then used to identify genes with expression ratios or differences between a control and experimental group. Fold change gives the absolute ratio of normalized intensities between the average intensities of the sample groups. (g) The Gene Ontology (GO) classification scheme allows you to quickly categorize genes by biological process, molecular function, and cellular component. To determine if there is a significant representation of your data identified from previous step in a particular GO category, a statistical test is performed and p-value is assigned to each category. Entries corresponding to each category that satisfies the p-value cutoff are displayed.

152

Ranade and Weiss

5. In this way GeneSpring GX software calculates and displays differentially expressed miRNAs along with their targets can be identified.

4. Notes 1. It is important to determine the purity and integrity of the input RNA prior to labeling and hybridization to increase the likelihood of a successful experiment. Low 260/280 nm ratios (2.0) result in inaccurate quantification of the total RNA and may lead to spurious microarray findings. 2. Incubating the sample for less than 5 min or more than 10 min is not advisable as it will affect the labeling efficiency of the sample. 3. Use an ice water bath (not just crushed ice) to ensure that the samples remain properly denatured. 4. Make sure that 10× T4 RNA ligation buffer has cooled to room temperature before using, otherwise the ligase activity could be impaired, which will adversely affect labeling efficiency. 5. Do not leave samples in the water bath for more than 5 min as this adversely affects the labeling efficiency. 6. The gene expression wash buffer should be warmed in water bath at 37°C overnight. 7. Wash all dishes, racks, and stir bars with deionized water. Do not use detergents, as some detergents may leave fluorescent residue on the dishes. References 1. Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116:281–297 2. Fabbri M, Croce CM, Calin GA (2008) MicroRNAs. Cancer J 14:1–6 3. Bandres E, Agirre X, Ramirez N, Zarate R, Garcia-Foncillas J (2007) MicroRNAs as cancer players: potential clinical and biological effects. DNA Cell Biol 26:273–282 4. Cho WC (2007) OncomiRs: the discovery and progress of microRNAs in cancers. Mol Cancer 6:60 5. Dalmay T, Edwards DR (2006) MicroRNAs and the hallmarks of cancer. Oncogene 25:6170–6175

6. Esquela-Kerscher A, Slack FJ (2006) Oncomirs – microRNAs with a role in cancer. Nat Rev Cancer 6:259–269 7. McManus MT (2003) MicroRNAs and cancer. Semin Cancer Biol 13:253–258 8. Poy MN, Eliasson L, Krutzfeldt J, Kuwajima S, Ma X, Macdonald PE (2004) A pancreatic islet-specific microRNA regulates insulin secretion. Nature 432:226–230 9. Yekta S, Shih IH, Bartel DP (2004) MicroRNA-directed cleavage of HOXB8 mRNA. Science 304:594–596 10. Zhao Y, Samal E, Srivastava D (2005) Serum response factor regulates a muscle-specific microRNA that targets Hand2 during cardiogenesis. Nature 436:214–220

Chapter 11 Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals Jonathan D. Gruber Abstract Genome-wide association studies are providing exciting new insight into the genetics of complex disease, but oftentimes, the genomic regions associated with the trait of interest are large enough to contain several equally plausible candidate genes. Commonly, no obvious, putatively functional, polymorphisms are found to segregate. In most cases, therefore, functional evaluation of possible regulatory mechanisms is necessary to narrow the list of potential candidates. One approach to functional characterization of such variants is allelic expression (AE) profiling, which provides an assessment of transcriptional differences between two homologous transcripts. In AE, a heterozygous, transcribed single nucleotide polymorphism is used to quantify the relative transcript abundance between two gene copies. A ratio that differs significantly from 1:1 suggests that the sample may be heterozygous for a cis-acting regulatory allele. The pattern of observed cis-regulatory variation in the profiled candidate genes can thus narrow the list of candidates under an association signal substantially. In addition, AE is also an accessible and economical strategy, as it relies heavily upon standard techniques and equipment likely to be present in any disease-mapping laboratory. Key words: Genome-wide association study, Differential allelic expression, Allele-specific expression, Allele-specific gene expression, Allelic imbalance, Cis-regulatory variation, Candidate genes, eQTL, Complex disease

1. Introduction In the past few years, human geneticists have found themselves asking the “What now?” question that naturally follows completion of a genome-wide association study (GWAS). Such studies typically yield several chromosomal regions containing markers with statistically significant evidence for association with a complex disease or trait (1). Like methods that identify quantitative trait loci (QTLs), the genomic positions of association signals provide some insight into the architecture of complex traits even Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_11, © Springer Science+Business Media, LLC 2011

153

154

Gruber

when the causal genes or variants remain unknown. Unfortunately, the haplotype block-like structure of the genome also imposes a limit on the resolution of association signals, and more commonly than not, results from a typical GWAS may not actually implicate a specific gene and are even less likely to pinpoint biologically consequential mutations. Identification of causal variants will be one of the primary challenges of the postgenome, post-GWAS era. Even identifying the affected genes, regardless of whether causal variants are immediately characterized, enables much future work. Identifying these genes will enhance insight into affected pathways, thereby providing the fundamentals for intelligent drug design (2). Knowing relevant genes also opens opportunities to create new animal models of disease physiology. The process is likely to be self-perpetuating, that is, knowing the identity of a subset of influential genes informs future searches for other consequential genes in the genome. And what about the exact causal sites? Someday, a catalog of such variants and their effects will genuinely move us closer to understanding the relationship between genotype and phenotype, though it remains unclear whether gene identification will lead to the pinpointing of causal sites or vice versa. In theory, many approaches could be taken to investigate a region showing evidence for association with a complex trait, but one single approach has not yet been shown to be more informative or efficient than others. Common approaches include deep resequencing of the region of interest in search of potential causal sites, mapping in a different population and/or with a denser marker set, analysis of nucleotide conservation across evolutionary taxa, or characterizing expression of genes in the associated region. This chapter will focus on this last approach, using the specific strategy of allelic expression (AE) profiling. In AE, a heterozygous, transcribed single nucleotide polymorphism (SNP) acts as a proxy for the relative transcript abundance between an individual’s two gene copies. A ratio that differs significantly from 1:1 suggests that the individual is heterozygous for a cis-acting regulatory allele. Patterns of observed cis-regulatory variation in profiled candidate genes can narrow the list of candidates under an association signal substantially. In addition, AE is also an accessible and economical strategy, as it relies heavily upon standard techniques and equipment likely to be present in any disease-mapping laboratory. The principle behind AE studies (Fig. 1) is based upon the simple expectation that dissimilar expression between two homologous transcripts (due to differential transcription, splicing, or degradation) implies a locally acting functional difference between an individual’s maternal and paternal homologues. The basic structure of any AE experiment, therefore, is to determine whether the two alleles of a gene are expressed at a ratio other than 1:1 in one or more individuals. Deviation from the expected 1:1 ratio is alternatively referred to by at least three different names, including

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

155

p:m ratio

a 12

p 12

1:1

m

3

p 12

1:4

m

3

p 3

1:1

m b

Fig. 1. The principle of allelic expression. (a) When the complete regulatory region of an individual’s paternal and maternal gene copies are functionally identical (i.e., homozygotes, depicted top and bottom), they should be equally represented in RNA. Therefore, cDNA departures from a 1:1 ratio (center) indicate the presence of heterozygous, cis-acting regulatory variation. In a panel of risk/protective haplotype heterozygotes, such a pattern is potentially indicative of disease-associated regulatory variation. (b) Relative signal from a heterozygous transcribed SNP provides an indicator of relative transcript abundance. When the ratio of allele abundance in cDNA differs significantly from that of gDNA (open triangles vs. filled circles), a cis-regulatory heterozygote is inferred; when cDNA and gDNA do not significantly differ (open squares vs. filled circles), either no heterozygous variation is acting or it is undetectably subtle.

156

Gruber

(1) differential AE (DAE), (2) allele-specific [gene] expression (AS[G]E), and (3) allelic [expression] imbalance (A[E]I, although “allelic imbalance” is also used as an entirely separate definition for somatic changes in copy number, c.f. ref. (3). Deviation from a 1:1 ratio between two alleles of a gene implies cis-regulatory variation. In general terms, a cis-acting difference may be either genetic (a sequence variant) or epigenetic (the effects of methylation and other chromatin-modifying pathways). However, epigenetic mechanisms cannot be solely responsible for an association signal, which is genetic by definition. Importantly, in order to observe differential expression between homologues, the two copies must themselves be distinguishable, generally by means of a heterozygous transcribed SNP. AE is uniquely sensitive to allele-specific effects of cis-regulatory variation and is thus highly appropriate for fine-scale searches in which the disease-associated polymorphism is known to act locally in the genome. It can only do so, however, if the actual causal variant affects the regulation, as opposed to the structure, of a gene product. While a truly informed estimate of the relative importance of these two alternatives will remain unknown for the time being, circumstantial evidence gathered from GWAS performed to date suggests that a substantial fraction of causal variants underlying complex disease are noncoding (1, 2). For instance, a meta-analysis of GWAS data for Crohn’s disease (4) revealed 32 trait-associated signals, of which only nine were correlated with a nonsynonymous coding SNP. Moreover, the majority of genetic variants are located in noncoding regions, where it is possible that effects on transcriptional regulation may be exerted. When this is the case, different levels of expression may be observed between risk and protective alleles (e.g., ref. 5). Typically, the goal of an AE study is simply to identify genes that are subject to cis-acting variation. The strategy may also be used to map cis-acting SNPs if additional genetic information is available about the sample panel. In this chapter, most of the technical details of AE studies will be covered in the context of simply identifying genes affected by cis-acting variation with the goal of narrowing a list of disease-association candidate genes. However, it is also possible to use AE to map cis-acting polymorphisms (see Note 1). And although the motivation is presumed to be the dissection of association signals, any researcher who wishes to determine whether a gene is affected by cis-acting variation should find these considerations useful.

2. Materials 2.1. cDNA Synthesis

1. RQ1 RNase-free DNase (Promega, Fitchburg, WI), and accompanying buffer (10×) and stop solution.

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

157

2. ReadyMade random hexamers (Integrated DNA Technologies, Coralville, IA) resuspended in water at 500 ng/ml. 3. Deoxynucleotide solution mix, i.e., dNTPs (New England Biolabs, Ipswich, MA), 10 mM each nucleotide. 4. SuperScript II reverse transcriptase (Invitrogen, Carlsbad, CA), with accompanying first-strand buffer (5×) and DTT (0.1 M). 2.2. AE Profiling by Pyrosequencing

1. For each assay, one sequencing primer and pair of PCR primers amplifying a 100–400 bp amplicon surrounding the SNP (see Note 2). 2. Streptavadin sepharose high performance beads (GE, Piscataway, NJ). 3. PyroMark Q96 plate low, PyroMark Q96 cartridge, PyroMark Q96 vacuum prep troughs, PyroMark Gold Q96 reagents kit, PyroMark binding buffer, PyroMark denaturation solution, PyroMark wash buffer 10×, PyroMark annealing buffer (Qiagen, Valencia, CA). Use solutions at 1×. 4. Ethanol (70%).

3. Methods By AE profiling, the researcher will assess the strength of evidence for cis-acting variation at each candidate gene. The strongest conceivable evidence is DAE observed consistently across several transcribed marker SNPs, preferentially in individuals heterozygous for the risk/protective genotype (see Fig. 2). Careful preparation for an efficient AE experiment (see Note 3) is critical and involves choosing the genes to be interrogated and the transcribed SNPs that can track the genes (see Note 4), as well as assembling a panel of replicated RNA and genomic DNA (gDNA) samples (see Notes 5 and 6). The clearest interpretation follows a biological scenario in which a single biallelic, disease-associated site is the only local regulatory variation, and it has a discernibly Mendelian relationship to the expression of just one gene. As this situation is far from guaranteed, achieving the optimal outcome of a systematic dissection of an association signal depends on a bit of serendipity in both the locus’s biology and the technical details. An AE study targeting all candidate genes is designed to maximize the opportunity to recognize the ideal situation, if it exists. However, even if biological complexity or technical limitations prevent the ideal realization of the strategy, upon completion of the experiment, the researcher will have substantially more information with which to prioritize further investigation of the candidate genes (Fig. 2).

158

Gruber

Fig. 2. The proposed structure of disease-association dissection by AE. Ideally, only the disease-associated candidate gene would show any evidence of DAE, but in many cases, a more detailed comparison of the pattern of DAE in candidate genes will be necessary. The strength of the candidate may be estimated by the decision tree presented, under the assumption that risk/protective haplotype heterozygotes should be most likely to show DAE in situations where the regulatory variation and disease risk are related.

3.1. cDNA Synthesis

1. Before cDNA synthesis, RNA integrity is verified, either on an agarose gel or nanodrop. 2. RNA is treated with DNAse in a 10 ml reaction by mixing 1 ml RQ1 RNase-free DNase (1 U/ml), 1 ml RQ1 RNase-free DNase buffer (10×), and RNA in water, adjusting enzyme to 1 U/mg of RNA. The sample is incubated for 30 min at 37°C, at which time 1 ml Stop Solution is added. Sample is then incubated at 65°C for 10 min and kept on ice (see Note 7). 3. cDNA synthesis is performed by adding 0.5 ml random hexamers (500 ng/ml) and 0.5 ml dNTPs (10 mM each) to 5 ml RNA. Sample is incubated for 5 min at 65°C and snap-cooled on ice. To the mix, 2 ml 5× SuperScript II first-strand buffer and 1 ml DTT (0.1 M) are added. The mix is incubated for 2 min at 25°C, at which point 1 ml SuperScript II, diluted 1:1 in double-distilled water, is added (see Note 8) and mixed by pipetting. After brief centrifugation, the sample is incubated at 25°C for 10 more minutes, followed by 42°C for 50 min and 70°C for 15 min.

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

159

4. Synthesized cDNA is diluted for use in assay-specific PCR, typically by 10× to 100×, depending on each gene’s abundance in the samples. 3.2. AE Profiling by Pyrosequencing

1. For this protocol, the researcher is expected to have access to a PyroMark Q96 ID (“pyrosequencer”) with vacuum prep workstation. However, many SNP genotyping technologies are amenable to AE quantification (see Note 9). 2. In a 50 ml reaction, the assay template is amplified from gDNA and cDNA samples (see Note 6) by standard PCR (see Note 10). 3. The relative efficiency of the reaction is checked on an agarose gel and, if necessary, the concentration of amplicon is equalized by diluting the PCR reactions. All samples are diluted to 40 ml with double-distilled water. 4. From a well-shaken source bottle of streptavadin sepharose beads, 3 ml/sample aliquot are mixed into 37 ml/sample binding buffer (1×) to form a slurry. 5. The slurry is poured into a trough, and 40 ml bead/binding buffer are added to each 40 ml sample using a multichannel pipettor, with occasional stirring of the slurry (see Note 11). The plate is sealed with film, with the backing saved for a later step. 6. The plate is affixed to a vortex or shaker (such as the Thermomixer, Eppendorf, Hauppauge, NY) and shaken vigorously, but not so hard that sample contacts the sealing film, for at least 10 min at room temperature (~25°C). 7. While the plate is shaking, sequencing primer stock (100 mM) is diluted into 1× annealing buffer at a 4 ml:1 ml ratio. A 40 ml aliquot of this diluted primer is added to the wells of a PyroMark Q96 plate low. 8. The vacuum station is prepared by filling the reagent trays with 70% ethanol, denaturation solution, and wash buffer. 9. Double-distilled water is run through vacuum tool for 20 s. 10. The plate is removed from shaking and the sealing film carefully removed. 11. The vacuum tool is used to aspirate samples’ beads and liquid from the plate, capturing the beads on the vacuum pins. 12. The vacuum tool is placed in the 70% ethanol for 10 s, then denaturing solution for 5 s, and washing buffer for 10 s. Each count begins when liquid can be seen flowing through the vacuum tube. 13. The tool is held up at an angle so that as much liquid as possible can drain out of the head.

160

Gruber

14. Critically, the vacuum pump is turned OFF before the vacuum tool is placed in the PyroMark Q96 plate low, which contains sequencing primer in 1× washing buffer (step 7 above). 15. The beads are released from the vacuum pins by gentle swirling and up-and-down motions of the pins in the wells. A slight grayness can be observed in wells where sample has successfully been deposited. 16. The samples are incubated at 90°C for 2 min and then allowed to cool to room temperature (see Note 12). 17. As the plate cools, necessary assay and sample information is entered into the Pyrosequencing software. 18. Substrate and enzyme from the PyroMark Gold Q96 reagent kit are resuspended in 620 ml double-distilled water and pipetted into the appropriate positions of the PyroMark Q96 cartridge. A 200 ml aliquot from each deoxynucleotide (provided in the kit) is also added to the cartridge at the appropriate positions. 19. The plate is locked into the PyroMark Q96 machine, followed by the cartridge. 20. The run is started using the machine’s software. 21. When the run has completed (see Note 13), export peak height data (see Note 14) for analysis. 3.3. Analysis

1. The analysis steps outlined below are appropriate for any platform so long as the relative intensity of the signal from each SNP allele can be exported. 2. AE is calculated for each SNP: (a) The quantitative data corresponding to the signal associated with each SNP allele, referred to as “a” and “b,” are retrieved. This is straightforward on most platforms (see Note 14). (b) T3he inherent bias between the two alleles’ assay signal, k, is estimated by calculating the mean a/b ratio observed overall samples of heterozygous gDNA (6). (c) The AE ratio is calculated by converting each observed a/b ratio by the formula AE a / b = a / b / k . Both cDNA and gDNA data should be similarly converted, and as a check the gDNA data should now exhibit a 1:1 ratio (on average). (d) Each individual AEa/b measure is recoded to a binary {0,1} (or {R,P}) system according to which SNP allele is genetically coupled to the Risk or Protective haplotype. (This requires phasing based on family data, or it can be inferred from strong LD documented in the HapMap or elsewhere.) It is sometimes necessary to take the inverse of

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

161

some adjusted AE ratios so that an analogous, comparable quantity describes each individual (rather than some AER/P and others AEP/R). 3. Evidence for DAE is evaluated: (a) For each individual, a t test is performed comparing the technical replicates of the converted cDNA AER/P ratio to the replicates of gDNA “AER/P” ratio. This tests the null hypothesis that the difference between the means of the two groups is zero. In other words, a significant result in this test indicates that the ratio of cDNA AE in the tissue of this individual is significantly different from 1:1, which is evidence for a cis-acting difference between the individual’s chromosomes (see Note 15). (b) When possible, multiple SNPs in a single gene are examined for concordance: are they telling a consistent story for each individual? Is DAE significant at both sites? If the SNPs’ phase is known, is there consistency in which haplotype shows greater abundance? Do most of the samples that show DAE at one SNP also show it at the other SNP? (c) For each SNP, evidence that observed DAE is consistent with a disease-associated SNP is evaluated. Specifically, the average gDNA “AE” and average cDNA AE of all informative individuals who are also heterozygous for the risk and protective haplotypes is compared by a t test. It is critical that the SNP alleles have been correctly recoded with respect to the risk/protective haplotypes. It is intentional that this test fails to be significant, if individuals tend to deviate from 1:1 AE but in different directions. We expect that the high-risk allele would consistently have either higher or lower expression compared with the protective allele, not both. For the mathematically inclined, a repeated measures two-way ANOVA (think of gDNA and cDNA as before-and-after a “template treatment”) is technically the more appropriate test. 4. Taking all SNPs, genes, and samples into account, the overall experiment is evaluated. Repeat for all candidate genes and prioritize future investigations, incorporating the logic of Fig. 2 into a larger framework of biological background knowledge about each gene. In some cases, the ideal situation will occur – variation is observed only at one gene and in 100% of the risk/protective heterozygotes, exclusively. In other cases, several genes may retain plausibility, but the patterns observed in the AE experiment help to focus future work toward the best candidates.

162

Gruber

4. Notes 1. The data collected in an AE study can be used to map cis-regulatory polymorphisms, although it requires a good deal of additional information, generally requires a larger panel than other AE analyses, and is far from guaranteed to determine a narrower association interval than the risk haplotype. In essence, one is performing association mapping in haploids: is the more strongly expressed gene copy found on the same chromosome as one of the alleles of a nearby SNP more frequently than expected by chance? There are two reasons that this avenue of analysis may not be worthwhile. First, a number of SNPs in strong LD probably form the basis for the risk/protective haplotype designation in the initial GWAS, so it is entirely possible that mapping will simply recapitulate the set of candidate SNPs that were the strongest candidates all along. Second, this analysis depends on knowing the phase of the transcribed marker SNP alleles and the alleles of all SNPs to be tested. This may be extremely difficult information to obtain in most cases. One exception is when AE is being assessed in a panel of CEPH individuals for whom phase information has been previously determined in the HapMap project. This strategy was employed and is excellently illustrated in Supplemental Fig. 1 of ref. 7. 2. One of the PCR primers is 5¢ biotinylated. The sequencing primer is on the opposite strand as the biotinylated primer, and its 3¢ end should be 0–4 bases from the SNP, positioned so that neither SNP allele is “A” and neither is part of a homopolymer. It is also prudent to use unlabeled primers to test and optimize each assay’s PCR on gDNA and cDNA test templates. This minimizes the use of expensive, labeled oligonucleotides in the preliminary stage until it is determined whether the gene is expressed at detectable levels in the tissue of interest. If an assay resists optimization, attempts should be made to verify that it is expressed. If not, it may be a poor candidate for influencing the disease. 3. The optimal organization of a panel of individuals for AE analysis depends on the priorities of the laboratory and the particular parameters of the expression quantification platform. Basically, the time and labor efficiency of the experiment increases as the efficiency of sample use decreases. The major decision to make is whether the entire sample panel will be processed for all assays rather than “cherrypicking” only the informative heterozygotes. Preassay cherrypicking is recommended only in situations when the tissue of choice is extremely limited or when each assay reaction is too expensive

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

163

to waste on (uninformative) homozygotes. Not only is cherrypicking a massive organizational challenge, but it is also likely to result in different handling times for different samples, which could introduce artifacts. Furthermore, in order to know which samples are informative prior to cherrypicking, the gDNA of each individual must be sequenced or genotyped in advance. In contrast, with a large panel and several intermediate frequency SNPs per gene, at least one assay per gene with a number of heterozygotes is nearly assured, so the uninformative samples may be identified and culled post hoc. 4. To fairly evaluate candidates, all genes that might be regulated by sequences in the disease-associated region and are expressed in the tissue of interest should be included in the study. The existence of associations in gene deserts (e.g., refs. 8, 9) and the known possibility of enhancers acting at long distances (10) imply that this list may be longer than simply the genes whose exons fall within the disease-associated region. Ultimately, deciding which genes to exclude or attempt to include will always be a judgment call. In each gene to be queried, one or more transcribed sequence polymorphisms (generally SNPs) must be used to distinguish the two homologous transcripts in heterozygous samples. Unfortunately, the major technical limitation on AE surveys is the need for these heterozygotes at transcribed SNPs, meaning that genes that simply lack common sequence variants may not be readily addressable by AE. For human polymorphism data, dbSNP (http:// www.ncbi.nlm.nih.gov/ snp/) is a comprehensive resource that can be searched with highly specific, clearly explained fields. SNPs that are in strong linkage disequilibrium (LD) with the risk/protective haplotypes are most likely to provide informative individuals (11). (Note that any site meeting that criterion, transcribed or not, is itself an excellent candidate causal site from a population genetic perspective.) For AE assay SNPs, the conventional wisdom has favored exonic SNPs, which can differentiate between the spliced messenger RNA (mRNA) of each of the two homologues. However, one group has found considerable success using intronic SNPs in samples originating as total RNA (7, 11, 12). They reasoned that most regulatory polymorphisms probably influence transcription itself and thus, any differences that exist in spliced mRNA will often be apparent in unspliced pre-mRNA. This may greatly expand the list of AE-addressable genes (11) because intronic SNPs are far more likely to be found at intermediate frequency. Finally, several groups have noted occasional conflicts between the results obtained by multiple SNPs in the same gene

164

Gruber

(7, 13, 14). Whenever possible, multiple SNPs in LD should be chosen for each gene so that this situation can be noted; at the very least, this observation suggests a biological or technical wrinkle in the canonical interpretation of cis-acting variation. 5. Optimizing the sample set is the most critical aspect of any project intended to dissect an GWAS region with AE. Both the type of sample tissue studied and the composition of the sample panel should be carefully considered. Sample type: the underlying goal of an AE study is to profile a particular characteristic of gene expression, and the considerations that apply generically to gene expression profiling must be applied to AE as well. Chief among these, and potentially the greatest challenge to researchers of human complex disease, is obtaining samples of the tissue in which abnormal physiology underlies the disease. Given this difficulty, is studying a tissue that is easier to obtain an option? Consider the suspicion that would sully microarray observations made in a disease-unrelated tissue. But is it possible that relative AE is less context-specific than total expression? Unfortunately, a number of previous studies have observed a gene’s DAE (the signature of cisacting variation) in one tissue but not another, or in some cases in multiple tissues but with dissimilar patterns (11, 13, 15–17). As might be predicted by the modular architecture of gene expression, this suggests that AE is context dependent and therefore must be studied in a tissue that is relevant to disease physiology. However, this does not necessarily mean that all studies must be performed on inherently onerous samples. For instance, CEPH and other collections of lymphoblastoid cell lines (LCLs) are commercially available (Coriell, Camden, NH), extensively genotyped as part of the HapMap (http: // www.hapmap.org), and even previously characterized in a GWAS of gene expression phenotypes (18). For diseases that are believed to be influenced by the activity of peripheral blood lymphocytes such as asthma (19) and Crohn’s disease (20), AE work on LCLs seems prudent. Note that when LCLs or osteoblast cell lines are to be used, it is particularly wise to examine the aforementioned highthroughput study (18), which may offer relevant data on some or all of the genes in the interval under consideration. In many cases, however, the researcher is left to obtain biopsy samples that originate in cadavers or live subjects in order to have a meaningful AE experiment, similar to other strategies for furthering GWAS results (e.g., ref. 21). Panel composition: the advantages of utilizing an AE approach include a lack of requirement for separate panels of cases and controls and a reduced panel size of informative samples: few studies have used a panel larger than 50–100 individuals. Strictly

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

165

speaking, to maximize one’s chances of identifying cis-acting variation contributing to an association signal, one should assemble an AE panel in which a high proportion of individuals are heterozygous for the risk and protective haplotypes. In addition, homozygotes for the risk and/or protective haplotypes (if ones can be found that are nonetheless heterozygous at the interrogated, transcribed assay SNP) are not expected to manifest disease-associated DAE and so can serve as negative controls. With these conditions satisfied, the case/control composition of the panel is unlikely to bias the survey unless a strongly penetrant causal site underlies the association signal, but these are unusual (22) and could be estimated in the original GWAS. Unfortunately, the number of informative samples (i.e., those heterozygous for a marker SNP) will vary from SNP to SNP. While the inference of DAE can technically be made on the basis of a single informative individual, interpretation of such results must be done carefully when the panel is extremely small. Most studies have used a minimum of five to eight heterozygous individuals per gene (7, 11, 12, 23). In general, the samples should be unrelated individuals. 6. AE studies are, in essence, attempts to quantify a varying phenotype using an imperfect meterstick. Therefore, the researcher must carefully consider the structure of sample replication and processing according to the following principles. (a) Biological replication (each genotype represented by multiple individuals). In general, the level of biological replication will be limited by the availability of suitable samples, and the number of informative individuals will vary from SNP to SNP. However, this is a gene expression study, so samples from different individuals should be handled identically to avoid confounding AE with sample prep artifacts. Ideally, parallel RNA extractions should be performed for each individual, otherwise the ability to make inferences between individuals’ AE profiles may be compromised. If multiple extractions are to be performed (advisable, see (b)), do not perform all of a sample’s extractions on a single occasion, but rather use a block structure in which each block consists of as many unique samples as possible, spreading replicate extractions across blocks. It is generally accepted that fewer replicate gDNA extractions are necessary, but for motivation to have equal numbers of gDNA and cDNA samples, see Note 15. gDNA does not necessarily need to be extracted from the same valuable, disease-relevant tissue as cDNA. (b) Technical replication (each individual represented by multiple samples). For each sample, at least 3 to 5+ tech-

166

Gruber

nical replicates should be performed per sample × template (cDNA, gDNA). In keeping with the principles of sound experimental design, and because greater AE variation has been noted among samples derived from replicate RNA extractions than from replicate cDNA syntheses or PCRs (12), the replication should be performed at the point of RNA and DNA extraction when possible, with replicates processed separately through reverse transcription, PCR, and expression quantification steps. (c) Technical consistency. While sample processing should be kept nearly identical for all samples, it is wise to minimize the variation in overall signal strength by individually adjusting the extracted RNA samples to a consistent level. Later, it may be necessary to adjust gDNA or cDNA concentrations before PCR so that yields are similar regardless of template. Naturally, the PCR conditions and “genotyping instrument” settings must be identical for all samples of a given assay! (This principle may seem obvious to the investigator, but may understandably escape a core-facility technician to whom samples have been outsourced, particularly because “normal” sequencing and SNP genotyping are robust, if not downright immune, to instrument tweaking.) 7. RNA samples, as well as certain reagents such as DTT, are sensitive to freeze–thaw cycles and should be prealiquoted to mitigate this problem. 8. Reverse transcriptase is expensive and it is often worth optimizing the minimum volume of RT needed per cDNA synthesis reaction. 9. Unlike most other options for post-GWAS investigation, AE studies require only the equipment and expertise that a typical genetic mapping-oriented laboratory is likely to already possess. Because the process is the quantification of the relative abundance of signal from two different SNP alleles, almost any sequencing or SNP-genotyping platform is appropriate. Most popular are traditional Sanger sequencing (7, 11, 12, 24) and pyrosequencing (6, 25–27), but SNaPshot (23, 28, 29), oligo ligation assay (14), Taqman (16), and fluorescence polarization (7, 12) have all been employed for geneby-gene AE. (Techniques for high-throughput AE quantification of many SNPs simultaneously have been published (30, 31) but are probably not cost–effective or appropriate here for the purpose of refining a single association signal.) From a practical standpoint, hands-on familiarity with a particular method – presumably accumulated while genotyping or sequencing gDNA – is the most important

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

167

technology selection criterion. Methods that demonstrably introduce minimal technical variation also deserve priority (see Note 15). Finally, sequencing (Sanger or pyro)-based methods also have the slight advantage that peculiar results are easier to troubleshoot because of the information provided by the surrounding, invariant bases. 10. Pyrosequencing PCR may be performed as a standard PCR. It is often suggested that 50 cycles of PCR be performed to deplete any excess (biotinylated) primer, which can occasionally interfere with the sequencing reaction. Furthermore, different preparations of Taq polymerase have different tendencies to misbehave for pyrosequencing (or any expression assay) PCR. Specifically, some preparations readily produce unwanted accessory amplicons when cDNA is used as PCR template, even when gDNA works perfectly. This disparity can create artifacts in the AE study. Should it occur, switching to a different Taq (in particular, switching to a commercial Taq when “homemade” was being used) may remedy the situation immediately. 11. Overloading or compacted bead pellets will cause problems in vacuum processing downstream. Do not pipet from the very bottom of the trough where beads have settled. 12. A custom holder for PyroMark Q96 plate low is provided with the Pyromark machine. This holder is preheated on the 90°C heat block. After the plate is placed in the holder, it is covered with a nonstick paper such as the backing of the PCR plate sealing film (Subheading 3.2.5) and is kept in place by a heavy block on top of the plate. 13. If the run experiences problems that can be solved by simply repeating the sequencing, the same samples can be salvaged. Vacuum processing them as described will strip away the extended sequencing primer and allow pyrosequencing to be performed anew. 14. Many manufacturers do not go out of their way to simplify the signal intensity measurements that underlie the SNP genotyping software. In some pyrosequencing software versions, directly exporting peak heights must be done in a particularly buggy section of the PSQ software. First, select all wells of the plate, then click the “peak heights” tab. All the information that will be exported will be visible, so if no information is displayed, click another tab and then go back. Click the peak heights menu triangle and select export to save a tab-delimited text file. For Sanger sequencing, one option is to utilize the software packages Peakpicker (24) or QSVanalyser (32). 15. Two unfortunate statistical vagaries, variance heterogeneity and departures from normality, are particularly likely to cloud

168

Gruber

the interpretation of the AE results and therefore should be kept in mind when evaluating data. First, the t test (and ANOVA) assumes that two compared groups are each populated with random draws from two underlying distributions whose variances are equal (and under the null hypothesis, those distributions also have the same mean). This assumption is particularly likely to be violated in AE studies because the cDNA class is subject to biological noise as well as technical error, whereas gDNA is less variable because it has approximately zero biological noise. Because the t test is most robust to variance heterogeneity violations when sample sizes are equal (33), an equal number of gDNA and cDNA reactions should be performed. Second, the t test assumes that the error in quantifying AE follows a normal distribution, but this is not always true for the ratio describing AE. Unfortunately, to my knowledge, there is no reliable transformation to force AE ratios (or frequencies) into normality. Some groups have favored logtransformation of the ratio (6, 27, 34). Another alternative is the principal components analysis (PCA), in which one extracts from each datum its deviation from the central tendency of the data (14). However, Monte Carlo simulations of the process of collecting relative AE data suggest that no obvious transformation (ratio vs. fraction, log, square root, arcsin, PCA) is universally optimal for coercing AE data to normality; in fact, such transformations rarely perform better than simple fractions or ratios (J. Gruber, unpublished results). These simulations also indicate that the variation in both relative and total intensity affect departures from normality. Aside from changing to a new platform, controlling the error in relative intensity is largely beyond the investigator’s control; however, efforts can and should be made to make uniform the total signal intensity among all cDNA and gDNA samples. In general, the use of mundane descriptors of AE (i.e., the simple fraction or ratio) and interpretation with a degree of skepticism is recommended, keeping in mind that over the entire experiment, any evidence for AE at each gene must be integrated into the context of the other results and prior biological knowledge. References 1. Donnelly P (2008) Progress and challenges in genome-wide association studies in humans. Nature 456:728–731 2. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA et al (2008) Genome-wide association studies for

complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9:356–369 3. Purdie KJ, Lambert SR, Teh MT, Chaplin T, Molloy G, Raghavan M et al (2007) Allelic imbalances and microdeletions affecting the PTPRD gene in cutaneous squamous cell

Allelic Expression Profiling to Dissect Genome-Wide Association Study Signals

4.

5.

6. 7.

8.

9.

10.

11.

12.

13. 14. 15. 16.

carcinomas detected using single nucleotide polymorphism microarray analysis. Genes Chromosomes Cancer 46:661–669 Barrett J, Hansoul S, Nicolae D, Cho J, Duerr R, Rioux J et al (2008) Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nat Genet 40:955–962 McCarroll S, Huett A, Kuballa P, Chilewski S, Landry A, Goyette P, Zody M et al (2008) Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn’s disease. Nat Genet 40:1107–1112 Wittkopp PJ, Haerum BK, Clark AG (2004) Evolutionary changes in cis and trans gene regulation. Nature 430:85–88 Pastinen T, Ge B, Gurd S, Gaudin T, Dore C, Lemire M et al (2005) Mapping common regulatory variants to human haplotypes. Hum Mol Genet 14:3963–3971 Ghoussaini M, Song H, Koessler T, Al Olama AA, Kote-Jarai Z, Driver KE et al (2008) Multiple loci with different cancer specificities within the 8q24 gene desert. J Natl Cancer Inst 100:962–966 Libioulle C, Louis E, Hansoul S, Sandor C, Farnir F, Franchimont D et al (2007) Novel Crohn disease locus identified by genomewide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet 3:e58 Bondarenko VA, Liu YV, Jiang YI, Studitsky VM (2003) Communication over a large distance: enhancers and insulators. Biochem Cell Biol 81:241–251 Verlaan DJ, Ge B, Grundberg E, Hoberman R, Lam KCL, Koka V, Dias J, Gurd S, Martin NW, Mallmin H, Nilsson O, Harmsen E, Dewar K, Kwan T, Pastinen T (2009) Targeted screening of cis-regulatory variation in human haplotypes. Genome Res 19:118–127 Pastinen T, Sladek R, Gurd S, Sammak A, Ge B, Lepage P et al (2004) A survey of genetic and epigenetic variation affecting human gene expression. Physiol Genomics 16:184–193 Campbell C, Kirby A, Nemesh J, Daly M, Hirschhorn J (2008) A survey of allelic imbalance in F1 mice. Genome Res 18:555–563 Gruber JD, Long AD (2009) Cis-regulatory variation is typically poly-allelic in Drosophila. Genetics 181:661–670 Cowles C, Hirschhorn J, Altshuler D, Lander E (2002) Detection of regulatory variation in mouse genes. Nat Genet 32:432–437 Lo S, Wang Z, Hu Y, Yang H, Gere S, Buetow K et al (2003) Allelic variation in gene expression is common in the human genome. Genome Res 13:1855–1862

169

17. Wilkins JM, Southam L, Price AJ, Mustafa Z, Carr A, Loughlin J (2007) Extreme context specificity in differential allelic expression. Hum Mol Genet 16:537–546 18. Dixon A, Liang L, Moffatt M, Chen W, Heath S, Wong K et al (2007) A genomewide association study of global gene expression. Nat Genet 39:1202–1207 19. Moffatt M, Kabesch M, Liang L, Dixon A, Strachan D, Heath S et al (2007) Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448:470–473 20. Rioux J, Xavier R, Taylor K, Silverberg M, Goyette P, Huett A et al (2007) Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat Genet 39:596–604 21. Lyssenko V, Lupi R, Marchetti P, Del Guerra S, Orho-Melander M, Almgren P et al (2007) Mechanisms by which common variants in the TCF7L2 gene increase risk of type 2 diabetes. J Clin Invest 117:2155–2163 22. Altshuler D, Daly M (2007) Guilt beyond a reasonable doubt. Nat Genet 39:813–815 23. Bray N, Buckland P, Owen M, O’Donovan M (2003) Cis-acting variation in the expression of a high proportion of genes in human brain. Hum Genet 113:149–153 24. Ge B, Gurd S, Gaudin T, Dore C, Lepage P, Harmsen E et al (2005) Survey of allelic expression using EST mining. Genome Res 15:1584–1591 25. de Meaux J, Goebel U, Pop A, MitchellOlds T (2005) Allele-specific assay reveals functional variation in the chalcone synthase promoter of Arabidopsis thaliana that is compatible with neutral evolution. Plant Cell 17:676–690 26. de Meaux J, Pop A, Mitchell-Olds T (2006) Cis-regulatory evolution of chalcone-synthase expression in the genus Arabidopsis. Genetics 174:2181–2202 27. Wittkopp PJ, Haerum BK, Clark AG (2008) Regulatory changes underlying expression differences within and between Drosophila species. Nat Genet 40:346–350 28. Johnson AD, Gong Y, Wang D, Langaee TY, Shin J, Cooper-Dehoff RM et al (2009) Promoter polymorphisms in ACE (angiotensin I-converting enzyme) associated with clinical outcomes in hypertension. Clin Pharmacol Ther 85:36–44 29. Valle L, Serena-Acedo T, Liyanarachchi S, Hampel H, Comeras I, Li Z et al (2008) Germline allele-specific expression of TGFBR1 confers an increased risk of colorectal cancer. Science 321:1361–1365

170

Gruber

30. Palacios R, Gazave E, Goñi J, Piedrafita G, Fernando O, Navarro A et al (2009) Allelespecific gene expression is widespread across the genome and biological processes. PLoS One 4:e4150 31. Serre D, Gurd S, Ge B, Sladek R, Sinnett D, Harmsen E et al (2008) Differential allelic expression in the human genome: a robust approach to identify genetic and epigenetic cis-acting mechanisms regulating gene expression. PLoS Genet 4:e1000006

32. Carr IM, Robinson JI, Dimitriou R, Markham AF, Morgan AW, Bonthron DT (2009) Inferring relative proportions of DNA variants from sequencing electropherograms. Bioinformatics 25(24):3244–3250 33. Markowski CA, Markowski EP (1990) Conditions for the effectiveness of a preliminary test of variance. Am Stat 44:322–326 34. Wittkopp PJ, Haerum BK, Clark AG (2008) Independent effects of cis- and trans-regulatory variation on gene expression in Drosophila melanogaster. Genetics 178:1831–1835

Chapter 12 Quantitative Polymerase Chain Reaction Using the Comparative Cq Method Kimberly Yeatts Abstract Quantitative PCR (qPCR) is considered the gold standard for molecular DNA quantification and can be used for a wide range of techniques from comparing gene expression levels to quantifying DNA copy number variation. The strengths of this assay include sensitivity to a wide range of expression levels, low starting template requirement, which is important when samples are scarce, and quick turnaround time. However, there are many variables to consider when performing qPCR: including (1) starting materials (e.g., tissues, cells, or genomic DNA), (2) fluorescent detection (e.g., SYBR green dye, fluorescent probes, or multiplexed assays), and (3) analysis methods (e.g., simple equations with single reference genes or complex algorithms with multiple reference genes). This chapter will introduce the process of designing an experiment while avoiding common mistakes and present tools for performing qPCR in a practical, simple, and efficient manner. Key words: Quantitative PCR, qPCR, Real-time PCR, RT-PCR, RNA extraction, Reverse transcription, Comparative C T method, Comparative C q method, DDC T, Delta-delta-C T

1. Introduction Quantitative PCR (qPCR) is a molecular biology technique for quantifying the starting concentration of nucleic acid by measuring the accumulation of DNA in the exponential phase of a polymerase chain reaction (PCR) (1). This technique is crucial for characterizing genes involved with disease and can be used in a variety of applications including measuring RNA transcripts expressed in cells, determining genomic copy number variation (CNV), validating RNA expression microarray results, and measuring the success of RNA interference experiments (2–4). There are many variations of the qPCR technique. In this chapter, we will outline one of the oldest and most basic methods, the Comparative Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_12, © Springer Science+Business Media, LLC 2011

171

172

Yeatts

Cq Method, which is also known as the DDCq (delta-delta-Cq) method (5), and was previously known as the DDCT method before universal nomenclature was implemented (6). After acquiring a thorough understanding of this method, incorporation of more advanced techniques appropriate for each specific study is relatively straightforward. We will use the Comparative Cq Method to perform relative quantification, as opposed to absolute quantification (7), which uses a standard curve to extrapolate the DNA quantity. Relative quantification compares gene expression of “Samples of Interest” to a “Calibrator” sample. For instance, if comparing gene expression levels in cells cultured with different drugs, the drug-treated cells would be “Samples of Interest,” and untreated cells would be the “Calibrator.” Relative quantification also requires the selection of a target gene and a reference gene. The target gene is the gene of interest. The reference gene is used to normalize the target gene based on the amount of RNA added to the reverse transcription reaction. Therefore, the expression level of the reference gene should be previously known to remain constant in all samples. Selection of a stable reference gene is one of the most important design considerations for any qPCR project (8). After the experimental design has been determined, the first step in the laboratory protocol is to obtain a DNA template. For gene expression studies, this is done by extracting RNA from the cells or tissue of interest, and then converting the RNA into cDNA with the reverse transcription enzyme (9). Next, a PCR containing cDNA template, primers, a fluorogenic probe, and real-time master mix is run in a real-time thermal cycler. As the amount of PCR product accumulates, the fluorescence emitted from the probe increases proportionally. A single PCR includes three distinct phases of PCR amplification: the exponential phase, the linear phase, and the plateau (see Fig. 1). Using the real-time instrument software, an arbitrary threshold line is placed in the exponential phase of the PCR. The point at which the fluorescent amplification crosses the threshold is known as the Cq value. Thus, a sample with more copies of the target gene will have more probe fluorescence, cross the threshold sooner, and have a smaller Cq value compared to a sample with fewer copies (see Fig. 1). Finally, the sample Cq values are analyzed using the comparative Cq method. The result of this analysis will be a number that represents the quantity of template DNA present in the sample, relative to the other samples that are concurrently analyzed. In addition to a variety of qPCR protocols, there are also many choices for real-time instrumentation. The focus of this protocol will be on the instrumentation and software of Applied Biosystems, Inc. This company has many real-time instrument options, including (1) choice of optical system (e.g., argon-ion laser or Tungsten-halogen lamp), (2) different throughput

Quantitative Polymerase Chain Reaction Using the Comparative Cq Method

173

Plateau 10

Linear

1

Exponential 0.1

Delta Rn

0.01

0.001

0.0001

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Cycle Number

Fig. 1. Real-time PCR amplification plots for two samples. The horizontal axis shows the cycle number of the PCR reaction. The vertical axis show the delta Rn in a logarithmic scale, where delta Rn represents the reporter signal divided by the passive reference dye signal minus the baseline signal. The three phases of the PCR reactions, shown as solid black and gray curved lines, are labeled as exponential, linear, and plateau. The dashed black line represents the threshold value. The dotted gray lines point to the spot where the amplification signals cross the threshold, and therefore indicate the Cq values.

capabilities (e.g., 96- or 384-well format), and (3) choice of cycling speed (e.g., normal or fast). However, it is important to keep in mind that there are many other outstanding real-time instruments on the market. For example, the Cepheid SmartCycler® system is a modular instrument that can run up to 96 different samples, each with a different protocol, in under 40 min. The SmartCycler® does not require universal cycling conditions, so assay development is more flexible than other systems. Bio-Rad also has a line of five real-time instruments that range from a reasonably priced 48-well, two-target detection system to a more advanced 96-well, five-target detection system. Bio-Rad’s realtime instruments are all add-ons made to attach to a thermal cycler base. Roche’s LightCycler® 480 technology offers 96- or 384-well throughput, and can complete runs in about 40 min. The light cycler uses a Xenon lamp for a broad spectrum of light detection, and also incorporates Therma-Base technology for optimal heat transfer. Eppendorf’s Mastercycler® ep realplex thermal cyclers have two- and four-channel capability. This technology utilizes individual LEDs for optical detection in a 96-well format. Lastly, the Rotor-Gene Q real-time PCR cycler

174

Yeatts

from Qiagen has a unique circular sample format that utilizes a centrifugal rotary design to spin each sample in a chamber of moving air, which maximizes temperature uniformity. The RotorGene Q system, with a six channel optical capability and an LED light source, can process up to 100 samples in 60 min. In summary, there are many qPCR instrument systems available for a wide variety of budgets and experimental designs, each with different sample-throughput and optical detection solutions. Selection of real-time instrumentation, therefore, should be dictated by the specific needs of the investigator and the relevant experimental parameters.

2. Materials 2.1. RNA Extraction

1. Cell cultures (see Note 1). 2. Vacuum apparatus inside fume hood for media removal. 3. TRIzol® (Invitrogen; Chicago, IL). Store at 4°C. Warning: TRIzol contains phenol, is toxic with skin contact and if swallowed, and also causes burns. 4. RNeasy Mini Kit (Qiagen; Valencia, CA) (see Note 2). 5. 1× PBS (Chilled to 4°C). 6. Cell scrapers (Fisher Scientific; Pittsburgh, PA). 7. Sterile, RNase-free tubes and tips. 8. 65°C water bath. 9. Chloroform. 10. 70% ethanol (prepared with DEPC-treated water). 11. RNase-free DNase Set (Qiagen; Valencia, CA) (see Note 3). 12. Nanodrop™ Spectrophotometer Wilmington, DE).

2.2. Reverse Transcription

(Thermo

Scientific;

1. SuperScript™ III First-Strand Synthesis System for RT-PCR (Invitrogen; Chicago, IL) (see Note 4). Store at −20°C. 2. Sterile 0.2 ml PCR tubes, RNase- and DNase-free.

2.3. Quantitative PCR

1. 20× Taqman® Gene Expression Assays (Applied Biosystems, Inc.; Foster City, CA) (see Note 5). Store at −20°C and protect from light. Minimize freeze–thaw cycles. Contains 0.5 Cq

Check pipetting technique and accuracy Mix reagents thoroughly Improve assay efficiency

NTC has Cq 70°C

For amplification of 5¢ or 3¢ ends, lambda vector-specific primers should hybridize 50 nt upstream or downstream to the cDNA cloning site, respectively. 1. Combine 107 plaque forming units of a l cDNA library in 1–5 ml of storage medium. 2. Add 4.5 ml of 10× PCR buffer. 3. Add 5 ml of 10 mM dNTPs and 50 pmol of each lambda (V5¢P or V3¢P) and gene-specific primer (GS5¢P or GS3¢P). 4. Add sterile water to a final volume of 45 ml. 5. Incubate reaction at 95°C for 10 min in a thermocycler to denature phage particles. 6. Incubate reaction for 5 min at 75°C. Within this step, add 5 ml of a premade master polymerase mix: 0.5 ml of 10× PCR buffer, 0.5 ml of polymerase, and 4 ml of sterile water. 7. Continue with standard PCR reaction cycles. For example, 30 cycles of denaturation at 94°C for 45 s, annealing at 63°C for 30 s, and extension at 72°C for 3 min. 8. Continue with the steps 15–26 of Subheading 3.1.4 for cloning and characterization of cDNA fragment of interest.

RNA Mapping Protocols: Northern Blot and Amplification of cDNA Ends

219

4. Notes 1. Make sure that there are no air bubbles in the gel or trapped between the wells which could possibly connect single wells or lead to inconsistent sample runs. Air bubbles can be carefully removed with a plastic pipette tip before the gel sets. 2. The 18S and 28S ribosomal RNA bands should appear as sharp bands. If the ribosomal bands in a given lane are not sharp, but appear as a smear towards smaller-sized RNAs, it is likely that the RNA sample suffered major degradation during preparation. The 28S ribosomal RNA band should be present at approximately twice the intensity of the 18S rRNA band. As the 28S rRNA is more labile than the 18S rRNA, equal intensities of the two bands generally indicates that some degradation has occurred. 3. UV light can damage the eyes and skin. Always wear suitable eye and face protection. 4. Preused hybridization buffer should be denatured at 95°C for 5 min before use. 5. To obtain a high yield of DIG-labeled PCR product, always optimize the PCR parameters (cycling conditions and concentrations of template, MgCl2, and primers) for each template and primer set in the absence of DIG-dUTP before attempting incorporation of DIG. 6. An excessive probe concentration in the hybridization solution causes cloudy hybridization background. If this happens, reduce probe concentration to 0.5–1 ml/ml DIG Easy Hyb buffer. 7. Use enough buffer to completely cover the membrane during prehybridization and hybridization incubations. The amount needed will depend on the shape and capacity of the container used for the incubations. Uneven distribution of probe during hybridization produces an irregular and cloudy background, and it is caused by using too little hybridization solution. Use at least 3.5 ml of hybridization solution per 100 cm2 of membrane. If roller bottles (hybridization tubes) are used for incubation, add at least 6 ml of hybridization solution per bottle. 8. Do not allow the membrane to dry at any time from the beginning of prehybridization through the final detection. If the membrane dries or sticks to a second membrane, the assay will have a high background. 9. Several uses and centrifugation steps of the antidigoxigeninAP conjugate can cause a certain loss of material, which must be compensated by use of larger amounts.

220

Alvarez and Nourbakhsh

10. Do not use plastic wrap to cover the membrane during the detection step: use hybridization bags, acetate sheet protectors, or two sheets of transparency film. 11. Luminescence continues for at least 24 h and signal intensity remains almost constant during the first few hours. Multiple exposures at different times can be taken to achieve the desired signal strength. 12. Prepare enough PCR Master Mix for all PCR reactions, plus one extra reaction, to ensure sufficient volume. The same Master Mix can be used for both 5¢- and 3¢-RACE reactions. 13. Because the necessary number of cycles depends on the abundance of the transcript, you may need to determine the optimal cycling parameters for your gene empirically. 14. If fragments >3 kb are expected, add 1 min for each additional 1 kb. References 1. Holtke HJ, Ankenbauer W, Muhlegger K, Rein R, Sagner G, Seibl R et al (1995) The digoxigenin (DIG) system for nonradioactive labeling and detection of nucleic acids – an overview. Cell Mol Biol 41:883–905 2. Rueger B, Thalhammer J, Obermaier I, Gruenewald–Janho S (1997) Experimental procedure for the detection of a rare human mRNA with the DIG System. Front Biosci 2:C1–C5 3. Sambrook J, Russell D (2001) Molecular cloning: a laboratory manual. Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 4. Hloch P, Hoffmann K, Kruchen B, Rueger B (2001) The DIG system – a high sensitive substitute of radioactivity in Northen blot analysis. Biochemica 2:24–25

5. Roche Applied Science (2003) DIG application manual for filter hybridization, 3rd edn. Roche Applied Science, Indianapolis, IN 6. Rost A, Kohler T, Heilmann S, Lehmann J, Remke H, Rotzsch W (1995) A rapid and simple method to prepare digoxigenin-labeled DNA-probes by using PCR-generated DNAfragments. Eur J Clin Chem Clin Biochem 33:A59 7. Finckh U, Lingenfelter PA, Myerson D (1991) Producing single-stranded DNA probes with the Taq DNA polymerase: a high yield protocol. Biotechniques 10:35–39 8. Frohman MA, Dush MK, Martin GR (1988) Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci USA 85:8998–9002

Chapter 15 High-Content RNA Interference Assay: Analysis of Tau Hyperphosphorylation as a Generic Paradigm RiLee H. Robeson and Travis Dunckley Abstract High-content analysis methods provide the opportunity to interrogate specific cellular end points in living cells. When coupled with high-throughput RNA interference (ht-RNAi) loss of function screens, highcontent analyses are a powerful discovery tool for the identification of new genes and pathways involved in specific disease-relevant cellular functions. The most common readout is a fluorescence measurement, usually based on a green fluorescent protein reporter (or some derivative thereof ) or a fluorescently labeled antibody. Here, we describe a specific approach to the development of a high-content assay for the hyperphosphorylation of tau protein that is compatible with RNAi screens. The goal of this chapter is to provide a generic paradigm, using hyperphosphorylation of tau protein as an example, to serve as a blueprint for the investigation of additional cellular end points or protein functions for those interested in performing high-content screens. Key words: RNA interference, siRNA, High-content assay, Immunofluorescence, Tau, Alzheimer’s disease

1. Introduction High-content RNA interference (RNAi) assays allow the power of genome-wide RNAi to be applied in a more traditional forward genetics approach for the identification of novel genes impacting key cellular functions of interest. The advancement of single-cell imaging technologies to facilitate high-content screening enables researchers to overcome a key limitation that has historically been encountered when studying mammalian cells as a model system, which is the lack of facile genetics. Genome-wide RNAi essentially performs single-gene knockdown experiments on a genome-wide scale, enabling high-throughput, loss-of-function screens (ht-RNAi). High-content assays, typically using green fluorescent protein reporters or fluorescently labeled antibodies, Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_15, © Springer Science+Business Media, LLC 2011

221

222

Robeson and Dunckley

allow quantitative measures of key cellular functions or proteins. Merging these high-content assays with ht-RNAi provides obvious utility for identifying key genes involved in virtually any pathway of interest for which an assay can be developed. Neurofibrillary tangles, which are composed primarily of aggregates of hyperphosphorylated tau protein, have emerged as key mediators of neurodegeneration (1). Tau functions as a microtubule-organizing protein that increases microtubule stability by suppressing dynamic instability (2). Aggregation of tau into neurofibrillary tangles leads to microtubule instability and loss of a functional microtubule cytoskeleton, contributing to neuronal cell dysfunction and cell death. Much of the work related to neurofibrillary tangles and Alzheimer’s Disease (AD) has focused on the hyperphosphorylation of tau protein. In AD, tau protein becomes hyperphosphorylated and aggregates into paired helical filaments (PHF), the main component of NFTs (3–8). It is widely recognized that hyperphosphorylation is a prerequisite in aggregation into NFTs. In vitro, numerous Ser/Thr and Tyr kinases phosphorylate tau protein (9). In addition, multiple protein phosphatases have been implicated in removing phosphate groups from AD-relevant phosphorylation sites (10, 11). However, the in vivo role of these kinases and phosphatases in the etiology of neurofibrillary tangle formation remains unclear. Furthermore, the signals that may activate these kinases during NFT formation and AD progression are presently unknown. Despite the central and established role of neurofibillary tangles in neurodegeneration, little is known about the specific molecular events and altered signaling pathways that lead to the aggregation of tau into NFTs. Identification of the genes involved in hyperphosphorylation of tau may provide valuable therapeutic targets to treat AD and other tau-based neurodegenerative disorders. Here, we describe the details of a highcontent immunofluorescence assay for the detection of a specific phosphorylated form of tau protein that is affected early in the progression of AD. Where appropriate, we highlight the general principles for consideration in the development of high-content assays.

2. Materials 2.1. Cell Culture

1. H4 neuroglioma cell line (ATCC; Manassas, VA). 2. Dulbecco’s Modified Eagle’s Medium (DMEM) (Invitrogen; Carlsbad, CA). 3. Fetal Bovine Serum, Certified (Invitrogen; Carlsbad, CA). 4. l-glutamine, 200 mM 100× (Invitrogen; Carlsbad, CA).

223

High-Content RNA Interference Assay

5. Penicillin–Streptomycin, Liquid (Invitrogen; Carlsbad, CA). 6. 500 mL Vacuum Filter/Storage Bottle System, 70-mm Diameter 0.22-mm PES Pore Membrane, Sterile (Fisher Scientific; Pittsburg, PA). 7. Geneticin Selective Antibiotic, Liquid (Invitrogen; Carlsbad, CA). 8. Corning 225 cm² Angled Neck Cell Culture Flask with Vent Cap (VWR; West Chester, PA). 9. Trypsin–EDTA (0.05% Trypsin, with EDTA 4Na) (1×) (Invitrogen; Carlsbad, CA). 10. Thermo Scientific 1× Phosphate-Buffered Saline (Fisher Scientific; Pittsburg, PA). 2.2. siRNA Transfection

1. Transfection reagent, such as SilentEffect (BioRad; Hercules, CA), Dharmafect (Thermo Scientific; Waltham, MA), or Lipofectamine (Invitrogen; Carlsbad, CA). 2. Lethal siRNA control (Qiagen; Valencia, CA). 3. Nonsilencing or scramble siRNA control (Qiagen; Valencia, CA). 4. Corning 384-well Black Clear Bottom plates (Fisher Scientific; Pittsburg, PA). 5. Opti-MEM I Reduced-Serum (Invitrogen; Carlsbad, CA).

Medium

(1×),

liquid

6. Cell Titer-Blue Viability Assay (Promega; Madison, WI). 2.3. High-Content Tau Phosphorylation siRNA Screening Assay

1. siRNA Library (Qiagen; Valencia, CA). 2. Corning 96-Well Flat Clear Bottom black Polystyrene TC-Treated Microplates, with Lid, Sterile (VWR; West Chester, PA). 3. Transfection reagent such as SilentEffect (BioRad; Hercules, CA), Dharmafect (Thermo Scientific; Waltham, MA), or Lipofectamine (Qiagen; Valencia, CA). 4. Lethal siRNA control (Qiagen; Valencia, CA). 5. Nonsilencing or scramble siRNA control (Qiagen; Valencia, CA). 6. Opti-MEM I Reduced-Serum (Invitrogen; Carlsbad, CA).

Medium

(1×),

liquid

7. Deionized H2O; 10% Bleach; 70% Isopropanol (VWR; West Chester, PA). 8. 10× Tris Buffer Saline (Fisher Scientific; Pittsburg, PA). 9. Tween 20 Detergent (Calbiochem; San Diego, CA). 10. Paraformaldehyde (Sigma-Aldrich; St. Louis, MO).

224

Robeson and Dunckley

11. Corning 25-mL and 100-mL Reagent Reservoirs, Polystyrene, Sterile (VWR; West Chester, PA). 12. Paper towels; 50-mL centrifuge tubes (VWR; West Chester, PA). 13. Normal Goat Serum and Sodium Azide (both Sigma-Aldrich; St. Louis, MO). 14. 30% Albumin from Bovine Serum (Sigma-Aldrich; St. Louis, MO). 15. Igepal CA-630 (Sigma-Aldrich; St. Louis, MO). 16. Mouse anti-human primary phospho Tau antibody (multiple vendors depending on epitope needed). 17. Polyclonal Rabbit Anti-Human Tau primary antibody (DakoCytomation; Golstrup, Denmark). 18. Fluorescein (FITC) AffiniPure Goat Anti-Rabbit IgG (H + L) (Jackson Immuno Research; West Grove, PA). 19. Cy5 AffiniPure Goat Anti-Mouse IgG, Fcg Fragment Specific (Jackson Immuno Research; West Grove, PA). 20. Hoechst DNA Stain (Invitrogen; Carlsbad, CA). 21. Aluminum foil. 22. Parafilm (VWR; West Chester, PA). 2.4. Western Blot Validation of Screening Hits

1. Target siRNAs (Qiagen; Valencia, CA). 2. Costar 6-Well Flat-Bottom Sterile Plate (Fisher Scientific; Pittsburg, PA). 3. Transfection reagent such as SilentEffect (BioRad; Hercules, CA), Dharmafect (Thermo Scientific; Waltham, MA), or Lipofectamine (Qiagen; Valencia, CA). 4. Lethal siRNA control (Qiagen; Valencia, CA). 5. Nonsilencing or scramble siRNA control (Qiagen; Valencia, CA). 6. Opti-MEM I Reduced-Serum (Invitrogen; Carlsbad, CA).

Medium

(1×),

liquid

7. Complete Lysis-M EDTA-Free Lysis Buffer Kit (Roche; Indianapolis, IN). 8. Phosphatase Inhibitor Cocktail 1 (Sigma-Aldrich; St. Louis, MO). 9. Phosphatase Inhibitor Cocktail 2 (Sigma-Aldrich; St. Louis, MO). 10. 10× Tris Buffer Saline (Fisher Scientific; Pittsburg, PA). 11. Cell Scraper 16-cm 2-Pos.-Blade (Sarstedt; Numbrecht, Germany). 12. 1.5-mL centrifuge tubes (VWR; West Chester, PA).

High-Content RNA Interference Assay

225

13. Thermo Scientific Pierce BCA Protein Assay Kit (Fisher Scientific; Pittsburg, PA). 14. 96 Well Flat Bottom Clear Plate (VWR; West Chester, PA). 15. NuPAGE 4–12% Bis-Tris Gel 1.5 mm × 10 well gel (Invitrogen; Carlsbad, CA). 16. NuPAGE MOPS SDS Running Buffer, 20× (Invitrogen; Carlsbad, CA). 17. NuPAGE Transfer Buffer, 20× (Invitrogen; Carlsbad, CA). 18. Methanol (Fisher Scientific; Pittsburg, PA). 19. NuPAGE LDS Sample Buffer, 4× (Invitrogen; Carlsbad, CA). 20. NuPAGE Sample Reducing Buffer (Invitrogen; Carlsbad, CA). 21. Nitrocellulose Membrane Filter Paper Sandwich 0.2-mM Pore (Invitrogen; Carlsbad, CA). 22. MagiMark XP Western Protein Standard (Invitrogen; Carlsbad, CA). 23. Novex Sharp Pre-stained Protein Standard (Invitrogen; Carlsbad, CA). 24. Generic milk powder. 25. Albumin from Bovine Serum (Sigma-Aldrich; St. Louis, MO). 26. Sodium Azide (Sigma-Aldrich; St. Louis, MO). 27. Peroxidase AffiniPure Goat Anti-Rabbit IgG (H + L) (Jackson Immuno Research; West Grove, PA). 28. Peroxidase-conjugated AffiniPure Goat Anti-Mouse IgG, FC Fragment Specific 0.8 mg/mL Antibody Concentration (Jackson Immuno Research; West Grove, PA). 29. Ampac Flexibles Heat-Sealable Multipupose Pouches, Capacity 4 oz. and 4.5 mil thickness (Fisher Scientific; Pittsburg, PA). 30. Super Signal West Femto Maximum Sensitivity Substrate (Thermo Scientific; Waltham, MA). 31. ReBlot Plus Mild/Strong Antibody Stripping Solution, 10× (Millipore; Billerica, MA).

3. Methods 3.1. Cell Culture

H4 neuroglioma cells are grown in standard Dulbecco’s Modified Eagle’s Medium (DMEM) supplemented with 10% fetal bovine serum (FBS) and l-glutamine (see Note 1). Penicillin and streptomycin can be used for routine cell passage but should be excluded from the media when cells are plated for siRNA transfection as these antibiotics can significantly affect transfection efficiencies (see Note 2 and following text). Additional cell lines may require

226

Robeson and Dunckley

other nutrients or growth factors. However, a general consideration for high-content assays that are to be adapted to siRNA screens is that the cell line of choice should be adherent and easily transfectable with siRNAs at greater than 90% efficiency. 3.2. siRNA Transfection

To determine experimental conditions to achieve greater than 90% efficiency, the most straightforward approach is to perform a systematic test of a variety of available lipid-based transfection reagents. Common reagents include SilentEffect, Dharmafect, and Lipofectamine. These (and other) reagents also have multiple available formulations, leading to literally dozens of possible transfection reagents. Most cell lines will be transfectable with some version of these reagents. To identify the appropriate experimental conditions, one can systematically test multiple transfection reagents by transfecting the cell line of interest with a lethal siRNA at several different ratios of transfection reagent to siRNA. One then identifies the specific transfection reagent and the ratio of transfection reagent and siRNA that results in the maximum number of cell deaths in the presence of lethal siRNA and the least cell deaths in the presence of nonsilencing or random control sequence siRNA (see Note 3). Toxicity of the transfection reagent and siRNA can be measured using a standard Cell Titer Blue assay. This assay is easily performed in a 384-well plate format.

3.3. High-Content Tau Phosphorylation siRNA Screening Assay

The focal point of a high-content screen is the detection assay (see Note 4). Considerable effort needs to be focused on ensuring that the assay accurately reflects the quantity or activity of the protein of interest. Thus, the assay should be specific for the protein or function of interest. Furthermore, the assay needs to be sensitive to subtle changes in the quantity or activity of the protein of interest to allow identification of siRNAs that alter the expression or function of the protein. Here, we describe an antibody-based immunocytochemistry assay. However, any assay with a visible or otherwise measurable readout could be applicable to development as a high-content assay.

3.3.1. Day One

1. Thaw siRNA plates. Store sealed siRNA plates at 4°C overnight. 2. Spin down plates briefly at 200 × g to collect siRNA at the bottom of wells. 3. Prepare transfection reagent by making a 7.5:1 dilution of SilentEffect in OptiMEM. Prepare enough for pipette errors and priming of the multidrop/microfill. Store at 4°C or on ice until ready to use. 4. Prewash microfill or multidrop consecutively with deionized H2O, bleach (5–10%), deionized H2O, 70% 2-propanol, and

High-Content RNA Interference Assay

227

then twice with deionized H2O. Spray down the tube and prime with blank Opti-MEM. 5. Count cells using complete DMEM minus PenStrep for actual assay (cells grown with 10 mL of Pen Strep). For 1 L, use 100 mL of FBS, 10 mL of 100× l-glutamine, and 890 mL of DMEM. First, thaw the l-glutamine and FBS, and then mix all the contents in a 1-L filter apparatus. Filter the contents and label the bottle. 6. Wash cells once with sterile PBS and aspirate. 7. Add 5 mL of trypsin and incubate at 37°C for 3–5 min. 8. Add 5 mL of media, wash the bottom of the flask, mix well, and transfer to a 15-mL tube. 9. Count by adding 10 mL of mixed cell suspension to hemocytometer slide. Calculate how much of the cell suspension is needed to achieve a concentration of 3,000 cells/well, plus enough extra for priming. Take the correct amount of cells from cell suspension, spin down at appropriate speed for cell line, and aspirate supernatant. Resuspend in 1 mL of assay media – keep until ready for plating. 10. Add transfection solution (see Note 5). Prime multidrop/ microfill with transfection solution and then add 50 mL per well on microfill or multidrop. Start timer when adding to first plate; let sit at RT for 30 min. Stir bottle occasionally, keeping it at an angle! 11. Add necessary amount of media to final cell suspension to obtain the desired cell concentration. 12. Mix cells well (do not swirl, but shake bottle so that cells do not get forced only to the outside). 13. Prime multidrop/microfill with cell suspension and add 3,000 cells/well in 50 mL media/well. Periodically mix cells to guarantee even cell distribution throughout assay. 14. Tap plate on two sides once cells have been added to the plate to ensure even cell distribution. Let sit at room temperature for ~10–20 min. 15. Check cell distribution through microscope. When the cells are evenly distributed, incubate for 4 days at 37°C on level surface; do not stack plates, as stacking causes the well conditions to vary (see Note 6). 3.3.2. Day Four

16. Thaw premade 4% paraformaldehyde (PFA) in water bath for ~10–15 min. Use ~40 mL for each run. 17. Keep 1× Tris-buffered saline (TBS) at 4°C for several minutes; have four boats, paper towel, and one 50-mL tube handy.

228

Robeson and Dunckley

18. Make up blocking buffer by vacuum-filtering 25 mL of normal goat serum, 5 mL of 2% sodium azide, 3.3 mL of 30% BSA, 5 mL of 0.1 IgePal CA-630, 50 mL of 10× TBS, and 411.7 mL of deionized H2O. 19. Thaw and spin down phospho-tau antibody and place on ice. Prepare working dilution in 1× TBS in 1.5-mL Eppendorf tube. Prepare a 1:1,000 final dilution. 20. Check cell confluency in plates. 21. Wash media off and use the same washing technique thereafter for consistency throughout the rest of the assay. Dump media into sink, using care until cells are fixed. Gently tap onto paper towel and wash with 100 mL of /1×TBS per well, adding buffer to the side of wells with autopipette at the slowest speed. Repeat procedure for remaining plates. 22. Remove 1× TBS and tap gently on paper towel. Wash again with 100 mL of 1× TBS and leave 100 mL of 1× TBS in each/ well so that the cells do not dry out. 23. In fume hood, remove TBS, tap gently on paper towel, and add 4% PFA: 30 mL/well with autopipette at the slowest speed. Gently rock plate for even distribution and then let sit for 15 min at room temperature. 24. Wash PFA in the same manner as above, washing twice with 100 mL of 1× TBS per well. 25. Block by adding 50 mL of blocking buffer per well and incubating for 1 h. Rock plate for even distribution; remove buffer and tap gently, but do not wash. 26. Add phospho-tau antibody: 30 mL/well to side of wells – gently tap plates to ensure complete coverage of the bottom of the well. Place parafilm over plates to limit evaporation and then place plates at 4°C overnight. 3.3.3. Day Five ( see Note 7 )

27. Label two 50-mL tubes: (1) tau primary and (2) secondary cocktail. 28. Have ready four boats, foil, blocking buffer, 1× TBS + 0.1% Tween, and 1× TBS. 29. Prepare a 1:200 tau primary antibody dilution in blocking buffer and place on ice. 30. Prepare secondary cocktail by combining in blocking buffer: (a) FitC-Goat anti-rabbit (GAR) (1:200) thaw and spin down (b) Cy5-Goat anti-mouse (GAM) (1:200) thaw and spin down (c) Hoechst DNA stain (1:1,000) 31. Mix in 50-mL tube, wrap tube in foil, and keep on ice.

High-Content RNA Interference Assay

229

32. Remove phospho-tau antibody and wash three times with 1× TBS + 0.1% Tween. 33. Add tau primary antibody (30 mL/well). Keep at room temperature for 1 h. 34. Wash twice with 1× TBS + 0.1% Tween, as described previously. 35. Add 30 mL of secondary cocktail to each well. Cover with foil – the less exposure to light, the brighter the final signal will be – and let sit at room temperature for 30 min. 36. Wash plates once with 1× TBS + 0.1% Tween and then twice with 1× TBS. 37. Leave final wash of 100 mL of 1× TBS per well on cells. 38. Cover plates with Parafilm, wrap in foil, and place at 4°C until ready to read. 3.3.4. Day Six (see Note 8 )

1. Turn on lasers on InCell 3000* and let them warm up for ~1 h. 2. Set up z-stack. Make up fresh Flat Field to normalize the read; this must be done prior to every run. 3. Run z-stack on buffer well (make note of well used) and pick which micron to run at (i.e., the one that produces the clearest image). Image at least 9 mm with the z-stack, with a 1-mm difference per image. 4. Set up protocol for the run: (1) pick desired scan length, (2) set optimal micron height determined by z-stack results, (3) set red and green laser to read first, then blue laser to read second to prevent overlap in fluor excitations. Label plates on INCell list before run begins; these must be in the same order as the plates to be read. Add any other important details in comments and load plates by hand, one at a time, keeping the remaining plates in the dark until ready to read. 5. Capture wells – 1 image per well. When run is complete, transfer data to InCell 3000 server to finalize analysis (see Note 9). Data returned from the INCell 3000 will be fluorescence measurements for phospho-tau and total tau levels, as well as total cell counts in each well of the 96-well plate. Cell counts will provide an indication of which siRNAs are toxic to the cells. Toxic siRNAs should be removed from further analyses. siRNA-treated samples should be normalized relative to the nonsilencing siRNA samples that are on the same plate. Once normalized, one can ompare siRNAs across different plates. While multiple statistical approaches can be used to prioritize the data, a relatively straightforward approach is to calculate the mean of all siRNAs tested, calculate the standard deviation (SD) of the entire data set, and prioritize siRNAs that are >2 SD from the mean and >1 SD from the mean.

230

Robeson and Dunckley

3.4. Target Validation

3.4.1. Western Blot Validation of Screening Hits 3.4.1.1. To Transfect with siRNA to Target Gene of Interest (see Note 10)

The first step in target validation should be a “hit pick” of all siRNAs that are >2 SD from the mean in the original screen. If this cutoff level does not provide enough hits, the stringency can be reduced to >1.5 or >1 SD from the mean. In this process, those siRNAs are replated on separate plates and retested in the high-content assay. Silencing RNAs that replicate can then be moved forward to further validation using a secondary measure, such as Western blotting, to confirm effects on phospho-tau and total tau, and to confirm that the siRNA is knocking down expression of the intended target. In addition to validating the screened siRNA, a second nonoverlapping siRNA sequence to the target of interest should be tested to reduce the possibility of off-target effects wherein the screened siRNA could be affecting tau by knocking down expression of an unintended target gene. 1. Plate siRNA to target gene in two to four wells of a 6-well plate (final siRNA concentration should be ~25 nM after cells, media, and transfection reagent are added). 2. Plate nonsilencing siRNA in two wells of the same 6-well plate (25 nM final concentration). 3. Resuspend cells from flask and plate at 50,000 cells per well in all six wells. 4. Place plate in incubator for 4 days.

3.4.1.2. To Prepare Protein Lysates

1. Turn on refrigerated microfuge and cool down to 4°C. 2. Chill 1× TBS and place labeled collection tubes (2 per sample) on ice. 3. Prepare lysis buffer by adding 1 protease tablet (EDTA-free and chilled at 4°C) to 10 mL of Complete Lysis Buffer (Roche). Let dissolve for 2 min at room temperature and then vortex gently. 4. Add 100 mL of solution A* phosphatase inhibitor (which contains cantharidin, bromotetramisole, and microcystin LR) and 100 mL of solution B phosphatase inhibitor (which contains sodium orthovanadate, sodium molybdate, sodium tartrate, and imidazole) to Complete Lysis buffer (1:100). Keep at 4°C. Allow ~30 min for phosphatase cocktail A to thaw, then mix gently and keep on ice. 5. Place 6-well plate(s) on ice for ~5 min. Make sure that the plate is level. 6. Aspirate media and then wash twice with 1 mL of cold 1× TBS. It is important to keep the surface covered, or else the cells will dry out (see Note 11). If working on more than one plate, maintain the total time spent by cells in lysis buffer constant and keep cells in cold 1× TBS if necessary.

High-Content RNA Interference Assay

231

7. Add 30–50 mL of lysis buffer to well(s) and scrape cells thoroughly with cell scraper. The less lysis buffer added, the higher the resulting protein concentration. 8. Tilt plate, scrape lysates toward the bottom of the well and collect cells in 1.5-mL centrifuge tubes. 9. Place tubes on ice for 30 min, flicking every 10 min. 10. Centrifuge tubes at 20,000 × g, at 4°C for 15 min to remove cellular debris. 11. Collect supernatant and transfer to new prechilled and labeled 1.5-mL centrifuge tube(s). Record volume collected. 12. Determine protein concentration using a standard BCA assay. 13. Store lysates at −80°C. 3.4.1.3. To Run Protein Gel

1. Preheat heating block to ~95°C (at least above 80°C). 2. Prepare samples according to BCA results: label 1.5-mL tubes; add blue LDS Sample Buffer (9.3 mL each) to produce a final 1× concentration. Add appropriate amount of blank lysis buffer to give a final volume of 37 mL. Add sample reducing buffer (3.7 mL each) to produce final 1× concentration, and finally, add the required amount of sample to get the desired micrograms of protein loaded. Mix well. 3. Put tubes on heating block for 10 min (see Note 12). Make sure that caps are closed tightly. 4. Prep gel(s) by opening the pouch containing the gel and drain solution, gently rinsing with deionized water, and removing the plastic seal at the bottom of the gel. 5. Place gel in container and pour running buffer in center to top; tap to remove any bubbles and check for leaks. The NuPAGE running buffer is prepared using 950 mL of deionized water and 50 mL of 20× NuPAGE Running Buffer. 6. Remove comb evenly and gently so as not to tear well walls. 7. Wash out wells by pipeting buffer up and down. 8. Fill halfway on the outside with running buffer. 9. Load 8–10 mL of ladder in the first and last wells and then 36 mL of each sample in the remaining wells. 10. Let gel run for ~50 min at 170–180 V.

3.4.1.4. To Transfer Gel to Membrane and Block

1. Prepare NuPAGE transfer buffer by combining 200 mL of methanol, 50 mL of 20× NuPAGE Transfer Buffer, and 750 mL of deionized water. Make sure that the buffer corresponds to one recommended for the membrane. 2. Soak sponges and filters/membranes in a container with transfer buffer.

232

Robeson and Dunckley

3. Remove first gel and crack open so that gel is on the larger piece; discard the other plastic piece. 4. Place first filter paper over gel. 5. Flip gel over and push out gel by pushing through the foot slot onto filter paper; discard plastic piece. 6. Cut off foot and place second filter paper on top of gel and flip. Make sure that there are no bubbles. 7. Remove top filter paper and place back into transfer buffer. 8. Place two sponges in apparatus and place gel (filter side down) on sponges; flush with the bottom edge of the apparatus. 9. Place wet membrane on gel, removing any bubbles. 10. Place filter paper on top, remove any bubbles, and place one sponge on top if running another gel, or 3 or 4 sponges if running only one gel. 11. Repeat steps 3–5 for a second gel, but place two sponges on top of final filter paper. 12. Squeeze apparatus together over the container that held the sponges, place inside the gel box apparatus, and secure with clamp. 13. Fill interior compartment with transfer buffer and tap on table to remove any bubbles. 14. Fill outside compartment completely with approximately 600 mL of deionized water, but do not let any water in the interior compartment. 15. Place lid on unit and run on low voltage (i.e., ~22 V) for 2 h. 16. Stop electrophoresis, remove membranes, and label wells and ladders. 17. Place gel in pouch, seal two sides and add blocking buffer: 5% milk (2.5 g of generic milk powder and 50 mL of 1×TBS + 0.1%Tween) or 5% BSA for phosphor-membranes (2.5 g of BSA flakes and 50 mL of 1× TBS + 0.1% Tween). 18. Remove air and seal open side. 19. Place at 4°C overnight or for 1 h at room temperature on rotator. 3.4.1.5. To Stain Membrane with Antibodies (see Note 13)

1. The following can be done in pouches or containers as long as the membrane is covered: preblock membrane with 5% milk/BSA for 1 h at room temperature (or 4°C overnight), as described above. 2. Prepare primary antibody according to predetermined optimal dilution. 3. Use 5 mL of 5% milk/BSA with antibody and 50 mL of 2% sodium azide.

High-Content RNA Interference Assay

233

4. Empty pouches, add antibody, and reseal. 5. Leave on rocker for 2 h at room temperature or at 4°C overnight. 6. Wash membranes in 1× TBS + 0.1% Tween for 30 min, changing out the 1× TBS + 0.1% Tween every 10 min. 7. Prepare secondary antibody: 1:25,000 Anti-Mouse/Goat HRP; 50 mL 5% milk/BSA plus 2 mL of secondary antibody. Invert to mix. 8. Remove last wash and add secondary antibody. 9. Place on rocker for 45 min at room temperature (1 h maximum). 10. Repeat wash. 11. Prepare developer shortly before adding as it is photosensitive: 1:1 Reagent A and Reagent B in 15-mL conical tube. Only 2–3 mL is needed totally to cover the entire membrane. 12. Remove last wash and place membrane face up; cover evenly with developer. 13. Incubate at room temperature for 5 min. 14. Blot membranes carefully onto Kimwipe and wrap each membrane individually in plastic wrap and fold over to make as flat as possible with no bubbles. 3.4.1.6. To Image

1. All imaging is done using an AlphaImager (Alpha-Innotech) equipped for chemiluminescence detection. It is important to image each membrane separately. 2. Open Program. 3. Center blot in imager and then turn off light. 4. Make sure that wheel is on position 1. 5. Set binning mode at High/Low. 6. Image for short time. 7. Save image once; the image will be inverted and at optimal shading. 8. Increase exposure time, if needed, to get optimal exposure.

3.4.1.7. To Strip and Reprobe

1. Wash membrane in 1× TBS + 0.1% Tween for 5 min on high. 2. Prepare ReBlot Stripping buffer: 2 mL of buffer + 18 mL of deionized water. 3. Place membranes in new container with fresh stripping buffer. 4. Rotate at room temperature on low to medium speed for 15–20 min, depending on strength of strip needed; however, it is preferable to not exceed 20 min.

234

Robeson and Dunckley

5. Remove stripping buffer. 6. Wash on high for 5 min in 1× TBS + 0.1% Tween. 7. Transfer membrane to a new container and block at room temperature for 1 h or overnight at 4°C on rocker in appropriate blocking buffer. 8. Continue process as described above.

4. Notes Here, we have described a specific high-content, high-throughput siRNA assay for the detection of phospho-tau and total tau expression. This protocol should provide a general starting point for any antibody-based high-content screen, although, clearly, specific details will need to be established on an application-dependent basis. For example, variables to be considered depending upon the desired target for study will include the choice of cell line (i.e., those in which the protein of interest will be highly expressed), choice of primary antibody (which must be specific to target), and choice of transfection reagent (which will be dictated by cell line). It is also critical to validate the specificity and sensitivity of the assay prior to screening. While this can be accomplished through a number of different ways, the most straightforward approach is to generate a dose–response curve using dilutions of siRNA to the target protein (i.e., tau in our specific example). Doing so will also provide the requisite data to determine the optimal assay duration for identification of siRNAs with the strongest effects. For instance, in our example here, we determined the assay duration for tau to be 4 days. This will likely vary depending upon the end points used and should be determined for each novel end point. One should be able to similarly use the above template to work through the remaining variables to enable development of a high-content assay. 1. When using the H4 neuroglioma cell line, maintain cells at ~80–85% confluency to ensure that cells are in a rapid growth phase for better consistency. 2. Do not repeatedly heat and cool media too many times. Make sure that media is evenly mixed before each use. Add the geneticin to each flask of cells fresh when passaging cells – do not add to media bottle, and regularly check for mycoplasma contamination. 3. Pick optimal transfection reagent that kills the least amount of cells with the scramble control and that kills the most with the lethal siRNA control. Once the optimal transfection reagent is determined, buy a large batch of the optimal lot to

High-Content RNA Interference Assay

235

use throughout the screening process – this eliminates variations in transfection efficiency due to lot differences. 4. When preparing for screening efforts: (a) Freeze down a large batch of cells being used at an early passage (if possible) to use throughout screening efforts to maintain consistency. (b) Screening plates can be preprinted, sealed, and stored at −80°C until ready for use, but thaw screening plates only once. (c) Use same transfection lot throughout entire screening process. 5. When adding transfection reagent or seeding the cells for the screen: (a) H4 cells are very sticky, so be sure that cells are singlecelled very well before adding them to the plates to prevent them from being patchy or too confluent , which will make analysis of the final images very difficult. (b) If adding transfection reagent and cells by hand, be sure not to cross-contaminate wells and add the cells at a low speed to prevent an accumulation of cells in the center of the wells. (c) Regularly mix transfection reagent solution and cell suspension during the process of adding them to the plates – this will help with seeding consistency throughout multiple plates. (d) Once materials are added to the plates, tap gently on two sides to ensure better distribution and prevent any liquid sticking to the side of the wells. 6. When placing the plates in the incubator: (a) Try not to stack multiple plates on top of one another in the incubator – this will cause edge effects in the plates where the cells in the outer wells will grow at a different rate than the inner wells because they are exposed to more air/heat/moisture (b) Make sure that all plates are level and do not disturb until you are ready to fix the plates. 7. When fixing and staining plates: (a) Always add all of your liquid to the side of the well on the wall. Never add reagent directly on the top of cells, doing so will wash cells off the plate and skew results. (b) Keep liquid addition methods as consistent as possible – do not change any incubation times during the screening process. This will skew your results.

236

Robeson and Dunckley

(c) Never let your cells dry out; washing and antibody addition must be done as efficiently as possible to limit dry time. (d) Once cells are fixed on the plates, you can use a higher dispensing speed on your autopipette as long as your cells are adherent and you add liquid to the side of the well. (e) Flick liquid out of plates somewhat consistently and always tap plates on paper towels to ensure the maximal removal of the liquid – your antibody dilutions will vary well to well if too much liquid is left. (f) Make up antibodies fresh before each use – do not reuse in the screening process. (g) Once the secondary antibodies have been added, keep plates in the dark as much as possible – putting a silver foil seal on the lid will help limit the light exposure during the rest of the staining and washing process. 8. When reading the plates: (a) Keep plates in the dark until ready to read – imaging does not have to be immediate, but know that the signal intensity will decrease over time even if the plates are kept in complete darkness and that evaporation will occur. (b) Make sure that the labels on the plates match the labels listed on the analysis computer and maintain the correct order. 9. This is a brief protocol for the INCell analyzer 3000 (General Electric). There are numerous analysis protocols available whose utility will be determined by the users needs. The current additional confocal platforms are the point scanning four color ImageXpress ULTRA (Molecular Devices, Union City, USA), the spinning disk (nipkow disk) Pathway 855 and 435 from BD Biosciences (formerly Atto Biosciences, Rockville, Maryland), OPERA (PerkinElmer Inc., Waltham, MA), and the slit scanning IN Cell 3000 (GE/Amersham Biosciences, Cardiff, UK). 10. When adding transfection reagent or seeding the cells for lysates: (a) H4 cells are very sticky, so be sure that cells are singlecelled very well before adding them to the plates to prevent them from being patchy or too confluent, which will make analysis of the final images very difficult. (b) When adding the cells, add gently and evenly across the whole well to prevent a concentration of cells in one area of the well. (c) Regularly mix transfection reagent solution and cell suspension during the process of adding them to the

High-Content RNA Interference Assay

237

plates – this will help with seeding consistency throughout multiple plates. (d) Once materials are added to the plates, tap gently on two sides to ensure better distribution and prevent any liquid sticking to the side of the wells. (e) Try not to stack plates and make sure that plates are level within the incubator. 11. When collecting lysates: (a) Surface of the wells must be covered with liquid as much as possible, or else the cells will dry out. (b) If working on more than one plate, it is important to keep the total time spent by cells in Lysis Buffer constant; therefore, keep cells in cold 1× TBS, if need be. (c) Determine the amount of lysis buffer to add based on the confluency of the sample wells. (d) When aspirating off the last wash of 1× TBS, be sure to remove as much as possible, or else the lysate concentration will be diluted. (e) Always keep lysates on ice. (f ) If intending to use lysates multiple times, mix sample well, run BCA, mix sample again, and make aliquots to avoid submitting the sample to repeated freeze/thaw cycles. 12. When running the Western blot: (a) Samples can be prepared up to the boiling point and then stored at −20°C until ready to use. (b) Load samples slowly to ensure that no cross-contamination occurs. (c) Depending on the size of the protein, a different percentage gel and transfer membrane may be needed. (d) Western blotting techniques vary from lab to lab; it is important to be consistent with your techniques throughout the validation process to limit variability. 13. When staining membranes for antibodies: (a) Always optimize your antibody dilutions prior to validation efforts. (b) Do not use 5% milk on any membrane that will be probed with any type of phosphoantibody. (c) Washing speed can be varied depending on the strength of the antibody being used. (d) Always probe your membranes with a loading control to confirm loading consistency across the wells.

238

Robeson and Dunckley

References 1. Johnson GV, Bailey CD (2002) Tau, where are we now? J Alzheimers Dis 4(5):375–398 2. Trinczek B, Biernat J, Baumann K, Mandelkow EM, Mandelkow E (1995) Domains of tau protein, differential phosphorylation, and dynamic instability of microtubules. Mol Biol Cell 6(12):1887–1902 3. Kidd M (1963) Paired helical filaments in electron microscopy of Alzheimer’s disease. Nature 197:192–193 4. Terry R (1963) The fine structure of neurofibrillary tangles in Alzheimer’s disease. J Neuropathol Exp Neurol 22:629–642 5. Grundke-Iqbal I, Iqbal K, Tung YC, Quinlan M, Wisniewski HM, Binder LI (1986) Abnormal phosphorylation of the microtubule-associated protein tau (tau) in Alzheimer cytoskeletal pathology. Proc Natl Acad Sci USA 83(13):4913–4917 6. Gustke N, Steiner B, Mandelkow EM et al (1992) The Alzheimer-like phosphorylation of tau protein reduces microtubule binding

7.

8.

9. 10.

11.

and involves Ser-Pro and Thr-Pro motifs. FEBS Lett 307(2):199–205 Ihara Y, Nukina N, Miura R, Ogawara M (1986) Phosphorylated tau protein is integrated into paired helical filaments in Alzheimer’s disease. J Biochem 99(6):1807–1810 Kosik KS, Orecchio LD, Bakalis S, Neve RL (1989) Developmentally regulated expression of specific tau sequences. Neuron 2(4):1389–1397 Johnson GV, Hartigan JA (1999) Tau protein in normal and Alzheimer’s disease brain: an update. J Alzheimers Dis 1(4–5):329–351 Liu F, Grundke-Iqbal I, Iqbal K, Gong CX (2005) Contributions of protein phosphatases PP1, PP2A, PP2B and PP5 to the regulation of tau phosphorylation. Eur J Neurosci 22(8):1942–1950 Liu F, Iqbal K, Grundke-Iqbal I, Rossie S, Gong CX (2005) Dephosphorylation of tau by protein phosphatase 5: impairment in Alzheimer’s disease. J Biol Chem 280(3):1790–1796

Part IV Alternative Approaches

Chapter 16 Integrative Systems Biology Approaches to Identify and Prioritize Disease and Drug Candidate Genes Vivek Kaimal, Divya Sardana, Eric E. Bardes, Ranga Chandra Gudivada, Jing Chen, and Anil G. Jegga Abstract Although a number of computational approaches have been developed to integrate data from multiple sources for the purpose of predicting or prioritizing candidate disease genes, relatively few of them focus on identifying or ranking drug targets. To address this deficit, we have developed an approach to specifically identify and prioritize disease and drug candidate genes. In this chapter, we demonstrate the applicability of integrative systems-biology-based approaches to identify potential drug targets and candidate genes by employing information extracted from public databases. We illustrate the method in detail using examples of two neurodegenerative diseases (Alzheimer’s and Parkinson’s) and one neuropsychiatric disease (Schizophrenia). Key words: Candidate gene prioritization, Disease gene ranking, Drug target ranking, Integrative genomics, Systems biology, Alzheimer’s disease, Parkinson’s disease, Schizophrenia

1. Introduction The majority of common diseases, common traits, and pharmacological drug response are genetically intricate, polygenic, multifactorial, and often result from an interaction of genetic, environmental, and physiological factors. High-throughput, genome-wide studies like linkage analysis and gene expression profiling, although useful for classification and characterization, do not provide sufficient information to identify specific disease causal genes or drug targets. Both of these approaches typically result in the identification of hundreds of potential candidate genes and cannot effectively reduce the number of target genes to a manageable figure for further validation.

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_16, © Springer Science+Business Media, LLC 2011

241

242

Kaimal et al.

In spite of recent advances in DNA sequencing and genotyping technologies, which have enabled the discovery of novel disease-causing mutations and disease-associated DNA sequence variants, transforming this knowledge into a set of validated drug targets continues to be a major challenge. Given the fact that the “druggable” genome constitutes only 12–20% of all human genes (1–3), it is not surprising that only 2% of all predicted human gene products have been successfully targeted with small-molecule drugs thus far (4). Additionally, the overlap between druggable genes and known disease genes is only about 25% (5). Traditional prioritization approaches for drug target identification based on a review of the literature can rapidly become overwhelming. Alternatively, computational integration of different criteria can be performed to create a ranking function to identify and rank potential drug targets. The question then becomes: how can we find targets that are not only diseasespecific but also druggable, and/or how can they be expected to be modulated with drugs? Using two neurodegenerative diseases (Alzheimer’s and Parkinson’s) and one neuropsychiatric disease (Schizophrenia) as test cases, we demonstrate here the applicability of integrative systems-biology-based approaches to identify and prioritize disease and drug candidate genes incorporating known disease–gene and drug–gene knowledge (Fig. 1). As shown in Table 1, several gene prioritization methods have been developed to overcome the limitations of highthroughput, genome-wide studies like linkage analysis and gene expression profiling, both of which typically result in the identification of hundreds of potential candidate genes (6–16). While all of these tools are based on the assumption that similar phenotypes are caused by genes with similar or related functions (7, 16–19), they differ by the strategy adopted for calculating similarity and by the data sources utilized (20). Except for ENDEAVOUR (10, 20) and ToppGene (15, 16), most of the existing approaches mainly focus on the combination of only a few data sources. ToppGene, for example, is more powerful as it also uses mouse phenotype data for human disease gene prioritization, which improves candidate gene prioritization (15). We have previously shown (15) that ToppGene performs better than SUSPECTS (14), PROSPECTR (8), and ENDEAVOUR (10), three commonly used methods in candidate gene prioritization. Most of the current computational disease candidate gene prioritization methods (6–15) rely on functional annotations, gene expression data, or sequence-based features. The coverage of the gene functional annotations, however, is a limiting factor. Currently, only a fraction of the genome is annotated

Integrative Systems Biology Approaches to Identify and Prioritize Disease

243

Fig. 1. Schematic representation of workflow to rank disease and druggable genes and interactomes based on functional annotation similarity (ToppGene) or network analysis (ToppNet).

with pathways and phenotypes (15). While two thirds of all the genes are annotated by at least one functional annotation, the remaining one third has yet to be annotated. Recent biotechnological advances such as the high-throughput yeast two-hybrid screen have facilitated building proteome-wide, protein–protein interaction networks (PPINs) or “interactome” maps in humans (21, 22). The shift in focus to systems biology in the postgenomic era has generated further interest in PPINs and biological pathways. While PPIs have been used widely to identify novel disease candidate genes (23–27), several recent studies (25, 26, 28–30) report also using them for candidate gene prioritization. Interestingly, because biological networks have been found to be comparable to communication and social networks (31) through commonalities such as scale-freeness and small-world properties, we reasoned that the algorithms used for social and

Online availability

Additional data files available from journal web site at http://genomebiology.com/2004/5/7/R47 http://cgg.ebi.ac.uk/services/dgp/ http://www.genetics.med.ed.ac.uk/prospectr/

Article supplementary material at http:// physiolgenomics.physiology.org/cgi/content/ full/00095.2003/DC1/

http://www.mf.uni-lj.si/bitola/ Article supplementary data available at http:// www.sanbi.ac.za/tiffin_et_al/ http://www.cmbi.ru.nl/GeneSeeker/

http://www.bioinformatics.polimi.it/GFINDer/ http://www-micrel.deis.unibo.it/~tom/

BITOLA (55) Tiffin et al. (8)

GeneSeeker (56, 57)

GFINDer (58, 59) TOM (60)

http://www.cmbi.ru.nl/MimMiner/

Article supplementary material on journal Web site at http://www.jmedgenet.com/supplemental/

Freudenberg and Propping (6) OMIM phenome map (61)

Protein-protein interactions (48)

Approaches using functional relatedness between candidate genes

http://www.ogic.ca/projects/g2d_2/

Genes2Diseases (53, 54)

Approaches using links between genes and phenotypes

DGP (52) PROSPECTR (9)

Smith and Eyre-Walker (19) Huang et al. (51)

Bortoluzzi et al. (50)

Approaches based on disease gene properties

Approach

N/A N/A

Sequence Sequence

Phenotype, sequence, GO, protein interactions Protein interactions

Phenotype, GO

Expression, phenotype, literature mining Expression, phenotype Expression, GO

Literature mining Expression, literature mining

Known genes and disease loci

N/A

N/A

N/A Known genes and/or disease loci

N/A

Phenotype GO terms Known genes Concept Disease

N/A N/A

Sequence, expression Sequence, expression

Sequence, GO, literature mining

N/A

Training set (input)

Expression

Data types used

Table 1 List of current bioinformatics approaches and tools to rank human disease candidate genes

244 Kaimal et al.

http://www.esat.kuleuven.be/endeavour/

http://toppgene.cchmc.org

http://toppgene.cchmc.org

SUSPECTS (14) Prioritizer (62)

Endeavour (10)

ToppGene (16)

ToppNet (30)

Known genes Disease loci

Sequence, expression, GO Expression, GO, protein interactions Sequence, expression, GO, pathways, literature mining Mouse phenotype, expression, GO, pathways, literature mining Protein interactions Known genes

Known genes

Known genes

Disease loci

GO

The first column has the source or the name of the tool (including reference, if available). The second column has the URL of the corresponding web application, when available. If there is no web application, information regarding either the project home page or links to the corresponding supplementary material are provided. The third column is the genomic annotation types/features used by each of the methods. The last column has details of the training or the input data, if used (Note: this list is extensive, but not exhaustive, and we apologize for any oversights)

Additional data files available from journal web site at http://genomebiology.com/2003/4/11/ R75 http://www.genetics.med.ed.ac.uk/suspects/ http://www.prioritizer.nl/

POCUS (7)

Integrative Systems Biology Approaches to Identify and Prioritize Disease 245

246

Kaimal et al.

Web networks should be equally applicable to biological networks. As a result of this line of reasoning, we developed ToppNet (30). In this chapter, we describe the ToppGene Suite (http://toppgene.cchmc.org), a unique, one-stop online assembly of computational software tools that enables biomedical researchers to (a) perform candidate gene prioritization based on functional annotation similarity between training and test set genes (ToppGene), (b) perform candidate gene prioritization based on protein interactions network analysis (ToppNet), and (c) identify and rank candidate genes in the training set interactome based on both functional annotations and PPIN analysis (ToppGeNet).

2. Materials and Methods The ToppGene knowledgebase combines 17 gene features available from the public domain. It includes both disease-dependent and disease-independent information in the nature of known disease genes, previous linkage regions, association studies, human and mouse phenotypes, known drug genes, and microarray expression results, gene regulatory regions (transcription factor target genes and microRNA targets), protein domains, protein interactions, pathways, biological processes, literature cocitations, etc. All these sources are combined specifically to prioritize disease or drug response candidate genes. 2.1. ToppGene: Functional Annotations-Based Candidate Gene Prioritization (see Note 1)

ToppGene generates a representative profile of the training genes using as many as 17 features and identifies overrepresentative terms from the training genes. This forms the first step. The test set genes are compared to this representative profile of the training set or the overrepresented terms from the training genes for all categorical annotations and the average vector for the expression values. For a test gene, a similarity score to the training profile for each of the 17 features is derived and summarized by the 17 similarity scores. In the case of a missing value (for instance, lack of one or more annotations for a test gene), the score is set to −1; otherwise, it is a real value in (0, 1). Different methods are used for similarity measures of categorical (e.g., GO annotations) and numeric (i.e., gene expression) annotations. While a fuzzy-based similarity measure is applied for categorical terms (see Popescu et al. (32) for additional details), for numeric annotation, i.e., the microarray expression values, the similarity score is calculated as the Pearson correlation of the two expression vectors of the two genes. The 17 similarity scores are combined into an overall score using statistical meta-analysis,

Integrative Systems Biology Approaches to Identify and Prioritize Disease

247

and a p-value of each annotation of a test gene G is derived by random sampling of the whole genome. The p-value of the similarity score Si is defined as: p (Si ) =

count of genes having score higher than G in the random sample . count of genes in the random sample containing annotation

Fisher’s inverse chi-square method, which states that −2∑ log pi →f 2 (2n) (assuming the pi values come from indepenn

i =1

dent tests) is then applied to combine the p-values from multiple annotations into an overall p-value. The final similarity score of the test gene is then obtained by 1 minus the combined p-value. Greater detail explaining the development of this method as well as validation and comparison with other related applications has been previously published (15, 16). 2.2. ToppNet: Network Analysis-Based Candidate Gene Prioritization (see Note 2)

ToppNet gene prioritization is based on the analysis of the protein–protein interaction network (PPIN). Stemming from the observation that biological networks share many properties with Web and social networks (31), ToppNet uses extended versions of three algorithms from White and Smyth (33) – PageRank with Priors, HITS with Priors, and K-step Markov – to rank disease candidate genes by estimating their relative importance in the PPIN to disease-related genes. The first algorithm, PageRank with Priors, is based on White and Smyth’s PageRank algorithm (33). It mimics the random surfer model wherein a random Internet surfer starts from one of a set of root nodes, R, and follows one of the links randomly in each step. In this process, the surfer jumps back to the root nodes at probability b, thus restarting the whole process. Intuitively, the PageRank with Priors algorithm generates a score that is proportional to the probability of reaching any node in the Web surfing process. This score indicates or measures the relative “closeness” or importance to the root nodes. The second algorithm, HITS with Priors, is an extension of HITS (Hyperlink-Induced Topic Search), which is a link analysis algorithm developed by Jon Kleinberg to rate Web pages. It determines two values for a page: “authority,” which estimates the value of the content of the page, and “hubness,” representing the value of its links to other pages (34). In the Web surfing model, the surfer still starts from one of the root nodes. In the odd steps, he/she can either follow a random “out-link” or jump back to a root node, and in the even steps, he/she can instead follow an “in-link” or jump back to a root node. Similar to the PageRank with Priors, HITS with Priors also estimates the relative probability of reaching a node in the network. The third algorithm in ToppNet is the K-Step Markov method, which mimics a surfer who starts with

248

Kaimal et al.

one of the root nodes. The surfer follows a random link in each step, but the surfer returns to the root node after K steps and restarts surfing. For more details about the protein interaction datasets used, algorithmic details and validation, see our recently published study (30). 2.3. ToppGeNet: Prioritization of Neighboring Genes in the Protein Interactome

ToppGeNet differs from ToppGene and ToppNet in that the test set is derived from the protein interactome. In other words, for a training set of known disease genes, the test set is generated by mining the protein interactome and compiling the genes interacting either directly or indirectly (based on user input) with the training set. After any overlapping or common genes between test and training sets are removed, interactome-based test set genes can be prioritized using either a functional annotationbased method (ToppGene) or PPIN-based method (ToppNet).

2.4. Identifying and Ranking Novel Disease and Druggable Genes for Alzheimer’s Disease (AD), Parkinson’s Disease (PD), and Schizophrenia

To illustrate the various features and utility of ToppGene Suite to identify and rank potential novel disease and druggable genes, we have used three neurological disorders as test cases. We used different applications (ToppGene, ToppNet, and ToppGenet) from our ToppGene Suite for this purpose (see Fig. 1 for schematic representation of the overall workflow). The following sections describe the workflow and the results:

2.4.1. Compiling Disease and Drug Training Set Genes for AD, PD, and Schizophrenia

Known disease-associated genes were obtained by combining gene lists from OMIM (35), the Genetic Association Database (36), GWAS (37), and diseases biomarkers from the Comparative Toxicogenomics Database (38). As shown in Table 2, for drug target genes of the three diseases, we first started with the marketed or approved drugs for each disease and then extracted their validated target genes from the Drug Bank database (39). The network view in Fig. 2 shows the shared genes and drugs and the validated target genes of these three diseases. We then used these as the training sets to rank the disease candidate genes and the disease interactome of AD, PD, and schizophrenia. As can be seen from Table 2 and Fig. 2, for a given disease, there is very little overlap between the disease genes and the drug target genes, suggesting that the encoded proteins of many disease genes could be difficult to target directly. Since genes or proteins typically interact with one another to function, a complementary strategy can be to exploit this functional interconnectivity of intracellular networks. In other words, the druggable search space can be extended to include potential targets lying either upstream or downstream, or in parallel with known disease genes. To this effect, we have ranked the disease gene neighborhoods of AD, PD, and schizophrenia, mining the interactome in the current study.

Integrative Systems Biology Approaches to Identify and Prioritize Disease 2.4.2. Compilation of Test Set Genes for AD, PD, and Schizophrenia

249

AlzGene (40), PDGene (http://www.pdgene.org), and SczGene (http://www.schizophreniaforum.org/res/sczgene) databases provide a comprehensive, unbiased, and regularly updated collection of genetic association studies performed using AD, PD, and schizophrenia phenotypes, respectively. The second type of test sets utilized here comprised immediate protein interactants of the known disease genes (i.e., the “disease interactome”).

Table 2 Alzheimer’s, Parkinson’s, and Schizophrenia-associated disease genes and drug genes used as training sets for ranking the respective disease candidate genes and disease interactome Drug genes (from DrugBank) Disease

Disease genes

Drug

Known target genes

Alzheimer’s disease (43 genes)

A2M, ACE, AD10, AD5, AD6, AD7, AD8, AD9, APBB2, APOE, APP, BACE1, BAX, BCL2, BDNF, BLMH, CALM1, CASP3, CD33, CPT1B, CRH, DISC1, FAM113B, GAB2, HFE, HMOX1, MPO, NOS3, PAXIP1, PCDH11X, PLAU, PPARG, PSEN1, PSEN2, PVRL2, SLC2A6, SLC30A4, SLC30A6, SORL1, TF, TOMM40, TPP1, ZNF224

Donepezil Galantamine Memantine Rivastigmine Tacrine

HTR2A, ACHE ACHE GRIN3A, GRIN2A, GRIN2B ACHE, BCHE ACHE

Parkinson’s disease (34 genes)

ADH1C, ATP13A2, CYP2D6, CYP2E1, DBH, DCTN1, DGKQ, DLG2, DRD4, FBXO7, FGF20, GAK, GIGYF2, GSTA4, HFE, HTRA2, LRRK2, MAPT, NDUFV2, NR4A2, PARK10, PARK12, PARK2, PARK3, PARK7, PINK1, SEMA5A, SLC18A2, SLC6A3, SNCA, SNCAIP, STAP1, TBP, UCHL1

Amantadine Apomorphine

DRD1, DRD2 DRD1, DRD2, DRD3, DRD4, DRD5, CALY ADCY7, CREB1, CYP3A4, DRD2, SST COMT ADORA1, BCAT1, CACNA1B, CACNA2D1, CACNA2D2 GRIN3A, GRIN2A, GRIN2B GRIN3A, GRIN3B, CYP2B, GRIN1, GRIN2D, HRH1 DRD1, DRD2 DRD2, DRD3, DRD4 MAOB COMT CHMR1

Bromocriptine Entacapone Gabapentin Memantine Orphenadrine Pergolide Ropinirole Selegiline Tolcapone Trihexyphenidyl

(continued)

250

Kaimal et al.

Table 2 (continued) Drug genes (from DrugBank) Disease

Disease genes

Drug

Known target genes

Schizophrenia (70 genes)

ACSM1, AGBL1, AKT1, APOL1, APOL2, APOL4, BTN2A2, BTN3A1, BTN3A2, CCDC60, CCL2, CHI3L1, CHRNA7, CLINT1, COMT, CP, CSF2RA, CTXN3, DAO, DAOA, DISC1, DISC2, DRD3, DTNBP1, FXR1, GPC1, HIST1H2AG, HIST1H2BJ, HLADQA1, HLA-E, HTR2A, IL3RA, MTHFR, NOTCH4, NRG1, NRGN, PDE4D, POM121L2, POU3F2, PRODH, PRSS16, PTBP2, RELN, RGS4, ROBO1, ROBO2, RTN4R, SCZD1, SCZD10, SCZD11, SCZD12, SCZD13, SCZD2, SCZD3, SCZD4, SCZD6, SCZD7, SCZD8, SLC12A2, SLC17A1, SLC17A3, SLCO6A1, SYN2, TAAR6, TCF4, TNIK, TRAF3, VRK2, ZNF184, ZNF804A

Amantadine Apomorphine

DRD1, DRD2 DRD1, DRD2, DRD3, DRD4, DRD5, CALY DRD2, HTR1A, HTR2A ADCY7, CREB1, CYP3A4, DRD2, SST DRD1, DRD2, DRD4, HRH1, HTR1A, HTR2A, HTR2C, CALY, HRH4 COMT ADORA1, BCAT1, CACNA1B, CACNA2D1, CACNA2D2 GRIN3A, GRIN2A, GRIN2B CHRM1, CHRM2, CHRM3, CHRM4, CHRM5, ADRA1B, DRD1, DRD2, DRD3, DRD4, HRH1, HTR2A, HTR2C GRIN3A, GRIN3B, CYP2B, GRIN1, GRIN2D, HRH1 DRD1, DRD2 ADRA1B, DRD1, DRD2, HRH1, HTR1A, HTR2A, HTR2B, HTR2C ADRA1B, ADRA1A, ADRB1, DRD2, HRH1, HTR2A DRD2, DRD3, DRD4 MAOB COMT CHRM1 ADRA1A, ADRB1, DRD2, DRD3, DRD4, HRH1, HTR1A, HTR1D, HTR2A, HTR2C, HTR7

2.4.3. Prioritization of Disease Candidate Genes and Disease Interactome (Fig. 1)

Aripiprazole Bromocriptine Clozapine

Entacapone Gabapentin

Memantine Olanzapine

Orphenadrine Pergolide Quetiapine

Risperidone Ropinirole Selegiline Tolcapone Trihexyphenidyl Ziprasidone

The ToppGene and ToppNet applications of the ToppGene Suite (http://toppgene.cchmc.org) were used to prioritize candidate genes separately. The top ten genes (listed based on the harmonic mean of the ToppGene and ToppNet ranks) using known disease genes and known drug genes are listed in Tables 3 and 4, respectively. The ToppGeNet application was used to rank the disease interactomes of AD, PD, and schizophrenia separately using ToppGene and ToppNet (see Tables 5 and 6 for the top ten ranked genes from each of the three disease interactomes).

Integrative Systems Biology Approaches to Identify and Prioritize Disease

251

Fig. 2. Network representation of known disease-genes, approved drugs, and known drug target genes of AD, PD, and schizophrenia. The size of the nodes is proportional to the degree or number of edges or connections to other node(s).

Table 3 Top ten ranked candidate genes of Alzheimer’s, Parkinson’s, and Schizophrenia disease using functional enrichment (ToppGene) and network-based (ToppNet) approaches Disease

Top ten genes

ToppGene ranks

ToppNet ranks

Mean ranks (harmonic)

Alzheimer’s disease (572 candidate genes as test set)

CASP8* COMT*+ NOS1* TP53* ESR1*+ TNF+ LRPAP1 LRP1+ SERPINE1*+ SNCA

1 128 4 2 48 3 346 27 5 6

16 1 2 9 3 137 4 5 29 17

1.88 1.98 2.67 3.27 5.65 5.87 7.91 8.44 8.53 8.87 (continued)

252

Kaimal et al.

Table 3 (continued) Disease

Top ten genes

ToppGene ranks

ToppNet ranks

Mean ranks (harmonic)

Parkinson’s disease (440 candidate genes as test set)

TH* B2M DRD2*+ TFRC DRD3*+ TF DRD1*+ HSPA8 FYN*+ DRD5*+

1 260 2 250 3 113 4 151 19 5

53 1 12 2 40 3 138 4 5 109

1.96 1.99 3.43 3.97 5.58 5.84 7.77 7.79 7.92 9.56

Schizophrenia (768 candidate genes as test set)

DRD2*+ PIK3R1+ DRD4*+ MAPK14* FYN+ APOL3 GRB2 SLC6A4*+ IL3+ CHRNA4*+

1 232 2 127 174 3 228 4 172 5

67 1 99 2 3 632 4 355 5 440

1.97 1.99 3.92 3.94 5.9 5.97 7.86 7.91 9.72 9.89

The training sets used for ranking were respective known disease genes. Genes marked with asterisks are part of the druggable genome, while those with plus have an FDA-approved drug

Table 4 Top ten ranked candidate genes of Alzheimer’s, Parkinson’s, and Schizophrenia disease using functional enrichment (ToppGene) and network-based (ToppNet) approaches Disease

Top ten genes

ToppGene ranks

ToppNet ranks

Mean ranks (harmonic)

Alzheimer’s disease (572 candidate genes as test set)

NOS1* APP SNCA LAMB1+ GNA11 CHRNA3* FYN*+ CHRNA7*+ PLAT*+ CHRNA4*+

1 38 2 368 108 3 19 4 220 5

20 1 22 2 3 445 4 157 5 366

1.9 1.95 3.67 3.98 5.84 5.96 6.61 7.8 9.78 9.87 (continued)

Integrative Systems Biology Approaches to Identify and Prioritize Disease

253

Table 4 (continued) Disease

Top ten genes

Parkinson’s disease (440 candidate genes as test set)

HTR2A*+ ACE*+ DLG2* CHRNA4* ADORA2A*+ FYN*+ KCNJ6 SLC6A3*+ GNAI3 PRKCG*

Schizophrenia (768 candidate genes as test set)

DLG4 CHRNA7*+ GRIN2C*+ GRB2 HTR1B*+ CACNA1A*+ GNAO1 SLC6A3*+ GRIA2* DLG1

ToppNet ranks

Mean ranks (harmonic)

1 94 39 2 3 22 26 4 47 13

51 1 2 180 6 3 4 113 5 7

1.96 1.98 3.8 3.96 4 5.28 6.93 7.73 9.04 9.1

49 1 2 243 7 3 105 4 5 259

1 226 66 2 4 69 3 240 67 5

1.96 1.99 3.88 3.97 5.09 5.75 5.83 7.87 9.31 9.81

ToppGene ranks

The training sets used for ranking were pooled known drug-target genes for each of the three diseases from DrugBank. Genes marked with asterisks are part of the druggable genome, while those with plus have an FDA-approved drug

Table 5 Top ten ranked candidate genes of Alzheimer’s, Parkinson’s, and Schizophrenia interactome using functional enrichment (ToppGene) and network-based (ToppNet) approaches Disease

Top ten genes

ToppNet Rank

ToppGene Rank

Mean rank (harmonic)

Alzheimer’s disease (467 immediate interactants of the 43 known Alzheimer’s diseasegenes)

CASP8* COMT*+ BCL2L1* CRHR1* NOS1* CRHR2* PTPN11* MAPT PTPN6* CASP9*

63 1 20 2 12 3 4 89 5 177

1 145 2 147 3 216 55 4 155 5

1.97 1.99 3.64 3.95 4.8 5.92 7.46 7.66 9.69 9.73 (continued)

254

Kaimal et al.

Table 5 (continued) Disease

Top ten genes

ToppNet Rank

ToppGene Rank

Mean rank (harmonic)

Parkinson’s disease (295 immediate interactants of the 34 known Parkinson’s diseasegenes)

CTNNB1 TH* PSEN1 RHOA POR* APP SNCB GRB10 HTT FANCG

1 197 100 2 3 22 106 4 102 5

39 1 2 122 20 3 4 177 5 233

1.95 1.99 3.92 3.94 5.22 5.28 7.71 7.82 9.53 9.79

Schizophrenia interactome (314 immediate interactants of the 70 known Schizophreniagenes)

ABP1 GRIN1*+ SLIT2 HTT ACE* APOA1 AR* LSM8 ERBB2* PDZK1

1 134 2 194 32 3 77 4 76 5

85 1 41 2 3 33 4 302 5 155

1.98 1.99 3.81 3.96 5.49 5.5 7.6 7.9 9.38 9.69

The disease interactome comprised known immediate interacting genes of the known disease genes. The training sets used for ranking were respective known disease genes. Genes marked with asterisks are part of the druggable genome, while those with plus have an FDA-approved drug

Table 6 Top ten ranked candidate genes of Alzheimer’s, Parkinson’s, and Schizophrenia interactome using functional enrichment (ToppGene) and network-based (ToppNet) approaches Disease

Top ten genes

ToppNet rank

ToppGene rank

Mean rank (harmonic)

Alzheimer’s Disease (467 immediate interactants of the 43 known Alzheimer’s disease-genes)

APP NOS1* COL4A1 GRM5*+ SNCA LAMA1+ SRC*+ GRM7* RICS KCNQ2*

17 1 310 2 3 379 73 4 53 5

1 50 2 265 53 3 4 97 5 352

1.89 1.96 3.97 3.97 5.68 5.95 7.58 7.68 9.14 9.86 (continued)

Integrative Systems Biology Approaches to Identify and Prioritize Disease

255

Table 6 (continued) Disease

Top ten genes

ToppNet rank

ToppGene rank

Mean rank (harmonic)

Parkinson’s disease (295 immediate interactants of the 34 known Parkinson’s disease-genes)

DLG4 SLC6A3*+ GABRA1*+ TRIP13 NOS1* GRB2 SRC TH* STUB1 DLG1

5 1 2 233 3 102 55 4 80 39

1 201 21 2 64 3 4 231 5 6

1.67 1.99 3.65 3.97 5.73 5.83 7.46 7.86 9.41 10.4

Schizophrenia interactome (314 immediate interactants of the 70 known Schizophrenia-genes)

DLG4 GNAQ EGFR*+ GNAI1 HTT GNAI2 GRB2 ACE*+ PTAFR* GNAO1

1 13 2 42 3 9 56 5 4 16

3 1 26 2 66 5 4 15 129 6

1.5 1.86 3.71 3.82 5.74 6.43 7.47 7.5 7.76 8.73

The disease interactome comprised known immediate interacting genes of the known disease genes. The training sets used for ranking were pooled known drug-target genes for each of the three diseases from DrugBank. Genes marked with asterisks are part of the druggable genome, while those with plus have an FDA-approved drug

3. Notes 1. Candidate gene prioritization approaches based on functional similarity have some limitations, which are listed below: (a) By using a training set, it is assumed that the disease genes yet to be discovered will be consistent with what is already known about a disease and/or its genetic basis. This assumption may not always be true. (b) It is important to note that the annotations and analyses provided, and the prioritization by these approaches, can only be as accurate as the underlying online sources from which the annotations are retrieved. Only one fifth of the known human genes have pathway or phenotype annotations, and there are still more than 40% genes whose functions are not defined. (c) Using an appropriate training set is critical. In an earlier study, while cross-validating ToppGene, we observed

256

Kaimal et al.

that using larger training sets (>100 genes) decreases the sensitivity and specificity of the prioritization compared to smaller training sets of 7–21 genes (15). (d) Almost all of the current disease gene identification and prioritization approaches are gene-centric. In other words, the approaches are based on coding sequences. However, it has been speculated that complex traits result more often from noncoding regulatory variants than from coding sequence variants (41–43), yet analyzing noncoding regulatory variants is replete with several problems. For instance, functional consequences of coding region variants are typically readily assessed as missense, nonsense, splicing, and other polymorphisms. On the other hand, interpreting the consequences of noncoding sequence variants is more complicated because the relationships among promoter, intergenic, or noncoding sequence variation, gene expression level, and trait phenotype are relatively less well understood than the relationship between coding DNA sequence and protein function. (e) Because these methods are primarily based on gene annotation, they tend to be biased towards selecting better annotated genes. For instance, a “true” candidate gene can be missed if it lacks sufficient annotations. Furthermore, some clinical understanding of the molecular basis or disease etiology is needed to aid the clinically informed binary evaluation, and this process could be partly subjective and researcher-specific. The effectiveness of this approach therefore depends critically on how well the disease under investigation is defined both molecularly and physiologically. 2. Although studies have shown that usage of PPIs enables identification of novel candidate disease genes, there are several practical limitations associated with network-based candidate gene prioritization approaches: (a) High-throughput PPI sets, especially yeast two-hybrid sets, are inherently noisy and contain a lot of interactions with no biological relevance (44–47). Surprisingly, only 5.8% of the human, fly, and worm yeast two-hybrid interactions have been confirmed by the HPRD (Human Protein Reference Database) (48). (b) There is a possibility of inherent bias towards well-studied proteins in the interactome. (c) Some of the human protein interactome data is derived by extrapolating high-throughput interactions from other species. Although, previous studies have shown that PPIs are quite conserved across species (49), there is a possibility for species-specific protein interactions.

Integrative Systems Biology Approaches to Identify and Prioritize Disease

257

(d) Two interacting proteins need not lead to similar disease phenotypes when mutated – for instance, they may have different but overlapping functions, or one may be more dispensable than the other (48). Additionally, disease proteins may lie at different points in a molecular pathway and need not interact with each other directly. (e) Disease mutations need not always involve proteins (e.g., telomerase RNA component in congenital autosomal dominant dyskeratosis) (48). References 1. Russ AP, Lampel S (2005) The druggable genome: an update. Drug Discov Today 10:1607–1610 2. Hopkins AL, Groom CR (2002) The druggable genome. Nat Rev Drug Discov 1: 727–730 3. Plewczynski D, Rychlewski L (2009) Metabasic estimates the size of druggable human genome. J Mol Model 15:695–699 4. Yildirim MA, Goh KI, Cusick ME, Barabasi AL, Vidal M (2007) Drug-target network. Nat Biotechnol 25:1119–1126 5. Sakharkar MK, Sakharkar KR, Pervaiz S (2007) Druggability of human disease genes. Int J Biochem Cell Biol 39:1156–1164 6. Freudenberg J, Propping P (2002) A similaritybased method for genome-wide prediction of disease-relevant human genes. Bioinformatics 18(Suppl 2):S110–S115 7. Turner FS, Clutterbuck DR, Semple CA (2003) POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol 4:R75 8. Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA (2005) Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 33:1544–1552 9. Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS (2005) Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinform 6:55 10. Aerts S, Lambrechts D, Maity S et al (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24:537–544 11. Thornblad TA, Elliott KS, Jowett J, Visscher PM (2007) Prioritization of positional candidate genes using multiple web-based software tools. Twin Res Hum Genet 10:861–870 12. Zhu M, Zhao S (2007) Candidate gene identification approach: progress and challenges. Int J Biol Sci 3:420–427 13. Tiffin N, Adie E, Turner F et al (2006) Computational disease gene identification: a concert of methods prioritizes type 2 diabetes

14.

15.

16.

17.

18. 19. 20.

21.

22.

23.

24.

and obesity candidate genes. Nucleic Acids Res 34:3067–3081 Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS (2006) SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22:773–774 Chen J, Xu H, Aronow BJ, Jegga AG (2007) Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinform 8:392 Chen J, Bardes EE, Aronow BJ, Jegga AG (2009) ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res 37:W305–W311 Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL (2007) The human disease network. Proc Natl Acad Sci USA 104:8685–8690 Jimenez-Sanchez G, Childs B, Valle D (2001) Human disease genes. Nature 409:853–855 Smith NG, Eyre-Walker A (2003) Human disease genes: patterns and predictions. Gene 318:169–175 Tranchevent LC, Barriot R, Yu S et al (2008) ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Res 36:W377–W384 Rual JF, Venkatesan K, Hao T et al (2005) Towards a proteome-scale map of the human protein-protein interaction network. Nature 437:1173–1178 Stelzl U, Worm U, Lalowski M et al (2005) A human protein–protein interaction network: a resource for annotating the proteome. Cell 122:957–968 George RA, Liu JY, Feng LL, BrysonRichardson RJ, Fatkin D, Wouters MA (2006) Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res 34:e130 Kann MG (2007) Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief Bioinform 8:333–346

258

Kaimal et al.

25. Kohler S, Bauer S, Horn D, Robinson PN (2008) Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet 82:949–958 26. Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes. Mol Syst Biol 4:189 27. Xu J, Li Y (2006) Discovering disease-genes by topological features in human proteinprotein interaction network. Bioinformatics 22:2800–2805 28. Chen JY, Shen C, Sivachenko AY (2006) Mining Alzheimer disease relevant proteins from integrated protein interactome data. Pac Symp Biocomput 11:367–378 29. Ortutay C, Vihinen M (2009) Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies. Nucleic Acids Res 37:622–628 30. Chen J, Aronow BJ, Jegga AG (2009) Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinform 10:73 31. Junker BH, Koschutzki D, Schreiber F (2006) Exploration of biological network centralities with CentiBiN. BMC Bioinform 7:219 32. Popescu M, Keller JM, Mitchell JA (2006) Fuzzy measures on the gene ontology for gene product similarity. IEEE/ACM Trans Comput Biol Bioinform 3:263–274 33. White S, Smyth P (2003) Algorithms for estimating relative importance in networks. In: KDD ‘03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, 266–275 34. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46: 604–632 35. Hamosh A, Scott A, Amberger J, Bocchini C, McKusick V (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33:D514–D517 36. Becker KG, Barnes KC, Bright TJ, Wang SA (2004) The genetic association database. Nat Genet 36:431–432 37. Hindorff LA, Sethupathy P, Junkins HA et al (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106:9362–9367 38. Davis AP, Murphy CG, Saraceni-Richards CA, Rosenstein MC, Wiegers TC, Mattingly CJ (2009) Comparative Toxicogenomics Database: a knowledgebase and discovery tool

39.

40.

41. 42. 43. 44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

for chemical-gene-disease networks. Nucleic Acids Res 37:D786–D792 Wishart DS, Knox C, Guo AC et al (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36:D901–D906 Bertram L, McQueen MB, Mullin K, Blacker D, Tanzi RE (2007) Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nat Genet 39:17–23 King MC, Wilson AC (1975) Evolution at two levels in humans and chimpanzees. Science 188:107–116 Korstanje R, Paigen B (2002) From QTL to gene: the harvest begins. Nat Genet 31:235–236 Mackay TF (2001) Quantitative trait loci in Drosophila. Nat Rev Genet 2:11–20 Giot L, Bader JS, Brouwer C et al (2003) A protein interaction map of Drosophila melanogaster. Science 302:1727–1736 Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y (2001) A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98:4569–4574 Li S, Armstrong CM, Bertin N et al (2004) A map of the interactome network of the metazoan C. elegans. Science 303:540–543 Uetz P, Giot L, Cagney G et al (2000) A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403:623–627 Oti M, Snel B, Huynen MA, Brunner HG (2006) Predicting disease genes using protein– protein interactions. J Med Genet 43:691–698 Huynen MA, Snel B, van Noort V (2004) Comparative genomics for reliable proteinfunction prediction from genomic data. Trends Genet 20:340–344 Bortoluzzi S, Romualdi C, Bisognin A, Danieli GA (2003) Disease genes and intracellular protein networks. Physiol Genomics 15:223–227 Huang H, Winter EE, Wang H et al (2004) Evolutionary conservation and selection of human disease gene orthologs in the rat and mouse genomes. Genome Biol 5:R47 Lopez-Bigas N, Ouzounis CA (2004) Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res 32:3108–3114 Perez-Iratxeta C, Bork P, Andrade MA (2002) Association of genes to genetically inherited diseases using data mining. Nat Genet 31:316–319

Integrative Systems Biology Approaches to Identify and Prioritize Disease 54. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA (2005) G2D: a tool for mining genes associated with disease. BMC Genet 6:45 55. Hristovski D, Peterlin B, Mitchell JA, Humphrey SM (2005) Using literature-based discovery to identify disease candidate genes. Int J Med Inform 74:289–298 56. van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG (2003) A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet 11:57–63 57. van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, Vriend G (2005) GeneSeeker: extraction and integration of human disease-related information from webbased genetic databases. Nucleic Acids Res 33:W758–W761 58. Masseroli M, Galati O, Pinciroli F (2005) GFINDer: genetic disease and phenotype location statistical analysis and mining of

59.

60.

61.

62.

259

dynamically annotated gene lists. Nucleic Acids Res 33:W717–W723 Masseroli M, Martucci D, Pinciroli F (2004) GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining. Nucleic Acids Res 32:W293–W300 Rossi S, Masotti D, Nardini C et al (2006) TOM: a web-based integrated approach for identification of candidate disease genes. Nucleic Acids Res 34:W285–W292 van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA (2006) A textmining analysis of the human phenome. Eur J Hum Genet 14:535–542 Franke L, Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C (2006) Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 78:1011–1025

Chapter 17 Identification of a Common Variant Affecting Human Episodic Memory Performance Using a Pooled Genome-Wide Association Approach: A Case Study of Disease Gene Identification Traci L. Pawlowski and Matthew J. Huentelman Abstract Genome-wide association studies (GWAS) are an important tool for discovering novel genes associated with disease or traits. Careful design of case–control groups greatly facilitates the efficacy of these studies. Here we describe a pooled GWAS study undertaken to find novel genes associated with human episodic memory performance. A genomic locus for the WW and C2 domain-containing 1 protein, KIBRA (also known as WWC1), was found to be associated with memory performance in three cognitively normal cohorts from Switzerland and the USA. This result was further supported by correlation of KIBRA genotype and differences in hippocampal activation as measured by functional magnetic resonance imaging (fMRI). These findings provide an excellent example of the application of GWAS using a pooled genomic DNA approach to successfully identify a locus with strong effects on human memory. Key words: Genome-wide association study, Single-nucleotide polymorphism, DNA pooling, Genomics

1. Introduction Genome-wide association studies (GWAS) have become a powerful tool allowing for both hypothesis-free experimentation, as well as the discovery of novel trait or disease gene associations. This approach has recently come of age with the combined contributions from the completion of the draft human genome sequence, the initiation of the International HapMap Project (1–4), the identification and public deposition of information related to millions of single-nucleotide polymorphisms (SNPs) and their

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_17, © Springer Science+Business Media, LLC 2011

261

262

Pawlowski and Huentelman

corresponding genomic locations, and the improvement and lowered cost of genome-spanning, SNP genotyping technologies. The GWAS approach is much faster than candidate gene sequencing, more powerful than traditional linkage analysis, and uniquely allows for whole genome survey of large cohorts. With the release of ultra-high density SNP arrays (i.e., those exceeding 300,000 markers), scanning the entire genome in a relatively short period of time has become exceedingly feasible (5). In addition, these SNP arrays also allow for measurement of copy number variation (CNV), which has significant implications for detecting other genomic factors that may contribute to disease susceptibility. GWAS studies provide genotype data spanning the entire genome, which, using appropriate statistical tests for evaluation, can be used to identify disease or trait loci. Instead of using traditional linkage analysis to look for a positional candidate gene, this approach utilizes an affected vs. unaffected study design with association analysis (5, 6), thereby facilitating the discovery of disease-causing genes for an entire population. It is important to note that these studies are powered primarily to detect variants that are relatively common within the study population-based, somewhat erroneously, on what is typically referred to as the common disease–common variant hypothesis (7). By early 2008, more than 150 associations between common SNPs and disease traits have been identified, yet with the large majority of these still requiring further study and independent replication (5). GWAS require relatively large sample sizes to provide adequate power to detect genetic association following correction for multiple testing. However, SNP microarrays were designed to interrogate one sample per chip, making the cost of scanning very large cohorts prohibitively expensive for many researchers. One way to reduce the cost of such an approach was developed that utilized pooling of genomic DNA and subsequent genotyping of the genomic DNA pools, which made survey of a large cohort very cost-effective without sacrificing data accuracy. This technique also provides a reasonably priced way to have sufficient statistical power in a GWAS experiment. Here we present a case study using pooled DNA samples with a GWAS approach to identify a gene underlying human episodic memory performance (8) and Alzheimer’s disease (9). Although this was one of the first studies to utilize an ultra-high density SNP array, it is worth remembering that the field of human genomics evolves quickly and much has changed since this study was initiated. Throughout this chapter we will attempt to point to additional articles and information that will help the reader navigate the many considerations that must be taken during the design of a modern-day GWAS.

Identification of a Common Variant Affecting Human Episodic Memory Performance

263

2. Methods 2.1. Assessment of Genetic Background

Study population substructure can result in spurious SNP associations (10–13). In a typical individually genotyped GWAS, an effort to collect individuals of a specified ethnic background can be made a priori, and because each person is genotyped, those individuals contributing to substructure can be identified and excluded from use during the analysis stage. However, with a pooled GWAS design, one must identify these individuals prior to pooling to avoid untoward and biasing effects on the eventual rank-ordered association list (see Note 1). In our study, we accounted for substructure by first individually genotyping 351 subjects from Swiss cohort 1 at 318 unlinked SNPs. Population structure was then investigated using the STRUCTURE program (14). We estimated the ancestry of study subjects under the a priori assumption of K = 2 discrete subpopulations. Structured association analysis revealed moderate allelefrequency divergence among populations. Identical results were obtained under the a priori assumption of 3 £ K £ 6 discrete subpopulations. Individuals identified as outliers were excluded from contributing to the pooled DNA.

2.2. Creation of Equimolar Pools of Genomic DNA

A pooled approach to GWAS has as a primary limitation the inability to retrospectively stratify the genotyped pooled cohort by a secondary phenotypic trait. However, due to its lowered cost and increased speed from genotyping to analysis, it is a worthwhile approach when resources are limited for a particular trait or disease investigation, or even as a preliminary screen with the hopes of identifying the so-called low-hanging fruit for a particular disease. Several groups have investigated these pros and cons, which are thoroughly discussed elsewhere (15–18). It has been demonstrated that estimation of allele frequencies in DNA pools exhibits high accuracy when care is taken to insure the pooling of equal amounts of DNA from each individual (15–17, 19). In the case study presented here, we quantitated individual genomic DNA concentrations of 351 subjects in triplicate with the PicoGreen® reagent Quant-iT™ dsDNA Assay Kit (Invitrogen), and then averaged the three readings to achieve the final working concentration. Any DNA sample whose triplicate measures were outside of 15% coefficient of variance was remeasured. We then constructed pools by assigning individuals to each pool based on quartile ranking in performance on a verbal episodic memory task. Groupings were based on 5-min free recall performance (i.e., bottom 25%, bottom 50%, top 50%, and top 25% performers). We used 120 ng of genomic DNA from each individual in the construction of pools and each pool was created

264

Pawlowski and Huentelman

de novo a total of three times. Individual sample quality is also an important consideration; therefore, prior to pooling, each sample was also assessed qualitatively using 1% TAE gel electrophoresis. Any noticeably degraded samples were excluded from the pooling phase. Once pools were created they were diluted to 50 ng/ml. The pools were then genotyped in duplicate on both the Nsp I and Sty I Early Access 500K Mendel arrays from Affymetrix (Santa Clara, CA). 2.3. Statistical Analysis

In a GWAS using pooled DNA samples, the resulting data cannot be analyzed with the same statistical approaches utilized for an individually genotyped sample set. Because the arrays are hybridized to pooled samples there is an increased amount of heterozygous and/or “no call” genotypes, which is due to the fact that the fluorescent intensity at each genotyping probe does not conform to the statistical model that was used to create the genotype-calling algorithm. For pooled GWAS, the data are essentially an approximation of the allele frequencies (e.g., allelotyping) of the pooled samples for each respective marker. Various groups have proposed solutions for pooled GWAS analysis and prioritization of SNPs for follow-up study (15, 16, 18, 20), and one such example is the freely available GenePool software, developed by members of our team (21). We used the following steps for data analysis in our study. While the relative allele signal (RAS) score we describe here is relevant to Affymetrix arrays, user manuals for other technologies will delineate the data extraction process in a platform-specific way (see Note 1). In the current example, we estimated allelic frequency for each marker based on the corresponding RAS score (note: the RAS metric has since been abandoned by Affymetrix for a model-based genotyping algorithm (20)). Generally, RAS = A/ (A + B), whereby A refers to the median match/mismatched differences of the major allele and B for the minor allele (Affymetrix Technical Manual). Because both sense and antisense directions are probed, there are two RAS values, RAS1 (sense) and RAS2 (antisense). Because RAS1 and RAS2 are predictions of allelic frequency with distinct variability we treated both values as two different experiments. Generating RAS values from the Affymetrix GeneChip Human Mapping 500K arrays was achieved by generating a PERL script. To determine pooling-related false positives, we used two different stringent statistical approaches in combination to identify markers showing significant differences in allele frequency between case and control pools. First, SNP-specific C2 values were calculated using the RAS-derived allelic frequencies to compare the following: top 50 vs. bottom 50% performers (entire sample) and top 25% vs. bottom 25% performers (distribution extremes). Because RAS1 and RAS2 values were treated independently,

Identification of a Common Variant Affecting Human Episodic Memory Performance

265

statistics for each SNP were calculated a total of four times. SNPs were considered significant in at least one of these four comparisons by using the following method: C2 £ 28, df = 1 (corresponding to P £ 0.05 Bonferroni-corrected for 500,000 comparisons), variation coefficient of RAS-derived allelic frequencies 0.2. A genomewide windowing approach was employed, in order to pick out significant physically contiguous clusters of SNPs. A Student’s t test was used to compare RAS1 and RAS2 separately across all ~500,000 SNPs for the top 25% vs. the bottom 25% performers and the top 50% vs. the bottom 50% performers in the Swiss population (see Note 2). Finally, we calculated a median t test score at sliding window sizes of 3, 5, 10, 20, and 40 SNPs for both RAS1 and RAS2 across the entire genome and graphed the results to identify significant groupings of SNPs at various window sizes. 2.4. Individual Genotyping and Analysis

In the example presented here, we genotyped each individual sample to confirm the two markers showing significant differences in allele frequency in cohort one, namely rs17070145 in the KIBRA gene and rs6439886, located in the calsyntenin 2 gene (CLSTN2). SNP genotyping was performed using both Pyrosequencing (Qiagen; Uppsala, Sweden) on a PSQ96 MA machine and the Amplifluor method (http://www.kbiosciences. com). Another popular method for low-throughput SNP genotyping is the TaqMan® allelic discrimination assay (Applied Biosystems; Foster City, CA). We performed statistical analysis of the genotype data obtained in individual samples using the Haploview v3.2 software (http:// www.broad.mit.edu/mpg/haploview) for the assessment of linkage disequilibrium and for haplotype reconstruction. Multifactorial analyses of variance were performed for the simultaneous assessment of the influence of age, sex, education, and genotype effects on cognitive test performance. All tests were two-tailed. Carriers of the T allele of rs17070145 were found to have significantly better memory performance on word recall at 5 min and 24 h after word presentation than C allele homozygotes. SNP rs6439886 had similar results but differences in memory performance were lower (see Note 3).

2.5. Replication Cohort Genotyping

It is important to verify findings from a GWAS study in additional cohorts in a hypothesis-driven fashion. In this example, Pyrosequencing was again utilized to examine the genotype of the second cohort. We found statistically significant evidence for association between rs17070145 and episodic memory, but not between memory performance and alleles at marker rs6439886. Thus, only rs17070145 was examined in a third cohort. Carriers of the T allele performed better on a visual episodic memory task compared to individuals carrying only C alleles. This study design of

266

Pawlowski and Huentelman

discovery with the first cohort, testing of significant SNPs with a second cohort, and further confirmation of a significant SNP with a third cohort is a cost-effective and statistically sound strategy (6). 2.6. Validation of Findings of Association in Replicate Populations

The results of a GWAS study are strengthened by independent replication. In the case of the KIBRA GWAS findings, two other studies supported the initial findings, while one did not (see Note 4). A cohort of 64 healthy elderly subjects was genotyped at rs17070145 and administered a verbal learning and memory task. Those who carried only the C allele were found to have significantly poorer performance compared to T allele carriers (22). Another study of 312 older adults observed delayed recall scores of C allele carriers to be lower than those carrying the T allele (23). While another study, comprised of two separate cohorts of 319 and 365 individuals respectively, did not observe statistically significant evidence for association between rs17070145 genotype and memory performance (24). In addition to validation studies, it is also helpful to extend findings of SNP associations for a trait (like memory) to diseases with potential effects on that trait. Two studies recently reported association between KIBRA alleles and Alzheimer’s disease (AD), although with results that were in opposition. One study, lead by a research team in Spain, reported association between age-atonset of AD and marker rs17070145, i.e., the same SNP previously linked to episodic memory (25). However, this study found the T allele associated with an elevated risk of AD, which is not consistent with findings of association between this allele and enhanced episodic memory. In contrast, we observed statistically significant evidence for association between the C allele at KIBRA and increased risk for AD in 1,026 cases and controls; many of whom were both clinically characterized and neuropathologically verified after death (9). The disparity between the two studies may reflect geographical location, classification criteria for cases and controls, or other factors. In addition to genetic findings, we also found significant transcriptional dysregulation of KIBRA and multiple biochemical interacting partners in laser-dissected neurons from multiple brain regions in clinically characterized and neuropathologically verified cases and matched controls. Finally, we found that cognitively normal carriers of only the C allele had significantly reduced positron emission tomography (FDG-PET) measurements of the cerebral metabolic rate for glucose in brain regions known to be preferentially affected by AD compared to T allele carriers (26, 27).

2.7. Biological Evidence Supporting Genetic Association

Logical next steps following the observation of statistically significant findings of association from GWAS, in combination with validation in independent replication panels is to investigate supporting biological evidence. In the case study presented here,

Identification of a Common Variant Affecting Human Episodic Memory Performance

267

we first examined expression of KIBRA RNA and protein levels in the brain. Using RT-PCR, we found that a truncated version of KIBRA was highly expressed in memory-related structures of the brain, while the full-length transcript was expressed at extremely low levels. We utilized functional magnetic resonance imaging (fMRI) to delineate expression differences of T allele carriers and noncarriers (N = 15 in each group) during performance of memory tasks. We found that carriers of the C allele only showed more activation of memory-related brain areas to reach the same level of memory performance as carriers of the T allele, which is consistent with the findings of genetic association. Together, findings obtained from GWAS using pooled DNA, replication in independent study samples, validation in a disease model related to the trait of interest and functional characterization are consistent with a role for KIBRA in memory function and susceptibility to AD.

3. Notes 1. Each manufacturer has a process specific to their array platform. In general, a hybridization station, wash station, and scanner are needed. After scanning is completed, each manufacturer has specific software for extracting and analyzing data. For manufacturer-specific information and protocols see the Affymetrix (http://www.affymetrix.com/products services/ arrays/genomewide snp6/genome wide snp 6.affx#1) or Illumina (http://www.illumina.com/pages.ilmn?ID=261) Web sites. A major barrier to the initiation and successful completion of a GWAS is the requirement of both significant equipment and computing infrastructure. Due to space constraints we will not discuss the pros and cons of each SNP genotyping platform. However, an excellent discussion of positive and negative attributes can be found elsewhere (28–30). In lieu of performing the actual array hybridization in your laboratory, many contractors exist who can expertly process your samples, extract the data, and return it to you for analysis. 2. Statistical power is important. Before undertaking a study, decide how many samples will be necessary and whether a transmission disequilibrium test or case–control design is more appropriate for your study. Excellent reviews of study design can be found in the following references (6, 31–33). 3. It is important to remember that thoughtful study design is critical. Careful attention to quality control measures such as call rate, batch effects, Hardy–Weinberg equilibrium and minor allele frequency, as well as cohort phenotyping, will help avoid biased data (34).

268

Pawlowski and Huentelman

4. Careful interpretation of results can explain differences in replication studies. Differences in phenotyping, cohort size, cohort population, and statistical methods can skew results.

Acknowledgments We would like to thank our colleagues and collaborators at: The Division of Psychiatry Research and Center for Integrative Human Physiology at the University of Zurich, Switzerland; The Banner Alzheimer’s Institute, Phoenix, AZ; and the Sun Health Research Institute, Sun City, AZ. We are also grateful to the individuals and families whose participation made this research possible. This work is supported by the Arizona Alzheimer’s Disease Consortium, the State of Arizona, and NIH grant R01NS059873. References 1. International HapMap Consortium (2003) The International HapMap Project. Nature 426(6968):789–796 2. International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437(7063):1299–1320 3. Frazer KA, Ballinger DG, Cox DR et al (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449(7164):851–861 4. Thorisson GA, Smith AV, Krishnan L, Stein LD (2005) The International HapMap Project Web site. Genome Res 15(11):1592–1593 5. Altshuler D, Daly MJ, Lander ES (2008) Genetic mapping in human disease. Science 322(5903):881–888 6. Hirschhorn JN, Daly MJ (2005) Genomewide association studies for common diseases and complex traits. Nat Rev Genet 6(2):95–108 7. Jobling MA, Hurles ME, Tyler-Smith C (2004) Human evolutionary genetics: origins, peoples and disease. Garland Science Publishing, London/New York, p 523 8. Papassotiropoulos A, Stephan DA, Huentelman MJ et al (2006) Common Kibra alleles are associated with human memory performance. Science 314(5798):475–478 9. Corneveaux JJ, Liang WS, Reiman EM et al (2008) Evidence for an association between KIBRA and late-onset Alzheimer’s disease. Neurobiol Aging 30:322–324 10. Campbell CD, Ogburn EL, Lunetta KL et al (2005) Demonstrating stratification in a

11.

12.

13.

14.

15.

16.

17.

18.

European American population. Nat Genet 37(8):868–872 Clayton DG, Walker NM, Smyth DJ et al (2005) Population structure, differential bias and genomic control in a large-scale, casecontrol association study. Nat Genet 37(11):1243–1246 Marchini J, Cardon LR, Phillips MS, Donnelly P (2004) The effects of human population structure on large genetic association studies. Nat Genet 36(5):512–517 Tian C, Gregersen PK, Seldin MF (2008) Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet 17(R2):R143–R150 Pritchard JK, Rosenberg NA (1999) Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 65(1):220–228 Craig DW, Huentelman MJ, Hu-Lince D et al (2005) Identification of disease causing loci using an array-based genotyping approach on pooled DNA. BMC Genomics 6:138 Docherty SJ, Butcher LM, Schalkwyk LC, Plomin R (2007) Applicability of DNA pools on 500 K SNP microarrays for cost-effective initial screens in genomewide association studies. BMC Genomics 8:214 Knight J, Saccone SF, Zhang Z, Ballinger DG, Rice JP (2009) A comparison of association statistics between pooled and individual genotypes. Hum Hered 67(4):219–225 Sebastiani P, Zhao Z, Abad-Grau MM et al (2008) A hierarchical and modular approach

Identification of a Common Variant Affecting Human Episodic Memory Performance

19. 20.

21.

22.

23.

24.

25.

to the discovery of robust associations in genome-wide association studies from pooled DNA samples. BMC Genet 9:6 Jawaid A, Sham P (2009) Impact and quantification of the sources of error in DNA pooling designs. Ann Hum Genet 73(Pt 1):118–124 Meaburn E, Butcher LM, Liu L et al (2005) Genotyping DNA pools on microarrays: tackling the QTL problem of large samples and large numbers of SNPs. BMC Genomics 6(1):52 Pearson JV, Huentelman MJ, Halperin RF et al (2007) Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet 80(1):126–139 Schaper K, Kolsch H, Popp J, Wagner M, Jessen F (2008) KIBRA gene variants are associated with episodic memory in healthy elderly. Neurobiol Aging 29(7):1123–1125 Almeida OP, Schwab SG, Lautenschlager NT et al (2008) KIBRA genetic polymorphism influences episodic memory in later life, but does not increase the risk of mild cognitive impairment. J Cell Mol Med 12(5A):1672–1676 Need AC, Attix DK, McEvoy JM et al (2008) Failure to replicate effect of Kibra on human memory in two large cohorts of European origin. Am J Med Genet B Neuropsychiatr Genet 147B(5):667–668 Rodriguez-Rodriguez E, Infante J, Llorca J et al (2009) Age-dependent association of KIBRA genetic variation and Alzheimer’s disease risk. Neurobiol Aging 30(2):322–324

269

26. Reiman EM, Chen K, Alexander GE et al (2004) Functional brain abnormalities in young adults at genetic risk for late-onset Alzheimer’s dementia. Proc Natl Acad Sci USA 101(1):284–289 27. Reiman EM, Chen K, Alexander GE et al (2005) Correlations between apolipoprotein E epsilon4 gene dose and brain-imaging measurements of regional hypometabolism. Proc Natl Acad Sci USA 102(23):8299–8302 28. Anderson CA, Pettersson FH, Barrett JC et al (2008) Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. Am J Hum Genet 83(1):112–119 29. Diskin SJ, Li M, Hou C et al (2008) Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res 36(19):e126 30. Nishida N, Koike A, Tajima A et al (2008) Evaluating the performance of Affymetrix SNP Array 6.0 platform with 400 Japanese individuals. BMC Genomics 9:431 31. Balding DJ (2006) A tutorial on statistical methods for population association studies. Nat Rev Genet 7(10):781–791 32. Hamer D, Sirota L (2000) Beware the chopsticks gene. Mol Psychiatry 5(1):11–13 33. McGinnis R, Shifman S, Darvasi A (2002) Power and efficiency of the TDT and casecontrol design for association scans. Behav Genet 32(2):135–144 34. Neale BM, Purcell S (2008) The positives, protocols, and perils of genome-wide association. Am J Med Genet B Neuropsychiatr Genet 147B(7):1288–1294

Chapter 18 RNAi-Based Functional Pharmacogenomics Sukru Tuzmen, Pinar Tuzmen, Shilpi Arora, Spyro Mousses, and David Azorsa Abstract Experimental alteration of gene expression is a powerful technique for functional characterization of disease genes. RNA interference (RNAi) is a naturally occurring mechanism of gene regulation, which is triggered by the introduction of double-stranded RNA into a cell. This phenomenon can be synthetically exploited to down-regulate expression of specific genes by transfecting mammalian cells with synthetic short interfering RNAs (siRNAs). These siRNAs can be designed to silence the expression of specific genes bearing a particular target sequence in high-throughput (HT) siRNA experimental systems and may potentially be presented as a therapeutic strategy for inhibiting transcriptional regulation of genes. This can constitute a strategy that can inhibit targets that are not tractable by small molecules such as chemical compounds. Large-scale experiments using low-dose drug exposure combined with siRNA also represent a promising discovery strategy for the purpose of identifying synergistic targets that facilitate synthetic lethal combination phenotypes. In light of such advantageous applications, siRNA technology has become an ideal research tool for studying gene function. In this chapter, we focus on the application of RNAi, with particular focus on HT siRNA phenotype profiling, to support cellular pharmacogenomics. Key words: RNAi, siRNA, Drug target, Phenotypic screening, Genomics, Functional pharmacogenomics

1. Introduction The field of pharmacogenomics aims to identify genomic factors that determine individual response to specific drugs. The use of genomic information to personalize therapy is presently being evaluated in numerous clinical trials and is likely to become an integral factor in the practice of individualized medicine. Up until now with the recent advances in RNAi-based technologies, genome-compatible approaches had not previously been utilized

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_18, © Springer Science+Business Media, LLC 2011

271

272

Tuzmen et al.

to establish causal links between gene function and drug response phenotypes. Consequently, functional translation of such information has been the rate-limiting step in understanding the molecular mechanisms underlying drug response. Personalized medicine is largely driven by the development and application of new genomic technologies to rapidly characterize genetic sequence variations and gene expression profiles on a genome-wide scale. Such experiments are typically accompanied by extensive association studies aimed at identifying genes or gene sets that predict drug response. The associations are then extensively validated and evaluated for their ability to predict response. Often, the function of these associated genes and their putative role in drug response is unknown. A better understanding of which genes are not only associated with drug response, but also causally involved would add immense value to pharmacogenomics data, permitting a more intelligent exploitation of such genes as diagnostic markers, and as putative drug targets to improve drug response. In addition to accelerating the functional validation of gene lists from association-based pharmacogenomics data, high-throughput screening approaches for genome-scale functional analysis can potentially be a more relevant starting point for pharmacogenomics analysis. In such strategies, genes that are causally involved in modulating drug response are identified, and their putative clinical associations with drug response are subsequently validated. With this rationale, global phenotype analysis enables the investigation of a wide range of genetic factors including associations with specific gene and transcript sequence variations, expression of mRNA or protein, protein modifications, and many other genetic and epigenetic variations, which are often not used in primary pharmacogenomics analysis. Additionally, this approach can identify functionally relevant genes, which after undergoing in-depth functional validation may also have utility as drug targets. Therefore, there is clearly an unmet need to develop and apply genome-compatible strategies and technologies to identify functional modulators of drug response.

2. RNAi: The Silencing Mechanism

RNA interference (RNAi) is a naturally occurring mechanism for sequence-specific, post-transcriptional gene silencing triggered by double-stranded RNA (dsRNA) (1–7). Once the dsRNA enters a cell, it is cleaved by the Dicer enzyme, which produces dsRNA molecules that are 21–23 nt in size. These are referred to as small interfering RNAs (siRNAs) and function to mediate gene silencing via interaction with a protein complex called RNA-induced silencing complex or RISC (Fig. 1). RNAi technology involves the

RNAi-Based Functional Pharmacogenomics

273

Fig. 1. The RNAi pathway: a powerful tool for gene knockdown in mammalian cells. The siRNA pathway (a) long doublestranded RNA (dsRNA) is cleaved by a member of RNAse III family, Dicer, into ~21 nt siRNAs. The siRNAs, generated either by Dicer cleavage (a) or synthetic construction (b), are then introduced into cells, where they integrate into the RNA-induced silencing complex, RISC (c). Once unwound, the antisense strand of siRNA chaperones RISC to the mRNA containing its complementary sequence, which triggers the destruction of the target by the endonucleolytic cleavage.

use of siRNAs as potent mediators of an intrinsic posttranscriptional gene silencing mechanism. This mechanism is highly conserved among multicellular organisms as diverse as flies, yeast, plants, worms, and humans (4, 8–16). In Caenorhabditis elegans, RNAi has been applied to generate “somatic knockdowns” for the functional analysis of thousands of genes (17–21). In mammalian cells, siRNAs can induce RNAi without activation of nonspecific, dsRNA-dependent pathways (4, 14, 22–25). Chemically or enzymatically synthesized siRNAs, as well as vector-based short hairpin RNAs (shRNAs), have been extensively applied to the study of function of individual mammalian genes (4, 26–28). Further, siRNAs act as guides for the RISC (4, 29–31). Here we present a paradigm for functional pharmacogenomics based on genome-wide RNAi analysis to discover causal relationships between gene silencing and cellular phenotypes resulting from drug exposure. New tools such as RNAi technology, availability of genome-wide reagents for triggering RNAi,

274

Tuzmen et al.

and genome-compatible screening can effectively be applied to rapidly discover gene knockdown events that alter readily identifiable phenotypes. In this review, we will describe this new approach including a discussion of the methodological challenges and technical requirements. Through specific examples, we will also illustrate how such research can be used to advance and accelerate drug target discovery and clinical drug development.

3. Resources for Genome-Scale RNAi Analysis

4. Challenges of RNAi as Therapeutics

4.1. Sequence Specificity

To date, many efforts have led to the creation of large siRNA and shRNA libraries targeting human genes on a genomic scale using RNAi (26, 32–36). Large genome-wide siRNA libraries, as well as focused libraries, are commercially available from suppliers such as Qiagen (Germantown, MD), Ambion (Austin, TX), Invitrogen (Carlsbad, CA), Sigma (St. Louis, MO) and Dharmacon (Lafayette, CO). Similarly, shRNA libraries are commercially available from sources including Sigma (St. Louis, MO), Open Biosystems (Huntsville, AL), System Biosciences (Mountain View, CA), and GeneCopoeia (Rockville, MD).

Although siRNAs are potent inhibitors of gene regulation and have enhanced our comprehension of gene function in a variety of cell lines and organisms, several challenges of the siRNA technology remain to be addressed. Specifically, the major problems associated with RNAi analysis include sequence specificity, delivery, and off-target effects, which are described in depth below. Based upon the nature of these limitations, development of RNAi-based therapeutics should still be considered to be premature. Amarzguioui et al. demonstrated that changes at the 3¢ end of the guide siRNA strand are well-tolerated (37, 38); in contrast, variation at either the 5¢ end or the middle of the siRNA have adverse effects (39–41). The character of the 2-nt 3¢-overhang of the siRNA is the predominant determinant of which strand participates in the RNAi pathway, and siRNAs with a unilateral 2-nt 3¢-overhang on the antisense strand are more effective than siRNAs with 3¢-overhangs at both ends, due to preferential loading of the antisense strand into the RISC (42). Several groups have proposed a set of guidelines that seek to have a focused choice of siRNAs that could potentially knockdown gene expression (1, 43). There are also Web-based online software systems available

RNAi-Based Functional Pharmacogenomics

275

for computing highly functional siRNA sequences with maximal target-specificity in mammalian RNAi (44). It is recommended that systematic testing be performed to ensure that the siRNA sequence under consideration targets a single gene. A BLAST search of the chosen sequence should be performed against sequence databases such as EST and/or Unigene libraries using the National Center for Biotechnology Information (NCBI) Web site (45). These studies should include at least two carefully BLASTED siRNA sequences as well as scrambled and mismatched controls and all the siRNAs should be used at the lowest active concentrations. In addition, it is important to note that siRNAs produce mRNA knockdowns rather than knockouts, and consequently, care must be taken in the interpretation of negative outcomes given that even minute and undetectable changes in the level of protein expression might still be sufficient to stimulate a cellular functional response (46). On the other hand, plasmid-based RNAi expression systems (shRNAs) can be considered an alternative or additional approach to synthetic siRNAs, but there are both advantages and disadvantages associated with this system (47). For example, chemically synthesized siRNA transfection is more efficient than plasmid-based shRNA transfection, as more cells will experience gene expression down-regulation after siRNA transfection. The initiation of siRNA-transfected knockdown is immediate as opposed to shRNA-based strategies, which require transcription and Dicer processing. Plasmid-based RNAi expression systems allow plasmids to be readily regenerated and the duration of silencing can be extended using this system. Nevertheless, the utility of plasmid-based strategies is limited in cell lines that are difficult to transfect and that cannot be grown for long periods of time in culture including primary cells (45). 4.2. Delivery

Efficient, reproducible and rapid siRNA delivery is essential for effective siRNA library screening. This constitutes a major hurdle especially for siRNA-directed silencing by lipid-based transfections. This is mainly because siRNAs cannot readily cross the mammalian cell membrane. As a result, protocols often need to be optimized individually for the efficient delivery of siRNAs into different cell lines. There are a multitude of reagents available today for the transfection of siRNAs into mammalian cells. The number of commercial vendors has significantly increased over a short period of time and continues to grow today as more researchers are beginning to utilize a functional genomics approach. The most commonly used approach is lipid-based (cationic and polyamine) transfection (48). Additionally, newer technologies for siRNA transfection are starting to emerge, such as amphiphilic chemicals and nanoparticles (23, 49, 50). Both lipidbased transfection and electroporation have been adapted for

276

Tuzmen et al.

Fig. 2. Cationic lipid-based siRNA transfection for functional chemogenomic studies. Comparison of direct and reverse transfection methods utilizing cationic lipid and siRNAs synergistically with low doses of chemical agents (i.e., 0.25 nM doxorubicin).

high-throughput siRNA delivery. Simultaneous transfection and plating of cells (i.e., reverse transfection) permit large number of sample plate handling in a short period of time, as shown in Fig. 2 (51). Electroporation can deliver hundreds of siRNAs quickly when performed with multiwell-plate electroporators (52). However, while electroporation can deliver siRNAs quicker in multiwell format, it has a major drawback of toxicity from the process itself. Systems that are subjected to extremely large electric fields can undergo dielectric breakdown, where the electric field disrupts their molecular structure. Varying the different parameters of the induced electric field across the membrane (value, duration of the pulse, delay between each pulse) can lead to different effects: a slight increase in membrane conductivity, incomplete or totally reversible electrical breakdown, or destruction of the cell membrane and eventual death of the cell (53). We have found that electroporation may not be consistent from well to well, thus introducing a noise factor in the experiment that may make the results difficult to reproduce. The necessity of initial optimization of transfection conditions for each cell line,

RNAi-Based Functional Pharmacogenomics

277

and protocol optimization for individual targets, causes the high-throughput target identification and validation to be very challenging. Nevertheless, once the selection of the optimal delivery method (a transfection reagent and the particular conditions for efficient uptake of siRNA while maintaining high cell viability) for the chosen cell type is identified, these challenges can be defeated. 4.3. Off-Target Effects

Although RNAi offers specificity as one of its greatest advantages, off-target effects often appear as a potential drawback. Consequently, a rigorous screening strategy must be implemented for all candidate siRNAs to avoid, or at least limit, off-target effects (54). The mechanisms by which this technology creates off-target effects are varied. For example, several researchers have demonstrated that siRNAs cause off-target effects by unintentionally knocking down unrelated genes (54–56). Alternatively, some cationic liposomes induce or prompt interferon response (57, 58) and viral vectors containing an shRNA expression cassette can stimulate an unintended immune response (59). The nature of the synthetic siRNAs may trigger the induction of the dsRNA cellular defense mechanism and although interferon induction might be beneficial for therapy in some cases, cytotoxicity is frequently an outcome of this uncontrolled induction of the innate defense mechanism (59). Systematic studies investigating whether interferon is induced in response to siRNAs have revealed that certain siRNAs do indeed produce this effect (60–62). Interestingly, transfection of synthetic siRNA of 21–23 nt into cells produced gene silencing in a potent and sequence-specific manner, but without triggering the nonspecific interferon response associated with dsRNA longer than 30 nt (63, 64). These mechanisms notwithstanding, off-target effects can be reduced by modifying the siRNA by careful design and screening of each candidate siRNA, and using an appropriate delivery vehicle with minimal potential adverse effects (59). These studies made it possible and practical to chemically synthesize siRNAs, which could then be applied as a molecular biology tool for gene silencing in mammalian cells.

5. RNAi in Cancer Research RNAi technology has been rapidly adopted as a research tool for analysis of cancer-related gene function. Several studies have shown the feasibility of using RNAi to both down-regulate gene expression in hematopoietic cells as well as knockdown specific cancer-related genes, including potential therapeutic targets, such as BCR-ABL, VEGF, k-ras, ERBB1, MDR1, HSP25, BRAF, and

278

Tuzmen et al.

Skp2 (65–76). For example, Brummelkamp et al. (77) employed a novel retroviral RNAi vector system to functionally target the oncogenic version of KRAS2 involved in pancreatic cancer progression. Targeted reduction of this gene led to a loss of anchorage-independent growth and decreased the tumorigenic capacity of the pancreatic cancer cell line, CAPAN-1, in nude mice. Recently, using an RNAi-based approach, it was found that topoisomerase levels determine chemotherapy response in vitro and in vivo (78). Together, these studies illustrate the utility of RNAi in the identification and validation of potential targets for anticancer therapies and the prediction of therapeutic response. Important issues that remain to be addressed, however, are the identification of the most effective sequences that can be targeted by siRNAs and the development of high-throughput systems for validation and analysis of the effects of RNAi against specific targets in mammalian cells. There is also a need to further characterize the ability of siRNAs to induce RNAi in vivo. Progress is being made in the development of RNAi-based strategies, which in turn has already facilitated the broad implementation of siRNA libraries targeting human genes on a genome-wide scale (3, 33, 79–84). These efforts have provided the opportunity to functionally screen hundreds to thousands of gene targets for their phenotypic effects on cancer cell growth, apoptosis, and other features of cancer (80, 85). Thus, RNAi is a mechanism that is proving relatively easy to exploit as a functional genomics tool and has enormous potential as a therapeutic strategy (59, 86).

6. RNAi Microarray Technology The emergence of novel technologies in molecular biology has enabled the identification of new targets that have been correlated with human disease. Technical advances in the RNAi field, combined with the completion of the human genome project, have also enabled the high-throughput, genome-wide RNAi analysis of various organisms. Several groups have utilized highthroughput RNAi (HT-RNAi) technology to systematically study hundreds to thousands of genes for knockdown phenotypes (33, 85). HT-RNAi is also routinely used for rapid, genomewide screening for genes involved in specific biological processes. The integration of this technology with platforms such as microarray-based analysis, has enhanced the speed, accuracy, and throughput of such genome-wide screens (87–91). One of the key advantages of RNAi microarrays for genomescale screening is that it permits miniaturization far beyond what is physically possible today in well-based systems. At present, most

RNAi-Based Functional Pharmacogenomics

279

cell culture-based assays of gene function make use of 6-, 12-, 24-, 96-, or 384-well plate formats. Although RNAi experiments can be conducted in a 384-well plate format, a “well-less” microarray platform would allow for a significantly higher throughput (87, 92, 93). Although the size and density of the spots are limited by the need to have at least several dozen, if not 100, cells on each spot, pilot studies indicate that it is feasible to array and analyze as many as 5,000–10,000 individual siRNA spots per slide. Another advantage of the array-based miniaturization is the reduction in reagent costs. In particular, chemically synthesized siRNAs are expensive and are produced in relatively small quantities. We estimate that at least 50 times less siRNA can be used in a microspot on an RNAi microarray compared with experiments in a typical 96-well plate. The RNAi microarray format will also reduce the cost of the phenotypic assays because only a single microscope slide needs to be subjected to analysis. The “wellless” nature of the platform also provides improvement in the level of uniformity in the experimental culturing and assay conditions, namely because all processes take place on the same surface. Because this platform does not have plate-to-plate variation, the data generated will be less noisy compared to that resulting from assays run on a 96- or 384-well format. Finally, large human siRNA libraries akin to those developed for C. elegans and Drosophila are available, permitting genomic-scale analyses of gene functions possible in living mammalian cells using the RNAi microarray platform (87, 93).

7. HighThroughput RNAi Drug Sensitivity Screening

The value of high-throughput cell-based screening technology is that it not only enables a fast method for screening the phenotype associated with knocking down individual genes, but it also enables investigation of gene interactions in parallel. The establishment of an efficient HT-RNAi platform involves successful development of several stages including assay development, assay validation, high-throughput screening and data analysis (Fig. 3). Once established, HT-RNAi can be used to identify and functionally validate target hits. A flow chart representing the steps of HT-RNAi used for the identification of proliferation modulators is shown in Fig. 4. One application of HT-RNAi is to conduct multidimensional analysis of drug modulator genes across multiple inputs (different chemical or drug exposures), in various systems (different cell lines or cell line variants) and across multiple outputs (cellular phenotypes or molecular endpoints). When each of these dimensions is multiplied, we achieve highly multidimensional RNAi data, which can be extremely valuable in both

280

Tuzmen et al.

Fig. 3. Flow of information in a high-throughput RNAi screening platform. Assay development and optimization are the key steps prior to initiating any scale of RNAi screen. This laborious step takes into account many stringent optimization conditions including transfection reagent tests, as well as drug dose–response analyses. Following this assay validation of smallscale RNAi screen is fulfilled, which leads to full library RNAi high-throughput screen (HTS). Data analysis and validation of target genes is the next crucial step, where statistical evaluation for target identification, expert curation of the identified targets through pathway analysis, and validation of targets through quantitative real-time PCR (qPCR) is applied.

identifying chemo-modulators, as well as better understanding the chemo-selectivity of chemotype modulation and its relationship to various genomic parameters and contexts. Data clustering is one of the data exploration techniques in which any nonrandom patterns or structures requiring further explanation are recognized and placed into a small number of homogeneous groups or clusters. Applications where the investigator wishes to view the chemo-sensitivity resulting from knocking down genes across multiple screens (experiments) are too numerous. However, one such example includes studies conducted by Whitehurst et al., in which chemosensitizer loci in cancer cells were identified through synthetic lethal screening (94). One of the objectives of the authors is to cluster genes based on statistical behavior such that some functional relationship may be predicted, or the characteristics of RNAi effect patterns for all clusters may be extracted for phenotype fingerprinting/profiling purposes (Fig. 5). High-throughput screening technologies are also providing the means with which to perform large-scale and whole-genome RNAi phenotype studies with massive amounts of data (27, 33). In addition to enhancing our current understanding of global gene function and regulation, it will also present substantial challenges in the areas of data management and analysis. For this reason, it becomes important to focus on effective informatics methods in order to make novel biological conclusions (95).

RNAi-Based Functional Pharmacogenomics

281

Fig. 4. RNAi high-throughput screening for functional “hit” identification. A flow chart of events that need to be taken in consideration when evaluating a potential functional target.

8. Drug Target Validation vs. Drug Discovery

Genomic technologies have quickly accelerated the identification of genes associated with physiological processes and pathological states. The challenge has traditionally been to functionally evaluate these candidate genes to determine which ones would make appropriate and pharmacologically relevant drug targets. In this context, a gene is first associated with a disease process, or with response to therapy, and then the expression or function of a gene is perturbed experimentally to determine if a functional and/or causal link exists between the gene or signaling pathways, and the desired pharmacological outcome or phenotype (96). To that end, RNAi technology is now widely applied as a readily available research tool to accelerate the functional validation of the multitude of new candidate drug targets emerging from “omics” data (97). Our own laboratory has used this approach to evaluate

282

Tuzmen et al.

Fig. 5. Multidimensional RNAi phenotype analysis. Analysis of various doses of drugs can be used to generate multidimensional RNAi phenotype plots. Each scatter plot in the figure shows a different pair of treatments (i.e., four subeffective doses of doxorubicin following siRNA transfection) plotted by phenotype (relative growth = relative percentage of surviving cells following treatment). In each case, a correlation coefficient is calculated to indicate the global similarity between two paired conditions. The highest correlation between treatments following silencing events was comparison of 8.6 and 0.86 nM doxorubicin, which had a correlation of 0.852. The most diverse global RNAi phenotypes were between the most different concentrations (H2O vehicle control of 0 and 86 nM).

candidate drug targets which were found to be amplified and overexpressed genes in cancer genomes through comparative genomic hybridization (CGH) microarray and cDNA microarray analysis. In such experiments, small focused siRNA libraries (i.e., Cancer Gene Library of 278 siRNAs targeting 139 classic oncogenes, and tumor suppressor genes) were created against prioritized candidates and used to rapidly determine the ones that were also necessary for cell growth and survival (98). The functional data add tremendous value to the genomic information, and permit the rapid prioritization of pharmacologically vulnerable targets to be distinguished from disease-associated genes that would not make appropriate drug targets. However, this approach is limited in that it focuses on genes that are associated with the disease. Although disease-related genes make sense as candidate drug targets, it is also possible that

RNAi-Based Functional Pharmacogenomics

283

certain genes and pathways which are not involved in the etiology of the disease may represent points of vulnerability that can be therapeutically exploited. The hypothesis behind this concept is that certain gene targets would become vulnerable under certain contexts, such as exposure to a drug or in the context of a cell in a pathological state, but would not be associated with the disease state. There are many examples in which a disease state alters the underlying genetic regulatory networks of a cell, leading to a different cellular responsiveness to external stimuli (99). It stands to reason that this would create vulnerabilities, and new points for therapeutic intervention that would not otherwise be sensitive in cells under normal conditions. Furthermore, these genes may not be related to the original etiological factors that led to the disease state, but now represent effective targets simply because of their new cellular and molecular context. This concept is particularly relevant to oncology, where the disease results from a slow multistep process, often involving a number of genetic alterations interacting with multiple environmental factors. Therefore, the etiology of many cancers is often poorly understood, the pathogenic process can be very complex and variable, and the end-stage disease is highly heterogeneous. Genomics-based cancer drug discovery has largely focused on the identification of genes that cause the disease state as candidate drug targets (100). Despite a very large body of knowledge about the genes that are in some way associated with cancer initiation, promotion, and progression, targeted cancer therapies have had limited success. We propose an alternative approach that consists of focusing on vulnerabilities in cancer cells that arise from the cancer process. In this paradigm, the most important aspect is not the disease-causing genes, but the specific modifications to the normal genetic regulatory network that occur in the context of a cancer cell with a highly altered genome. With HT-RNAi technology, it is possible to rapidly analyze genes in the genome to determine if they are needed for growth and survival in cancer cells but are not critical for survival of normal cells. Such genes may not be directly involved in the cancer process itself but nevertheless represent selectively vulnerable drug targets for therapeutic intervention. In a similar fashion, the cellular response to a specific drug may be controlled and modulated differently in various cell types, under different physiological and pathological conditions. Systematic RNAi phenotype screening can reveal the genes that functionally regulate the response to specific drugs in cancer cells providing a means for development of new combination chemotherapies that exploit these chemo-selective vulnerabilities. An example of an RNAi phenotype profiling study for identifying modulators of the response to MCF-7 breast cancer cells to doxorubicin is shown in Fig. 6.

284

Tuzmen et al.

Fig. 6. RNAi phenotype profiling of >10,000 siRNA. One cell line (MCF7) with two-drug exposure context. To discover genes which modulate the sensitivity to drug exposure, we have developed a functional chemogenomics assay using the number of surviving cells as an endpoint. We generated proof of concept experiments using doxorubicin, to determine if known modulators of doxorubicin sensitivity can be identified in cancer cells. Highlighted circles indicate gene targeting events that sensitize MCF-7 breast cancer cells to a low dose (~0.25 nM) of doxorubicin.

9. Key Benefits of RNAi In summary, the key benefits of RNAi technology are listed below: ●฀

●฀

●฀

Relative to other functional approaches, RNAi provides several advantages for investigating the function of genes, gene–gene interactions, and interactions with gene–environment. Knocking down gene expression provides a more biologically relevant experimental outcome compared to gene overexpression. In terms of knockdown approaches, use of siRNA to trigger RNAi has proven to be more potent, highly specific, and more reliable than other approaches such as antisense technology.

RNAi-Based Functional Pharmacogenomics ●฀

10. Perspectives of the Application of HT-RNAi for Pharmacogenomics

285

The ease of use and high success rate in designing effective and specific siRNA reagents enabled this technology to be compatible with the creation of large, genome-wide libraries for systematic and global-scale studies.

Genome-scale analysis of mRNA and protein expression has produced enticing clues about which genes may be associated with drug response and/or therapy failure in the clinic. Unfortunately, regulatory genetic networks that control drug response are extremely complex, heterogeneous, and virtually impossible to model based upon our current understanding. Therefore, most attempts to develop response biomarkers are based primarily on association data from genomics analysis, with little regard to the functional role that genes have in modulating drug response. Consequently, these genes and gene signatures may initially show extremely strong associations with drug response, but when applied to large clinical trials they often do not have sufficient predictive power to be useful. As an alternative, we have approached the problem from a systems analysis perspective, assuming a “black-box” model for the regulatory genetic network of a cell, and applied genome-wide RNAi for multidimensional analysis to reveal insightful functional relationships between synthetic genotypes and chemical compounds in cancer cells. This approach is based on recent advances in RNAi technology, and requires extensive expertise, infrastructure, and resources to perform HT-RNAi on a large scale. HT-RNAi screening has emerged to offer tremendous insights and knowledge behind the mechanisms and contexts of vulnerability of any particular drug therapeutic. Use of this information can then be applied to better position the therapeutic drug clinically by suggesting new drug combinations that may have enhanced effectiveness or by identifying biomarkers that will indicate patient groups that will likely respond better to the therapeutic. Consequently, it is very critical that the acquisition of the data from RNAi studies be as accurate as possible. Historically, the caveat of using RNAi has been the inability to obtain consistent and repeatable results. The factors that have contributed to this are multifold. The challenges provided earlier such as siRNA delivery and off-target effects still exist today. However, with recent significant improvements in these areas, these aspects that contribute to experimental noise can be filtered out to provide a robust experimental protocol (101). Moreover, the use of appropriate siRNA controls will also provide the ability to reduce the selection of false-positive results from these screens. For instance,

286

Tuzmen et al.

the inclusion of a positive control siRNA sequence that is known to be toxic to cells can be used to determine the overall transfection efficiency of the experiment. Conversely, a negative control sequence, which should not result in the knockdown of any gene, is equally important. From this, potential transient variables such as transfection reagent cytotoxicity that may contribute to falsepositive selection can be identified and filtered out during the analysis (101). These controls can then also be used to assess the quality of the HT-RNAi screen over a large genome-wide library. Plate-to-plate and run-to-run variations within the screen provide an overall assessment of the robustness of the screen. Application of HT-RNAi to functional chemogenomics is a powerful strategy for “predictive pharmacogenomics ” since it permits the discovery of candidate biomarkers that are causally related to drug response.

Acknowledgments The authors acknowledge the members of the Pharmaceutical Genomics Division at the Translational Genomics Research Institute. A special acknowledgement goes to Dr. H. Yin, Dr. Q. Que, D. Chow, Dr. G. Basu, Dr. N. Meurice, and Dr. J. Kiefer for their critical feedback. References 1. Elbashir SM, Harborth J, Weber K, Tuschl T (2002) Analysis of gene function in somatic mammalian cells using small interfering RNAs. Methods 26:199–213 2. Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC (1998) Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391:806–811 3. Hannon GJ (2002) RNA interference. Nature 418:244–251 4. Huppi K, Martin SE, Caplen NJ (2005) Defining and assaying RNAi in mammalian cells. Mol Cell 17:1–10 5. Tuschl T (2001) RNA interference and small interfering RNAs. Chembiochem 2:239–245 6. Nakayashiki H, Nguyen QB (2008) RNA interference: roles in fungal biology. Curr Opin Microbiol 11:494–502 7. Travella S, Keller B (2009) Down-regulation of gene expression by RNA-induced gene silencing. Methods Mol Biol 478:185–199

8. Tomari Y, Du T, Zamore PD (2007) Sorting of Drosophila small silencing RNAs. Cell 130:299–308 9. Matranga C, Zamore PD (2007) Small silencing RNAs. Curr Biol 17:R789–R793 10. Zamore PD (2004) Plant RNAi: how a viral silencing suppressor inactivates siRNA. Curr Biol 14:R198–R200 11. Roguev A, Bandyopadhyay S, Zofall M, Zhang K, Fischer T, Collins SR, Qu H, Shales M, Park HO, Hayles J, Hoe KL, Kim DU, Ideker T, Grewal SI, Weissman JS, Krogan NJ (2008) Conservation and rewiring of functional modules revealed by an epistasis map in fission yeast. Science 322:405–410 12. Volpe T, Schramke V, Hamilton GL, White SA, Teng G, Martienssen RA, Allshire RC (2003) RNA interference is required for normal centromere function in fission yeast. Chromosome Res 11:137–146 13. Hamilton B, Dong Y, Shindo M, Liu W, Odell I, Ruvkun G, Lee SS (2005) A systematic

RNAi-Based Functional Pharmacogenomics

14.

15. 16.

17. 18.

19. 20.

21.

22.

23.

24.

RNAi screen for longevity genes in C. elegans. Genes Dev 19:1544–1555 Caplen NJ, Mousses S (2003) Short interfering RNA (siRNA)-mediated RNA interference (RNAi) in human cells. Ann N Y Acad Sci 1002:56–62 Haley B, Tang G, Zamore PD (2003) In vitro analysis of RNA interference in Drosophila melanogaster. Methods 30:330–336 Stroschein-Stevenson SL, Foley E, O’Farrell PH, Johnson AD (2009) Phagocytosis of Candida albicans by RNAi-treated Drosophila S2 cells. Methods Mol Biol 470:347–358 Grishok A (2005) RNAi mechanisms in Caenorhabditis elegans. FEBS Lett 579:5932–5939 Grishok A, Hoersch S, Sharp PA (2008) RNA interference and retinoblastomarelated genes are required for repression of endogenous siRNA targets in Caenorhabditis elegans. Proc Natl Acad Sci USA 105:20386–20391 Kamath RS, Ahringer J (2003) Genomewide RNAi screening in Caenorhabditis elegans. Methods 30:313–321 Kim JK, Gabel HW, Kamath RS, Tewari M, Pasquinelli A, Rual JF, Kennedy S, Dybbs M, Bertin N, Kaplan JM, Vidal M, Ruvkun G (2005) Functional genomic analysis of RNA interference in C. elegans. Science 308:1164–1167 Sonnichsen B, Koski LB, Walsh A, Marschall P, Neumann B, Brehm M, Alleaume AM, Artelt J, Bettencourt P, Cassin E, Hewitson M, Holz C, Khan M, Lazik S, Martin C, Nitzsche B, Ruer M, Stamford J, Winzi M, Heinkel R, Roder M, Finell J, Hantsch H, Jones SJ, Jones M, Piano F, Gunsalus KC, Oegema K, Gonczy P, Coulson A, Hyman AA, Echeverri CJ (2005) Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature 434:462–469 Mariotti M, Castiglioni S, Maier JA (2009) Inhibition of T24 human bladder carcinoma cell migration by RNA interference suppressing the expression of HD-PTP. Cancer Lett 273:155–163 Medarova Z, Kumar M, Ng SW, Yang J, Barteneva N, Evgenov NV, Petkova V, Moore A (2008) Multifunctional magnetic nanocarriers for image-tagged SiRNA delivery to intact pancreatic islets. Transplantation 86:1170–1177 Tuschl T, Zamore PD, Lehmann R, Bartel DP, Sharp PA (1999) Targeted mRNA degradation by double-stranded RNA in vitro. Genes Dev 13:3191–3197

287

25. Sun X, Rogoff HA, Li CJ (2008) Asymmetric RNA duplexes mediate RNA interference in mammalian cells. Nat Biotechnol 26:1379–1382 26. Aleku M, Schulz P, Keil O, Santel A, Schaeper U, Dieckhoff B, Janke O, Endruschat J, Durieux B, Roder N, Loffler K, Lange C, Fechtner M, Mopert K, Fisch G, Dames S, Arnold W, Jochims K, Giese K, Wiedenmann B, Scholz A, Kaufmann J (2008) Atu027, a liposomal small interfering RNA formulation targeting protein kinase N3, inhibits cancer progression. Cancer Res 68:9788–9798 27. Rines DR, Gomez-Ferreria MA, Zhou Y, DeJesus P, Grob S, Batalov S, Labow M, Huesken D, Mickanin C, Hall J, Reinhardt M, Natt F, Lange J, Sharp DJ, Chanda SK, Caldwell JS (2008) Whole genome functional analysis identifies novel components required for mitotic spindle integrity in human cells. Genome Biol 9:R44 28. McManus MT, Sharp PA (2002) Gene silencing in mammals by small interfering RNAs. Nat Rev 3:737–747 29. Filipowicz W (2005) RNAi: the nuts and bolts of the RISC machine. Cell 122:17–20 30. Ohrt T, Mutze J, Staroske W, Weinmann L, Hock J, Crell K, Meister G, Schwille P (2008) Fluorescence correlation spectroscopy and fluorescence cross-correlation spectroscopy reveal the cytoplasmic origination of loaded nuclear RISC in vivo in human cells. Nucleic Acids Res 36:6439–6449 31. MacRae IJ, Ma E, Zhou M, Robinson CV, Doudna JA (2008) In vitro reconstitution of the human RISC-loading complex. Proc Natl Acad Sci USA 105:512–517 32. Berns K, Hijmans EM, Mullenders J, Brummelkamp TR, Velds A, Heimerikx M, Kerkhoven RM, Madiredjo M, Nijkamp W, Weigelt B, Agami R, Ge W, Cavet G, Linsley PS, Beijersbergen RL, Bernards R (2004) A large-scale RNAi screen in human cells identifies new components of the p53 pathway. Nature 428:431–437 33. Ganesan AK, Ho H, Bodemann B, Petersen S, Aruri J, Koshy S, Richardson Z, Le LQ, Krasieva T, Roth MG, Farmer P, White MA (2008) Genome-wide siRNA-based functional genomics of pigmentation identifies novel genes and pathways that impact melanogenesis in human cells. PLoS Genet 4:e1000298 34. Schlabach MR, Luo J, Solimini NL, Hu G, Xu Q, Li MZ, Zhao Z, Smogorzewska A, Sowa ME, Ang XL, Westbrook TF, Liang AC, Chang K, Hackett JA, Harper JW,

288

35.

36. 37.

38.

39.

40.

41.

42.

43. 44.

45.

46.

47.

Tuzmen et al. Hannon GJ, Elledge SJ (2008) Cancer proliferation gene discovery through functional genomics. Science 319:620–624 Silva JM, Marran K, Parker JS, Silva J, Golding M, Schlabach MR, Elledge SJ, Hannon GJ, Chang K (2008) Profiling essential genes in human mammary cells by multiplex RNAi screening. Science 319:617–620 Paddison PJ, Hannon GJ (2003) siRNAs and shRNAs: skeleton keys to the human genome. Curr Opin Mol Ther 5:217–224 Amarzguioui M, Holen T, Babaie E, Prydz H (2003) Tolerance for mutations and chemical modifications in a siRNA. Nucleic Acids Res 31:589–595 Amarzguioui M, Prydz H (2004) An algorithm for selection of functional siRNA sequences. Biochem Biophys Res Commun 316:1050–1058 Boden D, Pusch O, Lee F, Tucker L, Shank PR, Ramratnam B (2003) Promoter choice affects the potency of HIV-1 specific RNA interference. Nucleic Acids Res 31:5033–5038 Das AT, Brummelkamp TR, Westerhout EM, Vink M, Madiredjo M, Bernards R, Berkhout B (2004) Human immunodeficiency virus type 1 escapes from RNA interference-mediated inhibition. J Virol 78:2601–2605 Gitlin L, Karelsky S, Andino R (2002) Short interfering RNA confers intracellular antiviral immunity in human cells. Nature 418:430–434 Sano M, Sierant M, Miyagishi M, Nakanishi M, Takagi Y, Sutou S (2008) Effect of asymmetric terminal structures of short RNA duplexes on the RNA interference activity and strand selection. Nucleic Acids Res 36:5812–5821 Paddison PJ, Hannon GJ (2002) RNA interference: the new somatic cell genetics? Cancer Cell 2:17–23 Ui-Tei K, Naito Y, Saigo K (2007) Guidelines for the selection of effective short-interfering RNA sequences for functional genomics. Methods Mol Biol 361:201–216 Dykxhoorn DM, Novina CD, Sharp PA (2003) Killing the messenger: short RNAs that silence gene expression. Nat Rev Mol Cell Biol 4:457–467 Jones SW, Souza PM, Lindsay MA (2004) siRNA for gene silencing: a route to drug target discovery. Curr Opin Pharmacol 4:522–527 Rossi JJ (2008) Expression strategies for short hairpin RNA interference triggers. Hum Gene Ther 19:313–317

48. Han SE, Kang H, Shim GY, Suh MS, Kim SJ, Kim JS, Oh YK (2008) Novel cationic cholesterol derivative-based liposomes for serumenhanced delivery of siRNA. Int J Pharm 353:260–269 49. Sun TM, Du JZ, Yan LF, Mao HQ, Wang J (2008) Self-assembled biodegradable micellar nanoparticles of amphiphilic and cationic block copolymer for siRNA delivery. Biomaterials 29:4348–4355 50. Medarova Z, Pham W, Farrar C, Petkova V, Moore A (2007) In vivo imaging of siRNA delivery and silencing in tumors. Nat Med 13:372–377 51. Kuuselo R, Savinainen K, Azorsa DO, Basu GD, Karhu R, Tuzmen S, Mousses S, Kallioniemi A (2007) Intersex-like (IXL) is a cell survival regulator in pancreatic cancer with 19q13 amplification. Cancer Res 67:1943–1949 52. Ovcharenko D, Jarvis R, Hunicke-Smith S, Kelnar K, Brown D (2005) High-throughput RNAi screening in vitro: from cell lines to primary cells. RNA 11:985–993 53. Favard C, Dean DS, Rols MP (2007) Electrotransfer as a non viral method of gene delivery. Curr Gene Ther 7:67–77 54. Anderson E, Boese Q, Khvorova A, Karpilow J (2008) Identifying siRNA-induced offtargets by microarray analysis. Methods Mol Biol 442:45–63 55. Jackson AL, Bartz SR, Schelter J, Kobayashi SV, Burchard J, Mao M, Li B, Cavet G, Linsley PS (2003) Expression profiling reveals off-target gene regulation by RNAi. Nat Biotechnol 21:635–637 56. Scacheri PC, Rozenblatt-Rosen O, Caplen NJ, Wolfsberg TG, Umayam L, Lee JC, Hughes CM, Shanmugam KS, Bhattacharjee A, Meyerson M, Collins FS (2004) Short interfering RNAs can induce unexpected and divergent changes in the levels of untargeted proteins in mammalian cells. Proc Natl Acad Sci USA 101:1892–1897 57. Heidel JD, Hu S, Liu XF, Triche TJ, Davis ME (2004) Lack of interferon response in animals to naked siRNAs. Nat Biotechnol 22:1579–1582 58. Ma Z, Li J, He F, Wilson A, Pitt B, Li S (2005) Cationic lipids enhance siRNA-mediated interferon response in mice. Biochem Biophys Res Commun 330:755–759 59. Uprichard SL (2005) The therapeutic potential of RNA interference. FEBS Lett 579:5996–6007 60. Robbins MA, Rossi JJ (2005) Sensing the danger in RNA. Nat Med 11:250–251

RNAi-Based Functional Pharmacogenomics 61. Hornung V, Guenthner-Biller M, Bourquin C, Ablasser A, Schlee M, Uematsu S, Noronha A, Manoharan M, Akira S, de Fougerolles A, Endres S, Hartmann G (2005) Sequence-specific potent induction of IFN-alpha by short interfering RNA in plasmacytoid dendritic cells through TLR7. Nat Med 11:263–270 62. Kim DH, Longo M, Han Y, Lundberg P, Cantin E, Rossi JJ (2004) Interferon induction by siRNAs and ssRNAs synthesized by phage polymerase. Nat Biotechnol 22:321–325 63. Caplen NJ (2002) A new approach to the inhibition of gene expression. Trends Biotechnol 20:49–51 64. Caplen NJ, Parrish S, Imani F, Fire A, Morgan RA (2001) Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc Natl Acad Sci USA 98:9742–9747 65. Scherr M, Battmer K, Winkler T, Heidenreich O, Ganser A, Eder M (2003) Specific inhibition of bcr-abl gene expression by small interfering RNA. Blood 101:1566–1569 66. Scherr M, Steinmann D, Eder M (2004) RNA interference (RNAi) in hematology. Ann Hematol 83:1–8 67. Scherr M, Venturini L, Eder M (2009) Knock-down of gene expression in hematopoietic cells. Methods Mol Biol 506:207–219 68. Bausero MA, Bharti A, Page DT, Perez KD, Eng JW, Ordonez SL, Asea EE, Jantschitsch C, Kindas-Muegge I, Ciocca D, Asea A (2006) Silencing the hsp25 gene eliminates migration capability of the highly metastatic murine 4T1 breast adenocarcinoma cell. Tumour Biol 27:17–26 69. Logashenko EB, Vladimirova AV, Repkova MN, Venyaminova AG, Chernolovskaya EL, Vlassov VV (2004) Silencing of MDR 1 gene in cancer cells by siRNA. Nucleosides Nucleotides Nucleic Acids 23:861–866 70. Nagy P, Arndt-Jovin DJ, Jovin TM (2003) Small interfering RNAs suppress the expression of endogenous and GFP-fused epidermal growth factor receptor (erbB1) and induce apoptosis in erbB1-overexpressing cells. Exp Cell Res 285:39–49 71. Stierle V, Laigle A, Jolles B (2005) Modulation of MDR1 gene expression in multidrug resistant MCF7 cells by low concentrations of small interfering RNAs. Biochem Pharmacol 70:1424–1430 72. Sumimoto H, Yamagata S, Shimizu A, Miyoshi H, Mizuguchi H, Hayakawa T, Miyagishi M,

73.

74.

75.

76.

77.

78.

79. 80.

81.

82.

83.

84.

289

Taira K, Kawakami Y (2004) Gene therapy for human small-cell lung carcinoma by inactivation of Skp-2 with virally mediated RNA interference. Gene Ther 12:95–100 Wilda M, Fuchs U, Wossmann W, Borkhardt A (2002) Killing of leukemic cells with a BCR/ABL fusion gene by RNA interference (RNAi). Oncogene 21:5716–5724 Withey JM, Harvey AJ, Crompton MR (2006) RNA interference targeting of BcrAbl increases chronic myeloid leukemia cell killing by 17-allylamino-17-demethoxygeldanamycin. Leuk Res 30:553–560 Wu H, Hait WN, Yang JM (2003) Small interfering RNA-induced suppression of MDR1 (P-glycoprotein) restores sensitivity to multidrug-resistant cancer cells. Cancer Res 63:1515–1519 Zhang L, Yang N, Mohamed-Hadley A, Rubin SC, Coukos G (2003) Vector-based RNAi, a novel tool for isoform-specific knock-down of VEGF and anti-angiogenesis gene therapy of cancer. Biochem Biophys Res Commun 303:1169–1178 Brummelkamp TR, Bernards R, Agami R (2002) Stable suppression of tumorigenicity by virus-mediated RNA interference. Cancer Cell 2:243–247 Burgess DJ, Doles J, Zender L, Xue W, Ma B, McCombie WR, Hannon GJ, Lowe SW, Hemann MT (2008) Topoisomerase levels determine chemotherapy response in vitro and in vivo. Proc Natl Acad Sci USA 105:9053–9058 Frankish H (2003) Consortium uses RNAi to uncover genes’ function. Lancet 361:584 Gobeil S, Zhu X, Doillon CJ, Green MR (2008) A genome-wide shRNA screen identifies GAS1 as a novel melanoma metastasis suppressor gene. Genes Dev 22:2932–2940 Ito M, Kawano K, Miyagishi M, Taira K (2005) Genome-wide application of RNAi to the discovery of potential drug targets. FEBS Lett 579:5988–5995 Chen M, Du Q, Zhang H, Wang X, Liang Z (2007) High-throughput screening using siRNA (RNAi) libraries. Expert Rev Mol Diagn 7:281–291 Paddison PJ, Silva JM, Conklin DS, Schlabach M, Li M, Aruleba S, Balija V, O’Shaughnessy A, Gnoj L, Scobie K, Chang K, Westbrook T, Cleary M, Sachidanandam R, McCombie WR, Elledge SJ, Hannon GJ (2004) A resource for large-scale RNA-interference-based screens in mammals. Nature 428:427–431 Sachse C, Krausz E, Kronke A, Hannus M, Walsh A, Grabner A, Ovcharenko D, Dorris D,

290

85.

86.

87.

88.

89. 90. 91.

92. 93.

Tuzmen et al. Trudel C, Sonnichsen B, Echeverri CJ (2005) High-throughput RNA interference strategies for target discovery and validation by using synthetic short interfering RNAs: functional genomics investigations of biological pathways. Methods Enzymol 392:242–277 Kimura J, Nguyen ST, Liu H, Taira N, Miki Y, Yoshida K (2008) A functional genome-wide RNAi screen identifies TAF1 as a regulator for apoptosis in response to genotoxic stress. Nucleic Acids Res 36:5250–5259 Dykxhoorn DM, Lieberman J (2005) The silent revolution: RNA interference as basic biology, research tool, and therapeutic. Annu Rev Med 56:401–423 Mousses S, Caplen NJ, Cornelison R, Weaver D, Basik M, Hautaniemi S, Elkahloun AG, Lotufo RA, Choudary A, Dougherty ER, Suh E, Kallioniemi O (2003) RNAi microarray analysis in cultured mammalian cells. Genome Res 13:2341–2347 Ortega-Paino E, Fransson J, Ek S, Borrebaeck CA (2008) Functionally associated targets in mantle cell lymphoma as defined by DNA microarrays and RNA interference. Blood 111:1617–1624 Semizarov D, Kroeger P, Fesik S (2004) siRNAmediated gene silencing: a global genome view. Nucleic Acids Res 32:3836–3845 Vanhecke D, Janitz M (2004) Highthroughput gene silencing using cell arrays. Oncogene 23:8353–8358 Wheeler DB, Carpenter AE, Sabatini DM (2005) Cell microarrays and RNA interference chip away at gene function. Nat Genet 37(Suppl):S25–S30 Castel D, Debily MA, Pitaval A, Gidrol X (2007) Cell microarray for functional exploration of genomes. Methods Mol Biol 381:375–384 Fjeldbo C, Misund K, Günther C, Langaas M, Steigedal T, Thommesen L, Laegreid A, Bruland T (2008) Functional studies on transfected cell

94.

95.

96. 97. 98.

99.

100.

101.

microarray analysed by linear regression modelling. Nucleic Acids Res 36:e97 Whitehurst AW, Bodemann BO, Cardenas J, Ferguson D, Girard L, Peyton M, Minna JD, Michnoff C, Hao W, Roth MG, Xie XJ, White MA (2007) Synthetic lethal screen identification of chemosensitizer loci in cancer cells. Nature 446:815–819 Zhang X, Yang X, Chung N, Gates A, Stec E, Kunapuli P, Holder D, Ferrer M, Espeseth A (2006) Robust statistical methods for hit selection in RNA interference high-throughput screening experiments. Pharmacogenomics 7:299–309 Thomas RK, Weir B, Meyerson M (2006) Genomic approaches to lung cancer. Clin Cancer Res 12:4384s–4391s Ghosh D, Poisson L (2009) “Omics” data and levels of evidence for biomarker discovery. Genomics 93:13–16 Tuzmen S, Kiefer J, Mousses S (2007) Validation of short interfering RNA knockdowns by quantitative real-time PCR. Methods Mol Biol 353:177–203 Mehler MF (2008) Epigenetic principles and mechanisms underlying nervous system functions in health and disease. Prog Neurobiol 86:305–341 Meng F, Dong B, Li H, Fan D, Ding J (2009) RNAi-mediated inhibition of Raf-1 leads to decreased angiogenesis and tumor growth in gastric cancer. Cancer Biol Ther 8:174–179 Echeverri CJ, Beachy PA, Baum B, Boutros M, Buchholz F, Chanda SK, Downward J, Ellenberg J, Fraser AG, Hacohen N, Hahn WC, Jackson AL, Kiger A, Linsley PS, Lum L, Ma Y, Mathey-Prevot B, Root DE, Sabatini DM, Taipale J, Perrimon N, Bernards R (2006) Minimizing the risk of reporting false positives in large-scale RNAi screens. Nat Methods 3:777–779

Chapter 19 Genetic Predisposition to b-Thalassemia and Sickle Cell Anemia in Turkey: A Molecular Diagnostic Approach A. Nazli Basak and Sukru Tuzmen Abstract The thalassemia syndromes are a diverse group of inherited disorders that can be characterized according to their insufficient synthesis or absent production of one or more of the globin chains. They are classified in to a, b, g, db, d, and egdb thalassemias depending on the globin chain(s) affected. The b-thalassemias refer to that group of inherited hemoglobin disorders, which are characterized by a reduced synthesis (b+-thalassemia) or absence (b0-thalassemia) of beta globin (b-globin) chain production (1). Though known as single-gene disorders, hemoglobinopathies such as b-thalassemia and sickle cell anemia are far from being fully resolved in terms of cure, considering the less complex nature of the beta globin (b-globin) gene family compared to more complex multifactorial genetic disorders such as cancer. Currently, there are no definitive therapeutic options for patients with b-thalassemia and sickle cell anemia, and new insights into the pathogenesis of these devastating diseases are urgently needed. Here we address in detail the overall picture utilizing molecular diagnostic approaches that contribute to unraveling the populationspecific mutational analysis of b-globin gene. We also present approaches for molecular diagnostic strategies that are applicable to b-thalassemia, sickle cell anemia, and other genetic disorders. Key words: Beta globin gene, b-Thalassemia, Sickle cell anemia, Hemoglobinopathies, PCR, Diagnostics, Molecular pathology, Turkey

1. Introduction 1.1. b-Thalassemia

Diseases related to the hemoglobin (Hb) molecule can result from either structural anomalies, which result in “hemoglobinopathies,” or deficiencies in synthesis of one or more of the polypeptide chains of the molecule, which yield thalassemias (2). The thalassemias are a diverse group of Hb disorders and more than 200 mutations, resulting in varying levels of beta globin (b-globin) gene (1) expression, are known to produce b-thalassemia (Fig. 1). It is widely recognized that b-thalassemia mutations provide a

Johanna K. DiStefano (ed.), Disease Gene Identification: Methods and Protocols, Methods in Molecular Biology, vol. 700, DOI 10.1007/978-1-61737-954-3_19, © Springer Science+Business Media, LLC 2011

291

292

Basak and Tuzmen

Fig. 1. Beta globin gene: >200 mutations reported thus far (adapted from ref. (5)).

selective protection and heterozygous advantage against malaria and some of these mutations have reached high gene frequencies in the tropical regions endemic for this infectious disease. In these populations, a limited number of alleles account for the majority of b-thalassemias, and only a small percentage of the disease phenotype is due to a few rare mutations (3, 4). According to recent findings, the multilayered complexity in the phenotype of b-thalassemia is the result not only of marked molecular diversity at the b-globin locus, but at the level of several other genes as well. The heterogeneity at the b-globin locus, which is the most reliable and predictive factor of disease phenotype reported so far, is simplified, to a certain extent, by the fact that the mutations are ethnic-origin specific. Fifty out of >200 b-thalassemia mutations account for 90–95% b-thalassemias worldwide. In the broad group of Mediterranean countries, ~35 mutations have been reported thus far. However, allele frequencies vary among countries (2, 5). As many as 250 million people (4.5% of the world population) may be heterozygous for a defective globin gene (1) and >300,000 lethally affected homozygotes are born each year (6). 1.2. Sickle Cell Anemia

Sickle cell disease (SCD) is characterized as an autosomal recessive disorder that affects Mediterranean, Hispanic, Arabic, African, and some Asian populations globally. More than 300,000 individuals worldwide suffer from this disease. SCD mutation is well known to result in increased resistance to malaria in carriers of the sickle cell trait or heterozygous allele (7). The molecular basis for sickle cell disease is an A to T transversion in the sixth codon of the human b-globin gene (8–11). This simple transversion changes a polar glutamic acid residue to a nonpolar valine in the bs-globin chain on the surface of HbS (a2b2S) tetramers. The valine creates a hydrophobic projection that fits into a natural hydrophobic pocket formed on Hb tetramers after deoxygenation (12, 13). The interaction of tetramers results in the formation of HbS polymers/fibers that cause red blood cells (RBCs) to become rigid and nondeformable, yielding an enhanced ability to occlude

Genetic Predisposition to b-Thalassemia and Sickle Cell Anemia

293

small capillaries (8, 14–18). Such vaso-occlusive events cause severe tissue damage that can result in strokes, splenic infarction, kidney failure, liver and lung disorders, painful crises, and other complications. Cycles of erythrocyte sickling also cause cells to become fragile and subject to easy lysis, which produces chronic anemia (14, 19). SCD is normally a relatively benign disorder in the first few months of life because human fetal hemoglobin (HbF) has potent antisickling properties. HbF, which comprises 70–90% of total hemoglobin at birth, is gradually replaced by HbS during the first few months of life. Rising HbS levels result in the onset of disease between 3 and 6 months of age (8, 14, 19, 20).

2. Diagnosis of b-Thalassemia Clinical evaluation of the patient with thalassemia is usually informed by individual medical history, particularly family history of the disease and the geographic region of the familial origin, and by characteristic physical features upon examination. These have been well summarized, especially in the context of the differential diagnosis of other congenital anemias, elsewhere (2). Table 1 summarizes the most important aspects of evaluation of medical history and physical examination, and also lists the typical diagnostic laboratory tests, as well as the DNA analyses for determination of specific b-globin gene mutations (1, 2).

3. Molecular Diagnosis of b-Thalassemia and Sickle Cell Anemia

With the advent of PCR technology (21), a new era has blossomed to unveil the possibilities that this innovation created. This technology, combined with the use of oligonucleotide probes/ primers to detect individual mutations, led to methods that have indeed been very beneficial to scientists not only in the globin field but also in other genetic disease areas such as cystic fibrosis (CF) (22, 23), Duchenne muscular dystrophy (DMD) (24–28), hemophilia (29, 30), and cancer (31–33). PCR-based techniques provide rapid, efficient, and simple approach for carrier detection and prenatal diagnosis of b-thalassemia and SCD, and these techniques are widely utilized, particularly in countries where such diseases are endemic (2). Within the context of this review chapter, methodologies involved in the identification of SCD and b-thalassemia lesions caused by point mutations are addressed. Deletions, on the other hand, which are not a common cause of b-thalassemia and SCD, can be identified by the use of PCR-based methods to amplify

294

Basak and Tuzmen

Table 1 Diagnosis and DNA analysis of the b-thalassemic patient (adapted from ref. (2))

across deletion breakpoints by the technique of gap PCR (34), or in the case of large deletions, Southern blot analysis (35). Other b-globin gene-related deletions are listed elsewhere (36). The strategy for elucidating genetic mutations typically involves initial screening for the hot-spot mutations predominating in the at-risk population. If known mutation(s) are not responsible, then other methods are utilized to identify the causal variants. There are several PCR-based methodologies currently in use as diagnostic

Genetic Predisposition to b-Thalassemia and Sickle Cell Anemia

295

tools for elucidating the genetic diseases of hemoglobin. Here we classify these techniques under two main categories: known or known/unknown mutations (Fig. 2). As shown in Fig. 2a, methodologies for assessing known mutations include amplification refractory mutation system (ARMS) (37, 38), heteroduplex analysis (39), restriction enzyme analysis, especially for SCD (40), isotopic dot-blot analysis (41), and reverse, nonisotopic, dot-blot analysis (42, 43). As shown in Fig. 2b, known/unknown mutations are evaluated using techniques such as denaturing gradient gel electrophoresis (DGGE) (44, 45) manual and automated sequencing (46), and high-density oligonucleotide probe array analysis (47, 48). Despite the heterogenous segregation of b-thalassemia mutations in individual geographical locations, a small number of molecular lesions are prevalent in each at-risk population (Fig. 3). This proves to be very useful in diagnostic applications due to the restricted number of b-thalassemia mutations to be investigated. In diagnostic situations where none of the common b-thalassemia mutations are encountered, combinations of techniques such as DGGE and sequencing are the methods of choice (49). In this review chapter, we illustrate the results obtained from the Turkish population utilizing the common diagnostic methods that elucidate individual patterns of b-thalassemia mutations, and SCD.

4. Pharmacogenomic Intervention

Pharmacological regulation of globin gene expression as the basis for therapeutic intervention in certain genetic diseases of hemoglobin, such as sickle cell anemia and the thalassemias, has been the subject of considerable investigation. Compounds capable of increasing the expression levels of fetal hemoglobin and replacing the impaired adult hemoglobins have been suggested as suitable treatments for sickle cell anemia and b-thalassemia (8, 14, 19, 50–56). To date, however, no robust treatment regimens have been established. Consequently, generation of robust and simple assays that can correlate the changes in expression between the b and g isoforms of hemoglobin have been postulated as ways to screen for new chemical compounds that could increase HbF levels and hence act as therapeutic compounds. Techniques used for such analyses have included dual reporter assay systems (57), measurement of hemoglobins (58, 59), as well as measurement of mRNA levels by RNAse protection assays (51) or by qPCR analysis (60). Worldwide studies suggest that the best hope for therapeutic development of these diseases may lie in the discovery of agents that target the physiologic effects of the altered pathways and processes rather than their individual gene components. However, until such agents are developed, the best strategies to reduce the

296

Basak and Tuzmen

b

b 5/N N/N

b 8/9/N b 2/N

a

b 6/N b 8/N b -30/N

B A C G T ii)

i)

OCTGAGGAGAAGGTCI฀GCCGT฀TAGCOCI฀฀฀฀฀ ฀

200

210

220

b฀8/9 (+G)฀/฀ b฀8/9 (+G)

Manual฀sequencing 33 using฀ P฀labelled ddNTPs.

DGGE฀analysis

c

, 5 T G G G C G G C A

Wildtype reference sequence

G G G G G

Comptementary sense probe set: each probe type contained in a probe cell

C C C C C

C C C C C

G G G G G

T T T T T

T G A A C

C G G A G G C C

A A A A A

G A C T C G C C T C G G C T C G T _ C T C G C T C

C C C C C

T T T T T

T T T T T

G G G G G

C C C C C

single base delection

G

G

Wildtype target codon

Basw cell G: wildtype

G G G G G

3 C A T C

G G G G G G

,

substitution podition

Mutant target hybridization image

Wildtype target hybridization image

C

Automated฀sequencing using฀dRhodamine฀Dye Terminators.

C A/F G Mutant target codon Base cell A/G: mixture of mutant and widtype

Oligonucleotide฀microarray฀(DNA฀chip฀analysis)

Fig. 2. The techniques utilized for the analyses of (a) known mutations and (b) known/unknown mutations (adapted from ref. (2)).

Genetic Predisposition to b-Thalassemia and Sickle Cell Anemia

Cd฀39(C-T)฀(40.0%) IVS-I-110฀(G-A)฀(23.0%)฀ IVS-I-1฀(G-A)฀(10.5%)฀ IVS-I-6฀(T-C)฀(9.8%)฀ ฀

IVS-I-110฀(G-A)฀(43.4%) Cd฀39(C-T)฀(18.3%) IVS-I-1฀(G-A)฀(12.8%) IVS-I-6฀(T-C)฀(7.4%)

IVS-I-110฀(G-A)฀(39.3%)฀ IVS-I-6฀(T-C)฀(10.1%)฀ FSC-8฀(-AA)฀(5.5%)฀ IVS-I-1฀(G-A)฀(5.0%)฀ IVS-II-745฀(C-G)฀(5.0%)฀ IVS-II-1฀(G-A)฀(4.7%)฀

297

Cd฀41/42฀(4bp฀del)฀(44.2%) Cd฀17฀(A-T)฀(19.2%)฀ -28฀(A-G)฀(19.2%)฀ IVS-II-654฀(C-T)฀(6.4%)฀

IVS-I-110฀(G-A)฀(77.0%)฀ IVS-I-6฀(T-C)฀(6.7%)฀ IVS-I-1฀(G-A)฀(6.3%)฀ IVS-II-745฀(C-G)฀(5.8%)฀ Cd฀39฀(C-T)฀(1.9%)

IVS-I-5฀(G-C)฀(38.3%)฀ 619฀bp฀del฀(19.2%)฀ Cd฀8/9฀(+G)฀(16.4%) ฀ IVS-I-1฀(G-T)฀(10.0%)฀ Cd฀41/42฀(4bp฀del)฀(9.7%)

IVS-I-5฀(G-C)฀(54.0%)฀ IVS-II-654฀(C-T)฀(12.0%) Cd฀15฀(G-A)฀(7.0%)฀

Fig. 3. Population distribution of prevalent b-thalassemia mutations (adapted from ref. (2)).

risk of the genetic diseases of hemoglobin are to be found in DNA-based molecular diagnostic approaches and populationwide education, combined with cost-effective services for the treatment and prevention of these disorders.

5. b-Thalassemia and Sickle Cell Anemia in Turkey

In Turkey, as in many Mediterranean countries, b-thalassemia is a major public health concern. Throughout the country, the gene frequency is estimated to be 2.1%, and increases to 10% in certain regions. On the other hand, we have found the estimated gene frequency for SCD to be 4.6% in Turkey. In this country, the estimated number of b-thalassemia carriers is 1,300,000 and the number of homozygous b-thalassemia patients is approximately 4,000. The number of affected births is higher than expected, as the birthrate is still very high in Turkey, and the number of consanguineous marriages is above 60% in the eastern parts of the country (5). In contrast to many other Mediterranean countries, however, b-thalassemia in Turkey is very heterogeneous at the clinical

298

Basak and Tuzmen

level, and transfusion-dependent b-thalassemia is the major predominating form (2, 61). The molecular diagnostic approach described herein was initiated in 1987 to elucidate the molecular basis of b-thalassemia and SCD in Turkey. At that time, the molecular basis of bthalassemia was largely unknown. The main purpose of the work was to establish a comprehensive prenatal diagnosis strategy, based on DNA analysis, similar to successful practices in many at-risk populations (5). Between the years 1987 and 2006, more than 1,500 patients with homozygous b-thalassemia (including more than 3,000 b-thalassemia chromosomes), not pre-selected, and not related, were subjected to DNA analysis (5). Originally, the conventional dot-blot hybridization technique was employed, which is the preferred method for mass-screening of a b-thalassemia population (62). This method was later replaced by the b-Globin Strip Assay (Vienna Lab, Vienna, Austria), based on reverse hybridization, which also enables the detection of SCD (HbS). In both cases, the results were confirmed either by the amplification refractory mutation system (ARMS) and/or restriction enzyme analysis (2, 62). The samples, not defined by the above approaches, were subjected to direct DNA sequencing (2, 62, 63). Here we include previously unpublished additional results obtained between the years 2006 and 2008, and report the collective end results of over 3,100 chromosomes (Table 2). In terms of heterogeneity of Turkish b-thalassemia mutations, these findings are consistent with the previously published data (5). The ratio of b0:b+-thalassemia mutations is 1:1, but the majority of b+-thalassemia cases bears the severe IVS-I-110 lesion; hence, most of these mutations give rise to b-thalassemia major in homozygous or compound heterozygous combinations. In addition to the 13 common mutations, several rare and five novel b-thalassemia mutations were reported between the years 1987 and 2008 (Table 2) in the framework of this project, totaling 39 mutations including SCD. The substantial molecular heterogeneity of b-thalassemia in Turkey can be explained by the unique geographical location and rich history of this area. Historically, Turkey served as an important crossroad among civilizations, cultures, and continents for several centuries (64, 65). However, this study shows that despite this high degree of molecular heterogeneity, the advent of PCRbased techniques and improved methodologies of early fetal sampling have made heterozygote screening and prenatal diagnosis feasible in Turkey (2, 5, 49, 62, 64). We first investigated the distribution of known mutations in six geographical regions of Turkey by organizing patients according to geographical region of origin (5, 64, 66). As shown in Fig. 4, we found that the results from Istanbul and the Western parts of

Genetic Predisposition to b-Thalassemia and Sickle Cell Anemia

299

Table 2 The six most common mutations add up to 70.3%, and the overall frequency of the first 12 mutations including HbS is 83.5% Mutation

Frequency (%)

Mutation

Frequency (%)

IVS-I-110 (G→A)

39.2

IVS-II-654 (C→T)