241 31 20MB
English Pages 562 [563] Year 2023
Methods in Molecular Biology 2672
Tony Heitkam · Sònia Garcia Editors
Plant Cytogenetics and Cytogenomics Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Plant Cytogenetics and Cytogenomics Methods and Protocols
Edited by
Tony Heitkam Institute of Botany, TU Dresden, Dresden, Germany
Sònia Garcia Botanical Institute of Barcelona, Barcelona, Spain
Editors Tony Heitkam Institute of Botany TU Dresden Dresden, Germany
So`nia Garcia Botanical Institute of Barcelona Barcelona, Spain
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-3225-3 ISBN 978-1-0716-3226-0 (eBook) https://doi.org/10.1007/978-1-0716-3226-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Cover Illustration Caption: The cover inset shows a detail from Figure 1 of Chapter 19. A chromosome painting cocktail consisting of 171 BAC clones was hybridized (red, green, and yellow) onto the meiotic chromosome At1 from Arabidopsis thaliana. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface Although mitotic chromosomes were first observed over a century ago, the study of nuclei continues to offer insights into the biology and evolution of chromosomes and genomes. The plain observation of chromosomes by light microscopy evolved to the widespread use of in situ hybridization methods, which allowed the detection of specific DNA sequences directly along the chromosomes. This led to a new era of cytogenetics coupled with molecular biology. While the first hybridization experiments carried out by Pardue and Gall in 1969 still used radioactive labels, these were replaced with fluorescent dyes about 10 years later, contributing to the success and widespread use of the technique. Initially, cytogenetics was understood as the branch of genetics studying the genetic material in the cell nucleus. However, the concept evolved to include the study of nucleus architecture and genome size, among other subjects. As technology advances, the field is also growing, with new questions arising and old questions waiting for re-inspection. We see a higher precision in chromosome inspection and deep dives into model organisms and their chromosomes, but also more breadth in fully exploring organismal and chromosomal diversity in organisms. And most importantly, we observe the convergence of different research subjects, such as cytogenetics, phylogenetics, bioinformatics, and genomics. As we are now in the age of gene-editing, genomics, and big data, the cytogenetics field is also being re-positioned. It is now our generation’s task to actively identify the open questions, to which the cytogenetics field can contribute—and the opportunities are plentiful. Genome sequence data alone cannot provide insights into the genome’s organization and packaging into chromosomes. But these are needed to understand the principles of gene regulation, recombination, and speciation. Hence, integrating a cytogenetics base into the portfolio of emerging technologies, especially genomics, will serve to gain a better understanding of plant genome and chromosome evolution, the mechanisms of structural variations, and the role of chromatin packing. We foresee that insights in these areas will significantly accelerate plant breeding and pave the way toward more productive and resilient crops that can better withstand the challenges of climate change. To be able to navigate this changing field successfully, we provide an update on the most used procedures for the study of plant chromosomes and nuclei. From the classical basic karyological approaches to the most recent genomics-informed and computational methods, this book aims to be a practical resource for plant scientists interested in molecular and evolutionary biology, breeding, systematics, and plant -omics in general. Hence, next to providing an up-to-date account of the most widespread methods by specialists in the fields, we also included emerging methods that will likely be adapted by more labs in the future. To account for the strong need to integrate cytogenetics with bioinformatics and computational genomics, this new edition also contains eight computationally oriented chapters. This goes along with a renaming into Plant Cytogenetics and Cytogenomics for this new edition.
v
vi
Preface
To help the reader in placing the field and its advancements in time, Trude Schwarzacher, Qing Liu, and Pat Heslop-Harrison start with a historical perspective and frame the overall book in Part I (Chapter 1). Starting with Waldeyer’s suggestion of the term “chromosome” in 1888, they trace the route of plant cytogenetics until reaching today and predicting tomorrow. From here on, the book is devoted to in-depth views of generalist and specialist techniques. For easy reference, we have divided the book into seven sections, each addressing a specific field of interest in cytogenetics and cytogenomics. Part II deals with methods for the estimation of genome size and ploidy. Through an extensive protocol, Joa˜o Loureiro and colleagues (Chapter 2) review flow cytometry as the gold standard for assessing genome size and ploidy level in plants, offering useful advice on how to deal with seeds or desiccated materials as well as methodological aspects regarding sampling and storage, seldom addressed in such contributions. Rafael Martı´n-Martı´n and colleagues share their expertise on how to estimate nuclear DNA contents in difficult materials such as seaweed by fluorimetry (Chapter 3). The remaining two chapters of this section are already a demonstration of the usefulness (and new prevalence) of genomics in a plant cytogenetics context: while Uljana Hesse (Chapter 4) shows how to exploit K-mer data to estimate genome sizes, Juan Viruel and co-authors (Chapter 5) deploy target capture sequence data for addressing ploidy levels. In Part III, a series of protocols to make chromosome preparations are presented. First, Patrice S. Albert and James A. Birchler share a technique in which root tips are treated with nitrous oxide gas before putting them into fixative, resulting in both well-spread chromosomes and a high mitotic index (Chapter 6). Continuing after fixation, Alexis Maravilla, Marcela Rosato, and Josep A. Rossello´ offer their long-term experience on the classical squash technique to prepare mitotic chromosomes (Chapter 7), as do Nicola Schmidt and colleagues on the methodology to obtain chromosome preparations by the dropping technique. We also present here a protocol for laser microdissection to isolate specific cells or chromosomes, by Toma´sˇ Janı´cˇek, Roman Hobza, and Vojteˇch Hudzieczek (Chapter 9), one of the most precise and efficient dissection techniques. To end this book section, Petr Ca´pal and colleagues (Chapter 10) show how flow cytometry can be deployed to analyze and manipulate plant chromosomes, in which a liquid stream allows us to classify chromosomes depending on their optical properties. The next section, Part IV, addresses banding and staining techniques. Here we find several classical, yet still useful, protocols in cytogenetics. The first is a C-banding protocol, revised by Adam J. Lukaszewski (Chapter 11), where constitutive heterochromatin bands are effective for karyotyping or meiotic chromosome pairing analyses, among others. The second, by Ana Emı´lia Barros e Silva and Marcelo Guerra, deals with CMA/DAPI double staining protocol (Chapter 12) while also advising about the most common source of results misinterpretation. The third also approaches a “classic”: silver nitrate staining, by Claudio Palma-Rojas, Pedro Jara-Seguel, and Cristian Araya-Jaime (Chapter 13), highlighting the simplicity and utility of the method. A protocol on chromatin immunostaining to analyze the epigenetics of nuclei, by Nobuko Ohmido and Aqwin Polosoro (Chapter 14), closes this section. Part V focuses on in situ hybridization and other methods using fluorescent labels. Not surprisingly, this section is the largest of this book. To make sure that all of the methods result in beautiful chromosome images, Hans de Jong, Jose´ van de Belt, and Paul Fransz open by offering advice on the processing of fluorescence chromosome images, without the
Preface
vii
need of knowing complex software (Chapter 15). Hanna Weiss-Schneeweiss and Tae-Soo Jang guide us through a highly efficient, simple, and non-toxic (formamide-free) genomic in situ hybridization protocol (Chapter 16), while Gu¨lru Yu¨cel, Magdalena Senderowicz, and Boz˙ena Kolano review the use of ribosomal DNA probes for comparative cytogenetics (Chapter 17), as these are still the most widely used chromosomal landmarks. Martin Lycˇka and colleagues present several methods to identify telomere motifs and measure their length (Chapter 18), combining wet-lab and in silico data. Next, Terezie Manda´kova´ and Martin A. Lysak share their expertise in chromosome painting to visualize large chromosomal regions (Chapter 19). Introducing genome editing technologies to plant cytogenetics, Bhanu Prakash Potlapalli and colleagues show a CRISPR/Cas9-based in situ labelling method (Chapter 20). This CRISPR method circumvents the aggressive multiple washing steps of traditional FISH and hence reduces damage to the chromatin structure. Jana Lunerova´ and Radka Voza´rova´ show us how to prepare meiotic chromosomes and then share their methods for performing fluorescence in situ hybridization and immunodetection onto meiotic spreads (Chapter 21). A second contribution by Paul Fransz, Jose´ van de Belt, and Hans de Jong on the preparation of extended DNA fibers for in situ hybridization is provided next (Chapter 22): this method offers means to detect physical chromosomal rearrangements in the era of high-throughput sequencing technologies. A cutting-edge method to visualize chromosome territories is presented by Katerina Pernicˇkova´ and David Kopecky´ (Chapter 23), combining flow sorting, GISH, confocal microscopy, and 3D modelling software. Finally, Martina Dvorˇa´cˇkova´ and Jirˇ´ı Fajkus show a straightforward protocol to visualize the nucleolus (Chapter 24) by the incorporation of 5’ ethynyl uridine into bulk RNA. Part VI clearly shows how genomics-informed methods have come to stay in plant cytogenetics. Guanqing Liu and Tao Zhang explain how to design oligonucleotide probes through a bioinformatics pipeline based on NGS data (Chapter 25), while the next chapter, by Ludwig Mann and Sophie Maiwald, deals with the visualization of such probes on pseudochromosomes (Chapter 26). Denisa Bera´nkova´ and Eva Hrˇibova´ present how to deploy bulked oligonucleotides for chromosome painting, by showing how to amplify and label these probes (Chapter 27). The next protocol, by Hana Sˇimkova´, Zuzana Tulpova´ and Petr Ca´pal (Chapter 28), introduces us to flow sorting-assisted optical mapping, in which flow cytometry enables efficient purification of metaphase chromosomes to construct optical maps, facilitating genome assemblies and analyses of genome structural variation. The threedimensional organization of mitotic chromosomes is addressed by a chromosome conformation method exposed by Petr Ca´pal (Chapter 29), and finally, Alesˇ Kovarˇ´ık and colleagues show how to benefit from NGS data and the bioinformatics pipeline RepeatExplorer2 to analyze the genomic organization of 5S ribosomal DNA (Chapter 30). The last section of the book, Part VII, presents a range of useful software and online tools for plant cytogenetics and genomics. Shoaeib Mahmoudi and Ghader Mirzaghderi explain how to use KaryoMeasure to draw informative ideograms (Chapter 31). Marcial Escudero and colleagues review the use of ChromEvol, a program that models chromosome number change along a phylogeny (Chapter 32). The final chapter, by Marı´a Luisa Gutie´rrez and colleagues (Chapter 33), collects a series of online resources useful for plant cytogenetics and genomics, including databases and analytical tools, among others. To prepare the book that you have in your hands, we have counted on the contribution of exactly 100 authors, expert scientists in the topics of plant cytogenetics and cytogenomics. These researchers use the following protocols on a regular basis, and now they share
viii
Preface
their most valued tricks and tips, accessible through the Notes section of each chapter. This is also a global book, collecting plant cytogenetics expertise across a wide geographical range including European, North and South American, Asian, and African countries, reflecting the wide interest in the topic around the globe. With the advancement of sequencing technologies and the growing amount of genomics data, placing the DNA sequence information in a chromosome/nucleus context is more needed than ever, and we expect that this protocol collection will also help to reach this goal. We hope that you enjoy this book! Dresden, Germany Barcelona, Spain
Tony Heitkam So`nia Garcia
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PART I
INTRODUCTION
1 Plant Cytogenetics: From Chromosomes to Cytogenomics . . . . . . . . . . . . . . . . . . Trude Schwarzacher, Qing Liu, and J. S. (Pat) Heslop-Harrison
PART II
v xiii
3
SIZING THE NUCLEUS: METHODS FOR GENOME SIZE AND PLOIDY LEVEL ESTIMATION
2 The Use of Flow Cytometry for Estimating Genome Sizes and DNA Ploidy Levels in Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 ˇ ertner, Magdalena Lucˇanova´, ˜ o Loureiro, Martin C Joa Elwira Sliwinska, Filip Kola´rˇ, Jaroslav Dolezˇel, So`nia Garcia, Sı´lvia Castro, and David W. Galbraith 3 Nuclear DNA Content Estimation of Seaweed by Fluorimetry Analysis . . . . . . . . 65 Rafael P. Martı´n-Martı´n, Noemı´ Salvador-Soler, Jordi Rull Lluch, and Amelia Gomez Garreta 4 K-Mer-Based Genome Size Estimation in Theory and Practice . . . . . . . . . . . . . . . 79 Uljana Hesse 5 A Bioinformatic Pipeline to Estimate Ploidy Level from Target Capture Sequence Data Obtained from Herbarium Specimens . . . . . . . . . . . . . . . 115 Juan Viruel, Oriane Hidalgo, Lisa Pokorny, Fe´lix Forest, Barbara Gravendeel, Paul Wilkin, and Ilia J. Leitch
PART III
GETTING TO THE CHROMOSOMES: METHODS FOR CHROMOSOME FIXATION, PREPARATION, AND MANIPULATION
6 Nitrous Oxide-Induced Metaphase Arrest: A Technique for Somatic Chromosome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrice S. Albert and James A. Birchler 7 Preparation of Mitotic Chromosomes with the Squash Technique . . . . . . . . . . . . Alexis J. Maravilla, Marcela Rosato, and Josep A. Rossello 8 Preparation of Mitotic Chromosomes with the Dropping Technique . . . . . . . . . . Nicola Schmidt, Beatrice Weber, Jessica Klekar, Susan Liedtke, Sarah Breitenbach, and Tony Heitkam 9 Laser Capture Microdissection: From Genomes to Chromosomes, from Complex Tissue to Single-Cell Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toma´ˇs Janı´cˇek, Roman Hobza, and Vojteˇch Hudzieczek 10 Flow Cytometric Analysis and Sorting of Plant Chromosomes. . . . . . . . . . . . . . . . Petr Ca´pal, Mahmoud Said, Istva´n Molna´r, and Jaroslav Dolezˇel
ix
129 141 151
163 177
x
Contents
PART IV
PUTTING ON COLOR: BANDING AND STAINING TECHNIQUES
11
C-Banding of Plant Chromosomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam J. Lukaszewski 12 CMA/DAPI Banding of Plant Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Emı´lia Barros e Silva and Marcelo Guerra 13 Silver Nitrate Staining of Nucleolar Organizer Regions (Ag-NORs) in Plant Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claudio Palma-Rojas, Pedro Jara-Seguel, and Cristian Araya-Jaime 14 Chromatin Immunostaining of Plant Nuclei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nobuko Ohmido and Aqwin Polosoro
PART V
203 215
225 233
LABELING DNA: IN SITU HYBRIDIZATION AND OTHER METHODS USING FLUORESCENT LABELS
15
Critical Steps in DAPI and FISH Imaging of Chromosome Spread Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans de Jong, Jose´ van de Belt, and Paul Fransz 16 Formamide-Free Genomic In Situ Hybridization (ff-GISH) . . . . . . . . . . . . . . . . . Hanna Weiss-Schneeweiss and Tae-Soo Jang 17 The Use of Ribosomal DNA for Comparative Cytogenetics . . . . . . . . . . . . . . . . . . ¨ lru Yu ¨ cel, Magdalena Senderowicz, and Boz˙ena Kolano Gu 18 Identification of the Sequence and the Length of Telomere DNA. . . . . . . . . . . . . Martin Lycˇka, Petr Fajkus, Leon P. Jenner, Eva Sy´korova´, Miloslava Fojtova´, and Vratislav Peska 19 Chromosome Painting Using Chromosome-Specific BAC Clones . . . . . . . . . . . . Terezie M. Manda´kova´ and Martin A. Lysak 20 CRISPR-FISH: A CRISPR/Cas9-Based In Situ Labeling Method . . . . . . . . . . . . Bhanu Prakash Potlapalli, Takayoshi Ishii, Kiyotaka Nagaki, Saravanakumar Somasundaram, and Andreas Houben 21 Preparation of Male Meiotic Chromosomes for Fluorescence In Situ Hybridization and Immunodetection with Major Focus on Dogroses . . . . . . . . . Jana Lunerova´ and Radka Voza´rova´
247 257 265 285
303 315
337
22
Extended DNA Fibers for High-Resolution Mapping . . . . . . . . . . . . . . . . . . . . . . . 351 Paul Fransz, Jose´ van de Belt, and Hans de Jong 23 Visualizing Chromosome Territories and Nuclear Architecture of Large Plant Genomes Using Alien Introgressions . . . . . . . . . . . . . . . . . . . . . . . . 365 Katerˇina Pernicˇkova´ and David Kopecky´ 24
Visualization of the Nucleolus Using 5′ Ethynyl Uridine . . . . . . . . . . . . . . . . . . . . 377 Martina Dvorˇa´cˇkova´ and Jirˇı´ Fajkus
Contents
PART VI
xi
LEVERAGING DATA ONTO NUCLEI: GENOMICS-INFORMED METHODS
25
Bioinformatic Prediction of Bulked Oligonucleotide Probes for FISH Using Chorus2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Guanqing Liu and Tao Zhang 26 Visualization of Oligonucleotide-Based Probes Along Pseudochromosomes Using RIdeogram, KaryoploteR, and Circlize (Circos) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 Ludwig Mann and Sophie Maiwald 27 Bulked Oligo-FISH for Chromosome Painting and Chromosome Barcoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Denisa Bera´nkova´ and Eva Hrˇibova´ 28
Flow Sorting–Assisted Optical Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 Hana Sˇimkova´, Zuzana Tulpova´, and Petr Ca´pal 29 Chromosome Conformation Capture of Mitotic Chromosomes . . . . . . . . . . . . . . 485 Petr Ca´pal 30 Analysis of 5S rDNA Genomic Organization Through the RepeatExplorer2 Pipeline: A Simplified Protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 So`nia Garcia, Joan Pere Pascual-Dı´az, Alice Krumpolcova´, and Ales Kovarı´k
PART VII 31 32
33
SOFTWARE AND ONLINE PLANT CYTOGENETICS & GENOMICS RESOURCES
Tools for Drawing Informative Idiograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 Shoaeib Mahmoudi and Ghader Mirzaghaderi Using ChromEvol to Determine the Mode of Chromosomal Evolution . . . . . . . 529 Marcial Escudero, Enrique Maguilla, Jose´ Ignacio Ma´rquez-Corro, Santiago Martı´n-Bravo, Itay Mayrose, Anat Shafir, Lu Tan, Carrie Tribble, and Rosana Zenil-Ferguson Online Resources Useful for Plant Cytogenetics and Cytogenomics Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 Marı´a Luisa Gutie´rrez, Roi Rodrı´guez-Gonza´lez, Joan Pere Pascual-Dı´az, Ine´s Fuentes, and So`nia Garcia
Name Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subject Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
561 563
Contributors J. S. (PAT) HESLOP-HARRISON • Department of Genetics and Genome Biology, University of Leicester, Leicester, UK; Key Laboratory of Plant Resources Conservation and Sustainable Utilization / Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China; Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Guangzhou, China; South China National Botanical Garden, Guangzhou, China PATRICE S. ALBERT • Division of Biological Sciences, University of Missouri, Columbia, MO, USA CRISTIAN ARAYA-JAIME • Departamento de Biologı´a, Universidad de La Serena, La Serena, Chile; Instituto de Investigacion Multidisciplinario de Investigacion y postgrado. Universidad de La Serena, La Serena, Chile ANA EMI´LIA BARROS E SILVA • Laboratorio de Citogene´tica Vegetal, Departamento de Biocieˆncias, Centro de Cieˆncias Agra´rias, Universidade Federal da Paraı´ba, Areia, Paraı´ba, Brazil JOSE´ VAN DE BELT • Wageningen University & Research, Laboratory of Genetics, Wageningen, The Netherlands DENISA BERA´NKOVA´ • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic JAMES A. BIRCHLER • Division of Biological Sciences, University of Missouri, Columbia, MO, USA SARAH BREITENBACH • Faculty of Biology, Institute of Botany, Dresden, Germany PETR CA´PAL • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic; Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana´ for Biotechnological and Agricultural Research, Olomouc, Czech Republic SI´LVIA CASTRO • Centre for Functional Ecology, Department of Life Sciences, University of Coimbra, Coimbra, Portugal ˇ ERTNER • Department of Botany, Faculty of Science, Charles University, Prague, MARTIN C Czech Republic; Czech Academy of Sciences, Institute of Botany, Pru˚honice, Czech Republic JAROSLAV DOLEZˇEL • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic MARTINA DVORˇA´CˇKOVA´ • Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology, Masaryk University, Brno, Czech Republic MARCIAL ESCUDERO • Department of Plant Biology and Ecology, University of Seville, Seville, Spain JIRˇI´ FAJKUS • Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology, Masaryk University, Brno, Czech Republic PETR FAJKUS • Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czech Republic; Department of Cell Biology and Radiobiology, Institute of Biophysics of the Czech Academy of Sciences, Brno, Czech Republic
xiii
xiv
Contributors
MILOSLAVA FOJTOVA´ • Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czech Republic; National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic FE´LIX FOREST • Royal Botanic Gardens, Kew, Richmond, UK PAUL FRANSZ • Wageningen University & Research, Laboratory of Genetics, Wageningen, The Netherlands; Swammerdam Institute of Life Sciences, University of Amsterdam, Amsterdam, The Netherlands ` nic de Barcelona (IBB-CSIC), Barcelona, Catalonia, Spain INE´S FUENTES • Institut Bota DAVID W. GALBRAITH • School of Plant Sciences, BIO5 Institute, Arizona Cancer Center, Department of Biomedical Engineering, University of Arizona, Tucson, AZ, USA; Henan University, School of Life Sciences, State Key Laboratory of Crop Stress Adaptation and Improvement, State Key Laboratory of Cotton Biology, Key Laboratory of Plant Stress Biology, Kaifeng, China ` nic de Barcelona (IBB-CSIC, Ajuntament de Barcelona), SO`NIA GARCIA • Institut Bota ` nic de Barcelona (IBB, CSIC – Ajuntament de Barcelona, Catalonia, Spain; Institut Bota ` nic de Barcelona (IBB-CSIC), Barcelona, Barcelona), Barcelona, Spain; Institut Bota Catalonia, Spain ` nica, Facultat de Farma ` cia i Cie`ncies de AMELIA GO´MEZ GARRETA • Laboratori de Bota l’Alimentacio; Institut de Recerca de la Biodiversitat (IRBio) & Centre de Documentacio de Biodiversitat Vegetal (CeDocBiV), Universitat de Barcelona, Barcelona, Spain BARBARA GRAVENDEEL • Naturalis Biodiversity Center, Evolutionary Ecology, Leiden, Netherlands; Radboud Institute for Biological and Environmental Sciences, Leiden University, Leiden, Netherlands MARCELO GUERRA • Laboratorio de Citogene´tica e Evoluc¸a˜o Vegetal, Departamento de Botaˆnica, Centro de Biocieˆncias, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil ` nic de Barcelona (IBB-CSIC), Barcelona, MARI´A LUISA GUTIE´RREZ • Institut Bota Catalonia, Spain TONY HEITKAM • Faculty of Biology, Institute of Botany, Dresden, Germany ULJANA HESSE • Department of Biotechnology, University of the Western Cape, Bellville, South Africa ` nic de ORIANE HIDALGO • Royal Botanic Gardens, Kew, Richmond, UK; Institut Bota Barcelona (IBB, CSIC-Ajuntament de Barcelona), Barcelona, Catalonia, Spain ROMAN HOBZA • Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Brno, Czech Republic ANDREAS HOUBEN • Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Seeland, Germany EVA HRˇIBOVA´ • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic VOJTEˇCH HUDZIECZEK • Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Brno, Czech Republic TAKAYOSHI ISHII • Arid Land Research Center (ALRC), Tottori University, Hamasaka, Tottori, Japan TAE-SOO JANG • Department of Botany and Biodiversity Research, University of Vienna, Vienna, Austria; Department of Biological Science, College of Bioscience and Biotechnology, Chungnam National University, Daejeon, South Korea
Contributors
xv
TOMA´Sˇ JANI´CˇEK • Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Brno, Czech Republic PEDRO JARA-SEGUEL • Departamento de Ciencias Biologicas y Quı´micas, Facultad de Recursos Naturales, Universidad Catolica de Temuco, Temuco, Chile; Nu´cleo de Estudios Ambientales, Facultad de Recursos Naturales, Universidad Catolica de Temuco, Temuco, Chile LEON P. JENNER • Department of Cell Biology and Radiobiology, Institute of Biophysics of the Czech Academy of Sciences, Brno, Czech Republic HANS DE JONG • Wageningen University & Research, Laboratory of Genetics, Wageningen, The Netherlands JESSICA KLEKAR • Faculty of Biology, Institute of Botany, Dresden, Germany BOZ˙ENA KOLANO • Plant Cytogenetics and Molecular Biology Group, Institute of Biology, Biotechnology and Environmental Protection, Faculty of Natural Sciences, University of Silesia in Katowice, Katowice, Poland FILIP KOLA´Rˇ • Department of Botany, Faculty of Science, Charles University, Prague, Czech Republic; Czech Academy of Sciences, Institute of Botany, Pru˚honice, Czech Republic DAVID KOPECKY´ • Centre of Plant Structural and Functional Genomics, Institute of Experimental Botany of the Czech Academy of Sciences, Olomouc, Czech Republic ALES KOVARI´K • Department of Molecular Epigenetics, Institute of Biophysics, Academy of Sciences of the Czech Republic, Brno, Czechia ALICE KRUMPOLCOVA´ • Department of Molecular Epigenetics, Institute of Biophysics, Academy of Sciences of the Czech Republic, Brno, Czechia; Department of Experimental Biology, Faculty of Science, Masaryk University, Brno, Czech Republic ILIA J. LEITCH • Royal Botanic Gardens, Kew, Richmond, UK SUSAN LIEDTKE • Faculty of Biology, Institute of Botany, Dresden, Germany GUANQING LIU • Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Agricultural College of Yangzhou University, Yangzhou, China; Key Laboratory of Plant Functional Genomics of the Ministry of Education/Joint International Research Laboratory of Agriculture and Agri-Product Safety, The Ministry of Education of China, Yangzhou University, Yangzhou, China QING LIU • Key Laboratory of Plant Resources Conservation and Sustainable Utilization / Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China; Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Guangzhou, China; South China National Botanical Garden, Guangzhou, China ` nica, Facultat de Farma ` cia i Cie`ncies de JORDI RULL LLUCH • Laboratori de Bota l’Alimentacio; Institut de Recerca de la Biodiversitat (IRBio) & Centre de Documentacio de Biodiversitat Vegetal (CeDocBiV), Universitat de Barcelona, Barcelona, Spain JOA˜O LOUREIRO • Centre for Functional Ecology, Department of Life Sciences, University of Coimbra, Coimbra, Portugal MAGDALENA LUCˇANOVA´ • Czech Academy of Sciences, Institute of Botany, Pru˚honice, Czech ˇ eske´ Republic; Department of Botany, Faculty of Science, University of South Bohemia, C Budeˇjovice, Czech Republic ADAM J. LUKASZEWSKI • University of California, Riverside, CA, USA JANA LUNEROVA´ • Department of Epigenetics, Institute of Biophysics of the Czech Academy of Sciences, Kra´lovopolska´, Brno, Czech Republic MARTIN LYCˇKA • Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czech Republic; National
xvi
Contributors
Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic MARTIN A. LYSAK • CEITEC – Central European Institute of Technology, Masaryk University, Brno, Czech Republic ENRIQUE MAGUILLA • Department of Molecular Biology and Biochemical Engineering, Universidad Pablo de Olavide, Seville, Spain SHOAEIB MAHMOUDI • Office of Research Affairs, University of Kurdistan, Sanandaj, Iran SOPHIE MAIWALD • Faculty of Biology, Technische Universit€ a t Dresden, Dresden, Germany TEREZIE M. MANDA´KOVA´ • CEITEC – Central European Institute of Technology, Masaryk University, Brno, Czech Republic LUDWIG MANN • Faculty of Biology, Technische Universit€ a t Dresden, Dresden, Germany ALEXIS J. MARAVILLA • Jardı´n Bota´nico, ICBiBE, Universitat de Vale`ncia, Vale`ncia, Spain JOSE´ IGNACIO MA´RQUEZ-CORRO • Department of Molecular Biology and Biochemical Engineering, Universidad Pablo de Olavide, Seville, Spain SANTIAGO MARTI´N-BRAVO • Department of Molecular Biology and Biochemical Engineering, Universidad Pablo de Olavide, Seville, Spain ` nica, Facultat de Farma ` cia i Cie`ncies de RAFAEL P. MARTI´N-MARTI´N • Laboratori de Bota l’Alimentacio; Institut de Recerca de la Biodiversitat (IRBio) & Centre de Documentacio de Biodiversitat Vegetal (CeDocBiV), Universitat de Barcelona, Barcelona, Spain ITAY MAYROSE • School of Plant Sciences and Food Security, The George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel GHADER MIRZAGHADERI • Department of Agronomy and Plant Breeding, University of Kurdistan, Sanandaj, Iran ISTVA´N MOLNA´R • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic; Agricultural Institute, Centre for Agricultural Research, ELKH, Martonva´sa´r, Hungary KIYOTAKA NAGAKI • Institute of Plant Science and Resources, Okayama University, Chuo, Kurashiki, Japan NOBUKO OHMIDO • Graduate School of Human Development and Environment, Kobe University, Kobe, Japan CLAUDIO PALMA-ROJAS • Departamento de Biologı´a, Universidad de La Serena, La Serena, Chile ` nic de Barcelona (IBB, CSIC – Ajuntament de JOAN PERE PASCUAL-DI´AZ • Institut Bota ` nic de Barcelona (IBB-CSIC), Barcelona, Barcelona), Barcelona, Spain; Institut Bota Catalonia, Spain KATERˇINA PERNICˇKOVA´ • Centre of Plant Structural and Functional Genomics, Institute of Experimental Botany of the Czech Academy of Sciences, Olomouc, Czech Republic VRATISLAV PESKA • Department of Cell Biology and Radiobiology, Institute of Biophysics of the Czech Academy of Sciences, Brno, Czech Republic ` nic de Barcelona LISA POKORNY • Royal Botanic Gardens, Kew, Richmond, UK; Institut Bota (IBB, CSIC-Ajuntament de Barcelona), Barcelona, Catalonia, Spain; Real Jardı´n Bota´nico (RJB-CSIC), Madrid, Spain AQWIN POLOSORO • Graduate School of Human Development and Environment, Kobe University, Kobe, Japan; Research Center for Genetic Engineering, National Research and Innovation Agency, Bogor, Indonesia BHANU PRAKASH POTLAPALLI • Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Seeland, Germany
Contributors
xvii
` nic de Barcelona (IBB-CSIC), Barcelona, ROI RODRI´GUEZ-GONZA´LEZ • Institut Bota Catalonia, Spain MARCELA ROSATO • Dpt. Gene´tica, Fisiologı´a y Microbiologı´a (UD Gene´tica), Facultad de Ciencias Biologicas, Universidad Complutense de Madrid, Madrid, Spain JOSEP A. ROSSELLO´ • Jardı´n Bota´nico, ICBiBE, Universitat de Vale`ncia, Vale`ncia, Spain MAHMOUD SAID • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic; Field Crops Research Institute, Agricultural Research Centre, Giza, Cairo, Egypt NOEMI´ SALVADOR-SOLER • Instituto de Ciencias Biome´dicas, Universidad Autonoma de Chile, Temuco, Chile NICOLA SCHMIDT • Faculty of Biology, Institute of Botany, Dresden, Germany TRUDE SCHWARZACHER • Department of Genetics and Genome Biology, University of Leicester, Leicester, UK; Key Laboratory of Plant Resources Conservation and Sustainable Utilization / Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China; Center of Conservation Biology, Core Botanical Gardens, Chinese Academy of Sciences, Guangzhou, China; South China National Botanical Garden, Guangzhou, China MAGDALENA SENDEROWICZ • Plant Cytogenetics and Molecular Biology Group, Institute of Biology, Biotechnology and Environmental Protection, Faculty of Natural Sciences, University of Silesia in Katowice, Katowice, Poland ANAT SHAFIR • School of Plant Sciences and Food Security, The George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, Israel HANA SˇIMKOVA´ • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic ELWIRA SLIWINSKA • Laboratory of Molecular Biology and Cytometry, Department of Agricultural Biotechnology, Bydgoszcz University of Science and Technology, Bydgoszcz, Poland SARAVANAKUMAR SOMASUNDARAM • Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Seeland, Germany EVA SY´KOROVA´ • Department of Cell Biology and Radiobiology, Institute of Biophysics of the Czech Academy of Sciences, Brno, Czech Republic LU TAN • Panxi Crops Research and Utilization Key Laboratory of Sichuan Province, Xichang University, Xichang, China CARRIE TRIBBLE • School of Life Sciences, University of Hawai‘i at Ma ¯noa, Honolulu, HI, USA ZUZANA TULPOVA´ • Institute of Experimental Botany of the Czech Academy of Sciences, Centre of Plant Structural and Functional Genomics, Olomouc, Czech Republic JUAN VIRUEL • Royal Botanic Gardens, Kew, Richmond, UK RADKA VOZA´ROVA´ • Department of Epigenetics, Institute of Biophysics of the Czech Academy of Sciences, Kra´lovopolska´, Brno, Czech Republic BEATRICE WEBER • Faculty of Biology, Institute of Botany, Dresden, Germany HANNA WEISS-SCHNEEWEISS • Department of Botany and Biodiversity Research, University of Vienna, Vienna, Austria PAUL WILKIN • Royal Botanic Gardens, Kew, Richmond, UK GU¨LRU YU¨CEL • Plant Cytogenetics and Molecular Biology Group, Institute of Biology, Biotechnology and Environmental Protection, Faculty of Natural Sciences, University of Silesia in Katowice, Katowice, Poland; Department of Agricultural Biotechnology, Faculty of Agriculture, Ondokuz Mayıs University, Samsun, Tu¨rkiye
xviii
Contributors
ROSANA ZENIL-FERGUSON • Department of Biology, University of Kentucky, Lexington, KY, USA TAO ZHANG • Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Agricultural College of Yangzhou University, Yangzhou, China; Key Laboratory of Plant Functional Genomics of the Ministry of Education/Joint International Research Laboratory of Agriculture and Agri-Product Safety, The Ministry of Education of China, Yangzhou University, Yangzhou, China; Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, Yangzhou University, Yangzhou, China
Part I Introduction
Chapter 1 Plant Cytogenetics: From Chromosomes to Cytogenomics Trude Schwarzacher , Qing Liu, and J. S. (Pat) Heslop-Harrison Abstract Chromosomes have been studied since the late nineteenth century in the disciplines of cytology and cytogenetics. Analyzing their numbers, features, and dynamics has been tightly linked to the technical development of preparation methods, microscopes, and chemicals to stain them, with latest continuing developments described in this volume. At the end of the twentieth and beginning of the twenty-first centuries, DNA technology, genome sequencing, and bioinformatics have revolutionized how we see, use, and analyze chromosomes. The advent of in situ hybridization has shaped our understanding of genome organization and behavior by linking molecular sequence information with the physical location along chromosomes and genomes. Microscopy is the best technique to accurately determine chromosome number. Many features of chromosomes in interphase nuclei or pairing and disjunction at meiosis, involving physical movement of chromosomes, can only be studied by microscopy. In situ hybridization is the method of choice to characterize the abundance and chromosomal distribution of repetitive sequences that make up the majority of most plant genomes. These most variable components of a genome are found to be species- and occasionally chromosome-specific and give information about evolution and phylogeny. Multicolor fluorescence hybridization and large pools of BAC or synthetic probes can paint chromosomes and we can follow them through evolution involving hybridization, polyploidization, and rearrangements, important at a time when structural variations in the genome are being increasingly recognized. This volume discusses many of the most recent developments in the field of plant cytogenetics and gives carefully compiled protocols and useful resources. Key words Plant genomes, Cytogenomics, Karyotype, Chromosome banding, Fluorescent in situ hybridization, Immunocytochemistry, Flow cytometry, Structural variation, Genome evolution, Chromosome painting
1
What Cytogenetics Can Answer Genome size and chromosome number are fundamental properties of the genome of each species that are one of the prime targets of cytogenetic research. Among plants, the chromosome number varies from 2n ¼ 4 to 2n ¼ c. 1044. Genome sizes—the total number of base pairs in the nucleus—vary even more widely, from about 150 Mbp to 150,000 Mbp. Both measures have significant implication for chromosome and genome organization and behavior
Tony Heitkam and So`nia Garcia (eds.), Plant Cytogenetics and Cytogenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2672, https://doi.org/10.1007/978-1-0716-3226-0_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023
3
4
Trude Schwarzacher et al.
[1]. Neither growth habitat nor morphological study of a plant species gives much indication of genome size and chromosome number, with large and small values being spread among annuals and perennials, small herbs and large trees, from deserts to swamps, and tropics to polar climates. Cytogenetic approaches are uniquely able to allow analysis of chromosome number, mostly using microscopy, while genome size is most efficiently and accurately measured by flow cytometry or sometimes fluorimetry, both topics covered in this volume (Chapters 2 and 3). These results complement those from RNA and DNA sequencing, and genome assembly approaches (Chapters 4 and 5). Genome size and chromosome number measurements are important and essential before molecular analysis of DNA and genome sequencing, and background cytogenetic information can be considered a first critical stage of—omics studies, with additional results allowing efficient extension and interpretation of additional genotypes or higher taxa, and identification of events occurring during hybridization and evolution in new or ancient lines. Polyploidy or whole genome duplication, with multiplication of the number of genomes (sub-genomes), is involved in the ancestry of all plants [2, 3]. It has occurred multiple times in the evolution of most phyla and leads to immediate multiplication of genome size and chromosome number, and about half of species are recognized as evolutionarily recent polyploids (including within ferns), although gymnosperms are an exception, with few polyploid species. Species or groups of species may frequently include tetraploids, hexaploids, and higher ploidies, and also odd ploidies are known particularly in apomictic and vegetatively propagated taxa (e.g., triploid dessert banana, Musa acuminata [4]; triploid saffron, Crocus sativus [5]) and also in some sexual or mixed systems (pentaploid dog rose, Rosa canina [6]; di- to octoploid Urochloa [7]). Counts of chromosome numbers and measurements of genome size, combined with analyzing meiotic pairing in species and backcrosses, are powerful approaches to examine the occurrence of evolutionary recent polyploidy and the ancestry of taxa. Ancient whole genome duplications, on the other hand, are usually analyzed by whole genome sequencing and assembly. Beyond chromosome number and morphology, and genome size measurement, cytogenetic techniques identify many aspects of genome organization at the chromosomal level, both within (structural variations) and between (evolutionary) species. Identification and localization of DNA sequences using in situ hybridization has proven a powerful approach to define the nature and sequence content of the chromosomes as well as using the technique to define the behavior of chromosomes within the nucleus and throughout mitotic and meiotic division—topics covered throughout this book and illustrated in Figs. 1, 2, 3, and 4.
Introduction to Plant Cytogenetics
5
Fig. 1 Chromosomes during the cell cycle. (a–e) Barley (Hordeum sp., 2n ¼ 14) chromosomes from untreated root tips stained with DAPI. At interphase (a), chromatin is decondensed and distributed throughout the nucleus. The lighter opaque area in the center denotes the nucleolus. At Prophase (b), chromatin condense and chromosomes and their ends become visible. At metaphase (c), the chromosomes arrange on the metaphase plate with their centromeres and chromatids separate at anaphase and move toward the poles (d). At telophase, two daughter nuclei are formed (e). (f) Ice water treated root tips of rye (Secale cereale, 2n ¼ 14) to destroy microtubuli and arrest division at metaphase. Metaphase chromosomes (top) are condensed, spread easily, and their morphology can be seen with gaps at the centromeres. Euchromatin is sightly DAPI stained while the brightly stained subterminal heterochomatin forms chromocentres at interphase (bottom). (g) Overview of DAPI stained root tip cells from Musella lasiocarpa (2n ¼ 18). Three metaphase cells (M) and several interphases with brighter stained chromocentres of the centromeric heterochromatin. Bar ¼ 10 μm
6
Trude Schwarzacher et al.
Fig. 2 Conventional chromosome staining. (a) Metaphase chromosomes of the European orchid, Cephalanthera damasonium (2n ¼ 36), after Feulgen staining that shows clear morphology of chromosomes with gaps at the primary constriction denoting the centromeres, and secondary constrictions at the nucleolus organizing regions (NORs, arrows) at the end of the long arms of two large chromosome pairs. (b, c) C. longifolia (2n ¼ 32), after differential Giemsa C-banding to show the darker stained intercalary heterochromatin bands and the centromeres. Drawings of chromosomes from two metaphases arranged as karyotype indicating the bimodal organization with three large chromosome pairs, twelve small pairs, and one medium pair with heteromorphic C-bands. (see Schwarzacher and Schweizer [13]). Bar ¼ 6 μm for (a, b), 10 μm for (c)
2 Studying Chromosomes: From Simple Staining to Multicolor Fluorescence Cytogenetics The term “chromosome” was coined by Waldeyer in 1888 [8] and is derived from the Greek words “chroma” (color) and “soma” (body). Chromosomes have no color themselves, but easily take up dyes, and hence they can be stained for analysis that due to their size (typically 1–20 μm long) has to be viewed with a microscope. Hence, the analysis of chromosomes has been tightly linked to the development of chemical stains and that of the microscope optics throughout the last two centuries; in particular, the development of the compound light microscope in its simplest form is a system of two converging lenses (now invariably each a groups of lenses) used to look at objects at short distances: the objective enlarges and inverts the object into a ‘real’ image and the second lens, called eyepiece or ocular, is used to view that image. A maximum of a 100 objective and a typically 10 or 20 ocular allows to magnify chromosomes 1–2 thousand times that can be further enlarged through film or image analysis (see Chapter 15) but is
Introduction to Plant Cytogenetics
7
Fig. 3 Fluorescent in situ hybridization to mitotic plant chromosomes. (a) FISH karyotype of Aegilops ventricosum (2n ¼ 4 ¼ 28) using repetitive DNA probes: dpTa1 (a 340 bp tandem repeat from Triticum aestivum) labeled with digoxigenin-dUTP, pSc119.2 (a 120 bp tandem repeat from Secale cereale) labeled with biotin dUTP, and pTa71 (the 45S rDNA from T. aestivum) labeled after linearization with biotin dUTP (for probe description see [50]). Digoxigenin was detected by FITC—anti-digoxigenin (green fluorescence) and biotin by Texas Red—Streptavidin (red fluorescence). Chromosomes were counterstained with DAPI (blue fluorescence). Chromosomes from the N genome (top row) show the strongly labeled 45S rDNA site on chromosome 5 and several smaller signals from pSc119.2 and dpTa1 at intercalary and terminal positions. The D genome chromosomes (bottom row) have many strong and characteristic bands generated by the green fluorescing dpTa1 probe. (b) Chromosomes of Ensete glaucum (2n ¼ 18) probed with fluorochrome-associated synthetic oligonucleotides, Egcen-FAM (green) hybridizing to all centromeres and 5S rDNA-Cy3 (red) on the long arm of chromosome (See [44]). (c) Chromosomes of Elaeis guineensis (2n ¼ 32) with FISH using massive oligonucleotide pools, ATTO488(green), ATTO550 (red) and ATTO647 (far red, displayed in yellow) that identify all individual chromosomes with distinctive combination of signals (see [47]). Bar ¼ 5 μm
8
Trude Schwarzacher et al.
Fig. 4 FISH and immunostaining to meiotic chromosomes. (a) Conventional alcohol acetic acid fixed pachytene chromosomes of wheat Thinopyrum elogantum alien translocation line using total genomic DNA from Thinopyrum (labeled with digoxigenin dUTP and detected with FITC, green) and pTa71 (see legend to Fig. 3a) for the 45S rDNA (labeled with biotin dUTP and detected with Alexa 594, red; see Ali et al. [50]). (b) Male meiotic metaphase I of rye (Secale cereale, 2n ¼ 14) after FISH with two telomeric associated probes (see [49]) labeled with green and red (overlaps show in yellow). Seven bivalents with terminal chiasmata are visible. (c) Pachytene synaptonemal complex spreading and fixation with paraformaldehyde of saffron, Crocus sativus (2n ¼ 3 ¼ 24) stained with the lateral element specific antibody ASY-1 (red; see [34]). Unpaired axes show strong signal with persisting ASY-1 caused by the three homologs failing to pair, while a few paired regions are dark having dissolved the antigen. Bar ¼ 10 μm
Introduction to Plant Cytogenetics
9
conventionally limited by the wavelength of light to 0.1 μm. Electron microscopy and super-resolution optical microscopy further enhance resolution and magnification available. Eukaryotic cells divide into daughter cells through mitotic cycles in several stages: interphase with G1, S and G2 phase when DNA is transcribed and replicated to form two identical chromatids, and the division itself with prophase, metaphase, anaphase, and telophase during which chromosomes condense, chromatids separate and decondense again (Fig. 1a–e). At metaphase, chromosomes become visible and their morphology is relatively conserved between species despite the size differences [1] (Fig. 1f, g). The position of the centromere (primary constriction, and the place where microtubules attach via proteins during division) and presence and location of the nucleolar organizer regions (NORs, secondary constrictions) are characteristics of individual chromosome types, and their complement, the karyotype, is significant and constant for each species. Protocols to prepare chromosomes are covered in Chapters 6, 7, 8, 9, and 10 and aspects essential for achieving good results are discussed below. 2.1 Chromosome Banding
Chromosomes are divided into euchromatin that is made up of genes, regulatory and low copy sequences, and interspersed repeats and stains lightly with DAPI or other DNA stains, while heterochromatin is often tightly coiled, generally inactive and concentrated in certain regions of the chromosomes, mostly containing highly repetitive satellite or tandemly repeated sequence motifs. Unlike euchromatin, heterochromatin remains largely condensed during interphase producing chromocenters (Fig. 1f, g). Well before the controversy over whether DNA, proteins, or even other molecules were important for heredity was resolved (in the late 1950s), chromosome numbers and morphology were studied, and plants lead the way by Darlington [9]. Early studies used carmine and eosin-based stains or Feulgen staining [10] that color chromosomes uniformly along their length, but show chromosome morphology well with clear gaps at the primary and secondary constrictions (Fig. 2a). Phase-contrast microscopy of unstained chromosomes also shows these features, including chromocenters in interphase nuclei, and is almost always used to check slide quality and number of metaphases before in situ hybridization. More discriminating chromosome analysis became possible through C-banding, differential Giemsa staining of heterochromatin—in human [11] and in plants [12, 13] (see Chapter 11 and Fig. 2b, c)—as well as by fluorescent chromosome banding [14, 15] (see Chapter 12). Fluorescent dyes bind to the DNA in the major or minor groove and do so either uniformly or in a base pair specific manner (AT rich versus GC rich), and thus produce negative, neutral, and positive bands, characterizing certain fractions of chromosomes, including different types of
10
Trude Schwarzacher et al.
heterochromatin or the R- and G- bands of mammalian chromosomes. The quantitative nature of the fluorochrome binding to the DNA is also exploited in DNA size measurements and flow cytometry to sort chromosomes (Chapters 2, 3 and 10). Rather than looking at the DNA composition, proteins attached to the DNA can be visualized by silver nitrate or using specific antibodies (see Chapters 13 and 14). 2.2 In Situ Hybridization
While staining methods show chromosomes and chromosome regions as a whole, individual and specific DNA sequences along chromosomes can be visualized by DNA:DNA in situ hybridization [16]. Briefly (in the most common form), a single-stranded labeled DNA probe is allowed to hybridize to denatured, single-stranded DNA within the chromosomes and nuclei on a microscope slide preparation in situ (different from isolated DNA bound to a membrane as in Southern hybridization). The location and abundance of the labeled DNA sequences are subsequently inferred by detecting the label with color precipitates or fluorescence (Fig. 3); thus, the chromosome structure can be correlated with the molecular and genetic DNA information. In situ hybridization has changed how we use cytogenetics in both fundamental plant and animal science, and in applications in medicine and agriculture. It has revolutionized our understanding of the composition, organization, and evolution of DNA sequences and genomes and is now applied in many different variations of the original method (see Chapters 11, 12, 13, and 14). In situ hybridization was first described by Gall and Pardue [17] and John et al. [18] in 1969. Both groups used radioactively labeled ribosomal RNA and showed that the sequences were localized to the nucleoli of Xenopus leavis oocytes and kidney cells, as well as human HeLa cells. Subsequently, Pardue and Gall [19] used radioactive probes for satellite DNA (different density fractions separated by high-speed centrifugation) and those were concentrated in the heterochromatic centromeric regions of all acrocentric mouse chromosomes. The use of radioactive nucleotides to label DNA was inconvenient (besides security reasons) as preparations had to be covered by film or dipped into photographic emulsion and had to be exposed over days to weeks to accumulate signal. The spread of the radioactive signal and unspecific background resulted in low spatial resolution, although sensitivity was good and made the detection of single copy probes possible [20], such that radioactive in situ hybridization was used successfully to map genes on human chromosomes [21] and repeated DNA sequences in plants [22]. The subsequent development of nonradioactive labels is responsible for the revolution of in situ hybridization and its application to a wide range of scientific questions from mapping DNA sequences to chromosomes, to following their three-dimensional
Introduction to Plant Cytogenetics
11
organization within the nucleus and during division. The first nonradioactive in situ hybridization used the naturally occurring molecule biotin (vitamine H) to label the probe and its affinity to avidin (conjugated to an enzyme) for detecting hybridization sites by colored, light, and temperature stable precipitates developed from soluble and usually colorless enzyme substrates [23, 24]. These were sensitive methods, but color differentiation on chromosomes and nuclei was difficult (most colors looked greybrown or grey-blue under the microscope) and spatial resolution was low (>1 μm). Colorimetric detection is now mainly used for RNA in situ hybridization [16] where lower resolution for detection at the cellular level is appropriate and higher sensitivity is achieved by accumulation of precipitate over several days. Fluorescent in situ hybridization (FISH) is now the method of choice for chromosome work and cytogenetic analysis. Initially, fluorescein isothiocyanate (FITC, fluorescing yellow-green with blue light excitation) attached to avidin was used to detect the biotin labeled probes, and chromosomes were counterstained with propidium iodide (fluorescing orange with UV excitation) offering much better resolution and clear differentiation of hybridization sites and chromosomes [25, 26]. Since then, many different fluorophores have been developed for detection of probe hybridization sites that cover the entire spectrum (see Subheading 4.1) and are available linked to avidin (or its derivatives, such as streptavidin) to detect biotinylated probes. With the development of the digoxigenin (a steroid from Digitalis purpurea, foxglove) as labeling system by Roche [27] and anti-digoxigenin antibodies first linked to FITC, tetra-methyl-rhodamine-isothiocyanate (TRITC, fluorescing orange-red) or amino-methyl-coumarin-acetic-acid (AMCA; fluorescing blue), multiprobe experiments can be performed (see Chapters 15, 16, 17, 18, 19, 20, 21, 22, 23, and 24). Biotin and digoxigenin labeled probes can be easily combined and are a good starting point to learn FISH (see Figs. 3a and 4a, b). The nucleic acid probes can also be ‘directly’ labeled with fluorochromes attached to nucleotides avoiding the antibody step (see Figs. 3b, c). The denaturation of the double-stranded DNA, wrapped around nucleosomes in chromosomes, to give single-stranded DNA for hybridization with the probe, is an important aspect of most FISH protocols although non-denaturing FISH [28] and alternative approaches such as PNA (peptide-nucleic-acid) complexes [29] are also occasionally used. Denaturation methods ideally make the DNA in all sequence and protein contexts singlestranded, but do not destroy the chromosome morphology or loose material. Several factors influence the stability of the DNA within chromosomes including temperature, pH, monovalent cations (in most cases sodium ions), formamide, as well as base composition and fragment length of the DNA [16, 30]. In the past,
12
Trude Schwarzacher et al.
either acid or alkali denaturation of chromosomes has been exploited. Now, temperature along with formamide and low sodium ion concentrations, around neutral pH, are used most frequently. Formamide and low ion concentration reduce the stability of DNA and allow denaturation of the probe and chromosomal DNA between 70–90 C and reannealing at 37 C overnight. Early in situ hybridization experiments used denaturation of chromosome preparations by immersing the slides into 50–100 mL of 70% formamide at 70–72 C, dehydrating them in ice cold ethanol before applying the denatured probe. Alternatively, slides are denatured together with the hybridization probe of a few tens of microlitres using a thermal block or modified thermocycler that allows accurate and reproducible conditions [31]. Post-hybridization washing steps to remove unhybridized probe usually involve control of cation concentration and temperature, sometimes with addition of formamide, and need to be closely controlled to avoid loss of chromosomes from the slide or distortion of their morphology, as well as ensuring the retention of desired hybridized probe. The toxicity of formamide and its fumes means it is often avoided, and its autofluorescence also means additional washes are required, so protocols have been developed to exclude it (Chapter 16).
3
Preparing Chromosomes: Metaphase, Interphase, Meiosis, and Fibers A critical step for successful cytogenetic experiments, from simple staining to advanced FISH, is the fixation of dividing tissues and preparation of chromosomes either on slides or in suspension. As most chromosome analyses rely on metaphases, dividing tissue is essential and should be collected from healthy, disease-free, and rapidly growing plants. Root tips from young seedlings, freshly appearing roots at the edge of pots of older plants, or hydroponic cultures are all particularly suitable. Great care needs to be taken when germinating seeds and watering young seedlings. The quality of water is important and bottled drinking water is usually most suitable as ‘tap’ water often contains chlorine or heavy metal ions from pipes, and purified (including distilled) water may be contaminated with residues of purification media or have an extreme pH. Meristematic cells from young shoots, leaves or emerging buds, or liquid tissue culture cells can also be used, but finding dividing cells within callus tissue is difficult. For most applications, high metaphase counts are important, and material needs to be pretreated not only to accumulate metaphases and condense the chromosomes, but also to destroy the spindle microtubules allowing better spreading and separation of the chromosomes (compare Fig. 1c, f). Colchicine is an effective spindle microtubule inhibitor and results in heavily condensed
Introduction to Plant Cytogenetics
13
chromosomes, and this is used often for chromosome counting. Other pretreatments, such as 8-hydroxyquinoline (best for dicotyledonous and tropical species), alpha bromonaphtalene, or ice water (particularly for temperate grasses), give more extended chromosomes where probe hybridization sites can be more easily analyzed along chromosomes [16, 32]. Alternative reagents, including spindle poisons used as herbicides [33] or nitrous oxide (Chapter 6), are also effective. For meiotic studies (Fig. 4 and Chapter 21) or 3D analysis of interphase nuclei (see Chapters 23 and 24), spindle arresting agents are not used, so they do not interfere with spatial organization, pairing, and disjunction or destroy associated proteins. Most protocols use fresh ethanol:acetic acid fixation that preserves chromosomes well while removing some of the chromatin proteins that would otherwise hinder the access to the DNA packaged within the chromosome (see Chapters 6, 7, and 8). For immunostaining (Fig. 4c), paraformaldehyde is used as fixative to preserve proteins [34] (Chapter 21). After fixation, material is either hydrolyzed with acid [10] or digested with proteolytic enzymes [15] to soften the tissue and remove cell walls. Dividing tissue will need careful dissecting before squashing the cells, nuclei and chromosomes under a cover slip (Chapter 7), or made into a suspension to be dropped on the slides or analyzed directly (Chapters 8 and 9). In order to obtain high resolution spatial arrangement of probes, pachytene chromosomes that on average are 20–30 times longer than mitotic metaphase chromosomes have been used, particularly in tomato [35, 36] and Arabidopsis or Brassica [37, 38]. With a resolution of 50–100 kb, detailed organization of repetitive sequence classes at centromeres or ordering BACs to support physical mapping and genome assemblies was possible. Higher resolution of sequence organization over 10 s of kb can also be achieved using DNA fibers from isolated interphase nuclei extended to their full molecular length and examined in the light microscope after in situ hybridization [39] (see Chapter 22), allowing resolution of a few kb and defining a linear relationship of length in μm to kb DNA. However, fibers have lost the chromosomal context, and structural landmarks of chromosomes, potentially making interpretation difficult. Long, single-molecule DNA sequencing such as with Oxford Nanopore spans several 10 s to 100 s of kb of genomic DNA, and therefore is an alternative approach to analysis of the organization of sequences on chromosomal fibers or in BAC clones.
14
4
Trude Schwarzacher et al.
Dyes, Clones, and Synthetic Probes
4.1 Fluorescent Dyes
Fluorescent dyes or fluorophores need to be excited by light of a defined wavelength to become visible and emit light of a longer wavelength and lower photon energy. Initially, high energy mercury short arc lamps or selected laser illumination were used that produce light at a specific wavelength only, and thus fluorochromes were developed to fit those. With the advent of tuneable lasers and LED illumination, light with wavelength of all colors can be produced and fluorochromes are now available with fluorescence from blue to far red, e.g., the cyanine Cy dies, Alexa fluorochromes (Molecular Probes, Invitrogen), or ATTO series (ATTO-TEC, Sigma); for technical specification of dyes, see the many websites by companies selling fluorophores or microscopes and filters (e.g., https://www.chroma.com/spectra-viewer). Most fluorophores have, however, a finite life and fade under the high energy light required for excitation, even with antifade solutions. Chemical improvements have achieved higher stability and sensitivity of fluorochromes under variable experimental conditions (pH or solubility in water) and high-quality lenses and filters for microscopes, as well as advanced image capture systems, have made the use of fluorescence techniques much more user-friendly in recent years. While fluorescent dyes are normally used, semiconducting quantum dots relying on UV excitation and emission of photons of a color depending on the dot have also been tested but have not yet proved as convenient.
4.2 Cloned or PCRAmplified FISH Probes
Initially made possible by the emergence of routine molecular biology techniques in the 1980s, cloned DNA sequences became the norm for obtaining FISH probes such as the examples in Figs. 3a and 4b. The clones can be amplified in E. coli and after plasmid mini-prep isolation, the cloned inserts were excised with restriction enzymes. Most plasmid clones work best with inserts of a few hundred bp to 1–2 kb, although the often used clone containing the 18S, 5.8S, and 25S rRNA genes and intergenic spacers from wheat, pTa71 [40], has an insert of 9.2 kb and the entire linearized plasmid is used for labeling (see Fig. 3a). Small amounts of short plasmid clones can be amplified by PCR yielding plenty of insert DNA using the M13 sequencing primers incorporated in most plasmid vectors (note that the resulting fragment will be slightly longer than the insert as the M13 primers are located outside the universal cloning sites). Purified insert DNA is then subsequently labeled to incorporate UTPs linked to haptens or fluorochromes using DNA polymerases such as random priming or nick translation that are available in form of kits from many molecular reagent providers. In some cases, incorporating labeled nucleotide during PCR or attaching fluorochromes at the ends of DNA sequences is employed [16]. Other novel approaches include using CRISPR/ Cas9 to label probes (see Chapter 20).
Introduction to Plant Cytogenetics
15
Plasmid clone DNA probes used for FISH experiments are most often repetitive DNA for reasons discussed above and can have several origins, including random or restriction enzyme fragments, or PCR products that were obtained by specific or degenerate primers designed to amplify the sequence of interest already known or found by de novo bioinformatic analysis using total genomic DNA as template. Using cloned sequences has the great advantage that inserts can be sequenced to verify and to know exactly what a specific probe is. However, PCR products from total genomic template DNA can also be used as probes directly without cloning, as long as sharp bands are visible on test gels and contamination with underlying random sequences can be assumed minimal. In some cases, there might actually be an advantage in using PCR products of repeats directly, as multiple variants of a repeat are present rather than a selected one of a clone. We now routinely use specific primers [41, 42] to amplify the 18S rDNA from genomic DNA of our study species for labeling instead of the wheat clone pTa71. More recently, instead of labeling of DNA fragments with enzymatic methods, oligonucleotides can be synthesized with a fluorochrome attached to either the 50 (most commonly) or 30 or both ends. The technique works well for consensus sequences of tandem repeats and results in well-defined signals [43, 44]) (Fig. 3b). Massive oligonucleotide pools, consisting typically of 20,000 different 50-mers, in total 1,000,000 bp of synthesized DNA, have been developed for the identification of specific chromosomal regions and chromosome barcoding [45]; they rely on bioinformatic tools that use genome assembly information by finding unique regions (of several hundred kbs) strategically placed to identify each chromosome arm by specific patterns (Fig. 3c) and involve several steps to identify and remove repetitive elements to avoid cross hybridization [46, 47] (Chapters 25, 26, and 27). 4.3 Genome and Chromosome Painting
FISH of repetitive DNA is used widely for comparative cytogenetics to analyze the organization of sequences at centromeres or telomeres [48, 49] or to determine the genomic and chromosomal distribution of repeats including novel sequence classes identified through bioinformatic tools (Fig. 3, Chapters 17, 18, and 19). An alternative method uses total genomic DNA as probe in genomic in situ hybridization or GISH [26] (Chapter 16) and allows identification of the origin of genomes in hybrid and polyploid, wild and crop species. It is widely used to identify alien introgressions in breeding lines, particularly if combined with repetitive DNA probes for further characterization of chromosomes and involved rearrangements [50, 51] (see Fig. 4a). Specifically, it also allows the visualization of chromosome territories, nuclear architecture, and function (see Chapters 23 and 24).
16
Trude Schwarzacher et al.
If genomic DNA is used from flow-sorted or microdissected chromosomes, only certain chromosomes are labeled and their evolutionary rearrangements can be traced or interphase domains identified; this technique works particularly well in mammalian cytogenetics [52], but has not been achieved routinely in plants probably due to the rapid homogenization processes of repetitive DNA sequences dispersed over all chromosomes within genomes of plant species. Use of large insert clones, particularly BACs (bacterial artificial chromosomes, typically carrying 50 to 150 kb of plant chromosomal DNA), as probes has proved effective to identify homoeologous chromosome domains across taxonomic groups [53–55] (Chapter 19). Massive pools of synthetic oligonucleotides were used to label arms of Musa chromosomes [56] (Chapter 27), and it is likely that synthesis technologies combined with genome assemblies and the ability to screen out repetitive regions from synthesis will be increasingly used. Repeats common across a species may prevent use of BAC probes (which may include the repeat) to identify chromosomes, although they may still identify ancestral genomes [57]. Overall, the skills to develop and retain BAC resources for FISH are being lost since current genome sequencing strategies to identify specific sequences can be transferred between labs more easily. Nevertheless, long read sequencing may be able to generate full-length, complete sequences of the BAC resources at hands.
5
Chromosome Biology and Genomics The contents of this volume show the wide range of modern approaches leading to informative results in cytogenetics and cytogenomics, essential but often overlooked parts of genomics and genome biology research. Cytogenetics and microscopy of dividing cells is the optimum way to find chromosome number and morphology, including the relative sizes of chromosomes and locations of centromeres. Advanced cytogenetic techniques, using fluorescent banding and in situ hybridization, further allow inference of many aspects of genome and species evolution, including repetitive DNA sequence abundance and distribution, polyploidy, and chromosome rearrangements including fusion or fission, loss and gain of chromosomes, and rearrangements including structural variations. In many cases, cytogenetics is the most effective if not the only method to answer these questions unilaterally, particularly if applied in conjunction and in support of molecular analysis and genome assemblies. When repetitive sequences were identified and described by molecular techniques in the 1980 and 1990, limited knowledge could be gained about their abundance and distribution on chromosomes, but for FISH to show whether they are organized as
Introduction to Plant Cytogenetics
17
arrays at limited and often species-specific chromosomal regions or dispersed throughout the genomes [1, 58]. Still to date, FISH will always supplement molecular and bioinformatic identification of repeats such as for example when using RepeatExplorer of raw unassembled reads [42] (Chapter 30) and other methods as highlighted in many Chapters of this book. First whole genome projects have benefitted from FISH to order BACs and contigs to assemble scaffolds, to orientate them toward telomeres or centromeres and integrate physical and genetic maps, e.g., in tomato [59, 60] and sugar beet [61]. Whole genome assemblies have expanded massively in plants using next-generation short DNA fragment sequencing but were generally geared to assemble genes and low copy sequences, and the better repeats were masked the better the resulting assembly became. This of course left repeats only poorly captured. If present, tandem repeat arrays tend to be collapsed as repeat units only vary slightly and are placed on top of each other and thus only a small fraction of the total number of copies are recorded. In most cases though, dispersed transposable elements that make up 50% or more in small and up to 95% in large genomes [1] can be readily identified, annotated, and categorized due to their common features such as long or short terminal repeats and containing a number of welldefined protein domains. However, problems arise when scaffolds end with parts of a dispersed repeat where several alternative contigs have the adjacent part of the repeat as choices of continuation and further assembly becomes not possible. Long-range single molecule sequencing—currently led by Oxford Nanopore Technology and PacBio Sequel, but also involving more chromosome-based sequencing (Chapters 28 and 29), resulting in several tens of kb read lengths—can overcome some of the problems and with appropriate algorithms can generate de novo end-to-end chromosomal assemblies. But for example Nanopore sequences tend to have higher error rates and will need high sequence depth and proofing with short read sequencing, RNAseq for gene annotation, and methods such as Hi-C chromatin interaction data to check for assembly errors. Nevertheless, we found that while the short tandem repeats at the centromeres of Ensete glaucum [44] (Fig. 3b) were assembled well, the 9 kb 45S rDNA units were underrepresented and collapsed due to the length of the units and high similarities between them. For both sequence types, we found that FISH is essential to verify the assembly data and to allow unequivocal correlation of assembled pseudo-chromosomes to cytological chromosomes. Cytology and chromosome biology, in turn, has benefitted enormously from the revolution in sequencing and genome assembly projects in identifying new repetitive DNA elements generating a host of new probes to paint or mark chromosomes and chromosome regions. These require advanced bioinformatic tools and
18
Trude Schwarzacher et al.
algorithms, and thus a fruitful collaboration of in silico dry lab and in situ wet work results as described in this book, in our current understanding of the central role of the chromosome.
6
Conclusion It becomes clear that cytogenetics and cytogenomics are fundamental to many evolutionary and genome functional questions: determination of chromosome number, and genome size, including in surveys of related taxa, is essential to understand species evolution. Chromosome translocations have shaped speciation, and it is vital to describe and understand them in plant breeding contexts. If species, accessions, or lines to be crossed have rearrangements, this may lead to infertility in hybrids and is a frequent challenge in plant breeding. Here, cytology offers many possibilities: from analysis of meiotic pairing that will show the presence of translocations through trivalents, or the translocations may be detected by study of chromosome morphology, C-banding or repetitive DNA probes on chromosomes, or even use of massive oligonucleotide pools as probes. Cytogenetics brings together techniques from many cell and molecular biology approaches and from analysis of DNA sequences. In contrast to sequencing, chromosome preparation is not amenable to automation, and the constraint will always be the number of high-quality suitable slides, requiring a skilled cytologist. However, when successful, a single slide with many hundreds of nuclei and a few metaphases offers enormous and very powerful information analyzing basically single events. To avoid the often tedious finding and recording of cells and metaphases needed to synthesize results, automation of slide processing, microscopy, and karyotyping are possible and can speed up analysis (Chapters 31 and 32). Important developments recently have come from bioinformatic tools exploiting large genome sequencing data, along with advances in optical mapping and gene editing protocols with sequence targeting. Plant Cytology has changed in the last decade and has become much more computer- and web-based (Chapter 33): but as this book shows, it is as exciting and as relevant as when the first chromosomes were described more than 100 years ago.
Acknowledgments We thank the many students, post-docs, visitors, and collaborators over the last 40 years involving us with many different species and questions we were able to ask and answer looking at chromosomes. We acknowledge funding from the Overseas Distinguished Scholar Project of South China Botanical Garden, Chinese Academy of Sciences (Y861041001) to JSHH.
Introduction to Plant Cytogenetics
19
References 1. Heslop-Harrison JS, Schwarzacher T (2011) Organisation of the plant genome in chromosomes. Plant J 66:18–33. https://doi.org/10. 1111/j.1365-313X.2011.04544.x 2. Soltis PS, Marchant DB, Van de Peer Y, Soltis DE (2015) Polyploidy and genome evolution in plants. Curr Opin Genet Dev 35:119–125. https://doi.org/10.1016/j.gde.2015.11.003 3. Alix K, Ge´rard PR, Schwarzacher T, HeslopHarrison JS (2017) Polyploidy and interspecific hybridization: partners for adaptation, speciation and evolution in plants. Ann Bot 120:183–194. https://doi.org/10.1093/ aob/mcx079 4. Heslop-Harrison JS, Schwarzacher T (2007) Domestication, genomics and the future for banana. Ann Bot 100:1073–1084. https:// doi.org/10.1093/aob/mcm191 5. Schmidt T, Heitkam T, Liedtke S, Schubert V, Menzel G (2019) Adding color to a centuryold enigma: multi-color chromosome identification unravels the autotriploid nature of saffron (Crocus sativus) as a hybrid of wild Crocus cartwrightianus cytotypes. New Phytol 222: 1965–1980. https://doi.org/10.1111/nph. 15715 6. Herklotz V, Ritz CM (2017) Multiple and asymmetrical origin of polyploid dog rose hybrids (Rosa L. sect. Caninae (DC.) Ser.) involving unreduced gametes. Ann Bot 120: 209–220. https://doi.org/10.1093/aob/ mcw217 7. Tomaszewska P, Vorontsova MS, Renvoize SA, Ficinski SZ, Tohme J, Schwarzacher T, Castiblanco V, de Vega JJ, Mitchell RAC, Heslop-Harrison JS (2021) Complex polyploid and hybrid species in an apomictic and sexual tropical forage grass group: genomic composition and evolution in Urochloa (Brachiaria) species. Ann Bot 129 (in press) https://doi.org/10.1093/aob/mcab147 ¨ ber Karyokinese und ihre 8. Waldeyer W (1888) U Beziehung zu den Befruchtungsvorga¨ngen. Arch Mikr Anat 32:1–222 9. Darlington CD (1937) Recent advances in cytology, 2nd edn. London Churchill 10. Darlington CD, LaCour LF (1976) The handling of chromosomes, 6th edn. Wiley, New York 11. Arrighi FE, Hsu TC (1971) Localization of heterochromatin in human chromosomes. Cytogenetics 10:81–86. https://doi.org/10. 1159/000130130
12. Marks GE (1975) The Giemsa-staining centromeres of Nigella damascena. J Cell Sci 18(19): 25. https://doi.org/10.1242/jcs.18.1.19 13. Schwarzacher T, Schweizer D (1982) Karyotype analysis and heterochromatin differentiation with Giemsa C-banding and fluorescent counterstaining in Cephalanthera (Orchidaceae). Plant Syst Evol 141:91–113. https:// doi.org/10.1007/BF00986411 14. Schweizer D (1981) Counterstain-enhanced chromosome banding. Hum Genet 57:1–14. https://doi.org/10.1007/BF00271159 15. Schwarzacher T, Ambros P, Schweizer D (1980) Application of Giemsa banding to orchid karyotype analysis. Plant Syst Evol 134:293–297. https://doi.org/10.1007/ BF00986805 16. Schwarzacher T, Heslop-Harrison JS (2000) Practical in situ hybridization. BIO Scientific Publisher Limited, Oxford 17. Gall JG, Pardue ML (1969) Formation and detection of RNA-DNA hybrid molecules in cytological preparations. Proc Natl Acad Sci USA 63:378–383. https://doi.org/10.1073/ pnas.63.2.378 18. John HA, Birnstiel ML, Jones KW (1969) RNA-DNA hybrids at the cytological level. Nature 223:582–587. https://doi.org/10. 1038/223582a0 19. Pardue ML, Gall JG (1970) Chromosomal localization of mouse satellite DNA. Science 168:1356–1358. https://doi.org/10.1126/ science.168.3937.1356 20. Harper ME, Saunders GF (1981) Localization of single copy DNA sequences on G-banded human chromosomes by in situ hybridization. Chromosoma 83:431–439. https://doi.org/ 10.1007/BF00327364 21. Ferguson-Smith MA (1991) Invited editorial: putting the genetics back into cytogenetics. Am J Hum Genet 48:179–182 22. Hutchinson J, Lonsdale LM (1982) The chromosomal distribution of cloned highly repetitive sequences from hexaploid wheat. Heredity 48:371–376. https://doi.org/10.1038/hdy. 1982.49 23. Langer PR, Waldrop AK, Ward DA (1981) Enzymatic synthesis of biotin labeled polynucleotides: novel nucleic acid affinity probes. Proc Natl Acad Sci USA 78:6633–6637. https://doi.org/10.1073/pnas.78.11.6633 24. Rayburn AL, Gill BS (1985) Use of biotinlabeled probes to map specific DNA sequences on wheat chromosomes. J Hered 76:78–81.
20
Trude Schwarzacher et al.
https://doi.org/10.1093/oxfordjournals. jhered.a110049 25. Pinkel D, Straume T, Gray JW (1986) Cytogenetic analysis using quantitative, highsensitivity, fluorescence hybridization. Proc Natl Acad Sci 83:2934–2938. https://doi. org/10.1073/pnas.83.9.2934 26. Schwarzacher T, Leitch AR, Bennett MD, Heslop-Harrison JS (1989) In situ localization of parental genomes in a wide hybrid. Ann Bot 64:315–324. https://doi.org/10.1093/ oxfordjournals.aob.a087847 27. Eisel D, Seth O, Gru¨newald-Janho S, Kruchen B (2008) DIG application manual for non-radioactive in situ hybridization, 4th edn. Roche Diagnostics GmbH, Mannheim ´ , Jouve N (2010) Chromosomal 28. Cuadrado A detection of simple sequence repeats (SSRs) using nondenaturing FISH (ND-FISH). Chromosoma 119:495–503. https://doi.org/ 10.1007/s00412-010-0273-x 29. Volpi EV, Bridger JM (2008) FISH glossary: an overview of the fluorescence in situ hybridization technique. BioTechniques 45:385–409. https://doi.org/10.2144/000112811 30. Meinkoth J, Wahl G (1984) Hybridization of nucleic acids immobilized on solid supports. Anal Biochem 138:267–284. https://doi. org/10.1016/0003-2697(84)90808-X 31. Heslop-Harrison JS, Schwarzacher T, Anamthawat-Jonsson K, Leitch AR, Shi M, Leitch IJ (1991) In situ hybridization with automated chromosome denaturation. Technique 3:109–116 32. Schwarzacher T (2016) Preparation and fluorescent analysis of plant metaphase chromosomes. In: Caillaud MC (ed) Plant cell division. Methods in molecular biology, vol 1370. Humana Press, New York, pp 87–103. https://doi.org/10.1007/978-1-49393142-2_7 33. Dolezˇel J, Lucretti S, Schubert I (1994) Plant chromosome analysis and sorting by flow cytometry. Crit Rev Plant Sci 13:275–309. https:// doi.org/10.1080/07352689409701917 34. Staginnus C, Gregor W, Mette MF, Teo CH, Borroto-Fernández EG, Machado MLC, Matzke M, Schwarzacher T (2007) Endogenous pararetroviral sequences in tomato (Solanum lycopersicum) and related species. BMC Plant Biol 7:24. https://doi.org/10.1186/ 1471-2229-7-24 35. Szinay D, Chang SB, Khrustaleva L, Peters S, Schijlen E, Bai Y, Stiekema WJ, Van Ham RCHJ, de Jong H, Klein Lankhorst RM (2008) High-resolution chromosome mapping of BACs using multi-colour FISH and pooled-
BAC FISH as a backbone for sequencing tomato chromosome 6. Plant J 56:627–637. https://doi.org/10.1111/j.1365-313X. 2008.03626.x 36. Koorneef M, Fransz P, de Jong H (2003) Cytogenetic tools for Arabidopsis thaliana. Chromosom Res 11:183–194. https://doi. org/10.1023/A:1022827624082 37. Mandáková T, Pouch M, Brock JR, Al-Shehbaz IA, Lysak MA (2019) Origin and evolution of diploid and allopolyploid camelina genomes was accompanied by chromosome shattering. Plant Cell 31:2596–2612. https://doi.org/ 10.1105/tpc.19.00366 38. de Jong JH, Fransz P, Zabel P (1999) High resolution FISH in plants—techniques and applications. Trends Plant Sci 4:258–263. https://doi.org/10.1016/S1360-1385(99) 01436-3 39. Gerlach WL, Bedbrook JR (1979) Cloning and characterization of ribosomal RNA genes from wheat and barley. Nucleic Acids Res 7:1869– 1885. https://doi.org/10.1093/nar/7.7. 1869 40. Chang K-D, Fang S-A, Chang F-C, Chung M-C (2010) Chromosomal conservation and sequence diversity of ribosomal RNA genes of two distant Oryza species. Genomics 96:181– 190. https://doi.org/10.1016/j.ygeno.2010. 05.005 41. Liu Q, Li XY, Zhou XY, Li MZ, Zhang FJ, Schwarzacher T, Heslop-Harrison JS (2019) The DNA landscape in Avena: chromosome and genome evolution defined by major repetitive DNA classes in whole-genome sequence reads. BMC Plant Biol 19:226. https://doi. org/10.1186/s12870-019-1769-z 42. Tang Z, Yang Z, Fu S (2014) Oligonucleotides replacing the roles of repetitive sequences pAs1, pSc119.2, pTa-535, pTa71, CCS1, and pAWRC.1 for FISH analysis. J Appl Genet 55: 313–318. https://doi.org/10.1007/s13353014-0215-z 43. Wang Z, Rouard M, Biswas MK, Droc G, Cui D, Roux N, Baurens FC, Ge XJ, Schwarzacher T, Heslop-Harrison PJ, Liu Q (2022) A chromosome-level reference genome of Ensete glaucum gives insight into diversity and chromosomal and repetitive sequence evolution in the Musaceae. GigaScience 11: g i a c 0 2 7 . h t t p s : // d o i . o r g / 1 0 . 1 0 9 3 / gigascience/giac027 44. Han YH, Zhang T, Thammapichai P, Wen J, Jiang JM (2015) Chromosome-specific painting in Cucumis species using bulked oligonucleotides. Genetics 200:771–779. https://doi. org/10.1534/genetics.115.177642
Introduction to Plant Cytogenetics 45. Agrawal N, Gupta M, Banga SS, HeslopHarrison JS (2020) Identification of chromosomes and chromosome rearrangements in crop brassicas and Raphanus sativus: a cytogenetic toolkit using synthesized massive oligonucleotide libraries. Front Plant Sci 11: 598039. https://doi.org/10.3389/fpls.2020. 598039 46. Zaki NM, Schwarzacher T, Singh R, Madon M, Wischmeyer C, Hanim Mohd Nor N, Zulkifli MA, Heslop-Harrison JS (2021) Chromosome identification in oil palm (Elaeis guineensis) using in situ hybridization with massive pools of single copy oligonucleotides and transferability across Arecaceae species. Chromosom Res 29:373–390. https://doi.org/10.1007/ s10577-021-09675-0 47. Heslop-Harrison JS, Murata M, Ogura Y, Schwarzacher T, Motoyoshi F (1999) Polymorphisms and genomic organization of repetitive DNA from centromeric regions of Arabidopsis thaliana chromosomes. Plant Cell 11:31–42. https://doi.org/10.1105/tpc.11. 1.31 48. Vershinin AV, Schwarzacher T, HeslopHarrison JS (1995) The large scale genomic organization of repetitive DNA families at the telomeres of rye chromosomes. Plant Cell 7: 1823–1833. https://doi.org/10.1105/tpc.7. 11.1823 49. Forsstro¨m PO, Merker A, Schwarzacher T (2002) Characterisation of mildew resistant wheat-rye substitution lines and identification of an inverted chromosome by fluorescent in situ hybridisation. Heredity 88:349–355. https://doi.org/10.1038/sj.hdy.6800051 50. Ali N, Heslop-Harrison JS, Ahmad H, Graybosch RA, Hein GL, Schwarzacher T (2016) Introgression of chromosome segments from multiple alien species in wheat breeding lines with wheat streak mosaic virus resistance. Heredity 117:114–123. https://doi.org/10. 1038/hdy.2016.36 51. Wienberg J, Stanyon R, Jauch A, Cremer T (1992) Homolgies in human and Macaca fuscata chromosomes revealed by in situ suppression hybridization with human chromosome specific libraries. Chromosoma 101:256–270. https://doi.org/10.1007/BF00346004 52. Niemela¨ T, Seppa¨nen M, Badakshi F, Rokka VM, Heslop-Harrison JS (2012) Size and location of radish chromosome regions carrying the fertility restorer Rfk1 gene in spring turnip rape. Chromosom Res 20:353–361. https:// doi.org/10.1007/s10577-012-9280-5 53. Mandáková T, Lysak MA (2016) Painting of Arabidopsis chromosomes with chromosomespecific BAC clones. Curr Protoc Plant Biol 1:
21
359–371. https://doi.org/10.1002/cppb. 20022 54. Braz GT, He L, Zhao H, Zhang T, Semrau K, Rouillard JM, Torres GA, Jiang J (2018) Comparative Oligo-FISH mapping: an efficient and powerful methodology to reveal karyotypic and chromosomal evolution. Genetics 208: 513–523. https://doi.org/10.1534/genetics. 117.300344 ˇ ´ızˇková J, 55. Sˇimonı´ková D, Neˇmecˇková A, C Brown A, Swennen R, Dolezˇel J, Hrˇibová E (2020) Chromosome painting in cultivated bananas and their wild relatives (Musa spp.) reveals differences in chromosome structure. Int J Mol Sci 21:7915. https://doi.org/10. 3390/ijms21217915 56. Bertioli DJ, Vidigal B, Nielen S, Ratnaparkhe MB, Lee T-H, Leal-Bertioli SCM, Kim C, Guimara˜es PM, Seijo G, Schwarzacher T, Paterson AH, Heslop-Harrison JS, Araujo ACG (2013) The repetitive component of the A genome of peanut (Arachis hypogaea) and its role in remodeling intergenic sequence space since its evolutionary divergence from the B genome. Ann Bot 112:545–559. https://doi.org/10.1093/ aob/mct128 57. Sepsi A, Fábián A, Ja¨ger K, Heslop-Harrison JS, Schwarzacher T (2018) ImmunoFISH: simultaneous visualisation of proteins and DNA sequences gives insight into meiotic processes in nuclei of grasses. Front Plant Sci 9: 1193. https://doi.org/10.3389/fpls.2018. 01193 58. Anamthawat-Jonsson K, Heslop-Harrison JS (1993) Isolation and characterization of genome-specific DNA sequences in Triticeae species. Mol Gen Genet 240:151–158. https://doi.org/10.1007/BF00277052 59. Stack SM, Royer SM, Shearer LA, Chang S-B, Giovannoni JJ et al (2009) Role of fluorescence in situ hybridization in sequencing the tomato genome. Cytogenet Genome Res 124:339– 350. https://doi.org/10.1159/000218137 60. Shearer LA, Anderson LK, de Jong H, Smit S, Goicoechea JL, Roe BA, Hua A, Giovannoni JJ, Stack SM (2014) Fluorescence in situ hybridization and optical mapping to correct scaffold arrangement in the tomato genome. Genes Genomes Genet 4:1395–1405. https://doi.org/10.1534/g3.114.011197 61. Paesold S, Borchardt D, Schmidt T, Dechyeva D (2012) A sugar beet (Beta vulgaris L.) reference FISH karyotype for chromosome and chromosome-arm identification, integration of genetic linkage groups and analysis of major repeat family distribution. Plant J 72: 600–611. https://doi.org/10.1111/j. 1365-313X.2012.05102.x
Part II Sizing the Nucleus: Methods for Genome Size and Ploidy Level Estimation
Chapter 2 The Use of Flow Cytometry for Estimating Genome Sizes and DNA Ploidy Levels in Plants Joa˜o Loureiro, Martin Cˇertner, Magdalena Lucˇanova´, Elwira Sliwinska, Filip Kola´rˇ, Jaroslav Dolezˇel, So`nia Garcia, Sı´lvia Castro, and David W. Galbraith Abstract Flow cytometry has emerged as a uniquely flexible, accurate, and widely applicable technology for the analysis of plant cells. One of its most important applications centers on the measurement of nuclear DNA contents. This chapter describes the essential features of this measurement, outlining the overall methods and strategies, but going on to provide a wealth of technical details to ensure the most accurate and reproducible results. The chapter is aimed to be equally accessible to experienced plant cytometrists as well as those newly entering the field. Besides providing a step-by-step guide for estimating genome sizes and DNA-ploidy levels from fresh tissues, special attention is paid to the use of seeds and desiccated tissues for such purposes. Methodological aspects regarding field sampling, transport, and storage of plant material are also given in detail. Finally, troubleshooting information for the most common problems that may arise during the application of these methods is provided. Key words Best practices, DAPI, Desiccated tissues, DNA-ploidy level, Flow cytometry, Genome size, Plant nuclei isolation, Plant tissues, Propidium iodide, Seeds
1
Introduction Flow cytometry (FCM) allows the rapid analysis of the optical properties of particles, generally cells or their contents. Particles flow in a narrow liquid stream to intersect a focused beam of laser light. Through the interaction between the light and the particles, important information regarding their size, shape, and complexity can be obtained. FCM was initially developed, over the decades of 1960s and 1970s, for biomedical research applications, such as hematology, oncology, or immunology. However, applications in plant sciences soon emerged. FCM was first used to estimate nuclear DNA amounts in Vicia faba [1] albeit through a complicated procedure involving protoplast isolation and lysis. It was not
Tony Heitkam and So`nia Garcia (eds.), Plant Cytogenetics and Cytogenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2672, https://doi.org/10.1007/978-1-0716-3226-0_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023
25
26
Joa˜o Loureiro et al.
Fig. 1 Diagram of a rapid protocol for preparation of suspensions of cell nuclei according to Galbraith et al. [2]. (1) Typically, 1 mL of the isolation buffer is sufficient. DNA fluorochrome may be added also after step 3. (2) Fresh tissues are chopped by a razor blade (a plastic holder is advisable for safely handling the razor blade), or a sharp scalpel. Squeezing the tissues is not effective nor advisable. (3) The tissue homogenate is filtered through a small piece of nylon mesh or a specialized filter. Following the addition of fluorochromes that are specific for binding to DNA, flow cytometric analysis of the fluorescence intensities of the stained nuclei produces histograms of relative DNA content, which comprise peaks representing nuclei in different phases of the cell cycle
until Galbraith et al. [2] proposed their easy protocol for the preparation of intact nuclei suspensions for FCM (Fig. 1) that this technology started to take off as a widely used method in plant sciences. Given the challenges of the analysis of intact cells in plants (mostly due to their rigid cell walls), FCM is more frequently used to analyze subcellular plant organelles, mainly nuclei, but also mitochondria, plastids, and chromosomes [3]. The most common uses of FCM in plant-related fields are the analysis of cell nuclei to estimate ploidy level and genome size [4]. Additional applications include the analysis of the cell cycle, the estimation of the extent of endoreduplication, DNA base composition measurements, and flow cytometric seed screening for exploring reproductive pathways. For a review of best practices for each application, see Sliwinska et al. [5].
Genome Size and DNA-Ploidy Estimations using Flow Cytometry
27
Fig. 2 Chart showing the number of papers produced by year in the period 1985–2020 with the query “genome size” AND “flow cytometry” AND “plant” in the Scopus database (retrieved 05.10.2021)
Plant genome size assessments by FCM have particularly boosted research in cytogenetics, a field which, since the 1980s, has experienced continuous growth (Fig. 2). Genome size is associated with a myriad of anatomical, morphological, physiological, and environment-related traits, which go beyond the scope of this chapter, but have contributed significantly to the popularity of measuring this parameter. Given the relative ease of sample preparation and its high accuracy, the estimation of genome size soon became either the direct object of numerous plant cytogenetics papers or provided complementary information. At present, data have been archived for ca. 12,500 plant species, most of them obtained by FCM, as reported by the Plant DNA C-values Database (https://cvalues.science.kew.org/). Similar resources have been developed for specific plant families such as the Genome S i z e i n A s t e r a c e a e D a t a b a s e ( GS A D , h t t ps :// w w w. asteraceaegenomesize.com) for which FCM is also the main technique used to generate reliable genome size data [6]. It is foreseen that FCM will continue to be the method of choice for measuring plant genome sizes. Although bioinformatic tools based on whole genome sequence data have been developed as an alternative, they often underestimate genome size due to the presence of repetitive DNA elements and polyploidy [7]. Measurement of ploidy levels is one of the applications of FCM currently finding extensive deployment in agronomy and plant biotechnology. FCM is found particularly useful for routine inference of seed quality and for testing polyploid breeding success. For
28
Joa˜o Loureiro et al.
example, FCM is an easy, well-established method to determine the success of artificial induction of polyploidy by chemical agents [8]. Also, ploidy analyses are very important at the last stages of haploid induction, at which point a large number of plants need to be screened for potential haploids, FCM providing a very efficient way to achieve this [9]. FCM is also an excellent method to investigate endopolyploidy (endoreduplication), which occurs when several rounds of nuclear DNA duplication take place in some somatic tissues without an intervening mitosis, resulting in cells with DNA contents of 4C (although 4C may also include nuclei in G2 phase), 8C, 16C, 32C, and so-on. Endopolyploidy is tissue-specific and is more commonly encountered in certain plant families (e.g., Aizoaceae, Brassicaceae, Crassulaceae), while being absent from other groups (e.g., lycophytes and liverworts) [10]. The fluorescence histograms produced by FCM, showing successively increasing populations of nuclei of 4C, 8C, 16C (and so on) DNA content values, provide hallmarks of endopolyploidy. FCM is an indirect method in the sense that it requires the presence and processing of a reference standard of known genome size [5, 11]. Both the reference standard and the sample of the species to be analyzed are mixed and processed together. The resulting histogram of nuclear DNA contents will ideally provide two peaks of nuclei in G0/G1 phase of the cell cycle (one representing the standard, and one for the species under study), although usually small G2 peaks are also present. Comparison of the relative positions of both G0/G1 peaks determines the genome size and ploidy level of the desired sample. Fresh young leaves are typically the best plant materials for flow cytometric analysis and are collected from living plants in the field or grown in a greenhouse or growth chamber. In some cases, the presence of secondary metabolites can interfere with the FCM assays. In particular, anthocyanins [12] and tannic acid [13] are well-known for interfering with DNA staining by intercalating fluorochromes. Other compounds such as berberine exhibit autofluorescence [14]. Thus, a stoichiometric error in DNA amount quantification can be introduced by the presence of secondary metabolites. If interference by secondary metabolites is suspected, it is recommended to use plant tissues which may have lower or none of these compounds, as for example young plant tissues (seedlings), petals, etiolated shoots [15], or even seeds [16]. In the case that fresh/young material is not available for measurements, some researchers have nevertheless successfully obtained reliable genome sizes (see [15] for a review) or ploidy estimates [17] from desiccated plant materials such as those coming from silica gel-dried samples or quickly processed herbarium vouchers. The main goals of the present chapter are to provide a comprehensive and updated review of the method of FCM as applied to the determination of genome size and DNA-ploidy level using plant
Genome Size and DNA-Ploidy Estimations using Flow Cytometry
29
material, including fresh and desiccated tissues and seeds. We provide information regarding the most commonly used reagents, equipment, reference standards, and source materials, including how such materials should be sampled in the field, transported, and stored, in addition to advice on best practices, according to our experience of ca. 40 years working with FCM in plants. 1.1 Terminology of Genome Size and Ploidy Level
Following a tradition, the majority of reference standards have genome size given in picograms (pg) of DNA [11]. This implies that genome size of an unknown sample is determined in pg DNA, and the overall length of DNA molecules (the number of base pairs) can be approximated using a formula 0.978 × 109 × DNA content (pg) [82]. Since the analyses can be done using nuclei isolated from haploid, diploid, or polyploid cells, an unambiguous report on nuclear genome size requires proper terminology. While the term “genome” has been used widely when referring to genetic information of an organism, when reporting genome size, it is imperative to either specify the number of copies of the genetic information that are being considered (e.g., based on the number of basic chromosome sets, x) or ascertain whether it was estimated from sporophytic (2n) or gametophytic (n) plant tissues. In their proposal to unify terminology of the genome size, Greilhuber et al. [18] introduced the term “holoploid genome” to describe the genetic information contained in G0/G1 nuclei of plant gametes or gametophytes (n) and the quantity of DNA it contains to be expressed as the 1C-value, or the 1C genome size. Analogously, somatic tissue G0/G1 nuclei of sporophyte have by default a 2C DNA amount, and this value should be divided in half to obtain the 1C genome size. Genome size is most often reported as the 1C-value; this approach allows unambiguous DNA content reports also for species of unknown ploidy or with unavailable chromosome counts. If the ploidy level is known for the plant material under study, the 2C DNA amount may be divided by ploidy to produce the mean quantity of DNA corresponding to a basic chromosome set (x). This parameter, describing the DNA amount attributable to a “monoploid genome”, is then expressed by the 1Cx-value or the 1Cx genome size and can be conveniently used to compare genome size variation across different ploidy levels. Although it seems natural to expect a linear relationship between chromosome number and the amount of nuclear DNA, in principle, this relation is constantly observed only within a species. Different species within a genus or even higher taxonomic units with the same ploidy may differ in genome size and hence nuclear DNA amount. Correspondingly, different species with the same DNA amount may differ in ploidy. Therefore, flow cytometric ploidy estimations should always involve comparisons of DNA amounts between plants belonging to the same species. However, as documented by Suda et al. [19], differences in monoploid
30
Joa˜o Loureiro et al.
genome size (1Cx-value) can be observed between plants differing in ploidy within the same species. In such cases, ploidy estimates of various cytotypes using FCM may not be correct. The uncertainty holds also true for intraspecific aneuploidy, which may be observed in species with bimodal karyotypes, or allopolyploids whose parental species have different chromosome sizes. Because of this, it is critical to distinguish between the ploidy or aneuploidy level as determined by chromosome counting and by FCM. As suggested by Suda et al. [19], following the proposal by Hiddemann et al. [20], this can be done by designating the estimates determined by FCM as “DNA-ploidy”, or “DNA-aneuploidy”. 1.2 The Importance of Ploidy Level and Genome Size
FCM can readily be employed to analyze crude tissue homogenates that are easily prepared and excels in the ability to identify populations of cell nuclei in the homogenates. This makes FCM an unequalled method to measure nuclear DNA amounts either in relative, or absolute units. Since the nuclear DNA amount in individuals of the same species is proportional to ploidy level, or the number of basic chromosome sets (x), flow cytometric estimation of relative DNA amounts is established as the preferable alternative to low throughput chromosome counting or measurement of stomatal sizes. In fact, the dramatic increase in sample throughput offered by FCM has revolutionized many areas of plant biology and greatly stimulated progress. For example, the method has been used to characterize hidden genetic diversity within plant genera [21] and to support the classification of accessions held in gene banks [22, 23]. At the population level, FCM enables the discovery of cytotypes and the identification of hybrids [24–28]. Identification of reproductive pathways, including the type of apomixis, can be done after ploidy analysis of seed tissues (embryo and endosperm; [29]). Another exciting application is the analysis of the impact of ploidy on the interaction between plants and their pollinators [30, 31]. Similar success has been obtained with the application of FCM to plant breeding, where it supports the production of haploids and dihaploids [32], and the development of autopolyploid cultivars [8] and cultivars following interspecific hybridization [33]. Seed and biotechnology companies routinely employ FCM in the production of triploid seeds [34] and validating genetic stability during in vitro mass propagation [35–38]. Equally important impacts have emerged from flow cytometric estimation of nuclear DNA amounts in absolute units, or genome size. Even if various measures may need to be implemented to ensure reliable results, FCM still excels in simplicity, throughput and precision when compared to other methods. The knowledge concerning genome size in large groups of plants makes it possible to reveal the modes of plant genome evolution [39] and to estimate the copy number of various classes of DNA sequences in individual species. The size of the nuclear genome is a fundamental parameter
Genome Size and DNA-Ploidy Estimations using Flow Cytometry
31
for the design of genome sequencing projects and for accurately estimating its costs [40]. This information is also needed to assess the completeness of the assembly after the project is finished. As for ploidy analyses, genome size estimations also find important uses in taxonomy, evolution, and ecology [41].
2
Materials In this section, we provide a list of all reagents and equipment, composition of the nuclear isolation buffers needed to isolate plant nuclei from different plant tissues for genome size, and DNA-ploidy level estimations using FCM. Detailed information on plant tissues potentially suitable for such applications and reference standards is also provided.
2.1
Plant Material
2.1.1 Plant Material Suitable for DNA Content Estimation
In theory, any plant material providing intact nuclei in sufficient quantity should be suitable for nuclear DNA content estimation using FCM. While young leaves, not quite fully expanded, are recommended as first-choice material for most species [42], high quality results have also been obtained with other tissues and organs, including petals, sepals, young stems (or flower stalks), petioles, dry seeds, pollen, gynoecia, or even roots [15]. If the results of the measurements using a specific plant material are unsatisfactory, higher precision FCM measurements and easier interpretation of the FCM outputs may sometimes be achieved by selecting alternative tissues and/or sampling the tissues at a particˇ ertner et al. [15] for a review). In ular developmental stage (see C general, immature plant organs and meristems may exhibit distinct mitotic activity, whereas the use of senescing tissues may carry an increased risk of the presence of endopolyploid cells (see Note 1), pathogen infestation, or plant-controlled tissue degradation. Mature and older plant organs may also contain higher amounts of secondary metabolites that interfere with FCM measurements (DNA staining in particular, see Note 2). Both problems can be successfully avoided already at the stage of tissue selection and sampling. Any sampled tissue should be intact and free of parasites, pathogens, and epiphytes to eliminate biological contamination. Since only small amounts of somatic tissue are sufficient for DNA content estimation using FCM, the process of sample collection does not generally lead to unrecoverable plant damage. Whenever possible, fresh plant material is preferred for nuclear DNA content estimation, and the time lag between tissue sampling and conducting FCM measurements should be minimized. There are basically two alternative sources of plant material for FCM. First, ex situ cultivation of plant material under constant and controlled conditions minimizes potential environmental biases and may help attain FCM measurements of higher quality (see Note
32
Joa˜o Loureiro et al.
3). Plants grown in a greenhouse or growth chamber provide a stable source of material and sampling can be easily repeated. For high reproducibility of published FCM estimates, it is always important to carefully detail the conditions under which the plant material was raised. Second, field collection of material introduces a more profound degree of variation in tissue quality due to alterations in environmental conditions, a potentially greater risk of plant misidentification (emphasizing the importance of making voucher specimens), and higher demands on suitable storage conditions for fresh-tissue samples during their transport to the laboratory. For many species, when fresh-tissue samples are refrigerated and an appropriate humidity level is maintained, they can be stored for several days (up to approximately 2-weeks) without compromising the quality of FCM analysis. Longer storage times are possible either with the sampling of specialized plant reproductive propagules (e.g., seeds, bulbs, or tubers) or by the implementation of an appropriate material preservation strategy such as tissue desiccation ˇ ertner et al. or preservation of nuclei suspensions (reviewed by C [15]). However, such long-term preservation usually comes at a cost of lower quality of FCM analysis that may not be compatible with the high standards required for some applications, e.g., for estimation of absolute genome size. 2.1.2
Seeds
Seeds are convenient in terms of their ease of transport and since they allow long-term storage of plant material. FCM analyses do not necessarily have to be conducted on seedlings and/or older grown plants but can be successfully applied to dry seeds, which saves time and cultivation costs. For sexually reproducing angiosperms, a mature seed usually contains the 2C embryo as well as the 3C endosperm, a specialized storage tissue (see Note 4); both structures (especially the endosperm) may also contain endopolyploid cells. In gymnosperms, the overall seed structure is very similar. However, there is no double fertilization, the haploid (1C) megagametophyte being the functional equivalent of the endosperm. Consequently, when whole seeds are analyzed, not only the targeted G0/G1 embryo nuclei but also endosperm nuclei are detected, leading to the presence of additional peaks in FCM histograms, which can be confusing. This problem can be avoided by analyzing only the dissected embryo or its parts (usually the embryo axis/radicle because cotyledons often contain endopolyploid cells; [16]). Dry seed material has been successfully used for determination of nuclear DNA content (e.g., [16, 43, 44]) as well as for screening of reproductive modes in facultatively apomictic groups (e.g., [29, 45, 46]). Whereas in some plant species the DNA content estimates from dry seeds are comparable in quality to those from leaves [16] or can even be more reliable (e.g., when the leaves contain high amounts of secondary metabolites interfering with fluorescent staining; [14]), seeds should be generally avoided for
Genome Size and DNA-Ploidy Estimations using Flow Cytometry
33
precise genome size measurements because of possibly different metabolic and epigenetic effects on chromatin structure than for fresh tissues, which can affect the accessibility of DNA for fluorochromes. Other limitations include the only seasonal availability of seeds, an inability to repeat the FCM measurement (entire seeds are usually used for each analysis), and the fact that field-collected (open-pollinated) seeds may contain a higher proportion of hybrid, odd-ploidy, or aneuploid individuals than is the case for adult plants in the same populations, i.e., variants that otherwise would soon be removed by natural selection (e.g., [27]). 2.1.3 Plant Material Preservation: Desiccated Tissues
Material preservation allows the storage of plant tissue samples for prolonged periods of time (months up to several years) before FCM analysis. Not only is this approach convenient for gradual processing of large numbers of collected samples, but it also enables research in remote and understudied geographical areas. On the other hand, tissue preservation usually decreases the precision of FCM measurements as compared to those obtained with fresh tissue. The resulting quality of FCM estimates then depends on optimization of a particular material preservation strategy, and on sample storage time and storage conditions. Genome size estimation using preserved plant tissues should be avoided, unless the authors can rigorously assess the minimal effect of material preservation on fluorescence intensity (hence DNA content), either on a subset of samples or in a separate calibration data set (e.g., [47]). Particular preservation strategies differ substantially in their advantages and limitations (reviewed in [15]). The most commonly applied preservation method is quick tissue desiccation, which is a simple procedure that produces good quality FCM results. Both silica gel-dried and air-dried plant material can be used. Suda and Tra´vnı´cˇek [48, 49] described the use of DAPI staining for reliable ploidy level estimation in desiccated plant material. This has become the most common FCM protocol for use with desiccated tissues (e.g., [50–54]), although promising attempts to employ PI staining for genome size measurements from silica gel-dried starting material have also appeared [55]. Plant desiccation is a routine way of field sample preservation, and recently collected herbarium vouchers can be used for FCM-based investigations [48, 56, 57]. On the other hand, FCM analysis of desiccated material typically results in a decrease in fluorescence intensity as compared to that of fresh samples (usually up to 10%; e.g., [49, 55, 58]) and produces histograms with higher peak CVs (coefficients of variation) and more prominent levels of background debris ([49]). Furthermore, the quality of FCM measurements also appears to decrease with aging of desiccated samples, although this decrease can be slowed by deep freezing (-80 °C), prolonging the possible storage time up
34
Joa˜o Loureiro et al.
to several years [48]. The quality of FCM analysis from desiccated material also seems to vary across species [55] and organ or tissue types; for example, plants with soft and thin leaves often cause problems [48, 59]. A preliminary screening across several alternative tissue types is therefore highly recommended. Desiccated pollen may be a good choice when it is difficult to get high-quality results with somatic tissue and is also convenient for storage [60, 61]. 2.1.4 Plant Material Preservation: Other Approaches
Other, less commonly employed material preservation strategies in plant FCM include the use of frozen tissue, chemical fixatives, and nuclear suspensions stored in protective solutions. FCM histograms of reasonable quality have been achieved using rapidly frozen plant tissues [56, 62], and this approach has also been used for ploidy level screening [63, 64], estimation of absolute genome size [65, 66], and for establishing the patterns of DNA synthesis [67]. Rapid freezing (e.g., using liquid nitrogen) of the sampled tissue and maintaining it in the frozen state until FCM sample preparation is strongly recommended. That, of course, makes sample preparation in the field and their transport to laboratory more challenging. As with the desiccated material, the FCM analysis of frozen tissue frequently results in lower resolution outputs and a decay in histogram quality of aging samples is also apparent. Frozen samples of fresh plant material can be stored up to months or years. Contrary to animal and human FCM research, tissue preservation using chemical fixatives (ethanol- or formaldehyde-based) has been only rarely applied in plant studies [68, 69]. The protocol involves laborious and time-consuming enzymatic digestion of cell walls, which may later additionally result in decreased fluorescence intensity of nuclei [70]. Moreover, it seems likely that DNA staining by intercalating dyes (such as PI) can be perturbed by chemical fixation, as previously seen for the Feulgen reaction [71]. Tissue treatment with chemical fixatives is thus considered inappropriate for absolute genome size estimation using FCM, being only useful, ultimately, for screening of ploidy variation in model species. The storage of isolated nuclear suspensions in protective solutions is a promising alternative to physical and chemical preservation when high quality measurements are required [72, 73]. For example, Kola´rˇ et al. [59] showed on multiple plant species that nuclei kept at -18 °C in Otto I solution mixed with a 60% glycerol solution remained intact for at least a few weeks, and thereafter provided estimates of nuclear DNA content equivalent to those obtained from fresh samples analyzed immediately after collection. Kobrlova´ et al. [74] successfully employed this approach to study genome size variation in tropical flora of Borneo, obtaining high precision FCM measurements (mean sample CVs output.histo
Setting the command line options -C (count canonical k-mers), -m (k-mer size), and -h (maximum k-mer coverage) is essential; -s (memory) and -t (number of threads) must be adjusted based on the computational environment at hand. Important note: jellyfish does not accept the short names for fasta (fa) or fastq (fq) files.
98
Uljana Hesse
Fig. 10 Truncated and full-length histograms of the DM dataset (40x holoploid genome coverage; k = 21) and the effect of truncation on NK-total
KMC Another efficient k-mer counting program is KMC [47], which has the advantage of handling several fastq files at once. By default, it counts canonical k-mers and requires (1) making a temporary folder; (2) generating the database using “kmc”; and (3) generating the two-column histogram using “kmc_tools transform”. For single file analyses, the following command lines can be used: mkdir ./tmp kmc -k21 -ci1 -cs2000000 -t24 -m100 ./input.fastq ./db_prefix ./tmp kmc_tools transform ./db_prefix histogram ./output.histo – cx2000000
For multiple file analyses, the command lines are: mkdir ./tmp ls *.fastq > FILES kmc -k21 –t24 –m100 -ci1 -cs2000000 @FILES reads tmp/ kmc_tools transform reads histogram reads.histo -cx2000000
K-Mer-Based Genome Size Estimation
99
Essential command line options include -k (k-mer size), -ci (minimum k-mer coverage), and -cs/-cx (maximum k-mer coverage); -m (memory) and -t (number of threads) can be adjusted. BBMap tools The bbmap toolbox from BB-tools has a number of programs that not only generate a histogram, but can also calculate the genome size. The program KmerCountExact stores all k-mer sequences and their exact counts. Since this requires substantial amount of memory, the input data size is limited. The khist.sh script from BBNorm only saves the k-mer counts and is therefore more memory-efficient than KmerCountExact. The resulting histogram has three columns: the k-mer coverage, the total number of k-mers per k-mer coverage, and the unique number of k-mers per k-mer coverage. For other genome prediction programs, only columns 1 and 3 are needed and the header must be removed. This can be easily achieved using for example awk. Essential command line options include the k-mer length (k) and the maximum k-mer coverage (histmax/histlen). The default maximum k-mer coverage for both programs is 100000 (this is 10x the default value of jellyfish, which could explain reported variability in k-mer-based genome size estimates). KmerCountExact also permits specification of histogram parameters (histcolumn = 2 prints only columns 1 and 3; histheader = f removes the header). Memory usage can be adjusted through the -Xmx flag (e.g., -Xmx80g, which would request 80 Gb of RAM). kmercountexact.sh -Xmx80g in=input.fastq k=21 histmax=2000000 histcolumns=2 histheader=f khist=output.hist khist.sh -Xmx80g in=/path/input.fastq k=21 histlen=2000000 hist=/path/output.hist awk ’{if ($1 >= 1) {print $1, $3}}’ /path/output.hist > /path/ output2.hist
A number of other programs can also be used to generate k-mer histograms. Kmerfreq [36] is fast but not always suitable for k-mer-based genome-size analyses, since it generates k-mer frequency histograms for short k-mers only (k ≤ 19). Kmergenie and KAT both produce k-mer histograms and predict the genome size and will be discussed below. 3.3 Genome Size Prediction Using Kmer Frequency Histograms
The formula A widely used method for estimating the genome size remains application of the formula LG = NK-total/CK, which can be performed in an Excel (or similar) spreadsheet. In most cases this is not very complicated. When using a two-column histogram, first the total number of k-mers per k-mer coverage group is calculated by multiplying the k-mer coverage (value in column 1) by the
100
Uljana Hesse
Fig. 11 Peak calling for diploid, triploid and tetraploid plant species. DM, M6, RH: diploid potato; SJ autotriploid potato, ST autotetraploid potato, GB allotetraploid cotton
corresponding number of unique k-mers (value in column 2). NK-total is the sum of all the products. Next, the trajectory of the unique k-mer frequency curve is investigated to identify the valley (s) and peaks(s) of the histogram. The lowest point of the first valley is the point of overlap between the error peak and the first visible biological peak. At high sequencing depth, nearly all k-mers from the k-mer coverage groups at or below that point are likely error k-mers; their total number is usually subtracted from NK-total. The next step is to identify the average k-mer coverage of the largest biological peak. If the peak is more or less evenly bell-shaped (even if it overlaps with other peaks) its mean will be the k-mer coverage value that corresponds to the highest point of the peak (i.e., represents the highest number of unique k-mers); it should be located at the center of the bell. The next step is peak calling, i.e., determining whether the average k-mer coverage represents CK, ½CK, ¼CK, etc., which should be based on the theoretical ploidy-specific number of peaks as explained above (Fig. 11; see Note 6). Provided that k-mer histograms were generated from a high-quality dataset of appropriate sequencing depth (min. 20-40x of the holoploid genome; see Note 2), the following guidelines apply: (1) In diploid organisms, the heterozygous peak may be just a small bump on the left shoulder of the homozygous peak, but the homozygous peak is always well defined. A small bump to the right of a single peak represents repeats. (2) In autopolyploid genomes, any of the peaks may be dominant, and the histogram file usually represents a rather uneven mountain consisting of the coalescent peaks. Peak 1 is often quite large, since single copy mutations can appear at random on
K-Mer-Based Genome Size Estimation
101
any of the homologous genome copies. This results in substantial overlap between peak 1 and the error peak, i.e., the first valley will be rather high even at substantial sequencing depth. (3) In allopolyploid genomes, the uneven-numbered peaks (i.e., peaks 1, 3, 5) may be inconspicuous next to the dominating even-numbered peaks (peaks 2, 4, 6). Specifically, peaks 3 and 5 are rarely if ever seen: this would require substantial k-mer numbers, which is contradictive to the genomic composition of these organisms. As a result, the even-numbered peaks show better resolution. After peak calling, CK is calculated by dividing the estimated average k-mer coverage by the peak-specific fraction. For example, when using peak 1 of a triploid genome, the estimated k-mer coverage represents 1/3CK, i.e., CK = (1/3CK) ÷ (1/3)); when using peak 2 of a tetraploid genome, the estimated k-mer coverage represents 2/4CK, and CK = (2/4CK) ÷ (2/4); when using peak 3 of a tetraploid genome, the estimated k-mer coverage represents 3/4CK, i.e., CK = (3/4CK) ÷ (3/4). Once CK is known, the holoploid (1C) genome size can be calculated using the formula LG = NK-total / CK. A number of programs for k-mer-based genome size estimation have been developed. When the sequencing data are of insufficient depth or quality to produce well-defined bell-shaped peaks, the peaks can be modeled by fitting diverse statistical distributions to the actual data. The mean(s) of the distribution(s) can then be calculated using the respective mathematical functions to determine CK. Each unique k-mer, as well as the different k-mer coverage groups, represents discrete random variables. Therefore, many genome size prediction frameworks model the k-mer peaks using a Poisson distribution (a discrete probability distribution) to calculate CK. The Poisson distribution implies that the variance of the data is similar to the mean. This is usually not the case with real-life sequencing data, where sequencing errors and biases, repeats, and additional peaks add variance. To improve the fit of the model to the real- life data k-mer peaks, several programs employ negative binominal distributions, which account for a larger variance. Below, I will introduce the main programs published to date. As with any program: the results are only as good as the input data, and outputs must be interpreted critically, particularly when it comes to peak calling (see Note 7). To avoid error propagation, command lines are only provided for those programs, which generated acceptable genome size estimates for the test datasets in this study. GCE (2013) GCE was one of the first programs developed for automated genome size prediction [29]. It estimates CK based on Poisson distribution(s), fitted to the biological peak(s) of the histogram. The program requires looking at the histogram before running your analyses: you need to determine the total number of
102
Uljana Hesse
k-mers, decide whether your data represent a homozygous or heterozygous genome, and if you decide on the latter, you must provide the approximate average k-mer coverage value for the homozygous peak. Although this implies some manual investigation, it significantly improves understanding of your data, precision of the genome size estimate, and interpretation of the results. The program can also analyze, but tends to underestimate the genome sizes of polyploid genomes: if the average k-mer coverage value for the leftmost biological peak is provided, the polyploid genome size will be calculated. The holoploid (1C) genome size will be half that value, and the monoploid (1Cx) genome size can then be determined by dividing that number by the ploidy level. Homozygous mode: gce -f input.histo –g # > output.table 2> output.log
Heterozygous and polyploid genomes: gce -f input.histo –g # –H 1 –c # > output.table 2> output.log
Here, -f is the input histogram file with two columns: k-mer coverage and number of unique k-mers; -g is the total number of k-mers; -H activates heterozygous genome analyses; and –c is the approximate average coverage value for the chosen peak. KSA (2013) The kmerspectrumanalyzer [48] estimates the genome size by fitting over-dispersed Poisson (Negative Binomial) distributions to the biological peaks of the histogram using maximum likelihood analyses. It ignores k-mers from the error peak, identifies the first biological (principal) peak, and models up to 29 additional peaks at integer multiples of the principal peak to account for repeats. The program was developed and shown to work with bacterial genomes (Escherichia coli). When tested with default parameters using the plant datasets, the results were far outside the range of the respective flow cytometry genome size estimates. This is most likely associated with the very different structures of these genomes (heterozygosity, polyploidy, repeats). Kmergenie (2013) Kmergenie was originally developed to identify the best k-mer length for De Brujin Graph-based genome assembly using short read sequencing data, but it can also estimate genome size [49]. The program aims to identify the k-mer length that generates the highest number of non-erroneous unique kmers from a given dataset. It uses fastq or fasta files as input and generates approximate histogram files over a range of k-mer lengths specified by the user. Since the histograms are generated through k-mer sampling, they are conspicuously uneven at higher k-mer
K-Mer-Based Genome Size Estimation
103
coverage levels. Although the sampling threshold can be increased, it would require several rounds to determine optimal parameter settings for generating acceptable histograms for a given dataset. Moreover, the maximum k-mer coverage value is 10,000x, which is too short for repeat-rich genomes. This explains why the program consistently underestimated genome sizes of the plant datasets used in this study. BBMap tools (2014) Diverse programs from the bbmap toolbox (e.g., kmercountexact.sh and the khist.sh) invoke estimation of genome size when asked to provide the “.peaks” output file. The approach for genome size estimation differs from that in other programs: there is no modeling of peaks. Instead, the log-logscaled histogram is inspected for local maxima and adjacent local minima to identify the peaks. The unique k-mers of that region are summed up to provide the k-mer volume per peak. By default, the peak with the largest number of unique k-mers is considered to represent the homozygous peak and its average is used as CK. Peaks at integer multiples to the left of this one are considered heterozygous peaks (i.e., 1/2CK, 1/3CK, 1/4CK, etc.) and are used to calculate the ploidy level; peaks at integer multiples to the right are considered to represent repeats (2CK, 3CK, etc.). The total genome size is estimated by calculating the total number of bases under each peak (which is the total number of k-mers, multiplied by the respective copy number that this peak represents) and dividing it by the ploidy level. Consequently, in most cases the holoploid (1C) genome size should be reported. However, overlapping, somewhat arrhythmic or small peaks may be ignored, which affects calculation of ploidy levels and identification of the homozygous peak. This may lead to x-fold differences in genome size estimates, and it is always advisable to verify which one (the monoploid, holoploid, or polyploid genome size) has been calculated. Nonetheless, for the plant genome datasets tested in this study, BBNorm was found to be one of the most accurate genome size prediction programs. To produce the “.peaks” file with the genome size estimates, the kmercountexcat.sh and khist.sh can be run as follows: kmercountexact.sh -Xmx80g in=input.fastq k=21 histmax=2000000 histcolumns=2 histheader=f khist=output.hist peaks=output. peaks khist.sh -Xmx80g in=input.fastq k=21 histlen=2000000 hist=output.hist peaks=output.peaks
Setting the ploidy using the “ploidy=” option will be used to calculate the haploid genome size, but manual inspection of the histograms is still essential to verify correct peak calling.
104
Uljana Hesse
CovEST (2015) In the above programs, all low-coverage k-mers are considered erroneous and the corresponding k-mer groups are excluded from subsequent data analysis. Consequently, these programs require substantial sequencing depth for genome size estimation (min. 10x coverage of the holoploid genome). Moreover, at low genome coverage, a high number of valid genomic k-mers are discarded together with the error k-mers, which decreases the amount of data available for modeling the peaks. CovEST aims to address low genome coverage and high proportions of sequencing errors in the datasets by including error and repeat rates as model parameters when estimating CK [50]. The parameters are estimated after fitting a mixture of truncated Poisson distributions to the peaks of the histogram using maximum likelihood analyses. In 2018, the two main models (the “full error model” invoked by the “basic” mode and the “repeats and error model” invoked by the “repeats” mode) were extended to account for sequencing polymorphisms associated with heterozygosity in diploid organisms [51]. It is important to note that depending on the mode, the first or the second biological peak may be used for genome size estimation of diploid species, yielding either the polyploid (2C) or the monoploid (1Cx) genome size. When analyzing polyploid species, the polyploid genome size will be produced and the 1C and 1Cx genome sizes must be calculated separately. Essential command line options include the model (-m), and the parameters for k-mer length (-k) and read length (-r): covest -m repeat -k # -r # input.hist > output.o 2> error.e
FindGSE (2018) FindGSE estimates genome size through iteratively fitting k-mer frequencies with a skew normal distribution model [52]. Skew normal distribution models allow independent skewing of the left and right side of the distribution, which permits model adjustment to account for sequencing biases. By default, the program assumes that the genome is homozygous and will estimate the genome size based on the first biological peak. For diploid heterozygous genomes, the approximate CK value for the homozygous peak must be provided to improve accuracy of genome size prediction. The value must be submitted using the options exp_hom and should fulfill the condition CK < value output.o 2> error.e
GenomeScope 2.0 was published in 2020 [54]. It was the first program that targeted estimation of genome sizes in polyploid organisms. To account for the higher ploidy levels, the model was expanded to fit two negative binomial distributions per ploidy level. The program requires substantial sequencing depths that provide sufficient peak resolution. Still, the output may at times be difficult to interpret: in most cases, the monoploid (1Cx) genome size is estimated, but, e.g., for the homozygous potato genome ½ of 1Cx
106
Uljana Hesse
was consistently reported. Hence, for homozygous genomes, the original version of GenomeScope appears to be more appropriate. GenomeScope 2.0 can also be run on a web server (http://qb.cshl. edu/genomescope/genomescope2.0/) or via the command line using R. It requires setting the ploidy level (-p), the k-mer length (-k), the maximum k-mer coverage (-m), and the path to the output directory that it will automatically create. For lower coverage datasets (average read coverage below 50x of the holoploid genome), activating the option --num_rounds 1 can improve estimation results. genomescope.r -i input.hist -o /path-to-output-directory/ -p 2 -k 13 -m 2000000 --num_rounds 1
RESPECT (2021) Recent advances in sequencing technologies and the resulting drop in costs have jumpstarted analyses of diverse genomes from non-model organisms. Some of them are very large and/or have high levels of ploidy, and high genome coverage is difficult to achieve. However, even low-coverage sequencing data can be used to conduct a wide range of biologically relevant analyses, such as assembly of organelle genomes, repeat mining, and DNA-based taxonomic classification [55]. These so-called genome skim datasets usually comprise only 0.5–2Gbp and commonly amount to less than 5x genome coverage. The program RESPECT [9] uses k-mer histograms from such datasets to calculate the ratios of observed k-mer counts and then estimate k-mer repeat spectra. It optimizes the k-mer repeat spectra based on empirical estimates of k-mer ratios derived from diverse fully assembled genomes. These estimated spectra are then used to determine genomic parameters, including genome length, average genome coverage, and repeat content. For best results, the authors advise to preprocess the file, i.e., conduct adapter and quality trimming, remove duplicated reads, and filter for contaminating sequencing (e.g., technical sequences, reads from other organisms, etc.). The program can then be run using the following command line specifying just the k-mer length (-k): respect -i input.fastq -o /path-to-output-directory/ -k 21
In this study, several real-life datasets from plants were analyzed to visualize the effects of diverse factors (genome complexity, dataset characteristics, analysis methods, and parameter settings) that affect k-mer-based genome size estimation (Table 3; see Notes 8–12). These analyses are not meant to compare the different programs, which should be done using a much larger and more diverse number of datasets. However, they do show that relatively accurate k-mer-based genome size estimation is possible even for
(1) insufficient data * calculated using 40x holoploid genome coverage data ** best values obtained using 1x holoploid genome coverage data *** best values obtained using 2x holoploid genome coverage data
845** 977** 1900*** 1694*** 2330*** 2167*** 1459***
RESPECT
predicted genome size 1C(=1Cx value) 1C(=1Cx value) 2Cx value 1C(=2Cx value) 2Cx value 1C (=2Cx value) 2Cx value
Formula* predicted predicted predicted (using the 1st GCE* BBNorm* genome size genome size genome size peak) 877 1C 853 1C 871 1C 1718 2C 825 1C 1637 2C bad esmate 1C bad esmate 2226 709 2256 2C 1C 2C 3353 882 3445 no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) 1C NA NA 1C 2331 2367 bad esmate NA NA 3948 2C 3492
DM RH SJ ST SC CB EE
Monoploid genome size (1Cx) FCM 867 Mbp 867 Mbp 977 Mbp 811-946 Mbp 1005 Mbp ≈1227 Mbp 700 Mbp
predicted predicted predicted Genome Scope predicted Genome Scope predicted CovEST* FindGSE* genome size genome size genome size v1 (2017)* genome size v2 (2020)* genome size 871 1C 538 bad esmate 857 1C 843 1C 421 1/2Cx 1637 2C 1C 858 1C 1C 1Cx 816 778 781 bad esmate 2C bad esmate bad esmate 698 1Cx 2256 2779 2296 1944 2C bad esmate 2C 1C 752 1Cx 3445 2218 3758 725 no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) no esmate (1) 1C NA NA NA NA NA NA 1/2Cx 2367 567 3948 2C NA NA NA NA NA NA 1Cx 688
BBNorm*
2 2 3 4 5 4 6
Polyploid Holoploid genome size genome size (2C) FCM (1C) FCM homozygous 1734 Mbp 867 Mbp heterozygous 1734 Mbp 867 Mbp autopolyploid 2932 Mbp 1466 Mbp autopolyploid 3244-3782 Mbp 1622-1891 Mbp autopolyploid 5023 Mbp 2512 Mbp allopolyploid ≈4908 Mbp ≈2450 Mbp autoallopolyploid 4,2 Gbp 2,1 Gbp
Notes
S. tuberosum S. tuberosum S. juzepczukii S. tuberosum S. curlobum G. barbadense E. esula
Ploidy
DM RH SJ ST SC CB EE
Species
Table 3 K-mer-based genome size estimates for test datasets analyzed in this study
K-Mer-Based Genome Size Estimation 107
biologically complex genomes. Main pitfalls in k-mer-based genome size analyses include: (1) truncating the histograms; (2) incorrect interpretation of the genome size estimates generated by a program, which may represent the mono-, holo-, or polyploid genome size; and (3) choosing a program not designed to analyze low coverage and/or biologically complex data. Information on the ploidy level of the organism and the samples used for sequencing and flow cytometry analyses is crucial for result interpretation, and the possibility of unintended endopolyploidy should be taken into consideration when the computational and laboratorial genome size estimates do not align despite following best practice procedures.
4
Notes 1. Data quality should be verified by default since it may affect the genome size estimate. This can be easily achieved using the program FastQC: the “per base sequence quality,” “overrepresented sequences,” and “adapter content” tabs will show whether quality and/or adapter trimming is necessary; the “per base sequence content” can visualize sequencing biases (variability in the first 5–10 bp can be ignored); and the “sequence duplication levels” tab provides information on whether deduplication of the dataset should be considered. 2. Accurate genome size prediction usually requires substantial sequencing depth (min 20x coverage of the holoploid (1C) genome for diploid species, min 40x for higher ploidy levels). The only program that can handle low-coverage (skim) datasets is RESPECT. Here the data must not exceed 5x genome coverage. Avoid mixing datasets from different genotypes, even if they belong to the same species. If the genome coverage is very low and mixed data have to be used, first estimate the genome size using RESPECT on selected datasets, then combine the data and use the formula or another program to verify the result. 3. Before joining files from different datasets (e.g., forward and reverse reads; files from different sequencing runs), it is advisable to inspect the histograms of each file separately to verify that none of them are biased. Troubleshooting (Fig. 12): (1) If you have lots of data but cannot visualize a peak, you probably have a high proportion of duplicated reads (check your FastQC report). You can try deduplicating the dataset, but in most cases, it will not be suitable for genome size estimation (even when using RESPECT). (2) If your k-mer curve follows an unexpected trajectory (e.g., it has additional
K-Mer-Based Genome Size Estimation
109
Fig. 12 K-mer frequency histograms of high-quality and biased sequencing data generated from short-insert WGS libraries (oyster genome project [58]). (a) K-mer curves of raw forward (F) and reverse (R) read datasets from three sequencing runs. The k-mer curves for the datasets SRR943134 follow the expected trend. The k-mer curves of SRR943128 remained flat although the amount of sequencing data was similar in all three datasets. For this dataset, FastQC analysis disclosed a very high proportion of duplicated reads (b–e). The k-mer curves of SRR943126 showed sequencing biases (trajectories for F and R datasets differed substantially, the valley between the error and the first biological peak was very high, the tail to the right of the peak was “fat”). For this run, the FastQC results indicated substantially lower read quality, and taxonomic analysis of the reads showed a higher proportion of unidentified reads than in run SRR943134 (f, g)
peaks between the error and the first biological peaks or the tail to the right is surprisingly “fat”), you may have sequencing biases caused by (a) biological or technical contaminants, or (b) sequencing library- or run-specific problems. A high proportion of taxonomically unidentified reads for NCBI’s SRA datasets (provided on the “Analysis” tab for each sequencing run) may be a first indicator for contaminants. Besides rigorous quality filtering, consider merging read pairs or using only one of the read pair files to avoid generating k-mers from overlapping read pairs. Keep in mind that the reverse reads are often of poorer quality than the forward ones. 4. For genome size estimation, k-mer lengths between 17 bp and 25 bp should be chosen. For very heterozygous and polyploid species, the lower k-mer values (17 bp or 19 bp) may improve peak resolution. 5. When generating k-mer histograms, the maximum k-mer coverage value should be adjusted based on the study organism. This threshold is defined by the presence of high copy number repeats in the genome. For example: The As-16mer43bp repeat, that was found in the genome of oats over three million times [33], makes up more than 180 Mbp of that 12,6Gbp
110
Uljana Hesse
genome (1.45%). To account for this repeat, the maximum k-mer coverage must be set to at least 3.5 million. Previous studies argued against large maximum k-mer coverage thresholds, as this would result in counting k-mers from the genomes of the cell organelles (plastids and/or mitochondria). Their DNA is usually part of the total DNA sample and may generate a substantial number of k-mers since these organelles are often present in multiple copies per cell. However, for repeat rich genomes, this bias appears to be negligible. A truncated histogram is identified by a large value in the last cell of the unique k-mer counts column (column 2; e.g., Fig. 10: 236653). Increase the maximum k-mer coverage threshold to reduce this value to a low double-digit number. 6. Knowledge on the ploidy level of the organism is important for peak calling and accurate genome size estimations. If the ploidy level has not been established for the organism in question, it can sometimes be approximated by looking at the ploidy and/or genome size values of closely related species in diverse databases (e.g., the Chromosome Count Database [56] and the genome size databases mentioned above). Ploidy levels can also be estimated using the program smudgeplot [54], which however requires substantial genome coverage and computational power for the analysis, as well as expertise in result interpretation. To some degree, it can be approximated from the histogram plots, taking into consideration the form of the curve and the amount of sequencing data. If in doubt, laboratorial verification of the ploidy values and genome sizes is advisable, particularly for species with highly variable ploidy levels such as potato [57]. 7. For all programs, peak calling must be verified manually by inspecting the histograms. The result may represent the mono-, holo-, or polyploid genome size, or even fractions thereof, i.e., the 1C genome size may have to be calculated separately. It is advisable to estimate the genome size using several different approaches. If the outputs from various k-mer analysis methods differ substantially from each other and/or laboratorial results, review data quality and peak calling, and ask program developers for assistance in result interpretation. 8. The formula works well for diploid species, but less so for polyploid species. 9. The early modeling programs (KSA, CovEST) were not designed to estimate genome sizes of polyploid organisms and will underestimate their genome sizes. GCE, BB-tools, and FindEST will perform well, as long as information on a biological peak is provided.
K-Mer-Based Genome Size Estimation
111
10. When using GenomeScope, data from diploid species should be analyzed using the first version (2017), and data from polyploid species should be analyzed using the second one (2020). In both versions, the maximum k-mer coverage value must be set to fit the species as the default value of 10,000 is often too low. Make sure the histogram itself is not truncated. 11. The program RESPECT accurately predicted the 1C genome size of diploid organisms using just 1x holoploid genome coverage data. For polyploid organisms, 2x holoploid genome coverage data gave better results. Most importantly, for polyploid genomes RESPECT consistently calculated the 2Cx genome size, i.e., the 1C genome size must usually be calculated separately for all but tetraploid plant species (in tetraploids 1C = 2Cx). 12. It appears that all programs may have difficulties predicting the genome sizes of polyploid genomes with uneven ploidy levels (triploid and pentaploid potato species).
Acknowledgments All biocomputational analyses were conducted at the Centre for High Performance Computing (CHPC, Cape Town, South Africa). I would like to sincerely acknowledge Brian Bushnell for advice on peak calling of tetraploid species and Rei Kajitani for kindly providing the data for the inlet of Fig. 5b. References 1. Bennett MD, Leitch IJ (2005) Genome size evolution in plants. In: The evolution of the genome. Academic, pp 89–162 2. Gregory TR (2005) Genome size evolution in animals. In: The evolution of the genome. Academic, pp 3–87 3. Kullman B, Tamm H, Kullman K (2005) Fungal Genome Size Database 4. Pellicer J, Leitch IJ (2020) The plant DNA C-values database (release 71): an updated online repository of plant genome size data for comparative studies. New Phytol 226(2): 301–305 5. Gregory TR (2021) Animal Genome Size Database http://www.genomesize.com 6. Blommaert J (2020) Genome size evolution: towards new model systems for old questions. Proc R Soc B 287(1933):20201441 7. Manekar SC, Sathe SR (2018) A benchmark study of k-mer counting methods for highthroughput sequencing. GigaScience 7(12): giy125
8. Reynolds G, Strnadova-Neeley V, Lachowiec J (2021) MinHash k-mer sketching highlights allopolyploid subgenome sequence differentiation. In: ISCB-Africa ASBCB. https://glfrey.github.io/files/Gillian_ Reynolds_ISCB2020.pdf 9. Sarmashghi S, Balaban M, Rachtman E, Touri B, Mirarab S, Bafna V (2021) Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. PLoS Comput Biol 17(11):e1009449 10. Zimin A, Stevens KA, Crepeau MW, HoltzMorris A, Koriabine M, Marc¸ais G, Puiu D, Roberts M, Wegrzyn JL, de Jong PJ, Neale DB et al (2014) Sequencing and assembly of the 22-Gb loblolly pine genome. Genetics 196(3):875–890 11. Wang K, Wang J, Zhu C, Yang L, Ren Y, Ruan J, Fan G, Hu J, Xu W, Bi X, Zhu Y et al (2021) African lungfish genome sheds light on the vertebrate water-to-land transition. Cell 184(5):1362–1376
112
Uljana Hesse
12. Greilhuber J, Dolezˇel J, Lysa´k MA, Bennett MD (2005) The origin, evolution and proposed stabilization of the terms ‘genome size’ and ‘C-value’ to describe nuclear DNA contents. Ann Bot 95(1):255–260 13. Leisner CP, Hamilton JP, Crisovan E, Manrique-Carpintero NC, Marand AP, Newton L, Pham GM, Jiang J, Douches DS, Jansky SH, Buell CR (2018) Genome sequence of M6, a diploid inbred clone of the highglycoalkaloid-producing tuber-bearing potato species Solanum chacoense, reveals residual heterozygosity. Plant J 94(3):562–570 14. Graebner RC, Chen H, Contreras RN, Haynes KG, Sathuvalli V (2019) Identification of the high frequency of triploid potato resulting from tetraploid × diploid crosses. HortScience 54(7):1159–1163 15. Hendrix B, Stewart JM (2005) Estimation of the nuclear DNA content of Gossypium species. Ann Bot 95(5):789–797 16. Chao WS, Horvath DP, Anderson JV, Foley ME (2005) Potential model weeds to study genomics, ecology, and physiology in the 21st century. Weed Sci 53(6):929–937 17. Pham GM, Hamilton JP, Wood JC, Burke JT, Zhao H, Vaillancourt B, Ou S, Jiang J, Buell CR (2020) Construction of a chromosomescale long-read reference genome assembly for potato. GigaScience 9(9):giaa100 18. Zhou Q, Tang D, Huang W, Yang Z, Zhang Y, Hamilton JP, Visser RG, Bachem CW, Robin Buell C, Zhang Z, Zhang C et al (2020) Haplotype-resolved genome analyses of a heterozygous diploid potato. Nat Genet 52(10): 1018–1023 19. Kyriakidou M, Anglin NL, Ellis D, Tai HH, Stro¨mvik MV (2020) Genome assembly of six polyploid potato genomes. Sci Data 7(1):1–6 20. Sun H, Jiao WB, Krause K, Campoy JA, Goel M, Folz-Donahue K, Kukat C, Huettel B, Schneeberger K (2021) Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar. bioRxiv 21. Wang M, Tu L, Yuan D, Zhu D, Shen C, Li J, Liu F, Pei L, Wang P, Zhao G, Ye Z et al (2019) Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat Genet 51(2): 224–229 22. Horvath DP, Patel S, Dog˘ramaci M, Chao WS, Anderson JV, Foley ME, Scheffler B, Lazo G, Dorn K, Yan C, Childers A, Schatz M, Marcus S (2018) Gene space and transcriptome assemblies of leafy spurge (Euphorbia esula) identify promoter sequences, repetitive elements, high-
quality markers, and a full-length chloroplast genome. Weed Sci 66(3):355–367 23. Hardigan MA, Laimbeer FPE, Newton L, Crisovan E, Hamilton JP, Vaillancourt B, Wiegert-Rininger K, Wood JC, Douches DS, Farre´ EM, Veilleux RE, Buell CR (2017) Genome diversity of tuber-bearing Solanum uncovers complex evolutionary history and targets of domestication in the cultivated potato. Proc Natl Acad Sci 114(46):E9999–E10008 24. Li X, Waterman MS (2003) Estimating the repeat structure and length of DNA sequences using ℓ-tuples. Genome Res 13(8):1916–1922 25. Zhao Z, Ng YK, Fang X, Li S (2016) Eliminating heterozygosity from reads through coverage normalization. In: IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 174–177 26. PGS Consortium (2011) Genome sequence and analysis of the tuber crop potato. Nature 475(7355):189–195 27. Stoler N, Nekrutenko A (2021) Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform 3(1):lqab019 28. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB (2013) Characterizing and measuring bias in sequence data. Genome Biol 14(5):1–20 29. Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv preprint arXiv:1308.2012 30. Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T (2014) Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res 24(8): 1384–1395 31. Stevens KA, Woeste K, Chakraborty S, Crepeau MW, Leslie CA, Martı´nez-Garcı´a PJ, Puiu D, Romero-Severson J, Coggeshall M, Dandekar AM, Kluepfel D, Neale DB, Salzberg SL, Langley CH (2018) Genomic variation among and within six Juglans species. G3: Genes, Genomes, Genetics 8(7):2153–2165 32. Biscotti MA, Olmo E, Heslop-Harrison JP (2015) Repetitive DNA in eukaryotic genomes. Chromosom Res 23(3):415–420 33. Liu Q, Li X, Zhou X, Li M, Zhang F, Schwarzacher T, Heslop-Harrison JS (2019) The repetitive DNA landscape in Avena (Poaceae): chromosome and genome evolution defined by major repeat classes in whole-
K-Mer-Based Genome Size Estimation genome sequence reads. BMC Plant Biol 19(1):1–17 34. Li G, Wang L, Yang J, He H, Jin H, Li X, Ren T, Ren Z, Li F, Han X, Zhao X et al (2021) A high-quality genome assembly highlights rye genomic characteristics and agronomically important genes. Nat Genet 53(4): 574–584 35. Zhu L, Wu H, Li H, Tang H, Zhang L, Xu H, Jiao F, Wang N, Yang L (2021) Short tandem repeats in plants: genomic distribution and function prediction. Electron J Biotechnol 50: 37–44 36. Wang H, Liu B, Zhang Y, Jiang F, Ren Y, Yin L, Liu H, Wang S, Fan W (2020) Estimation of genome size using k-mer frequencies from corrected long reads. arXiv preprint arXiv:2003.11817 37. SRA toolkit: https://hpc.nih.gov/apps/ sratoolkit.html (SRA Toolkit Development Team) 38. BB-tools: https://jgi.doe.gov/data-andtools/software-tools/bbtools/ (Brian Bushnell) 39. BB-tools user guide: https://jgidoegov/dataand-tools/bbtools/bb-tools-user-guide/ reformat-guide/ 40. Sandhya S, Srivastava H, Kaila T, Tyagi A, Gaikwad K (2020) Methods and tools for plant organelle genome sequencing, assembly, and downstream analysis. In: Legume Genomics. Humana, New York, pp 49–98 41. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: https://www.bioinformatics. babraham.ac.uk/projects/fastqc/ (2015) 42. Bolger A, Giorgi F (2014) Trimmomatic: a flexible read trimming tool for Illumina NGS data. Bioinformatics 30(15):2114–2120 43. Song L, Florea L, Langmead B (2014) Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 15(11):1–13 44. Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11): 1–13 45. Wood DE, Lu J, Langmead B (2019) Improved metagenomic analysis with kraken 2. Genome Biol 20(1):1–13 46. Marcais G, Kingsford C (2012) Jellyfish: a fast k-mer counter. Tutorialis e Manuais 1:1–8 47. Kokot M, Długosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17):2759–2761
113
48. Williams D, Trimble WL, Shilts M, Meyer F, Ochman H (2013) Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics 14(1):1–11 49. Chikhi R, Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1):31–37 50. Hozza M, Vinarˇ T, Brejova´ B (2015) How big is that genome? Estimating genome size and coverage from k-mer abundance spectra. In: International symposium on string processing and information retrieval. Springer, Cham, pp 199–209 51. Krampl W (2018) Prediction of properties of polymorphic genomes from sequencing data. Diploma Thesis. Comenius University in Bratislava, Slovakia 52. Sun H, Ding J, Piednoe¨l M, Schneeberger K (2018) FindGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34(4): 550–557 53. Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC (2017) GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33(14):2202–2204 54. Ranallo-Benavidez TR, Jaron KS, Schatz MC (2020) GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun 11(1):1–10 55. Bohmann K, Mirarab S, Bafna V, Gilbert MTP (2020) Beyond DNA barcoding: the unrealized potential of genome skim data in sample identification. Mol Ecol 29:2521–2534 56. Rice A, Glick L, Abadi S, Einhorn M, Kopelman NM, Salman-Minkov A, Mayzel J, Chay O, Mayrose I (2015) The Chromosome Counts Database (CCDB)–a community resource of plant chromosome numbers. New Phytol 206(1):19–26 57. Berdugo-Cely JA, Martı´nez-Moncayo C, Lagos-Burbano TC (2021) Genetic analysis of a potato (Solanum tuberosum L) breeding collection for southern Colombia using Single Nucleotide Polymorphism (SNP) markers. PLoS One 16(3):e0248787 58. Zhang G, Fang X, Guo X, Li LI, Luo R, Xu F, Yang P, Zhang L, Wang X, Qi H, Xiong Z et al (2012) The oyster genome reveals stress adaptation and complexity of shell formation. Nature 490(7418):49–54
Chapter 5 A Bioinformatic Pipeline to Estimate Ploidy Level from Target Capture Sequence Data Obtained from Herbarium Specimens Juan Viruel, Oriane Hidalgo, Lisa Pokorny, Fe´lix Forest, Barbara Gravendeel, Paul Wilkin, and Ilia J. Leitch Abstract Whole genome duplications (WGD) are frequent in many plant lineages; however, ploidy level variation is unknown in most species. The most widely used methods to estimate ploidy levels in plants are chromosome counts, which require living specimens, and flow cytometry estimates, which necessitate living or relatively recently collected samples. Newly described bioinformatic methods have been developed to estimate ploidy levels using high-throughput sequencing data, and these have been optimized in plants by calculating allelic ratio values from target capture data. This method relies on the maintenance of allelic ratios from the genome to the sequence data. For example, diploid organisms will generate allelic data in a 1:1 proportion, with an increasing number of possible allelic ratio combinations occurring in individuals with higher ploidy levels. In this chapter, we explain step-by-step this bioinformatic approach for the estimation of ploidy level. Key words Flow cytometry, Phylogenomics, Polyploidy, Sequence capture, Whole Genome Duplication
1
Introduction The ploidy level of a plant has traditionally been estimated by making chromosome counts and/or, more recently, by using flow cytometry approaches, which estimate genome size and assume polyploids will have proportionally larger genomes than their haploid/diploid relatives (e.g., [1–5]). Although widely used and providing meaningful and useful results in most cases of recent polyploid events, these approaches have several limitations. First, chromosome counting requires access to living tissue to obtain cells that are actively dividing and, while estimating ploidy using flow cytometry may be possible using dried material, and can give surprisingly good results (even for samples stored for over 30 years,
Tony Heitkam and So`nia Garcia (eds.), Plant Cytogenetics and Cytogenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2672, https://doi.org/10.1007/978-1-0716-3226-0_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023
115
116
Juan Viruel et al.
e.g., in Cupressaceae, [6, 7]), the applicability of this method is strongly constrained by the quality of the material (e.g., see [6, 8], and references therein). This, unfortunately, limits the use of herbarium material for ploidy screening by flow cytometry, as success rate is often low (e.g., ploidy level could only be accurately estimated in 5.9% of tested Dioscorea samples collected up to 15 years ago, [8]). Second, inferring ploidy level from flow cytometry data requires previous knowledge of genome size and/or chromosome number in the studied species or its close relatives. Third, the relationship between genome size and ploidy level in different species can be blurred by additional factors, such as changes in the abundance of repetitive elements that may take place following polyploidisation (e.g., in Helianthus, [9, 10]). In some cases, even with a wealth of genome size and chromosome count data available, genome restructuring between closely related species may make it impossible to reliably assign ploidy levels based on genome size data alone (e.g., in Echinops, [11, 12]). To try and overcome these challenges, bioinformatic tools to estimate ploidy level using high-throughput sequencing (HTS) data are now being developed, providing new approaches to estimate ploidy levels [8]. In this chapter, we present a pipeline and a list of software required to estimate the ploidy level of plant accessions sequenced using a target capture approach. The methodology, described in [8], detects within sample allelic multiplications for single nucleotide polymorphism (SNP) positions in polyploids and, therefore, reflects recent rather than ancient polyploidization events. In diploid organisms, all (single copy) nuclear genes will have two copies per cell, one from each parent, (or one copy in a haploid individual). The number of different copies (alleles) of each of these genes will be proportionately multiplied in polyploids (i.e., three in triploids, four in tetraploids, etc.). Given a SNP position in a diploid individual, the number of copies present for each heterozygous position will be in a ratio of one (1:1). However, this allelic ratio will change in a triploid sample to two (2:1). Increasing numbers of allelic ratio combinations will occur in samples with higher ploidy levels, e.g., one (2:2) and three (3:1) in tetraploids (for more details, see the Supplementary Materials in [8]). The number of reads obtained from target capture sequencing data for each allele will therefore reflect these ratios. Target capture is particularly effective with herbarium material [13, 14], which opens exciting prospects, such as large-scale ploidy screening and ploidy level determination in species for which suitable material for karyological and cytogenetic techniques are not available. Furthermore, as our approach is based on gene copy number, the method described here does not require prior ploidy data information and is not impacted by processes such as the expansion of repetitive elements, unlike ploidy estimations based on genome size estimates from flow cytometry. Nevertheless, one
Bioinformatic Methods to Infer Ploidy Level
117
should be aware that the method presented below does come with its own limitations: (i) first or early generation autopolyploids will behave like diploids (e.g., all allelic ratios in the first generation of an autotetraploid will be one (2:2)), and their ploidy level will therefore be underestimated; and, (ii) while the method can differentiate between diploids and polyploids, distinguishing between different polyploid levels is not yet completely reliable due to the overlaps of the different allelic ratios and the influence of sequencing errors and/or noise in the data. In addition, some genes may expand in copy number, blurring the allele frequency plots (note that such gene duplications can be detected by investigating the allelic ratios for a targeted gene in several diploid species). Below, we consider all samples having a median allelic ratio value below two as diploid, and those with median allelic ratios over two as polyploids. Allelic ratio statistics can be produced per sample, targeted gene, as well as per SNP position, if further exploration of the data is needed. HTS and target capture approaches, in particular, are booming, generating monumental amounts of data from which ploidy data could be extracted. Methods of ploidy determination based on HTS data are, therefore, a valuable complement to cytogenetic and flow cytometric techniques. Together these approaches provide exciting new opportunities for increasing our understanding of polyploidization over time and space in a wider diversity of plant species.
2 2.1
Data and Software Installation Data
1. HTS data generated for the study group are required, ideally for ortholog markers. In the case of target capture, lineagespecific custom (e.g., [15]) or phylogenetically-broad universal (e.g., [16]) bait kits can be used, and enriched genomic libraries can be sequenced with an Illumina platform (ideally, paired-end reads). We recommend assessing the quality of raw paired-end reads (FASTQ file format) with either FastQC (available at https://www.bioinformatics.babraham.ac.uk/pro jects/fastqc/) or with MultiQC ([17], available from https:// multiqc.info/) and trimming these reads, to remove adapters and low-quality regions detected (i.e., Phred quality score > 30), using Trimmomatic ([18], available at http:// www.usadellab.org/cms/?page=trimmomatic) when working with four-color Illumina sequencing platforms, or fastp ([19], see https://github.com/OpenGene/fastp) when working with two-color platforms. 2. A reference file, in FASTA format, containing one sequence per gene targeted in the bait kit. This reference file can be created
118
Juan Viruel et al.
for the study group, e.g., using the HybPiper pipeline ([20], available at https://github.com/mossmatters/HybPiper). 2.2
Software
Analyses can all be run in a Linux environment: 1. Burrows-Wheeler Aligner (BWA, [21]), available at http://biobwa.sourceforge.net/, or another ‘map to reference’ method. 2. SAMtools [22], available at http://www.htslib.org/. 3. nQuire [23], available at https://github.com/clwgg/nQuire. Installation requires gcc compiler, and libz, libm, and libpthread system libraries installed. 4. R project for statistical computing [24], available at https:// www.r-project.org/, which can be run in the R studio environment [25], which works on operating systems other than Linux and is available at https://www.rstudio.com/.
3
Bioinformatic Pipeline The ploidy estimation analysis starts with quality filtered paired-end FASTQ reads and a FASTA reference file containing a sequence for each of the targeted genes for one of the species in the study group. Optimal performance of this pipeline is dependent on sequencing coverage. The number of paired-end reads needed will depend on several factors, such as the number of targeted markers, the type of bait kit used (custom vs. universal), and/or the sequencing platform used, all of which can affect sequence coverage. As a guide, sequences from custom kits targeting ~300 markers can give good results with 200 K–500 K paired-end reads per sample [15], whereas coverage for universal kits should at least reach 1 M to 2 M reads per sample [16]. The bioinformatic pipeline presented here has four main steps (Fig. 1): 1. Map the trimmed, paired-end reads (FASTQ) to the file containing the reference sequences (FASTA) using a mapping software (e.g., BWA), where the FASTQ and FASTA files are provided as input, and a SAM file is returned as the output. 2. Use SAMtools to sort and transform the SAM file from the previous step into a sorted BAM file. 3. Use nQuire, with the sorted BAM file as input, to obtain SNPs from (denoised) mapped reads and to calculate allelic frequencies. 4. Calculate allelic ratios and plot the results for each sample, gene, and SNP using R.
Bioinformatic Methods to Infer Ploidy Level
FASTQ raw data
119
e.g. HybPiper
FASTQC Trimmomac FASTQ quality filtered data
FASTA reference file BWA SAM file Samtools Sorted BAM file nQuire Allelic frequency R Allelic rao
Fig. 1 Outline of the bioinformatic pipeline described here to investigate ploidy levels in sequencing samples obtained using target capture approaches 3.1
Map to Reference
Recommended commands: bwa index ref.fasta bwa mem ref.fasta read1.fastq read2.fastq > SampleID.sam
Trimmed paired-end reads will be mapped to the reference file, here using BWA [21] as an example. BWA requires indexing the reference FASTA file (see Note 1) using the option index, which will generate additional files in the working folder. BWA will automatically detect these files when a reference FASTA file that has been indexed is specified. Information on the successfully mapped reads will be recorded in a Sequence Alignment/Map output file, i.e., a SAM file.
120
Juan Viruel et al.
3.2 BAM Transformation
Recommended commands:
samtools sort SampleID.sam -o
SampledID_sorted.bam
A BAM file is the compressed binary version of a SAM file. This conversion will be automatically performed, with SAMtools [22], when specifying the output with the file extension .bam. The BAM file also needs to be arranged in the same order as the reads appear in the input FASTQ files, which is done with the option sort. 3.3
nQuire Analysis
Recommended commands:
nQuire create -b SampleID-
sorted.bam -o SampleID_nQuire -x
Option create in nQuire will generate a file in binary format containing SNP information, here named SampleID_nQuire. bin.
Recommended commands:
nQuire denoise -o SampleID-
sorted_denoised SampleID_nQuire.bin
Next, option denoise will report the total number of SNPs in the samples, as well as how many were discarded during this denoising step, mainly due to mismapping [23]. It is recommended to run all remaining steps for both “noisy” (Fig. 2a) and “denoised” (Fig. 2b) binary format files to assess the impact denoising has on the estimates. Recommended commands: nQuire histo SampleIDsorted_denoised.bin
The option histo plots a histogram on screen showing the number of SNPs and their allelic frequency as percentages. Allelic frequency percentages for annotated SNPs are plotted between 20% and 80% because below and above these values allelic frequencies cannot be separated from sequencing error, which is why nQuire automatically drops any SNPs outside this range. The distributions depicted in Fig. 2 illustrate the output for a tetraploid individual, with allelic frequencies for SNPs concentrated around 50% (i.e., allelic ratio 2:2) and about 75% and 25% (i.e., allelic ratios of 3:1 and 1:3). The nQuire software offers various statistics to determine whether the data best fit a diploid, triploid, or tetraploid model (see below for more information). Recommended commands: nQuire lrdmodel SampleIDsorted_denoised.bin
Using option lrmodel on the denoised .bin file generates information on eight parameters: (i) file, filename; (ii) fee, free model maximized log-likelihood; (iii) dip, diploid fixed model maximized log-likelihood; (iv) tri, triploid fixed model maximized log-likelihood; (v) tet, tetraploid fixed model maximized log-likelihood; (vi) d_dip, diploid delta log-likelihood; (vii) d_tri, triploid delta log-likelihood; and (viii) d_tet, tetraploid delta log-likelihood. When comparing each of the three fixed models (diploid, triploid, and tetraploid), the lowest delta value (under a free model) will likely correspond to the ploidy level that best fits
Bioinformatic Methods to Infer Ploidy Level
121
Number of SNPs
(A) Noisy nQuire histogram
25
50 Allelic frequency (%)
75
Number of SNPs
(B) Denoised nQuire histogram
25
50 Allelic frequency (%)
75
Fig. 2 Screen log resulting from running option histo in nQuire for a tetraploid sample. (a) Distribution of allelic frequencies for a “noisy” binary file. (b) Distribution of allelic frequencies from a denoised binary file
the observed distribution of the allelic frequencies. Further details can be found in the nQuire manual. Recommended commands: nQuire histotest SampleIDsorted_denoised.bin
Option histotest gives values for (i) the sum of squared residuals (SSR) of empirical versus ideal histograms (Norm SSR), plus, (ii) the standard error (y-y slope), and (iii) the R2 of the regression of y-values (r^2) for the diploid, triploid, and tetraploid models. As explained in the nQuire manual, we would expect low “Norm SSR”, positive “y-y slope” (with low standard error), and high “r^2” values when the distribution of our (empirical) data is similar to an ideal normal distribution fixed for diploid, triploid, or tetraploid samples (see Note 2). Recommended commands: nQuire view SampleIDsorted_denoised.bin
Option view provides information on the coverage and the number of counts per allele, for each of the SNPs, with regard to the genes in the reference FASTA file (see Note 3). This is the information that will be used to calculate the allelic ratio for any given sample (see point Subheading 3.4 below).
122
Juan Viruel et al.
If the user is not interested in obtaining the statistics generated by nQuire for diploid, triploid, and tetraploid models, and wishes to directly calculate the allelic ratio values, it is possible to achieve this by only running options create, denoise, and view. 3.4 Calculating the Allelic Ratios and Estimating the Level of Ploidy
The output from nQuire view is a table that can be imported into R as a data frame to calculate the ratio between the fourth and fifth columns of the output, which correspond to the alleles with the highest and the lowest number of counts, respectively. We can use the values in these columns to calculate (i) the median and (ii) the average of allelic ratios in the sample, as well as (iii) the percentage of SNPs with an allelic ratio < 2. Recommended commands: read.csv("./Output_of_nQuire_view
",header=F,
sep=’\t’)->SampleID as.data.frame(SampleID)-> SampleID as.character(SampleID$V1)->SampleID$V1 SampleID$V6=2)->sum2 sum1+sum2->sum3 sum1*100/sum3->diploidSNPs median(SampleID$V6) mean(SampleID$V6) diploidSNPs
After running these commands for the tetraploid sample (see above; Fig. 2), 25% of the SNPs had an allelic ratio < 2, the median was 2.73, and the average was 2.51. These numbers lend support to this sample being polyploid. When dealing with diploid samples, we expect to have a higher percentage of SNPs with allelic ratios