266 60 9MB
English Pages XV, 352 [356] Year 2021
Methods in Molecular Biology 2181
Ernesto Picardi Graziano Pesole Editors
RNA Editing Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
RNA Editing Methods and Protocols
Edited by
Ernesto Picardi and Graziano Pesole Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Bari, Italy Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari “A. Moro”, Bari, Italy
Editors Ernesto Picardi Institute of Biomembranes Bioenergetics and Molecular Biotechnologies, National Research Council Bari, Italy
Graziano Pesole Institute of Biomembranes Bioenergetics and Molecular Biotechnologies, National Research Council Bari, Italy
Department of Biosciences Biotechnology and Biopharmaceutics University of Bari “A. Moro” Bari, Italy
Department of Biosciences Biotechnology and Biopharmaceutics University of Bari “A. Moro” Bari, Italy
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-0786-2 ISBN 978-1-0716-0787-9 (eBook) https://doi.org/10.1007/978-1-0716-0787-9 © Springer Science+Business Media, LLC, part of Springer Nature 2021 All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Dedication ¨ hman Dedicated to Marie O
v
Foreword Sometimes, only one person is missing, and the whole world seems depopulated.—Alphonse de Lamartine, Meditations Poetiques
¨ hman was passionate about her science. When she talked about some new Marie O project or result from her group, her face would light up and she would begin usually by saying “We have some really cool results . . ., have you ever seen this?” The answer usually was no. Marie was always pushing the frontiers, using new technology to solve the mysteries of RNA editing by ADARs. I can imagine her going through the contents of this book, eager to read chapters on topics such as “RNA editing in single cells” or “RNA editing in neurological and neurodegenerative disorders,” just to name two. Marie was an excellent scientist. Her many publications as well as her contributions to the field of RNA editing attest to this fact. However, beyond this, Marie was an outstanding colleague. It was on her desk you prayed that your grant proposal or manuscript would end up. You knew that she would be fair and you had the suspicion that she would also be lenient. Once at a conference, Marie and I were discussing a poster when we realized that we both had the manuscript that it was based on, to review. I had accepted it, while Marie had rejected it. I immediately pointed out to her that the authors would probably guess that the reverse happened. She gave me a huge smile and laughingly said “I know!” When my next manuscript got a very “sharp” review, I had to smile. Yes, no one thought she could do any wrong. Marie was a very active member of the RNA community in Sweden and she worked relentlessly to promote others. This was not something she paid lip service to but truly believed. While on sabbatical within her group, I was flabbergasted at the number of committees she was a member of: she was on scientific panels, PhD committees, and promotion committees all over Sweden. People trusted her, as they knew Marie would do the best for everyone concerned. The year before she died, Marie and I organized a conference with Anders Virtanen in Uppsala on “RNA and Disease.” What struck me was how supportive Marie was of younger colleagues, particularly of women. When discussing whom to invite for oral presentations, Marie whipped out her computer and suggested a list of junior group leaders. That she had a list of names and emails of junior colleagues was one thing, but that she also knew what they did and into which session they would fit really surprised me. Observing her made me feel guilty; I live in a country with a population of comparable size but it would have taken me days to come up with similar suggestions. Marie’s group held her in high esteem; she supported them, motivated them, and made their working environment as fun as possible. She had the knack of helping them achieve their full potential under her guidance. For many female students, she was their role model, successfully managing to achieve international recognition for her research while also juggling her family life, willing and brave enough to show her feminine side and her empathy for others. Scientific research is like a great wall; all of our contributions are bricks added to this wall: some add a lot, and others a few. However, it takes a special type of researcher such as Marie to form the cement that glues us all together, and it is this bond that gives strength to the scientific community.
vii
viii
Foreword
Unfortunately, this unique ability is underrated and not appreciated in our competitive world. We should value it more and learn from Marie how to be good citizens, and how to unselfishly contribute to the collective good and not always focus on the progression of our own career. To paraphrase John F. Kennedy, “Ask not what science can do for you, but what you can do for science.” Mary A. O’Connell
Preface The term “RNA editing” was originally coined by Benne and co-authors in 1986 to describe the uridine (U) insertion and deletion process in mitochondrial transcripts of kinetoplastid protozoa. Over the years, the term has expanded to include a variety of other non-transient RNA modifications occurring post- or co-transcriptionally in different organisms including prokaryotes, fungi, animals, plants, and viruses. Nowadays, there are at least two mechanistically and evolutionarily unrelated RNA editing systems, the “insertion/deletion” and “substitution” editing. While the former is limited to kinetoplastid and slime mold RNAs, the latter is prominent in plant organelles and especially in mitochondria where specific cytidines (C) are modified in uridines (U) by deamination (even though reverse U-to-C changes have also been reported). In mammals, RNA editing changes primary transcripts by C-to-U or A-to-I modifications through the action of APOBEC family of deaminases and members of adenosine deaminase that acts on RNA (ADAR) family, respectively. Numerous A-to-I editing events have also been reported in coding and noncoding transcript regions of invertebrates and fungi. The advent of deep transcriptome sequencing technologies such as the RNAseq has greatly improved the investigation of RNA editing at genomic scale, allowing its profiling in a variety of organisms and experimental conditions. RNA editing has a plethora of functional implications. In humans, for instance, its deregulation has been linked to a variety of nervous diseases (such as epilepsy, schizophrenia, major depression, and amyotrophic lateral sclerosis), immune disorders, and cancers. This book, including 19 chapters, has been conceived to cover the state-of-the-art methodologies to investigate RNA editing through wet and dry approaches. Hoping that this book content meets the reader expectations, we would like to acknowledge all chapter authors for their effort and excellent contributions. ¨ hman for her Finally, we would like to dedicate this book to the memory of Marie O passion and dedication to the wonderful and fascinating world of RNA editing. A special thanks is addressed to Mary O’Connell for writing the Foreword of this book ¨ hman. in memory of Marie O Bari, Italy
Ernesto Picardi Graziano Pesole
ix
Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v vii ix xiii
1 Substitutional RNA Editing in Plant Organelles . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mizuho Ichinose and Mamoru Sugita 2 Computational Detection of Plant RNA Editing Events . . . . . . . . . . . . . . . . . . . . . Alejandro A. Edera and M. Virginia Sanchez-Puerta 3 Discovering RNA Editing Events in Fungi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huiquan Liu and Jin-Rong Xu 4 C-to-U RNA Editing: From Computational Detection to Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taga Lerner, Mitchell Kluesner, Rafail Nikolaos Tasakis, Branden S. Moriarity, F. Nina Papavasiliou, and Riccardo Pecori 5 Live-Cell Quantification of APOBEC1-Mediated RNA Editing: A Comparison of RNA Editing Assays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martina Chieca, Serena Torrini, and Silvestro G. Conticello 6 Adenosine-to-Inosine RNA Editing Enzyme ADAR and microRNAs. . . . . . . . . . Kang Yuting, Dan Ding, and Hisashi Iizasa 7 Quantitative Analysis of Adenosine-to-Inosine RNA Editing . . . . . . . . . . . . . . . . . Turnee N. Malik, Jean-Philippe Cartailler, and Ronald B. Emeson 8 Discovering A-to-I RNA Editing Through Chemical Methodology “ICE-seq” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masayuki Sakurai, Shunpei Okada, Hiroki Ueda, and Yuxi Yang 9 ALU A-to-I RNA Editing: Millions of Sites and Many Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amos A. Schaffer and Erez Y. Levanon 10 RNA Editing in Human and Mouse Tissues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harini Srinivasan, Eng Piew Louis Kok, and Meng How Tan 11 Bioinformatics Resources for RNA Editing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Angela Diroma, Loredana Ciaccia, Graziano Pesole, and Ernesto Picardi 12 High-Throughput Sequencing to Detect DNA-RNA Changes . . . . . . . . . . . . . . . Claudio Lo Giudice, Graziano Pesole, and Ernesto Picardi 13 Detection of A-to-I Hyper-edited RNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . Roni Cohen-Fultheim and Erez Y. Levanon 14 Proteome Diversification by RNA Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eli Eisenberg
1
xi
13 35
51
69 83 97
113
149 163 177
193 213 229
xii
15
16
17 18 19
Contents
MicroRNA Editing Detection and Function: A Combined In Silico and Experimental Approach for the Identification and Validation of Putative Oncogenic Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentina Tassinari, Valeriana Cesarini, Domenico Alessandro Silvestris, Andrea Scafidi, Lorenzo Cucina, and Angela Gallo RNA Editing in Interferonopathies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loredana Frassinelli, Silvia Galardi, Silvia Anna Ciafre`, and Alessandro Michienzi The Role of RNA Editing in the Immune Response. . . . . . . . . . . . . . . . . . . . . . . . . Sadeem Ahmad, Xin Mu, and Sun Hur RNA Editing in Neurological and Neurodegenerative Disorders. . . . . . . . . . . . . . Pedro Henrique Costa Cruz and Yukio Kawahara New Frontiers for Site-Directed RNA Editing: Harnessing Endogenous ADARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Merkle and Thorsten Stafforst
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
253
269
287 309
331 351
Contributors SADEEM AHMAD • Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA, USA; Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA, USA JEAN-PHILIPPE CARTAILLER • Department of Molecular Physiology & Biophysics, Vanderbilt University School of Medicine, Nashville, TN, USA VALERIANA CESARINI • RNA Editing Lab, Oncohaematology Department, IRCCS Ospedale Pediatrico “Bambino Gesu`”, Rome, Italy MARTINA CHIECA • Core Research Laboratory, ISPRO—Institute for Cancer Research, Prevention and Clinical Network, Firenze, Italy; Department of Medical Biotechnologies, ` di Siena, Siena, Italy Universita LOREDANA CIACCIA • Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari, Bari, Italy SILVIA ANNA CIAFRE` • Department of Biomedicine and Prevention, University of Rome Tor Vergata, Rome, Italy RONI COHEN-FULTHEIM • Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, Israel SILVESTRO G. CONTICELLO • Core Research Laboratory, ISPRO—Institute for Cancer Research, Prevention and Clinical Network, Firenze, Italy; Institute of Clinical Physiology, CNR, Pisa, Italy PEDRO HENRIQUE COSTA CRUZ • Department of RNA Biology and Neuroscience, Graduate School of Medicine, Osaka University, Osaka, Japan LORENZO CUCINA • RNA Editing Lab, Oncohaematology Department, IRCCS Ospedale Pediatrico “Bambino Gesu`”, Rome, Italy DAN DING • Department of Microbiology, Shimane University Faculty of Medicine, Izumo, Shimane, Japan; Ningxia Medical University, Yinchuan, Ningxia, China MARIA ANGELA DIROMA • Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Bari, Italy ALEJANDRO A. EDERA • Facultad de Ciencias Agrarias, IBAM, Universidad Nacional de Cuyo, CONICET, Almirante Brown, Argentina ELI EISENBERG • Raymond and Beverly Sackler School of Physics and Astronomy and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel RONALD B. EMESON • Training Program in Neuroscience, Vanderbilt University School of Medicine, Nashville, TN, USA; Department of Molecular Physiology & Biophysics, Vanderbilt University School of Medicine, Nashville, TN, USA; Departments of Pharmacology, Biochemistry and Psychiatry & Behavioral Sciences, Vanderbilt University School of Medicine, Nashville, TN, USA LOREDANA FRASSINELLI • Department of Biomedicine and Prevention, University of Rome Tor Vergata, Rome, Italy SILVIA GALARDI • Department of Biomedicine and Prevention, University of Rome Tor Vergata, Rome, Italy ANGELA GALLO • RNA Editing Lab, Oncohaematology Department, IRCCS Ospedale Pediatrico “Bambino Gesu`”, Rome, Italy
xiii
xiv
Contributors
CLAUDIO LO GIUDICE • Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Bari, Italy SUN HUR • Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA, USA; Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA, USA MIZUHO ICHINOSE • Center for Gene Research, Nagoya University, Nagoya, Japan; Institute of Transformative Bio-Molecules (ITbM), Nagoya University, Nagoya, Japan HISASHI IIZASA • Department of Microbiology, Shimane University Faculty of Medicine, Izumo, Shimane, Japan YUKIO KAWAHARA • Department of RNA Biology and Neuroscience, Graduate School of Medicine, Osaka University, Osaka, Japan MITCHELL KLUESNER • Department of Pediatrics, University of Minnesota, Minneapolis, MN, USA; Center for Genome Engineering, University of Minnesota, Minneapolis, MN, USA; Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA ENG PIEW LOUIS KOK • Genome Institute of Singapore, Agency for Science Technology and Research, Singapore, Singapore TAGA LERNER • Division of Immune Diversity, Program in Cancer Immunology, German Cancer Research Centre (DKFZ), Heidelberg, Germany; Faculty of Biosciences, Heidelberg University, Heidelberg, Germany EREZ Y. LEVANON • Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, Israel HUIQUAN LIU • State Key Laboratory of Crop Stress Biology for Arid Areas, College of Plant Protection, Northwest A&F University, Yangling, Shaanxi, China TURNEE N. MALIK • Training Program in Neuroscience, Vanderbilt University School of Medicine, Nashville, TN, USA TOBIAS MERKLE • Interfaculty Institute of Biochemistry, University of Tu¨bingen, Tu¨bingen, Germany ALESSANDRO MICHIENZI • Department of Biomedicine and Prevention, University of Rome Tor Vergata, Rome, Italy BRANDEN S. MORIARITY • Department of Pediatrics, University of Minnesota, Minneapolis, MN, USA; Center for Genome Engineering, University of Minnesota, Minneapolis, MN, USA; Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA XIN MU • Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA, USA; Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA, USA MARY O’CONNELL • Central European Institute of Technology, Centre for Molecular Medicine, Masaryk University, Brno, Czech Republic SHUNPEI OKADA • Research Institute for Biomedical Sciences, Tokyo University of Science, Chiba, Japan F. NINA PAPAVASILIOU • Division of Immune Diversity, Program in Cancer Immunology, German Cancer Research Centre (DKFZ), Heidelberg, Germany RICCARDO PECORI • Division of Immune Diversity, Program in Cancer Immunology, German Cancer Research Centre (DKFZ), Heidelberg, Germany GRAZIANO PESOLE • Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Bari, Italy; Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari, Bari, Italy
Contributors
xv
ERNESTO PICARDI • Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Bari, Italy; Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari, Bari, Italy MASAYUKI SAKURAI • Research Institute for Biomedical Sciences, Tokyo University of Science, Chiba, Japan M. VIRGINIA SANCHEZ-PUERTA • Facultad de Ciencias Agrarias, IBAM, Universidad Nacional de Cuyo, CONICET, Almirante Brown, Argentina; Facultad de Ciencias Exactas y Naturales, Universidad Nacional de Cuyo, Mendoza, Argentina ANDREA SCAFIDI • RNA Editing Lab, Oncohaematology Department, IRCCS Ospedale Pediatrico “Bambino Gesu`”, Rome, Italy AMOS A. SCHAFFER • Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, Israel DOMENICO ALESSANDRO SILVESTRIS • RNA Editing Lab, Oncohaematology Department, IRCCS Ospedale Pediatrico “Bambino Gesu`”, Rome, Italy HARINI SRINIVASAN • School of Chemical and Biomedical Engineering, Nanyang Technological University, Singapore, Singapore; Genome Institute of Singapore, Agency for Science Technology and Research, Singapore, Singapore THORSTEN STAFFORST • Interfaculty Institute of Biochemistry, University of Tu¨bingen, Tu¨bingen, Germany MAMORU SUGITA • Center for Gene Research, Nagoya University, Nagoya, Japan MENG HOW TAN • School of Chemical and Biomedical Engineering, Nanyang Technological University, Singapore, Singapore; Genome Institute of Singapore, Agency for Science Technology and Research, Singapore, Singapore RAFAIL NIKOLAOS TASAKIS • Division of Immune Diversity, Program in Cancer Immunology, German Cancer Research Centre (DKFZ), Heidelberg, Germany; Faculty of Biosciences, Heidelberg University, Heidelberg, Germany VALENTINA TASSINARI • RNA Editing Lab, Oncohaematology Department, IRCCS Ospedale Pediatrico “Bambino Gesu`”, Rome, Italy SERENA TORRINI • Core Research Laboratory, ISPRO—Institute for Cancer Research, Prevention and Clinical Network, Firenze, Italy; Department of Medical Biotechnologies, ` di Siena, Siena, Italy Universita HIROKI UEDA • Biological Data Science Division, Research Center for Advanced Science and Technology (RCAST), University of Tokyo, Tokyo, Japan JIN-RONG XU • Department of Botany and Plant Pathology, Purdue University, West Lafayette, IN, USA KANG YUTING • Department of Microbiology, Shimane University Faculty of Medicine, Izumo, Shimane, Japan; Ningxia Medical University, Yinchuan, Ningxia, China YANG YUXI • Research Institute for Biomedical Sciences, Tokyo University of Science, Chiba, Japan
Chapter 1 Substitutional RNA Editing in Plant Organelles Mizuho Ichinose and Mamoru Sugita Abstract RNA editing by cytidine (C) to uridine (U) conversions frequently occurs in land plant mitochondria and plastids. Target cytidines are specifically recognized by nuclear-encoded pentatricopeptide repeat (PPR) proteins in a sequence-specific manner. In the moss Physcomitrella patens, all PPR editing factors possess the DYW-deaminase domain at the C-terminus. Here, we describe methods for the direct sequencing of cDNA to detect RNA editing events and the RNA electrophoresis mobility shift assay (REMSA) to analyze the specific binding of PPR editing factors to their target RNA. Key words C-to-U editing, Mitochondria, Pentatricopeptide repeat, Physcomitrella patens, Plastid, RNA-protein interaction
1
Introduction RNA editing is a posttranscriptional process that introduces changes in RNA sequences encoded by nuclear, mitochondrial, or plastid genomes, and occurs in a wide range of organisms [1]. RNA editing is categorized into either insertions/deletions or conversion of nucleotides. Uridine (U)-insertion/deletion occurs in mitochondrial transcripts of kinetoplastid protozoans, such as Trypanosoma brucei. Similarly, mitochondrial RNAs in the slime mold Physarum polycephalum are heavily edited by the insertion of mono- and dinucleotides at specific sites. On the other hand, nuclear-encoded transcripts undergo different types of editing such as the conversion of cytidine (C)-to-uridine (U), and adenosine (A)-to-inosine (I) editing in fruit flies, octopuses, and humans. APOBEC1 (apoB mRNA editing cytidine deaminase 1) is involved in C-to-U RNA editing of apoB mRNA and ADARs (adenosine deaminases acting on RNA) convert A-to-I in double-stranded RNA [1]. This A-to-I editing occurs not only in protein-coding regions of mRNAs, but also frequently in noncoding regions that contain intramolecular inverted Alu repeats [2].
Ernesto Picardi and Graziano Pesole (eds.), RNA Editing: Methods and Protocols, Methods in Molecular Biology, vol. 2181, https://doi.org/10.1007/978-1-0716-0787-9_1, © Springer Science+Business Media, LLC, part of Springer Nature 2021
1
2
Mizuho Ichinose and Mamoru Sugita
In the plant kingdom, C-to-U RNA editing usually occurs in plastid and mitochondrial transcripts in land plants but not in green algae [3, 4]. In some plant taxa, U-to-C “reverse” editing is also found in both plant organelles. Several hundred C-to-U RNA editing sites are present in mitochondrial genomes, and 30–60 sites in plastid genomes of flowering plants. Surprisingly, there are thousands of C-to-U editing sites in plastid and mitochondrial transcripts in lycophytes (primitive ferns). C-to-U RNA editing occurs mostly in translated regions of organellar mRNAs, and occasionally also in the untranslated regions, introns, and structural RNAs. Most of the C-to-U changes in the protein-coding region lead to preservation of evolutionary codons. C-to-U RNA editing in plant organelles requires nuclear-encoded pentatricopeptide repeat (PPR) proteins [5, 6]. PPR proteins constitute a large family comprising 100 to over 1000 members in land plants [7]. Most PPR proteins required for RNA editing consist of tandem repeats of canonical 35-amino acid PPR (P) motifs and their L (long, 35 or 36 amino acids) and S (short, 31 amino acids) variants, and are followed by conserved E (extension) and DYW domains. They are usually called PPR-DYW proteins. Some editing PPR proteins lack the DYW domain but interact with either other PPR-DYWs or a DYW protein with no PPR motifs to edit target site(s) [8]. To date, nearly 80 PPR editing factors have been identified in plants [9]. In the moss Physcomitrella patens, all 11 known C-to-U editing events in mitochondria are assigned by eight specific PPR-DYW proteins that address one or two edits [10]. The DYW domain contains a conserved region, similar to the active site, C/HxE(x)nPCxxC, of cytidine deaminases from bacteria, yeast, plants, and animals [11]. The DYW domain has recently been demonstrated to be essential for C-to-U editing [12]. PPR editing proteins bind to specific cis-elements for editing in a one-PPR motif to a one-nucleotide manner and are responsible for single or multiple editing sites with similar cis-element sequences. A code for PPR-RNA recognition has been elucidated [13–15]. The amino acid combinatorial patterns at positions 5 and 35, which is the last position, in each PPR motif recognize a specific RNA base. To date, bioinformatics tools and databases for surveying plant RNA editing are publicly available [9, 16], and a PPR–target prediction tool is useful for identifying PPR editing factors [17]. Here we represent experimental procedures to demonstrate that Physcomitrella patens PPR-DYW proteins are responsible for RNA editing at some specific sites [10, 18].
Substitutional RNA Editing in Plant Organelles
2
3
Materials
2.1 Detection of RNA Editing Events by Direct Sequencing
1. Total cellular RNA isolated from 4-day-old P. patens protonemata. 2. M-MLV reverse TOYOBO).
transcriptase
(e.g.,
ReverTra
Ace®,
3. Random primer (e.g., hexadeoxyribonucleotide mixture; pd. (N)6, TaKaRa). 4. High-fidelity DNA polymerase for PCR (e.g., PrimeSTAR GXL polymerase, TaKaRa). 5. Primers designed to amplify an approximately 0.5 kb fragment including the editing site: ccmF-edi-up: 50 -CTTATCTTTATGGTTGTGCTTTGT 0 G-3 . ccmF-edi-low: 50 -AACAAAGCACACCACAGAAAGATT 0 C-3 . 6. Phenol/chloroform/isoamyl alcohol (25:24:1). 7. Ethanol. 8. BigDye Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher Scientific). 2.2 RNA Electrophoresis Mobility Shift Assay
1. pBAD/TOPO ThioFusion Expression Kit (Invitrogen). 2. Escherichia coli BL21 competent cells. 3. LB medium: Dissolve 5 g tryptone, 5 g NaCl, and 2.5 g yeast extract in 500 mL distilled water. Sterilize by autoclaving. 4. Ampicillin. 5. 20% L-Arabinose: Dissolve 2 g L-arabinose in 10 mL distilled water and sterilize with a 0.22 μm filter. 6. Ni-NTA agarose (QIAGEN). 7. 2 Lysis buffer: 100 mM Tris–HCl (pH 8.0), 1 M KCl, 20 mM MgCl2, 1% TritonX-100, 20% glycerol. 8. Extraction buffer: 1 Lysis buffer with 40 mM imidazole and 1 mM DTT. 9. Elution buffer: 1 Lysis buffer with 250 mM imidazole and 1 mM DTT. 10. 100 mg/mL Lysozyme. 11. Micro Bio-Spin™ Chromatography Columns (Bio-Rad, #7326204). 12. Slide-A-Lyzer™ Dialysis cassette (Thermo Fisher Scientific).
4
Mizuho Ichinose and Mamoru Sugita
13. Dialysis buffer: 20 mM Hepes-KOH (pH 8.0), 150 mM NaCl, 10% glycerol. 14. Synthetic oligo RNA (purified by RNase-free HPLC), ccmFCRNA2: 50 -UGGUUGGUAAGUAGAGAUGUUUCCACAG GUGCUCCUUUUUCUCAUG-30 (ccmFC-C103 and -C122 editing sites are underlined). 15. T4 polynucleotide kinase (TaKaRa). 16. [γ-32P] ATP (3000 Ci/mmol). 17. 4 M Ammonium acetate. 18. 100% Ethanol. 19. 80% Ethanol. 20. PCR primers to amplify the 192 nt region of cox3 for preparation of DNA template for in vitro transcription: cox3-F3: 50 -ATGTAATACGACTCACTATAGGGG GCC ACTGGGTTTCAT. GGTTTTCATG-30 (T7 promoter sequence is underlined). cox3-R: 50 -TTAATTACCTCCCCACCAATAAATAG-30 . 21. T7 RNA polymerase (TaKaRa). 22. NTP mix (ATP, CTP, GTP, UTP, each 5 mM). 23. Recombinant RNase Inhibitor (TaKaRa). 24. RNA loading dye: 95% Formamide, 0.02% SDS, 1 mM EDTA, 0.02% bromophenol blue (BPB), 0.02% xylene cyanol (XC). 25. 5 TBE buffer: 445 mM Tris, 445 mM boric acid, 10 mM EDTA (see Note 1). 26. 30% Acrylamide/bis-acrylamide solution (29:1). 27. Denaturing 7 M urea-5% polyacrylamide gel (10 10 cm): 4.2 g Urea, 2 mL 5 TBE, 1.7 mL 30% acrylamide/bisacrylamide solution. Make up to 10 mL with distilled water. Dissolve urea by microwave and cool down on ice. Add 20 μL 20% ammonium persulfate and 5 μL TEMED. 28. Low Range ssRNA Ladder (NEB). 29. RNase-free ethidium bromide solution. 30. RNA extraction buffer: 179 μL RNase-free water, 20 μL 3 M sodium acetate (pH 5.2), 1 μL 0.5 M EDTA, 2 μL 10% SDS. 31. 2 Binding buffer: 125 mM NaCl, 70 mM Hepes-KOH (pH 8.0), 8 mM DTT, 15% glycerol (see Note 2). 32. 1 mg/mL BSA. 33. 3MM Whatman paper (GE Healthcare).
Substitutional RNA Editing in Plant Organelles
3
5
Methods
3.1 Detection of RNA Editing by Direct Sequencing
To detect the editing event, various methods can be applied (see Note 3). Since the moss P. patens has only 11 editing sites in the mitochondria, here we use a simple method by direct sequencing of amplified cDNA fragments to detect editing events in PPR-DYW gene knockout mosses (see Fig. 1). 1. DNA-free total cellular RNA (1 μg) is reverse transcribed by M-MLV reverse transcriptase using a random primer according to the manufacturer’s instructions. 2. PCR amplification of targets containing the editing sites is performed with a high-fidelity DNA polymerase, specific target primers (ccmF-edi-up and ccmF-edi-low), and cDNA.
Fig. 1 RNA editing defects in PPR-DYW knockout mosses. Direct sequence chromatograms of PCR-amplified genomic DNA and/or cDNA from the wildtype (WT), PpPPR_65 KO (Δ65), and PpPPR_71 KO (Δ71) mosses are shown. Gray-shaded areas indicate the two ccmFC editing sites (ccmFC-C103 and -C122). RNA editing efficiencies are shown as a percentage (source: Ichinose et al. 2013 Plant & Cell Physiol., reproduced and modified with permission of Oxford University Press)
6
Mizuho Ichinose and Mamoru Sugita
3. PCR product is purified to remove free primers and dNTPs by phenol-chloroform extraction and ethanol precipitation at room temperature (see Note 4). 4. Sequencing is performed with 50 ng of PCR product, the BigDye Terminator v3.1 Cycle Sequencing Kit, and the ccmF-edi-up primer. 5. The sequencing chromatograph is analyzed and editing sites are identified. 3.2 RNA Electrophoresis Mobility Shift Assay (REMSA)
REMSA is a common and sensitive technique for studying proteinRNA interactions. It is based on the observation that the electrophoretic mobility of a protein-RNA complex is typically less than that of free RNA. We perform REMSA to confirm whether PPR-DYW editing factors recognize the target editing site. An example of REMSA is shown in Fig. 2.
3.2.1 Purification of His-Tagged Recombinant Proteins
We use the pBAD/TOPO ThioFusion Expression system (Invitrogen) for expressing recombinant PPR proteins because thioredoxin and His-tag do not bind substrate RNAs. 1. Amplify the region encoding PPR omitting the N-terminal transit peptide by PCR (see Note 5). 2. Clone into the pBAD/Thio-TOPO vector by the TOPO® Cloning reaction according to the manufacturer’s instructions. All reagents are provided by the pBAD/TOPO ThioFusion Expression Kit (Invitrogen). 3. Transform the plasmid into the protease-deficient E. coli strain BL21. 4. Inoculate a single colony into 2 mL LB medium containing 50 μg/mL ampicillin and grow overnight at 37 C with shaking (starter culture). 5. Add 0.5 mL of the starter culture to 50 mL LB with ampicillin and shake at 37 C until OD600 ¼ 0.4 to 0.6 (see Note 6). 6. Cool down to room temperature by placing on ice. 7. Induce expression by adding 0.5 mL of 20% L-arabinose (final conc. 0.2%). 8. Shake overnight at 16 C. (Perform the remaining steps at 4 C.) 9. Harvest cells in a 50 mL tube and centrifuge at 3000 g for 3 min at 4 C. 10. Wash pellet with cold H2O and recentrifuge. 11. Suspend pellet in 5 mL of extraction buffer. 12. Add 50 μL of 100 mg/mL lysozyme (final conc. 1 mg/mL) and shake for 1 h.
Substitutional RNA Editing in Plant Organelles
7
Fig. 2 Detection of binding of the recombinant PpPPR_71 (r-71) to the target RNA. (a) REMSA was performed with the recombinant proteins (r-Trx and r-71) and 32P-labeled ccmFC RNA. The concentration of recombinant proteins is indicated above each lane. The positions of the protein-RNA complex and free RNA are indicated by white and black arrowheads, respectively. (b) REMSA with the r-71 and labeled ccmFC RNA in the presence of cold competitor RNAs (ccmFC and cox3). The concentration of competitor RNAs is indicated above each lane. Non-labeled RNAs for competition were preincubated with the r-71 (50 nM) for 10 min before the labeled ccmFC RNA was added (source: Tasaki et al. 2010 Plant J., reproduced and modified with permission of John Wiley & Sons Ltd)
13. Sonicate cell lysis for 6 10 s with 10-s pauses at 300 W. 14. Centrifuge at 3000 g for 10 min and collect supernatant into a 15 mL tube. 15. Add 100 μL of Ni-NTA agarose equilibrated with extraction buffer.
8
Mizuho Ichinose and Mamoru Sugita
16. Rotate for 1 h. 17. Centrifuge at 1000 g for 2 min and discard supernatant. 18. Add 4 mL of extraction buffer. Repeat this washing procedure twice. After the last centrifugation, remove supernatant and resuspend the Ni-NTA agarose in 0.5 mL of extraction buffer. 19. Place Ni-NTA agarose in a Micro Bio-spin column. 20. Set the column into a 2 mL tube. 21. Centrifuge at 10,000 g for 2 min and transfer the Micro Bio-spin column into a new 1.5 mL tube. 22. Add 100 μL of elution buffer. 23. Allow to stand for 5 min and centrifuge at 10,000 g for 2 min. 24. Repeat this elution procedure with 50 μL of elution buffer. 25. Add the elution to a Slide-A-Lyzer™ Dialysis cassette and dialyze in dialysis buffer overnight with mixing. 26. Transfer the protein solution into a new tube, check protein concentration, and store at 80 C until use (see Note 7). 3.2.2 Preparation of Radiolabeled RNA Probes
We design synthetic oligo RNAs which include 30–40 nt of the upstream sequence and 5 nt of the downstream sequence surrounding the editing site. 1. Each synthetic oligo RNA is 50 -end labeled with [γ-32P] ATP under the following conditions (see Note 8). Reaction mixture (10 μL). 10 μM Oligo RNA 10 T4 polynucleotide kinase buffer
1 μL a
1 μL
[γ-32P] ATP
2 μL
T4 polynucleotide kinase
1 μL
H2O
5 μL
Incubate for 1 h at 37 C a Supplied with purchased T4 polynucleotide kinase
2. Add 1 μL of 4 M ammonium acetate and 40 μL of ethanol. Mix by tapping. 3. Let stand at 30 C for 30 min. 4. Centrifuge at maximum speed for 15 min at 4 C. Discard supernatant. 5. Wash pellet with 100 μL of 80% ethanol. 6. Centrifuge for 2 min and discard supernatant. 7. Resuspend pellet in 100 μL of sterile water. 8. Quantify the radioactivity of RNA probes in a liquid scintillation counter and make 100 μL of 0.5 nM RNA probe.
Substitutional RNA Editing in Plant Organelles 3.2.3 Preparation of a Competitor RNA by In Vitro Transcription
9
1. Perform a PCR reaction with gene-specific primers containing a T7 promoter sequence (cox3-F3 and cox3-R). 2. Purify the PCR product by ethanol precipitation and dissolve pellet in RNase-free water. 3. Add the following components: Reaction mixture (20 μL). 10 T7 RNA polymerase buffera
2 μL
50 mM DTTa
2 μL
NTP mix (each 5 mM)
2 μL
RNase inhibitor
20 U
Purified PCR product
50 ng
T7 RNA polymerase
50 U
RNase-free water
Up to 20 μL
a
Supplied with purchased T7 RNA polymerase
4. Incubate for 1 h at 42 C. 5. Add 5 U of RNase-free DNase I and incubate for 10 min at 37 C to remove the template DNA. 6. Increase reaction volume to 50 μL with RNase-free water. 7. Add 5 μL of 4 M ammonium acetate and 200 μL of ethanol and keep at 30 C for 30 min. 8. Centrifuge at maximum speed for 15 min at 4 C and discard supernatant. 9. Dissolve pellet in 5 μL of RNA loading dye. 10. Heat for 90 s at 90 C and then incubate on ice. 11. Prepare a denaturing 7 M urea-5% polyacrylamide gel. 12. Pre-run for 10 min at 30 mA with 1 TBE and rinse the wells with running buffer by pipetting. 13. Load the sample and low-range ssRNA ladder on the denaturing gel. 14. Run the gel for 20 min at 30 mA (see Note 9). 15. Stain the gel in the RNase-free ethidium bromide solution. 16. Excise the RNA fragment corresponding to the expected size from the gel with a clean scalpel and transfer it to a 1.5 mL tube. 17. Add 200 μL of RNA extraction buffer and crush the gel with a pestle. 18. Rotate overnight at 4 C.
10
Mizuho Ichinose and Mamoru Sugita
19. Centrifuge at 20,000 g for 5 min at 4 C. 20. Transfer supernatant to a new 1.5 mL tube. 21. Measure the concentration of 1 μL solution. 22. Purify the solution by ethanol precipitation. 23. Dissolve pellet concentration. 3.2.4 Preparation of a Native Polyacrylamide Gel
in
RNase-free
water
to
the
desired
1. Assemble the gel (20 20 cm) by using 1 mm spacer and comb. 2. Prepare a 6% native polyacrylamide gel. To make a 20 mL gel: 4 mL 30% acrylamide/bis-acrylamide solution. 4 mL 5 TBE buffer. 12 mL H2O. 50 μL 20% Ammonium persulfate. 15 μL TEMED 3. Immediately pour the acrylamide solution to the gel and let it polymerize. 4. Assemble the electrophoresis apparatus with the gel, fill the tank with 1 TBE, and connect with the power supply. 5. Pre-run for 20 min at 30 mA and at 4 C.
3.2.5 Protein-RNA Binding Reaction
1. Make protein dilutions using a dialysis buffer. 2. Set binding reaction: Reaction mixture (18 μL). Protein
5 μL
2 Binding buffer
10 μL
1 mg/mL BSA
2 μL
H2O
1 μL
Incubate for 10 min at room temperature. (For competitor: Add 1 μL of competitor RNA instead of water (step 2) and incubate for 10 min at room temperature before adding probe.)
3. Add 2 μL of 0.5 nM radiolabeled RNA probe. 4. Incubate for 15 min at room temperature. 5. Carefully load 15 μL of reaction mixture on the native-PAGE gel (see Note 10). 6. Load 5 μL of RNA loading dye in unused well as the loading control. 7. Run the gel for 1 h at 30 mA.
Substitutional RNA Editing in Plant Organelles
11
8. Transfer the gel to 3MM Whatman paper. Cover the top of the gel with plastic wrap and dry at 80 C in a vacuum dryer for 1 h. Switch off the heater and allow the gel to cool down for 10 min. 9. Place gel in a cassette with an imaging plate. Exposure time may range between 1 h and overnight. Scan the imaging plate using an phosphor imager.
4
Notes 1. If bands appear smeared, try THE buffer (34 mM Tris, 66 mM Hepes, 0.1 mM EDTA, pH 8.3) instead of TBE buffer [19]. 2. Tris–HCl buffer can be used instead of Hepes-KOH buffer. 3. Poisoned primer extension (PPE) assay [20], single-nucleotide extension polymorphism typing [21], and RNA-seq analysis [22] are used to detect RNA editing events in plant organelles. 4. PCR product can also be purified by the MinElute PCR Purification Kit (QIAGEN) or illustra ExoProStar (GE Healthcare). 5. We use the TargetP website [23] to predict the cleavage site of transit peptides. The recombinant PPR protein should be started at least 15 amino acid upstream from the first PPR motif. 6. In REMSA, sufficient recombinant proteins are usually obtained in a 50 mL culture. If the yield of recombinant protein is low, scale up to a 500 mL culture. 7. We confirm the purity of proteins through SDS-PAGE and Coomassie Blue staining. Proteins can normally be stored at 80 C for a month, but some proteins lose RNA binding activity at 80 C. In that case, proteins are kept at 4 C and used within 10 days for REMSA. 8. Instead of [γ-32P] ATP, fluorescence tags, such as fluorescein (FAM), Cy5, and Cy3, can be used for RNA labeling [19]. 9. Adjust the running time based on the length of RNA probes. The mobility of BPB and XC dyes corresponds to 35 nt and 130 nt nucleic acid, respectively, in the 5% denaturing gel. 10. Do not add RNA loading dye because formation of the RNA-protein complex will be impaired.
References 1. Gott JM, Emeson RB (2000) Functions and mechanisms of RNA editing. Annu Rev Genet 34:499–531 2. Levanon EY, Eisenberg E, Yelin R, Nemzer S, Hallegger M, Shemesh R, Fligelman ZY,
Shoshan A, Pollock SR, Sztybel D, Olshansky M, Rechavi G, Jantsch MF (2004) Systematic identification of abundant A-to-I editing sites in the human transcriptome. Nat Biotechnol 22:1001–1005
12
Mizuho Ichinose and Mamoru Sugita
3. Ichinose M, Sugita M (2017) RNA editing and its molecular mechanism in plant organelles. Genes (Basel) 8:5 4. Cahoon AB, Nauss JA, Stanley CD, Qureshi A (2017) Deep transcriptome sequencing of two green algae, Chara vulgaris and Chlamydomonas reinhardtii, provides no evidence of organellar RNA editing. Genes (Basel) 8:80 5. Kotera E, Tasaka M, Shikanai T (2005) A pentatricopeptide repeat protein is essential for RNA editing in chloroplasts. Nature 433:326–330 6. Zehrmann A, Verbitskiy D, van der Merwe JA, Brennicke A, Takenaka M (2009) A DYW domain–containing pentatricopeptide repeat protein is required for RNA editing at multiple sites in mitochondria of Arabidopsis thaliana. Plant Cell 21:558–567 7. Cheng S, Gutmann B, Zhong X, Ye Y, Fisher MF, Bai F, Castleden I, Song Y, Song B, Huang J, Liu X, Xu X, Lim BL, Bond CS, Yiu SM, Small I (2016) Redefining the structural motifs that determine RNA binding and RNA editing by pentatricopeptide repeat proteins in land plants. Plant J 85:532–547 8. Takenaka M, Zehrmann A, Verbitskiy D, H€artel B, Brennicke A (2013) RNA editing in plants and its evolution. Annu Rev Genet 47:335–352 9. Lenz H, Hein A, Knoop V (2018) Plant organelle RNA editing and its specificity factors: enhancements of analyses and new database features in PREPACT 3.0. BMC Bioinf 19:255 10. Ichinose M, Sugita C, Yagi Y, Nakamura T, Sugita M (2013) Two DYW subclass PPR proteins are involved in RNA editing of ccmFc and atp9 transcripts in the moss Physcomitrella patens: first complete set of PPR editing factors in plant mitochondria. Plant Cell Physiol 54:1907–1916 11. Salone V, Ru¨dinger M, Polsakiewicz M, Hoffmann B, Groth-Malonek M, Szurek B, Small I, Knoop V, Lurin C (2007) A hypothesis on the identification of the editing enzyme in plant organelles. FEBS Lett 581:4132–4138 12. Oldenkott B, Yang Y, Lesch E, Knoop V, Schallenberg-Ru¨dinger M (2019) Plant-type pentatricopeptide repeat proteins with a DYW domain drive C-to-U RNA editing in Escherichia coli. Commun Biol 2:85
13. Barkan A, Rojas M, Fujii S, Yap A, Chong YS, Bond CS, Small I (2012) A combinatorial amino acid code for RNA recognition by pentatricopeptide repeat proteins. PLoS Genet 8: e1002910 14. Yagi Y, Hayashi S, Kobayashi K, Hirayama T, Nakamura T (2013) Elucidation of the RNA recognition code for pentatricopeptide repeat proteins involved in organelle RNA editing in plants. PLoS One 8:e57286 15. Takenaka M, Zehrmann A, Brennicke A, Graichen K (2013) Improved computational target site prediction for pentatricopeptide repeat RNA editing factors. PLoS One 8:e65343 16. Lo Giudice C, Herna´ndez I, Ceci LR, Pesole G, Picardi E (2019) RNA editing in plants: a comprehensive survey of bioinformatics tools and databases. Plant Physiol Biochem 137:53–61 17. Yan J, Yao Y, Hong S, Yang Y, Shen C, Zhang Q, Zhang D, Zou T, Yin P (2019) Delineation of pentatricopeptide repeat codes for target RNA prediction. Nucleic Acids Res 47:3728–3738 18. Tasaki E, Hattori M, Sugita M (2010) The moss pentatricopeptide repeat protein with a DYW domain is responsible for RNA editing of mitochondrial ccmFc transcript. Plant J 62:560–570 19. Kindgren P, Yap A, Bond CS, Small I (2015) Predictable alteration of sequence recognition by RNA editing factors from Arabidopsis. Plant Cell 27:403–416 20. Hayes ML, Hanson MR (2007) Chapter 21. Assay of editing of exogenous RNAs in chloroplast extracts of Arabidopsis, maize, pea, and tobacco. Methods Enzymol 424:459–482 21. Takenaka M, Brennicke A (2009) Multiplex single-base extension typing to identify nuclear genes required for RNA editing in plant organelles. Nucleic Acids Res 37:e13 22. Picardi E, Horner DS, Chiara M, Schiavon R, Valle G, Pesole G (2010) Large-scale detection and analysis of RNA editing in grape mtDNA by RNA deep-sequencing. Nucleic Acids Res 38:4755–4767 23. Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300:1005–1016
Chapter 2 Computational Detection of Plant RNA Editing Events Alejandro A. Edera and M. Virginia Sanchez-Puerta Abstract Computers are able to systematically exploit RNA-seq data allowing us to efficiently detect RNA editing sites in a genome-wide scale. This chapter introduces a very flexible computational framework for detecting RNA editing sites in plant organelles. This framework comprises three major steps: RNA-seq data processing, RNA read alignment, and RNA editing site detection. Each step is discussed in sufficient detail to be implemented by the reader. As a study case, the framework will be used with publicly available sequencing data to detect C-to-U RNA editing sites in the coding sequences of the mitochondrial genome of Nicotiana tabacum. Key words RNA editing, Land plant, RNA sequencing, Mitochondrial genome, RNA read alignment, Nicotiana tabacum
1
Introduction In only one decade our understanding of gene expression has been significantly deepened thanks to next-generation RNA sequencing (RNA-seq). Data produced by such technology is composed of millions of reads derived from sequencing RNA fragments, which are stored as computer files that can be processed by computers allowing fast and systematic transcriptomic analyses. RNA-seq provides information not only about gene expression but also about functional and structural RNA features, resulting in an excellent mean for studying transcriptomic processes using computational tools, such as RNA editing in land plant organelles [1–5]. In this chapter, we develop a computational framework for detecting RNA editing events using RNA-seq data. This framework comprises three major steps: RNA-seq data processing, RNA read alignment, and RNA editing site detection (also known as variant calling). Taking RNA-seq data as input, the framework processes the RNA-seq data by removing sequencing adapters and low-quality reads, if any. The processed RNA-seq data are
Ernesto Picardi and Graziano Pesole (eds.), RNA Editing: Methods and Protocols, Methods in Molecular Biology, vol. 2181, https://doi.org/10.1007/978-1-0716-0787-9_2, © Springer Science+Business Media, LLC, part of Springer Nature 2021
13
14
Alejandro A. Edera and M. Virginia Sanchez-Puerta
Fig. 1 Schematic illustration of the framework for detecting RNA editing sites from RNA-seq data. RNA editing sites are detected from paired-end RNA reads aligned on a reference sequence. The reference sequence, composed of two exons, is depicted as a blue rectangle at the bottom, where positions containing cytidines are indicated by uppercase Cs and the exon-exon junction by a vertical bar dividing the blue rectangle. Forward and reverse reads aligned on the reference sequence are depicted as cyan and pink arrows, respectively. Read positions containing cytidines or uridines are depicted as uppercase Cs and Us, respectively. The framework detects C-to-U RNA editing sites when reference cytidines have 10 or more reads aligned showing more than 10% of uridines. (a) A cytidine detected as unedited. (b, c) Two detected C-to-U RNA editing sites, with high (b) and low (c) editing extents, respectively. (d) An incorrectly detected editing site, because the reference cytidine has few reads aligned and one of them has an artifactual base in a low-quality read position (depicted as a cursive U). (e) An undetected editing site, because of having aligned less than 10 reads
subsequently aligned onto reference DNA sequences, which are also given as input. As a result, the aligned RNA reads can be used as a proxy of the expression profile of the reference sequences (Fig. 1). Based on these alignments, RNA editing sites, for example C-to-U editing events, can be detected as nucleotide discrepancies between reference positions and their corresponding positions in the aligned reads (Fig. 1b, c). The detection power of this technique is high, as RNA-seq data are usually abundant and genome-wide. However, for reference regions with low number of aligned reads, RNA editing sites could be incorrectly detected (Fig. 1d) or undetected (Fig. 1e). To illustrate the framework, we will use it to detect C-to-U RNA editing sites with data available in public databases. In addition to this detection, a number of analyses are carried out as valuable complements for editing site detections.
2
Materials
2.1 Hardware Requirements
The run time of many programs for processing next-generation sequencing data can be linearly reduced through parallel execution when using multicore processors. The number of available cores in a computer using a GNU/Linux system can be known with this command: $ cat /proc/cpuinfo | grep "processor" | wc -l
Computational Detection of Plant RNA Editing Events
15
On the other hand, the use of appropriate disks can significantly contribute to reduce run times, as they offer better access time for processing large files. SSD disks are much faster than conventional HDD disks because the former do not persist and retrieve information mechanically. Even faster performances can be achieved using NVMe SSD disks. In addition, virtualization techniques, such as RAID (Redundant Array of Independent Disks), can further reduce file processing time, as many physical disks are combined into a single logic unit in which read and write disk operations are performed in parallel. For all the experiments executed throughout this chapter, we have used \1/g’ NC_006581.fas
And, using a FASTA file viewer, such as AliView [8] available at https://github.com/AliView/AliView, we will rename orfX, orf25, and orfB as mttB, atp4, and atp8, respectively; fix the badly parsed name of the gene nad1; and remove all the ORFs and duplicated sequences (i.e., nad3 and sdh3). As a result, the FASTA file should contain 35 coding sequences (Table 1), which will be used as reference sequences.
3
Methods
3.1 RNA-seq Data Processing
Next-generation sequencing may introduce artifacts [9], such as sequencing adapters fused to read ends and sequencing errors resulting from chemistry biases [10]. Since these artifacts can significantly impact downstream analyses, the quality of the sequencing data should be analyzed to identify potential issues. Some quality analyzing tools include NGS QC Toolkit [11], PRINSEQ [12], or FastQC [13]. We will analyze the quality of the tobacco RNA-seq data (see Subheading 2.3) using FastQC. The following command calls FastQC using as inputs the forward and reverse FASTQ files that we previously downloaded: $
FastQC/fastqc
rna-seq/SRR2064998_1.fastq.gz
rna-seq/
SRR2064998_2.fastq.gz
This command generates an HTML report for each FASTQ file, where each containing ten quality control metrics (https:// www.bioinformatics.babraham.ac.uk/projects/fastqc/Help), which can be inspected with a web browser, for example Firefox: $
firefox
rna-seq/SRR2064998_1_fastq.html
rna-seq/
SRR2064998_2_fastq.html
The reports show that no sequencing adapters were found, the per-base quality score of forward and reverse reads is on average high (>Q28) for all read positions (Fig. 2a), and the read-length distribution is centered in the values of 100 bp and 72 bp for forward and reverse reads, respectively (data not shown). However, the reports indicate some quality issues, for example, the GC content is not uniform at the 50 and 30 ends of forward reads (Fig. 2b). In addition, there is an unexpected high level of sequence duplication (data not shown). Usually these duplicated sequences can come from ribosomal RNA (rRNA), as it comprises most of the cellular RNA [14]. However, the reports indicate that such
18
Alejandro A. Edera and M. Virginia Sanchez-Puerta
Table 1 RNA read alignment statistics Reference sequence
Number of positions
Number of cytidines
Avg. RNA read depth
atp1
1530
309
3627
atp4
597
114
59
atp6
1188
215
38
atp8
471
94
82
atp9
234
51
83
ccmB
621
157
20
ccmC
753
181
16
ccmFc
1317
298
12
ccmFN
1725
405
19
cob
1182
239
63
cox1
1584
334
122
cox2
783
156
136
cox3
798
166
98
mat-R
1977
523
177
mttB
840
195
21
nad1
978
194
44
nad2
1467
310
42
nad3
357
74
59
nad4
1488
313
66
nad4L
303
51
15
nad5
2010
412
36
nad6
687
125
39
nad7
1185
244
41
nad9
573
108
41
rpl2
996
227
70
rpl5
555
113
25
rpl16
516
92
227
rps3
1692
325
265
rps4
1050
212
251
rps10
363
71
155
rps12
378
72
76
rps13
351
55
127 (continued)
Computational Detection of Plant RNA Editing Events
19
Table 1 (continued) Reference sequence
Number of positions
Number of cytidines
Avg. RNA read depth
rps14
360
72
16
rps19
285
43
2141
sdh3
327
73
514
Totals
31,521
6623
252
>=20 reads on average
26,442
5459
301
Shown are some statistics calculated from aligning the RNA reads (SRR2064998) on the mitochondrial coding genes of N. tabacum (NC_006581) used as reference sequences. For each reference sequence, shown are the number of reference positions, the number of reference cytidines, and the average number of aligned RNA reads. The reference sequences with average read depths below 20 reads are shaded in gray. Totals numbers of positions and cytidines across all the reference sequences are shown below the table, as well as the average of the average read depths. These totals are also shown when removing the gray-shaded reference sequences (20 reads on average).
duplications are not rRNA. If they had been rRNA, tools such as SortMeRNA [15] or riboPicker [16] could have been used for removing rRNA-derived reads. Nevertheless, this high level of sequence duplication is not unexpected for RNA-seq data, as gene expression generally occurs in short genomic regions, which are usually highly transcribed. To fix the nonuniform GC content at the ends of forward reads, their 50 and 30 ends were clipped and store in a directory named “pre” (see Note 1). We can use FastQC to analyze the quality of these processed RNA-seq data: $ FastQC/fastqc pre/SRR2064998_1.fastq.gz pre/SRR2064998_2. fastq.gz
The new reports show that the quality control metrics for reverse reads did not change (Fig. 2c, d), as expected. However, the clipping changed the length distribution of forward reads, now centered in 72 bp (data not shown), and the per-base quality and the GC content at the ends of forward reads (Fig. 2c, d). 3.2 RNA Read Alignment
To align the tobacco RNA-seq data onto the reference sequences, a read aligner must be used, for example Bowtie [17], BWA [18], Magic-BLAST [19], STAR2 [20], and TopHat2 [21], among others. Depending on the research problem, some aligners can be better than others [22, 23]. We will use Bowtie, as it is fast and highly flexible. Although Bowtie has been designed for DNA sequencing data (i.e., it is not splice-aware), it can be configured
Alejandro A. Edera and M. Virginia Sanchez-Puerta
a
Before RNA-seq data processing Forward reads Reverse reads
36
32
32
32
32
24 20 16 12
24 20 16 12
28 24 20 16 12
28 24 20 16 12
4
d 40
30
30
10
0
Read position (bp)
Read position (bp) 40
20
10
0
5 13 21 29 37 45 53 61 69
Read position (bp)
5 13 21 29 37 45 53 61 69
20
5 13 21 29 37 45 53 61 69
Nucleotide content (%) Read position (bp)
96
80
64
48
32
5
16
10
Nucleotide content (%)
Read position (bp)
Nucleotide content (%)
40
Read position (bp)
20
5 13 21 29 37 45 53 61 69
5 13 21 29 37 45 53 61 69
96
4
80
4
64
8
4
48
8
32
8
5
8
30
0
28
Read position (bp)
30
20
10
0
5 13 21 29 37 45 53 61 69
28
Quality−score distribution
40
36
Quality−score distribution
40
36
Quality−score distribution
40
36
b 40 Nucleotide content (%)
After RNA-seq data processing Forward reads Reverse reads
c
40
16
Quality−score distribution
20
Read position (bp)
Fig. 2 Quality control metrics generated from the tobacco RNA-seq data. Shown are two quality control metrics, the per-base quality score (a, c) and the per-base nucleotide content (b, d), generated by FastQC from the forward and reverse reads before (a, b) and after (c, d) being processed. (a, c) Per-base quality scores. The distribution of quality scores (y-axis) is shown for each read position (x-axis), and is represented by its 10–90 percentile range, its interquartile range, its mean, and its median. The 10–90 percentile range is depicted by a vertical yellow rectangle containing an inner rectangle, which is darker, that depicts the interquartile range. The mean and median values are depicted as black curves and red horizontal lines, respectively. The quality score range (x-axis) is segmented into three regions: low quality when y < 20 (red background), intermediate quality when y 20 and y < 28 (yellow background), and high quality when y 28 (green background). (b, d) Per-base nucleotide content. The relative nucleotide content (y-axis) is shown for each read position (x-axis), in which A, C, G, and T contents are depicted by green, blue, yellow, and red curves, respectively
to deal with splicing events. For example, TopHat2, a splice-aware aligner, is built upon Bowtie [21]. In addition, dealing with splicing events in plant organelle genomes is relatively easy, as such events are well documented and highly conserved across plant species. Two versions of Bowtie are available: Bowtie 1 (http:// bowtie-bio.sourceforge.net/index.shtml) and Bowtie 2 (http:// bowtie-bio.sourceforge.net/bowtie2/index.shtml). Bowtie 2 is recommended for reads longer than 50 bp, as it is the case for the tobacco RNA-seq data. Bowtie performs read alignments using an index calculated from the sequences used as reference, which helps to keep a low
Computational Detection of Plant RNA Editing Events
21
memory consumption [17]. This index is built from the FASTA file containing the reference sequences (see Subheading 2.4): $ bowtie2-2.3.5.1-linux-x86_64/bowtie2-build NC_006581.fas NC_006581
The index is constituted by six files. All of them with the basename NC_006581 but with different extensions: .1.bt2, .2.bt2, .3. bt2, 4.bt2, .rev.1.bt2, and .rev.2.bt2. Once the index is built, we can use Bowtie 2 to align the tobacco RNA-seq data onto the reference sequences: $
bowtie2-2.3.5.1-linux-x86_64/bowtie2
-x
NC_006581
-1
pre/SRR2064998_1.fastq.gz -2 pre/SRR2064998_2.fastq.gz -S NC_006581.sam --local --no-unal --no-mixed --fr --nofw -p 128
Although all the parameters used to call Bowtie 2 are well documented in Bowtie’s manual (http://bowtie-bio.sourceforge. net/bowtie2/manual.shtml), they are briefly discussed in Note 2. The aligning mode is an important parameter in Bowtie 2, which we set to local (--local option) when calling Bowtie 2, to make it able to align more reads at the ends of the reference sequences (see Note 3). The resulting read alignments are stored in a SAM file, named NC_006581.sam, which contains 65,076 read pairs. Since many downstream programs expect as input a BAM file (i.e., a compressed version of a SAM file), the output SAM file can be compressed into a BAM file using SAMTools [24]: $ samtools view -S -b NC_006581.sam > NC_006581.bam
In addition, some programs also need that the BAM file being sorted (and consequently indexed): $ samtools sort NC_006581.bam -o NC_006581.srt.bam $ samtools index NC_006581.srt.bam
The resulting sorted BAM file can be inspected using tools for visualizing BAM files, for example, Tablet [25] which can be downloaded from https://ics.hutton.ac.uk/tablet. 3.3 RNA Editing Site Detection 3.3.1 BAM File Processing
Using the sorted BAM file obtained from Bowtie 2 and the reference sequences in the FASTA file, we will use the program BAM Readcount to extract the read nucleotides aligned on each reference position: $
bam-readcount
NC_006581.rc
-f
NC_006581.fas
NC_006581.srt.bam
>
22
Alejandro A. Edera and M. Virginia Sanchez-Puerta
The output NC_006581.rc details the number of read bases aligned on each reference position as well as the read base composition as the number of A, C, G, and U nucleotides. Since reference positions having no aligned reads are not included in the output files of BAM Readcount, which is an issue for some analyses, we created a new file from NC_006581.rc, named NC_006581.rc2, that includes such positions (see Note 4). To facilitate RNA editing site detection, we can arrange the information contained in NC_006581.rc2 into a new file named NC_006581.tsv: $ awk ’{split($6,a,":");split($7,c,":");split($8,g,":");split ($9,t,":");b=$2%3;printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s \n", $1, $2, b, $3, a[2]+c[2]+g[2]+t[2], a[2], c[2], g[2], t[2]}’ NC_006581.rc2 > NC_006581.tsv
In the file NC_006581.tsv, each entry is associated to a reference position and has nine tab-separated fields: the reference name, the position number, the codon position (1, 2, 3 for first, second, and third codon positions, respectively), the nucleotide in the reference position (A, C, G, T), the total number of reads aligned on the reference position, and the number of A, C, G, and U nucleotides that were aligned on the reference position. 3.3.2 Filtering Reference Sequences
The fifth field of the file NC_006581.tsv can be used, for example, to analyze the number of reads aligned on each reference position (i.e., the read depth) and to calculate the average read depth across each reference sequence (Table 1). Since RNA editing site detection is not reliable for reference sequences with low read depth, as false editing sites could be detected if reads contain some low-quality bases (Fig. 1d), we will discard reference sequences having average read depths below 20 reads (shaded coding sequences in Table 1): $ awk ’($1 != “ccmB” && $1 != “ccmC” && $1 != “ccmFc” && $1 != “ccmFN” && $1 != “nad4L” && $1 != “rps14”)’ NC_006581.tsv > NC_006581.tsv2
As a result, the file NC_006581.tsv2 contains 29 reference sequences, which make a total of 26,442 reference positions, of which 5,459 contain cytidines (Table 1). These 29 coding sequences have on average 300 reads aligned (Table 1, Fig. 3). The highest read depths are observed in the genes atp1, rps19, and rps3 (Fig. 3). 3.3.3 C-to-U RNA Editing Site Detection
To detect C-to-U RNA editing sites, we will look for reference cytidines having uridines in the corresponding positions of their aligned reads (Fig. 1b, c). To avoid detecting false editing sites
Computational Detection of Plant RNA Editing Events
at p6
s3 rp
p8
rp
s1
rp
at
3
s1
rps
23
2
10
p9
at
b
co
Nicotiana tabacum
rpl5
29 CDS regions 26,442 nucleotides 5,459 cytidines Avg Read Depth: 300.86 reads
rpl16
ox1
ox2 ox3
7
nad
d6
m
na
at
na
nad3
4 nad
nad2
d1
na
ttB
m
d5
-R
Fig. 3 RNA read depths of the tobacco reference sequences. The number of reads aligned on each reference position is represented as a radial gray line. The average read depth across all the reference sequences is indicated at the center, having a value of 300.86 reads. To enhance visualization, the read depths are truncated when they are higher than the average read depth across all the reference sequences
(Fig. 1d), an editing site will be identified with a treshold-based method: whenever a reference cytidine have more than ten aligned bases with more than 10% of them being uridines (Fig. 1b, c). We will carry out this threshold-based detection using the 5,459 reference cytidines in NC_006581.tsv2: $ awk ’($4 == "C")’ NC_006581.tsv2 > NC_006581.refc
from which we will detect the C-to-U editing sites as follows: $ awk ’($5 > 10 && $9/$5 > .10)’ NC_006581.refc > NC_006581. esites
This threshold-based method found a total of 359 C-to-U editing sites. However, the editing site detection depends on the threshold value chosen. The threshold of 10% has been proposed based on phylogenetic evidence [26], but we can investigate other thresholds, by using the last command with threshold values ranging from 1 to 100%. The result shows that threshold values less than
24
Alejandro A. Edera and M. Virginia Sanchez-Puerta
Table 2 Comparison of C-to-U RNA editing sites detected by the Fisher-based and threshold-based methods Threshold-based detection (ground truth)
Fisher-based detection
Sensitivity
Edited
Unedited
Total
Edited
331
2
333
Unedited
28
5098
5126
Total
359
5100
5459 92.20
5% are highly sensitive while values higher than 80% are highly conservative (Fig. 4). This suggests that the threshold of 10% is not too conservative, being sufficiently sensitive to detect weak editing sites (Fig. 1c). Alternatively, RNA editing sites can be detected using statistical methods [4, 27]. We can compare the detection power of our threshold-based method with a statistical method based on the Fisher’s exact test [4]. Carrying out the editing site detection process again but using the Fisher-based method, a total of 333 C-to-U editing sites were detected from the reference cytidines (see Note 5). Comparing these statistically detected editing sites with the editing sites detected by the threshold-based method (used as ground truth), a strong agreement between both detection methods is observed: a sensitivity of 92.20% (Table 2). Inspecting the 28 editing sites that the Fisher-based method was unable to detect (Table 2), the so-called false negatives, about 85% are cytidines having 10–20% of uridines or 10–20 reads aligned (data not shown). This indicates that the statistical detection method is more conservative than the threshold-based method, in particular for cytidines having weak editing evidence (Fig. 1c) and few reads aligned (Fig. 1d, e). As the statistical and threshold-based methods have similar detection power but the threshold-based method is methodologically simpler, we will continue using the C-to-U editing sites detected by the threshold-based method with a threshold of 10%. 3.4 Analyzing Detections 3.4.1 Biased Distribution Among the Three Codon Positions
There are different ways to assess the reliability of the editing sites that were detected with the threshold-based method from the tobacco RNA-seq data. For example, the detected C-to-U editing sites should exhibit the expected patterns that C-to-U editing sites generally show in organellar coding sequences of land plants, such as a differential editing extent among the three codon positions [26, 28]. In addition, C-to-U editing sites are unevenly distributed in coding regions, being the majority found in the second codon positions followed by the first and third codon positions [29]. We
Computational Detection of Plant RNA Editing Events
25
# of Editing Sites
600 500 400 300 200 100
100
90
80
70
60
50
40
30
20
10
0
0
Threshold (%)
Fig. 4 Sensitivity of the threshold value used to identify C-to-U RNA editing sites. Shown is the number of C-toU RNA editing sites (y-axis) detected for different threshold values ranging from 1% to 100% (x-axis). For a given threshold, the number of detected C-to-U RNA editing sites is divided according to their codon positions: first (blue), second (green), and third (red) codon positions. A vertical black line indicates a threshold value equal to 10%
can inspect how the detected editing sites are distributed among the three codon positions using the following command: $ cut -f3 NC_006581.esites | sort | uniq -c
This command shows that the detected editing sites are unevenly distributed among the three codon positions: 111, 213, and 35 sites are in the first, second, and third codon positions, respectively (Fig. 4). Moreover, this uneven distribution among the three codon positions is not a threshold-introduced artifact, as the same biased distribution is observed for all the threshold values (Fig. 4). 3.4.2 Comparing Detections with Other Methods
Alternatively, the detection reliability of our method can be assessed by comparing the editing sites it identified with the results of previous editing site detections. The C-to-U RNA editing sites of N. tabacum mitochondrial genes have been previously detected with a computational methodology rather similar to our threshold-based method [30]. This previous study also used nextgeneration sequencing data, which is deposited under the accession SRX403934, and reported 334 editing sites in the 29 coding sequences that we are analyzing here. Using these 334 editing sites as ground truth, the editing sites detected by our method with a threshold of 10% highly agree with them, as indicated by a sensitivity of 88.32% (Table 3). Both detections disagree only in 103 sites (Table 3): 39 sites detected by Grimes et al. [30] but not by us (false negatives) and 64 sites that we detected but not Grimes et al. [30] (false positives). To estimate
26
Alejandro A. Edera and M. Virginia Sanchez-Puerta
Table 3 Comparison of C-to-U RNA editing sites detected by the threshold-based method and those reported in a previous study Grimes et al. (2014)
Threshold-based detection
Sensitivity
Edited
Unedited
Total
Edited
295
64
359
Unedited
39
5061
5100
Total
334
5125
5459 88.32
how many of these false negatives and false positives could be potential true C-to-U editing sites, we can analyze their phylogenetic conservation, as editing sites in organellar coding sequences are highly conserved across land plants, as amino acids changed by C-to-U RNA editing tend to restore phylogenetically conserved amino acids [29, 31]. The phylogenetic conservation of editing sites can be analyzed with PREP-Mt (http://prep.unl.edu). This is a program to predict C-to-U RNA editing sites for well-known organellar coding sequences of land plants [31]. It predicts cytidines as C-to-U editing sites by converting cytidines to thymidines, and then evaluating if this restores highly conserved amino acids [31]. If the false negatives or false positives are potential true C-to-U editing sites, they should be phylogenetically conserved, an so predicted as C-toU editing sites by PREP-Mt. However, PREP-Mt can only predict non-synonymous editing sites (i.e., editing sites that produce non-synonymous amino acid changes). Synonymous C-to-U editing sites include sites in third codon positions and in first codon positions of the codons CUA and CUG (which encode for leucine). In the last comparison (Table 3), we found 39 false negatives and 39 false positives that are non-synonymous editing sites. A total of 92% and 51% of the non-synonymous false negatives and false positives were predicted, respectively, as C-to-U editing sites by PREP-Mt using a cutoff of 0.2. The predictions show that the threshold-based method was unable to detect many phylogenetically conserved C-to-U editing sites. To understand why these sites were undetected, we analyzed the number of reads that were aligned on the non-synonymous false negatives and the non-synonymous false positives (Fig. 5). In comparison to the non-synonymous false positives, virtually all the non-synonymous false negatives have very low numbers of reads aligned, on average below 20 reads (gray-shaded region in Fig. 5). This indicates that the majority of the false-negative sites that the threshold-based method was unable to detect were because of their very low
Computational Detection of Plant RNA Editing Events False negatives
240
27
False positives
220 200 180
*
140
* *
120 100
* nad4
nad6 nad7 rpl5 rps3 rps4 rps19
nad3
nad2
nad1
mttB
cob
mat−R
atp9
atp6
nad7 rpl5 rps10
nad6
nad5
nad4
mttB
0
nad1 nad2
20
*
1100 1147 443 2 754 458 488 599 307 308 1457 362 376 433 436 437 242 358 359 374 398 598 608 1550 1568 1916 1918 1958 26 88 95 103 244 251 92 2 427 653 658 665 683 698 82 191 53 874 1160 400 76 110 163 46 77 106 740 743 755 341 356 61 230 247 251 199 217 1405 1417 374 394 926 187 551 890 928 7
40
atp8 cox1 cox3
60
* * ** ****
* * ** * *** *** ** * * ** * ** * * * * * * * * ** * * ** * ******* * 956 1085
80
atp6 1093
Read number
160
Fig. 5 Number of reads aligned on false negatives and false positives. Shown are false-negative (gray shaded on the left) and false-positive (on the right) C-to-U editing sites (x-axis) obtained from comparing the threshold-based method with a previous study. Only non-synonymous false positives and false negatives are shown. The number of reads (y-axis) aligned on each false negative and false positive is depicted as a vertical bar, whose color indicates first (blue) or second (green) codon position. Asterisks indicate sites predicted by PREP-Mt. The average number of aligned reads is depicted by a horizontal red line for false negatives and false positives, respectively
number of aligned reads (Fig. 1e). Regarding those sites that could not be explained by this way, we hypothesize that the majority of them are probably editing sites that are tissue specific [4, 32, 33] or development dependent [34].
4
Notes 1. Read Quality Processing Many quality processing tools for next-sequencing data are available: AfterQC [35], BBMap (https://sourceforge.net/pro jects/bbmap), Cutadapt [36], Fastp [37], and Trimmomatic [38], among others. To process the tobacco RNA-seq data, we will use Trimmomatic. We will create a new directory named pre to store the processed RNA-seq data: $ mkdir pre && cd pre
28
Alejandro A. Edera and M. Virginia Sanchez-Puerta
and in this directory we will install Trimmomatic as follows: $ wget http://www.usadellab.org/cms/uploads/supplementary/ Trimmomatic/Trimmomatic-0.39.zip $ unzip Trimmomatic-0.39.zip
Using Trimmomatic (that requires Java installed, sudo apt-get install default-jdk), we will clip the ends of the forward reads as follows: $ java -jar Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads 128 ../rna-seq/SRR2064998_1.fastq.gz SRR2064998_1.fastq.gz HEADCROP:13 CROP:72
With this command, 13 bases are clipped from the 50 read ends (HEADCROP:13) and 15 bases are from the 30 read ends, where the latter clipping is done by cropping the read lengths to 72 bp (CROP:72). The -threads option allows parallel execution, in this case using 128 cores. Finally, to keep paired the FASTQ files in the same directory, a symbolic link is created pointing to the reverse reads: $ ln -s ../rna-seq/SRR2064998_2.fastq.gz.
2. Bowtie 2 Parameters We called Bowtie 2 using its four mandatory arguments: -x, -1, -2, and -S, which indicate the index basename, the files containing the forward and reverse reads, and the name of the output file, which is in SAM format (see http://samtools.github.io/ hts-specs/SAMv1.pdf for more details about the SAM format), respectively. In addition, we used other arguments to control the performance of Bowtie 2. We called Bowtie 2 using the local mode (--local) instead of using the default mode, the so-called endto-end mode. While the end-to-end mode aligns the full length of a read on a reference, the local mode aligns the center part of the read on the reference, leaving the read ends unaligned if they unmatch the reference sequence. In general, both modes will obtain similar read alignments. For example, using the coding region of an intron-containing gene as reference, both modes will ideally align reads derived from spliced transcripts and from the exonic regions of unspliced transcripts (case #1 in Fig. 6), not aligning reads completely derived from introns or UTRs (case #2 in Fig. 6). However, in this sceneario and in contrast to the end-to-end mode, the local mode will be able to align additional reads: those containing short portions of
Computational Detection of Plant RNA Editing Events
29
Fig. 6 Schematic alignment of RNA reads derived from immature and mature transcripts using coding regions as reference. Shown are seven illustrative cases of reads derived from unspliced (left panel) and spliced (center panel) transcripts of an intron-containing gene with two exons. Using the local mode of Bowtie 2, the reads are aligned on a reference containing the coding regions of this gene (right panel). Paired-end reads are depicted as forward (cyan) and reverse (pink) reads, respectively. Read ends enclosed by yellow boxes indicate that they are clipped by the local mode of Bowtie 2 during read alignment
UTRs or introns at their ends (case #3 in Fig. 6), unless such portions would be too large (case #4 in Fig. 6). In addition to the aligning mode, we called Bowtie 2 with other arguments. --no-mixed disables Bowtie 2 to align individual read mates when failing to align read pairs (case #5 in Fig. 6); so the SAM file produced by Bowtie 2 will only contain paired reads. As disadvantage, the --no-mixed option may discard correct single reads, for example, those single reads that result from pairs whose mates are unable to be aligned because they span UTRs or introns (case #6 in Fig. 6). The options --fr --nofw tell Bowtie 2 to align reads in a certain orientation to guarantee a strand-specific alignment [39]. This orientation expects that the reverse read of each pair being upstream of its mate, with the downstream mate having a sequence complementary to the reference. As a result, when using the options -fr --nofw, reads aligned on unexpected read orientations are discarded (case #7 in Fig. 6). --no-unal prevents Bowtie 2 from reporting unaligned reads in the output SAM file, reducing considerably the size of the SAM file. Finally, the -p option indicates the number of cores that Bowtie 2 will use for parallel execution. 3. End-to-End vs. Local Mode for RNA Read Alignment We compare the read alignments obtained by using the end-toend and local modes of Bowtie 2 when using the coding sequences of N. tabacum as reference. For the end-to-end read alignments, we called Bowtie 2 with the same arguments used for the local read alignments but replacing the --local option by the --end-to-end option. The local mode consistently aligned more reads than the end-to-end mode in virtually all the reference sequences (Fig. 7a). The largest differences are observed in genes with high transcriptional levels, such as atp1, rps19, and rps3 (Fig. 3). The read-depth differences between both modes mainly occur at the ends of the reference
atp1
8,000 6,000 4,000
Read depth
150
500 400 300 200 100
1,600
1,400
1,000
800
1,200
atp8
150 100
30K
atp9
Gene position (bp) 100
1,600
1,400
cox1
1,200
1,000
800
600
400
0 atp1 atp4 atp6 atp8 atp9 ccmB ccmC ccmFc ccmFN cob cox1 cox2 cox3 mat−R mttB nad1 nad2 nad3 nad4 nad4L nad5 nad6 nad7 nad9 rpl2 rpl5 rpl16 rps3 rps4 rps10 rps12 rps13 rps14 rps19 sdh3
0
200
Gene position (bp) 10K
240
220
200
180
160
140
120
80
100
60
40
0
50 20
20K
500
450
400
350
300
250
200
150
100
50
50 0
40K
Read depth
Gene position (bp) 200
Read depth
Cumulative read depth difference
50K
600
2,000 400
b
200
250K
0
a
Alejandro A. Edera and M. Virginia Sanchez-Puerta
Read depth
30
Gene position (bp)
Fig. 7 Comparison between the end-to-end and local aligning modes used by Bowtie 2. (a) Cumulative read depth differences. A cumulative read depth is calculated for each reference sequence as the sum of the read depths of all its reference positions. Two cumulative values are obtained for each reference sequence by using the RNA read alignments obtained by the end-to-end and local aligning modes, respectively. The difference between the two cumulative values (y-axis) is depicted as a vertical bar for each reference sequence (x-axis). y-axis values are expressed in thousands (K). (b) Read depth differences between end-to-end and local modes for some reference sequences. Shown is the read depth for four references. The read depth refers to the number of reads (y-axis) aligned on each reference position (x-axis) with the end-to-end mode (red) and the local mode (red), respectively
sequences (Fig. 7b), as the end-to-end mode is unable to align reads containing in their ends portions of UTRs or introns. To increase the read depths obtained by the end-to-end mode, the reference sequences can be extended at their ends using their flanking regions. In this way, the end-to-end mode and the local mode should not show significant read-depth differences for intronless genes. 4. Filling Output File of BAM Readcount BAM Readcount does not output reference positions when having no aligned reads. This can be addressed by creating a “blank” BAM-Readcount output file from the FASTA file containing all the positions of the reference sequences. This blank
Computational Detection of Plant RNA Editing Events
31
file, named NC_006581.blank, contains all the (31,521) reference positions initialized with zero aligned reads: $ cat NC_006581.fas | sed ’s/^>\([^\n]\+\)/>\1!/g’ | tr -d ’\n’ | tr ’>’ ’!’ |
s e d
’ s / ^ ! / / g ’
|
a w k
’ B E G I N { K = " " ;
P = 0 ;
T=":0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00: 0.00:0.00"} {split($0, a, "!"); for (i=1; i