273 99 9MB
English Pages xvi+331 Year 2015
Methods in Molecular Biology 1384
Emilio Marengo Elisa Robotti Editors
2-D PAGE Map Analysis Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
2–D PAGE Map Analysis Methods and Protocols
Edited by
Emilio Marengo and Elisa Robotti Department of Sciences and Technological Innovation Università del Piemonte Orientale Alessandria, Italy
Editors Emilio Marengo Department of Sciences and Technological Innovation Universita` del Piemonte Orientale Alessandria, Italy
Elisa Robotti Department of Sciences and Technological Innovation Universita` del Piemonte Orientale Alessandria, Italy
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-4939-3254-2 ISBN 978-1-4939-3255-9 (eBook) DOI 10.1007/978-1-4939-3255-9 Library of Congress Control Number: 2015954566 Springer New York Heidelberg Dordrecht London # Springer Science+Business Media New York 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Humana Press is a brand of Springer Springer Science+Business Media LLC New York is part of Springer Science+Business Media (www.springer.com)
Dedication To my father who was for me an example of rectitude, honesty and energy. To my mother for teaching me to love life in all its forms. To Elena, my future. To all who love me, for their sustain. Emilio Marengo To my mother and my father for their love and sustain and for teaching me to never give up. To Marco, my present. To Matteo, my future. Elisa Robotti
Foreword It seems only yesterday when, in 1975, a bright star rose suddenly above the horizon of Separation Science. Three labs reported a two-dimensional (2-D, charge coupled orthogonally to mass separation) technique, for analysis of complex protein systems, simultaneously and independently, although most of the credit went just to O’Farrell [1–3]. Perhaps because his system was the most elaborate: in fact, he was able to resolve and detect about 1100 different proteins from lysed E. coli cells on a single 2-D map and suggested that the maximum resolution capability might have been as high as 5000 different proteins. Apart from the meticulous attention to detail, major reasons for the advance in resolution obtained by O’Farrell, compared to earlier workers, included the use of samples labeled with 14C or 35S to high specific activity, and the use of thin (0.8 mm) gel slabs for the second dimension, which could then be dried down easily for autoradiography. This detection method was able to reveal protein zones corresponding to one part in 107 of the sample (usually 1–20 mg was applied initially, since higher loads caused zone spreading, although up to 100 mg could be loaded). Coomassie blue, in comparison, was about 3 orders of magnitude less sensitive and could reveal only about 400 spots. For the first dimension, O’Farrell adopted gel rods of 13 cm in length and 2.5 mm in diameter. The idea was to run samples fully denatured, in what became known as the “O’Farrell lysis buffer” (9 M urea, 2 % Nonidet P-40, 2 % b-mercaptoethanol, and 2 % carrier ampholytes, in any desired pH interval). For the second SDS-PAGE dimension, O’Farrell [1] used the discontinuous buffer system of Laemmli [4] and, for improved resolution, a concave exponential gradient of polyacrylamide gel (usually in the intervals 9–15 or 10–14 %T, although wide porosity gradients, e.g., 5–22.5 %T, were also suggested). It is thus seen that, since its very inception, O’Farrell carefully selected all the best conditions available at the time; it is no wonder that his system was adopted as such in the avalanche of reports that soon followed (as for this writing, his paper has received about 20,000 citations!). O’Farrell 2-D mapping protocol became the basic methodology for what we would call today “proteomics” of any tissue or biological fluid, where thousands of components were suspected to be present. It remained the gold standard for such investigations at least for the following 25 years, up to the third millennium. When I stated that the O’Farrell 2-D mapping introduced in 1975 had been a bright star I affirmed a widely held opinion, but it was not a polar star for us navigators in the starry sky represented by the polypeptides display in the 2-D gel. It was wishful thinking, at best. There was indeed a major impediment in this methodology, namely the erratic spot profile obtained by performing the first dimension in conventional IEF with soluble carrier ampholytes (CA), a la Svensson-Vesterberg, if you like [5]. There were no fixed stars in the firmament of 2-D maps: the apparent pI values kept changing, from batch to batch of CAs and, of course, from brand to brand, as manufactured by different companies (a chaotic synthesis, as you might remember) [6]. The situation was so frustrating that the Andersons recommended carbamylation train standards for mapping the pH gradient course and even preparing large volumes of stock solutions of CAs, obtained by carefully blending the various commercial products. The help was soon at hand, since in 1982 Bjellqvist et al. [7] launched another supernova in the sky of bioanalysis: immobilized pH gradients (IPGs), which were soon demonstrated to be able to overcome all these problems, while affording
vii
viii
Foreword
exquisite resolution when run in narrow and ultra-narrow pH ranges. IPGs went largely unnoticed for about a decade, even though they brought about some out-of-(terrestrial) space results in bio-separations, including a resolution limit of DpI ¼ 0.001 for IPGs, vs. a maximum resolving power of conventional IEF in soluble carrier ampholytes of DpI ¼ 0.01, one order of magnitude less. Together with that, IPGs brought “democracy” for the first time in electrokinetic processes. Up to their introduction, 2-D maps had been conducted only in linear pH gradients, which penalize acidic proteins, jammed in the overcrowded zone of the pH 4–6 region, where >60 % of all proteins focus. Already in 1985, we were able to describe a broad range, nonlinear IPG, strongly flattened in the crowded region, with a sharp upward turn at alkaline pH values [8]; these ranges are by far the most popular in today 2-D map analyses. Needless to say, IPGs proved to offer a loading ability much superior to that of conventional IEF. Gels could be massively overloaded without isoelectric precipitations or smears. This unique property could be exploited in 2-D maps for detecting low-abundance proteins; whereas the typical protein load in 18 20 cm gel slabs was up to 0.5 mg, with IPGs the load could be incremented up to 10 mg per gel [9]! Two-dimensional maps represent only one half of the proteomics panorama of the present day (excluding from the count mass spectrometry, which of course has an enormous significance in this field). The other half is a chromatographic approach developed by Yates III and coworkers [10, 11], consisting of an online 2-D ion-exchange column coupled to RP-HPLC, termed MudPit, for separating tryptic digests of 80S ribosomes from yeast. The acidified peptide mixture is loaded onto a strong cation exchanger (SCX) column; discrete eluate fractions are then fed onto a RP (reversed-phase) column, whose effluent is injected directly into a mass spectrometer. This iterative process is repeated 12 times by using increasing salt gradient elution from the SCX bed and an increasing organic solvent concentration from the RP beads (typically a C18 phase). In a total yeast lysate, the MudPit strategy allowed the identification of almost 1500 proteins [11]. There are major differences, though, between these two methods, in that the first one (2-D gel mapping) consists in separating intact, proteins, as found in the original tissue in which they were expressed, whereas in the second, chromatographic approach, only digested species are analyzed, which means that subtle differences in expression (e.g., deamidation, proteolytic cleavage products originating in vivo) are usually lost. Ideally, though, a lab should utilize both approaches, since it is claimed that the advantage of MudPit (an unfortunate acronym for such a powerful technique, since it literally means “hole full of mud”) would be the ability of detecting also scarcely soluble membrane proteins and very acidic or basic species whose pI values would be outside the range of IEF/IPG. It turns out, though, that a kind of dichotomy developed, by which in the USA MudPit is mostly adopted (on the grounds that it is mostly an instrumental approach, involving little labor from lab technicians) whereas on this side of the Atlantic 2-D gel approaches are still much in vogue. What should I state about the present book? In principle, there are so many books already published describing in detail all methodological approaches, tips and hints on 2-D gel slabs, that an additional one would seem to be pleonastic. Yet, by looking at the table of contents and at the list of chapters (no less than 15), it is easy to note that this book is a very special one in the panorama of manuals published up to the present. Whereas all the others are “cookbooks” giving just about only recipes, this one, on the contrary, gives mostly and perhaps only theory, a field too much ignored in all treatises on 2-D gel maps. So, I believe that this is a most useful and unique approach, in that it would help users to avoid common pitfalls due to ignorance of the basic theoretical mechanisms underlying the technique, including data handling and proper tools for spot analysis. Of course potential users had
Foreword
ix
better have a minimum of mathematical background in order to be able to understand all the theories here proposed. Milano, 30th April 2015
Pier Giorgio Righetti Honorary Professor, Politecnico di Milano Milano, Italy
References 1. O’Farrell PH (1975) High resolution two-dimensional electrophoresis of proteins. J Biol Chem 250:4007–4021 2. Klose J (1975) Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals Humangenetik 26:231–243 3. Scheele GA (1975) Two-dimensional gel analysis of soluble proteins. Characterization of guinea pig exocrine pancreatic proteins J Biol Chem 250:5375–5385 4. Laemmli UK (1970) Cleavage of structural proteins during the assembly of the head of bacteriophage T4. Nature 227:680–681 5. Righetti PG (1983) Isoelectric Focusing: Theory. Methodology and Applications. Elsevier, Amsterdam, pp 1–386 6. Righetti PG, Simo` C, Sebastiano R, Citterio A (2007) Carrier ampholytes for IEF, on their fortieth anniversary (1967-2007), brought to trial in court: The verdict. Electrophoresis 28:3799–3810 7. Bjellqvist B, Ek K, Righetti PG, Gianazza E, Goerg A, Postel W, Westermeier R (1982) Isoelectric focusing in immobilized pH gradients: principle, methodology and some applications. J Biochem Biophys Methods 6:317–339 8. Gianazza E, Giacon P, Sahlin B, Righetti PG (1985) Non-linear pH courses with immobilized pH gradients. Electrophoresis 6:53–56 9. Righetti PG, Gelfi C (1984) Immobilized pH gradients for isoelectric focusing. III Preparative separations in highly diluted gels J Biochem Biophys Methods 9:103–111 10. McCormack AL, Schieltz DM, Goode B, Yang S, Barnes G, Drubin D, Yates JR 3rd (1997) Direct analysis and identification of proteins in mixtures by LC/MS/MS and database searching at the lowfemtomole level. Anal Chem 69:767–776 11. Washburn MP, Wolters D, Yates JR 3rd (2001) Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nature Biotech 19:242–247
Preface As commonly acknowledged, two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) is one of the most widespread techniques for the separation of complex protein extracts, above all in the research field, for the identification of candidate biomarkers of different biological effects (pathologies, drug effects, ripening effects, etc.). Notwithstanding its being quite time-consuming and laboratory intensive, it is still one of the most exploited techniques for the separation of protein extracts due to its low cost as compared to high-throughput high-resolution methods, and its versatility. SDS 2-D PAGE can also be considered nowadays as a phase of sample pre-treatment to enhance or facilitate the subsequent analysis based on mass spectrometry, and remove the masking effect of the most abundant proteins. The achievement of the final 2-D map is based on a multistep experimental procedure involving: sample preparation and pretreatment, isoelectric focusing, interfacing of the first and second dimension, separation according to the molecular mass, and staining and destaining of the final 2-D map. Once the final maps are obtained, they have to undergo a final step usually named “differential analysis,” providing sets of candidate biomarkers by comparison of maps from different classes of samples (e.g., control vs. pathological, control vs. drug treated etc.). Actually, this final step is in turn based on a multistep complex procedure based on the exploitation of image analysis tools to provide information about the differences existing between the groups of investigated samples, i.e., the candidate biomarkers. In most cases the comparison of different 2-D maps is accomplished by dedicated software packages guiding the operator through a wizard-like procedure for noise removal (background, artifacts, etc.) and image warping, with the final aim of aligning gel images and matching protein spots across gels, to open the way toward the final quantification of spot intensities across all gel images. Once protein spots are matched and quantified, statistical methods can be applied to identify the relevant upregulation and downregulation. The present volume is focused on deepening the analysis of 2-D maps by bioinformatics tools for what regards both the image analysis process to detect and quantify protein spots and the statistical analysis carried out to identify candidate biomarkers (i.e., spots upregulated or downregulated across samples). Two main approaches to the analysis of 2-D maps images are available: –
The first involves a step of spot detection on each gel image, to provide a final list of spots present in all the investigated gels, each characterized by its volume: the final differential analysis is then performed on the spot volume dataset obtained.
–
The second is instead based on the direct differential analysis of the 2-D maps images following a pixel-based procedure.
In the traditional approach spot boundaries are identified on each gel and spots are matched across multiple gels using a reference or master gel. However, this approach suffers from an important drawback: when the matching method fails, missing values are introduced in the spot volume data table; they can be due to the absence of a spot on the gel (in this case the missing value is a true zero) or to a failure in matching (in this case instead the missing value cannot be substituted by zero). Another problem regards the
xi
xii
Preface
definition of spot boundaries, which is particularly challenging in the case of overlapping spots. The pixel-based approach can overcome these problems, but both approaches rely on proper gel alignment; moreover, the pixel-based method is computationally intensive. The volume is structured in four parts. The first part is devoted to deepening the problem of 2-D maps reproducibility and maps modeling. The validity of the results obtained by the final differential analysis deeply depends on the choices made during the experimental planning. This aspect is addressed in the first part of the book, to provide general good practices for a correct experimental design. In this part, the problem of spot overlapping is also addressed, and the main software packages available for 2-D maps analysis are presented. The second part instead is devoted to spot-based methods: algorithms for maps denoising, background removal, and normalization are presented; the problem of image warping and spot detection and matching are presented and the most widespread algorithms available are described in detail. The third part is mainly devoted to the description of classical and multivariate statistical methods that can be applied to spot volume datasets for the identification of candidate biomarkers. The last part finally is focused on direct image analysis tools through pixel-based approaches. The mathematical and statistical procedures are described from a theoretical point of view, to provide the basis for their correct applications, but examples of applications are also provided. The book is in fact thought to be of use for both the insiders of 2-D map image analysis and the researchers exploiting 2-D maps analysis in a wizard-like procedure: to the first ones the book is intended to as a compendium of the most recent applications, while for the second ones as a guide to help in the understanding of all the main steps of image analysis, to avoid errors and misinterpretations during the image analysis process. Alessandria, Italy
Emilio Marengo Elisa Robotti
Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PART I
2D-MAPS REPRODUCIBILITY AND MAPS MODELING
1 Sources of Experimental Variation in 2-D Maps: The Importance of Experimental Design in Gel-Based Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cristina-Maria Valcu and Mihai Valcu 2 Decoding 2-D Maps by Autocovariance Function . . . . . . . . . . . . . . . . . . . . . . . . . Maria Chiara Pietrogrande, Nicola Marchetti, and Francesco Dondi 3 Two-Dimensional Gel Electrophoresis Image Analysis via Dedicated Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin H. Maurer
PART II
vii xi xv
3 39
55
IMAGE ANALYSIS TOOLS TO PROVIDE SPOT VOLUME DATASETS
4 Comparative Evaluation of Software Features and Performances . . . . . . . . . . . . Daniela Cecconi 5 Image Pretreatment Tools I: Algorithms for Map Denoising and Background Subtraction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlo Vittorio Cannistraci and Massimo Alessio 6 Image Pretreatment Tools II: Normalization Techniques for 2-DE and 2-D DIGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elisa Robotti, Emilio Marengo, and Fabio Quasso 7 Spot Matching of 2-DE Images Using Distance, Intensity, and Pattern Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hua-Mei Xin and Yuemin Zhu 8 Algorithms for Warping of 2-D PAGE Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcello Manfredi, Elisa Robotti, and Emilio Marengo 9 2-DE Gel Analysis: The Spot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simona Martinotti and Elia Ranzato 10 GENOCOP Algorithm and Hierarchical Grid Transformation for Image Warping of Two-Dimensional Gel Electrophoretic Maps. . . . . . . . . . Elisa Robotti, Emilio Marengo, and Marco Demartini 11 Detection and Quantification of Protein Spots by Pinnacle . . . . . . . . . . . . . . . . . Jeffrey S. Morris and Howard B. Gutstein 12 A Novel Gaussian Extrapolation Approach for 2-D Gel Electrophoresis Saturated Protein Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Massimo Natale, Alfonso Caiazzo, and Elisa Ficarra
xiii
69
79
91
109 119 155
165 185
203
xiv
Contents
PART III STATISTICAL METHODS APPLIED TO SPOT VOLUME DATASETS TO IDENTIFY CANDIDATE BIOMARKERS 13 14
Multiple Testing and Pattern Recognition in 2-DE Proteomics . . . . . . . . . . . . . Sebastien C. Carpentier Chemometric Multivariate Tools for Candidate Biomarker Identification: LDA, PLS-DA, SIMCA, Ranking-PCA . . . . . . . . . . . . . . . . . . . . . Elisa Robotti and Emilio Marengo
215
237
PART IV DIFFERENTIAL ANALYSIS FROM DIRECT IMAGE ANALYSIS TOOLS 15
16
17
The Use of Legendre and Zernike Moment Functions for the Comparison of 2-D PAGE Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emilio Marengo, Elisa Robotti, and Marco Demartini Nonlinear Dimensionality Reduction by Minimum Curvilinearity for Unsupervised Discovery of Patterns in Multidimensional Proteomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Massimo Alessio and Carlo Vittorio Cannistraci Differential Analysis of 2-D Maps by Pixel-Based Approaches. . . . . . . . . . . . . . . Emilio Marengo, Elisa Robotti, and Fabio Quasso
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
271
289 299 329
Contributors MASSIMO ALESSIO Proteome Biochemistry, San Raffaele Scientific Institute, Milan, Italy ALFONSO CAIAZZO WIAS Berlin, Berlin, Germany CARLO VITTORIO CANNISTRACI Biomedical Cybernetics Group, Biotechnology Center (BIOTEC), Technische Universit€ a t Dresden, Dresden, Germany SEBASTIEN C. CARPENTIER Department of Biosystems, Faculty of Bioscience Engineering, K.U. Leuven, Leuven, Belgium; SYBIOMA: Facility for Systems Biology Based Mass Spectrometry, Leuven, Belgium DANIELA CECCONI Mass Spectrometry & Proteomics lab, Department of Biotechnology, University of Verona, Verona, Italy MARCO DEMARTINI Department of Sciences and Technological Innovation, University of Piemonte Orientale, Alessandria, Italy FRANCESCO DONDI Department of Chemical and Pharmaceutical Sciences, University of Ferrara, Ferrara, Italy ELISA FICARRA Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy HOWARD B. GUTSTEIN Anesthesiology and Perioperative Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA MARCELLO MANFREDI Department of Sciences and Technological Innovation, University of Piemonte Orientale, Alessandria, Italy; High Resolution Mass Spectrometry Lab, ISALIT SRL, Spin-off of University of Piemonte Orientale, Alessandria, Italy NICOLA MARCHETTI Department of Chemical and Pharmaceutical Sciences, University of Ferrara, Ferrara, Italy EMILIO MARENGO Department of Sciences and Technological Innovation, University of Piemonte Orientale, Alessandria, Italy SIMONA MARTINOTTI DiSIT—Dipartimento di Scienze e Innovazione Tecnologica, University of Piemonte Orientale, Alessandria, Italy MARTIN H. MAURER Department of Physiology and Pathophysiology, University of Heidelberg, Heidelberg, Germany; Mariaberg Hospital for Child and Adolescent Psychiatry, Gammertingen-Mariaberg, Germany JEFFREY S. MORRIS Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA MASSIMO NATALE Global ICT, Unicredit SpA, Milano, Italy MARIA CHIARA PIETROGRANDE Department of Chemical and Pharmaceutical Sciences, University of Ferrara, Ferrara, Italy FABIO QUASSO Department of Sciences and Technological Innovation, University of Piedmont Orientale, Alessandria, Italy ELIA RANZATO DiSIT—Dipartimento di Scienze e Innovazione Tecnologica, University of Piemonte Orientale, Alessandria, Italy ELISA ROBOTTI Department of Sciences and Technological Innovation, University of Piedmont Orientale, Alessandria, Italy
xv
xvi
Contributors
CRISTINA-MARIA VALCU Department of Behavioral Ecology & Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany MIHAI VALCU Department of Behavioral Ecology & Evolutionary Genetics, Max Plank Institute for Ornithology, Seewiesen, Germany HUA-MEI XIN College of Physics & Electronics, Shandong Normal University, Shandong, China; CREATIS, INSA Lyon, University of Lyon, Villeurbanne, France YUEMIN ZHU CREATIS, INSA Lyon, University of Lyon, Villeurbanne, France
Part I 2D-Maps Reproducibility and Maps Modeling
Chapter 1 Sources of Experimental Variation in 2-D Maps: The Importance of Experimental Design in Gel-Based Proteomics Cristina-Maria Valcu and Mihai Valcu Abstract The success of proteomic studies employing 2-D maps largely depends on the way surveys and experiments have been organized and performed. Planning gel-based proteomic experiments involves the selection of equipment, methodology, treatments, types and number of samples, experimental layout, and methods for data analysis. A good experimental design will maximize the output of the experiment while taking into account the biological and technical resources available. In this chapter we provide guidelines to assist proteomics researchers in all these choices and help them to design quantitative 2-DE experiments. Key words Biological variation, Optimal sample size, Power analysis, Replication, Sampling, Sample pools, Technical variation
1
Introduction Expression proteomics aims to measure levels of protein expression and identify differences in protein abundance with biological relevance in specific experimental settings (i.e., correlated with specific phenotypes or induced by specific treatments). To this end, gel-based differential display experiments are often employed to compare protein expression patterns between different biological samples. The technique is relatively demanding in terms of time and resources; hence, the scale of such studies is often limited. Unfortunately inappropriate experimental designs can introduce biases while undersized experiments often produce unreliable results. A rigorous experimental planning can limit the systematic errors and maximize the information gained from an experiment, while reaching a compromise between the reliability of the desired results and the resources available. Experimental design lays thus the foundation for reproducible research.
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_1, © Springer Science+Business Media New York 2016
3
4
Cristina-Maria Valcu and Mihai Valcu
The experimental approach has to be adapted to effectively address the specific biological question under investigation. Experimental design is the process of planning surveys and experiments such as to maximize their quantitative and qualitative output. It includes selecting the biological material and methodologies and planning how experiments will be conducted and how the data will be analyzed. Given the wide range of possible applications in gel-based proteomics, it is difficult to prescribe a cookbook-like protocol for optimal experimental design. The aim of this chapter is rather to assist proteomics researchers in planning gel-based experiments by discussing advantages and disadvantages of different experimental approaches and by outlining general guidelines for designing quantitative 2-DE experiments.
2
Materials
2.1 R Packages and Functions
1. power.t.test function: power calculations for one and two samples t tests [1] (this function will be used as an example throughout this chapter). 2. power.anova.test function: power calculations for balanced one-way analysis of variance tests [1]. 3. pamm package: power analysis for random effects in mixed models [2]. 4. samr package: significance analysis of microarrays [3]. 5. ssize package: sample size calculations for microarray experiments [4]. 6. clippda package: a package for clinical proteomic profiling data analysis [5].
2.2
3
Other Software
1. Russell Lenth’s power and sample size software [6].
Methods Prerequisites for a successful experimental design are a well-defined aim of the study and clearly formulated scientific hypotheses. Experimental design includes aspects regarding the choice of materials and methods, planning and conducting the experiments, and analyzing the data: 1. Selecting the (sub)proteome relevant for the study and the most suitable type of biological material. 2. Selecting the appropriate methodology and necessary equipment (see Subheading 3.3.1).
Experimental Design in Gel-Based Proteomics
5
3. Defining the factors and the factor levels (treatments, controls). 4. Selecting the type of replicates (biological, technical) (see Subheading 3.4.2). 5. Selecting the sampling scheme (see Subheading 3.4.1) and experiment layout (see Subheading 3.4.5). 6. Planning how experiments will be conducted: batches, experimenters, randomization (see Subheadings 3.4.3 and 3.4.5). 7. Selecting methods for the statistical analysis (see Subheading 3.3.6). 8. Establishing the protocols and standardizing the procedures (see Subheadings 3.3.2 and 3.3.3). 9. Estimating the optimal sample size (see Subheading 3.5). 10. Performing a cost-benefit analysis to assist the choice among candidate experimental designs (see Subheading 3.5.5). 3.1 Sources of Variation in 2-D Maps
Variation in proteomics data has three components: biological, technical, and random. The nature and significance of these sources of variation are quite different; hence, they have to be managed differently: 1. Biological variation is relevant by itself and has to be accounted for in any experiment, i.e., the samples need to reflect the biological variation of the original population so that the experimental observations on samples can be used to make assertions over the population (see Subheading 3.2). This is achieved by applying appropriate sampling schemes (see Subheading 3.4.1). More variable samples require more replicates in order to achieve a reliable estimation of proteins abundance and identify significant differences in protein levels between samples. Optimal sample size has therefore to be estimated for each proteomics experiment (see Subheading 3.5). 2. Technical variation has to be minimized in order to facilitate the detection of differences between samples. Optimization (see Subheading 3.3.2) and standardization of experimental conditions (see Subheading 3.3.3) are prerequisites for reproducible results, especially in techniques like 2-DE which involve multiple steps, each of them likely to introduce experimental error. The layout of the experiment is selected to ensure minimal technical variation (see Subheading 3.4.5). 3. Random variation is inherent to all experimental systems and can be tackled through appropriate replication (see Subheadings 3.4.2 and 3.5). The specific impact of biological and technical sources of variation on 2-D-based proteomics is discussed in more detail in the following two headings.
6
Cristina-Maria Valcu and Mihai Valcu
3.2 Biological Variation
The criteria typically considered when choosing the type of biological sample include: the accessibility and availability of the sample, the possibility to apply treatment/stimuli and to control experimental conditions, the complexity of the proteome and of the experimental procedures required for sample preparation, and the level of biological variation. Hereafter the discussion focuses on showing how the level of biological variation can affect the experimental design and the experimental output of gel-based proteomic experiments. Biological variation is inherent to all living entities and arises from genetic factors in interaction with environmental factors. The level of variation increases from simple to complex organisms and from in vitro-derived material to living organisms. Less complex and less variable samples require less replicates, hence afford faster and less costly experiments. This can be achieved by using in vitro-derived samples or, in the case of living organisms, through selecting genetically homogeneous groups of individuals (e.g., laboratory lines) and maintaining them in controlled environmental conditions (cage or glasshouse experiments). It should however be kept in mind that the findings will mostly be valid for those particular samples and conditions.
3.2.1 Biological Material Obtained In Vitro
Cell cultures are often preferred as experimental models because they can provide large amounts of biological material under standardized and reproducible conditions and allow a wide variety of experimental treatments. Variation among such samples is typically low; hence, the number of samples required is also lower than the number of samples required when working on living organisms. Primary cell cultures are heterogeneous populations of cells, directly derived from living tissues, thus very close to in vivo cells regarding their phenotype and between-cell variability. However, they are difficult to obtain and standardize, have limited life spans, and can only provide limited amounts of material. Also, upon isolation of cells from their original tissue and cultivation in vitro, their phenotype (and proteomic profile) begins to drift from the one expressed in vivo. Primary cell cultures independently isolated are true biological replicates. To obtain more biological material, cell lines can be cultivated for a (cell type specific) number of passages before they cease dividing and senesce. Immortalized cell lines can tolerate a larger number of passages, but due to clonal drift, they diverge even more phenotypically from the original in vivo cells [7]. Early passages (i.e., young cultures) are therefore preferred, while aged cell cultures close to the limit number of tolerated passages should be avoided. Established cell lines exhibit low biological variation between different samples, enable reproducible experiments with lower number of replicates, and can provide relatively large amounts of biological material. However, the degree to which the
Experimental Design in Gel-Based Proteomics
7
experimental observations can be transferred to in vivo systems is comparatively limited. Performing all experiments on biological material from a particular passage (or at least close passages) can increase the experimental power by reducing the biological variability between samples. This has however the side effect of partly restricting the validity of the results to that particular passage of the cell culture (see Subheading 3.3.3). A better practice is to allow a higher (controlled) biological variance between samples and thus increase the relevance of the results and avoid false-positive results. Obtaining true biological replicates is not possible for immortalized cell lines. However, aliquots of immortalized cell lines can be cultured and allowed to drift independently and used as pseudobiological replicates (see Note 1). 3.2.2 Biological Material Derived from Living Organisms
In vivo-derived samples are optimal for many research topics because their proteomes represent genuine reflections of the biology of the sampled organisms. They pose specific experimental challenges, mostly related to sample collection and preparation. Protein expression is cell type specific, but tissues often contain several types of cells (see Note 2). This can either mask changes in protein expression occurring in particular cell types or dilute the proteins of interest among hundreds of proteins irrelevant for the study. It is therefore beneficial to restrict the samples as much as possible to the relevant fraction (e.g., isolating specific cell types from fresh tissues or through laser capture microdissection, isolating organelles) and thus focus on the sub-proteome of interest. The proteomic profiles of living tissues depend upon the physiological condition and the environmental factors prior to and at the time of sample collection. In human samples, for example, a large proportion of the biological variation can be explained by differences in the lifestyle of individuals [8]. In plants, a significant part of the biological variation of the leaf proteome is due to the orientation of the leaf and the time of the day [9]. Living organisms (whether microorganisms, plants, or animals) should therefore experience similar environments (apart to the treatment applied or experimental factors) prior to sampling. The conditions in which samples are collected and preserved prior to proteomic analysis should also be strictly controlled (see Subheading 3.3.3). The relative contribution of genetic vs. environmental and physiological factors to protein expression variation can be varied through selecting samples with a particular structure (e.g., gender, age) and maintaining the investigated organisms in standardized laboratory conditions (e.g., standardized cultures, glasshouse/cage experiments). Genetically uniform samples and controlled conditions reduce the level of biological variation (and the required sample size), but they also restrict the relevance of the results to
8
Cristina-Maria Valcu and Mihai Valcu
the particular group structure and experimental conditions employed (see Subheading 3.3.3). Conversely, genetically diverse groups and diverse conditions require larger sample sizes but afford a wider relevance of results. 1. Laboratory lines of organisms. They are genetically homogeneous, hence afford increased reproducibility of the results and require smaller sample sizes as compared to wild populations. The level of genetic variation retained by laboratory strains largely depends on their level of inbreeding (see Note 3). The genetic uniformity of laboratory lines allows for the identification of relatively small differences in protein expression as statistically significant between groups; however some of these differences may be strain specific. Diversifying the genetic background of the sample can partly alleviate this risk. In the case of animal experiments, possible solutions are to employ unrelated individuals (e.g., mice belonging to different litters), to use outbred stocks, or to repeat the experiments on different laboratory strains (see Note 4). 2. Sampling from wild populations. Wild populations are more genetically variable and experience more diverse environmental conditions. Hence, they require larger sample sizes than laboratory strains. If appropriate sampling schemes are applied, the results will be widely relevant across genotypes and experimental conditions. Ideally, several populations should be sampled in order to cover the between-population heterogeneity and obtain results generally valid at species level. The limitations associated with the study of wild individuals mainly regard the possibility to apply treatments and standardize the experiments. The level of interindividual variation differs between species (or even populations); therefore, the optimal sample size must be determined for each species/population and not extrapolated from related species. 3. Cell- and tissue- type-specific variation. Different cells and tissue types may require specific conditions of sampling, preservation, and processing; therefore, sample preparation should be optimized for each type of sample to ensure an efficient and reproducible protein extraction and to minimize technical variation. The level of variation in protein abundance may also be sample type specific (see Note 5); hence optimal sample sizes should be estimated for each type of sample. In living organisms, protein expression patterns of most tissues are temporally variable and depend on internal and external factors. The degree of this dependence is tissue specific. Animals blood is a typical example of a highly variable
Experimental Design in Gel-Based Proteomics
9
sample, but most other tissues will also change their proteomic profile in response to particular internal and external stimuli to different extents. In order to avoid undesired and potentially confounding variations in protein expression, the sample collection should be performed under standardized conditions with respect to all factors which can induce changes in the protein expression. 3.3 Technical Variation
Technical variation can originate from different steps of the experimental procedure, from variation in reagent’s quality, and from handling (see Note 6). Since differences in protein abundance between samples can only be identified above the level of technical variance, it becomes important to minimize technical noise and thus make possible the detection of smaller differences. Below, we discuss some important steps which should be taken toward minimizing technical variation.
3.3.1 Choice of Methods
The methods, equipment, and software used to produce and analyze 2-D maps can influence the level of technical variation to different degrees. The appropriate methodology should be selected taking into account the aim of the study and the available resources. The most important aspects to be considered are listed below: 1. Methods (e.g., carrier ampholyte vs. immobilized pH gradients (IPGs), horizontal vs. vertical sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE)). 2. Equipment employed in performing 2-DE (e.g., isoelectric focusing (IEF), PAGE chamber). 3. IPG range and SDS-PAGE gel concentration (uniform vs. gradient) (see Note 7). 4. Type of IPG and SDS-PAGE gel (precast vs. self-made) (see Note 8). 5. Method for protein detection and quantification (staining vs. labeling) (see Notes 9 and 10). 6. Application of the sample to IPGs (rehydration vs. cup loading, anodic vs. cathodic cup loading). 7. Image acquisition equipment and image analysis software (see Note 11). 8. Reagents quality (see Note 12).
3.3.2 Optimized Protocols
In order to minimize technical variation, 2-DE protocols need to be empirically optimized for each type of sample and experimental goal. Identifying the main sources of variation can help to prioritize the optimization procedures, which can otherwise be time and resource consuming (see Note 13). Below are the principal experimental steps requiring optimization:
10
Cristina-Maria Valcu and Mihai Valcu
1. Sample collection needs special attention in order to avoid sample degradation or modification (see Note 14). Samples should be processed fast (see Note 15), frozen in liquid nitrogen as soon as possible, and then stored at 70 C. Protease and phosphatase inhibitors should be added early in the experimental procedure (see Note 16). 2. Sample preparation can be an important source of technical variation and is usually the step which requires most attention for optimization. Because each experimental stage can introduce technical variation in the experiment, sample preparation should be kept as simple as possible. Protein extraction protocols can be quite diverse for different types of samples, and the composition of the sample buffer has to be adapted to the type of sample. Protein extracts should be stored at 70 C and freeze-thaw cycles should be avoided. 3. The composition of the IEF buffer and the sample application technique also require optimization. Running conditions (for both dimensions) may need to be adjusted for some types of samples (see Note 17). 4. Although published staining methods are generally well established, the protocols might need optimization due, e.g., to differences in reagents quality or in the type of shaking equipment used. 5. The parameters used for image acquisition should be optimized and maintained constant for all gels in an experiment. A protocol needs to be established for image analysis (the degree to which the image analysis procedure can be modified and optimized depends on the software). 6. Special attention should be paid throughout the experimental procedure to avoid contamination (see Note 18). 3.3.3 Standardized Procedures
Any treatment applied to the samples can induce unwanted variations in 2-DE maps. This technical noise can be reduced by standardizing the experimental procedure with regard to as many factors as possible (e.g., handling of cell cultures or living organisms prior to and during sampling (see Note 19), storage conditions, sample preparation, IEF and SDS-PAGE running conditions, batches of reagents and consumables, image acquisition and analysis). In vitro experiments, for example, should be conducted under controlled conditions since even apparently minor factors, such as differences in splitting ratios or seeding densities, can influence the growth curve and implicitly the abundance of particular proteins in the cells. Experiments which employ living animals also require strictly controlled conditions, and standardization can minimize experimental variation and increase statistical power. However, over-standardization of experimental conditions can result in
Experimental Design in Gel-Based Proteomics
11
“laboratory-specific” results which no other research group can subsequently reproduce. A possible solution is to introduce systematic variation in the experimental conditions (e.g., regarding age, gender, body mass). Although heterogenization increases the within-experiment variation, it also increases the betweenexperiment reproducibility, thus avoiding spurious results [10]. Handling can account for a significant proportion of variation in 2-D maps; hence the same person should preferably handle a given step for all samples which are to be directly compared in an experiment. If this is not feasible, blocking should be applied with respect to the experimenter (see Subheading 3.4.5, step 3), and the identity of the person performing the experiment should be accounted for, during data analysis. User-induced variation has been reported for most image analysis software; hence a single person should also perform this task. In order to reduce experimenter-induced bias, it is desirable to conceal the identity of the samples during as many experimental steps as possible (perform blind experiments). 3.3.4 Multiplexing
Through multiplexing (co-migration of two or more samples labeled with spectrally solvable fluorescent dyes, e.g., Difference gel electrophoresis (DIGE)) (see Fig. 1), the technical variation due to 2-DE is removed and the 2-DE maps of the multiplexed samples perfectly overlap, facilitating spot detection (see Note 20). Sample losses due to incomplete sample entry into the IPG or transfer to the second dimension are identical for the multiplexed samples; therefore, the quantification of protein abundances is more reliable. With DIGE, two or three samples can be multiplexed reducing the number of gels required for an experiment (see Fig. 1). Alternatively, one of the labels can be used for an internal standard (see Subheading 3.3.5). Due to between-dye differences in labeling efficiency and signal strength, designs using two dyes (Cy3 and Cy5) encompass lower technical noise [11]. However, at larger samples sizes, three-dye designs compensate this effect by reducing the number of gels required and thus introducing smaller gel-to-gel and batch-tobatch variance in the experiment [11]. Assigning the samples to the dyes should be done randomly (see Fig. 1).
3.3.5 Internal Standards
Multiplexing makes possible a further development of the experimental design: the use of internal standards, which minimizes the impact of technical between-gel and between-run variation on the detection and quantification of protein spots on the 2-D map. An internal standard is formed by pooling equal amounts of protein from all samples in the experiment. The standard is labeled, mixed, and separated on the same 2-DE gel with the samples (see Note 21). It is used to align the gels, thus improving spot identification;
12
Cristina-Maria Valcu and Mihai Valcu Group 1
1
1
2
3
2
Group 2
4
3
5
4
Internal standard
6
5
1
2
3
4
5
6 Random assignment
6
1
2
3
4
5
6
Design A. Gel
1
2
3
4
5
6
Cy2
St
St
St
St
St
St
Cy3
4
2
4
6
2
1
Cy5
5
3
1
6
3
5
Random pairing Multiplexing
Design B. Gel
1
2
3
4
5
6
7
8
9
10
11
12
Cy2
St
St
St
St
St
St
St
St
St
St
St
St
Cy3
3
6
4
5
2
1
4
5
3
6
1
2
Cy5
1
6
4
6
3
5
5
2
1
2
3
4
Random pairing Multiplexing Reciprocal labeling
Fig. 1 Multiplexing and using internal standards. (a) Multiplexed three-dye design. Equal amounts of protein from each sample in the experiment are labeled with Cy2 and pooled to form the internal standard. Half of the samples in each group are randomly selected and labeled with Cy3, the other half with Cy5. Cy3- and Cy5- labeled samples are randomly paired. Each pair of samples is mixed with the internal standard and separated on a 2-DE gel. (b) Multiplexed three-dye design with reciprocal labeling. Fractions of all samples are labeled with Cy2, Cy3, and Cy5. The Cy2-labeled fractions are pooled to form the internal standard. Cy3and Cy5- labeled fractions are randomly paired. Each pair is mixed with the internal standard and separated on a 2-DE gel
to normalize spot volume; and to calculate standardized abundance, thus enabling a more accurate quantification of the proteins. The most popular approach is DIGE with two or three dyes. In two-dyes designs one of the labels is used for the internal standard and the other for the samples. In three-dyes designs, the internal standard is typically labeled with Cy2; half of the samples in each group are labeled with Cy3 and the other half with Cy5. The samples should be assigned to the dyes and multiplexed randomly (see Design A in Fig. 1). In the case of three-dyes designs, reciprocal labeling of the samples (dye-swap) can help avoid false-positive results caused by dye-specific properties [12, 13] (see Design B in Fig. 1). Another possible approach to using internal standards is spiking protein samples with an Alexa-labeled internal standard (ALIS) prior to 2-DE, followed by the normalization of spot volumes from
Experimental Design in Gel-Based Proteomics
3.3.6 Data Analysis
3.4 Strategies for Experimental Design
13
each gel against the average or median spot volume of a selection of representative proteins in the internal standard of that gel [14]. Planning data analysis is part of the experimental design and should be done prior to performing the experiments. Statistical tools often pose specific requirements on the data, which need to be taken into consideration when planning the experiments. Moreover, knowledge of the methods used for data analysis is required in order to estimate optimal sample sizes (see Subheading 3.5). The aim of experimental design is to facilitate causal inferences between dependent (response or measured; for 2-DE maps: spot volume) and independent (explanatory or predictor) variables. To achieve this it is necessary to: 1. Identify confounding variables and isolate their effect from that of the targeted factors. 2. Identify the sources of variability and reduce their effect such as to improve the precision of measurements and allow the detection of the differences between the treatment outcomes. Depending on the aim of the study and the biological and technical resources available, once a variable has been identified as potentially confounding, it should be attempted to be fixed (e.g., performing experiments at constant temperature, humidity). If this is not possible, stratification should be applied (e.g., regarding age, gender, body weight), or else randomization should be ensured.
3.4.1 Sampling Scheme
Inferential statistics uses the measurements made on a sample to draw conclusions on the statistical population of origin (parent population). The aim of sampling is to select a subset of units (i.e., a sample) from a statistical population which makes this inference possible. The sampling scheme is chosen depending on the number of factors included in the experiment and on the experimental layout (see Subheading 3.4.5). The following are the most usual sampling schemes applied in biological research: 1. Random sampling: each element in the population has an equal chance to be included in the sample (see Fig. 2a). 2. Systematic sampling: the elements of a population are sorted according to a criterion of choice, and then elements at regular intervals are included in the sample. If the starting point is randomized, the elements of the population will have equal chances to be included in the sample (see Fig. 2b). 3. Stratified sampling: the elements in a population are grouped into distinct categories. Each category is then randomly sampled. The proportion of the sample units selected from each category reflects the relative proportion of the categories in the original population (see Fig. 2c).
14
Cristina-Maria Valcu and Mihai Valcu Scheme A.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D
.
.
D
.
Random sampling
Scheme B.
Systematic sampling
Scheme C.
Stratified sampling
Scheme D.
.
Cluster sampling
Scheme E.
E
.
A
.
.
. E
.
B
B
.
.
.
.
.
.
.
.
Matched samples
C
A
.
.
.
.
.
.
C
.
.
.
.
.
.
Group 1
Group 2
A1
.
A2
.
A1
.
A2
.
B1
.
B2
.
B2
.
B1
.
C1
.
C2
.
C1
.
C2
.
.
D1
.
D2
.
.
E2
.
E1
.
D1 E1
.
D2
.
E2
Random assignment
Matched sampling
Fig. 2 Sampling schemes frequently applied in biological research. (a) Random sampling. (b) Systematic sampling. (c) Stratified sampling. (d) Cluster sampling. (e) Matched sampling. Filled fish represent males, empty fish represent females, filled circles indicate samples randomly selected, empty circles indicate samples selected systematically, and letters indicate matched samples
Experimental Design in Gel-Based Proteomics
15
4. Cluster sampling: when the elements of a population are naturally grouped in clusters (i.e., grouped distribution), a subset of the clusters is randomly selected, and then elements from the selected clusters are in their turn randomly selected (see Fig. 2d). 5. Matched random sampling: pairs of elements are selected from the population according to a criterion of choice, and then the units of the pairs are randomly assigned to control or treatment groups (see Fig. 2e). This type of sampling is most often used in clinical trials where patients are typically matched by age and gender. 3.4.2 Replication
Replication is a basic tool to ensure the reproducibility of the results and to increase the experimental power to detect significant changes in protein expression (see Subheading 3.5). The replicates employed can be either biological or technical (see Fig. 3). In order to unambiguously categorize the type of replicate, we define the experimental/sample unit as the unit which is randomly assigned and exposed to a treatment independently on other similar units [15]. Biological replicates are different sample units, while technical replicates are repeated measurements of the same sample unit. The purpose of biological and technical replicates is fundamentally different. Biological replicates encompass the biological and the technical noise in the experiment and allow differences in protein abundance to be identified above the biological variability of the population (see Note 22). Technical replicates give indications upon the measurement error in the experiment and allow a reliable quantification of individual samples. In the absence of biological replicates (see Fig. 3a), technical replicates only depict differences between the compared sample units; hence they increase the false-positive rate. The practical utility of technical replication depends on the relative levels of biological and technical variation. When technical variance is relatively low (e.g., in DIGE), biological replicates alone (see Fig. 3c) will increase the chance to detect smaller differences in protein abundance above the biological variation [16]. When technical noise is high, relative to biological variation (see Note 23), performing technical replicates (see Note 24) can help to filter out artifacts and outliers [17]. In this case, nested designs (see Fig. 3b) using several biological replicates, each with two or more technical replicates, can increase the confidence of the findings (see Subheading 3.4.5, step 2).
3.4.3 Randomization
Not all sources of technical variation in an experiment can be identified, quantified, and controlled. Normalization procedures can partly correct for some sample-to-sample differences, but the only practical way to avoid bias and systematic errors in proteomic
16
Cristina-Maria Valcu and Mihai Valcu
Design A.
Group 1
Group 2
Technical replicates
Design B.
Group 1
Group 2 Biological replicates
Technical replicates
Design C.
Group 1
Group 2 Biological replicates
Fig. 3 Replication. (a) Experimental design with technical replicates. (b) Experimental design with biological and technical replicates (when technical variation is high relative to biological variation). (c) Experimental design with biological replicates (when technical variation is low relative to biological variation)
experiments is randomization. Randomization is required for a correct estimation of the precision of the results [18] and it reduces the influence of factors which have not been explicitly accounted for in the experimental design (e.g., instrumental and experimenter drift). Randomization should be performed at each step of the experiment (see Fig. 4): assignment of sample units to the control or treatment groups; place of the flasks in the incubator/place of the plants in the glasshouse/assignment of animals to cages; order of sample collection, processing, and loading; assignment of Cy labels to samples (for DIGE); allocation of samples to batches (blocking, see Subheading 3.4.5, step 3); allocation of samples to IPGs; order of gels in the chamber; order of IPG placing on top of the second dimension gel; order of gel processing (dying, scanning); etc. 3.4.4 Sample Pooling
Ideally each sample unit should be derived from a different biological individual or an independently isolated cell line. There are however experimental circumstances when sample pooling may be required or even advisable. One such case is when the amount of biological material available from one individual is insufficient for performing 2-DE; pooling several samples is then necessary in
Experimental Design in Gel-Based Proteomics
1
2
3
4
5
6
7
8
9
Group 1
11
7
5
8
10
11
12 Random assignment
Group 2
3
1
2
10
9
6
17
4
12 Randomisation
Batch 1
Batch 2
Protein extraction 2
10
4
11
6
5
12
8
1
9
3
7 Randomisation
Batch 1
Batch 2
IEF 6
11
9
4
8
10
12
1
5
2
7
3 Randomisation
SDS-PAGE 9
11
4
8
6
10
1
12
7
2
5
3 Randomisation
Image acquisition 10
6
8
4
9
11
7
2
12
3
5
1
Fig. 4 Randomization in 2-DE experiments. Randomization is applied at every experimental step
order to reach the minimal quantity required by the method of choice. Another example is the comparison of populations from genetically diverse species, when the required sample sizes are so large that the analysis becomes impractical (see Note 25). Last, but not least, very large sample sizes also imply high experimental costs and duration, which might exceed the available resources. In all these cases, and when the aim of the analysis is to identify general differences in protein expression between populations at the expense of information regarding inter-individual variation, pooling can be a practical way to reduce the experimental effort. In order to minimize technical variation, pooling should be done early in the experimental procedure. Pooling relies on the assumption of biological averaging, i.e., that the measurement on the pooled sample equals the average of the measurements on the individual samples (see Note 26) [19, 20]. Pooled samples are formed by combining equal amounts of proteins extracted from independent samples. When pooling is necessary, partial pooling strategies are desirable (in Fig. 5 design
18
Cristina-Maria Valcu and Mihai Valcu
Design A.
Group 1
Group 2
Total pool
Technical replicates
Design B.
Group 1
Group 2
Random assignment Partial pools
Biological replicates
Fig. 5 Sample pooling strategies. (a) Total pooling. (b) Partial pooling
B is preferred over design A) because some of the inter-individual variation is retained between pools (see Note 27) [19]. The partial pools should contain material originating from the same number of independent randomly assigned samples. Changes in protein expression will then be identified above the biological variation remaining after sample pooling. There are both benefits and shortcomings associated with sample pooling. Pooling of independent randomly selected samples reduces the biological variance by forming an average sample, and increases the experimental power to detect changes in protein expression above the biological variance of these average samples [16]. Pooling also reduces the contribution of biological outliers, and when a reasonably high number of independent samples are pooled, it normalizes the distribution of spot volumes between gels allowing for the use of parametric statistics in the analysis of data [20]. However, sample pooling results in loss of information regarding inter-individual variation in protein abundances and it can give a false confidence on the significance of the results. Moreover, correlation of protein expression with variables measured at individual level (e.g., patient data) is no longer possible. As the number of pooled samples increases, the number of required pools (and gels) decreases, but the number of biological
Experimental Design in Gel-Based Proteomics
19
samples increases. The increase is not linear and depends on the relative ratio between biological and technical variation [19]. The benefits of pooling are more substantial when the biological variation is large [19]. Sample pooling is also advantageous when the number of possible replicates is limited [16]. The decision depends on the relative costs of obtaining biological samples vs. gel costs [19] (see also Subheading 3.5.5). 3.4.5 Experiment Layout
The aim of experiments is to provide evidence in support of a causal relationship by controlling all other factors than the one(s) tested. From a practical perspective, the sources of variation can be divided into sources relevant for the experiment and sources irrelevant for the experiment. Experimentally relevant variation is induced by the experimental treatment or factors, while variation unimportant for the experiment can originate from both known and unknown factors. Experimental design aims to minimize the latter in order to facilitate the detection and the precise measurement of the former. Both types of variation can be biological or technical in essence. Below we introduce the basic schemes of experimental layout applied in gel-based proteomics (for an introduction to experimental design see ref. 21): 1. Factorial design. The experimental layout depends primarily on the number of experimentally relevant factors. In single factor experiments, two or more groups of independent (or paired) samples are subjected to different levels of a single factor (see Fig. 6, Design A). Multifactorial designs allow several factors to be examined simultaneously (see Fig. 6, Designs B and C) [21]. Full factorial designs include all the combinations of the factor levels (see Fig. 6, Designs B and C). This allows measuring the effect of factors as well as of their interactions. Full factorial designs with many factors produce complex experiments which may exceed the available resources. In such cases, simplified versions of the factorial design known as fractional factorial designs can be employed (see Fig. 6, Design D). Fractional factorial designs retain only a fraction of the combinations of a full factorial design, in a manner that ensures that the design is balanced (all factors, factor levels, and factor interactions are represented by the same number of observations) and orthogonal (each factor can be evaluated independently of all other factors in the experiment) (see Note 28). 2. Nested design. In some cases the levels of one factor (e.g., B) may be similar but not identical for different levels of another factor (e.g., A). The factorial design becomes in this case a nested or hierarchical design (see Fig. 7) [22]. Figure 7 exemplifies a two-stage nested design with a three-level factor (B)
Cristina-Maria Valcu and Mihai Valcu Design A.
Design B.
Design C.
Design D.
1 Factor A 2
4
1 Factor A 2
Factor B
Factor B
7 3
8
3
4
5
6
7
Factor B
20
1 Factor A 2
8
3
4
5
6
1 Factor A 2
Group
A
Group
A
B
Group
A
B
C
Group
A
B
C
1
-
1
-
-
1
-
-
-
1
-
-
-
2
+
2
+
-
2
+
-
-
4
+
+
-
3
-
+
3
-
+
-
6
+
-
+
4
+
+
4
+
+
-
7
-
+
+
5
-
-
+
6
+
-
+
7
-
+
+
8
+
+
+
Fig. 6 Factorial design. (a) Experimental design with one two-level factor. (b) Full factorial design with two two-level factors. (c) Full factorial design with three two-level factors. (d) Fractional factorial design with three two-level factors. Combinations of factors represented with filled circles are included in the experiment
Factor A.
Factor B.
Observations
1
2
2 level factor
1
2
3
1
2
3
1.1.1
1.2.1
1.3.1
2.1.1
2.2.1
2.3.1
1.1.2
1.2.2
1.3.2
2.1.2
2.2.2
2.3.2
1.1.3
1.2.3
1.3.3
2.1.3
2.2.3
2.3.3
1.1.4
1.2.4
1.3.4
2.1.4
2.2.4
2.3.4
3 level factor
Fig. 7 Nested design including one two-level factor (a) and one three-level factor (b). The experiment includes four independent biological replicates
nested in a two-level factor (A) and four biological replicates. Typical examples of nested factors include: technical replicates nested in biological replicates, gels nested in 2-DE runs, and individual samples nested in experimental batches.
Experimental Design in Gel-Based Proteomics Group 1
1
2
Design A.
3
4
Group 2
5
6
1
2
Batch 1 1
2
Design B.
3
4
4
Design C.
6
2
5
6
1
2
2
Design D.
3
1
5
1
3
2
2
3
2
5
6
3
4
5
6
Non-randomized
5
3
1
6
Randomized
5
6
Non-randomized blocks
6
4
Randomized blocks
Batch 2 2
3
4
5
Batch 1 1
4
Batch 2
Batch 1 1
3
Batch 2
Batch 1 4
21
6
4
Batch 2 3
1
5
4
5
6
Fig. 8 Randomized and block designs. (a) Non-randomized design. The batches contain samples from single groups. (b) Randomized design. Samples are randomly assigned to batches. The batches may contain a nonbalanced number of samples from different groups. (c) Non-randomized block design. The batches contain a balanced number of samples from different groups. (d) Randomized block design. Samples are randomly assigned to batches. The batches contain a balanced number of samples from different groups
3. Block design. Variation irrelevant for the experiment can originate from known or unknown sources. Controlling variation from known sources ultimately resides in ensuring that all variables other than the ones targeted by the experiment are equally represented in all groups. One possible approach is blocking (see Fig. 8c, d) [21]. The principle of blocking used in experiments is similar to that of stratifying used in sampling. With blocking, the sample units are divided into homogeneous blocks on the basis of a factor which is known to alter the biological response to the treatment (see Note 29). The sample units are then randomly assigned to treatments within each block (randomized block design). This reduces the contribution of the blocking factor to the measurements’ variation, thus allowing a better estimation of the sources of variation relevant to the study. Blocking can be applied to virtually any known variable likely to introduce variation in 2-DE maps and influence protein expression response to the treatment: biological features of the samples (e.g., age, gender), technical procedure (e.g., batch
22
Cristina-Maria Valcu and Mihai Valcu
A: 1
A: 2
A: 1
A: 2
B: 1
B: 3
B: 4
B: 3
B: 3
B: 2
B: 2
B: 1
B: 2
B: 4
B: 1
B: 2
B: 1
B: 4
B: 4
B: 3
Fig. 9 Split-plot design. (a) Whole-plot factor, (b) subplot factor
of samples, 2-DE run), or experimenter. For example, when not all samples can be processed in parallel, but they are processed in batches, each batch should contain balanced numbers of all the sample types in the experiment (see Fig. 8c, d). A special case of randomized block design is the matched pairs design, where the blocking factor (or factors) is used to group the sample units into pairs. The units of the pairs are then randomly assigned to the treatments (similarly to matched random sampling). 4. Split-plot design. In some factorial designs, it is not possible to fully randomize all factors. The experiment can in this case employ an extension of the randomized block design to more treatments (variables), known as split-plot design. In this design, a comparatively hard-to-change factor is applied to large units called whole plots, which are in turn divided into subplots (or split plots) to which different levels of the easy-tochange factor are randomly assigned (see Fig. 9) (see Notes 30 and 31) [23]. Typical situations requiring a split-plot design occur, for example, when the order of sample processing cannot be completely randomized (e.g., the samples have to be processed in the order of their collection) or when one of the factors is difficult to change (e.g., agricultural practices applied at large scale). 5. Randomizing. Experimental variation can also originate from unknown sources. The only possibility to control such variation is to randomize each step of the experimental procedure (see Fig. 4). Since randomization (see Fig. 8b) can eventually produce unbalanced experimental designs in which the compared groups are represented in different batches by very different numbers of samples, in extremis resembling a non-randomized experiment (see Fig. 8a), random block designs (see Fig. 8d) are often preferred.
Experimental Design in Gel-Based Proteomics
23
3.5 Optimal Sample Size
The number of observations required to detect a particular difference between groups depends on the magnitude of the difference between the groups and on the level of variation within them, i.e., heterogeneous groups require larger samples for detecting the same difference than relatively homogeneous groups. It is important to estimate the optimal sample size because an experiment employing too few replicates will lack the power to detect the targeted differences (see Subheading 3.5.1 for the theoretical background). In some cases, even small increases of the sample size can greatly improve the statistical power of the experiment [11, 24]. However, increasing the number of replicates can only help up to a certain point, beyond which additional samples can only bring marginal benefits. For this reason an experiment employing too many replicates will waste precious resources, without improving the experimental output. Optimal sample size has to be estimated for each type of sample and experimental setup and power analysis can be used to optimize the experimental design given the resources available (see Subheadings 3.5.1–3.5.2). Power analysis optimizes false negatives as well as false positives, because with very large sample sizes very small differences can be detected, but relatively more will be the false positives (see Notes 32 and 33). It should be noted that increasing the number of technical replicates has comparatively little impact on the experimental power (except for very noisy techniques, see Note 23); typically most benefits are gained from increasing the number of biological replicates [25].
3.5.1 Statistical Power
The majority of gel-based proteomic studies employes univariate statistics for detecting differences in protein expression between groups; hence the discussion below will focus on the univariate approaches (see Note 34). Univariate statistics tests calculate the probability (p) that the compared samples originate from the same population (known as null hypothesis, H0). When the p-value is smaller than the conventional thresholds α (type I error rate), H0 is rejected and the differences between samples are assigned corresponding statistical significance levels (significant for α < 0.05, very significant for α < 0.01, highly significant for α < 0.001). Two types of errors are possible in taking statistical decisions (see Table 1): Type I error (α, the incorrect rejection of H0, i.e., the samples are wrongly declared to originate from different statistical populations) and Type II error (β, the failure to reject H0, i.e., the samples are wrongly declared to originate from the same statistical population). Type I error rate is fixed by the p-value, while Type II error rate can be controlled through experimental design, by varying the sample size.
24
Cristina-Maria Valcu and Mihai Valcu
Table 1 Types of errors in taking statistical decisions Test outcome
Actual difference
H0 true H0 false
Reject H0
Accept H0
Type I error, α Power, 1—β
1—α Type II error, β
The power of a test (the complement of the type II error rate, i.e., 1 β) represents its ability to correctly detect differences between the samples (to correctly reject H0 and identify true positives). The power depends on the significance level, the difference between samples (effect size), sample variance, and sample size. Significance levels are conventionally set at one of the abovementioned thresholds. Effect sizes represent the lowest differences between samples which the researcher regards as biologically meaningful and wishes to identify (see Notes 35 and 36); sample variance can be estimated in a pilot study or from similar published studies (see Note 37); the sample size can be varied to achieve the desired experimental power. The target power in gel-based proteomics is typically 0.8, i.e., if the experiment will be repeated, the difference between the samples will be detected on the average four times out of five trials. To increase the power, the researcher can minimize the technical variation (see Subheading 3.3); he can also target larger differences between samples, accept higher false-positive rates (p-values), or increase the number of replicates. 3.5.2 Optimal Sample Size Estimation and Power Analysis
The methodology for performing power analysis depends on the particular statistical test employed to analyze the data. This is one of the reasons for which the selection of the statistical tools used in the analysis stage is part of the experimental design. Power analysis methods have been developed for most statistical tests (see Notes 38 and 39) [26]. Some of these methods involve exact calculations based on mathematical formulae; others use approximate estimations, while still others employ simulations (see Note 40). There are two general approaches to estimating optimal sample sizes: based on a representative variance value (see Subheading 3.5.3) and based on the variances of all proteins in the data set (see Subheading 3.5.4). Most commercial and free statistical software offer tools for performing power analysis and estimating optimal sample sizes. The majority of these power estimators are designed for single variables and therefore cannot be directly applied to 2-DE data, which comprise multiple variables (proteins) with potentially
Experimental Design in Gel-Based Proteomics
25
different biological and technical variance. For this purpose a representative value of the variance needs to be selected, e.g., the mean or median of the distribution of protein variances [27] or a level of variance encompassing a certain proportion of the detected proteins, e.g., 75 % [8] or 90 % [24]. The statistical software dedicated to microarray or proteomics data are designed to apply sample size estimators to multiple variables (see Subheading 2, for examples). Below we exemplify how a basic R function can be used to estimate optimal sample size for 2-DE experiments both using a value of variance representative for the dataset (see Subheading 3.5.3) and using the variances of all proteins in the dataset (see Subheading 3.5.4). 3.5.3 Power Analysis Based on a Representative Variance Value
Among the statistical approaches to the analysis of 2-DE data, the t-test and ANOVA are the most frequently undertaken (see Notes 38 and 41). R [1] offers power analysis functions for both tests (power.t.test and power.anova.test, respectively). Here we exemplify how the optimal sample size can be estimated through power analysis using the power.t.test function (see Note 42). The function parameters include sample size (n), effect size (delta), standard deviation (SD), significance level (alpha), power, and the type of test (two sample/one sample/paired; two sided/one sided). The function can be used to compute: 1. Sample size needed to detect a specified effect size (when n is set to NULL). 2. Statistical power achieved with a specified sample size (when power is set to NULL). 3. Effect size detected with a specified sample size (when delta is set to NULL). In order to perform power analysis: 1. Between-sample variance is estimated from previous studies or in a pilot study (see Notes 37 and 43). 2. The power.t.test function is applied in variant (B), using, for example, the median normalized spot volume and a value of variance which encompasses 90 % of the spots. 3. The statistical power achieved for different sample sizes is plotted against the sample size as in Fig. 10. 4. The optimal sample size can be established from this graph as the lowest sample size for which the target power of 0.8 is reached. Figure 10 displays the power achieved by experiments with varying numbers of tree leaf samples, to detect effect sizes of 1.5-, 2-, and 3- fold change at a significance level (α) of 0.05. Large differences in protein abundance (e.g., three-fold change) can be
26
Cristina-Maria Valcu and Mihai Valcu
1.00
Power
0.75
0.50
Fold change 0.25 1.5 2 3 5
10
15
20
Sample size
Fig. 10 Estimating optimal sample size from a representative variance value. Power analysis was performed for European beech (Fagus sylvatica) leaves (6 independent pools of equal amount of biological material from 15 saplings from a glasshouse experiment) using the median normalized spot volume and an SD value encompassing 90 % of the spots
detected with a power of 0.8, using sample sizes as low as 4, while smaller differences (e.g., two-fold change) would require 10 samples to be detected. Experimental power increases nonlinearly with sample size. For example, increasing the sample size from 5 to 6 would result in an almost 3 % increase in the power to detect three-fold changes at a significance level of 0.05. However, a further increase of the sample size to 7 will only produce a power increase of about 1 %. For a target effect size of three-fold change, the power of 0.8 is reached at four samples. If the resources are sufficient, it would however be beneficial to further increase the sample size up to 6, as this small increase in the sample size would allow the power to reach a value of 0.99. 3.5.4 Power Analysis Based on all Proteins’ Variances
There are two different approaches to estimate the sample size using the variances of all the proteins on the 2-DE map, depending on whether the protein abundance measure (i.e., normalized spot
Experimental Design in Gel-Based Proteomics
100
a
27
b
Proportion of spots (%)
75
50
Fold change 25
1.5 2 3
0 5
10
15
20
5
10
15
20
Sample size
Fig. 11 Estimating optimal sample size from the variances of all proteins. (a) Power analysis was performed for European beech (Fagus sylvatica) leaves (6 independent pools of equal amount of biological material from 15 saplings from a glasshouse experiment) using the normalized spot volumes of all 2-DE spots. (b) Power analysis was performed for HepG2 cells (5 independent biological replicates) using the normalized spot volumes of all 2-DE spots
volume) meets the assumptions of parametric statistic tests or not (see Note 38). Case 1. When the data meet the requirements for performing a t-test, the power.t.test function can also be used to estimate the optimal sample size based on the variances of all proteins on the 2DE map. 1. Between-sample variance is estimated from previous studies or in a pilot study (see Notes 37 and 43). 2. The power.t.test function is applied in variant (A) to all proteins. 3. The proportion of proteins requiring sample sizes below specified thresholds is plotted against the sample size as in Fig. 11. 4. The optimal sample size can be established from this graph depending on the desired effect size and power as well as on the available resources. Figure 11 displays the proportion of spots requiring specific sample sizes to achieve a power of 0.8, to detect effect sizes of 1.5-, 2-, and 3- fold change at a significance level (α) of 0.05 for two types of samples: pools of tree leaves (panel A) and
28
Cristina-Maria Valcu and Mihai Valcu
hepatoblastoma cells (panel B). For both samples large effect sizes can be detected for most proteins with relatively small sample sizes, while small effect sizes can only be detected for a limited proportion of the proteins and require a larger sample size. As expected, the proportion of proteins reaching the desired power to detect a given fold change increases with the sample size. For example, in the case of HepG2 cells, when the sample size increases from 5 to 6, 4 % additional proteins will reach a power of 0.8 to detect two-fold changes at a significance level of 0.05. Adding two more sample units will increase this proportion of another 4 %. However, further addition of samples can only bring a marginal benefit in terms of experimental power. Depending on the available resources, the optimal sample size might be estimated in this case between 5 (90 % of spots) and 8 (98 % of spots). A comparison of panels A and B shows how the actual sample size depends on the biological variation of the samples. Protein expression is much more variable between leaves of European beech saplings than between HepG2 cultured cells. To reduce the biological variation, beech saplings have been maintained in standardized glasshouse conditions and 6 independent pools were formed, each containing leaf material from 15 randomly selected plants. On the contrary, to increase variation among HepG2 cultured cells, the samples were allowed to drift for 8 independent passages. Even so, biological variation retained after pooling beech leaves exceeds the variation characterizing HepG2 cells. As a consequence, more than 20 samples are, for example, needed to detect a two-fold change for 90 % of the proteins in the leaves, as compared to only five samples in the case of HepG2 cells. Case 2. When the distribution of the normalized spot volumes is unknown or not possible to normalize, as well as for more elaborated experimental designs, in which 2-DE data are analyzed using statistical methods for which no mathematical formula has been developed, simulation approaches can provide quite accurate estimations for optimal sample sizes (see Fig. 12). 1. Between-samples variance is estimated from previous studies or in a pilot study (see Notes 37 and 43). 2. For each protein, the mean and SD of the normalized spot volumes measured for control and treatment groups are used to construct two Gaussian distributions (μC, σ C) and (μT, σ T), respectively (see Fig. 12). 3. The two distributions are standardized, and then the treatment distribution is translated with a specified effect size δ.
Experimental Design in Gel-Based Proteomics A. Monte-Carlo simulation for estimating optimal sample size Control group
29
B. Pseudo script R style
Treatment group
M C, SDC
MT, SDT
# define fixed parameters proteins # number of protein spots test # choose statistical test effect_size = 2 # set effect size power = 0.8 # set power simulations = 1000 # set number of simulations alpha = 0.05 # set significance level # set initial parameters (updated during the routine) n = 2 # initial sample size current_power = 0 # initial power
σT
σC
µC
µT Standardise
σT
σC
0
δ σT
σC
0
# routine for (i along proteins){ sd_c = sd(control) # SD protein i in control group sd_t = sd(treatment)# SD protein i in treatment group # perform simulations for protein i while(current_power < power){ n = n + 1 p_values = empty_vector for(j along simulations){ control = rnorm(n, mean = 0, sd_c) treatment = rnorm(n, mean = effect_size, sd_t) p_val = test (control, treatment)$p.value p_values[j] = p_val } beta = length(p_values > alpha) / simulations current_power = 1 - beta } return(n) # sample size for protein i }
δ
Statistic test
Statistical decision: accept/reject H0 Repeat >1000x % [p < 0.05] = p(reject H0 | HA true)
Fig. 12 Sample size estimation using a Monte Carlo simulation
4. Two random samples (n) are extracted from each of the two distributions and compared using a statistical test. This step is repeated at least 1000 times and the proportion of rejections for the given α threshold is used to compute the power. The sample size is incremented until the desired power is reached. 5. The sample size for which this power is reached represents the optimal sample size for that protein. The routine is applied to all proteins, and then the proportion of proteins requiring sample sizes below particular thresholds is plotted against the sample size similarly to Fig. 11. The estimation of the optimal sample size follows then the same procedure as described before.
30
Cristina-Maria Valcu and Mihai Valcu
3.5.5 Cost-Benefit Analysis
There are two aspects of experimental design which can be optimized through cost-benefit analysis: the optimal sample size and sample pooling. Power analysis is a very useful instrument for costbenefit analysis regarding optimal sample size (see Note 44) since it can help finding the optimal balance between Type I and Type II error rates, given the resources available (see Subheading 3.5.1, see Note 32). To determine whether pooling is beneficial, the costs of the experiment can be estimated for different scenarios (nonpooled samples vs. pools of different number of individuals) using the following formula: Ct ¼ N s Cs þ N g Cg
ð1Þ
where Ns is the number of individuals, Ng the number of gels required, Cg the cost associated with each gel, Cs the cost associated with obtaining each sample, and Ct the total cost of an experiment [19, 27]. The most cost-effective scenario, i.e., which minimizes experimental costs for the same experimental power, can thus be identified (see Note 45). For an experiment using nonpooled samples, the costs will thus be: Ct ðnon‐pooled Þ ¼ NsCs þ NsCg
ð2Þ
If p is the number of samples pooled for each biological replicate, the cost of an experiment using pooled samples can be calculated as: Ct ðpooled Þ ¼ NsCs þ NsCg= p:
ð3Þ
Pooling becomes beneficial when (see Note 46): Ct ðpooled Þ < Ct ðnon‐pooled Þ:
4
ð4Þ
Notes 1. The feasibility of this approach depends on the cell line (in some cases this may result in overaged cultures). 2. The same applies to ex vivo samples. 3. Although the level of genetic variation can be quite different between strains, laboratory lines typically regarded as outbred maintained under strictly controlled conditions may exhibit similarly low levels of interindividual variation in protein abundance to those of inbred lines (e.g., in mice, 24). 4. When the samples contain individuals with different degrees of relatedness, their parentage should be accounted for during the analysis.
Experimental Design in Gel-Based Proteomics
31
5. Tissues expressing a high proportion of specific proteins are more likely to exhibit a larger variation [24, 28]. 6. Technical variance is typically considered to be similar between samples [11]. 7. Unless the pI of the target proteins is known, 2-DE maps should cover as much as possible of the pH range. Narrower pH gradients generally afford a better resolution than wide gradients (unfortunately they also increase the duration and costs of the experiments if the entire pH scale should be covered). At the very least, acidic and basic proteins should be focused separately, since they typically require a different placement of the cup (near the cathode for acidic IPGs and near the anode for basic IPGs), different IEF conditions, and sometimes even different buffers (e.g., 10 % isopropanol in the rehydration buffer may improve the focusing of basic proteins). 8. Different brands of commercial IPG strips may differently influence the reproducibility of the 2-DE maps. 9. Staining methods are typically noisier than labeling methods. 10. Among staining methods, those with equilibrium end point, broad linear dynamic ranges, and staining intensity independent from protein properties are preferable [17, 29]. 11. High-quality images are essential for attaining a precise and reproducible quantification of proteins. 12. High-quality reagents are essential for obtaining reproducible 2-DE maps with high resolution. There may be differences between batches of reagents; therefore, it is preferable to prepare all buffers for an experiment from the same batch of reagents. 13. The relative contribution of the different experimental steps to the total experimental noise is different and often specific to the type of sample [12, 20]. For example, biological fluids (e.g., serum, plasma) are sensitive to collection and storage conditions [30–32]; biological materials with tough cell walls (e.g., plant tissues, bacteria) are difficult to disrupt reproducibly; biological materials with high protease activity (e.g., pancreatic tissue) are sensitive to proteolysis throughout the sample preparation procedure. 14. Removal of animal samples may require surgery or dissection. The anesthesia or euthanasia procedures involved should be chosen with care, because they can rapidly modify the abundance of specific proteins as well as induce post-translational modifications. 15. Special care should be taken in the case of samples collected post mortem, which are rapidly degrading.
32
Cristina-Maria Valcu and Mihai Valcu
16. To avoid proteolysis and dephosphorylation of proteins extracted from cell cultures, protease inhibitors (e.g., 2 mM PMSF) and phosphatase inhibitors (e.g., 0.2 mM Na3VO4, 1 mM NaF) can be added in the washing and lysis buffers. In the case of animal tissues removed through dissection, the organs can be flushed in situ through both heart ventricles with phosphate buffered saline (PBS) buffer containing the same protease inhibitors and phosphatase inhibitors. This will also minimize blood contamination of the organs. 17. The current protocol for the IEF often has to be adapted for samples with high content of contaminating nonprotein compounds (e.g., plant samples), for samples extracted in different buffers, or when different rehydration buffers are used. 18. Contaminants can originate from the environment (dust, keratins) or from other tissues of the sampled organism (e.g., blood for animal samples). The first category of contaminants can be avoided by performing all experiments in clean (ideally dustfree) environments and using protective clothes and gloves. The second category depends on the type of sample and can be minimized through a careful isolation of the analytes. 19. For example, unless these are experimental treatments or factors, plant leaves should be sampled at the same time of the day, in similar light, temperature, and humidity conditions; sampled at the same time of the day, at similar postprandial intervals, etc. 20. Labeling with minimal Cy dyes targeting the lysine residues is recommended when relatively large amounts of sample are available, while precious samples available only in small amounts (e.g., obtained by laser capture microdissection) can be labeled with saturation dyes [33]. 21. When multiplexing is not possible, a mixed sample can be run on a separate gel. This can facilitate the alignment of single stain gels and, in the case of complex samples containing more than one proteome (e.g., infected tissues containing both host and pathogen proteins [34]), it can ease the assignment of spots to the sample of origin. 22. The level of biological variation can be specifically used as criterion in the identification of differentially expressed proteins. The criteria typically employed in detecting differences in protein abundances, are either a p-value threshold or a p-value threshold with a fixed effect size (most commonly 1.5- or 2- fold change). These criteria do not take into account that different proteins exhibit different levels of variance. To account for these differences, the criterion can instead consist of a p-value threshold together with an effect size exceeding the intrinsic level of variation of the spots modeled on the control samples [34].
Experimental Design in Gel-Based Proteomics
33
23. Technical variation is typically much lower than biological variation. However, when the experimental procedure cumulates several steps susceptible to increased variation (e.g., multiplestep protein extraction, protein pre-fractionation, silver staining), the overall technical noise can exceed the biological variation. 24. Technical replicates should be performed for the experimental step which introduces the largest technical noise, e.g., if sample preparation introduces most technical variation, the technical extraction replicates should be included in the experiment. 25. The alignment of large numbers of gels obtained in many different 2-DE runs can be problematic, and spots identification becomes less accurate with increasing gel numbers. 26. The assumption has been shown to hold in experiments employing internal standards but might fail for a proportion of the spots in experiments using single stains [19]. In samples with very high biological variation, the assumption might also not hold due to Jensen’s inequality effect, i.e., the average of the pool is higher than the average of the individuals [19]. 27. Several small pools are preferable to fewer large pools. 28. One approach to selecting the best combination of factors and factor levels is the Taguchi method. Although this approach remains debated in the statistical community, it has been applied in 2-DE-based proteomics, for example, to method optimization [35]. 29. It is possible to block more than one factor as in the Latin square design (2 blocking factors), Graeco-Latin square design (3 blocking factors), or hyper-Graeco-Latin square design (more than 3 blocking factors) [21]. 30. There can be several whole-plot or sub-plot factors. Sub-plots can in their turn have a block design, a factorial design, etc. 31. A variant of this design is the strip-plot design, where the second factor is also applied in whole plots (strips) perpendicular to the whole plots of the first factor [36]. 32. Choosing appropriate levels for Type I (false-positive) and Type II (false-negative) errors depends on the relative costs of these errors, e.g., in the primary screening stages of drug development, it is less costly to select more putative targets for further investigation than to miss potential drug candidates; hence higher Type II error rates can be tolerated; the opposite applies to late stages of preclinical or clinical drug testing [37]. 33. In the case of experiments on animal or human subjects, ethical considerations also guide the choice of sample sizes. In clinical trials, under- or over- sized experiments would prevent the access of patients to effective treatments or subject them unnecessarily to potential side effects of the treatment, respectively.
34
Cristina-Maria Valcu and Mihai Valcu
34. Methods for power analysis are also available for multivariate statistics [37]. 35. At very large sample sizes, very small differences can be detected as statistically significant, but such changes are not necessarily biologically meaningful. 36. Reasonable thresholds depend on the biological process under investigation and on the protein species. For example, a small change in abundance might not be biologically significant for a structural protein, but might be highly relevant for a transcription factor. 37. Since technical variation depends on protocols, equipment, reagents, and experimenter, the best estimates for experimental noise will be obtained in pilot studies performed on the same type of samples, in the same conditions, and with the same design as the final experiment being planned. The same transformations as planned for the final study should also be applied to the data obtained in the pilot study. 38. It should always be verified that the data meet the assumptions of the chosen test. Caution should also be taken that data do not depart from normality after the transformation [38]. Parametric tests (e.g., Student’s test) require that the samples are independent and drawn from populations with equal variances and that the variable is normally distributed in each group. Non-parametric tests relax these assumptions to different degrees, while some parametric tests model the variance explicitly (e.g., generalized least square (GLS) models). Whether the data meet these assumptions or not can depend on the nature of the sample as well as on the experimental design. For example, samples are independent when separated on different 2-DE gels (as in single stain gels or two-dye DIGE) but not when separated on the same gel (as in three-dye DIGE) [39]. Statistical models including the gel as factor should be used to compare protein abundances in this case. The assumption of normality should hold on theoretical grounds for pooled samples (due to the central limit theorem) and has been shown to hold for DIGE [11] but should be verified for other experimental setups or staining methods, or else non-parametric statistics should be used to analyze 2-DE data. Unequal variance can derive from heteroskedasticity or from experimental conditions. The first type of issue can, for example, in the case of DIGE data, be solved through log transformation of the data [11]. When this is not effective, statistical models including spot volume as factor should be used for data analysis. When the unequal variance results from the experimental conditions (e.g., when a treatment
Experimental Design in Gel-Based Proteomics
35
induces a response with higher or lower variance than that of the control), statistical tests for unequal variance should be employed (e.g., Welsh t-test). 39. Sample sizes are typically estimated assuming balanced designs, i.e., with equal number of sample units assigned to control and treatment. Unbalanced designs achieve lower experimental power for the same number of sample units than balanced designs. For example, an experiment comparing 5 control vs. 7 treatment sample units will have less power than one comparing 6 control vs. 6 treatment sample units. 40. Monte Carlo simulations are employed when the theoretical distribution of the data is known. When the data distribution is not known, Bootstrap simulations can be used; however, they require large sets of empirical measurements. 41. When the conditions for applying the t-test are not met, a simulation approach as, for example, described in Subheading 3.5.4, can be undertaken using the median normalized spot volume and an SD value encompassing 90 % of the spots. 42. The power.t.test function assumes equal variance, but other sample size estimators (e.g., Lenth’s power tool) allow the variance to vary between the two groups. 43. When unequal variance is expected between control and treatment, the variance should be estimated for both groups. When variance is known to be homogeneous between control and treatment, it suffices to estimate the variance of the controls. 44. The estimated optimal sample size can in practice be smaller or larger than the one afforded by the project resources. When the optimal sample size exceeds the resources available, one possible solution is to narrow the scope of the study by fixing additional factors (e.g., by restricting it to a narrower age class, or specific environmental conditions). Although this will also restrict the validity of the results to the targeted subpopulation, this approach is more desirable than an underpowered study, unlikely to yield any useful results. Conversely, when the optimal sample size is lower than the resources available, the scope of the study can be broadened to include a more diverse sample or more experimental conditions, and thus render the results more widely applicable [6]. 45. In order to estimate the number of pools required for each scenario, a measure of the variance between different pools is required, which can be used to estimate the optimal sample size (see Subheading 3.5.2). 46. Whether pooling is beneficial for a particular study depends on the technical reproducibility and can therefore vary among different laboratories.
36
Cristina-Maria Valcu and Mihai Valcu
References 1. R Development Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria 2. Martin J (2012) pamm: Power analysis for random effects in mixed models. R package version 0.7 3. Tibshirani R, Chu G, Narasimhan B et al. (2011) samr: Significance analysis of microarrays. R package version 2.0 4. W GR (2009) ssize: Compute and plot power, required sample-size, or detectable effect size for gene expression experiment. R package version 1.28.0 5. Nyangoma S (2014) clippda package: A package for clinical proteomic profiling data analysis. R package version 1.14.0 6. Lenth R (2001) Some practical guidelines for effective sample size determination. Am Statistician 55:187–193 7. Alge CS, Hauck SM, Priglinger SG et al. (2006) Differential protein profiling of primary versus immortalized human RPE cells identifies expression patterns associated with cytoskeletal remodeling and cell survival. J Proteome Res 5:862–878 8. Karp NA, Lilley KS (2007) Design and analysis issues in quantitative proteomics studies. Proteomics 7:42–50 9. Jorge I, Navarro RM, Lenz C et al. (2005) The holm Oak leaf proteome: analytical and biological variability in the protein expression level assessed by 2‐DE and protein identification tandem mass spectrometry de novo sequencing and sequence similarity searching. Proteomics 5:222–234 10. Richter SH, Garner JP, Auer C et al. (2010) Systematic variation improves reproducibility of animal experiments. Nat Methods 7:167–168 11. Karp NA, Lilley KS (2005) Maximising sensitivity for detecting changes in protein expression: experimental design using minimal CyDyes. Proteomics 5:3105–3115 12. Corzett TH, Fodor IK, Choi MW et al. (2006) Statistical analysis of the experimental variation in the proteomic characterization of human plasma by two-dimensional difference gel electrophoresis. J Proteome Res 5:2611–2619 13. Marouga R, David S, Hawkins E (2005) The development of the DIGE system: 2D fluorescence difference gel analysis technology. Anal Bioanal Chem 382:669–678 14. Wheelock A˚M, Morin D, Bartosiewicz M et al. (2006) Use of a fluorescent internal protein
standard to achieve quantitative two‐dimensional gel electrophoresis. Proteomics 6:1385–1398 15. Lee JK (2011) Statistical bioinformatics: for biomedical and life science researchers. John Wiley & Sons, Hoboken, NJ 16. Karp NA, Spencer M, Lindsay H et al. (2005) Impact of replicate types on proteomic expression analysis. J Proteome Res 4:1867–1871 17. Hunt SM, Thomas MR, Sebastian LT et al. (2005) Optimal replication and the importance of experimental design for gel-based quantitative proteomics. J Proteome Res 4:809–819 18. Chich J-F, David O, Villers F et al. (2007) Statistics for proteomics: experimental design and 2-DE differential analysis. J Chromatogr B 849:261–272 19. Karp NA, Lilley KS (2009) Investigating sample pooling strategies for DIGE experiments to address biological variability. Proteomics 9:388–397 20. Valcu C-M, Valcu M (2007) Reproducibility of two-dimensional gel electrophoresis at different replication levels. J Proteome Res 6:4677–4683 21. Ruxton G, Colegrave N (2011) Experimental design for the life sciences. Oxford University Press, Oxford 22. Krzywinski M, Altman N, Blainey P (2014) Points of Significance: Nested designs. Nat Methods 11:977–978 23. Kowalski SM, Potcner KJ (2003) How to recognize a split-plot experiment. Quality Progress 36:60–66 24. Valcu C-M, Reger K, Ebner J et al. (2012) Accounting for biological variation in differential display two-dimensional electrophoresis experiments. J Proteomics 75: 3585–3591 25. Horgan GW (2007) Sample size and replication in 2D gel electrophoresis studies. J Proteome Res 6:2884–2887 26. Aberson CL (2011) Applied power analysis for the behavioral sciences. Routledge, New York, NY 27. Zhang S-D, Gant TW (2005) Effect of pooling samples on the efficiency of comparative studies using microarrays. Bioinformatics 21:4378–4383 28. Bahrman N, Zivy M, Damerval C et al. (1994) Organisation of the variability of abundant proteins in seven geographical origins of maritime pine (Pinus pinaster Ait.). Theor Appl Genet 88:407–411
Experimental Design in Gel-Based Proteomics 29. Westermeier R, Marouga R (2005) Protein detection methods in proteomics research. Biosci Rep 25:19–32 30. Ferguson RE, Hochstrasser DF, Banks RE (2007) Impact of preanalytical variables on the analysis of biological fluids in proteomic studies. Proteomics-Clin Appl 1:739–746 31. Pasella S, Baralla A, Canu E et al. (2013) Preanalytical stability of the plasma proteomes based on the storage temperature. Proteome Sci 11:10 32. Hulmes JD, Bethea D, Ho K et al. (2004) An investigation of plasma collection, stabilization, and storage procedures for proteomic analysis of clinical samples. Clin Proteomics 1:17–31 33. Shaw J, Rowlinson R, Nickson J et al. (2003) Evaluation of saturation labelling two‐dimensional difference gel electrophoresis fluorescent dyes. Proteomics 3:1181–1195 34. Valcu C-M, Junqueira M, Shevchenko A et al. (2009) Comparative proteomic analysis of responses to pathogen infection and wounding in Fagus sylvatica. J Proteome Res 8:4077–4091
37
35. Khoudoli GA, Porter IM, Blow JJ et al. (2004) Optimisation of the two-dimensional gel electrophoresis protocol using the Taguchi approach. Proteome Sci 2:6 36. Lansky D (2001) Strip-plot designs, mixed models, and comparisons between linear and non-linear models for microtitre plate bioassays. Dev Biol 107:11–23 37. Castelloe JM (2000) Sample size computations and power analysis with the SAS system. In: Proceedings of the Twenty-Fifth Annual SAS User’s Group International Conference. Citeseer, p 265–225 38. Valcu M, Valcu C-M (2011) Data transformation practices in biomedical sciences. Nat Methods 8:104–105 39. Karp NA, Mccormick PS, Russell MR et al. (2007) Experimental and statistical considerations to avoid false conclusions in proteomics studies using differential in-gel electrophoresis. Mol Cell Proteomics 6: 1354–1364
Chapter 2 Decoding 2-D Maps by Autocovariance Function Maria Chiara Pietrogrande, Nicola Marchetti, and Francesco Dondi Abstract This chapter describes a mathematical approach based on the study of the 2-D autocovariance function (2-D ACVF) useful for decoding the complex signals resulting from the separation of protein mixtures. The method allows to obtain fundamental analytical information hidden in 2-D PAGE maps by spot overlapping, such as the number of proteins present in the sample and the mean standard deviation of the spots, describing the separation performance. In addition, it is possible to identify ordered patterns potentially present in spot positions, which can be related to the chemical composition of the protein mixture, such as post-translational modifications. The procedure was validated on computer-simulated maps and successfully applied to reference maps obtained from literature sources. Key words 2-D PAGE (2-D polyacrylamide gel electrophoresis) maps, Chemometric methods, Bidimensional autocovariance function, Spot overlapping, Bioinformatics
1
Introduction Polyacrylamide gel electrophoresis (2-D PAGE) separation of proteins is considered the classical and principal tool for proteomic studies, combined with mass spectrometry, to achieve a comprehensive identification and quantification of almost every protein present in a complex biological (animal or plant tissue) sample [1, 2]. However, in the last 10 years the importance of two- and multidimensional separation techniques (mainly 2-D liquid chromatography) has been considerably increased, as witnessed by recent advances in shotgun or top-down approaches [3, 4]. Despite the enormous improvements in chemical technologies and separation efficiency [5–8], and some impressive innovations in robotics and algorithms for spot detection, protein databases, and data handling [9, 10], a comprehensive separation and elucidation of all the proteins present in the sample is still far from being achieved [11–14]. This is due to the intrinsic complexity of cells and biological fluids, that can contain thousands of proteins, present in a wide range of relative abundances and displaying great
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_2, © Springer Science+Business Media New York 2016
39
40
Maria Chiara Pietrogrande et al.
differences in structure and size. The consequence is that co-migrating proteins can be easily present in the same spot, which results in a drop of the quality of the analytical information contained in the map [13, 14]. Therefore, plethora of data obtained from each analytical run require a proper signal processing procedure for decoding the complexity of the 2-D map, in order to fully extract the whole analytical information contained therein, in particular, information very relevant to proteomics, such as the number of proteins and the chemical composition of the sample, i.e., proteins occurrence, identity, abundance, and chemical structure. The mathematical-statistical method herein described is based on the study of the 2-D autocovariance function (2-D ACVF), computed on the experimental digitized map, i.e., experimental 2-D ACVF (2-D EACVF) [15–18]. This method allows to estimate the complexity of the mixture (number of components, abundance distribution) and the separation performance. Moreover, the study of the 2-D EACVF plot allows to identify the potential presence of structured patterns of protein spots, which can be related to specific protein structural modifications [17]. 1.1
Theory
The experimental map is tipically acquired in a digitized form consisting of a gridded surface N x N y , where all the nodes are equally spaced. As an example, a computer-generated map representing experimental 2-D PAGE gels is reported in Fig. 1a: it contains 200 proteins with an elliptic spot size of σ x ¼ 0.025 pI and σ y ¼ 0.0006 logMr. The 2-D EACVF is computed on the digitized map as: C k, l ¼
y l x k N X 1 NX f i, j N x N y i¼1 j ¼1
f
k¼
M x, . . . ,
1, 0, 1, . . . , M x
l¼
M y, . . . ,
1, 0, 1, . . . , M y
f iþk, j þl
f
ð1Þ
where: Nx and Ny are the total number of points of the digitized map along the two separation axes, fi,j represents the map intensity at the point (i, j), f is the average intensity calculated over all the sampled points, k and l are the lags between subsequent points in the map along the two separation axes over which 2-D EACVF is calculated, Mx and My are the maximum number of lags used for 2-D EACVF calculation. Each point used for computation can be converted into Δx ¼ ΔpI and Δy ¼ ΔlogMr on the basis of the sampling interdistances between subsequent points along the X and Y axes.
Decoding 2-D Maps
41
2-D EACVF can be plotted vs. the interdistances along the two separation axes (ΔpI and ΔlogMr) to obtain a 2-D EACVF plot: Fig. 1b reports the contour plot of 2-D EACVF computed on the 2-D map shown in Fig. 1a. It is characterized by a main peak centered at the origin with a bidimensional Gaussian shape (enlarged detail in Fig. 1b). Besides the peak in the center, some minor fluctuations of the autocovariance function are observed. Theoretical expressions for 2-D ACVF (2-D TACVF) were derived to express 2-D TACVF as a function of the separation parameters. The theoretical model requires a frequency function for describing the distribution of spot positions over the X, Y area. Two limiting cases are considered, describing: 1) a completely disordered separation and 2) an ordered map [15]. In both cases the single protein spots are represented by a bivariate Gaussian distribution described by the standard deviations along the two separation axes, σ x and σ y: both circular (σ x ¼ σ y) and elliptical (σ x ¼ 6 σ y) spots are assumed (see Note 1). 1. Completely disordered separation is characterized by protein positions randomly distributed on the 2-D space (Poissonian retention pattern) (see Note 2): in this case the 2-D TACVF is given by the following equation [15]: 2 V 2T σ 2h =a 2h þ 1 2 C ðΔx, Δy Þ ¼ e ½ðΔx Þ =4σx 4σ x σ y πXYm
ðΔy Þ2 =4σ 2y
ð2Þ
where: V T ¼ 2πσ x σ y ma h is the total volume of the signal computed on the three coordinates (x, y, f ), m is the number of detectable components, X and Y are the lengths of the separation space, σ x and σ y are the standard deviations of single protein spots along the two separation axes, and σ 2h =a 2h describes the distribution of spot intensities, where ah is the mean value of protein abundance and σ 2h its variance. 2. An ordered pattern in a 2-D PAGE map may be formed by ordered sequences of protein spots, where the position of the n-th term of the series is described by: x ðnÞ ¼ ax þ b x n
ð3aÞ
y ðnÞ ¼ a y þ b y n
ð3bÞ
where ax, ay, bx, and by are constants (see Note 3). In this case, the expression of 2-D TACVF is [15]:
a
0.58 0.6
400 350 300
0.62 logMr(Kda)
250 200
0.64
150 0.66
100
0.68
50 0
0.7 0.72 0.74 4
5
6
7
8
9
10
11
pl(pH)
b c0,0
number of proteins, m (Eq. 6) 1.665 σy
mean spot dimension, sx, sy(Eq. 5)
1.665 σx
γ
x
100
0.025 0.02
80
DlogMr(Kda)
0.015 0.01
60
0.005 0
40
−0.005 20
−0.01 −0.015
0
−0.02 −0.025 −1
–20 −0.8 −0.6 −0.4 −0.2
0
0.2
0.4
0.6
0.8
1
Dpl(pH)
Fig. 1 Computation of the experimental 2-D EACVF on a 2-D generated map. (a) Simulated map containing 200 proteins and spot dimensions σ x ¼ 0.026 pI and σ y ¼ 0.0006 logMr. (b) 2-D EACVF plot versus interdistance along the two separation axes ΔpI and ΔlogMr; enlarged detail: 3-D view of the 2-D EACVF plot region for short
Decoding 2-D Maps
C ðΔx, Δy Þ ¼
k¼n max X k¼0
e
V 2T 4σ x σ y πX Y ðnmax
½ðΔx b X kÞ2 =4σ 2x
σ 2h þ1 k þ 1Þ a 2h
½ðΔy b Y kÞ2 =4σ 2y
43
ð4Þ
where nmax is the highest value of n, i.e., the number of proteins of the series. In this case, the 2-D TACVF plot shows well-defined cones located at interdistances kbx and kby: they are called deterministic since they correspond to repeated interdistances among the terms of the series. Their height decreases with k, but their shape is independent on k.
2
Materials
2.1 PC-Generated SDS 2-D PAGE Maps
1. Take into consideration datasets of the pI and logMr coordinates from the SWISS-2-D PAGE database [19]. 2. Retrieve the pI and logMr values of identified spots of real reference maps of human tissues from SWISS-2-D PAGE database [19].
2.2
Software
1. Use the dedicated software Melanie (Geneva Bioinformatics, GeneBio S.A.Geneva, Switzerland) to detect the spots of the digitized maps and to measure their volumes. 2. Write the numerical calculation algorithms in Fortran.
3
Method
3.1 Calculation of PC-Generated 2-D Maps
1. Generate 2-D maps with known separation properties in order to validate the method [16–18]. 2. Describe each spot by two position coordinates (pI and logMr ) and by a third coordinate fi,j representing the spot intensity. 3. Generate the pI and logMr coordinates that follow the same position distribution present in real maps, by applying the rejection algorithm: the flowchart of the proposed algorithm is reported in Fig. 2a. Such a distribution representing real maps was evaluated from 1956 identified spots in reference maps of human tissues retrieved from the SWISS-2-D PAGE database [19]. 4. The rejection algorithm is based on a simple but efficient method for generating random coordinates whose distribution
ä Fig. 1 (continued) for interdistances (ΔpI 4σ x and ΔlogMr 4σ y) from which the parameters m, σ x, and σ y are estimated. Reproduced from ref. [16] with kind permission of the WILEY-VCH Verlag GmbH
Fig. 2 Flowchart of the algorithms used for computation. (a) Rejection algorithm for generating the pI and logMr coordinates that follow the same position distribution present in real maps. (b) Calculation of the 2-D EACVF on the digitized map
Decoding 2-D Maps
45
function p(x) is known within a given range of x but not analytically defined. The cumulative distribution function (the indefinite integral of the density probability function p(x)) is not required. Let’s consider a probability distribution function f(x) which has finite area and encloses p(x) (see lower inset in Fig. 2a). f(x) is called the comparison function. The integral of f(x) has to be computable and A is the area below the curve. First, generate a random uniform deviate between 0 and A and pick the corresponding x0 value from the cumulative distribution function of f(x) (transformation method, upper inset in Fig. 2a). Then, independently generate a second random uniform deviate between 0 and f(x0). If this is lower than p(x0), then x0 is accepted; otherwise, x0 is rejected and the procedure repeated again from the beginning. More information on transformation and rejection methods can be found in dedicated textbooks [21]. 5. In addition, build up reference 2-D maps using the pI and logMr values of identified spots retrieved from real reference maps of human tissues (SWISS-2-DPAGE database) [19]. 6. Generate spot intensity values using a random distribution that has been demonstrated the most probable for a high number of components [11, 20]. 7. Describe spots with elliptical shape selecting proper σ x and σ y values to represent different experimental conditions, i.e., map dimension, pH gradient, and scanner resolution (see Note 4). 3.2 Numerical Calculation of 2-D EACVF on Digitized 2-D Maps
3.3 Estimation of the Separation Parameters
1. Calculate the 2-D EACVF on the digitized map, according to Eq. 1: the flowchart of the proposed algorithm is reported in Fig. 2b (see Note 5). 2. Plot it versus the interdistance along the two separation axes to obtain the 2-D EACVF plot: as an example, Fig. 1b shows the 2-D EACVF plot computed on the simulated map of Fig. 1a. In general, the 2-D EACVF plot clearly shows a maximum of 2-D EACVF at short interdistances, lower than the average spot sizes 4σ x and 4σ y(short-term correlations, shown in the inset in Fig. 1b), followed by lower values at higher interand distances (long-term correlations, ΔpI 4σ x ΔlogMr 4σ y). 1. Here a simplified version of the 2-D ACVF approach is described: it is based on the measurement of fewer points than the experimental 2-D ACVF for estimating the separation parameters, i.e., number of detectable components (m) and the average spot widths (4σ x and 4σ y), and identifying the potential presence of ordered structures [16].
46
Maria Chiara Pietrogrande et al.
2. The first part of 2-D EACVF represents the mean spot size averaged on all the spots present in the map, described by 4σ x and 4σ y. The pre-exponential term: V 2T σ 2h =a 2h þ 1 4σ x σ y πXYm
ð5Þ
is the same in both random and ordered models (Eqs. 2 and 4). 3. Compute the spot mean standard deviation, 4σ x and 4σ y, from the width of the 2-D EACVF bidimensional Gaussian peak at half height as (enlarged detail in Fig. 1b): pffiffiffiffiffiffiffi h x , 1=2 ¼ 2 ln2σ x ¼ 1:665σ x
and
pffiffiffiffiffiffiffi h y , 1=2 ¼ 2 ln2σ y ¼ 1:665σ y
ð6Þ
The average spot dimensions, σ x and σ y, are an estimation of the system performance (see Note 6). 4. Compute the number of proteins present in the mixture, m, from the 2-D EACVF value computed at the origin, i.e., Δ pI ¼ 0, ΔlogM r ¼ 0, C0,0, using the following equation (enlarged detail in Fig. 1b; see Note 7): V 2T σ 2h =a 2h þ 1 ln2 0:22 V 2T σ 2h =a 2h þ 1 ¼ m¼ πh x , 1=2 h y , 1=2 C 0, 0 X Y h x , 1=2 h y , 1=2 C 0, 0 X Y
ð7Þ
This computation requires the detection of the spots and the evaluation of their intensities (i.e., volumes) to compute the total volume, VT, and the relative dispersion ratio of the spot maxima, σ 2m =a 2m . 5. Detect the pmax spot maxima h m, j , 8 j ¼ 1, pmax present in the map using an algorithm based on the comparison of seven successive points for each dimension. In this approach, the maximum (fourth point) was detected when the first three values were increasing and the last three decreasing. Alternatively, the dedicated software, Melanie II [22], was employed to yield a more correct measure of the volumes Vm,j of the detected pmax spots. 6. In the computation of Eq. 7, the most critical parameter is the estimation of the protein abundance dispersion ratio (see Note 8). 7. Compute the degree of separation achieved in the map, the separation extent γ ¼ pmax =m, as the ratio between the total number of the detected spots, pmax, and the number of proteins, m [11, 12, 17, 20, 23]. This value is usually lower than 1, as a consequence of spot overlapping present in real 2-D PAGE maps (see Note 9).
Decoding 2-D Maps
47
Table 1 Computation of the separation parameters on digitized 2-D maps generated by computer simulations (first–sixth rows) and reference 2-D maps retrieved from the SWISS-2D PAGE database (seventh–ninth rows) σ 2h =ah2 mest
2 σm /am2 pmax γ (%)
m
σx
200
0.026 0.0006 1
203 14
7
0.026 0.0006 0.93
187
92
200
0.009 0.0002 1
196 14
7
0.009 0.0002 0.97
184
94
500
0.026 0.0006 1
459 21
5
0.026 0.0005 0.92
381
83
500
0.009 0.0002 1
485 22
5
0.009 0.0002 0.97
451
93
750
0.009 0.0002 1
710 26
4
0.009 0.0002 0.95
660
93
1000
0.009 0.0002 1
919 3
3
0.009 0.0003 0.93
836
91
HEPG2 99
0.009 0.0002 1
100 10 10
0.009 0.0002 0.99
99
99
DL-1 108
0.009 0.0002 1
104 10 10
0.009 0.0002 0.99
103
99
601 24
0.009 0.0002 0.96
571
95
σy
PLASMA 626 0.009 0.0002 1
erel (%) σ x,est
4
σ y,est
The true values of number of proteins (m) and spot shape (σ x, σ y) are compared to the estimated results (mest, σ x,est and σ y,est): erel% is the percentage relative error of mest related to m. The theoretical value of protein abundance dispersion ratio, σ 2h =a2h , is compared with its experimental approximation, the maximum dispersion ratio, σ 2m =a 2m . From the total number of detected spots, pmax, the degree of separation achieved γ was computed. HEPG: hepatoblastoma-derived cell line (HEPG2_HUMAN) [19] DL-1: colorectal adenocarcinoma cell line (DL-1) (DLD1_HUMAN) [19] PLASMA and a human plasma (PLASMA_HUMAN) [19]
8. As an example, the method was applied to synthetic 2-D maps describing the separation pattern usually present in 2-D PAGE gels and reference 2-D maps retrieved from the SWISS2-DPAGE database (Table 1) [16]. The elliptical spot shape with σ x ¼ 0.009 pI and σ y ¼ 0.0002 logMr (second, fourth–ninth rows) corresponds to the common 2-D PAGE maps (see Note 4). Spot abundance (AM) was described by an exponential (E) distribution yielding σ 2h =a 2h ¼ 1:0 (fourth column). The results obtained on the investigated maps (Table 1) show that the 2-D autocovariance function method gives a correct estimation of the mean spot shape (compare seventh and eighth columns with second and third columns) and the number of proteins present in the sample (compare fifth column with first column; see Note 10). The obtained values of the degree of separation (γ, 11th column) show that if 500 proteins are present in a sample, only 83 % of them can be separated in the worst conditions corresponding to lower efficiency (σ x ¼ 0.026 pI and σ y ¼ 0.0006 logMr, third row) but separation increases up to 93 % with greater efficiency (σ x ¼ 0.009 pI and σ y ¼ 0.0002 logMr, fourth row).
48
Maria Chiara Pietrogrande et al.
3.4 Estimation of the Separation Pattern: Detection of Presence of Spot Ordered Sequences
1. Analyze the 2-D EACVF region describing long-term correlation (interdistance higher than 4σ x and 4σ y, Fig. 1b) to extract information on the retention pattern of the 2-D maps. If the 2-D map exhibits some ordered structures formed by spots located at constant interdistances (Δx, Δy) repeated in different regions of the map, the 2-D EACVF plot shows wellshaped deterministic cones at the repeated Δx, Δy values (Fig. 3b shows the 2-D EACVF plot computed on the digitized map of Fig. 3a). The visual detection of such deterministic peaks permits to identify the existence of the sequences, singling them out from the signal complexity [16–18] (see Note 11). If the repeated interdistances (Δx, Δy) correspond to sequences of protein spots, the Δx and Δy values are related to the parameters bx and by of the series (Eqs. 3a and 3b) and intensities of the 2-D EACVF cones to the number of terms of spot sequence, nmax (Eq. 4) [16]. 2. A common feature of the 2-D PAGE maps of biological samples is the presence of trains of protein spots that are consistent with the separation of protein isoforms differing in a constant change in amino acid charges or molecular weight or produced by co- and post- translational modifications (PTMs) such as glycosylation and phosphorylation [1, 24] (see Note 12). 3. As an example, the 2-D PAGE map of a hepatoblastomaderived cell line (HEPG2_HUMAN from SWISS-2-D PAGE database 19) was investigated (Table 1, seventh row and Fig. 3a). The 2-D EACVF plot computed on the map (Fig. 2b) shows some well-defined deterministic cones that identify some interdistance repetitiveness present in the map (see Note 13). This behavior is more clearly shown by projecting the 2-D EACVF values along each separation axis, i.e., ΔpI and ΔMr (black line in insets in Fig. 3b). For comparison, the figures show the plots of the 2-D EACVF computed on a simulated map containing the same number of components (m ¼ 99) with a disordered retention pattern (red line). Well-defined deterministic peaks are evident along the pI axis (inset at the top): one peak at ΔpH ¼ 0.2 that is consistent with protein isoforms generated by acetylation and another at ΔpH ¼ 0.5, which is consistent with protein alkylation [1, 2, 24]. Also along the second dimension separation axis (inset at the bottom), some deterministic peaks are evident due to repeated increments of the protein molecular masses, i.e., at ΔMr value of 160 and 330 Da, which can be related to protein acylation or glycosylation (see Note 14).
Decoding 2-D Maps
a
49
80 70
Mr (KDa)
60 50 40 30 20 10 0 4
5
6
7
8
pI (pH)
b EACVF
0.55
0.25
-0.05 -0.8
-0.3
0.2
0.7
ΔpI(pH)
0.2 pH 0.5 pH
Δ Mr(Kda)
0.55
EACVF
160Da
ΔpI(pH)
0.25
-0.05 -0.8
330Da -0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Mr
Δ Mr(Kda)
Fig. 3 2-D EACVF method on a real 2-D PAGE map for identification of train of spots. (a) Digitized 2-D PAGE map of a hepatoblastoma-derived cell line (HPG2) from SWISS-2D PAGE database. (b) Plot of the 2-D autocovariance function computed on the map. Enlarged insets: 2-D EACVF values over the ΔpI (top) and ΔMr (bottom) separation axes - comparison between the 2-D EACVF computed on the HEPG2 map (black line) and on a simulated map containing the same number of components with a disordered retention pattern (red line)
50
4
Maria Chiara Pietrogrande et al.
Notes 1. The theoretical background of this approach is that a multicomponent 2-D map is considered as a series of 2-D spots with a random distribution of position and height. For sake of simplicity, here we assume that the spots are modeled by bidimensional Gaussian peaks; thus, the signal is expressed as: f ðx; y Þ ¼
1 X
i¼ 1
h i exp
" x
x 0, i 2σ 2x
2
ðy
y 0, i Þ2
2σ 2y
#
ð8Þ
where hi is the random height, x0,i and y0,i the random positions of the spots, and σ x and σ y the widths of the elliptic 2-D spots along the separation axes x and y. 2. There are theoretical and experimental evidences that in 2-D PAGE maps of biological samples, the disordered pattern is the most probable distribution described by an exponential function, as a consequence of the mixture complexity itself. It derives from the combination of two independent distributions along the separation axes and yields a fully random pattern with spot overcrowding in the central region of the map [1, 2, 11–14, 23, 24]. 3. Usually, the real 2-D maps exhibit a combination of general disordered spots with superimposition of some ordered structures. They are formed by spots located at constant interdistances (ΔpI, ΔlogMr) repeated in different regions of the map, or ordered sequences of spots, such as the case of trains of spots, which correspond to protein isoforms showing a monodimensional shift of the spots to more basic/acid pI values. 4. As an example, two specific cases were considered. The elliptical spot shape with σ x ¼ 0.009 pI and σ y ¼ 0.0002 logMr (second, fourth–ninth rows in Table 1) describes a common 2-D PAGE map where a tissue homogenate sample (ca. 1 mg of total protein) is loaded on a standard gel size of 18-cm strip of broad pH range (pH 3–7) and analyzed with standard (1 mm) scanner resolution. Spots with σ x ¼ 0.026 pI and σ y ¼ 0.0006 logMr correspond to the same 2-D PAGE map analyzed with lower (3 mm) scanner resolution (first and third rows in Table 1). 5. A crucial point of the numerical calculation of 2-D EACVF is computing each node using the same number of points to assure the same degree of precision. For this purpose, use the cyclic calculation procedure by merging the beginning and the end of the separation axes by using negative k or l indices [15]. This means that the 2-D map is handled as it is
Decoding 2-D Maps
51
wrapped around itself and the right side of the separation space is continuously linked to the left one [25]. 6. The estimation of the system performance is the basis for selecting proper experimental conditions in order to optimize the efficiency of the 2-D separation, i.e., improving the gel structure, increasing the gel size, or exploiting narrower and narrower pH gradients in the first dimension IEF separation or both. Another way for reducing ΔpI and ΔlogMr is increasing scanner resolution, i.e., from standard (1 mm) to high ( New.” This will open the “Analysis Wizard,” where the main analysis parameters will be defined and an automated analysis will be started. This analysis wizard will guide you through the 2–12 following steps, but you can re-edit each individual step by hand without using the wizard. Once you have created a new experiment, you can re-define individual parameters by hand. Import the gels to be analyzed in this experiment: “File > Add/Remove Gels.” 2. To detect and separate individual spots, first select an appropriate spots detection algorithm, e.g., “2005 Detection” (see Note 2). Then number and annotate all spots in the gel by “Analysis > Detect spots.” The program will calculate the geometric characteristics of the protein spots such as area, optical density, or spot volume. You can select the parameters in the “Measurements window” (see Note 3). 3. Exclude artifacts by “Analysis > Spot Filtering.” It is not advisable to delete gel artifacts detected as spots, since this will add to the background. The background is defined as the whole area outside a spot (see next step). Thus, a filtered spot is not regarded as background, but it is also not a spot (it is not visible anymore and does not add to calculations). You can also exclude spots of a certain size, for example, spots smaller than 100 px2 (see Note 4). 4. Subtract gel background with the settings “mode of non-spot” and a margin of 45 pixels (see Note 5). 5. (Optional) You can create a virtual gel averaged of all sample gels in the respective experimental group by “Edit > Averaged Gels” (see Note 6). 6. Create a reference gel by “Edit > Choose Reference Gel” (see Note 7). 7. “Analysis > Warping” will overlay the individual sample gels to the reference gel. 8. “Analysis > Matching” will assign every sample spot in the different sample gels to the corresponding spots in the reference gel. 9. Modify the reference gel by adding new spots, “Analysis > Add to Reference Gel.” After this step, you must re-match all gels to
Two-Dimensional Gel Electrophoresis Image Analysis via Dedicated. . .
59
the modified reference gel to update the current experiment (step 8). 10. By “Edit > Synchronize Spot Numbers,” all (matched) spots will have the identical number in all gels. You should repeat this step whenever spot matching is modified. 11. For inter-gel comparison, spot parameters are normalized by “Analysis > Normalization” (see Note 8). 12. (Optional) Calibrate for isoelectric point and molecular weight markers. If desired, you can add calibrations to the 2-D images, for example, for the isoelectric point and molecular weight, by “Analysis > 2-D Calibration.” This is useful in the case you had molecular weight markers run in one lane of the gel. 13. Statistically analyze the results. You can use the three built-in statistical tests, which are t-test, paired t-test, and analysis of variance (ANOVA). Choose the settings under “Tools > Options > Comparisons.” Otherwise, you can export the data and compare them by other statistical tests, when appropriate [12, 13]. 14. (Optional) Direct automated spot picking devices, by “Edit > Spot picking” (see Note 8). 15. (Optional) Include additional data such as mass spectrograms. This feature is not included in the software version used for this manuscript (see Note 9). 16. (Optional) Create a Web-based federated database. For the publication of the data on the internet, you can create a federated database by “Tools > Web Page Builder” (see Note 10). 17. “Edit > Export to file” or “Edit > Export to Excel” will save the data in additional files and you can compare the data by statistical and bioinformatic methods of your own choice. 18. Make sense out of the data. This is the most challenging step of the whole analysis, but there is no general protocol. Bioinformatic tools for proteome analysis are available [14–17]. The respective analysis depends on what your original question and study rationale was. The interested reader is referred to Table 2 and the literature [14, 18].
4
Notes 1. Image file specifications: a single digitized gel image should have at least a data range of 1 byte ¼ 8 bits (28 ¼ 256 grayscale values) or better 2 bytes ¼ 16 bits (216 ¼ 65,536 grayscale values).
60
Martin H. Maurer
Table 2 Selected Internet tools for the analysis of proteomic data Name
Description
URL
BioCarta
Interactive graphical display of pathways
www.biocarta.com/ genes/index.asp
Pathway Interaction Database
Biomolecular interactions and cellular processes assembled into authoritative human signaling pathways
pid.nci.nih.gov
Human Protein Atlas
Protein expression atlas in humans
www.proteinatlas.org
GeneCards
Hyperlinked integrated database of human genes
www.genecards.org
PathGuide
Compilation of pathway resources
www.pathguide.org
Gene Ontology Tree Machine
Compares gene lists to GO terms
bioinfo.vanderbilt. edu/gotm/
iHOP—Information Hyperlinked over Proteins
Literature-based information tool
www.ihop-net.org
Coremine Medical
Building biomedical mindmaps
www.coremine.com
Chilibot
Search for relationship between two (or many) genes/proteins or lists thereof
www.Chilibot.net
FABLE
Mines the biomedical literature for information about human genes and proteins
www.fable.chop.edu
BioProfiling.de
Analytical web portal for highthroughput biology
www.bioprofiling.de/
BaCelLo
Prediction of subcellular localization
gpcr.biocomp.unibo. it/bacello/
The E-Cell Project
Computational biology modeling of complex cellular systems
www.e-cell.org/
Cytoscape
Visualizing and integrating molecular interaction networks and gene expression profiles
www.cytoscape.org
Pathway analysis
Literature mining
Software
david.abcc.ncifcrf.gov DAVID—Database for Annotation, Integrates functional genomic annotations with graphical summaries. Visualization and Integrated Annotation and summary of lists of Discovery (DAVID) 2008 identifiers for Gene Ontology, protein domain, and biochemical pathway membership MeV—MultiExperiment Viewer
Analysis, visualization, and data mining of www.tm4.org/mev. html large-scale genomic data and microarray tool (continued)
Two-Dimensional Gel Electrophoresis Image Analysis via Dedicated. . .
61
Table 2 (continued) Name
Description
URL
Gene list Venn diagrams
Creates “Venn diagrams” (overlapping circles) of gene lists
genevenn.sourceforge. net/
L2L Microarray Analysis Tool
Database of published microarray gene expression data and comparing own data to published microarray results
depts.washington. edu/l2l/
GOMiner
Biological interpretation of “omic” data
discover.nci.nih.gov/ gominer/index.jsp
This list assembles some important freely available Internet tools which I found useful for further proteomics analysis. The list is neither comprehensive, nor does it contain commercially available links or software (all links were last accessed on 11 January 2014)
The resolution of the image should be at least between 100 and 150 μm, or 300 dpi (dots per inch). Whatever setting is chosen, image parameters between the different gels images should be documented and kept constant throughout all files in the experiment. Most 2-D programs allow the import of graphic files in common formats such as TIFF, JPEG, BMP, GIF, or PNG, but it is strongly recommended not to use compressed file formats. 2. Spot detection algorithms: most software programs apply “watershed algorithms” for spot detection [4, 19], which interpret the gel image as a topographical map with the pixel intensities as levels above zero. Spots appear as “mountains” in this map, and the boundary between overlapping spots is found by determining the virtual “watershed” [20]. For a discussion of other spot detection algorithms, see [3]. In the PG200 software, you can select other spot detection algorithms besides the 2005 Spot Detection: the 2003 Standard Detection, Closer Boundaries (2003), Extra Splitting (2003), High Sensitivity (2003), Ultra Sensitivity (2003), and Wider Boundaries (2003). The algorithms are not specified in detail in the software manual. In our intra-lab testing, we recommend to use the automated 2005 Spot Detection algorithm (Maurer, unpublished data). 3. Measurement parameters: characteristics of spot geometry include the spot area, which is defined as the number of pixels within the spot boundaries; the optical density (OD) of a spot, which is its maximum pixel intensity; the relative optical density (%OD), which is the optical density of a spot divided by the sum of optical densities of all spots in the gel, multiplied by 100; the spot volume, which is defined as the integral of all pixel intensities within the spot boundary; and the relative spot volume,
62
Martin H. Maurer
which is the spot volume divided by the sum of all spot volumes in the gel, multiplied by 100. Differences in the usage of spot geometry parameters are discussed in the literature [21–23]. 4. Spot filtering: it is advisable to exclude very small signals beneath a certain threshold, e.g., signals which are smaller than 100 pixel2. Manual editing tools allow reshaping the automatically detected spots, for example, to unite separated spots or to dissect spots, but manual interventions should be kept at a minimum, because this will decrease the reproducibility of the inter-gel comparisons [24]. Moreover, regions in which it is difficult to detect protein spots automatically should also be filtered. These are the bottom line, at about 14 kDa size, where protein spots are mostly unseparated within the bromophenol blue line. The other difficulty lies in the streaking lines of unfocused proteins from the first dimension, corresponding to the electrode supporting surface. These regions produce a large number of “false-positive” spots. 5. Background subtraction: the software allows other methods for background subtraction, as no background, lowest on boundary, and average on boundary. Within a margin of 45 pixels, the software searches for background values around the detected and marked spot. This is an empirical value. The margin should not be too close to the spot, since the background will be too high, and should not be too far away from the spot, since this will increase the calculation time. With regard to the signal-to-noise ratio, empirical studies in our laboratory found that a good signal intensity to work with is about 20 % above the gel background (Maurer et al., unpublished data). Moreover, the manual definition of the gel background may decrease the reproducibility of the inter-gel comparisons; thus, an algorithm should be chosen that sets the background to the average or modal value of all non-spot areas (also called “mode of non-spot”). Other valid methods of background subtraction include the subtraction of the global image minimum, meaning that the lowest pixel value in the gel is subtracted, or by polynomial estimation of the gel background in a specified reference area of the gel. Algorithms based on the calculation of the lowest or average pixel values along the spot boundary (lowest on boundary and average on boundary) should be handled with care, since spots often overlap in these areas or an uneven distribution of unspecific staining occurs. In this context, it is important to remember that spots which have been manually deleted will contribute to the background calculation. Instead of deleting spots, they should be
Two-Dimensional Gel Electrophoresis Image Analysis via Dedicated. . .
63
filtered (see above, Note 4). Thus, only the spot representation is suppressed, but not their influence on any spot-related calculation. 6. Although averaging gels may be a useful tool for creating complex and comprehensive reference gels, because it will add data from all gels in the experimental set, it is only of limited value in the quantitative interpretation of proteomic results, since sometimes the data of individual spots will not be considered in an averaged reference gel. 7. The quality of the whole experiment is influenced by the quality of the reference gel, as it is the basis for spot matching. Therefore, special care should be addressed to the process of choosing and modifying the reference gel. The number of spots represented as well as the resolution and separation of two spots should be considered (Fig. 2).
Fig. 2 Parameters influencing the choice of the reference gel. The “ideal” reference gel combines both a maximum number of spots and an optimal resolution, which includes a good separation of the spots. In the example gels: in (a), neither number nor resolution is good, whereas in (b) the spot resolution is acceptable, but there are only few spots stained in the gel. On the other hand, in (c) there are a larger number of spots, but they are overlapping and blurry. Therefore, (d) is my choice as reference gel image in the first instance, but keep in mind that the reference gel is virtual and is modified in subsequent steps by the software (Reproduced with kind permission from Springer Science + Business Media: Expression Profiling in Neuroscience, Twodimensional protein analysis of neural stem cells, 2012, p. 116, M. H. Maurer, Fig. 2)
64
Martin H. Maurer
8. The normalization procedures include total spot volume, single spot, a preselected value, match ratio, and total volume ratio. I recommend the total spot volume, since this will result in the most reliable values for normalized spots. Other normalization methods may be useful in the individual experimenter’s hands. 9. Additional features can be added to the dataset, but in the context of analyzing 2-D gel data, these features are not mandatory. For example, directing the spot picking device is useful in case you have an integrated workstation for post-image gel processing, but for data analysis of the gel image, this step is not necessary. The addition of mass spectra will describe the protein sample, but with regard to the software analysis of the 2-D gel image, it is not required. 10. The creation of a web-based federated database may be a desirable step in the publication process of proteomic data [25], since it will provide an easy possibility to share data, but the care and update of a federated database require continuous work; thus, many federated databases are not available anymore [14] or the software used for their creation is outdated. In case you wish to create a federated database, I recommend the use of an opensource software, for example, MAKE 2D-DB package [26]. In an actual development, more and more databases will accept the direct submission of MS datasets [27].
Acknowledgments The author has been supported by grants of the European Union within the Framework Programme 7, the German Ministry of Research and Education (BMBF) within the National Genome Research Network (NGFN-2), the German Research Foundation DFG, the intramural program of the Medical Faculty of the University of Heidelberg, the Steuben-Schurz Society, and the Estate of Friedrich Fischer. The author thanks Professor em. Dr. Wolfgang Kuschinsky, Heidelberg, for his sustained support and dedicates this work to his 70th birthday. References 1. Maurer MH, Kuschinsky W (2007) Proteomics. In: Lajtha A (ed) Handbook of neurochemistry and molecular neurobiology, 3rd edn. Springer, New York, pp 737–769 2. Berth M, Moser FM, Kolbe M et al. (2007) The state of the art in the analysis of twodimensional gel electrophoresis images. Appl Microbiol Biotechnol 76:1223–1243
3. Maurer MH (2006) Software analysis of twodimensional electrophoretic gels in proteomic experiments. Curr Bioinformatics 1:255–262 4. Pleissner KP, Hoffmann F, Kriegel K et al. (1999) New algorithmic approaches to protein spot detection and pattern matching in twodimensional electrophoresis gel databases. Electrophoresis 20:755–765
Two-Dimensional Gel Electrophoresis Image Analysis via Dedicated. . . 5. Lemkin PF (1999) Comparing 2-D electrophoretic gels across Internet databases. Methods Mol Biol 112:393–410 6. Oye OK, Jorgensen KM, Hjelle SM et al. (2013) Gel2DE—a software tool for correlation analysis of 2D gel electrophoresis data. BMC Bioinformatics 14:215 7. Natale M, Maresca B, Abrescia P et al. (2011) Image analysis workflow for 2-D electrophoresis gels based on ImageJ. Proteomics Insights 4:37–49 8. Morris JS, Clark BN, Gutstein HB (2008) Pinnacle: a fast, automatic and accurate method for detecting and quantifying protein spots in 2-dimensional gel electrophoresis data. Bioinformatics 24:529–536 9. Li F, Seillier-Moiseiwitsch F (2011) RegStatGel: proteomic software for identifying differentially expressed proteins based on 2D gel images. Bioinformation 6:389–390 10. Maurer MH (2012) Two-dimensional protein analysis of neural stem cells. In: Karamanis Y (ed) Expression profiling in neuroscience. Humana Press, Totowa, NJ, pp 101–117 11. Adobe Systems Incorporated (1992) TIFF Revision 6.0. http://partners.adobe.com/ public/developer/tiff/index.html. Accessed 2013-12-18. 12. Maurer MH, Feldmann RE Jr, Bro¨mme JO et al. (2005) Comparison of statistical approaches for the analysis of proteome expression data of differentiating neural stem cells. J Proteome Res 4:96–100 13. Maurer MH (2005) Comparison of large proteomic datasets. Curr Proteomics 2:179–189 14. Maurer MH (2004) The path to enlightenment: making sense of genomic and proteomic information. Genomics Proteomics Bioinformatics 2:123–131 15. Chakravarti DN, Chakravarti B, Moutsatsos I (2002) Informatic tools for proteome profiling. Biotechniques Suppl 4–10:12–15 16. Kanehisa M, Bork P (2003) Bioinformatics in the post-sequence era. Nat Genet 33(Suppl): 305–310
65
17. Boguski MS, McIntosh MW (2003) Biomedical informatics for proteomics. Nature 422: 233–237 18. Maurer MH (2012) Web-based tools for the interpretation of chain-like protein spot patterns. Curr Proteomics 9:18–25 19. Bettens E, Scheunders P, Van Dyck D et al. (1997) Computer analysis of two-dimensional electrophoresis gels: a new segmentation and modeling algorithm. Electrophoresis 18: 792–798 20. Maurer MH (2004) Simple method for threedimensional representation of 2-DE spots using a spreadsheet program. J Proteome Res 3:665–666 21. Seillier-Moiseiwitsch F, Trost DC, Moiseiwitsch J (2002) Statistical methods for proteomics. Methods Mol Biol 184:51–80 22. Appel RD, Vargas JR, Palagi PM et al. (1997) Melanie II–a third-generation software package for analysis of two- dimensional electrophoresis images: II. Algorithms. Electrophoresis 18:2735–2748 23. Dowsey AW, Dunn MJ, Yang GZ (2003) The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics 3:1567–1596 24. Mahon P, Dupree P (2001) Quantitative and reproducible two-dimensional gel analysis using Phoretix 2D Full. Electrophoresis 22:2075–2085 25. Appel RD, Bairoch A, Sanchez JC et al. (1996) Federated two-dimensional electrophoresis database: a simple means of publishing twodimensional electrophoresis data. Electrophoresis 17:540–546 26. Mostaguir K, Hoogland C, Binz PA et al. (2003) The Make 2D-DB II package: conversion of federated two-dimensional gel electrophoresis databases into a relational format and interconnection of distributed databases. Proteomics 3:1441–1444 27. Vizcaino JA, Reisinger F, Cote R et al. (2011) PRIDE and “Database on Demand” as valuable tools for computational proteomics. Methods Mol Biol 696:93–105
Part II Image Analysis Tools to Provide Spot Volume Datasets
Chapter 4 Comparative Evaluation of Software Features and Performances Daniela Cecconi Abstract Analysis of two-dimensional gel images is a crucial step for the determination of changes in the protein expression, but at present, it still represents one of the bottlenecks in 2-DE studies. Over the years, different commercial and academic software packages have been developed for the analysis of 2-DE images. Each of these shows different advantageous characteristics in terms of quality of analysis. In this chapter, the characteristics of the different commercial software packages are compared in order to evaluate their main features and performances. Key words Software packages, Spot detection, Warping, Gel matching, Data analysis
1
Introduction 2-DE is the oldest technology widely used in proteomics: it has first been described in 1975 [1, 2]. Since its introduction, a number of specialized image analysis software packages have been developed to facilitate rapid and accurate comparisons of different proteome profiles. Currently the most famous commercial software packages for 2-DE image analysis are: DeCyder 2D Differential Analysis (GE Healthcare, Little Chalfont, UK), Delta2D (Decodon, Greifswald, Germany), Dymension (Syngene, Cambridge, UK), Image Master 2-D (GE Healthcare, Little Chalfont, UK), Melanie (Geneva Bioinformatics, Geneva, Switzerland), PDQuest (Bio-Rad Laboratories, Hercules, CA, USA), Phoretix 2-D Advanced (Nonlinear Dynamics, Newcastle Upon Tyne, UK), Progenesis SameSpots (Nonlinear Dynamics, Newcastle Upon Tyne, UK), Proteomweaver (Bio-Rad Laboratories, Hercules, CA, USA), and Redfin Solo (Ludesi, Malmoe, Sweden). In addition, albeit less frequently used, some academic packages are also available, such as for example Pinnacle (see Chapter 11), Flicker, and RegStatGel software.
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_4, © Springer Science+Business Media New York 2016
69
70
Daniela Cecconi
Fig. 1 The two main approaches followed by 2-DE analysis software packages
As shown in Fig. 1, these programs follow two main workflows: the traditional packages perform the spot detection prior to gel matching and analysis of expression profiles; while fourth-generation software first apply a warping procedure and then perform spot detection across all gels, leading to one spot map which is superimposed on each warped image [3]. Table 1 reports the list of the current most popular traditional and fourth-generation software packages with their corresponding web address. These 2-DE analysis packages have a broad range of options, capability, and price, making it somehow difficult to choose the software best suited for a specific necessity. A helping hand to the choice of software could certainly come from the comparative studies available. Several comparisons of these software packages have been published over the years, for example, in an early study of 2002 Raman et al. [4] compared Melanie (v. 3.03) and the pixel-based software Z3 (v 1.51 and 2.0, software now discarded). The authors found that Z3 performed better than Melanie in spot detection, while as concerning gel matching, the one
2-DE Image Analysis Software
71
Table 1 Most popular commercial software packages for 2-D gel analysis Software
Company
Availability
Traditional software packages DeCyder 2D Differential Analysis
GE Healthcare
http://www5.gelifesciences.com
Dymension
Syngene
http://www.syngene.com/dymension
Image Master 2D Platinum
GE Healthcare
http://www5.gelifesciences.com
Melanie
Geneva Bioinformatics http://www.genebio.com/products/melanie/
Phoretix 2D Advanced Nonlinear Dynamics
http://www.nonlinear.com
PDQuest
Bio-Rad Laboratories
http://www.bio-rad.com/en-us/product/pdquest2-d-analysis-software
Proteomweaver
Bio-Rad Laboratories
http://www.bio-rad.com/it-it/sku/soft-lit-5484proteomweaver-2-d-analysis-software
Fourth-generation software packages Delta 2D
Decodon
https://www.decodon.com/delta2d.html
Progenesis SameSpots
Nonlinear Dynamics
http://www.nonlinear.com
Redfin Solo
Ludesi
http://www.ludesi.com/redfin/
that outperforms depends on the type of image distortion (for geometric distortions Z3 gave better results, whereas in nongeometric distortions the two packages performed comparably). In the same year, Nishihara and Champion [5] compared Z3 (v 1.5) with PDQuest (v. 6.2.1) and Progenesis (version not specified) concluding that the three software packages detected similar numbers of spots among the gel replicates. Rosengren et al. [6] in a comparison between PDQuest (v. 7.0.1) and Progenesis (v. 2002.1) confirmed that there was no significant difference between PDQuest and Progenesis, and that both packages were sensitive to the adjustable parameters with respect to the detection of true and false positive spots. They also analyzed three gel sets previously used by Raman et al. [4], and showed that spot quantification was more accurate with Progenesis and PDQuest than with Melanie and Z3. Phoretix 2D (v. 2004) and PDQuest (v. 7.2) were analyzed by Wheelock et al. [7]. The authors showed that Phoretix 2D outperformed PDQuest, which exhibited higher software-induced variance in spot quantification. In an examination of PDQuest (v 7.1.1) and Progenesis (v 2003.02), Arora et al. [8] demonstrated that even if PDQuest is more consistent than Progenesis in spot detection, it requires
72
Daniela Cecconi
more user intervention. Moreover, despite individual strengths and weaknesses, PDQuest and Progenesis are comparable in terms of spot identification, spot quantification, and gel matching. Karp et al. [9] compared DeCyder (v 5.02), Progenesis (v. 2006), and Progenesis SameSpots (v. 2006), showing that this latter strongly reduces the level of missing values, in particular in post-stained gels. Five different software packages, Delta 2D (v. 3.3), ImageMaster 2D Platinum (v. 6), PDQuest (v. 8.0), Progenesis (v. 2005), and ProteoMine (v. 2005), were considered by Clark et al. [10]. The authors found that the matching performance of these packages worsens with increasing gel numbers, with the exception of Delta 2D. Moreover, due to the best efficacy of gel matching, Delta 2D software had the shortest overall analysis time. Kang et al. [11] compared DeCyder (V6.5), Progenesis SameSpots (V3.0), and Dymension (v.3), demonstrating that based on matching accuracy Progenesis SameSpots performed best, while results obtained for protein fold changes were substantially different in each package, indicating that quantification is software dependent. Recently, Millioni et al. [12] compared Delta2D (v. 4) and Proteomweaver (v. 3.0), and demonstrated, by using experimental gel images, that Delta 2D required less time to complete the analysis and also fewer user interventions. Therefore, there are many studies on different software packages; however, at present a comprehensive evaluation of all of them is still not available. In addition, it should be noted that the types of samples evaluated by these publications, as well as the number of gels considered, in some case were not reflective of a real proteomics experiment [10]. These comparisons have been performed either using artificial gels [4, 6] and actual gels [9, 11] (but in some cases with relatively few replicates [7, 8, 13]), by means of gel replicates generated by artificial image distortion [4, 6, 12], or investigating only selected spots within a gel [5]. Starting from these assumptions, here below we discuss the main steps of the image analysis (i.e., preprocessing of the gel images, spot detection and quantification, gel matching, and the data analysis) trying to highlight the advantages and disadvantages of traditional and fourth-generation software packages listed in Table 1.
2
Preprocessing of the Gel Images As regards this first stage of the image analysis, there are three major problems that must be considered in the choice of the software [14]: (1) preprocessing of the gel; (2) reliable detection of the spots that do not appear as ideal rounded shapes; and (3) overlapping spots. The purpose of preprocessing is the removal of image
2-DE Image Analysis Software
73
artifacts to improve the general quality of the images. Unfortunately, the filtering of noise and the background subtraction performed by the different software applications tend to affect both the quantification of the spots [15] and the variability of the analysis [16] respectively. In particular, the denoising filters of both type of programs (traditional and fourth generation) tend to erode spot edges and alter the intensity values of the spot pixels [17] influencing their quantification. Instead, as regards background subtraction, the fourth-generation programs appear to be preferable since they introduce a smaller variability in the analysis, when compared to the traditional ones. In fact, since identical spot boundaries are drawn across all gels images, the area for background subtraction remains constant through all the gels, thus reducing the inter-gel variability [18].
3
Spot Detection Spot detection is a challenging and important task to obtain proteomic information; indeed after spot detection, characteristic information about each spot is extracted to quantify the protein expression and to aid the gel matching process. The algorithms applied by the commercial software packages to perform spot detection can be categorized into two classes: parametric and nonparametric. Parametric spot detection, which is generally used by traditional programs, transforms the spot shape into the “ideal” elliptic shape, by fitting to Gaussian distributions. Nonparametric spot detection, which is instead a characteristic of the fourthgeneration software, applies no constraints on the spots shape, so in this way the spots are detected by their actual shape present on the gel [19]. The Gaussian transformation works well for smaller spots, but not so well for the larger ones (which generally tend to take irregular shapes). Moreover, it has been demonstrated that the spot-based Gaussian fitting algorithm utilized in PDQuest introduces about twofold higher levels of software-related variability when compared to the nonparametric pixel-based algorithms of Phoretix 2D Expression (CV median 8 % versus 4 %) [7]. Gaussian modeling is also ineffective against saturated spots, so it does not allow for their proper quantification (see Chapter 12). On the other hand the image can be replaced by a list of spot centers by parametric spot detection, greatly reducing the complexity of the data file of the gel image. It has been shown that a map with a thousand of spots is reduced by Melanie software to a file of 28 kb, i.e., approximately 1 %, compared to what would be with the nonparametric pixel-based approach [20]. Another advantage of parametric spot detection is that it manages very well overlapping spots, so in the case of overlapped spots the traditional software packages work better than the fourth-generation programs [15].
74
Daniela Cecconi
As above mentioned, traditional software packages perform the detection of the spots on each gel, and successively compare the different proteome profiles through the gels matching. This kind of approach typically requires an extensive user intervention to correct the spots that the software has or has not automatically detected (for example to split some overlapping spots, to remove small deposits of dye or impurities in the gel detected as spots, to add spots not identified because they are too small or too faint, etc.. . .). In a completely different way, the fourth-generation warpingbased software packages perform the alignment of images as a first step, by aligning spots positions across all gel images. Then the spots are detected simultaneously across all the images, leading to one spot map which is superimposed on each warped gel leading to an identical spot outline across all gels. In this case the user intervention is less invasive, since spot editing is performed only on the fusion image and then transferred to each replicate [21]. However it should be noted that, particularly in areas of high spot density, the warping-based software packages tend to amalgamate the spots into single spots. So, to correct these spots, the user should find a way to draw a pattern of multiple spot species, which is consistent for all images. For this reason it is possible that by using warping-based programs a lower number of spots is detected, due to a merging of distinct spots in regions of the gel where a large number of spots is present [9]. Notably in a comparison of five commercially available software packages, Delta 2D, Image Master Platinum, PDQuest, Progenesis and Proteomine, it was shown that using the manufacturer-recommended operative settings, PDQuest and Progenesis automatically detected significantly more spots than the other three software packages, Delta 2D included [10]. 3.1
Missing Values
In 2-DE not all spots can be reproduced in all gels, which results in incomplete datasets. The origin of the missing values can be technical (i.e., over-focusing of proteins, a bad transfer from the first to the second dimension, and gel-to-gel variation in staining) or biological (PTMs which change the pI of the protein, and isoforms) [22]. Introducing a missing value is a correct choice in the second case, but will lead to serious problems in the first case. Missing values affect not only the matching of the gels, but also the downstream statistical analysis. The efficiency and statistical power of a 2DE quantitative analysis can be reduced by having incomplete sets of data. This is especially a problem when multivariate statistical analysis is planned, since it needs complete datasets. The elimination of missing values can only be achieved by either excluding all spots with missing values, or by estimating quantitative values out of the existing spots [23]. An even better approach to eliminate the problem of the missing values is the use of fourth-generation software packages. These software packages in fact use a common set of spots boundaries for all the
2-DE Image Analysis Software
75
gels. By the fourth-generation software packages all the gels are overlapped, creating a master gel, which contains all the possible spots; in this way all the spots are matched, and then their position is transferred to the original images. This makes it possible to obtain 100 % spot matching across all images. Nowadays, the only drawback of this approach is that the missing spots have to be supervised, to ensure that the spot is really missing or that it is qualitatively different.
4
Spot Quantification After spot detection, characteristic information about each spot is extracted to quantify the protein expression (by calculation of the integrated optical density, or scaled volume) and to aid in the gel matching process. There are two basic methods used for spot quantification: model-based and image segmentation. Model-based quantitation models the spot’s intensity as a Gaussian normal distribution or some variant thereof; while image segmentation partitions the image into nonoverlapping segments, classifying each pixel as belonging to a spot or to background. Both methods have their weak point: model-based approaches may not adjust perfectly to spot boundary, while segmentation approaches can be affected by variability related to nonfiltered background or noise pixels. Taken into account all the above considerations, it is possible to say that both types of software packages have strengths and weaknesses. They both do not solve all possible problems that can appear in the early steps of image analysis, as well as in spot detection and quantification.
5
Gel Matching In order to compare the proteomic profiles and identify the differentially expressed proteins, it is essential that the images are properly aligned and superimposed. An important prerequisite for efficient gel matching is therefore the image warping step, which can be carried out by the two different approaches mentioned above: spot-based and pixel-based methods [14]. Spot-based methods performed by traditional software packages, starts with the given list of detected spots and usually involves the use of so-called landmark points, which act as anchors for the distortion and the matching of the different images. The spot center of the detected spots is used for the matching step. Usually a “reference” or “standard” gel is chosen, and the other gels are aligned to it in pair-wise fashion. Landmarks can be chosen by the user or automatically selected by some programs. In both cases, the landmarking process can be prone to errors; in fact sometimes, it is difficult to determine whether the change in spot
76
Daniela Cecconi
location depends on experimental distortions of the gel or on biological modifications of the protein. Instead pixel-based method [24], which characterizes the fourth-generation packages, performs the image warping directly on the raw data, by considering the image as a surface formed by the pixel intensities. The method requires alignment and background correction of the images before the analysis. In this case, additional features, such as spot shape, streaks, and reproducible background features are used to correct for geometric distortions [25]. This algorithm follows the concept of recursive gel matching; this means that each gel of a matching set is recursively used as “reference gel” once, during the matching process. The main advantages of this method consist in user intervention minimization and improvement of the analytical results. However, also these fourthgeneration software perform better the matching of gels when landmarks are defined. It has been reported that by using only five landmarks, Progenesis SameSpots outperformed DeCyder and Dymension software packages, correctly matching 92 % of spots [11]. On the other hand, Valledor et al. [22] compared Progenesis SameSpots (v. 4.0.3), PDQuest (v. 8.01), and Delta 2D (v. 4.0), and found that of Samespots and PDQuest provide the best results in terms of quality of matching after some manual spots corrections and rematching. In general, it can be said that the warping results both in faster and improved accuracy in spot matching. Thus the warping performed by fourth-generation programs prior to any image analysis, reduces the user time and offers improvements in terms of accuracy and subjectivity of the subsequent matching.
6
The Data Analysis The final step of image analysis involves statistical analyses in order to identify the significantly modulated spots. In general, the workflow of commercial packages can be summarized as follows: normalization, log transformation of the data, and statistical test to detect the differential expression. The availability of tests provided by the commercial software packages has increased over the years, so that now in addition to univariate statistical tests also multivariate analyses are available. Standard univariate statistical approaches could be appropriate if the experiment is designed to only examine alterations of a few proteins, though they are inherently sensitive to type II error (false positive) [20]. Multivariate approaches have instead the additional benefit of providing information about the relationships between samples and variables, allowing the identification of the proteins that are involved in the same biochemical pathway [26]. For these reasons it is very important to carefully consider what are the statistical tests included in the software and eventually evaluate the possibility of exporting the data and processing them separately, by others statistical packages.
2-DE Image Analysis Software
77
The commercial software packages that currently offer only univariate statistical test are Image Master, Melanie, and Proteomweaver. By these programs, it is possible to perform only Student ttest, ANOVA, and two nonparametric tests such as Mann–Whitney, and the Kolmogorov-Smirnov analyses. In addition to the abovementioned tests, PDQuest and DeCyder also include multivariate partial least squares analysis; while Redfin Solo only provides ANOVA and principal component analysis (PCA). Currently, the software packages with the most complete statistical analyses are Delta 2D and SameSpots. Indeed, Delta2D includes: clustering (hierarchical clustering, support trees, and K-means clustering), Student t-test, ANOVA (with corrections for multiple testing by Bonferroni, adjusted Bonferroni, and Westfall-Young; and control of false discovery rate), two-factor analysis of variance, nonparametric tests (such as Wilcoxon, Mann–Whitney, Kruskall-Wallis, Mack-Skillings, and Fisher Exact Test), and PCA. While, as regarding Progenesis SameSpots it allows to apply (in addition to univariate Student t-test) PCA, correlation analysis, power analysis, and qvalues (false discovery rate adjusted p-values). A deepening to this argument is presented in Chapters 13 and 14.
7
Conclusions In conclusion, although the availability of the software for image analysis is pretty vast, and despite the improvements in performance introduced by the fourth-generation software, image analysis still remains a limiting step in a proteomics experiment. Indeed even with some recent encouraging developments, the different algorithms for spot detection, image warping, and gel matching, are not improved to a point at which no human supervision is needed. The hope for the future is that the introduction of further improvements can really speed up the image analysis step and make the results independent on the operator and on the chosen software.
References 1. Klose J (1975) Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals. Humangenetik 26(3):231–243 2. O’Farrell PH (1975) High resolution twodimensional electrophoresis of proteins. J Biol Chem 250(10):4007–4021 3. Millioni R, Puricelli L, Sbrignadello S, Iori E, Murphy E, Tessari P (2012) Operator- and software-related post-experimental variability and source of error in 2-DE analysis. Amino Acids 42(5):1583–1590
4. Raman B, Cheung A, Marten MR (2002) Quantitative comparison and evaluation of two commercially available, two-dimensional electrophoresis image analysis software packages, Z3 and Melanie. Electrophoresis 23 (14):2194–2202 5. Nishihara JC, Champion KM (2002) Quantitative evaluation of proteins in one- and twodimensional polyacrylamide gels using a fluorescent stain. Electrophoresis 23 (14):2203–2215 6. Rosengren AT, Salmi JM, Aittokallio T, Westerholm J, Lahesmaa R, Nyman TA,
78
Daniela Cecconi
Nevalainen OS (2003) Comparison of PDQuest and Progenesis software packages in the analysis of two-dimensional electrophoresis gels. Proteomics 3(10):1936–1946 7. Wheelock AM, Buckpitt AR (2005) Softwareinduced variance in two-dimensional gel electrophoresis image analysis. Electrophoresis 26 (23):4508–4520 8. Arora PS, Yamagiwa H, Srivastava A, Bolander ME, Sarkar G (2005) Comparative evaluation of two two-dimensional gel electrophoresis image analysis software applications using synovial fluids from patients with joint disease. J Orthop Sci 10(2):160–166 9. Karp NA, Feret R, Rubtsov DV, Lilley KS (2008) Comparison of DIGE and post-stained gel electrophoresis with both traditional and SameSpots analysis for quantitative proteomics. Proteomics 8(5):948–960 10. Clark BN, Gutstein HB (2008) The myth of automated, high-throughput two-dimensional gel analysis. Proteomics 8(6):1197–1203 11. Kang Y, Techanukul T, Mantalaris A, Nagy JM (2009) Comparison of three commercially available DIGE analysis software packages: minimal user intervention in gel-based proteomics. J Proteome Res 8(2):1077–1084 12. Millioni R, Miuzzo M, Sbrignadello S, Murphy E, Puricelli L, Tura A, Bertacco E, Rattazzi M, Iori E, Tessari P (2010) Delta2D and Proteomweaver: performance evaluation of two different approaches for 2-DE analysis. Electrophoresis 31(8):1311–1317 13. Mahon P, Dupree P (2001) Quantitative and reproducible two-dimensional gel analysis using Phoretix 2D Full. Electrophoresis 22 (10):2075–2085 14. Aittokallio T, Salmi J, Nyman TA, Nevalainen OS (2005) Geometrical distortions in twodimensional gels: applicable correction methods. J Chromatogr B Analyt Technol Biomed Life Sci 815(1–2):25–37 15. Rogers M, Graham J, Tonge RP (2003) Using statistical image models for objective evaluation of spot detection in two-dimensional gels. Proteomics 3(6):879–886 16. Tsakanikas P, Manolakos ES (2009) Improving 2-DE gel image denoising using contourlets. Proteomics 9(15):3877–3888 17. Cannistraci CV, Montevecchi FM, Alessio M (2009) Median-modified Wiener filter provides efficient denoising, preserving spot edge and morphology in 2-DE image processing. Proteomics 9(21):4908–4919
18. Silva E, O’Gorman M, Becker S, Auer G, Eklund A, Grunewald J, Wheelock AM (2010) In the eye of the beholder: does the master see the SameSpots as the novice? J Proteome Res 9(3):1522–1532 19. Rogers M, Graham J, Tonge RP (2003) Statistical models of shape for the analysis of protein spots in two-dimensional electrophoresis gel images. Proteomics 3(6):887–896 20. Wheelock a˚M, Wheelock CE (2008) Bioinformatics in gel-based proteomics. In: Rakwal R, Agarwal GK (eds) Plant proteomics: technologies, strategies, and applications. Wiley, Hoboken, NJ 21. Luhn S, Berth M, Hecker M, Bernhardt J (2003) Using standard positions and image fusion to create proteome maps from collections of two-dimensional gel electrophoresis images. Proteomics 3(7):1117–1127 22. Valledor L, Jorrin J (2011) Back to the basics: maximizing the information obtained by quantitative two dimensional gel electrophoresis analyses by an appropriate experimental design and statistical analyses. J Proteomics 74 (1):1–18 23. Zellner M, Graf A, Zehetmayer S, Winkler W, Staes A, Gevaert K, Vandekerckhove J, Marchetti-Deschmann M, Miller I, Bauer P, Allmaier G, Oehler R (2012) How many spots with missing values can be tolerated in quantitative two-dimensional gel electrophoresis when applying univariate statistics? J Proteomics 75(6):1792–1802 24. Faergestad EM, Rye M, Walczak B, Gidskehaug L, Wold JP, Grove H, Jia X, Hollung K, Indahl UG, Westad F, van den Berg F, Martens H (2007) Pixel-based analysis of multiple images for the identification of changes: a novel approach applied to unravel proteome patterns [corrected] of 2-D electrophoresis gel images. Proteomics 7(19):3450–3461 25. Dowsey AW, English J, Pennington K, Cotter D, Stuehler K, Marcus K, Meyer HE, Dunn MJ, Yang GZ (2006) Examination of 2-DE in the Human Proteome Organisation Brain Proteome Project pilot studies with the new RAIN gel matching technique. Proteomics 6 (18):5030–5047 26. Marengo E, Robotti E, Bobba M, Milli A, Campostrini N, Righetti SC, Cecconi D, Righetti PG (2008) Application of partial least squares discriminant analysis and variable selection procedures: a 2D-PAGE proteomic study. Anal Bioanal Chem 390(5):1327–1342
Chapter 5 Image Pretreatment Tools I: Algorithms for Map Denoising and Background Subtraction Methods Carlo Vittorio Cannistraci and Massimo Alessio Abstract One of the critical steps in two-dimensional electrophoresis (2-DE) image pre-processing is the denoising, that might aggressively affect either spot detection or pixel-based methods. The Median Modified Wiener Filter (MMWF), a new nonlinear adaptive spatial filter, resulted to be a good denoising approach to use in practice with 2-DE. MMWF is suitable for global denoising, and contemporary for the removal of spikes and Gaussian noise, being its best setting invariant on the type of noise. The second critical step rises because of the fact that 2-DE gel images may contain high levels of background, generated by the laboratory experimental procedures, that must be subtracted for accurate measurements of the proteomic optical density signals. Here we discuss an efficient mathematical method for background estimation, that is suitable to work even before the 2-DE image spot detection, and it is based on the 3D mathematical morphology (3DMM) theory. Key words Denoising, Background subtraction, Image processing, Noise reduction filter, Spatial filtering, Two-dimensional gel electrophoresis
1
Introduction
1.1 Filters for Denoising
Noise removal by means of computerized image processing techniques, is a fundamental early stage in differential analysis of twodimensional electrophoresis (2-DE) images [1–3]. A good denoising procedure should prevent both the overestimation of the image background and the formation of misleading spots, and should allow the extraction of faint spots and accurate estimation of spot properties, which in turn leads to spot differential analysis improvement [2, 4]. 2-DE images are affected by different kinds of noise, in particular by statistical “Gaussian noise” (due to imperfect images acquisition), and by “Spike noise” (due to small irregular features
Both authors have equally contributed. Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_5, © Springer Science+Business Media New York 2016
79
80
Carlo Vittorio Cannistraci and Massimo Alessio
originated during the laboratory experimental procedures) [1, 2]. The aim of the denoising phase is to remove noise-small-features, spikes and other imperfections, while leaving larger features (i.e., spots) unaffected [1, 3, 5]. Noise suppression methods employed in commercial software for 2-DE image analysis are based on spatial filtering [2, 6, 7] that uses local n n kernel/window filters like median, adaptive, Gaussian, and polynomial convolution filters [1, 8–10]. These filters use the intensity values of pixels located in the kernel area (named neighborhood) to recalculate the new value of the processed pixel at the centre of the kernel. To recompute the new pixel value: median filter (MF) uses the median of the neighborhood pixels; adaptive filters tailor the computation of the new pixel value to the local noise variance in the kernel; Gaussian and polynomial filters are implemented as a convolution of a Gaussian or polynomial kernel with the image. However, these filters tend to distort spot edges and alter the intensity value of the spot pixels [2]. Their advantage is that they normally require the setting of only one or two parameters, like the size of the filter window that is generally related to the minimum size of the spots [3, 7, 10]. Recently, it has been reported that Gaussian noise is efficiently removed by means of spatial-frequency domain filters, namely wavelet denoising (WD) and contourlet denoising (CD) [2, 9]. A wavelet is a wavelike oscillation with an amplitude that starts at zero, increases, and then decreases back to zero. Roughly simplifying, WD implements a spatial-time decomposition of the image, based on a selected wavelet function: denoising is obtained by setting threshold strategies over this decomposition [2, 9]. One of the main drawbacks of wavelets is that they can capture only limited directional information, i.e., vertical, horizontal, and diagonal. CD, which is a multiresolution, adaptive, directional image decomposition method based on contour segments, was introduced to overcome this limitation [2]. However, these filters necessitate complex setting strategies, which in turn made the use of spatial-frequency denoising unsuitable in the laboratory practice and in computational pipelines for automated pre-processing. In fact in real laboratory practice the types and the combination of noise present in the 2-DE images are not a priori known, and usually a large number of images are automatically analysed. Therefore it is unfeasible to change the filter setting in relation to the mixture of different kinds of noises present in different images. Considering the necessity to have a unique filter setting, the ideal filter is the one that, for a given window (the best window setting), provides the most effective performance for the contemporary denoising of different kinds of noise. A novel nonlinear adaptive spatial filter, named medianmodified Wiener filter (MMWF) has been designed [11] for the suppression of different kinds of noise that could coexist in 2-DE images (e.g., small-noise-features, spikes, Gaussian noise). In spatial
2-D Maps Denoising and Background Subtraction
81
filtering, that include linear and nonlinear local filters, the relevant parameter is the window size (nm pixels area around a given processed pixel), and to recalculate the value of the processed pixel at the window center, the filter uses the pixel intensity values of the window area [1, 8, 10] (see Note 1). To reduce the “salt and pepper” noise typical of 2-DE images, it is frequently employed the nonlinear median filter (MF), in which the value of an output pixel is determined by the median (a nonlinear operator) of the neighborhood pixels in the window area [9]. Being the median less sensitive than the mean to extreme values, MF removes these outliers without reducing the sharpness of the image [1, 8, 9, 12]. An alternative approach is the use of the adaptive Wiener filter (WF), a linear filter that tailors itself to the local image variance; for large variance it performs little smoothing, while for small variance WF performs a greater smoothing, resulting in a better preservation of the signal edges [12]. In the image, the evaluation of the mean μ and variance σ 2 around each pixel, referred to the local window is: μ¼
σ2 ¼
X 1 aðn, mÞ N M n, m∈η
X 1 a 2 ðn, mÞ N M n, m∈η
ð1Þ
μ2
ð2Þ
where N-by-M is the size of the local neighborhood area η contained in the window, and a(n, m) is a notation to identify the intensity of each pixel contained in the area η. Thus the new pixel value b calculated by WF is: b w ðn, mÞ ¼ μ þ
σ2 σ
ν2 aðn, mÞ 2
μ
ð3Þ
where ν2 is the noise variance (see Note 2). The nonlinear adaptive MMWF introduces a modification to WF by replacing the local window mean around each pixel μ, with the local window median around each pixel μ e: b mmwf ðn, mÞ ¼ μ eþ
σ2
σ
ν2 aðn, mÞ 2
μ eÞ
ð4Þ
This merges the features of MF and WF, minimizing the respective defects. It has been reported that for spike denoising MF operates an optimal suppression (Fig. 1), however for increased window size, MF, in addition to noise removal, also erodes the edge of isolated spots, causing spots erosion and distortion [8, 9, 11]. On the other hand, when two spots are very close, the median operator returns in
82
Carlo Vittorio Cannistraci and Massimo Alessio
Fig. 1 Comparison of median filter (MF), Wiener filter (WF), Gaussian filter (GF), and median-modified Wiener filter (MMWF) denoising effects on spot and spike in 2-DE gel region (original image)
Fig. 2 Comparison of median filter (MF), Wiener filter (WF), Gaussian filter (GF), and median-modified Wiener filter (MMWF) denoising effects on close-set spots in 2-DE gel region (original image) also visualized as 3D landscape
output a bigger value for the pixels at the centre of the valley between the two spots. This paradoxical behaviour has been reported by Cannistraci et al.; in fact the classical edge erosion of MF in this particular case is replaced by a dilatation, that fills the space between two or more close-set spots, masking their presence in the post-processed image. Altogether, these side effects might strongly influence the analysis after the pre-processing of 2-DE images (Fig. 2).
2-D Maps Denoising and Background Subtraction
83
On the contrary, in the adaptive WF the variance-dependent 2 2 term σ σ 2ν is a sort of edge detector [8, 9, 13]; thus on 2-DE spot edges, the filter output has almost the same value as the filter input, that results in the maintenance of the original spot morphology also in the case of close-set spots (Fig. 2). WF yields the best result among the spatial filters for low levels of Gaussian noise [9, 11]; however, the edge detector is not very precise in the case of high levels of noise and WF drastically reduces its removal efficiency [9, 11]. Moreover, WF causes a fuzzyfication of the spot neighborhood (Fig. 1), because if a small part of the spot enters inside the window, the mean of the window pixels output is bigger than the filter input, causing a lack of definition of the spot border. The presence of spotlike signals generated by spike-noise fuzzyfication and the imprecise spots definition can impair the successive steps of the computational pipeline, altering the proteomic information contained in the 2-DE gels. MMWF combines the features and capacities of MF and WF and reduces the respective defects. In fact, MMWF shows a denoising efficacy in removing spike noise close to that of MF, but avoids spots fuzzyfication; moreover it maintains the ability of WF to preserve spot morphology of close-set spots (Figs. 1 and 2). These properties are due to the drop-off effect [11] that preserves the edge distinguishableness in an image and makes MMWF not suffering from the erosion problem for window sizes approximately lower than the minimum spot diameter in the image. A comparative study for the removal of different kinds of noise, spike denoising (noise-small-feature and spike) and Gaussian denoising with three different settings of standard deviation (Gaussian 10, Gaussian 20, Gaussian 30, as in ref. 9) has been conducted considering real 2-DE gels and different kinds of window setting [11]. The performances of alternative denoising techniques (median, Wiener, median-modified Wiener, Gaussian and polynomial Savitzky-Golay [14, 15] filters, and wavelet denoising) were evaluated to establish the best denoising approach for each of the noise category (spike or Gaussian) and, more relevant for 2-DE images analysis, for contemporary suppression of the four different kinds of noise [11]. Considering maps’ images with approximate average size of 2000 2000 pixels, for spike denoising the best filter resulted the MF with window 3 3, and the second resulted the MMFW algorithm with window size setting 7 7 and 9 9 [8, 9, 11]. In the case of Gaussian denoising, the best performer was WD, while the second best performance was reached equally by MMWF (window 7 7), Gaussian filter (window 7 7) and polynomial Savitzky-Golay filter (window 11 11). Even though MF and WD showed the best performance in spike and Gaussian denoising respectively, they are unsuitable for an efficient removal of different types of noise occurring
84
Carlo Vittorio Cannistraci and Massimo Alessio
contemporaneously, being their best settings depending on the type of noise. Conversely, MMWF, that showed the best performance at the same window size (7 7) on different kinds of noise, resulted as the best filter for global denoising (see Note 3). 1.2 Background Subtraction
All 2-DE gel images contain, even at different intensity, levels of background that are depending on laboratory experimental procedures (nonhomogeneous staining) and on image capture processing. Therefore, in order to perform accurate measurements of the spot’s optical density that correctly reflects the protein spot expression level, or also before any direct-pixel map analysis, it is necessary to account for the background subtraction. Several methods of determining the background levels for each spot in the gel are included in the commercially available software, but most of them are depending on a previous spot detection. The more frequently used fully automated methods for local background subtraction are the “lowest on boundary” and “average on boundary.” In these methods, the background value of each spot is calculated by evaluating the intensities of the pixel just outside the detected spot edges. In the case of lowest on boundary, the lowest pixel intensity detected on the boundary of each spot is used to calculate the background for the area of that specific spot. Similarly, in the case of average on boundary, the background intensity for that spot is calculated using the mean value of the pixels outside the spot. These two methods are suitable in the case of the analysis of a set of gels with low and quite homogenous background. Other methods, based on the generation of a mathematical model of the gel background, are included under patent in several commercial software. Many of these methods are applied post-spot detection and calculate a background level based on a surface model of the entire gel taking into consideration all the pixels in the map that are not part of a detected spot (utilized in software like PDQuest, BioRad; Delta 2D; Progenesis PG240, NonLinear Dynamics; SameSpot, Totallab). Mathematical models can be implemented using freely available strategies [16] as described below for 3D mathematical morphology (3DMM) theory. A very common and efficient mathematical method for background estimation, that can be applied on a computational pipeline, and that is suitable to be used even before the spot detection, is based on the 3DMM theory [1, 17–19]. This method is very useful especially in computational differential proteomic approaches that do not adopt spot detection, indeed they are based on direct pixel analysis of the maps’ images using unsupervised learning machine: linear dimensionality reduction techniques like principal component analysis (PCA) [3] or nonlinear dimensionality reduction techniques, like minimum curvilinear embedding (MCE) [20] (for further explanations on this topic see Chapter 16).
2-D Maps Denoising and Background Subtraction
85
Fig. 3 3D mathematical morphology for background subtraction. Example of 3D mathematical morphology operations on a real 2-DE gel area: (a) original image; (b) erosion; (c) dilatation; (d) opening (background estimation); (e) result of background removal
Generally, the 3D mathematical morphology function used for background estimation is called morphological opening of the image. The morphological opening is defined like the combination of morphological erosion followed by morphological dilatation. Erosion and dilatation are two fundamental operations in morphological image processing, from which all other morphological operations—like opening—are based. Here, for simplicity, it is not reported the generalized mathematical formulation of these operations, but it is reported the mathematical formulations already adjusted to the contest of background estimation in grayscale 2-DE maps’ images. Given a denoised 2-DE image (e.g., Fig. 3a), and considering that it is a grayscale 8 bit image, erosion in a given point of the image n, m is equivalent to a local-minimum operator, corrected according to the values inside a window (more in general called structure in mathematical morphology), with an arbitrary shape and that has for center n, m (see Note 4). In the example reported in Fig. 3b erosion is performed using a ball-shaped structure (actually an ellipsoid) of radius 15 and height 15 pixels (see Note 5): eroded valueðn, mÞ ¼ min½0, aðn, m, rÞ bðn, m, r, hÞ ð5Þ ¼ min½0, aðn, m, 15Þ bðn, m, 15, 15Þ where a(n, m, r) is the local image area of radius r, surrounding the pixel in position (n, m); and b(n, m, r, h) is a local kernel which has the shape of a ball-shaped structuring element of radius r and maximum height h. The center of the kernel b(n, m, r, h) is located in position (n, m). The value of the b(n, m, r, h) kernel in position (n, m) will be equal to h, while each other point of the kernel around (n, m) takes the value corresponding to the height of the ellipsoid above its respective position. Therefore, on the border of the ellipsoid kernel the points will assume the lowest values (see Note 6). The formulation for image dilatation is similar, and in particular differs for the operator, that is the local-maximum, and for the sum,
86
Carlo Vittorio Cannistraci and Massimo Alessio
instead of the minus between the pixel in the original image area a (n, m, r) and the respective pixel in the ball-shaped structure b (n, m, r, h): dilatated valueðn, mÞ ¼ max½255, aðn, m, rÞ þ bðn, m, r, hÞ ¼ max½255, aðn, m, 15Þ þ bðn, m, 15, 15Þ
ð6Þ
The effect of image dilatation on the original image, using a ballshaped structure, as above reported, is shown in Fig. 3c. When a ball-shaped structure is used to transform the image, the related operations are also called rolling ball transforms in 3DMM. Finally, morphological opening, which is the operation used for background estimation, is obtained as combination of morphological erosion followed by morphological dilatation. Figures 3d and e show the result of morphological opening and background subtraction, respectively, applied to the original image. Applying 3D mathematical morphological algorithm to problematic 2-DE map images, in order to obtain an efficient background removal, requires sequences of different morphological opening operations, applied adopting diversely shaped structures in order not only to estimate the background, but also to remove artifacts like elongated strips due to gel running artifacts and improper staining.
2
Materials 1. Hardware requirement. 1 CPU machine with at least 1 GB RAM. 2. Software suitable for denoise filter computing. For image filtering the computing can be implemented using MATLAB® software v. 7.0 (The MathWorks™, Natick, MA, USA) or a more recent version. Alternatively, the provided MATLAB function can be re-implemented in free software languages like Python, Java, and C++. 3. Mathematical function. The MATLAB function for Median Modified Wiener Filter application to images is available for free downloading from: https://sites.google.com/site/car lovittoriocannistraci/5-datasets-and-matlab-code. 3D mathematical morphology functions for image erosion, dilation, and opening are implemented in MATLAB®. 4. Features of the input images. The “ideal” image spatial resolution of the input images is 254 dpi (1 pixel ¼ 100 μm); however images at higher or lower spatial resolution can be successfully used as well (see Note 7). The full gel image size is not relevant and can range approximately from 600 1000 to 2000 2000 pixels as usual for commercial 2-D gel images acquired at 254 dpi. The recommended radiometric resolution can range from 8- to 16- bit grayscale level.
2-D Maps Denoising and Background Subtraction
3
87
Methods
3.1 Nonlinear Adaptive Spatial Noise-Removal Filtering
3.2 3D Mathematical Morphology Background Subtraction
1. Input the original 2-D gel image in the code function. 2. Set the size of the filter window in the code function (see Note 8). 3. Run the code and get in output the filtered image of the 2-D gel map. 1. Load the denoise-filtered image in the MATLAB workspace and check that it is a uint8 variable (it means unsigned 8-bit grayscale variable). If not, transform the image variable in a uint8 variable. Assign the variable the name I. 2. Set the shape, radius, and height (in the presented example r ¼ 15 pixels and h ¼ 15 pixels) of the ball-shaped structure used to perform 3DMM transformation of the image. In the example case, call the MATLAB function as follows: se ¼ strel(’ball’,15,15);
3. To perform image erosion call the MATLAB function as follows: eroded_image¼imerode(I,se);
where I is the original image. 4. To perform image dilatation call the MATLAB function as follows: dilated_image¼imdilate(I,se).
5. To perform image opening for background estimation call the MATLAB function as follows: opened_image¼imopen(I,se).
6. To perform background subtraction call the MATLAB function as follows: new_image¼I-imopen(I,se).
3.3
4
Next steps
1. Use the filtered image for the next steps in your computational pipeline.
Notes 1. The filter window size is a critical early choice, because spatial filtering can degrade spots of size smaller than the window size. 2. If the Gaussian noise variance is not given, the average of all the local variances estimated for each window is used [12]. 3. The nonlinear nature of the MMWF, due to the median term, allows its scoring as second best performer in each of the single
88
Carlo Vittorio Cannistraci and Massimo Alessio
denoising categories, while its adaptive characteristics, supported by the mathematical framework of the WF formula, allow the MMWF to be versatile and invariant on the different kinds of noise. 4. The setting of shape and size of the structure is critical because it depends on the type of background that must be removed (e.g., diffuse acrylamide support staining or elongated strips due to gel running artifacts) and on the shape and size of the signals that must be preserved (e.g., round or elongated spots). 5. In these didactic simulations we decided to set the radius to 15 pixels, because this is the radius of the smallest spot we want to preserve in the image. Any spot with radius smallest than 15 pixels will be totally eroded by the background subtraction. 6. If the new value is lower than zero, then it is adjusted to zero because this is the lowest value in an 8-bit grayscale 2-DE image. 7. It should be taken into consideration that at higher spatial resolution and image size the computational time necessary for image filtering increases. 8. In an automated analysis of a large number of images of size approximately 2000 2000 pixels and image spatial resolution of 254 dpi (1 pixel ¼ 100 μm), according to our empirical tests [11], we suggest to use MMWF for global denoising with window 7 7 or 9 9 pixels.
Acknowledgements M.A. is supported by AIRC Special Program Molecular Clinical Oncology 5 per mille n.9965. C.V.C. is supported by the independent group leader starting grant of the Technische Universit€at Dresden (TUD). References 1. Dowsey AW, Dunn MJ, Yang GZ (2003) The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics 3:1567–1596 2. Tsakanikas P, Manolakos I (2007) Effective denoising of 2D gel proteomics images using contourlets. Image Processing, 2007, ICIP 2007. IEEE International Conference on 2, 1–22 3. Marengo E, Robotti E, Antonucci F et al. (2005) Numerical approaches for quantitative analysis of two-dimensional maps: a review of commercial software and home-made systems. Proteomics 5:654–666
4. Kaczmarek K, Walczak B, de Jong S et al. (2005) Baseline reduction in two dimensional gel electrophoresis images. Acta Chromatogr 15:82–96 5. Morris JS, Clark BN, Gutstein HB (2008) Pinnacle: a fast, automatic and accurate method for detecting and quantifying protein spots in 2-dimensional gel electrophoresis data. Bioinformatics 24:529–536 6. Rogers M, Graham J, Tonge PR (2003) Using statistical image models for objective evaluation of spot detection in two-dimensional gels. Proteomics 3:879–886
2-D Maps Denoising and Background Subtraction 7. Appel RD, Palagi PM, Walther D et al. (1997) Melanie II–a third-generation software package for analysis of two-dimensional electrophoresis images: I. Features and user interface. Electrophoresis 18:2724–2734 8. Conradsen K, Pedersen J (1992) Analysis of two-dimensional electrophoretic gels. Biometrics 48:1273–1287 9. Kaczmarek K, Walczak B, de Jong S et al. (2004) Preprocessing of two-dimensional gel electrophoresis images. Proteomics 4:2377–2389 10. Appel RD, Vargas JR, Palagi PM et al. (1997) Melanie II–a third-generation software package for analysis of two-dimensional electrophoresis images: II. Algorithms. Electrophoresis 18:2735–2748 11. Cannistraci CV, Montevecchi FM, Alessio M (2009) Median-modified Wiener filter provides efficient denoising, preserving spot edge and morphology in 2-DE image processing. Proteomics 9:4908–4919 12. Lim JS (1990) Two-dimensional signal and image processing. Prentice Hall, Upper Saddle River 13. Stein RA, Boyd JE (1988) Evaluation of the variance function for edge enhancement in multidimensional data. IEEE Int Symp Circ Sys 3:2553–2556
89
14. Chinrungrueng C, Suvichakorn A (2001) Fast edge-preserving noise reduction for ultrasound images. IEEE Trans Nucl Sci 48:849–854 15. Savitzky A, Golay MJE (1964) Smoothing and differentiation of data by simplified least squares procedure. Anal Chem 36:1627–1639 16. Najman L, Talbot H (2010) Mathematical morphology: from theory to applications. ISTE-Wiley, New York 17. Berth M, Moser FM, Kolbe M et al. (2007) The state of the art in the analysis of twodimensional gel electrophoresis images. Appl Microbiol Biotechnol 76:1223–1243 18. Skolnick MM (1986) Application of morphological transformations to the analysis of twodimensional electrophoretic gels of biological materials. Comput Vision Graphics Image Process 35:306–332 19. Skolnick MM (1986) Special section on mathematical morphology: introduction. Comput Vision Graphics Image Process 35: 281–282 20. Cannistraci CV, Ravasi T, Montevecchi F et al. (2010) Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes. Bioinformatics 26:i1–i9
Chapter 6 Image Pretreatment Tools II: Normalization Techniques for 2-DE and 2-D DIGE Elisa Robotti, Emilio Marengo, and Fabio Quasso Abstract Gel electrophoresis is usually applied to identify different protein expression profiles in biological samples (e.g., control vs. pathological, control vs. treated). Information about the effect to be investigated (a pathology, a drug, a ripening effect, etc.) is however generally confounded with experimental variability that is quite large in 2-DE and may arise from small variations in the sample preparation, reagents, sample loading, electrophoretic conditions, staining and image acquisition. Obtaining valid quantitative estimates of protein abundances in each map, before the differential analysis, is therefore fundamental to provide robust candidate biomarkers. Normalization procedures are applied to reduce experimental noise and make the images comparable, improving the accuracy of differential analysis. Certainly, they may deeply influence the final results, and to this respect they have to be applied with care. Here, the most widespread normalization procedures are described both for what regards the applications to 2-DE and 2D Difference Gel-electrophoresis (2-D DIGE) maps. Key words Normalization, Gel-electrophoresis, 2-D DIGE
1
Introduction All workflows available for the analysis of 2-D maps are usually based on a series of standard steps, namely: –
Image preprocessing, devoted to improve image quality, involving background and noise subtraction and normalization tools [1].
–
Spot detection, identifying spots present on each image and separating them from background and other features [2–10]. Each spot is then quantified by the sum of the pixel values inside the spot boundary.
–
Spot filtering, removing protein spots not satisfying specific criteria.
–
Image alignment, focused to align corresponding regions in multiple gel images and involving warping tools [2, 11–25].
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_6, © Springer ScienceþBusiness Media New York 2016
91
92
Elisa Robotti et al.
–
Spot matching, devoted to match corresponding protein-spots across multiple gels [3, 17, 18, 26–28].
–
Differential analysis. This step is usually accomplished on spot volume datasets, where each sample is characterized by the volumes of the spots detected on its surface [29–36], but it can also be carried out directly comparing pixel intensities across the different gels [37–41]. Statistically significant changes of protein expression can be detected by both classical statistical methods (Student’s t-test, Mann-Whitney test, etc.) [42] or multivariate procedures (pattern recognition and classification tools [43–45]).
The available approaches differ for what concerns the order in which the several steps are combined in the overall procedure. Figure 1 represents the three procedures present in literature. The first two pipelines are based on a final analysis of spot volumes sets (e.g., 2-D maps of control vs. pathological). The first pipeline involves the identification of the spots of each image, followed by the matching of the corresponding spots across multiple gels, on the basis of inter-spot distances and patterns. This procedure can be accomplished both with and without image preprocessing and alignment [3, 4, 10], while preprocessing can improve spot identification, alignment is usually avoided since it is bypassed by spot matching. The second pipeline differs from the first one since it
a Image preprocessing
Spot detection
Spot matching
Differential analysis (spots)
Image alignment
Spot detection
Differential analysis (spots)
b Image preprocessing
c Image preprocessing
Fig. 1 Different 2-DE analysis strategies
Image alignment
Differential analysis (pixels)
Normalization for 2-DE and 2-D-DIGE
93
requires a step of pixel-based alignment, before spot detection, that is accomplished on a master gel (the average image of all aligned gels) [14, 22, 46, 47]. The final spot quantification is then performed either on the original or on the warped images. The third pipeline is completely different, since it compares sets of 2-D maps directly from the pixel intensities. In this approach, therefore, spots do not need to be detected and quantified [48–51]. Normalization is usually a step of preprocessing and is therefore applied to the raw images to make them comparable. However, it can also be applied to the final spot volumes. Gel electrophoresis is usually applied to identify different protein expression profiles in biological samples, e.g., in control vs. pathological samples or control vs. treated. Obtaining valid quantitative estimates of protein abundances on each map before differential analysis is therefore fundamental to provide robust and reliable candidate biomarkers. Unfortunately, the identification of statistically relevant different protein profiles is often confounded by the experimental variability that can be quite large in 2-D gel electrophoresis, and that may arise from small variations in the overall experimental procedure: sample preparation, reagents, sample loading, electrophoretic conditions, staining and image acquisition. Normalization procedures are applied to reduce the experimental noise and make the images comparable, improving the accuracy of differential analysis. Certainly, preprocessing deeply influences the final results, as was shown by Wheelock et al. [52] and must be applied with care. Here, the most widespread normalization procedures are described both for what regards the applications to 2-DE maps and 2-D DIGE maps.
2
Normalization Procedures for 2-D Gel-Electrophoretic Maps As stated before, image normalization in 2-D gel electrophoresis is fundamental, due to gel-to-gel variability [53–55]. Different gels usually present important differences not related to the effect investigated (a pathology, a drug effect, ripening, etc.) but rather to the experimental procedures: a different sample loading, protein loss during the experimental procedure (first dimension, equilibration before the second dimension, molecular weight separation, etc.), small variations in the staining procedure (time and temperature of exposure) [53–55]. Image normalization can be used to solve such problems following two main approaches: –
It can be applied directly to the gel images by normalizing pixel intensities between multiple gels, with the purpose of making pixel intensities comparable across all replicate gels.
–
It can be applied after spot detection, directly to the spot volume data. This second approach is certainly the most widespread.
94
Elisa Robotti et al.
2.1 Normalization of Single Gels
Unlike other pretreatment steps, as denoising and background subtraction, normalization is accomplished across multiple gels rather than on single gels, to make them more comparable to each other. However, some procedures are also available for the correction of pixel intensities in individual gels [56, 57]. One of these approaches is histogram equalization, which can be used to enhance the contrast between background and protein spots prior to spot detection. Since these procedures are applied to each image separately, they introduce a bias in the between-gel variation and should thus be used with caution.
2.1.1 Histogram Equalization
Histogram equalization [58] adjusts image intensities to enhance contrast (i.e., the areas of lower local contrast gain in contrast) by spreading out the most frequent intensity values. Let r be the gray levels of the image to be enhanced, normalized to the interval [0, 1] (r ¼ 0 representing black and r ¼ 1 representing white). The transformation is in the form s ¼ T ðr Þ, with 0 r 1. This transformation produces a level s for every pixel value r in the original image. T(r) must satisfy two conditions: –
It is single valued and monotonically increasing in the range [0;1] (see Note 1).
–
If r is in the range [0;1], also T(r) is in the range [0;1] (see Note 2).
The inverse transformation is therefore given by r ¼ T 1 ðs Þ, with s in the range [0;1]. Gray levels in an image can be considered as random variables in the interval [0; 1]. Let pr(r) and ps(s) be the probability density functions of random variables r and s, respectively. If pr(r) and T(r) are known and satisfy the first condition, the probability density function ps(s) of the transformed variable s can be obtained by: dr ð1Þ ps ðs Þ ¼ pr ðr Þ ds A widespread transformation function in image processing is: ðr
s ¼ T ðr Þ ¼ pr ðwÞdw
ð2Þ
0
where
Ðr
0
pr ðwÞdw is the cumulative distribution function (CDF)
of the random variable r (see Note 3). Since: 2r 3 ð ds dT ðr Þ d ¼ ¼ 4 pr ðwÞdw5 ¼ pr ðr Þ dr dr dr 0
ð3Þ
Normalization for 2-DE and 2-D-DIGE
95
it follows that: dr 1 ¼ 1: ps ðs Þ ¼ pr ðr Þ ¼ pr ðr Þ ds pr ðr Þ
ð4Þ
Since ps(s) is a probability density function, it must be zero outside the interval [0; 1] and is a uniform probability density function within the range [0; 1] (see Note 4). For discrete values, the probability of occurrence of the gray level rk in an image is given by: pr ðr k Þ ¼
nk n
ð5Þ
where n ¼ 0,1,2. . ., L 1 (L being the total number of possible gray levels in the image), n is the total number of pixels in the image, and nk is the number of pixels that have gray level rk. The discrete version of the transformation function is therefore given by: s k ¼ T ðr k Þ ¼
k X nj pr r j ¼ n j ¼0 j ¼0
k X
ð6Þ
where k ¼ 0,1,2. . .,L 1. An output image is obtained by mapping each pixel with level rk in the input image into a corresponding pixel with level sk in the output image. 2.1.2 Contrast-Limited Adaptive Histogram Equalization
Generalizations of histogram equalization use multiple histograms to emphasize local contrast, rather than the overall contrast; they include adaptive histogram equalization and contrast-limiting adaptive histogram equalization (CLAHE) [59]. CLAHE [60] involves a grayscale transformation function to enhance the local image contrast, rather than the global image contrast as in the original histogram equalization, when uniformly applied on the entire image. CLAHE therefore applies histogram equalization separately on small non-overlapping image areas (called tiles). The histogram of each transformed tile approximates the uniform distribution, amplifying faint regions. The main advantage of this method with respect to the classical histogram equalization procedure is that it prevents the amplification of noise and streaks. Moreover, CLAHE imposes a constraint on the resulting contrast, avoiding the possible over-saturation of the resulting image. This constraint is adjusted by the so-called clip limit h, that determines the maximum number of pixels allowed to occupy a bin in the resulting histogram. In cases of over-saturation, where certain histogram bins are occupied by more than h (2gray level depth 1) pixels,
96
Elisa Robotti et al.
the excessive amount of pixels is redistributed over the rest of the histogram (see Note 5) [61]. 2.2 Normalization of Multiple Gels
The normalization across multiple gels can be accomplished by several different approaches [50, 62, 63] that are presented hereafter separately.
2.2.1 Normalization by the Total, Mean, or Median Pixel Value
The most widespread and simplest approach is the normalization of each image by its total, mean or median pixel intensity. It consists simply in dividing each pixel intensity value of each image by the total, mean, or median pixel value of the corresponding image, independently for each image. This procedure is suitable when the between-gel variability is smaller than the within-gel variability: in such cases, setting the total, mean, or median intensity equal for all samples is a suitable approximation [50].
2.2.2 Normalization by the Use of Reference Spots
Another approach consists in the normalization using reference spots, characterized by an equal concentration in all samples [62]. This approach is certainly a valid alternative but suffers from an important drawback: reference proteins are usually characterized by high intensity and sometimes produce saturated spots. An alternative to this approach is represented by the use of standard linear models with nonzero intercept [63, 64]. In such procedures, the standard two-steps pretreatment of 2-D map image analysis, consisting in background elimination and normalization, is replaced by a single step where the regression method is applied to the original images. This approach is quite effective since it can be performed automatically and does not require the optimization of input parameters for the background elimination (see Note 6). In the paper by Daszykowski et al. [63], the reference image is the average image calculated over all the gels available. Pixel intensities of the single images are compared with the intensities of the pixels of the average image. In the ideal case where no background is present and the same range of pixel intensities is available for both images (each single image and the average image), the regression line of Ik versus Itarget should present unit slope and null intercept. The intensities of image k can therefore be transformed to Ik,new by: I k, new ¼
Ik
a b
ð7Þ
where a and b are, respectively, intercept and slope of the regression line between image k and the target image. 2.2.3 Normalization Based on Principal Component Analysis
Approaches based on principal component analysis (PCA) have also been applied [55]. In these approaches it is assumed that variations caused by experimental differences can be modelled by the first
Normalization for 2-DE and 2-D-DIGE
97
principal component (PC). This component is then subtracted from the rest of the data, which will then represent the normalized data. In the so-called linear scaling method [55], the spot volume can be approximated by its relationship with a variable accounting for the global intensity, e.g., the total number of spots detected in a gel (NS) or the mean volume of the spots (MV). The simplest way to use this normalization procedure is to calculate, for each gel, the sum of the volumes of a set of spots and convert it into a constant (see Note 7). An alternative to this approach consists in considering the experimental variation as the most important variation between the gels [55]: in this case, when PCA is applied to the spot volume dataset, experimental variation is mainly accounted for by the first PC [65]. For an effect (a treatment, a pathology, a ripening condition, etc.), if the spot volume has a linear relationship with a variable X representing the within-gel experimental variation (e.g., MV or NS), the volume can be expressed as: Y i j ¼ ai X j þ b i
ð8Þ
where Yij is the volume of spot i in gel j, ai is the slope, and bi is the intercept of the linear model for spot i (i runs from 1 to I, namely the total number of spots on each gel; j runs from 1 to J, namely the total number of investigated gels). A scaling factor Kj can defined for each gel as: K j ¼ Y ∗∗ =Y ∗j ¼ ða ∗ X ∗ þ b ∗ Þ=ða ∗ X j þ b ∗ Þ
ð9Þ
where: XX Y ij
; Y ∗j ¼ X IJ X bi Xi ; X∗ ¼ b∗ ¼ I J
Y ∗∗ ¼
X Y ij I
; a∗ ¼
X ai I
; ð10Þ
The corrected volume of spot i in gel j is therefore given by: Y ij corr ¼ a i X j þ b i K j
ð11Þ
If, for each i value, b i ¼ 0; then Y i j corr ¼ ai X and the volume of spot i in gel j depends only on the parameters characteristic of the spot. If b i 6¼ 0; with b ¼ 0, Y i j corr ¼ ai X i +b i XX :j : the more different X* is from Xj, the less efficient the correction is for the volume of spot i in gel j.
98
Elisa Robotti et al.
If PCA is exploited, instead, with the subtraction of the information accounted for by PC1, the following result is obtained: Yij corr ¼ ðY 0ij
t j l i Þs þ Yi∗
ð12Þ
where Yij0 is the actual value of the standardized volume, tj the score of gel j on PC1, s is the standard deviation of Yij and li is the loading of the i-th spot on PC1. Since Y 0ij ¼ t j l i : Yij corr ¼ ai X ∗ þ b i
ð13Þ
The method based on PCA is better than the linear one, since the corrected value does not depend on the gel. In this approach, if the variation associated to the effect investigated (pathology, drug effect, ripening effect, etc.) is orthogonal to the experimental variation, the efficiency of the correction depends on the relative value of the two variations: spots with experimental variations greater than the investigated effect will be better corrected. If the variation due to the investigated effect is not orthogonal to the experimental variation, it will contribute to the definition of PC1 and this correction is not applicable (see Note 8). 2.2.4 Quantile Normalization
Linear models [63] are generally sufficient, but in some cases the relationship between pixel intensities and normalization factors is nonlinear [66] (see Note 9). In these cases quantile normalization can be used [54, 67, 68], which relies on the assumption that the underlying distribution of the intensities is common to all gels. A set of data vectors (gels) can be normalized by the following algorithm [68]: 1. Given n vectors (gels) of length p, arrange a matrix X of dimension p n where each vector is a column. 2. Sort each column of X to give Xsort. 3. Take the means along the rows of Xsort and assign them to each element in the row to get X0 sort. 4. Get Xnormalized by rearranging each column of X0 sort to have the same order of the original X. Chang et al. [54] compared quantile normalization with the normalization based on total and median intensities, and proved it to be more effective.
2.2.5 Log Transformation and Z-Score Normalization
The distribution of spot intensities from 2-DE is usually quite skewed; to make the distribution more symmetric, increasing the relative intensity of faint versus abundant spots, a log transformation is usually applied [54]. Chang et al. [54] performed a log
Normalization for 2-DE and 2-D-DIGE
99
transformation of the intensities prior to quantile normalization. The authors studied the effect of normalization by M-A plots [69]. M ¼ log
A¼
gel1 gel2
logðgel1Þ þ logðgel2Þ 2
ð14Þ
ð15Þ
where M represents the log ratio of each spot in the two gels, while A is the average logarithmic intensity of each spot in the two gels. In the ideal case, the M-A plot should scatter around the line M ¼ 0 (see Note 10). Øye et al. [70] compared different normalization procedures: mean, median, and Z-score normalization. In this last case, the normalization procedure replaces each pixel intensity value with the corresponding Z-score, calculated by the mean and standard deviation of all pixel intensities in each image: Z jk ¼
I jk
Ik sk
ð16Þ
where Zjk is the Z-score value for pixel j on image k, Ijk is the original pixel intensity of pixel j on image k, ¯ Ik is the mean pixel intensity of gel k, and sk is the standard deviation of pixel intensities on image k.
3
Normalization of DIGE Gels Difference gel electrophoresis (DIGE) [71] allows the comparison of different samples on a single gel. The two samples are labelled with different cyanine dyes that are subsequently run on the same gel: in the experimental setting with two samples, Cy3 and Cy5 are used for the two samples; in the experimental setting with three samples, in addition to Cy3 and Cy5, Cy2 is used for control or reference (pool channel). In this way, the protein expression profiles of the two samples can be directly compared without the need of image alignment or spot matching. Certainly, the intensities recorded with this approach suffer from important dye effects [72, 73] and normalization is a fundamental step in the analysis of 2-D DIGE maps. Normalization parameters can depend on pixel intensities and on the spatial localization in the gel [11, 74]. In these cases, a good normalization can be obtained by applying the normalization to subregions of the gel, even if this approach may lead to a great computational effort. It is important to stress that different normalization procedures can provide differences in the identification
100
Elisa Robotti et al.
of candidate biomarkers [64] and must be chosen with extreme care. Several recent papers compare different normalization tools [64]: in general, it is not possible to identify a method performing better than others, thus leaving the choice of the normalization method as a function of the experimental context. Often, M-A plots are used to evaluate the ability of the different normalization methods [74]. M values are calculated as: log2 Cy5 Cy3 when a pool channel is not used, while they are calculated as: log2 Cy5=Cy2 Cy3=Cy2 if a pool channel is adopted. Then, m-values can be Cy3 defined as: log2 Cy5 Cy2 or log2 Cy2. The average log2 intensity (A) is calculated as: log2 Cy5+log22Cy3. To compare different methods box plots of M and m-values or scatterplots of M versus A-values (M-A plots) can be used [74]. 3.1 Normalization Methods Based on Specific Software
The DeCyder software [64] provides two methods for data normalization: “DeCyder no pool” and “DeCyder pool”: 1. The “DeCyder no pool” method is used when there is no pool channel included in the experiment. The method consists of channel specific shifts in the log intensities (log volumes), so that the distribution of the log intensities is centered on zero for each dye channel. 2. The “DeCyder pool” performs equivalent shifts of each of the two series of log ratios: log Cy5/Cy2 and log Cy3/Cy2, for each gel.
3.2 Normalization by the Total, Mean, or Median Values
In total volume normalization (TVN) [74], the sum of all spot volumes is calculated for each gel image. The median of all the sums is then computed and each spot volume is divided by the sum of the spot volumes on the same gel image, and then multiplied by the median sum, to rescale all values to the original orders of magnitude. Median normalization [74] can be performed by rescaling internal reference volumes, so that the median standardized log abundance (SLA) ratio for each biological replicate gel image is zero.
3.3 Locally Weighted Regression (Loess) Normalization
Loess (or Lowess) [75] normalization has been developed for microarray studies [69, 76–78], but has also been recently applied to DIGE [74, 79]. It is a nonparametric method that empirically fits a normalization curve to the experimental data, and combines multiple regression with a k-nearest-neighbor procedure. M-A scatterplots of comparable spot intensities are used as starting point for the curve fitting procedure. At each point in the M-A plot, Loess fits a low-degree polynomial to a subset of the data characterized by variable values near the point whose response is being estimated. Weighted least squares is used for fitting, giving more weight to points close to the
Normalization for 2-DE and 2-D-DIGE
101
point whose response is being estimated (see Note 11). To limit the influence of potential outliers, each sample is assigned weights iteratively, where each new estimation of the Loess curve reduces the weight of the sample according to its residual with respect to the previous Loess estimation. After the estimation is completed, the Loess curve is used to normalize the pixel intensities across all gels. The subsets of data used are determined by a nearest-neighbor algorithm, where the input parameter called “smoothing” or “window” determines the amount of data used to fit each local polyno1 mial. The smoothing parameter α is comprised in the range ½λ+n;1 , where λ is the degree of the local polynomial. The subset of data used in each weighted least squares fit comprises the n α points closest to the point where the response is being estimated (see Note 12). The local polynomials are almost always of first or second degree (see Note 13). The main advantage of this method is that it does not require the specification of a function; the operator must only provide the window and the degree of the local polynomial. On the other hand, it is quite computationally intensive. 3.4 Variants of the Loess Normalization
Some variants of the classical Loess normalization are present in literature. In cyclic Loess [64], data transformed in the M-A space are normalized by fitting a Loess regression curve and then shifting each point by the M value of the Loess line at the same value of A. The final effect is that a Loess curve that runs through the modified data would show M ¼ 0 at all A points. The process is iterated until there are no differences left to normalize between the couple of gels considered (see Note 14). Yang et al. [69] introduced the robust scatterplot smoother Loess [80] in two channel microarray data. In two-channel spatial Loess normalization a Loess smoother function is estimated from the M differences between the log2 dye intensities, log2Cy5 log2Cy3. The estimated Loess function is then subtracted from the M differences and the log2 intensities, log2Cy5 and log2Cy3, can be recovered. In [74], the authors applied two normalization procedures based on Loess normalization. In “loess þ scale’ normalization, they used the function normalizeWithinArrays with default settings and normalizeBetweenArrays using the “scale” found in the R package [81] Limma [82, 83]. The second method is called “2-D loess þ scale” which is a spatial normalization method. The Loess smoother function is estimated from the M differences between the log2 dye intensities, log2Cy5 log2Cy3, as a two-dimensional function of the spot coordinates on the protein gel (see Note 15). The estimated Loess function is then subtracted from the M differences. The method eliminates the phenomenon of one dye showing generally higher values than the other across regions of the gel.
102
Elisa Robotti et al.
In Fodor et al. [79] the method “DeCyder pool” is first applied but then the gel-specific “standardized log abundances,” log Cy5/Cy2 and log Cy3/Cy2, are adjusted for within-gel intensity dependence with a Loess smoother [80]; then the resulting gelspecific quantities (log Cy5/Cy2)0 (log Cy3/Cy2)0 are scaled to equal scales [69]. 3.5 Quantile-Based Methods
Quantile normalization can be used also for 2-D DIGE [68]. The classical procedure has also been modified by Kultima et al. [74] to provide two different normalization methods: “SC-quantile” was developed for single-channel Affymetrix data [68] and it ensures that the intensities for the three dyes have the same empirical distribution across the gels. The second method is called “SC-2-D þ quantile”: it first applies spatial normalization and then singlechannel quantile normalization. Spatial Loess normalization was applied with three dyes such that all the three sets of log2 intensities from a spot (log2Cy5, log2Cy3, and log2Cy2) were shifted to the average intensities (log2Cy5 þ log2Cy3 þ log2Cy2)/3 (see Note 16). After Loess normalization the single-channel data was also quantile normalized.
3.6
In variance stabilizing normalization (VSN), both scaling and offset factors are used to normalize spot volumes [73, 84]. VSN [85] is applied to each red (Cy5) and green (Cy3) channel separately, followed by a median shift of the gel specific log ratios log Cy5/Cy3 and finally standardization of the log ratio distributions of all the gels to an equal scale by a robust Z-score (see Note 17). The method “cyclic linear” [64] exploits the same approach as “cyclic loess” normalization, except that instead of a Loess smoother, a linear fit through the scatterplot is calculated. “Contrast” instead [64], like “cyclic loess,” normalizes data using curves fit to scatterplots of data from two gel images [86].
4
Other Methods
Notes 1. If these requirements are satisfied, the inverse transformation exists and the increasing order from black to white in the output image is preserved. 2. The output gray levels are in the same range as the input levels. 3. Since probability density functions are always positive and the integral of a probability density function for variables in the range [0; 1] also is in the range [0; 1], the two conditions are satisfied.
Normalization for 2-DE and 2-D-DIGE
103
4. Note that T(r) depends on pr(r), but the resulting ps(s) is always uniform, independent of the form of pr(r). 5. In CLAHE, the neighboring transformed tiles are merged by bilinear interpolation to reduce artificially induced boundaries, and the pixel intensity values are updated in accordance with the adapted histograms. 6. The intercept of the regression model accounts for this effect. 7. Usually calculated as the mean sum of the volume on all gels. 8. An advantage of this approach is the possibility of discarding other experimental effects unrelated to the global intensity of the gels (e.g., effect of sex in Taylor and Giometti [65]), by subtracting corresponding axes. 9. An example is given by the cases where intensity saturation is present: this mainly happens with a too high level of staining and/or as a result of the scanning procedure. 10. The disadvantage of the use of M-A plots consists in the fact that they can only be plotted for two gels at a time, hampering the evaluation of the effect of gel normalization across multiple gels. 11. The standard weight function used for Loess is the tri-cube weight function. 12. Large values of α produce the smoothest functions, less dependent on fluctuations in the data. With small α values, the regression function will be closer to the original data. A too small α value will cause the regression function to capture the random error in the data. Useful values of the smoothing parameter typically range between 0.25 and 0.5. 13. Using a zero-degree polynomial turns Loess into a weighted moving average. High-degree polynomials would tend to overfit the data in each subset and are numerically unstable, making accurate computations difficult to achieve. 14. This typically takes 1–2 cycles through all pairs of gel images. 15. For the spatial location normalization the function ma2D in the m array R package [87] was used. The coordinates of each spot were extracted by means of the DeCyder software. The smoothing factor was calculated as the ratio 100/[number of spots found also in the master image] for each gel. The data were finally scaled between the gels. 16. In practice this was done using three times the ma2D function in the m array R package [87]. Instead of the two sets of log2 intensities, log2R and log2G, the authors supplied one set of log2 intensities (log2Cy5 or log2Cy3 or log2Cy2) as well as the set of averages (log2Cy5 þ log2Cy3 þ log2Cy2)/3 to the function ma2D, so that the log2 intensities were subtracted
104
Elisa Robotti et al.
from the averages, and the Loess function fit to and subtracted from the resulting differences. The normalized log2 intensities were then derived by adding the averages (log2Cy5 þlog2Cy3 þ log2Cy2)/3 again to the output of the ma2D. 17. This method does not include the pool channel. References 1. Rye M, Fargestad EM (2012) Preprocessing of electrophoretic images in 2-DE analysis. Chemom Intell Lab Syst 117:70–79 2. Moller B, Posch S (2009) Robust features for 2-DE gel image registration. Electrophoresis 30:4137–4148 3. Srinark T, Kambhamettu C (2008) An image analysis suite for spot detection and spot matching in two-dimensional electrophoresis gels. Electrophoresis 29:706–715 4. Liu YS, Chen SY, Liu RS, Duh DJ, Chao YT, Tsai YC, Hsieh JS (2009) Spot detection for a 2-DE gel image using a slice tree with confidence evaluation. Math Comput Modell 50:1–14 5. Cutler P, Heald G, White IR, Ruan J (2003) A novel approach to spot detection for twodimensional gel electrophoresis images using pixel value collection. Proteomics 3:392–401 6. Kazhiyur-Mannar R, Smiraglia DJ, Plass C, Wenger R (2006) Contour area filtering of two-dimensional electrophoresis images. Med Image Anal 10:353–365 7. Wu Y, Lemkin PF, Upton K (1993) A fast spot segmentation algorithm for two-dimensional gel electrophoresis analysis. Electrophoresis 14:1351–1356 8. Bettens E, Scheunders P, VanDyck D, Moens L, VanOsta P (1997) Computer analysis of two-dimensional electrophoresis gels: a new segmentation and modeling algorithm. Electrophoresis 18:792–798 9. Tsakanikas P, Manolakos ES (2011) Protein spot detection and quantification in 2-DE gel images using machine-learning methods. Proteomics 11:2038–2050 10. Langella O, Zivy M (2008) A method based on bead flows for spot detection on 2-D gel images. Proteomics 8:4914–4918 11. Dowsey AW, Dunn MJ, Yang GZ (2003) The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics 3:1567–1596 12. Conradsen K, Pedersen J (1992) Analysis of 2dimensional electrophoretic gels. Biometrics 48:1273–1287 13. Woodward AM, Rowland JJ, Kell DB (2004) Fast automatic registration of images using the
phase of a complex wavelet transform: application to proteome gels. Analyst 129:542–552 14. Luhn S, Berth M, Hecker M, Bernhardt J (2003) Using standard positions and image fusion to create proteome maps from collections of two-dimensional gel electrophoresis images. Proteomics 3:1117–1127 15. Schultz J, Gottlieb DM, Petersen M, Nesic L, Jacobsen S, Sondergaard I (2004) Explorative data analysis of two-dimensional electrophoresis gels. Electrophoresis 25:502–511 16. Salmi J, Aittokallio T, Westerholm J, Griese M, Rosengren A, Nyman TA, Lahesmaa R, Nevalainen O (2002) Hierarchical grid transformation for image warping in the analysis of twodimensional electrophoresis gels. Proteomics 2:1504–1515 17. Panek J, Vohradsky J (1999) Point pattern matching in the analysis of two-dimensional gel electropherograms. Electrophoresis 20:3483–3491 18. Kaczmarek K, Walczak B, de Jong S, Vandeginste BG (2002) Feature based fuzzy matching of 2D gel electrophoresis images. J Chem Inf Comput Sci 42:1431–1442 19. Gustafsson JS, Blomberg A, Rudemo M (2002) Warping two-dimensional electrophoresis gel images to correct for geometric distortions of the spot pattern. Electrophoresis 23:1731–1744 20. Smilansky Z (2001) Automatic registration for images of two-dimensional protein gels. Electrophoresis 22:1616–1626 21. Veeser S, Dunn MJ, Yang GZ (2001) Multiresolution image registration for twodimensional gel electrophoresis. Proteomics 1:856–870 22. Sorzano CO, Arganda-Carreras I, Thevenaz P, Beloso A, Morales G, Valdes I, Perez-Garcia C, Castillo C, Garrido E, Unser M (2008) Elastic image registration of 2-D gels for differential and repeatability studies. Proteomics 8:62–65 23. Daszykowski M, Faergestad EM, Grove H, Martens H, Walczak B (2009) Matching 2D gel electrophoresis images with Matlab ‘Image Processing Toolbox’. Chemom Intell Lab Syst 96:188–195
Normalization for 2-DE and 2-D-DIGE 24. Dowsey AW, English J, Pennington K, Cotter D, Stuehler K, Marcus K, Meyer HE, Dunn MJ, Yang GZ (2006) Examination of 2-DE in the Human Proteome Organisation Brain Proteome Project pilot studies with the new RAIN gel matching technique. Proteomics 6:5030–5047 25. Potra FA, Liu X, Seillier-Moiseiwitsch F, Roy A, Hang Y, Marten MR, Raman B, Whisnant C (2006) Protein image alignment via piecewise affine transformations. J Comput Biol 13:614–630 26. Xin HM, Zhu Y (2009) Multiple informationbased spot matching method for 2-DE images. Electrophoresis 30:2477–2480 27. Noma A, Pardo A, Cesar RM (2011) Structural matching of 2D electrophoresis gels using deformed graphs. Pattern Recogn Lett 32:3–11 28. Rogers M, Graham J (2007) Robust and accurate registration of 2-D electrophoresis gels using point-matching. IEEE Trans Image Process 16:624–635 29. Marengo E, Robotti E, Cecconi D, Hamdan M, Scarpa A, Righetti PG (2004) Identification of the regulatory proteins in human pancreatic cancers treated with Trichostatin-A by 2DPAGE maps and Multivariate Statistical Analysis. Anal Bioanal Chem 379(7-8):992–1003 30. Marengo E, Robotti E, Bobba M, Liparota MC, Rustichelli C, Zamo` A, Chilosi M, Righetti PG (2006) Multivariate statistical tools applied to the characterisation of the proteomic profiles of two human lymphoma cell lines by two-dimensional gel electrophoresis. Electrophoresis 27(2):484–494 31. Marengo E, Robotti E, Bobba M, Milli A, Campostrini N, Righetti SC, Cecconi D, Righetti PG (2008) Application of Partial Least Squares Discriminant Analysis and variable selection procedures: a 2D-PAGE Proteomic Study. Anal Bioanal Chem 390(5):1327–1342 32. Robotti E, Demartini M, Gosetti F, Calabrese G, Marengo E (2011) Development of a classification and ranking method for the identification of possible biomarkers in two-dimensional gel-electrophoresis based on Principal Component Analysis and variable selection procedures. Mol Biosyst 7(3):677–686 33. Negri AS, Robotti E, Prinsi B, Espen L, Marengo E (2011) Proteins involved in biotic and abiotic stress responses as the most significant biomarkers in the ripening of Pinot Noir skins. Funct Integr Genomics 11(2):341–355 34. Polati R, Menini M, Robotti E, Millioni R, Marengo E, Novelli E, Balzan S, Cecconi D (2012) Proteomic changes involved in
105
tenderization of bovine Longissimus dorsi muscle during prolonged aging. Food Chem 135(3):2052–2069 35. Marengo E, Robotti E, Bobba M (2008) 2DPAGE Maps Analysis. In: Vlahou A (ed) Clinical proteomics: methods and protocols, vol 428, Methods in molecular biology. Humana Press, Totowa, NJ, pp 291–325 36. Marengo E, Robotti E, Righetti PG, Campostrini N, Pascali J, Ponzoni M, Hamdan M, Astner H (2004) Study of Proteomic changes associated with healthy and tumoral murine samples in Neuroblastoma by Principal Component Analysis and classification methods. Clin Chim Acta 345(1-2):55–67 37. Marengo E, Robotti E, Gianotti V, Righetti PG, Cecconi D, Domenici E (2003) A new integrated statistical approach to the diagnostic use of proteomic two-dimensional maps. Electrophoresis 24(1-2):225–236 38. Marengo E, Robotti E, Righetti PG, Antonucci F (2003) New approach based on fuzzy logic and Principal Component Analysis for the classification of two-dimensional maps in health and disease: application to lymphomas. J Chromatogr A 1004(1-2):13–28 39. Marengo E, Bobba M, Liparota MC, Robotti E, Righetti PG (2005) Use of Legendre Moments for the Fast Comparison of twodimensional polyacrylamide gel electrophoresis Maps Images. J Chromatogr A 1096(1-2): 86–91 40. Marengo E, Robotti E, Bobba M, Demartini M, Righetti PG (2008) A new method for comparing 2D-PAGE maps based on the computation of Zernike moments and multivariate statistical tools. Anal Bioanal Chem 391 (4):1163–1173 41. Marengo E, Cocchi M, Demartini M, Robotti E, Bobba M, Righetti PG (2011) Investigation of the applicability of Zernike moments to the classification of SDS 2D-PAGE maps. Anal Bioanal Chem 400(5):1419–1431 42. Carpentier S, Panis B, Swennen R, Lammertyn J (2008) Finding the significant markers: statistical analysis of proteomic data. In: Vlahou A (ed) Clinical proteomics, vol 428, Methods in molecular biology. Humana Press, Totowa, NJ, pp 327–347 43. Jacobsen S, Grove H, Jensen KN, Sorensen HA, Jessen F, Hollung K, Uhlen AK, Jorgensen BM, Faergestad EM, Sondergaard I (2007) Multivariate analysis of 2-DE protein patterns—practical approaches. Electrophoresis 28:1289–1299 44. Faergestad EM, Rye MB, Nhek S, Hollung K, Grove H (2011) The use of chemometrics to
106
Elisa Robotti et al.
analyse protein patterns from gel electrophoresis. Acta Chromatogr 23:1–40 45. Grove H, Jorgensen BM, Jessen F, Sondergaard I, Jacobsen S, Hollung K, Indahl U, Faergestad EM (2008) Combination of statistical approaches for analysis of 2-DE data gives complementary results. J Proteome Res 7 (12):5119–5124 46. Bandow JE, Baker JD, Berth M, Painter C, Sepulveda OJ, Clark KA, Kilty I, VanBogelen RA (2008) Improved image analysis workflow for 2-D gels enables large-scale 2-D gel-based proteomics studies—COPD biomarker discovery study. Proteomics 8:3030–3041 47. Rye MB, Faergestad EM, Alsberg BK (2008) A new method for assigning common spot boundaries for multiple gels in twodimensional gel electrophoresis. Electrophoresis 29:1359–1368 48. Faergestad EM, Rye M, Walczak B, Gidskehaug L, Wold JP, Grove H, Jia X, Hollung K, Indahl UG, Westad F, van den Berg F, Martens H (2007) Pixel-based analysis of multiple images for the identification of changes: a novel approach applied to unravel proteome patterns [corrected] of 2-D electrophoresis gel images. Proteomics 7:3450–3461 49. Rye MB, Faergestad EM, Martens H, Wold JP, Alsberg BK (2008) An improved pixel-based approach for analyzing images in twodimensional gel electrophoresis. Electrophoresis 29:1382–1393 50. Van Belle W, Anensen N, Haaland I, Bruserud O, Hogda KA, Gjertsen BT (2006) Correlation analysis of two-dimensional gel electrophoretic protein patterns and biological variables. BMC Bioinformatics 7:198 51. Daszykowski M, Stanimirova I, BodzonKulakowska A, Silberring J, Lubec G, Walczak B (2007) Start-to-end processing of twodimensional gel electrophoretic images. J Chromatogr A 1158:306–317 52. Wheelock AM, Buckpitt AR (2005) Softwareinduced variance in two-dimensional gel electrophoresis image analysis. Electrophoresis 26:4508–4520 53. Dowsey AW, English JA, Lisacek F, Morris JS, Yang GZ, Dunn MJ (2010) Image analysis tools and emerging algorithms for expression proteomics. Proteomics 10:4226–4257 54. Chang J, Van Remmen H, Ward WF, Regnier FE, Richardson A, Cornell J (2004) Processing of data generated by 2-dimensional gel electrophoresis for statistical analysis: missing data, normalization, and statistics. J Proteome Res 3:1210–1218
55. Burstin J, Zivy M, Devienne D, Damerval C (1993) Analysis of scaling methods to minimize experimental variations in 2-dimensional electrophoresis quantitative data - application to the comparison of maize inbred lines. Electrophoresis 14:1067–1073 56. Appel RD, Vargas JR, Palagi PM, Walther D, Hochstrasser DF (1997) Melanie II—a thirdgeneration software package for analysis of two-dimensional electrophoresis images: II. Algorithms. Electrophoresis 18:2735–2748 57. Lemkin PF, Lipkin LE (1981) Gellab—a computer-system for 2d-gel electrophoresis analysis. 1. Segmentation of spots and system preliminaries. Comput Biomed Res 14:272–297 58. Gonzalez RC, Woods RE (2008) Digital image processing, 3rd edn. Prentice Hall, Upper Saddle River, NJ 59. Michalis A, Savelonas EA, Mylona DM (2012) Unsupervised 2D gel electrophoresis image segmentation based on active contours. Pattern Recog 45(2):720–731 60. Pizer SM, Amburn EP, Austin JD (1987) Adaptive histogram equalization and its variations. Comput Vision Graph Image Process 39:355–368 61. Pisano ED, Zong S, Hemminger BM, DeLuca M, Johnston RE, Muller K, Braeuning MP, Pizer SM (1998) Contrast limited adaptive histogram equalization image processing to improve the detection of simulated speculations in dense mammograms. J Digit Imaging 11:193–200 62. Merril CR, Creed GJ, Joy J, Olson AD (1993) Identification and use of constitutive proteins for the normalization of high resolution electrophoretograms. Appl Theor Electrophor 3:329–333 63. Daszykowski M, Wrobel MS, BierczynskaKrzysik A, Silberring J, Lubec G, Walczak B (2009) Automatic preprocessing of electrophoretic images. Chemometr Intell Lab Syst 97:132–140 64. Keeping AJ, Collins RA (2011) Data variance and statistical significance in 2D-gel electrophoresis and DIGE experiments: comparison of the effects of normalization methods. J Proteome Res 10:1353–1360 65. Taylor J, Giometti CS (1992) Use of principal components analysis for mutation detection with two-dimensional electrophoresis protein separations. Electrophoresis 13(3):162–168 66. Smales CM, Birch JR, Racher AJ, Marshall CT, James DC (2003) Evaluation of individual protein errors in silver-stained two-dimensional
Normalization for 2-DE and 2-D-DIGE gels. Biochem Biophys Res Commun 306:1050–1055 67. Almeida JS, Stanislaus R, Krug E, Arthur JM (2005) Normalization and analysis of residual variation in two-dimensional gel electrophoresis for quantitative differential proteomics. Proteomics 5:1242–1249 68. Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185–193 69. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30:e15 70. Øye OK, Jørgensen KM, Hjelle SM, Sulen A, Ulvang DM, Gjertsen BT (2013) Gel2DE - a software tool for correlation analysis of 2D gel electrophoresis data. BMC Bioinformatics 14:215–219 71. Unlu M, Morgan ME, Minden JS (1997) Difference gel electrophoresis: a single gel method for detecting changes in protein extracts. Electrophoresis 18:2071–2077 72. Krogh M, Liu Y, Waldemarson S, Valastro B, James P (2007) Analysis of DIGE data using a linear mixed model allowing for proteinspecific dye effects. Proteomics 7:4235–4244 73. Kreil DP, Karp NA, Lilley KS (2004) DNA microarray normalization methods can remove bias from differential protein expression analysis of 2D difference gel electrophoresis results. Bioinformatics 20:2026–2034 74. Kultima K, Scholz B, Alm H, Skold K, Svensson M, Crossman AR, Bezard E, Andren PE, Lonnstedt I (2006) Normalization and expression changes in predefined sets of proteins using 2D gel electrophoresis: a proteomic study of L-DOPA induced dyskinesia in an animal model of Parkinson’s disease using DIGE. BMC Bioinformatics 7(1):475 75. Jacoby WG (2000) Loess: a nonparametric, graphical tool for depicting relationships between variables. Elect Stud 19:577–613 76. Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74(368):829–836 77. Cleveland WS, Devlin SJ (1988) Locallyweighted regression: an approach to regression
107
analysis by local fitting. J Am Stat Assoc 83 (403):596–610 78. Berger JA, Hautaniemi S, Jarvinen AK, Edgren H, Mitra SK, Astola J (2004) Optimized LOWESS normalization parameter selection for DNA microarray data. BMC Bioinformatics 5:194 79. Fodor IK, Nelson DO, Alegria-Hartman M, Robbins K, Langlois RG, Turteltaub KW, Corzett TH, McCutchen-Maloney SL (2005) Statistical challenges in the analysis of twodimensional difference gel electrophoresis experiments using DeCyder(TM). Bioinformatics 21:3733–3740 80. Cleveland WS, Grosse E, Shyu WM (1992) Chapter 8: statistical models in S. In: Chambers JM, Hastie TJ (eds) Local regression models. Wadsworth & Brooks/Cole, Pacific Grove 81. Team RDC (2005) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org 82. Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:Article 3 83. Smyth GK, Speed T (2003) Normalization of cDNA microarray data. Methods 31 (4):265–273 84. Karp NA, Kreil DP, Lilley KS (2004) Determining a significant change in protein expression with DeCyder during a pair-wise comparison using two-dimensional difference gel electrophoresis. Proteomics 4 (5):1421–1432 85. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl 1): S96–S104 86. Astrand M (2003) Contrast normalization of oligonucleotide arrays. J Comput Biol 10 (1):95–102 87. Dudoit S, Yang YH (2002) Bioconductor R packages for exploratory analysis and normalization of cDNA microarray data. In: Parmigiani G, Garrett ES, Irizarry RA, Zeger SL (eds) The analysis of gene expression data: methods and software. Springer, New York
Chapter 7 Spot Matching of 2-DE Images Using Distance, Intensity, and Pattern Information Hua-Mei Xin and Yuemin Zhu Abstract The analysis of a large number of two-dimensional gel electrophoresis (2-DE) images requires developing automatic methods. In such analyses, spot matching plays a fundamental role, in particular for the identification of proteins. We describe a simple and accurate method which allows to automatically and accurately match spots in 2-DE images. The method consists of simultaneously exploiting the distance between the spots, their intensity, and the pattern formed by their spatial configuration. Key words Spot matching, Spot registration, Two-dimensional gel electrophoresis, 2-DE images, Image registration, Proteomics
1
Introduction Two-dimensional gel electrophoresis (2-DE), originally introduced for analyzing Escherichia coli proteins in 1975 [1], is a powerful technique for proteins separation. It separates proteins according first to their isoelectric points and then to their molecular weight, thus generating a two-dimensional (2-D) image, called 2-DE image, in which a spot indicates the presence in the initial solution of proteins with a specific mass and charge. The precise comparison of 2-DE images is important because one of the main objectives of 2-DE techniques is to detect differences in proteins expression. Before performing this comparison, it is necessary to superpose them, to obtain the best geometric correspondence, so that the spots in one image correspond to those in the other image, with the hope that superimposed spots correspondingly contain the same protein along the different maps. In terms of image processing, this is a matching or registration problem. So, 2-DE images matching is a fundamental step in the study of proteins, the more as the number of 2-DE images can be large. In the past, a number of 2-DE images matching techniques were proposed. They can be classified loosely into three categories: landmark based, intensity based, and based on both of them.
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_7, © Springer Science+Business Media New York 2016
109
110
Hua-Mei Xin and Yuemin Zhu
Spot-based schemes first detect spots and then consider these detected spots as characteristic points, also called landmarks, to guide the matching [2–7]. Point pairs can be manually selected by the user, or they can be automatically selected in some analysis systems, often for the largest spots. The main advantage of spotbased approaches is that the list of spot features originating from different gels can be used to compensate for the uneven distribution of the gel spots in the image warping step. Their major disadvantage is that the extraction of the corresponding spots is subjected to inaccuracy and can be very computationally expensive. Also, the optimal choice of the spots for guiding the matching remains an open question. The choice of the landmarks is generally achieved by estimating the smoothest possible transformation, in the sense of a certain regularity criterion, which does not necessarily correspond to the physical reality. Pixel-based schemes perform image matching directly on the raw data, by considering the image as a surface formed by pixel intensities [8–11]. They are highly dependent on the similarity definition, transformation function and optimization procedure used. These methods work at the level of pixels, and thus do not need a preliminary step of data reduction. Their main advantage resides in the fact that more image information is taken into account. However, one of their disadvantages is the heavy computational cost due to the need to consider all the pixels of the image. Another drawback of this type of matching methods is their limited accuracy, due to the fact that they do not account for specific spot landmarks constraints in the gel images. The third problem relates to the fact that the relationship between the intensities of the two images may not be trivial, especially in the case of multimodal images. The third category of spot matching methods exploits the advantages of both spot- and pixel- based techniques [12–16]. This type of approach makes use of both spot landmark constraints and pixel intensity information. The spot landmark constraints are used to compensate for information mismatch between images. When focus is put just on the use of spot landmarks, two problems can deteriorate the performance of this type of approach: inaccuracy of spot landmark pairing, and insufficient number of corresponding spot landmarks. In principle, the matching jointly exploiting the advantages of spot- and pixel- based techniques, can significantly improve accuracy and robustness. However, due to complex biological, physical, and chemical processes, the location of the same protein spot in different 2-DE images may vary both in globally and locally different manners, so it is inherently impossible to perfectly match the images. Even under strictly controlled laboratory conditions, repeated analyses of the same sample normally produce nonidentical gels [17]. This makes it extremely time consuming and difficult to accurately identify the difference between two images,
Spot Matching Using Distance, Intensity and Pattern Information
111
which however carries pertinent information for biologists. In this chapter, we describe a simple and accurate multiple informationbased spot matching method for 2-DE images. The method is inspired from the principle of attraction and exploits simultaneously the spot distance, spot intensity, and spot pattern information, to accurately and automatically match spots in two 2-DE images. Unlike existing techniques, this method does not use distance and intensity information as a measure of similarity in the optimization process. Instead, it uses explicitly and analytically the distance and intensity information to calculate a so-called matching force in a very simple and not time-consuming process.
2
Materials Both simulated and real 2-D gel electrophoresis data of the Escherichia coli proteome are used to show the performance of the matching approach hereafter presented.
2.1 Simulated Gel Electrophoresis Data
The simulated gel electrophoresis data are obtained by applying global affine and local B-spline distortions to the original images, with different deformation levels (see Note 1), thus generating pairs of gel images corresponding to different levels of deformation (see Fig. 1) (see Note 2).
2.2 Real Gel Electrophoresis Data
The real 2-D gel electrophoresis data of the Escherichia coli proteome comes from the Laboratory of Bacterial Chemistry, CNRS URP9034, Marseille, France (see Note 3).
2.3 Software Environment
The open-source software ITK (Insight Segmentation and Registration Toolkit), v. 2.4, was used for the calculations (http://www. itk.org/).
a
b
c 250 200 150 100 50 0 −50 −100 −150 −200
Fig. 1 Example of a pair of simulated images (size: 300 310). (a) Reference image. (b) Floating Image. (c) Image of the difference between (a) and (b)
112
3
Hua-Mei Xin and Yuemin Zhu
Methods Two images to be matched are often referred to as reference image and floating image respectively. The reference and floating images will be matched using the principle of attraction (see Note 4).
3.1 Extraction of the Corresponding Spots
1. Given the original reference and floating 2-DE images (see Note 5), which are designated, respectively, by Ir and If. 2. Normalize the intensities of the images to [0, 255] (see Note 6). 3. Choose different threshold intervals [L1, U1], [L2, U2], . . ., [Ln, Un] according to gel electrophoresis data, where L denotes the lower threshold and U the upper threshold (see Note 7). 4. In ITK software, employ the morphological operators “open” and “close” to fill the segmented spot regions and refine their edge. The operator “open” performs an erosion operation followed by a dilatation operation, using a predefined structuring element, while the morphological operator “close” carries out a dilatation operation, followed by an erosion operation, using a predefined structuring element. The erosion operation finds the local minima in gel images, and the dilatation operation finds the local maxima (see Note 8). 5. Extract the spots and calculate automatically the position of their centroid, from the segmented binary image. Let c designate an operator that gives the coordinates (x, y) of the centroid of a spot. 6. Extract the intensity of the centroid of each spot from the original grey-level image. g(c(sri)) and g(c(sfj)) denote the intensities of the spots at the centroids in Ir and If, respectively (see Notes 9 and 10).
3.2 Spot Matching Using Distance, Intensity, and Pattern
1. Given two sets of 2-D spots (obtained from Subheading 3.1), Φr ¼ fs ri ji ¼ 1, 2, . . . , N g and Φ f ¼ s fj j j ¼ 1, 2, . . . , M ð1Þ where N and M represent the number of the spots sri and sfj in Ir and If, respectively. 2. Based on the principle of attraction, it can be assumed that every spot sfj in If attracts every spot sri in Ir by a force f pointing along the line that links the two spots, defined as: f s ri ; s fj
K s ri ; s fj ¼ r D s ri ; s fj
ð2Þ
where K(sri, sfj) is a similarity criterion, D(sri, sfj) is a function of the spots distance, and r is the unit vector from sfj to sri.
Spot Matching Using Distance, Intensity and Pattern Information
113
3. Deform If corresponding to the force f that can be further written as: wg w p K g K p d f s ri ; s fj ¼ wd D d þ λ kdk
ð3Þ
where wd, wg, and wp are the weighting coefficients; Kg represents the intensity similarity, while Kp the pattern similarity (see Notes 10 and 11). The weighting coefficients wd, wg, and wp aim at adjusting the strength of distance, intensity, and pattern information. 4. Calculate Kg, which represents the intensity similarity as: g ðc ðs ri ÞÞg c s fj ð4Þ g c s fj þ 1 max g ðc ðs ri ÞÞ, g c s fj withg ðc ðs ri ÞÞ ¼ 255 g ðc ðs ri ÞÞ andg c s fj ¼ 255 g c s fj (see Note 11).
K g ¼ g ðc ðs ri ÞÞ
5. Calculate Kp, which measures the pattern similarity as: K p ¼ exp
diff P fj ; P ri
ð5Þ
where Pri and Pfj encode the pattern information of spots in Ir and If, respectively. diff(Pfj, Pri) expresses the difference in pattern between floating and reference images inside a neighborhood (see Note 12). 6. Define Kp inside the circular neighborhood Ω of the spots sri and sfj as:
1 0
X
X
v v
fj , f n ri , rm C B
C B n6¼ j ∈Ω m6¼i∈Ω C K p ¼ expB C B δ A @ 0
B expB @
X
y fj , f n 1 n6¼ j ∈Ω arct g X 2π x fj , f n n6¼ j ∈Ω
X
ð6Þ y ri, rm
m6¼i∈Ω
X
m6¼i∈Ω
x ri, rm
1 C C A
in which δ denotes the neighborhood radius of the spots sri and sfj, vri,rm is the shifting vector from spot sri to spot srm, and vfj,fn the shifting vector from spot sfj to spot sfn (see Note 13). 7. Calculate the distance Dd: Dd ¼
kdk2 δ2
ð7Þ
114
Hua-Mei Xin and Yuemin Zhu
where d designates the Euclidean distance vector between the two spots sri and sfj. As in step 6, δ is the neighborhood radius of the spots sri and sfj. 8. Compute the attractive force f between sri and every spot s fj ∈Φ f . If s ∗ fj gives the maximum value of kf(sri, sfj)k, then sri ∗ and s fj will be considered as the corresponding spots. 9. Repeat the calculation in step 8 for each given spot s ri ∈Φr in the gel image, thus yielding two corresponding spot sets n o Φ∗ r ¼ fs ri ji ¼ 1, 2, . . . , q g
and
Φ∗f ¼ s ∗ fj j j ¼ 1, 2, . . . , q
with respect to Ir and If, where q designates the number of matched spots (see Note 14). Figure 2 gives an example of spot matching for real 2-DE images (see Note 15).
a
b
c
d 400
350
300
250
200
150
100 100
150
200
250
300
350
400
450
500
Fig. 2 Example of matching based on the distance, intensity, and pattern. (a) Reference image. (b) Floating image. (c) Image of the difference between (a) and (b). (d) Result of matching spots (red spots represent spots of the reference image, black spots those of the floating image, and the blue vectors indicate the matching of the spots). The stars represent the spots that remained unmatched
Spot Matching Using Distance, Intensity and Pattern Information
115
10. Other small numbers of spots without correspondence can be subsequently separated for further biological analysis (like the spot shown as star in Fig. 2d).
4
Notes 1. Choose different deformation levels according to gel electrophoresis data. Here, we choose deformation levels ranging from 1 to 6 with increasing deformation. For each 2-DE image, deformation levels 1–6 correspond to the mismatched spots representing 10, 30, 50, 60, 80, and 90 % of the total number of spots, respectively. After applying this procedure to a set of 20 2-DE images, we obtain 120 pairs of simulated gel images. 2. Figure 1 gives an example of a pair of simulated images. Figure 1c shows the resulting difference image after the deformation. 3. Real datasets also undergo deformation levels from 1 to 6. For each different deformation level, three pairs of real 2-DE images are considered, in which the corresponding spot pairs have been established using mass spectroscopy. 4. The principle of attraction comes from the law of universal gravitation in physics. It states that every mass point attracts every other mass point by a force pointing along the line intersecting both points; the force is proportional to the product of the two masses and inversely proportional to the square of the distance between the mass points. 5. For the simulated data: use an original gel image as the referenced image, and then register the floating image to the reference image by applying different deformations. For the real data: select the paired images, in which the corresponding spot pairs have been established using mass spectroscopy. Then choose one of the image pair as the original gel image, and the other as the floating image. 6. Intensity normalization is a necessary step for the matching. It allows to retain the relative differences between different 2DE images. 7. Automatically determine the threshold values, according to the intensities of the spots in the images. Assign the pixels having intensity values inside the interval to be 255, and the pixels having intensity values outside the interval between the two threshold values, to be 0. The threshold interval is chosen within the intensities of the images. For example if the interval is [100, 200], the pixels with intensities 200 are turned into 255.
116
Hua-Mei Xin and Yuemin Zhu
8. The suitable structuring elements are chosen according to gel electrophoresis data. Here, we use the ellipse structuring elements. The morphological operators “open” and “close” can be used as filters to remove both positive and negative noise. The “open” is used to remove noisy parts of the object (or remove small objects) while the “close” is used to fill in the object (or remove small holes). 9. After segmentation, intensity and position of the centroid of each detected spot are calculated. These values are used in the subsequent matching process. 10. Here, g(c(sri)) and g(c(sfj)) denote the intensities of the spots at the centroids in Ir and If, respectively, where c ðs ri Þ ¼ x ri ; y ri are the coordinates of the centroid of the i-th spot sri in the reference image, and c s fj ¼ x fj ; y fj are the coordinates of the centroid of the j-th spot sfj in the floating image. 11. Determine the weighting coefficients wd, wg and wp based on the priori database. These parameters aim at adjusting the strength of the distance, intensity, and pattern information, respectively. Normally, wd, wg, and wp can be set at 1. λ∈ 0, 1 is the coefficient used to prevent unstable situation when kdk2 is close to zero. 12. The greater the difference diff(Pfj, Pri), the smaller the pattern similarity parameter Kp (i.e., the more the two patterns are dissimilar). Kp reaches the maximum value when diff(Pfj, Pri) is zero. 13. Here, we model the similarity pattern (as a real number) and the phase of the residual vector. The former is equal to the product of the modulus with the scale factor δ and the latter (with a factor 2π) results from the subtraction between the sum of vfj,fn in the floating image and the sum of vri,rm in the reference image. The shifting vectors are calculated by subtracting the vectors of the two spots, for example sri and srm. 14. The performance of this process can be indirectly influenced by noise. Errors due to noise on the position as well as on the intensity of the spot will reduce the accuracy of the spots pairing to some extent. 15. Figure 2 visually shows the corresponding spots obtained by the presented matching method, in which the red spots represent the proteins that were expressed in the reference image, the black spots represent the proteins that were expressed in the floating image, and each blue vector indicates a pair of matched spots.
Spot Matching Using Distance, Intensity and Pattern Information
117
Acknowledgement The authors thank the team of Dr. Long-Fei WU, Laboratory of Bacterial Chemistry, CNRS URP9034, Marseille, France, for providing the 2-DE images and the ground truth. Yuemin ZHU was partly supported by the French ANR under ANR-13-MONU0009-01. Hua-Mei Xin was partly supported by the Scientific Research Foundation for the Returned Overseas Chinese Scholars (2009-36). References 1. O’Farrell PH (1975) High resolution twodimensional electrophoresis of proteins. J Biol Chem 250:4007–4021 2. Appel RD, Vargas JR, Palagi PM et al. (1997) Melanie II-a third-generation software package for analysis of two-dimensional electrophoresis images: II. Algorithms. Electrophoresis 18:2735–2748 3. Pleiner KP, Hoffmann F, Kriegel K et al. (1999) New algorithmic approaches to protein spot detection and pattern matching in twodimensional electrophoresis gel databases. Electrophoresis 20:755–765 4. Efrat A, Hoffmann F, Kriegel K et al. (2002) Geometric algorithms for the analysis of 2-Delectrophoresis gels. J Comput Biol 9:299–316 5. Rogers M, Graham J (2007) Robust and accurate registration of 2-D electrophoresis gels using point-matching. IEEE Trans Image Process 16:624–635 6. Mo¨ller B, Posch S (2009) Robust features for 2-DE gel image registration. Electrophoresis 30:4137–4148 7. Rohr K, Worz S (2012) An extension of thinplate splines for image registration with radial basis functions. 9th IEEE international symposium on biomedical imaging, pp 442–445 8. Smilansky Z (2001) Automatic registration for images of two-dimensional protein gels. Electrophoreisis 22:1616–1626 9. Veeser S, Dunn MJ, Yang GZ (2001) Multiresolution image registration for two-dimensional gel electrophoresis. Proteomics 1:856–870
10. Gustafsson JS, Blomberg A, Rudemo M (2002) Warping two-dimensional electrophoresis gel images to correct for geometric distortions of the spot pattern. Electrophoresis 23:1731–1744 11. Lin DT (2010) Autonomous sub-image matching for two-dimensional electrophoresis gels using MaxRST algorithm. Image Vision Comput 28:1267–1279 12. Rohr K, Cathier P, Wo¨rz S (2004) Elastic registration of electrophoresis images using intensity information and point landmarks. Pattern Recogn 37:1035–1048 13. Sorzano COS, The´venaz P, Unser M (2005) Elastic registration of biological images using vector-spline regularization. IEEE Trans Biomed Eng 52:652–663 14. Shi G, Jiang T, Zhu W et al. (2007) Alignment of two-dimensional electrophoresis gels. Biochem Biophys Res Commun 357:427–432 15. Daszykowski M, Færgestad EM, Grove H et al. (2009) Matching 2-D gel electrophoresis images with Matlab ‘Image Processing Toolbox’. Chemometr Intell Lab 96:188–195 16. Nhek S, Mosleth EF, Hoy M et al. (2012) Nonlinear visualisation and pixel-based alignment of 2-D electrophoresis images. Chemometr Intell Lab 118:97–108 17. Voss T, Haberl P (2000) Observation on the reproducibility and matching efficiency of twodimensional electrophoresis gels: consequences for comprehensive data analysis. Electrophoresis 21:3345–3350
Chapter 8 Algorithms for Warping of 2-D PAGE Maps Marcello Manfredi, Elisa Robotti, and Emilio Marengo Abstract Software-based image analysis of 2-D PAGE maps is an important step for the investigation of proteome. Warping algorithms, which are employed to register spots among gels, are able to overcome the difficulties due to the low reproducibility of this analytical technique. Over the years, the research of new matching and warping mathematical methods has allowed the development of several routine applications of easy-to-use software. This chapter describes common and basic spatial transformations used for the alignment of protein spots present in different gel maps; some recently new approaches are also presented. Key words 2-D PAGE analysis, Warping, Gel matching, Data analysis, Image processing
1
Introduction The aim of warping methods is to adjust any systematic geometric distortion which are often present in the digitized 2-D maps. The use of automatic matching and warping software, which is essential in order to compare the proteomic profiles and identify the differentially expressed proteins, highly affects the quantitative and qualitative variability among replicate gels [1]. Matching and warping algorithms are able to overcome the difficulties that are due to variations in the migration positions of spots among gels. Registration of two images consists in deforming one image (source) into another (template), so that similar structures in the images match one to each other [2]. Several warping transformations [3–5] can be employed in 2-D PAGE analysis, they can be divided into two main families, according to the information used to drive the registration process: – Landmark-based methods [6]. They proceed in two steps: (1) image segmentation, to identify geometric structures (e.g., spots), and (2) image registration, consisting in aligning the geometric sets identified and computing the image transformation function.
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_8, © Springer ScienceþBusiness Media New York 2016
119
120
Marcello Manfredi et al.
– Intensity-based methods. They register the two images by the optimization of an intensity similarity measure calculated between the raw images (e.g., the sum of squared intensity differences or cross-correlation). Spot-based methods were the first to be developed [7–9] but they have been slightly outdone by intensity-based methods [10–14], that do not need segmentation and exploit image information to a larger extent; on the other hand, however, intensitybased methods cannot incorporate user-driven information. When rigid registration is applied, intensity-based methods showed to be superior [15], but when a nonrigid registration is applied, the large number of degrees of freedom can produce a large number of wrong pairings. In the registration of 2-D maps this can be an important drawback: a spot may be either matched to a wrong one or not matched at all. Therefore, the combination of intensity and landmark information has gained more and more popularity, starting from applications in the medical field [16–22]. The choice of the warping algorithm is a compromise between a smooth transformation and a good match. Electrophoresis images usually require local and nonlinear deformations that can be handled properly by an elastic deformation framework. Some methods solve systems of differential equations to find the transformation field, after smartly deriving an image formation model [14], others produce a piecewise-bilinear mapping [11, 12]. Many complex algorithms like restriction landmark genomic scanning (RLGS) [23, 24], fuzzy clustering [25], and genetic algorithms [26] have been developed to solve these matching problems (see Chapter 10 for more details). In this chapter, common and basic spatial transformations used for warping are first presented; then, some of the most exploited methods for 2-D map warping and matching present in literature are described.
2
Methods
2.1 Warping Transformations
In warping, a pair of two-dimensional functions, u(x, y) and v(x, y), are identified, which map a position (x, y) in one image (x ¼ column number and y ¼ row number) to a position (u, v) in another image [27] (see Fig. 1).
2.1.1 Affine Transformations
There are many classes of spatial transformations used in image warping: the simplest class is the affine transformation, which consists of translation, rotation, scaling, shear and combinations of the above. It can be described by six parameters which are usually represented by a (3 2) transformation matrix. It can be unambiguously defined by establishing a correspondence between three points in the source image and three points in the target image.
Warping of 2-D Maps
x
y
121
u
(x,y)
(u,v) v
Fig. 1 Representation of image warping: the objective of the algorithm in 2-DE is to match the same spots in different gel images
The most general transformation is:
½x
y
1 ¼ ½u
v
a 11
a 12
0
6 1 4 a 21
a 22
7 05
2
a 31
3
ð1Þ
1
a 32
This transformation has been widely used to superimpose pairs of 2-D maps. Defined the centers of the spots by (x1, y1); . . . (xm, ym) in the first gel, and by (u1, v1); . . . (um, vm) in the second gel, a regression algorithm can be used to estimate the linear parameters which minimize the sum of squared differences between the spots: m X ui i¼1
m 2 X u xi ; yi vi þ i¼1
2 v xi; yi
ð2Þ
1. The translation function translates all the points to new positions by adding offsets Tu and Tv to u and v, respectively:
½x
y
1 ¼ ½u
v
1
0
0
6 1 4 0
1
7 05
2
Tu
Tv
3
ð3Þ
1
2. All points in the uv-plane can be rotated around the origin through the counterclockwise angle θ through a rotation transformation:
½x
y
1 ¼ ½u
v
2
6 1 4
cos θ
sin θ
0
sin θ
cos θ
7 05
0
0
1
3
ð4Þ
122
Marcello Manfredi et al.
3. The points can be scaled by applying the scale factors Su and Sv to the u and v coordinates respectively (enlargements are defined by scale factors larger than 1):
½x
y
1 ¼ ½u
v
Su
0
0
6 1 4 0
Sv
7 05
2
0
0
3
ð5Þ
3
ð6Þ
1
4. The shear transformation along the u-axis is:
½x
y
1 ¼ ½u
v
2
1
6 1 4Hv 0
0
0
1
7 05
0
1
where Hv is used to make x linearly dependent on v and u. Similarly, the shear transform along the y-axis is:
½x
y
1 ¼ ½u
v
2
1 Hu
6 1 40 0
2.1.2 Bilinear and Perspective or Projective Transformations
1 0
0
3
7 05
ð7Þ
1
Both these transformations are characterized by eight parameters and can therefore be used to map an arbitrary quadrilateral in the source image onto another arbitrary quadrilateral in the target image. 1. The general representation of a perspective transformation is:
½x0
y0
w0 ¼ ½u
v
2
3
a11
a12
a13
6 1 4 a21
a22
7 a23 5
a31
a32
ð8Þ
a33
It is produced when [a13a23]T is nonzero. It is used in conjunction with a projection onto a viewing plane, in what is known as a perspective or central projection. Perspective transformations preserve parallel lines only when they are parallel to the projection plane. 2. The general representation of a bilinear transformation is: 3 2 a3 b 3 7 6 6 a2 b 2 7 6 7 ½ x y ¼ ½ uv u v 1 6 ð9Þ 7 4 a1 b 1 5 a0 b 0
Warping of 2-D Maps
123
A bilinear transformation handles the four-corner mapping problem for non-planar quadrilaterals [27]. It is most commonly used in the forward mapping formulation where rectangles are mapped onto non-planar quadrilaterals. Bilinear mappings preserve lines that are horizontal or vertical in the source image. 2.1.3 Polynomial Transformations
They are widely used in many fields of image alignment: the source image coordinates are defined as bivariate polynomials in the target image coordinates. The polynomial coefficients are often calculated from known positions of few points of the image (landmarks) using least squares methods. When local control over the mapping is required, piecewise polynomial methods can be employed [27]. Polynomial transformations of order p are specified by either: p i p i p X p X X X i j u¼ bi j xi y j a i j x y and v ¼
or
u¼
ð10aÞ
i¼0 j ¼0
i¼0 j ¼0
p p p X p X X X bi j xi y j a i j x i y j and v ¼
ð10bÞ
i¼0 j ¼0
i¼0 j ¼0
depending on the maximum order of interaction terms included. These transformations include quadratic, biquadratic, cubic, and bicubic as special cases [28]. 2.1.4 Thin Plate Splines [29–31]
The process of using thin plate splines in image warping involves minimizing a bending energy function for a transformation over a set of given nodes or landmark points:
E¼
ðð R
2
0
2 @ d z dx 2
!2
d2 z þ2 dxdy
!2
þ
d2 z dy 2
!2 1 Adxdy
ð11Þ
The basis for solving the problem is given by a kernel function U defined as: U ¼ r 2 log r 2 , where r ¼ ðx a
x b Þ2 þ ð y a
y b Þ2
ð12Þ
where a and b indicate a couple of landmarks. Given a set of source landmark points, P as a (3 n) matrix (where n is the number of points) must be defined. Using the kernel function, another matrix K (n n) must be defined: 2 3 1 x1 y1 6 1 x2 y2 7 7 ð13Þ P ¼6 4... ... ...5 1 xn yn
124
Marcello Manfredi et al.
2
0
U ðr 12 Þ
6 6 U ðr 21 Þ K ¼6 6 ... 4 U ðr n1 Þ
. . . U ðr 1n Þ
0
3
7 . . . U ðr 2n Þ 7 7 ... ... 7 5 ... 0
... U ðr n2 Þ
ð14Þ
Matrix L (n þ 3 n þ 3) is a combination of K and P where T is the matrix transpose operator and O is a 3 3 matrix of zeros: L¼
"
K
P
PT
O
#
ð15Þ
The matrix L allows to solve the bending energy equation. Inverting this matrix L to another matrix Y defined as Y ¼ (V j 0 0 0)T, where V is any n-vector from (v1, . . ., vn), a vector W ¼ (w1, . . ., wn) and the coefficients a1, ax, ay can be derived: L
1
Y ¼ W a 1
ax
ay
T
ð16Þ
Using the elements of L 1Y, it is possible to define a function z(x, y) by the equation: z ðx; y Þ ¼ a 1 þ a x x þ a y y þ
n X wi U ðjP i
ðx; y ÞjÞ
ð17Þ
i¼1
2.2 Spot-Based Methods
In spot-based methods, a list of detected spots must be input in the software; the warping considers the spots as individual points. When the list of spots is not generated automatically by the software, the process involves the use of landmarks that the user manually chooses in the different maps [32]. Spot-based methods are more conventional with respect to pixel-based methods, because a list of spot features (spot location, area, and volume) can be used in the spot matching step. A disadvantage of this approach is that precise segmentation of all spots is computationally intensive and can lead to false matchings. Traditional methods use polynomial functions to align the spots in two gels: the geometric relationship between the reference (R) and the distorted (D) systems can be modeled by a linear combination of given M functions [33]: xR ¼
M M X X bj f j x D ; y D a j f j x D ; y D and y R ¼ j ¼1
ð18Þ
j ¼1
where xR and yR are the transformed values for xD and yD. Given N known spot pairs (x,y) from manual or automatic landmarking, the coefficients aj and bj are determined by maximizing the following error function:
Warping of 2-D Maps
E RD N ¼
N 2 1X r RD i N i¼1
125
ð19Þ
where ENRD is the average squared error of the Euclidian distance between R and D and provides a measure of the overall degree of geometric distortion between the pairs of maps. Geometric distortions introduce differences in the coordinates of two identical points in the reference and distorted gel. The distortion vector connecting the two locations can be represented by polar coordinates (rRDi, θRDi), where rRDi and θRDi are the Euclidean distances between the positions and the angles, relative to the horizontal axis, respectively: r RD i
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 ¼ x iR x iD þ y iR y iD R yi θRD ¼ arctan i x iR
y iD x iD
ð20Þ
ð21Þ
The basis functions {f1, f2, . . ., fm} used to ensure the smoothness of the warp, can be nonlinear functions on both x and y, or low-order monomial functions fj(x, y) ¼ xmjynj. For polynomial functions of order n, there are M ¼ (n þ l)(n þ 2)/2 coefficients to be determined for both dimensions. There are cases where the global transformation with low-order polynomials cannot correct the geometric distortions but several improvements have been developed and new methods to warp the gel images have been tested during the last years. Thin plate spline (TPS), which incorporate an extra function that allows nonlinear bending of the gel coordinates, and hierarchical grid transformation, showed a superior warping efficiency [14, 34]. 2.2.1 Iterative Closest Point Image Alignment [5]
Iterative closest point (ICP) is a procedure able to overcome the limitations of other automatic segmentation methods that often do not match corresponding spots or merge close spots. Standard ICP uses the Euclidian distance, but the authors propose a combination of information potential together with Euclidian distance as distance metric. The registration algorithm has been applied to synthetic and real 2-D maps [5]. The algorithm gives good results in terms of accuracy and robustness. 1. Image partition. The two images to be aligned (the float image X and the fixed image Y) are partitioned into M1 M2 equally sized rectangular sub-images Xij and Yij (i ¼ 1, . . ., M1 and j ¼ 1, . . ., M2). An independent affine transformation Tij is then defined on each sub-image. The float image is the subimage Xij, whereas the fixed image is a sub-image which contains Yij extending it to a multiple k of the size of sub-image Yij
126
Marcello Manfredi et al.
along the four directions. The global transformation is composed at the end, using the method from Little et al. [35]. 2. ICP algorithm [36, 37]. The data and the model shape are both represented as two lists of centers of protein spots in the two gel images to be aligned. Let P be a “data” shape (decomposed to a point set {pi}) to be aligned to a “model” shape Q. ICP consists of two steps: (a) The closest point in the model shape Q is computed for each point pi (see Note 1). The final transformed point set is denoted as {yi} (see Note 2). (b) The transformation T (see Note 3) that minimizes a cost function C related to the point sets is determined: ð22Þ argmin C y i , T pi
where argmin is the argument of the minimum, i.e. the points for an argument for which the function reaches the minimum. The algorithm then redetermines the closest point set and continues until the local minimum match between the two shapes is found.
3. Rejection strategy. The authors applied a rejection strategy to eliminate outliers, which consists of two steps: (a) They first applied the method developed by Masuda [38], which classifies a pair of spots (Pi, Q j) as outliers when the distance between the center (xi,yi) of spot Pi and the center (xj, yj) of spot Q j is larger than 2.5σ (σ ¼ standard deviation of the distance). (b) The method by Pulli [39] is then applied. The strategy rejects the worst n% pairs, based on the Euclidean distance between the center of spots Pi and Q j. The optimization procedure is switched from the first step to the second one when the termination condition of the first step is satisfied. 2.2.2 Matching Using Deformed Graphs [40]
Existing methods for matching relative positions of the points in different gels, usually do not take into account the structural information around the points, but they extract point features which are then used for point and gel matching. Existing solutions are based on distances between pair of points, e.g., the graduated assignment algorithm [41], exploiting Euclidean distances, and the bipartite graph matching (BGM) combined with shape context (SC) [42]. The authors propose an extension of a previous work [40] in which a quadratic assignment formulation is proposed, combining SC as the linear term and structural information as the quadratic term of a cost function, minimized by a correspondence algorithm based on deformed graphs, to solve the gel matching problem. For complex pairs of gels, the authors couple the ICP strategy with previous
Warping of 2-D Maps
127
knowledge provided by the user, by selecting or validating a couple of corresponding spots. The authors compare the results obtained on simulated cases against well-known algorithms. 1. Spot identification. The algorithm proposed in Almansa et al. [43] is used for the detection of points (see Note 4); each spot is then described by its peak. 2. Graph matching. Graph matching dates back to the 1970s, with the work of Barrow and Popplestone [44]. Tsai and Fu [45] used the attributed relational graphs (ARGs) as a natural data structure to store both kinds of information, appearance and structure: ARG is a directed graph in which appearance information is encoded as vertex attributes, and structural relations as edge attributes. Given an input and a model ARG, graph matching finds a mapping from input to model vertices, evaluating the vertices and edges attributes through a cost function. The authors exploit the maximum common subgraph (MCS) method [46] (see Note 5). For representing each gel image, each point corresponding to a detected spot is represented by a vertex. The method matches two gel images (i.e., two graphs): a model Gm ¼ (Vm, Em, μ, ν), and an input graph, Gi (where νm ∈ Vm is a model vertex, μ(νm) is its attribute, em ∈ Em is a model (directed) edge, and ν(em) is its attribute; similar notations are used for the input graph). MCS maps a subset of input vertices to a subset of model vertices by minimizing the cost function: X X E¼α d sc þ ð1 αÞ ds ð23Þ vertices
edges
where dSC (linear term) represents the shape context distance and compares pairs of vertices (νi, νm), directly evaluating their attributes μ(νi) and μ(νm); dS (quadratic term) is the structural distance taking into account geometric penalties and comparing the edge attributes using deformed graphs. α ranges from 0 to 1. The equation can be reformulated as: X E ðP Þ ¼ E ðvi ; vm Þ ð24Þ ðvi ;vm Þ∈P
where P is a set of pairs. E(P) is then minimized. The same equation can be reformulated for each pair (vi, vm), as: E ðvi ; vm Þ ¼ αd sc ðvi ; v m Þ þ ð1
αÞd s ðG d ðv i ; vm Þ, G m Þ
ð25Þ
The authors exploit ICP to provide a robust estimation to align input and model graphs. 3. Shape context distance. The idea behind SC [47] is to describe each point (spot) with the distribution of the points in its
128
Marcello Manfredi et al.
neighborhood. Using a set of bins in polar coordinates, the number of points in each bin is computed to obtain a 2-D histogram. The normalized histogram at point p is denoted as hp(k) (k identifies the bin). The distance between the SCs of two points p in one image and q in the other image can therefore be computed as:
2 1X h p ðk Þ h q ðk Þ d sc ð p; q Þ ¼ 2 k h p ðk Þ þ h q ðk Þ
ð26Þ
where k runs on the number of bins. The spot matching step leads to a matrix C where each entry Cpq ¼ dsc(p,q). Therefore, for each spot in one image, we obtain the similarity with each spot in the other one. 4. Structural information and deformed graphs. Structural information can be used to improve the classification. The authors exploit the spatial relationships between points corresponding to the detected spots. The key idea is that each directed edge has a corresponding vector as its attribute. The relative positions are evaluated by comparing two given vectors, v1 and v2, in terms of the angle between them and their lengths [48]: ! ! ! ! v1 v2 j cos θ 1j þ ð1 δÞ ð27Þ c vec v1 ; v2 ¼ δ 2 Cs where θ is the angle between the two vectors and jvj is the vector length. The first term represents the “angular” cost (assigning higher values to “opposite” vectors), while the second term represents the “modular” cost (assigning a value proportional to the difference of the lengths, normalized by a constant Cs to keep the values in the range [0,1]). δ ranges from 0 to 1. If outliers are present (e.g., due to differences in protein contents and detection errors), the two graphs have different number of vertices, making the structural evaluation costly. An auxiliary graph, called deformed graph, is used in these cases. The deformed graph Gd (νi, νm) is almost an exact copy of Gm (same number of vertices and edges), with the same attributes for the edges, except for those corresponding to the “deformed edges”, obtained replacing the coordinates of vm by the coordinates of vi in the corresponding copy of vm in the deformed graph (indicated as vd). 5. Structural distance. For a given pair (νi, νm), its corresponding deformed graph Gd(νi, νm) is first computed. Then, the structural term ds is calculated by:
Warping of 2-D Maps
d s ðG d ðvi ; vm Þ, G m Þ ¼
X 1 c vec ðvðe d Þ, vðe m ÞÞ jE ðvd Þje ∈E ðv Þ d
129
ð28Þ
d
where E(vd) is the set of deformed edges with an endpoint at vd; |S| is the size of set S, em is the model edge corresponding to the deformed edge ed, and v(ed) and v(em) are the vectors corresponding to their edge attributes. When considering structural information, each vi ∈ Vi tends to be associated to the nearest vm ∈ Vm. 2.2.3 Integrated Hierarchical-Based and Optimization-Based Spot Matching [49]
The authors present a segmentation algorithm for protein spot identification and an algorithm for matching protein spots. Here only the algorithm used for spot matching is presented, consisting in an integration of hierarchical- and optimization- based methods. The hierarchical method is first used to find corresponding pairs of protein spots satisfying the local cross-correlation and overlapping constraints. The matching energy function based on local structure similarity, image similarity, and spatial constraints, is then optimized. Initially, the image intensities of both images are scaled in the same range. Given Ia and Ib as two input images, I 0 a and I 0 b are the corresponding images after rescaling. Let sai and sbj be detected spots of Ia and Ib (i ¼ 1,. . .,Na and j ¼ 1,. . .,Nb; Na and Nb ¼ number of spots in Ia and Ib). For each level h, Sa,h and Sb,h include all spots with an intensity at the centroid lower than a threshold Kh, (Kh < Khþ1, h > 0): n 0 o S a, h ¼ s ai I a c p ðs ai Þ < K h
n 0 o S b, h ¼ s b j I b c p s b j < K h
ð29aÞ ð29bÞ
where cp is the function that gives the 2-D coordinates of the centroid of the input spot, and I 0 a(cp(sai)) and I 0 b(cp(sbj)) are image intensities at the spot centroids of the rescaled images. The matching score Mai,bj is computed as: M ai, b j ¼ αRai, b j þ ð1
αÞO ai, b j
ð30Þ
where α is a weighting parameter. For each level h, (Rai,bj, Oai,bj) is computed for every spot pair sai included in Sa,h and sbj included in Sb,h. At level h, spot pairs (sai, sbj) are considered corresponding spots (included in Ph) if they satisfy two conditions: i and j correspond to the spot pair for which the maximum M value is obtained; the M and O values calculated for the spot pair overcome the two threshold values Mt and Ot respectively. All spot pairs included in Ph are passed to Phþ1 in level h þ 1. In matching at level h þ 1, finding
130
Marcello Manfredi et al.
spot correspondence is performed only with spots in Sa,hþ1 and Sb,hþ1 but not in Ph. All matching results are in Ph0 (h0 is the highest level (see Note 6)). 1. Hierarchical-based method for initialization of spot matching. For each level h, spots are matched on the basis of the local image cross-correlation coefficient and overlapping constraint of the two images. The local images of spots sai and sbj are within square areas (xai Δ, yai Δ) and (xbj Δ, ybj Δ) in Ia and Ib, respectively (cp(sai) ¼ (xai,yai), cp(sbj) ¼ (xbj,ybj)). Δ is a constant. Given lai and lbj (local images of sai and sbj), their correlation coefficient is computed as:
Rai, b j
0X X l b j ðx; y Þ l l ð x; y Þ ai ai 1@ x y ¼ 2 L ai, b j
L ai, b j
lb j
þ 1A ð31Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi XX 2 2 ¼ l ai ðx; y Þ l ai l b j ðx; y Þ l b j x
1
ð32Þ
y
where l ai and l b j are the mean intensities of lai and lbj, respectively, and x is within [xai Δ; xai þ Δ] and y in the range [ybj Δ; ybj þ Δ]. Rai,bj ranges from 0 to 1 (values close to 1 correspond to highly similar local images). The overlapping constraint is calculated from the overlapping area between the two images. When a spot in one image is shifted to its corresponding spot in the other image, the whole image is also shifted, and the two images overlap. Suppose sai and sbj are matching spots with the image shifting vector fai, xbj, yai ybj]; the overlapping ratio of sai and sbj is bj ¼ [xai computed as: ! I a \ I b þ T ai, b j ð33Þ O ai, b j ¼ 0:5ðjI a j þ jI b jÞ !
where ðI b +T ai , b j Þ is the shifted image of Ib, | | indicates the area size of the image, and \ is the intersection operator (see Note 7). 2. Energy optimization-based method for spot matching. The algorithm just described is efficient in finding correspondences between dark spots and spots with dark neighboring spots, but it can miss other correspondences: this problem is overcome by a further step where a matching energy function is defined. If Ψ contains the corresponding points from the initial step, Pa and Pb are ordered sets of spots paired in Ψ. Therefore:
Warping of 2-D Maps
Ψ¼
pak ; pbk pak ∈ P a , pbk ∈ P b
131
ð34Þ
where k ¼ 1, 2, . . ., |Ψ|. Ua and Ub are sets of spots that have not been paired yet. The matching energy function of two unpaired spots uai (belonging to Ua) and ubj (belonging to Ub) is defined as follows. First, two ordered sets of vectors, Vai and Vbj of uai and ubj, are calculated: n! o ð35aÞ V ai ¼ v ak ¼ c p ðuai Þ c p pak pak ∈P a , 1 k M
Vbj ¼
n! v bk ¼ c p ub j
o ð35bÞ c p pbk pbk ∈P b , 1 k M
where pak are the M nearest spots of uai and pbk are the corresponding spots of pak as defined in Ψ; cp is the function giving the 2-D coordinates of the centroids of the spots. The local structure similarity between two spots is computed as: !
3 2 ! M min k v k; k v k X ak bk 1 41 !
5 E distðai, b j Þ ¼ ð36Þ ! M k¼1 max k v k; k v k ak
E orientðai, b j Þ
" M 1X 1 ¼ M k¼1 π
arccos
bk
! ! v ak v bk ! ! k v ak kk v bk k
!#
ð37Þ
where || || indicates the Euclidean norm of the vector. Edist and Eorient are based on distance similarity and orientation similarity of two local spot structures. The other energies are the image energy, Eimage, and the constraint energy, Econst. Eimage is defined from image cross-correlation, while Econst is computed from the overlapping ratio Oai,bj: E imageðai, b j Þ ¼ 1
Rai, b j
ð38Þ
E constðai, b j Þ ¼ 1
O ai, b j
ð39Þ
All energies range between 0 and 1. A vector from the four energies of spots uai and ubj is defined as: h iT ! E ai, b j ¼ E 2distðai, b j Þ ; E 2orientðai, b j Þ ; E 2imageðai, b j Þ ; E 2constðai, b j Þ ð40Þ The matching energy of the spots is then defined as: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi E matchðai, b j Þ ¼
! E ai, b j
!
W
ð41Þ
132
Marcello Manfredi et al. !
T W ¼ α α2d ; α2o ; α2i ; α2c
α¼
ð42Þ
1
ð43Þ
ðαd þ αo þ αi þ αc Þ2
!
where W is the vector of weighting parameters associated with the four energies (αd, αo, αi, αc range between 0 and 1). The optimization of the matching energy is based on an algorithm that finds corresponding spots in Ua and Ub (see Note 8). 2.3 Pixel-Based Methods
In pixel-based methods, image warping is performed directly on the raw image, which is considered as a surface formed by pixel intensities [12]. The warping on the raw pixel values uses numerous additional features like spot shape and intensity spread that otherwise are lost in the spot detection approach. In this approach the correlation corr(I1, I2) between two intensity surfaces I1 and I2 is maximized: covðI 1 ; I 2 Þ corrðI 1 ; I 2 Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi covðI 1 ; I 1 ÞcovðI 2 ; I 2 Þ
ð44Þ
where cov(I1, I2) is the covariance between two images: covðI 1 ; I 2 Þ ¼
1 X I 1 ðx; y Þ jD jx , y∈D
I 1 I 2 ðx; y Þ
I2
ð45Þ
D is the domain of the points to be considered for the registration process. An advantage of using the correlation as a measure of similarity is that it is invariant to changes both of the means I 1 and I 2 and of the variances cov(I1, I1) and cov(I2, I2) over the domain D. One of the first pixel-based algorithms for 2-DE analysis was a multiresolution approach with resampling by cubic convolution in order to remove the global distortions at lower resolutions and the local distortions at higher resolutions. The disparity between the gels was estimated by minimizing the sum of squared differences, which basically is equivalent to the cross-correlation technique [50]. The first commercial software that performed pixel-based registration was Z3 (Compugen, TelAviv, Israel) [11]. In this case the spots are segmented and then a score taking into consideration the number of spots, their contrast, and area is attached to each rectangle. This score is used for ruling out rectangles strongly overlapping with higher score rectangles. The collection of all shift vectors is used to generate a global transformation starting from translational (shift), linear (rotation and scaling), and finally a Delaunay transformation (piecewise bilinear mapping).
Warping of 2-D Maps
133
Another approach called pixel-based analysis of multiple images for the identification of changes (PMC) that can be used for multiple gel images was presented by Færgestad [51]. The method includes a first normalization of the gel images to adjust their pixel intensity to a constant protein amount and a background correction to remove streaks and background staining defects. The normalization is performed by dividing each image by the total intensity of the image (rescaled to a common [0–1] scale across all images); background correction is instead carried out using two approaches based on different principles: the first one consisting of repeatedly fitting polynomial curves to a signal, and the second one using the penalized asymmetric least squares (PALS) approach, extended to 2-D data [52–54] (for more details see Chapter 17). The spot volume analysis is performed defining the boundaries around the spots for each gel separately and across all images. Then the 1 D pixel vectors of the unfolded images can be analyzed by multivariate data analysis tools (see Chapter 14 for more details), e.g. principal component analysis (PCA) [55] followed by partial least squares regression (PLSR) [55]. The authors show how this approach solves persistent problems like mismatch of protein spots, erroneous missing values, and failure to detect variations in overlapping proteins [51]. In general, after background and noise elimination, several algorithms can be used. Fuzzy warping algorithms are often employed before the matching step [56]. The technique starts with the identification of the maxima of the most intense spots in all images; the spots correspondence is then computed using an iterative approach that first calculates the correspondence based on their distance and then calculates the transform parameters. The spots correspondence is calculated by centering two-dimensional Gaussian functions at the center of the k identified spots in the target image, each of them associated to its σ, width parameter, ensuring the Gaussian overlap. The output of each Gaussian function is then calculated for all the identified spots of the warped image (see Fig. 2). This results in the output matrix, G(k, h) (k and h are the number of spots identified respectively in the target image and in the image to be warped). The elements of this matrix vary in the range 01. If the element gi,j is very small, the i-th spot in the target and the j-th spot in the warped image are far away one from each other. If the two spots are close to each other, their corresponding element gi,j is close to one [56–59]. To find the one-to-one correspondence between the spots, the Shrinkon iterative standardization [60] is used to find the correspondences between the spots which are then used for the calculation of the transformation parameters. In many cases, the global procedure used for the correspondences estimation is not efficient enough to compensate for local distortions of the warped images,
134
Marcello Manfredi et al.
Fig. 2 Fuzzy matching through the 2-D Gaussian function centered at the maximum of the spot in the target image. Reproduced from ref. 56
but the local piecewise linear transform can be applied for further improvement of the correlation of the images (for more details see Chapter 17). Block-matching techniques [61] have been widely used to process images in several fields; they represent a good approach in scenarios with deformable structures such as 2-DE gels. Recently, a new iterative block-matching technique, based on successive deformation, search, fitting, filtering, and interpolation stages, has been proposed, to measure elastic displacements in 2-D maps. The algorithm uses a multiresolution iterative process integrated with several deformation models that can be used as restrictions, to guarantee the smoothness and freedom of the flow, and to interpolate a dense flow for each pixel from an arbitrary set of points, allowing the registration of a warped image. The technique divides the image into regular regions called blocks, and solves the correspondence problem for each block. An estimation of the deformed image is obtained from the dense field using a bilinear model to obtain non-discrete image values. The deformed image is used in the next iteration to refine the results. The Pearson correlation quotient (R), which measures the strength of linear dependence between two variables, has been used. The Pearson correlation quotient has the advantage of being invariant to the average gray level of the images. The data obtained in the
Warping of 2-D Maps
135
similarity analysis led to the most probable discrete displacement. This technique represents a general solution, being easy to adapt to different 2-D deformable cases and providing an experimental reference for block-matching algorithms [61]. 2.3.1 Warping to Correct for Geometric Distortions of the Spot Pattern [14]
This method consists of two steps: 1. Warping is applied to correct for current leakage along the sides of the gel in the second dimension. The effect of current leakage on the protein pattern is modelled by a simple physical-chemical model. The estimation of the orientation directions is quite difficult in 2-D gels: in this case, an easily identifiable object with a direction is the gel front, at the bottom of the gel, which is deformed from a straight line in the presence of current leakage. In this case, the amount of current leakage is estimated by the curve shape of the gel front. 2. Current leakage-corrected images are automatically aligned to correct for other sources of geometric distortions. The authors applied the approach proposed by Glasbey and Mardia [27]: the compromise between good matching and rough distortions is guaranteed by the maximization of a penalized likelihood. The likelihood measures the similarity between aligned images while the penalty is a distortion criterion (uniquely minimized by a null set of functions) (see Note 9).
2.3.2 Multiresolution Image Registration [12]
With the multiresolution approach, both global and local deformations can be handled: using low-resolution images the coarse misalignments are eliminated and finer distortions can be dealt with on the next resolution level. This process is reiterated until the final optimal transformation is reached. Given two images I1 and I2, registration finds a transformation t minimizing the difference between I1 and t(I2), given a similarity measure. This problem is converted in an optimization problem: the parameter vector c has to be identified that maximizes the function f(c) ¼ corr(I1,tc(I2)). The authors applied piecewise bilinear maps (PBM) to represent the deformation. Multiresolution decomposition can be applied to PBMs resulting in a hierarchy of transformation spaces [62–68], where transformations from spaces with higher dimension model more localized and finer details. The transformation is defined by a map of m mapping points in the target image onto points of the source image (m can be used to determine the intensities at different points p in the target image by sampling them from points m(p) in the source image). A PBM is a lattice of maps: the target image is first partitioned into 2l 2l regular squares (l ¼ level of detail). For a given level and a given index (i,j) the vertices ai,j, aiþ1,j, ai,jþ1, aiþ1,jþ1 of the corresponding square in the target image and the control points ci,j, ciþ1,j, ci,jþ1, ciþ1,jþ1 of the corresponding quadrilateral in the source define the mapping function m. A point q
136
Marcello Manfredi et al.
within a square ai,j, aiþ1,j, ai,jþ1, aiþ1,jþ1 in the target image, is mapped according to a weighted sum of the corresponding control points: mðq Þ ¼ ωi, j ðu; vÞc i, j þ ωiþ1, j ðu; v Þc iþ1, j þ ωi, j þ1 ðu; vÞc i, j þ1 þ ωiþ1, j þ1 ðu; vÞc iþ1, j þ1
ð46Þ
where ωi, j ðu; vÞ ¼ ð1 uÞð1 vÞ, ωi +1, j ðu; vÞ ¼ uð1 vÞ, ωi, j +1ðu; v Þ ¼ ð1 uÞv, ωi +1, j +1ðu; vÞ ¼ uv, while u and v are the horizontal and vertical ratios of point q. The vertices in the source image ai,j are fixed at a given detail level, while the vertices ci,j can vary to represent different maps. For a given detail level a PBM is defined by a parameter vector c ¼ (ci,j) containing the coordinates of the control points. All possible parameter vectors of a certain detail level l define the linear space Tl. By subdividing each square in the target image into four smaller squares, the linear space Tlþ1 for the next higher detail level is defined. Each parameter vector in Tl can be subdivided such that it becomes a parameter vector in Tlþ1, representing the same mapping. The subdivision scheme is defined by:
c ilþ1 ,j
¼
8 > > > > > > > > > >
> > > > > > >
> > l l :1 cl þ cl þ c þ c aþ1, b a , bþ1 aþ1, bþ1 2 a, b
i odd, j even i even, j odd i odd, j odd ð47Þ
where a ¼ i/2, b ¼ j/2. By dividing the map between the target image and the source image, the number of parameters increases approximately by a factor of 4. By varying these new parameters it is possible to control more local distortions for the corresponding transformation in the given image. The authors exploit a similarity measure defined as: corrðI 1 ; I 2 Þ ¼
covðI 1 ; I 2 Þ σ ðI 1 Þσ ðI 2 Þ
ð48Þ
Corr varies in the range [ 1,1] (where 1 corresponds to a perfect alignment of the two images). The optimization procedure to provide the optimal transformation parameters is based on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [69].
Warping of 2-D Maps 2.3.3 Fast Automatic Registration of Images Using a Complex Wavelet Transform [70]
137
Wavelet transforms [71–77] are multiresolution methods, giving a measure of the correlation between the image and a chosen “mother wavelet” function, which is stretched and moved; the correlation is then determined for each dilatation and position. The stretching of the wavelet allows the study of different effects: long wavelets highlight large-scale features, while short wavelets highlight small-scale features of the image. The sum of all scales allows a multiresolution representation of the image [71]. If special orthonormal wavelets are chosen, the transform can be perfectly inverted to reconstitute the original data. The authors exploit a newly developed complex wavelet transform [78–80]. 1. Parameterization of the complex wavelet transform. Phase differences between warped but otherwise identical images give information about the relative movement of regions of the images [78–80]. Convolution-based transforms, like the wavelet transforms, store information locally. The structure of a complex wavelet transform can be similar to the discrete wavelet transform using the Mallat algorithm [76], except that the use of complex filters requires complex coefficients and generates a complex output. Since it is difficult to obtain a perfect reconstruction and good frequency features by using filters in a single tree, the fast dual tree wavelet complex transform (DTWT) can be used (exploiting two separate Mallat algorithm trees in parallel), or a Q-shift transform (QSWT). An extension of the DTWT and QSWT to two dimensions is easily achieved by separate filtering along the columns and the rows. The final output consists of six sub-images at each scale. Each of these has its own directionality, in which the phase is most sensitive to displacement. 2. QSWT-warping. The authors exploit the QSWT approach. The squared difference surface (SDS) for each pel in each sub-band at each scale, is calculated (A pel is the scope of a wavelet coefficient, i.e., the area of the image covered by the coefficient at the scale such that at scale m, a pel is 2m 2m pixels). SDS is a measure of similarity between the current pel in the test image and the displaced versions of the reference image; it is formed by a 2-D scan of the displacements around the center of the reference pel for each sub-band. A surface is provided by summing all the sub-bands. The minimum of the resulting surface gives the best estimate of the magnitude and direction of the displacement between test and reference pels. Computing this for each pel requires a 2-D scan of the surface, followed by curve fitting to identify the exact minimum. Phase always varies from π to π, independently from the value of the coefficient (i.e., null areas are as important as those containing significant signals). Only the most reliable movement estimates are picked and used to interpolate the complete movement map, while
138
Marcello Manfredi et al.
unreliable estimates are rejected. This procedure is firstly carried out at the longest scale of the DTWT decomposition, to derive the movement map at coarsest resolution, then the results are carried through to the next shortest scale, and the procedure is repeated hierarchically to the shortest scale and finest resolution of the decomposition. The movement bases for each sub-band can be vector summed up to produce the overall motion vector for that pel (see Note 10). 3. Scale selection. The longest scales hold information on the background and tend to comprise only low amplitude coefficients; the shortest scales hold information on the spots; those in between, hold information on the warping field. For registration, only scales longer than the peak scale should be used, since they contain information on the spots, while gel warping occurs at scales longer than individual spot sizes. To form a movement map, the longest and shortest scales can be rejected as a form of denoising. Registration is carried out using the remaining scales. 2.3.4 Autonomous Subimage Matching by MaxRST Algorithm [81]
The method, based on a robust gaussian similarity measure and a maximum relation spanning tree (MaxRST) strategy, provides an autonomous matching method not requiring manual registration or manual selection of landmarks. The Gabriel graph (GG) [82] and relative neighborhood graph (RNG) [23] are exploited to build a proximity graph of the 2-D image; features are then extracted from the GG and RNG graph models for further matching. Significant spot detection is accomplished by the directed gradient Watershed transformation model [83]. Considering a set of points S chosen from a fractional area of the sample gel image, the aim is to search for all the occurrences of point sets T, which are the best match with S in the reference images. The similarity between protein points is calculated by the features of the proximity graph. The MaxRST algorithm is derived from the minimum-cost spanning tree algorithm [84], where the minimum cost is replaced by the maximum gaussian similarity measure. 1. Global match. The 2-D map image is first represented by GG and RNG. Rather than matching the graph directly, the resultant GG and RNG are built. The initial anchor point pair is found automatically. For each vertex u of the sample proximity graph, the variances of two features are then computed: degree and distance between all points v in the reference graph. The first one is defined, for point pair (u, v), as:
Warping of 2-D Maps
139
Vardeg ðu; vÞ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2
2ffi degGG ðuÞ degGG ðvÞ þ degRNG ðuÞ degRNG ðvÞ ¼
ð49Þ
where degGG() and degRNG() are the vertex degree in GG and RNG, respectively. The distance variance of point pair (u,v) is given by: Vardistance ðu;v Þ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #2ffi # " u"
2 u X X X X 0 0 0 0 d u;u d v;v d u;u þ d v;v ¼t GG
GG
i
i
RNG
j
j
k
RNG
k
l
l
ð50Þ
X X 0 0 d GG u;ui and d GG v;v j are the sums of all where i
0
j
distances0 dGG(u, ui ) of i neighbors of vertex u and distances dGG(v, vj ) of j neighbors of vertex v, respectively, in the GG. Similar notations are defined for the RNG for neighbors k and l. The initial anchor point pair is determined by: Corrðu; vÞ ¼ 1
Vardeg ðu; vÞ Vardistance ðu; vÞ ð51Þ Max Vardeg ðu; v Þ MaxðVardistance ðu; vÞÞ
where Max(Vardeg(u, v)) and Max(Vardistance(u, v)) are the maximum values of Vardeg(u, v) and Vardistance(u, v), for all u in the sample image and v in the reference image. If Corr (u,v) ε, then (u, v) is considered a candidate anchor point pair (see Note 11). Landmarks are chosen automatically based on larger values of correlation and they are collected and sorted in a queue according to their correlation value. 2. Gaussian similarity measure. The n-th feature of spot s in the sample image is indicated as fn(s), and the corresponding feature of spot r in the reference image is indicated as fn(r). μfn(s,r) is the difference function between fn(s) and fn(r): μ f n ðs; r Þ ¼ e
ð f n ðs Þ
f n ðr ÞÞ
2
2σ 2
ð52Þ
where σ is the variance of fn, and μfn(s,r) ranges between 0 and 1. Larger values are obtained for greater similarities. All similarity measures are summed up as a weighted mean value [85]: n X ω f i μ f i ðs; r Þ R s; r; ω f 1 ; ω f 2 ; . . . ; ω f n ¼ i¼1
ð53Þ
140
Marcello Manfredi et al.
where ω f i is the weight of feature i and
n X ω f i ¼ 1 (see i¼1
Note 12). The matching performance worsens significantly if both distance and angle features are excluded, showing that they are important.
3. Local match: MaxRST matching. Once the anchor points are identified, the MaxRST local matching continues to be applied to the GG and RNG. The neighbors are vertices adjacent to the anchor point. The Gaussian similarity measure R of the neighbors between sample and reference graphs is then computed on the basis of five features: GG degree, RNG degree, GG distance, GG angle, and GG distance to root. If the similarity measure is larger than 0.6, the tree is extended. Finally, the neighbor spot pair with the maximum affinity is chosen as the new anchor point pair. The next comparison procedure is then performed. The MaxRST algorithm runs recursively until all the candidate anchor point pairs are determined (see Note 13). The candidate anchor point pairs are determined during the global matching; the matching starts from the anchor point pair with the largest correlation. Starting from s and r, the similarity measure R(s 0 ,r 0 ) of the neighbors of s and r is calculated. The neighbor pairs are placed and sorted into a candidate queue if their similarity is 0.6. If R(s 0 ,r 0 ) has the maximum value, the algorithm changes from node s to s 0 in the sample gel graph and from r to r 0 in the reference gel graph. Nodes s 0 and r 0 are considered as the new anchor points and their corresponding neighbors are compared in the same way. The algorithm stops when all the candidate spots in the sample gel graph have been tested. Then, a rectangular area is plotted to cover the path from s to the termination node of the spanning tree and the ratio of the area difference in the corresponding rectangles is computed: D ðA s ; A r Þ ¼ 1
kA s \Ar k ðkA s k þ kA r kÞ=2
ð54Þ
where As and Ar are the matched areas in the sample image and reference image, \ is the intersection operator and k.k is the size of the matched area. An area difference ratio below 0.2 corresponds to a match, otherwise the next anchor point pair from the candidate queue is chosen. Small values of D(As, Ar) correspond to rigid constraints (see Note 14). 2.3.5 Matching 2-D Maps with Matlab “Image Processing Toolbox”
A fuzzy warping procedure [56–59, 86] is applied by the authors in [87], exploiting the Matlab “Image Processing Toolbox” for gel image analysis.
Warping of 2-D Maps
141
1. Image matching. If I1 and I2 are two images to be matched (where I1 is the reference; see Note 15), the matching consists of two steps: (1) geometric transform of the grid of image I1, and (2) intensity interpolation. There are two possible geometric transforms: (a) Inverse transform: x 2 ¼ f ðx 1 ; y 1 Þ+e (or x^ 2 ¼ f ðx 1 ; y 1 Þ); y 2 ¼ f ðx 1 ; y 1 Þ+e (or y^ 2 ¼ f ðx 1 ; y 1 Þ). (b) Forward transform: x 1 ¼ g ðx 1 ; y 1 Þ+e (or x^ 1 ¼ g ðx 1 ; y 1 Þ); y 1 ¼ g ðx 1 ; y 1 Þ+e (or y^ 1 ¼ g ðx 1 ; y 1 Þ), where x and y are the coordinates (pixel positions) of the image features (subscripts indicate image 1 or 2). The transform maps a set of integers (x,y) to a set of real numbers ( x^ , y^ ). The inverse geometric transform represents the coordinates of image I2 as a function of the features of image I1. In forward transform, some pixels can be mapped to points outside the image, and/or multiple points of I2 can be mapped to one and the same pixel location in the output image [56, 58, 59, 86, 88]. To apply the transform, its form (i.e. its order) must be selected, and the coordinates of the corresponding features (markers) of the warped images must be available. Then, the whole grid of I1 can be transformed to the new coordinates x^ 2 , y^ 2 . As the new coordinates x^ 2 , y^ 2 do not overlap with the mesh of I2, intensities at x^ 2 , y^ 2 must be calculated by interpolation. 2. Image matching with Matlab “Image Processing Toolbox.” The basic subroutines in Matlab “Image Processing Toolbox” for image matching are: (a) Selection of control points: [F1, F2] ¼ cpselect (I1, I2) (b) Correction of the control points coordinates: [F1, F2] ¼ cp2corr (F2, F1, I2, I1) (c) Calculation of the transform parameters based on the selected markers: tform ¼ cp2tform (F2, F1, ’transform’, order) (d) Image transform (grid transform and intensity interpolation): I2t ¼ imtransform (I2, tform,‘interpolation method’). Warping can be turned into an automatic procedure if fuzzy warping (FW) is included in the scheme. Manual marker selection (cpselect subroutine) is therefore modified and the corresponding features (control points) are identified by fuzzy warping, where feature correspondence and transform parameters are calculated simultaneously.
142
Marcello Manfredi et al.
3. Fuzzy warping. Fuzzy warping [56–59] can be summed up as follows: (a) Identify the coordinates of the maxima of the most intense spots to be matched (sets F1 and F2 of dimensionality (m1 2) and (m2 2)). (b) Build a similarity matrix, S(m1 m2) with elements that are the outputs of Gaussian functions centered at the features F1, for all features F2. (c) Augment the S matrix with additional rows and columns, with elements equal to 1/m2 (i.e., construct correspondence matrix C (m m); where m ¼ max(m1,m2) þ 1). (d) Perform Sinkhorn standardization [60] on the augmented similarity matrix, C (see Note 16). (e) Use the elements of matrix C as weights for the calculation of transform parameters. (f) Perform the transformation of features F1 with the actual transform parameters. (g) Decrease the width of the Gaussian function (σ) until σ σ minimal. The elements of the last row and the last column of C describe the probability of a feature without matching and are not normalized (see Note 17). 4. Local transform. A local transform is then applied by the piecewise linear transform, based on the Delaunay triangulation [86, 88] (see Note 18), allowing the exact matching of the corresponding spots maxima. The performance of this approach depends on the distribution of the corresponding features over the image. 2.3.6 Robust Automated Image Normalization Gel Matching [89, 90]
Robust automated image normalization (RAIN) is an image-based method to register 2-D maps exploiting smooth volume-invariant multiresolution B-spline transformations. The method involves a similarity measure based on the sum of squared residuals and the limited memory BFGS as optimizer. 1. B-spline transformations. B-splines [91] are connected piecewise polynomials of order n; at each connection, a smoothness constraint imposes continuity of the spline and its n 1 derivatives. Two B-spline surfaces are needed to model the smooth deformation along both x and y axes. To preserve volume invariance, the intensity of each warped pixel must be weighted by its change in density: this can be achieved by considering the Jacobian, i.e., the partial derivatives of the mapping with respect to the axes [92]. The determinant of the Jacobian represents the total change in volume. Normalizing the intensities of the warped image by the determinant at each pixel,
Warping of 2-D Maps
143
makes the integrated optical density (IOD) of each protein spot constant (see Note 19). Bias field correction is integrated into the image registration by modelling a multiplicative and additive bias fields by separate third-order B-spline surfaces. Multiplicative bias field βU represents a smooth variation of the contrast over the image, while additive bias field βV represents the same for the image brightness: I U , V ðx; y Þ ¼ ðI ðx; y Þ þ βV ðx; y ÞÞe βU ðx;y Þ
ð55Þ
The additive and multiplicative bias fields act in turn on the pixel intensity at location (x,y) in image I. Bias field parameters maximizing the similarity between sample and reference images must be identified. Matching errors are computed as differences of pixel intensity between the two images: logtransformed values are exploited to avoid that the error uniquely depends on strongly expressed spots [14]. On the other hand, since log-images enhance the noise due to background as well as faint spots, a weighting factor α is necessary (ranging from 0 for a pure log-image, to 1 for the original image). Jr and Js for sampling the reference and sample images Ir and Is, respectively, are given by (see Note 20): J r ¼ lnðI r ðx; y Þ þ αÞ
ð56Þ
h 0 0 i J s, C ¼ ln det βX ; βY ðI s ðβX ; βY Þ þ βV Þe βU þ α
ð57Þ
The B-spline parameters C ¼ {X,Y,U,V} must be identified (see Note 21) to find the optimum warp and bias fields. 2. Finding the optimal match. A multiresolution approach is adopted (see Note 22). Reference and sample images are resized to 2048 2048 pixels and details are removed iteratively by merging each square of four pixels into a single pixel taking the average. A pyramid of images is then created with a bottom-up approach, from resolution 1024 1024 to resolution 16 16. RAIN matching proceeds in reverse, determining the optimal matching parameters X, Y, U, and V from level 16 16 to level 1024 1024. A large number of candidate transformations are evaluated by RAIN. The authors calculate the sum of squared residuals:
sim J s, C ; J r ¼
X
ðx;y Þ∈I r
J s, C ðx; y Þ
2 J r ðx; y Þ
ð58Þ
The optimization of parameters C is critical; since it depends on the presence of local minima, a gradient descent algorithm is exploited (see Note 23). A good bias field correction improves convergence of the registration, but conversely a good
144
Marcello Manfredi et al.
registration also improves bias field estimation. For this reason, iterations of bias field correction and image registration were interspersed. 2.4
Hybrid Methods
2.4.1 Elastic Registration Using Intensity Information and Point Landmarks [2]
The so-called hybrid methods combine both landmark-based and pixel-based approaches to provide methods for image registration characterized by the advantages typical of both procedures. Here some of the most recent applications will be presented. Other procedures based on the use of hierarchical grid transformation followed by an optimization step based on the use of Genetic Algorithms [26] are presented in Chapter 10. This method combines intensity and landmark information (landmarks are centers of protein spots determined by fitting spots with Gaussian models). Subpixel positions of the landmarks are also determined and used for elastic image registration. Landmark localization is performed prior to image registration. Landmark and intensity information are combined on the basis of the local correlation coefficient [93] as similarity measure (see Note 24). The PASTAGA (PASha Treating Additional Geometric Attributes) algorithm is exploited for registration [16]. The transformation T between two images I ¼ I(x) and J ¼ J(x), with coordinates x ¼ (x, y), is based on the minimization of the following energy function: E ðC 1 ; C 2 ; T Þ ¼ S ðI ; J ; C 1 Þ þ σkC 1
T k2 þ σγkC 2
T k2 þ σλRðT Þ
ð59Þ where S is the intensity similarity measure (see Note 25) and R is a quadratic regularization energy. C1 and C2 are sets of correspondences: C1 between points of I and J according to the intensity similarity measure and C2 between geometric features segmented in I and J; T is the transformation (see Note 26). Registration parameters are as follows: σ is related to the level of noise in the image, λ is the smoothing strength, and γ is the relative strength of the geometric features compared to intensity information (see Note 27). The registration energy can be minimized by alternatively minimizing it with respect to C1 and T, while C2 is fixed, alternating two steps (T is initialized as being the identity): – Find the correspondences C1 2 S ðI ; J ; C 1 Þ+σkC 1 T k (see Note 28). – Find the transformation T by minimizing C 1 +λRðT Þ (see Note 29).
by
minimizing
T 2 +γkC 2
T k2
The approach was applied to 2-D maps of different complexity. The authors proved that intensity information alone is not sufficient, but including point landmarks significantly improved the final result.
Warping of 2-D Maps
2.4.2 Elastic Registration by Gaussian Elastic Body Splines [94]
145
The authors introduce an approach based on an improved physical deformation model using analytic solutions of the Navier equation, a more realistic deformation model with respect to other splinebased approaches [2, 47, 95, 96]. The method accurately models local deformations. The approach also allows to include landmark correspondences to help registration in regions difficult to register by intensity information alone. This methodology is based on an energy-minimizing function JHybrid, that includes three energy terms to incorporate intensity and landmark information and to regularize the deformation field u: J Hybrid ðuÞ ¼ λI J Data, I ðg 1 ; g 2 ; uÞ þ J Data, L pi ; q i ; u þ λE J Elastic ðuÞ
ð60Þ
where g1 and g2 are source and target images, (pi, qi) are the landmark correspondences (i ¼ 1, . . ., n), and λI and λE are scalar weights. The authors chose the elastic energy JElastic according to the Navier equation, which represents the regularization of the deformation field. Since the approach is based on the Navier equation, cross-effects among the elastic deformations are dealt with (a contraction in one direction leads to a dilation in orthogonal directions). Intensity information is exploited as much as possible to reduce user interaction in the registration process. When the images are of the same modality, the sum-of-squared intensity differences for the intensity similarity measure JData,I can be used. The approach includes landmark correspondences (pi, qi) to help the registration in regions difficult to register (see Note 30). The weighted Mahalanobis distance between corresponding landmarks is used therefore for JData,L based on approximating Gaussian elastic body splines (GEBS) [97] (see Note 31). An efficient way of minimizing JHybrid is to alternatingly minimize it with respect to intensity information and to the remaining functions, until convergence of the deformation field u (usually in 10–30 iterations) (see Note 32). Affine differences in the images are dealt with by a pure intensity-based affine registration [98], prior to elastic registration.
3
Notes 1. It is defined as the point that minimizes the distance d between pi and Q. 2. Information potential is introduced as part of the distance metric. Each spot is considered as a set of samples (i.e., intensities of the pixels within the spot) from a certain distribution. Before alignment, intensity normalization is applied.
146
Marcello Manfredi et al.
Corresponding spots usually show similar shapes and intensity values after normalization (i.e., similar distributions). The similarity of two spots can therefore be calculated by their information potential. Ia and Ib are gel images to be aligned, Pi (i ¼ 1, . . ., N1) is a set of spots from Ia and Qj (j ¼ 1, . . ., N2) from Ib; the IP-Euc (Information Potential-Euclidean) distance is: IPE P i ; Q i ¼ ð1
λÞE euc þ λE i p
ð61Þ
where λ is in the range [0, 1], Eeuc is based on Euclidean distance, and Eip is based on information potential energy. Larger similarities are obtained by smaller shape differences and larger information potential energies. 3. The authors applied a 2-D affine transformation composed of translation, rotation, shearing and scaling. 4. It is based on the detection of meaningful spots determined by their contrast and shape. 5. Exact graph matching is characterized by the edge-preserving constraint: if two vertices are linked by an edge in the first graph, their corresponding vertices must be linked by an edge also in the second graph [99]. Inexact graph matching instead penalizes the assignments which are not edge-preserving, increasing the value of the cost function. Optimal inexact matching algorithms always find a solution, i.e., the global minimum. In both cases, the mapping function can represent a homomorphism (many-to-one) or an isomorphism (one-toone) but this last one is too restrictive. A weaker form is subgraph isomorphism in which an isomorphism holds between one of the two graphs and a vertex-induced subgraph of the other. MCS is an even weaker form, mapping a subgraph of the first graph to an isomorphic subgraph of the second. 6. The authors use these level parameters: K1 ¼ 32, K2 ¼ 64, K3 ¼ 128, and K4 ¼ 256. They use h ¼ 3, α ¼ {0.5, 0.9}, Mt ¼ [0.75, 0.95], and Ot ¼ [0.8, 0.9]. These settings are based on experiments where the correlation and the overlapping functions are equally weighted. When the size of the matching images is very different, larger weight should be attributed to the correlation function. 7. When the sizes of two corresponding images are very different, the coordinates should be normalized before computing the overlapping ratio. 8. Ua and Ub do not necessarily have the same dimension since some spots may not have a correspondence. The authors assume that the number of spots in Ua is smaller than that in Ub. They also set the parameters as follows: αd, αo, αi, αc ¼ 1
Warping of 2-D Maps
147
(all energies are equally weighted), M ¼ 4 or 5, Et ¼ [0.05, 0.10] and dt is dependent on the image resolution. 9. The authors use a simple Gaussian model for the warped image, after initial logarithmic intensity transformation of the images. They also chose a distortion criterion with a null set, consisting of a set of affine functions. However, other image models and null sets can be adopted. 10. The authors apply the Euclidean distance as similarity measure, with each pixel representing a dimension/variable. 11. Reducing Corr to below 0.9 does not worsen the matching performance but increases the computational time (more anchors are introduced). Setting it to a higher value (0.95) provides a limited number of anchors and decreases the matching results. The authors exploit a value equal to 0.7. 12. Different uniform random linear combinations of weighting proved to provide a similar matching performance. The authors set all weights equal to 0.25. 13. The MaxRST algorithm is described as follows: If node tree, T ¼ 0 Acquire a new anchor point pair (s, r) from the candidates queue Apply MaxRST Else, if the anchor point pair queue is not empty Compute R (s 0 , r 0 ) for neighbors s 0 and r 0 Assign the next anchor point pair to the points with maximum R Apply MaxRST Else, if the anchor point pairs queue is empty Stop Compare the area As of the sample to Ar in the reference If the difference ratio D(As, Ar) is 10 mm. In the cases presented here, two or three steps of HGT are generally
GENOCOP and HGT Image Warping
181
sufficient to reach a good correction of the S image (about 0.8 1.0 mm on the test set along each separation) and the successive steps are able to decrease the error in the test set alignment only if a large number of landmarks is chosen. The number of landmarks does not increase the computational time needed to reach an acceptable warping solution. An acceptable correction of distorted images is obtained using 30 or more landmarks; a smaller number of landmarks corresponds only to global corrections of the images without local fine tuning. The selection of landmarks can be performed visually, with the aim of mass spectrometry or using automatic matching tools [21, 34]. The selection of landmarks is a crucial step in alignment and warping procedures, since each pair of landmarks indicates a warping direction to the algorithm. The computational time of the algorithm is strictly influenced by the calculation of the error function; a linear modeling of the warped space is fast and simple to implement, but it can introduce discontinuities in the warped images, especially if the algorithm is stopped in the first steps. The introduction of spline models offers a more natural way of modeling the warped space with respect to what is obtained through linear modeling; the obvious drawback is that it is more demanding in terms of computational effort. The entire optimization performed on several couples of images (stopped after 4–5 steps of HGT) took approximately 10–15 min using a standard PC, but the software can still be engineered and optimized to further reduce the computational costs.
6
Notes 1. It is obvious that a hypothetical perfect superimposition of the two images is characterized by the condition: fx ¼ fy ¼ 0. 2. The simplest case is a ¼ 1. If at least one of the offsets does not belong to the constrained search space, a is decreased until all offsets belong to the search space. 3. The algorithm can be developed by MATLAB v. 2007R and subsequent. 4. A domain that is half the size of the gel electrophoretic map is acceptable. 5. The optimization step can be often partitioned in several independent sub-steps, since when the four quadrilaterals surrounding a grid point do not contain any landmark, the corresponding grid point cannot be moved. 6. Each group of adjacent nodes to be optimized determines an independent optimization step.
182
Elisa Robotti et al.
7. In fact the use of soft constraints provides less information for the selection of the optimal warping: quadrilaterals not containing any landmark give to the corresponding nodes of the grid a larger freedom to move, but this may result in strong modifications of the shape of the surface. The use of hard constraints alone is anyway advisable, in particular when many landmarks, spanning the whole surface, are selected. 8. The adopted strategy of partitioning the optimization problem in multiple independent smaller optimization steps (characterized by a lower number of nodes and landmarks to be taken into account) reduces the search space of the GENOCOP approach and the computational complexity. 9. Caution must be used in the alteration step to avoid that two adjacent nodes cross each other. When this happens the corresponding chromosome must be rejected. 10. Typically in the present case 20 and 30%, respectively. 11. Applying the basis to only the S points increases the speed of the successive transformations of the S into the T points and reduces the memory requirements. 12. Namely when a gradient below a given threshold is obtained (which means that a stationary point has been reached). This is a normal conjugate gradient using the Polak-Ribiere variant with the Brent’s interpolation method (“line minimization”). 13. Using a Fant’s resample algorithm [35]. Other types of resample algorithms can be used [36]. 14. This is useful to reduce the number of solutions that are rejected because of the presence of node exchanges (intersections). 15. The second step of the algorithm was completed in less than 2 min and the successive steps took about 8 more minutes, without introducing important variations on the observed errors defined on the test set. 16. Only when the number of landmarks was increased and set between 30 and 60, the third step introduced some improvements in the observed error. If a large number of landmarks is used, also the fourth step turns out to be advantageous and the error observed is further reduced along the entire image. 17. The result offered by the implementation of spline interpolation seems more realistic if compared to linear interpolation that can be alternatively used: the surfaces represented using the grids in Fig. 5b, c are more smoothed with respect to the examples presented in Fig. 1b, that contain many points of discontinuity.
GENOCOP and HGT Image Warping
183
18. In particular some small areas were present where the alignment of the points of the test set was not satisfactory; these areas correspond to quadrilaterals of the grid that do not contain any landmark, but are nevertheless involved in the warping procedure because they share a moveable point with one or more quadrilaterals that contain at least one landmark.
Acknowledgements The authors gratefully acknowledge the collaboration of Dr. Alberto Zamo` (Policlinico G. B. Rossi, University of Verona) who provided the biological samples used to produce the electrophoretic 2-D maps used in this study. The contents of this chapter are reproduced and adapted from [26] with permission from The Royal Society of Chemistry. References 1. Challapalli KK, Zabel C, Schuchhardt J et al. (2004) High reproducibility of large-gel twodimensional electrophoresis. Electrophoresis 25(17):3040–3047 2. Choe LH, Lee KH (2003) Quantitative and qualitative measure of intralaboratory twodimensional protein gel reproducibility and the effects of sample preparation, sample load, and image analysis. Electrophoresis 24 (19–20):3500–3507 3. Valcu CM, Valcu M (2007) Reproducibility of two-dimensional gel electrophoresis at different replication levels. J Prot Res 6 (12):4677–4683 4. Go¨rg A, Weiss W, Dunn MJ (2004) Current two-dimensional electrophoresis technology for proteomics. Proteomics 4:3665–3685 5. Westermeier R, Naven T (2002) Proteomics in practise: a laboratory manual of proteome analysis. Wiley, Freidburg 6. Hamdan M, Righetti PG (2005) Proteomics today. Wiley, Hoboken, NJ 7. Marengo E, Robotti E, Cecconi D et al. (2004) Identification of the regulatory proteins in human pancreatic cancers treated with Trichostatin A by 2D-PAGE maps and multivariate statistical analysis. Anal Bioanal Chem 379:992–1003 8. Gustafsson JS, Glasbey CA, Blomberg A, Rudemo M (2004) Statistical exploration of variation in quantitative two-dimensional gel electrophoresis data. Proteomics 4:3791–3799
9. Wheelock AM, Buckpitt AR (2005) Software related alterations in 2D gel image analysis. Electrophoresis 26:4508–4520 10. Grove H, Hollung K, Uhlen AK et al. (2006) Challenges related to analysis of protein spot volumes from two-dimensional gel electrophoresis as revealed by replicate gels. J Prot Res 5 (12):3399–3410 11. Aittokallio T, Salmi J, Nyman TA et al. (2005) Geometrical distortions in two-dimensional gels: applicable correction methods. J Chromatogr B 815:25–37 12. Valledor L, Jorrin J (2011) Back to the basics: maximizing the information obtained by quantitative two dimensional gel electrophoresis analyses by an appropriate experimental design and statistical analyses. J Proteomics 74 (1):1–18 13. Dowsey AW, English JA, Lisacek F et al. (2010) Image analysis tools and emerging algorithms for expression proteomics. Proteomics 10:4226–4257 14. Raman B, Cheung A, Marten MR (2002) Quantitative comparison and evaluation of two commercially available, two dimensional electrophoresis image analysis software packages, Z3 and Melanie. Electrophoresis 23:2194–2202 15. Li F, Seillier-Moiseiwitsch F (2011) Differential analysis of 2D gel images. In: Abelson JN, Simon MI (eds) Methods in enzymology, vol 487, Computer methods, part C. Elsevier, San Diego, CA, USA, pp 596–609
184
Elisa Robotti et al.
16. Berth M, Moser FM, Kolbe M et al. (2007) The state of the art in the analysis of twodimensional gel electrophoresis images. Appl Microbiol Biotechnol 76:1223–1243 17. Garrels JI (1989) The QUEST system for quantitative analysis of two-dimensional gels. J Biol Chem 264:5269–5282 18. Lemkin PF, Lipkin LE (1981) GELLAB: a computer system for 2D gel electrophoresis analysis. II. Pairing spots. Comput Biomed Res 14:355–380 19. Vincens P, Tarroux P (1987) HERMeS: a second generation approach to the automatic analysis of two-dimensional electrophoresis gels. Part III: spot list matching. Electrophoresis 8:100–107 20. Conradsen K, Pedersen J (1992) Analysis of two-dimensional electrophoresis gels. Biometrics 48:1273–1287 21. Daszykowski M, Stanimirova I, BodzonKulakowska A et al. (2007) Start-to-end processing of two-dimensional gel electrophoretic images. J Chromatogr A 1158:306–317 22. Salmi J, Aittokallio T, Westerholm J et al. (2002) Hierarchical grid transformation for image warping in the analysis of twodimensional electrophoresis gel. Proteomics 2:1504–1515 23. Massart DL, Vandeginste BGM, Deming SM et al. (2001) Chemometrics: a textbook. Elsevier, Amsterdam 24. Michalewicz Z (1996) Genetic algorithms þ data structures ¼ evolution programs, 3rd edn. Springer, New York 25. Cooper L, Steinberg D (1970) Introduction to methods of optimization. W.B. Saunders, London 26. Marengo E, Cocchi M, Demartini M et al. (2012) GENOCOP algorithm and hierarchical grid transformation for image warping of two
dimensional gel electrophoretic maps. Mol BioSyst 8:975–984 27. Paul HC, Eilers ID, Currie MD (2006) Fast and compact smoothing on multi‐dimensional grids. Comput Stat Data Anal 50:61–76 28. Davis L (1991) Handbook of genetic algorithms. Van Nostrand Reinhold, New York 29. Wriight A (1991) Foundation of genetic algorithms, first workshop on the foundations of genetics algorithms and classifier systems. Morgan Kaufmann, San Mateo, CA, pp 205–218 30. Bhandari D, Murthy CA, Pal SK (1996) Genetic algorithm with elitist model and its convergence. Int J Patt Recogn Artif Intell 10:731–747 31. Press HW, Teukolsky SA, Vetterling TW et al. (2002) Numerical recipes in Cþþ. Cambridge University Press, Cambridge 32. Cecconi D, Zamo` A, Parisi A et al. (2008) Induction of apoptosis in jeko-1 mantle cell lymphoma cell line by resveratrol: a proteomic analysis. J Prot Res 7:2670–2680 33. Gustafsson JS, Blomberg A, Rudemo M (2002) Warping two-dimensional electrophoresis gel images to correct for geometric distortions of the spot pattern. Electrophoresis 23:1731–1744 34. Kang Y, Techanukul T, Mantalaris A et al. (2009) Comparison of three commercially available DIGE analysis software packages: minimal user intervention in gel-based proteomics. J Prot Res 8:1077–1084 35. Wolberg G (1990) Digital image warping. IEEE Computer Society Press, Los Alamitos, CA 36. Daszykowski M, Færgestad EM, Grove H et al. (2009) Matching 2D gel electrophoresis images with Matlab “Image Processing Toolbox”. Chemom Intell Lab Syst 96:188–195
Chapter 11 Detection and Quantification of Protein Spots by Pinnacle Jeffrey S. Morris and Howard B. Gutstein Abstract Accurate spot detection and quantification is a challenging task that must be performed effectively in order to properly extract the proteomic information from two-dimensional (2-D) gel electrophoresis images. In Morris et al., Bioinformatics 24:529–536, 2008, we introduced Pinnacle, an automatic, fast, effective noncommercial package for spot detection and quantification for 2-D gel images, and subsequently we have developed a freely available gui-based interface for applying the method to a set of gels. In this chapter, we overview Pinnacle, and in a step-by-step manner we describe how to use the software to obtain spot lists and quantifications, to be used for comparative proteomic analysis. Key words Automated methods, Image processing, Spot detection, Wavelet denoising
Abbreviations 2-D R2 %CV
1
Two-dimensional Coefficient of determination Percent coefficient of variation
Introduction A key bottleneck in proteomic research involving 2-D gel electrophoresis has been the lack of automatic, effective, and reproducible methods for analyzing the gel images. It could be argued that these analytical problems have been one of the key factors inhibiting this technology from reaching its potential in biomedical research. Given that the proteomic information in the gel images is contained in spots resulting from migrating proteins on the gel, the critical preprocessing steps involve detecting, matching, and quantifying spots across gels. Sensitive spot detection algorithms are essential, since any spots not detected in this step are lost to subsequent analysis, leading to lost discoveries. Accurate spot matching is crucial, for mismatched spots result in different
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_11, © Springer ScienceþBusiness Media New York 2016
185
186
Jeffrey S. Morris and Howard B. Gutstein
proteins put together in follow-up analyses, compromising the ability to detect which ones are associated with outcomes of interest. Accurate spot quantification is also important in order to minimize the measurement error that can reduce the power for detecting significant differences. Until recently, the only available analytical methods were commercial software packages with problematic underlying algorithms and workflows that broke down for studies involving more than a very small number of gels [1, 2]. These packages followed a traditional workflow that involves detecting spots, determining spot boundaries, and estimating spot volumes on each individual gel, and then matching the detected spots across the gels. This workflow has numerous weaknesses. It results in missing data for any spot not detected in all gels, causing analysis challenges and introducing potential bias. It is subject to spot matching errors, since cognate spots will be at slightly different locations on each gel, and the matching step is only done after spot detection. Mismatched spots compromise the ability to detect differentially expressed proteins, since different proteins are mistakenly grouped together for the analysis. Spot boundaries are also difficult to estimate properly, especially in regions with overlapping spots caused by comigrating proteins, increasing the imprecision of spot quantifications. This problem is compounded when different spot boundaries are estimated for each gel. Most software packages provide manual editing capabilities to correct this plethora of errors, but this is a time-consuming process that compromises the objectivity and reproducibility of the analysis. Exacerbating the problem, all of these errors worsen as the number of gels in the study increases. This suggested a perverse incentive for researchers to conduct studies with fewer samples, which can lead to insufficient information and capability to detect significant differences, with particular regard to the multiple testing problem inherent to the simultaneous analysis of hundreds or thousands of proteins. The dominance of these expensive yet ineffective software packages was partly fueled by the lack of freely available methods for spot detection and quantification, which was especially problematic for researchers performing infrequent 2-D gel studies, whose laboratories could not afford the software. As a solution to this problem, we developed Pinnacle, a novel method [1] for protein detection and quantification, that utilizes a completely different workflow to avoid these problems. Working on pre-aligned gels, Pinnacle performs spot detection by searching for pinnacles (local maxima in both dimensions, see Fig. 1) on a wavelet-denoised average gel. In the average gel, signals from spots found on many gels are reinforced while white noise is attenuated. Thus, detection based on the average gel can have greater sensitivity than detection done on individual gels, especially for faint spots present on many gels [1, 3]. In stark contrast to previous methods,
360
370
380
390
400
205
360
370
380
390
235
1.15
1.8
0 200
0.5
1
1.5
2
2.5
1.1 350
1.2
210
220
225
Vertical Location
215
230
Vertical Slice at 365
Horizontal Location
240
400
370
380
390
Vertical Slice at 379
Horizontal Location
360
Horizontal slice at 218
400
Vertical Location
0.2 200 205 210 215 220 225 230 235 240
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0.8 350
0.85
0.9
-0.5
0
0.5
1
1.5
2
0.95
1.3
1.1
1.4
1.05
1.6
1.1
1.2
1.9
1.7
1.25
2.5 1.5
3
Horizontal slice at 210 2
370
380
390
Vertical Slice at 368
Horizontal Location
360
Horizontal slice at 231
400
Vertical Location
0 200 205 210 215 220 225 230 235 240
0.5
1
1.5
2
2.5
0.5 350
1
1.5
2
2.5
Fig. 1 Illustration of a pinnacle: The heat maps on the left show a region of an average gel image, and “x”s mark-detected pinnacles. The right three columns illustrate three of the pinnacles in this field, with the pixel intensities for a single row and column corresponding to one of the marked pinnacles plotted in the top and bottom, respectively, with a red line indicating the pinnacle location, defined as a location that is a local maximum in both the horizontal and vertical directions
240 350
235
230
225
220
215
210
205
200
Detection and Quantification of Protein Spots by Pinnacle 187
188
Jeffrey S. Morris and Howard B. Gutstein
spot detection accuracy improves as the number of gels increases. Specificity is further improved by the wavelet denoising [4] that eliminates spurious noise without strongly attenuating the true signal. Note that the gel images must be pre-aligned before applying Pinnacle, using an image warping program so that cognate spots are reasonably well aligned across the gels (see Note 1). After spots are detected on the average gel, quantifications are obtained for each spot of each gel by taking the maximum intensity within a defined neighborhood around the detected “pinnacle” (peak intensity value for each spot) while performing local background correction and normalization using one of several approaches. Pinnacle is quick and automatic, and results in no missing data, and the use of pinnacle intensities instead of spot volumes for the quantification obviates the need to detect spot boundaries, resulting in greater precision and reliability. Theorem 1 in [1] demonstrates that the pinnacle intensity should be strongly correlated with spot volume as long as the given spot has a common shape across gels. In fact, Pinnacle enables one to detect potentially co-migrating proteins (or protein isoforms) if multiple peak values are detected in a region of the gel that other spot-based methods would tend to call a single spot. Using series of dilution experiments, we compared Pinnacle with two commercially available processing packages that at the time were the state of the art, and found that Pinnacle resulted in significantly improved validity (higher R2 with true quantification) and reliability (lower %CV on replicate gels) than the commercial packages, and in a comparative analysis we found more significantly different spots across groups [1]. After our initial development of Pinnacle, various software companies developed alternative workflows that worked on pre-aligned gels to minimize problems from spot matching after detection, and improved spot quantification by defining common spot outlines to use across all gels to quantify spot volumes for each cognate spot. Using dilution series, we compared with the first of these methods, Progenesis SameSpots (Nonlinear Dynamics, Newcastle Upon Tyne, UK), and found that its performance was significantly improved over Progenesis PG240, the previous version of Nonlinear Dynamics’ software, in terms of spot detection and quantification [5]. In these comparisons, Pinnacle still outperformed SameSpots, leading to more spots detected, spot quantifications with higher validity and reliability, and detecting more significantly different spots in a comparative proteomics analysis [5]. The type of algorithm underlying SameSpots is still prevalent in commercial software today. Our original software implementation for Pinnacle involved Matlab scripts, but we have subsequently developed a Windows standalone executable with a graphical user interface for applying the method. This executable is freely available as a self-extracting file on the Web (https://biostatistics.mdanderson.org/ SoftwareDownload/SingleSoftware.aspx?Software_Id¼95). In this
Detection and Quantification of Protein Spots by Pinnacle
189
chapter, we describe Pinnacle in more detail, providing a step-bystep explanation of how to input gel images into the software, perform spot detection and quantification using Pinnacle, and then outputting the spot quantification lists for subsequent analysis.
2
Materials The Pinnacle software is a stand-alone Windows package with a graphical user interface, so a Windows-based system is needed to use the software. It can be downloaded from https://biostatistics. mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Soft ware_Id¼95. Here we mention the required format of the gel images to be used with the software. 1. The software requires the gel images be stored as 16-bit .tif files. All gel images for a common analysis should be stored in a common directory. 2. Very Important: Pinnacle is designed to work with gel images that have already been registered or warped to each other so that the spots are reasonably well aligned on the separate gel images. There is no such registration package included with Pinnacle, so the user must use another package to perform the registration and warping, save the warped/aligned images as 16-bit .tif files, and then use them for the Pinnacle analysis. Various commercial and noncommercial methods are available for performing this registration; some are automated algorithms while others require manual intervention. Some of these methods are described in this book (see Chapter 8). We use RAIN [6]. 3. Do not edit the .tif images in Photoshop or any other imaging processing software, as this could alter the quantitative information in the gel image and affect accuracy. 4. When scanning the gels, avoid saturation of the gel images— Pinnacle works best when gels have little to no saturation.
3
Methods To frame our presentation, following a mathematical representation of the model underlying Pinnacle is reported. Let Yi(x, y) represent the registered scanned gel image i( i ¼ 1, . . . , N ), at horizontal location x and vertical location y in the image, with the image sampled on a grid of size J K . The set of images are assumed to have been registered to each other to establish a matching correspondence in the (x, y) for the different gels:
190
Jeffrey S. Morris and Howard B. Gutstein Protein Signal
Background
zfflfflfflfflfflffl}|fflfflfflfflfflffl{ Y i x j ; y k ¼ Bi x j ; y k þ
Ni |{z}
Normalization Factor
zfflfflfflfflfflffl}|fflfflfflfflfflffl{ S i x j ; y k þ e ijk |{z}
ð1Þ
Noise
The quantity Bi(x, y) represents the background artifact of gel i, assumed to follow a smooth spatial stochastic process, eijk represents the white noise, assumed to be an independently and identically distributed zero-mean Gaussians, with variance σ 2, while Ni represents a multiplicative normalization constant factor for the gel image i, included to adjust for systematic differences in the total protein quantity or scanning settings across gels. Our goal in preprocessing is to extract the protein signal Si(x, y), which is expected to be theresultof a convolution of P protein spots centered at locations x *s ; y *s , s ¼ 1, . . . , P. This can be written as: p X S i ðx; y Þ ¼ Q is ϕs ðx; y Þ s¼1
ð2Þ
where ϕs(x, y) is a canonical spot template for spot s, which often can be well approximated by a 2-D Gaussian density function centered at x *s , y *s , and Qis represents the quantification of spot s from gel i. We will refer back to this equation and notation throughout the presentation.
3.1 Loading Gel Images into Pinnacle, Loading and Saving Project Files
1. To start a new analysis, in the File menu, choose Start New Project, which will bring up a dialog box to select the directory in which the registered gels .tif files reside. 2. Select the folder containing the .tif files, and then by default the software will load all the .tif files of this folder into the program. At this time, two windows open by default, an Individual Gel window and Average Gel window. Subheading 3.2 discusses these windows. Figure 2 contains a screenshot of the File menu and these two windows, tiled vertically. 3. If one wishes to analyze only a subset of the .tif files or if the directory contains other .tif files that are not registered gel image files, then one can choose Select Gel Images in the Action menu, and mark the set of .tif files corresponding to the desired gels for the analysis. At this time, the average gel will update to only include the selected gel images. 4. At any time, one can save the current project in the File menu, using Save Project or Save Project As, and can reload that project by choosing Open Existing Project in the File menu. The project file saves all the current settings and results.
Fig. 2 Screenshot showing File menu and the Individual Gel and Average Gel windows
Detection and Quantification of Protein Spots by Pinnacle 191
192
Jeffrey S. Morris and Howard B. Gutstein
3.2 Viewing Individual Gels or the Average Gel
By default, the software opens two graphical windows, one displaying an individual gel and the other displaying the average gel for the entire gel set, computed as the pixel-wise average across all registered gel images. These windows are useful for exploring the properties of your gel images, and have considerable flexibility for choosing which gels and which statistical summary gels are computed and for zooming in and out, moving around the gel, and changing the contrast to allow the exploration of faint and more intense spots. The gel windows can be synchronized together so that any changes/zooming/moving done in one gel window will be mirrored in the other gel windows. After pinnacles are detected, they are indicated on these windows by “x” marks. These gel image displays are useful for exploration of the pinnacle detection performance and to provide a framework for rapid, intuitive hand-editing of the results. In this section we describe some of the settings and capabilities of the graphical windows. 1. By default the Individual Gel window plots a heatmap of the final gel in the set that was loaded. If you want to see other gels, select a different one in the Gel pop-up list. When the Gel popup list is selected, turning the mouse scroll wheel will step through the different individual gels. Multiple individual gels can be opened in separate windows by selecting Open Individual Gels in the Window menu, or all individual gels can be opened using Open All Individual Gels (although be careful about doing this if the gel set is very large!). The windows can be arranged in various ways using the Arrange Windows in the Window menu. 2. Average Gel: by default the Average Gel window plots a heatmap of the arithmetic, pixel-wise mean gel. Alternatively, the pixel-wise maximum gel or quantile gel can be computed under the Gel pop-up list (see Note 2). 3. Colormaps: various colormaps are available for plotting the heatmaps of the gels. By default, Jet is chosen, which plots the intensities ranging from colder colors to hotter colors (blue ! red). This is our personal choice, as the rich contrast of colors appears to make visual inspection of the spots more intuitive. For any colormap, by checking the box Reverse, one can reverse any colormap, e.g., to let white be lower intensities and black be higher intensities for the Grey map (see Note 3). 4. Image Contrast: the contrast of any heatmap can be adjusted by changing the minimum and maximum values of the colormap, which can be done by the sliders marked Min Value and Max Value. This is especially useful when investigating the spot detection for fainter or overlapping spots. By decreasing the Max Value, fainter spots become more clear and one can see whether they look like “real” spots that should be selected.
Detection and Quantification of Protein Spots by Pinnacle
193
However, this will tend to make the more intense spots run together. To visually delineate nearby intense spots, increasing the Max Value will help one to assess whether there are multiple local spots in a particular intense region. 5. Zooming and Navigating: by default, the entire gel is shown in the window, but one can zoom and move around the image to investigate local regions of the gel. One can zoom in on the image by selecting the magnifying glass button, and then leftclicking and dragging the mouse up and to the right. Zooming out can be done by left-clicking and dragging the mouse down and to the left. When zoomed, one can move around the image by selecting the hand button and left-clicking and dragging around. The image can be centered on a particular spot by selecting the button with a box and red cross in the bottom middle of the menu bar and then left-clicking on the desired centering point for the gel image. 6. Synchronizing Gel Images: when changing the colormaps or contrasts of the images or while zooming or moving around the image, one can synchronize these changes across all gel image windows by checking the box Sync All. If one has multiple images opened, but only wants to synchronize over a subset, they can check the Sync box for each image window to be synchronized. This is useful in exploring the pinnacle detection to see whether the spots in the average gel appear to match well with visual spots in individual gels as a check on the alignment. 7. 3D Plots: instead of heatmaps, the gel images can be displayed as three-dimensional plots, by selecting 3D Plot in the Type pop-up list. These 3D plots have the same colormap specified by the heatmap. The perspective can be changed by selecting the rotation button in the menu bar and left-clicking and dragging the mouse around. 8. Saving Gel Images: one can save the currently active gel image by selecting Save Image As under the File menu. Supported formats include .jpg, .tif, .pdf, and .ps. 3.3
Spot Detection
Overview: given a set of N pre-aligned gel images, with Yi(xj,yk) indicating the intensity of the pixel at horizontal position xj and vertical position yk within gel image i(i ¼ 1, . . ., N), Pinnacle performs spot detection on the average gel (see Note 4), given by: Y x j ; yk ¼ N
1
N X Y i x j ; yk
ð3Þ i¼1 ¼ B x j ; yk þ S x j ; yk þ ejk X where S ðx; y Þ ¼ N 1 N i S i ðx; y Þ is the average signal, B ðx; y Þ is i the average baseline, and e j k eN 0, σ 2 =N is the noise on the
194
Jeffrey S. Morris and Howard B. Gutstein
pffiffiffiffiffi average gel, reduced by a factor of N relative to the noise level in an individual gel, because of the averaging. This will increase the signal-to-noise ratio for spots present on at least a certain proportion of all gels, and thus can lead to increased sensitivity and specificity for spot detection. Further, this means that the accuracy of the spot detection should increase with the number of gels in the analysis. These concepts were illustrated by a rigorous simulation study in the context of peak detection for 1D mass spectrometry [3], and the same concept applies to spot detection here. Next, wavelet denoising is performed to estimate and remove noise (e¯jk in the equation above), and then positions x *s , y *s are identified that are local maxima in both the horizontal and vertical directions (see Fig. 1), and for which Y ðx; y Þ > q α , where qα is a minimum pinnacle size threshold, defined as the α quantile among all intensities * * gel Y ðx; y Þ. Next, we aggregate * on the average * pinnacles x s , y s and x s 0 , y s 0 that are very close to each other, i.e., x *s x *s0 < δ and y *s y *s0 < δ, where δ is a prespecified neighborhood size, by keeping spot s if Y x *s ; y *s > Y x *s0 ; y *s0 , keeping spot s’ otherwise. While our studies show that Pinnacle tends to work very well without any manual editing, the software allows manual editing of the pinnacles, adding or removing them, using a dynamic display including the mean gels and individual gels while allowing zooming in on particular regions of the images, moving around the image, and changing the contrast of the heatmaps to accentuate fainter spots or to separate merged spots. The strategy of detecting on the mean gel along with the graphical interface, makes manual editing of the spots an efficient and straightforward process. 1. Defining Image Region: before spot detection, one can first define an image region S including the regions of the gel where spot detection is desired. The reason for this is to give the user the option to exclude regions of the gels containing obvious artifacts, which are especially common near the boundaries of the rectangular image. The region can be marked either using a rectangular or lasso tool under Specify Image Region in the Action menu. All spot detection, background correction, and normalization will be done only using the data in this region. 2. Detecting, Saving, and Loading Pinnacle Sets: one can detect pinnacles by choosing Detect Pinnacles within the Action menu. The pinnacles will be detected using the settings indicated in the Pinnacle Settings dialog box in the Settings menu, which will be discussed below. One can save the pinnacle set by choosing Save Pinnacle List in the Action menu. One can load in previously saved pinnacle sets through Load Pinnacle List, and can choose among any of the loaded pinnacle sets in the Pinnacle List drop-down menu in the Average Gel window, which contains a list of all loaded pinnacle lists for the current
Detection and Quantification of Protein Spots by Pinnacle
195
project. One can open a window containing the pinnacle list by selecting Pinnacle Window within the Window menu. This window displays (xs*, ys*) for each pinnacle, the intensity in the average gel, and an indicator of whether it was automatically detected or manually added (see step 4 below). 3. Pinnacle detection settings: there are a number of different settings that affect pinnacle detection that can be adjusted in the Pinnacle Settings dialog box in the Settings menu. There are default values for each setting. Figure 3 contains a screen shot of the pinnacle settings window, the pinnacle window, and individual and average gel windows with each detected pinnacle indicated by a red or green “x.” (a) Wavelet denoising settings: Wavelet denoising on the average gel is used to remove spurious “wiggles” that can manifest as false positive pinnacles. It works by transforming the image to the wavelet domain using a 2-D Daubechies wavelet transform with wavelet filter length, thresholding any wavelet coefficients below a specified wavelet threshold, and then performing the inverse wavelet transform to construct the denoised image. Wavelets are multi-resolution basis functions that simultaneously decompose a signal in both the frequency and time domain, and have properties making them useful for compression and denoising, especially for signals with local features like protein spots. Both the wavelet filter length and wavelet threshold can be set in Pinnacle Settings. The wavelet filter length must be an even number (2, 4, 6, 8, 10, or 12, with default 6) (see Note 5). Results tend to be robust to the filter length choice. The wavelet threshold is a nonnegative number (default 3.6) (see Note 6). Making the wavelet threshold larger will result in more denoising but if too large may smooth away true features, while making it smaller will result in less denoising. (b) Other pinnacle settings: signal detection also depends on two other parameters, the minimum pinnacle size threshold qα and the neighborhood size δ, described above (see Notes 7 and 8). These parameters are set in the Pinnacle Settings dialog box. In the Minimum Pinnacle Size Threshold box, one specifies α, the percentile (between 0 and 1) indicating the minimum intensity for a pinnacle. The neighborhood size box (positive integer) indicates the minimum distance between two pinnacles. All pinnacles closer than this amount are merged together, taking the maximum value. This allows for slight misalignment of the gels, since slightly misaligned gels may lead to multiple local maxima in the average gel corresponding to the same cognate protein.
Fig. 3 Screenshot showing Pinnacle Settings dialog box and the Pinnacle Window. Note that the pinnacles are evident as red “x”s in the gel image windows
Detection and Quantification of Protein Spots by Pinnacle
197
4. Hand-editing-detected pinnacles: we have found that the automatic spot detection through Pinnacle, typically performs quite well, especially if one tweaks the minimum pinnacle size threshold or wavelet threshold settings for a particular set of gels, assessing the performance by visually inspecting the average gel and individual gel images. However, the software also provides the capability to hand-edit the pinnacle lists. This is typically done by zooming in on the average gel, moving around, and varying the contrast (min value, max value) to inspect the detected pinnacles, and then removing any that do not look like real spots (e.g., artifacts), or to add pinnacles for any regions that appear to be spots in the gel image window but were not detected. Pinnacles can be added by holding Ctrl and left-clicking on a location of either the average gel window or any individual gel window. Each pinnacle is represented in the gel image windows by a red or green “x,” whichever has the best contrast with the gel image at that location. Pinnacles can be removed by holding Ctrl and right-clicking while moving the cursor around in one of the gel windows. The Eraser Size in the Pinnacle Settings dialog box gives the size of the square eraser used to remove pinnacles, effectively “erasing” them from the gel image. Subheading 4 contains some notes and tips on hand-editing-detected pinnacles (see Notes 9–12). 3.4 Spot Quantification
Overview: once spot detection is completed, * * and a list of detected spots and corresponding pinnacles x s ; y s , s ¼ 1, . . . , P found on the average gel, then we compute relative spot quantifications Qis for each spot and each gel. Rather than estimating the spot boundaries and computing the spot volumes, quantification is based on the pinnacle intensity. We have shown that as long as a cognate spot s has a common canonical spot shape ϕs(x, y) across gel images, a reasonable assumption in our investigation, the maximum pinnacle intensity and the integrated spot volumes are measuring essentially the same quantity [1]. The benefits of using pinnacles for spot quantification are that they are faster and do not require estimation of spot boundaries, potentially reducing measurement error and increasing precision of spot quantifications. As discussed in the introduction, the use of common spot boundaries and registered gel images has improved the current software package performance, but in our view there is still a benefit to quantifying protein amounts by pinnacle intensities rather than integrated spot volumes, since no spot boundaries must be estimated. Spot quantification is done by first computing the maximum intensity on the individual gel image in a neighborhood around the pinnacle location (xs*, ys*) on the gel, i.e., average δ, y *s +δÞg, where Y *is ¼ max Y i ðx; y Þ, x ∈ x *s δ, x *s +δÞ, y ∈ y *s δ is the specified neighborhood size (specified in the Pinnacle Settings
198
Jeffrey S. Morris and Howard B. Gutstein
window). Background correction and normalization are also done to estimate the background B is ¼ B i x *s ; y *s and normalization factor Ni, and then the quantification for spot s for gel i is given Y* B
by Q is ¼ isN i is. Optionally, wavelet denoising can be done to estimate and remove the white noise e is ¼ e i x *s ; y *s : Note that this algorithm provides well-defined spot quantifications for each cognate spot for each gel, so there is no missing data. 1. Quantifying Spots: one can obtain the spot quantifications by selecting Quantify Spots from the Action menu. Spot quantification will be done according to the Spot Quantification Settings in the Settings menu (see Fig. 4), as discussed below. In the gel image windows, once the spots have been quantified the red/green “x” marking the pinnacles is recolored black/white, and the “x” in the individual gel image shows the actual location used for quantification for that gel image (see Fig. 4). After performing the quantification, the user is prompted to provide a prefix name for the .xls files containing the quantification information. Three .xls files are created, one “*_spotsettings. xls” containing the settings used for pinnacle detection, “*_spotlist.xls” containing for each cognate spot the pinnacle location xs*, ys*, an indicator of whether the pinnacle was automatically * or* manually added, and the intensity in the average gel Y x s ; y s , and “*_spotquants.xls” containing for each cognate spot and gel the spot number, an indicator of whether automatically or manually detected, the gel label, the gelspecific pinnacle location xis*, yis*, and the spot quantification Q is. The quantifications can also be shown in the Pinnacle Window under the Window menu by checking the box Include Spot Quantifications in the Pinnacle Window (see Fig. 4). At any time, the spot .xls files can be exported using Export Spot Quantifications in the Action menu. 2. Spot Quantification settings: there are a number of different settings that affect spot quantification that can be adjusted in the Spot Quantification Settings dialog box in the Settings menu (see Fig. 4). There are default values for each setting. (a) Denoising: denoising can be turned on in order to estimate and remove the white noise errors in the image eijk. If denoising is done, then the wavelet threshold can be specified (default 3.6). By default, denoising is turned off. (b) Background Correction: background correction, if turned on, will subtract an estimated baseline value from each pinnacle to estimate and remove B is ¼ B i x *s ; y *s : If local is selected (default), then Bis is estimated to be either the minimum or specified Quantile within the rectangle * x∈ x s γ x , x *s +γ x Þ, y∈ y *s γ y , y *s +γ y Þ, where γ x and γ y are the specified horizontal background correction window
Fig. 4 Screenshot showing Spot Quantification Settings dialog box and the Pinnacle Window with spot quantifications for a zoomed-in region of the gel image
Detection and Quantification of Protein Spots by Pinnacle 199
200
Jeffrey S. Morris and Howard B. Gutstein
size and vertical Background correction window size, respectively. The local minimum is computed when Quantile ¼ 0. When global is selected, the global quantile or minimum within the specified image region for the entire gel is used as the background for all spots. (c) Normalization: normalization is performed to account for differences in the total protein content or total image intensity in the individual gel images. The constant Ni is estimated by taking either the total image volume, defined as the sum of the gel intensities in the specified image region S, or the sum of the detected pinnacles. We use the total image volume for normalization by default. 3.5
4
Analyze All
Under the Action menu, Analyze All will do a complete, automatic pinnacle analysis using the current settings, detecting, quantifying, and saving all pinnacles while performing background correction, normalization, and denoising.
Notes 1. Failure to align the gels can result in multiple pinnacles corresponding to the same cognate spots, producing spurious spots and inaccurate quantifications. 2. The maximum gel contains the maximum intensity of each pixel across all selected gels, while the quantile gel contains a specified quantile (e.g., 0.50 for median, 0.95 for 95th percentile). 3. Some may prefer the images to look more like the physical gels, in which case the Grey colormap can be chosen, with intensities ranging from darker to lighter colors (black ! white).
4. Important: note that Pinnacle is designed to perform spot detection on the average gel, and we do not recommend its application to individual gels.
5. Results are relatively robust to choice of the wavelet filter length (default 6), but if very small values are chosen (e.g., 2) this can lead to denoised gel images with a “blocky” appearance, and if they are chosen very large, this could potentially lead to some attenuation of the intensities of some of the most intense peaks. 6. The wavelet threshold is a nonnegative number (default 3.6). Making the wavelet threshold larger will result in more denoising but if too large, may smooth away true features, while making it smaller will result in less denoising. Note that when working with large gel sets (i.e., large N), Pinnacle can work well even with no wavelet denoising (threshold 0), since the averaging inherent in computing the mean gel already performs some denoising.
Detection and Quantification of Protein Spots by Pinnacle
201
7. Increasing the minimum pinnacle size threshold reduces the number of spurious spots, but making it too large may remove faint but true spots, while making this threshold too small can result in more spurious spots. 8. The neighborhood size box allows for slight misalignment of gels, since slightly misaligned gels may lead to multiple local maxima in the average gel corresponding to the same cognate protein. Making this parameter too large will merge together different proteins. 9. It also frequently helps to check sync all in the gel image window, so one can see both the average gel and individual gels in the regions being investigated. 10. When deciding whether a given pinnacle corresponds to a true spot or not, we typically toggle through the individual gels to see whether we believe it is a real spot or not. This also gives insight into whether the gels are aligned well enough. 11. We typically will zoom into a range of about 1/5 of the gel rows and columns and, systematically, pan around the image, adding or removing spots. 12. We have found that this process typically takes on the order of 10–15 min for the entire gel set. Generally, less than 10 % of the spots that were automatically detected are modified manually.
Acknowledgements This work was supported by grants from the National Cancer Institute (CA107304, CA160736) and from the National Institute on Drug Abuse (DA18310) and the National Institute on Alcohol Abuse and Alcoholism (AA16157). References 1. Morris JS, Clark BN, Gutstein HB (2008) Pinnacle: a fast, automatic and accurate method for detecting and quantifying protein spots in 2dimensional gel electrophoresis data. Bioinformatics 24:529–536 2. Clark BN, Gutstein HB (2008) The myth of automated, high-throughput two-dimensional gel electrophoresis. Proteomics 8:1197–1203 3. Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R (2005) Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21:1764–1775
4. Donoho D, Johnstone IM (1994) Ideal spatial adaptation via wavelet shrinkage. Biometrika 81:425–455 5. Morris JS, Clark BN, Wei W, Gutstein HB (2010) Evaluating the performance of new approaches to spot quantification and differential expression in 2-dimensional gel electrophoresis studies. J Proteome Res 9 (1):595–604 6. Dowsey AW, Dunn MJ, Yang GZ (2008) Automated image alignment for 2-D gel electrophoresis in a high-throughput proteomics pipeline. Bioinformatics 24:950–957
Chapter 12 A Novel Gaussian Extrapolation Approach for 2-D Gel Electrophoresis Saturated Protein Spots Massimo Natale, Alfonso Caiazzo, and Elisa Ficarra Abstract Analysis of images obtained from two-dimensional gel electrophoresis (2-D GE) is a topic of utmost importance in bioinformatics research, since commercial and academic software currently available have proven to be neither completely effective nor fully automatic, often requiring manual revision and refinement of computer generated matches. In this chapter, we present an effective technique for the detection and the reconstruction of over-saturated protein spots. Firstly, the algorithm reveals overexposed areas, where spots may be truncated, and plateau regions caused by smeared and overlapping spots. Next, it reconstructs the correct distribution of pixel values in these overexposed areas and plateau regions, using a two-dimensional least-squares fitting based on a generalized Gaussian distribution. Pixel correction in saturated and smeared spots allows more accurate proteins quantification, providing more reliable image analysis results. The method is validated for processing highly exposed 2-D GE images, comparing reconstructed spots with the corresponding non-saturated image. The results demonstrate that the algorithm enables correct spot quantification. Key words Image analysis, Two-dimensional gel electrophoresis, Proteomics, Software tools
1
Introduction Since the pioneer work of O’Farrel [1], two-dimensional gel electrophoresis (2-D GE) has been demonstrated to be the most comprehensive technique for the analysis of proteome, allowing the simultaneous analysis of very large sets of gene products. In the post-genomic era, 2-D GE becomes a powerful tool that is widely used for the analysis of complex protein mixtures extracted from cells, tissues, or other biological samples [2]. The main goal of 2-D GE is to match protein spots between gels and define differences in the expression level of proteins in different biological states (e.g., healthy vs. diseased or control vs. treated). An important prerequisite to guarantee optimal performance of automated image analysis packages is good image capture, which should allow for the detection of various protein amounts, ranging
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_12, © Springer ScienceþBusiness Media New York 2016
203
204
Massimo Natale et al.
from very low to high abundance. However, the amplitude of the dynamic range is still a relevant issue in biological applications, since protein concentrations in biological systems may vary by seven or more orders of magnitude [3]. Currently, there is no acquisition system that can cover this wide range. In practice, the differences in protein concentration in biological systems compromise the weak and low-abundance protein spots, which are often more biologically relevant, but generally too faint to be accurately quantified [4]. Moreover, due to the highly complex nature of biological samples, in most cases several proteins have similar pI and MW values, and the corresponding spots migrate to the same portion of the gel image, resulting in complex regions with multiple overlapping spots. In this context, the commercial software currently available often require manual revision of spots detection and refinement of computer generated matches [5–7]. The aim of the spots detection step is to define the location, the true boundary and the intensity (usually the total pixel volume within the boundary) of each spot. In particular, in order to achieve fast and reliable image analysis, the algorithm used for image segmentation must provide well-defined and reproducible spot detection, quantification and normalization [8]. However, despite the availability of several spot detection algorithms, the accurate definition of a protein spot might be critical in certain regions, where defining the boundaries is challenging due to variable background, spot overlapping, and spot saturation [9–11]. In a saturated spot, intensity values are truncated, preventing the resolution of high-intensity pixels, since the missing area of the spot will not be measured. In other words, a saturated spot does not provide any reliable quantitative data, and it might also compromise the overall normalization. In particular, when comparing different experimental states for expression changes representative of a particular condition, the inclusion of highly to medium abundant and saturated spots might bias the normalization step, especially if their variance is a significant portion of the total spot volume [3]. For these reasons, several authors recommend to manually delete the saturated spots, before analyzing the gels [12]. In fact, currently available commercial software (as Delta2D, ImageMaster, Melanie, PDQuest, Progenesis, and REDFIN) or academic packages (as Pinnacle and Flicker) are not able to deal with specific protein spot distortions found in the gel images [13]. Here, we present a novel two-step algorithm for the detection and reconstruction of over-saturated protein spots. Firstly, it reveals overexposed areas, where spots may be truncated, and plateau regions caused by smeared and overlapping spots. Secondly, it processes the grayscale distribution of the saturated spots, reconstructing the saturated region by a generalized Gaussian approximation.
Extrapolation of Saturated Protein Spots
205
The method yields a highly refined and fully automatic spot detection that does not need further manual corrections. Furthermore, the pixel correction in saturated and smeared spots, according to the generalized Gaussian curve, allows more accurate quantification, providing more reliable results from 2-D GE analysis. To validate the proposed method, we processed a highly-exposed 2-D GE image containing saturated spots and compared the reconstructed spots to the corresponding spots of the unsaturated image. The results indicate that the method can effectively reconstruct the saturated spots and improve spots quantification.
2 2.1
Materials 2-D GE
1. Isoelectrofocusing buffer of water containing 5 % (w/v) SDS and 2.5 % (w/v) dithiothreitol and then diluted to 330 ml with a solution containing 7 M urea, 2 M thiourea, 4 %(w/v) CHAPS (3-[(3-cholamidopropyl)dimethylammonio]-1-propanesulfonate), 0.5 % (w/v) IPG buffer 3–10 NL, and traces of bromophenol blue. 2. Isoelectrofocusing was performed on 18 cm linear IPG strips 4–7 (GE Healthcare, Uppsala, Sweden). 3. Equilibration steps (15 min each): in the first step, the strips were left in a buffer containing 50 mM Tris–HCl pH 8.8, 6 M urea, 30 % (w/v) glycerol, 2 % (w/v) SDS, 1 % (w/v) dithiothreitol, and traces of bromophenol blue. The same buffer, but supplemented with 2.5 % (w/v) iodoacetamide, was used for the second step. 4. Sodium Dodecyl-Sulphate Polyacrylamide Gel-Electrophoresis (SDS-PAGE) was performed on 12.5 % polyacrylamide gel (1.5 mm thickness, 180 mm width, 200 mm length) [14]. 5. The gels were stained overnight with RuBPs in fixing solution (30 % ethanol and 10 % acetic acid) according to the manufacturer’s instructions.
2.2
Image Scanning
1. In order to collect a set of images useful to test the algorithm, we ran ten plasma samples using 2-D GE, with each sample run in duplicate. 2. The gel images were acquired by a Gel-Doc-it 310 Imaging system (UVP, Upland, CA, USA).
2.3
Image Analysis
Once all gels in the study had been collected and digitalized, they were analyzed using ImageJ [15], a Java open-source image analysis library, and the commercial software Melanie [16].
206
Massimo Natale et al.
In order to achieve the maximum intensity, the available images were acquired at 16-bit gray scale and saved in TIFF format (see Note 1).
3 3.1
Methods 2-D GE
1. Isoelectrofocusing was performed at 20 C by 72,000 Vh as follows: after 12 h at 50 V (600 Vh), the voltage was kept constant at 500 V for 1 h (500 Vh), then increased to 1000 V with a linear gradient in 2.5 h (1000 Vh), further increased to 8000 V with a linear gradient in 3 h (13,500 Vh), and finally kept constant at 8000 V for 7 h (56,000 Vh). 2. SDS-PAGE was carried out with constant current (60 mA) at 16 C until the dye front reached the lower end of the gel (see Note 2).
3.2
Image Scanning
Images were acquired with the following setting exposure: 7 s, gain 3; and three different apertures: 6, 7 and 8. Aperture is expressed as F-stop (e.g., F2.8 or f/2.8): the smaller is the F-stop number (or f/value), the larger is the aperture (see Note 3).
3.3
Image Analysis
After image scanning, the 2-D GE images are noisy and usually affected by several artifacts such as scratches, air bubbles and stain spikes. A 3 3 median filter was applied with the aim to remove most of the spikes, while preserving the shape of the spots (see Note 4). In order to identify the plateau regions, we implemented a morphological filter, inspired by the rolling-ball algorithm described by Sternberg [17], which allows segmentation of the plateau zones. This method is based on a structural element (SE) defined by a circle of given radius (RD) and a grayscale value tolerance (GVT). In particular, for each pixel (x,y) of the image I, the SE is defined as a circular neighborhood of RD: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SE ¼ ðs; t Þ∈I = ðs x Þ2 þ ðt y Þ2 < RD
ð1Þ
I ¼ ð0; imagewidthÞ ð0; imageheightÞ
ð2Þ
where:
denotes the spatial domain of the image. RD and the GVT are defined by a single parameter. For instance, setting the parameter to 10, the RD is 10 pixels and the GVT in the 10 % of the grey values of the pixel at position (x, y). The center of SE is moved along each pixel of the image and the maximum and minimum grey values of the pixels for each point
Extrapolation of Saturated Protein Spots
207
Fig. 1 Saturation zones in 2-D GE images acquired at different exposures. The figure shows images of the same gel acquired at three different exposures as high (a), medium (b), and low (c). The saturation zones detected from our algorithm are indicated in red. Spots boxed in blue were further analyzed in Fig. 2
(x, y) within the given RD are calculated. When the difference between maximum and minimum is less than GVT, the area defined by the local operator is considered as a plateau area (see Fig. 1). After localizing the plateau regions, we segmented the image to identify the isolated protein spots containing plateau areas on each gel image. The watershed algorithm [10] was used in this segmentation step. The reconstruction of the saturated spots was performed considering the unsaturated spot to be described by an analytical function, depending on a restricted set of parameters. In particular, we assumed each cross section of the spot intensity along both vertical and horizontal axes to be approximated by a generalized Gaussian distribution. Namely, for each value of the Y-coordinates we considered a function of the form: M ðy Þ exp f ðx, M ð y Þ, σ ð y Þ, x 0 , b Þ ¼ σðy Þ
jx
x 0 ð y Þjb
bσ ð y Þb
!
ð3Þ
For b ¼ 2, Eq. 3 defines the kernel of a standard Gaussian distribution centred in x0(y), where σ(y) and M(y) are the standard deviation and the maximum of intensity values, respectively. Note that, unlike the other parameters, b does not depend on y, assuming that the approximating Gaussian can have different maximum, center and variance in different sections. The reconstruction problem can be formulated as follows. Given:
x i ; y i , i ¼ 1, . . . , N x , j ¼ 1, . . . , N y
ð4Þ
and the corresponding intensities {Iij}, we determine the set of parameters (M, σ, x0, b) for which the function defined in Eq. 4 fits at best the values of the intensity in the unsaturated region.
208
Massimo Natale et al.
In practice, we have to minimize an error function that defines how good a particular parameter set is. For example, a standard least-squared criterion can be used:
E M yi , σ y j , x0 y j , b ¼
Nx
X f xi , M y , σ y , x0 y , b j j j i¼1
2 Ii j
ð5Þ
However, the error measure can be defined in different ways, e.g., according to different norms or including different weights for the different parameters and/or the different pixels. In particular, we modified Eq. 5 in order to control the variation of the parameters for different sections. In our case the error function in Eq. 6 can be formulated as follows:
E^ M y j , σ y j , x 0 y j , b
¼ E M y j , σ y j , x0 y j , b
2
M yj 1 þ δσ σ y j þ M yj
2 x0 y j 1 þ δx 0 x 0 y j
σ yj
1
2
ð6Þ
For positive values of the parameters ðδM ; δσ ; δx 0 Þ, the problem is then reduced to finding the parameters yielding the minimum of the selected error function. One possibility is to perform an exhaustive search on all the values of a pre-defined parameters space. However, if the size of the parameter space is large, a more effective Newton–Raphson algorithm [18] could be employed in order to find where the gradient equals zero.
4
Notes 1. In several cases researchers would be interested in processing also the image acquired with the longest exposure (see Figs. 1, 2, and 3), since it contains the largest number of spots (and thus also the lower-abundance protein spots). It must be said that, in practice, the operators only acquire a single image per gel, choosing then the image that can give the largest amount of useful information. In the example considered here, the image shown in Fig. 1 provides more information due to more spots revealed. However, without an extrapolation approach, this image would be discarded due to the larger number of saturated spots.
Extrapolation of Saturated Protein Spots
209
Fig. 2 Visualization of unsaturated and saturated spots in 2-D view. Upper panel indicates the regions in the blue box as shown in Fig. 1 panel (a)–(c) with high, medium, and low exposure. These regions were visualized in 2-D view in the lower panels accordingly. The profiles in this figure clearly show the Gaussian distribution of unsaturated spots
c 5
Gray value (x104)
Gray value (x104)
a
4 3 2 1 50
40 30 Pixels on
4 3 2 1
40 30 20 x 10 a is on Y 0 0 s l e Pix
20 10 X axis
50 40
30 Pixels
b
20 10 on X a xis
0
0
10
30
20
ls Pixe
40
on Y
axis
d 8 Gray value (x104)
8 Gray value (x104)
5
6 4 2 0 50
40
30
20 10 Pixels o n X axis
0
0
10
20
Pixe
n ls o
40 30 xis Ya
6 4 2 0 50
40
30
20
10 0 0 Pixels o n X axis
10
30
20
n ls o
40
xis
Ya
e
Pix
Fig. 3 Visualization of unsaturated and saturated spots in 3D view before and after reconstruction. The 3D visualization of the unsaturated spot number 3 Fig. 2 panel (c), and saturated spot number 3 Fig. 2 panel (a) was shown in panels (a) and (b), respectively. Pixels are shown on X- and Y-axes and the gray values are shown on Z-axis. Shape reconstruction of the unsaturated spot shown in panel (a) and saturated spot shown in (b) was shown in panels (c) and (d), respectively
210
Massimo Natale et al.
2. An important prerequisite to guarantee optimal performance of automated image analysis packages is a good image capture, which should allow for the detection of various protein amounts ranging from very low to high abundance. In particular, when a gel image is acquired, it is important to avoid saturation effects, which occur when grey levels exceed the maximum representable level on the used scale (e.g., in an 8bit image an overexposed or saturated area contains a large number of pixels with gray-level values close to 256). In practice, to avoid saturation during the scanning procedure, the operators should assure that the more abundant protein spots are represented by pixels slightly below the maximum intensity available, while keeping the dynamic range as wide as possible and acquiring images at 16-bit rather than 8-bit, because a 16bit image has 65,536 gray levels while an 8-bit has only 256 gray levels. 3. The orders of magnitude define the volume difference between smallest and largest spots. For example, 1 order of magnitude means a difference from 1 to 10, while 5 orders from 1 to 100,000. The linear dynamic range is necessary to examine these 2-D GE spots containing varied concentrations of the target proteins, so that the correlation between protein concentration and spot volume can be generated and evaluated. The use of fluorescent protein stains can extend the linear dynamic range by 3–5 orders of magnitude [3]. 4. Scratches and air bubbles are unintentionally introduced by the operator, while spikes are caused by the presence of precipitated staining particles or imperfect polyacrylamide gel matrix. In particular, spikes are a sort of impulsive noise because of the small area of high intensity. The aim of reducing speckles in a 2-D GE image (despeckling) is to remove the noise without introducing any significant distortion in quantitative spot volume data. References 1. O’Farrel PH (1975) High resolution twodimensional electrophoresis of proteins. J Biol Chem 250:4007–4021 2. Gorg A, Weiss W, Dunn MJ (2004) Current two-dimensional electrophoresis technology for proteomics. Proteomics 4:3665–3685 3. Miller I, Crawford J, Gianazza E (2006) Protein stains for proteomic applications: which, when, why? Proteomics 6:5385–5408 4. Hortin GL, Sviridov D (2010) The dynamic range problem in the analysis of the plasma proteome. J Proteomics 73:629–636
5. Millioni R, Puricelli L, Sbrignadello S, Iori E, Murphy E, Tessari P (2012) Operator- and software-related post-experimental variability and source of error in 2-DE analysis. Amino Acids 42:1583–1590 6. Clark BN, Gutstein HB (2008) The myth of automated, high-throughput two-dimensional gel analysis. Proteomics 8:1197–1203 7. Wheelock AM, Buckpitt AR (2005) Softwareinduced variance in two-dimensional gel electrophoresis image analysis. Electrophoresis 26:4508–4520
Extrapolation of Saturated Protein Spots 8. dos Anjos A, Møller AL, Ersbøll BK, Finnie C, Shahbazkia HR (2011) New approach for segmentation and quantification of twodimensional gel electrophoresis images. Bioinformatics 27:368–375 9. Srinark T, Kambhamettu C (2008) An image analysis suite for spot detection and spot matching in two-dimensional electrophoresis gels. Electrophoresis 29:706–715 10. Rashwan S, Faheem T, Sarhan A, Youssef BAB (2009) A fuzzy-watershed based algorithm for protein spot detection in 2DGE images. In: Proceedings of the 9th WSEAS international conference on signal processing, computational geometry and artificial vision (ISCGAV’09), World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, WI, USA, pp. 35–40 11. Daszykowski M, Bierczynska-Krzysik A, Silberring J, Walczak B (2010) Avoiding spots detection in analysis of electrophoretic gel images. Chemometr Intell Lab Syst 104:2–7 12. Berth M, Moser FM, Kolbe M, Bernhardt J (2007) The state of the art in the analysis of
211
two-dimensional gel electrophoresis images. Appl Microbiol Biotechnol 76:1223–1243 13. Maurer MH (2006) Software analysis of two-dimensional electrophoretic gels in proteomic experiments. Curr Bioinform 1:255–262 14. Maresca B, Cigliano L, Corsaro MM, Pieretti G, Natale M, Bucci EM et al. (2010) Quantitative determination of haptoglobin glycoform variants in psoriasis. Biol Chem 391:1429–1439 15. Schneider CA, Rasband WS, Eliceiri KW (2012) NIH Image to ImageJ: 25 years of image analysis. Nat Methods 9(7):671–675 16. Natale M, Maresca B, Abrescia P, Bucci EM (2011) Image analysis workflow for 2-D electrophoresis gels based on ImageJ. Proteomics Insights 4:37–49 17. Sternberg S (1983) Biomedical image processing. IEEE Comput 16:22–34 18. Ben-Israel A (1996) A Newton-Raphson method for the solution of system of equations. J Math Anal Appl 15:243–252
Part III Statistical Methods Applied to Spot Volume Datasets to Identify Candidate Biomarkers
Chapter 13 Multiple Testing and Pattern Recognition in 2-DE Proteomics Sebastien C. Carpentier Abstract After separation through two-dimensional gel electrophoresis (2-DE), several hundreds of individual protein abundances can be quantified in a cell population or sample tissue. However, gel-based proteomics has the reputation of being a slow and cumbersome art. But art is not dead! While 2-DE may no longer be the tool of choice in high-throughput differential proteomics, it is still very effective to identify and quantify protein species caused by genetic variations, alternative splicing, and/or PTMs. This chapter reviews some typical statistical exploratory and confirmatory tools available and suggests case-specific guidelines for (1) the discovery of potentially interesting protein spots, and (2) the further characterization of protein families and their possible PTMs. Key words 2-DE, Multivariate statistics, Protein correlations, Clustering, Protein isoforms
1
Introduction Due to regulation at the translational level and post-translational modifications (PTMs), it is important to also analyze the final product of the genetic code, proteins, since they are the actual effectors in the cell. To date 2-DE is a still intensively used method in differential proteomics. 2-DE has the advantage of being a highresolution technique which is able to resolve individual proteins isoforms including PTMs. Gel-based proteomics has the reputation, however, of being a slow and cumbersome art. Indeed the throughput of 2-DE is limited and gel-free approaches provide an answer to throughput problems associated with gel-based proteomics. Gel-free differential proteomics relies on mass spectrometry for both quantification and identification. The major disadvantage of this approach lies in the disconnection between the protein and its peptides. A protein sample containing several thousands of proteins is digested and all these peptides are analysed at once. This leads to both identification and quantification problems, especially in the case of higher eukaryotes with complex polyploid
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_13, © Springer ScienceþBusiness Media New York 2016
215
216
Sebastien C. Carpentier
genomes and big protein families. Peptides shared between several proteins do not contribute to the conclusive identification of a particular protein. This is the so-called protein inference problem [1]. Tryptic specific peptides need to be measured and identified for final protein identification and quantification. So both gel-free and gel-based methods have their advantages and disadvantages. While 2-DE may no longer be the tool of choice in high-throughput differential proteomics in the future, it is still very effective to identify and quantify protein species caused by genetic variations, alternative splicing and/or PTMs and it will give a complementary view of the proteome. Protein species, when caused by sequence variation (caused by paralogous genes or allelic variants) or PTMs, have many common peptides when digested but often have small differences in pI and/or mass [2, 3]. The complexity in a single spot is much lower than in a gel-free blind high-throughput approach, where finding the low-abundance differential peptides is similar to looking for a needle in a haystack. Proteins are often also correlated, especially when they belong to the same family. Therefore this chapter reviews some typical statistical exploratory tools available for general 2-DE and suggests case-specific guidelines for an approach that can be used for (1) the discovery of potentially interesting protein spots and (2) the further characterization of protein families and their possible PTM.
2
Materials 1. Plant material of the selected banana varieties Cachaco (CAC) (ABB, ITC 0643) and Mbwazirume (MBW) (AAAh, ITC 0084) were supplied by the International Transit Centre of Bioversity International (Leuven, Belgium). 2. Differential Gel-Electrophoresis (DIGE) labeling of the protein samples was performed with CyDyes from GE Healthcare (Little Chalfont, UK). 3. Gels were run on IPG strips pI 4–7 24 cm (GE Healthcare, Little Chalfont, UK) on an IPGphor 2 (GE Healthcare, Little Chalfont, UK) and Ettan Dalt 6 (GE Healthcare, Little Chalfont, UK) and scanned with a Typhoon scanner (GE Healthcare, Little Chalfont, UK). 4. Peptides were analyzed with a matrix-assisted laser-desorption ionization time-of-flight/time-of-flight (MALDI TOF/TOF) 4800 (ABSCIEX, Framingham, MA, USA). 5. The following software packages were used for data handling: Decyder v.7.0 (GE Healthcare, Little Chalfont, UK), Statistica v.8.0 (Nine Sigma, Cleveland, OH, USA), GPS Explorer (ABSCIEX, Framingham, MA, USA), Mascot distiller (Matrix Science, Boston, MA, USA), Speclust (Lund University, http://co.bmc.lu.se/speclust/).
Multiple Testing and Pattern Recognition in 2-DE Proteomics
3
217
Methods The proteins were extracted, separated and labelled according to [4, 5]. Data from two different DIGE experiments were exported from Decyder v.7.0. One example was used to calculate the false discovery rate (FDR) and the other was used to evaluate the protein abundances with principal component analysis (PCA), partial least square (PLS) (NonLinear Iterative Partial Least Squares - NIPALS algorithm), and clustering (Nearest neighbor algorithm) using Statistica v.8. Mgf files were generated with Mascot Distiller and transformed afterwards to .peaks files. MS spectra and MS/MS spectra were clustered using Speclust, http://co.bmc.lu.se/ speclust/ [6].
3.1 Discovery of Potentially Interesting Protein Spots 3.1.1 Univariate Statistics: Multiple Testing
Two-dimensional gel electrophoresis (2-DE) simultaneously quantifies hundreds of individual protein abundances (protein spots). One way to start the analysis is to state for each individual protein spot a null hypothesis (H 0) and to choose the most appropriate statistical test to check this hypothesis after having specified a significance level α (see Note 1). In the case of a proteomics experiment, additional measures are advisable. Univariate statistical tests like the t-test, the Kolmogorov-Smirnov test, ANalysis Of VAriance (ANOVA), or the Kruskal-Wallis test have not been designed to analyze complex datasets containing multiple (correlated) variables. Testing hundreds of variables (protein spots) one by one and reporting them with an acceptance of a certain risk of false positives (α) enhances the chance of reporting false positive cases (multiple testing issue) [7]. Instead of controlling the false positive result for the single test, there is the need to control the error rate of all the tests. One way to approach this multiplicity problem is to control the family-wise error rate (FWER), aiming to control the probability of committing any type I error in families of comparisons under simultaneous consideration. The FWER can be defined as the probability of having at least one false-positive decision among all tests [7] (see Note 2). In order to keep FWER α in a proteomics experiment with m detected and matched spots, one can for example test each individual protein spot at the Bonferroni adjusted significance level: α/m. However, the number of erroneous rejections should also be taken into account and not only the question whether any error was made [7]. Benjamini and Hochberg [7] state that the seriousness of the loss incurred by erroneous rejections is related to the number of hypotheses rejected and so a better alternative is the expected proportion of errors among the rejected hypotheses, which they term the false discovery rate (FDR). Applying an FDR procedure implies a control of FWER, while it admits a more powerful procedure (less false-negative results).
218
Sebastien C. Carpentier
Fig. 1 Ranked p-values for all the measured variables (blue) and the corrected value (i/m q, red). The inset zooms in on the values where i/m q gets below p(i). In the case of PCER, 189 protein spots have a threshold below 0.05. The inset shows that k ¼ 40 corresponding to a p-value of 0.004
In the example given below, 502 protein spots have been matched across all the 6 gels: m ¼ 502 and a risk of false positives of 5 % (α-level ¼ 0.05) has been applied. In case of a per comparison error rate (PCER) approach [7], which ignores the multiplicity problem altogether, Fig. 1 shows that 189 protein spots have a threshold below 0.05 and the H 0 hypothesis of 313 spots has not been rejected. In the case of the Bonferroni-adjusted significance level, α/m ¼ 0.0001; only 1 spot is below the new threshold and the H 0 of 501 spots has not been rejected. So there is an immense difference between 1 and 189 significantly different protein spots. The FDR procedure of Benjamini and Hochberg considers testing H 1, H 2, . . ., H m different H 0 hypothesis based on their corresponding p-values p1, p2, . . ., pm. The procedure ranks the different p-values p(l) < p(2) < . . . < p(m), and denotes by H(i) the null hypothesis corresponding to the p(i). The procedure looks then for k, being the largest i for which pðiÞ mi q and q ¼ (1 p), then rejects all the H(i) i ¼ 1, 2, . . ., k. If one applies the definition introduced by Benjamini and Hochberg [7] to our example, then q ¼ 0.05 and Fig. 1 (inset) shows that k ¼ 40 corresponding to a p-value of 0.004. So we reject H 0 for H(i) i ¼ 1,2,. . .,40. H 0 is not rejected for 462 proteins.
Multiple Testing and Pattern Recognition in 2-DE Proteomics
219
To avoid reporting too many false-positive results, it is necessary to apply a correction for multiple testing. Some correction algorithms like FWER are too strict, calling for a way to control the expected proportion of falsely rejected H 0 hypotheses (the false discovery rate). Benjamini and Hochberg [7] prove that their approach is a practical and powerful approach for multiple testing. 3.1.2 Multivariate Statistics: PCA, PLS, and Clustering
Univariate statistical methods examine the individual protein spots one by one, considering the different proteins as independent measurements; they are not able to detect correlations to the other variables (proteins). Proteins fit within the larger entity of networks and interact with each other. For further reading on experimental design and univariate or multivariate statistics in proteomics the reader is referred to [8, 9] and for multivariate techniques in general to the books of Sharma and Jackson [10, 11]. The field of multivariate analysis consists of those statistical techniques that consider two or more related random variables as a single entity and attempts to produce an overall result taking the relationship among the variables into account [11]. In contrast to a univariate approach, it displays the interrelationships between the large number of variables and it is able to correlate multiple proteins to a specific experimental group and so to detect patterns. The use of multivariate methods in the analysis of 2-DE was already established in the early days of 2-DE [12]. PCA is one of the multivariate possibilities to perform explorative data analysis. PCA condenses the information contained in a huge data set into a smaller number of artificial factors, which explain most of the variance observed. A principal component (PC) is a linear combination calculated from the existing variables (proteins) (PC1 ¼ a 1(protein1) þ a 2(protein2) þ . . . þ a m (proteinm); PC2 ¼ b 1 (protein1) þ b 2 (protein2) þ . . . þ bm (proteinm)). The relation between the original variables (proteins) and the PCs is displayed in the loading plot. This means that if a protein has a high loading score for a specific PC, then this protein greatly contributes to the definition of that PC. Using PCA, trends and grouping of samples may be detected and this technique allows studying which proteins are important to explain trends and grouping in the data through the loading plot. Explorative PCA does not put strict requirements to the data. The majority of PCA applications are descriptive in nature. In these instances, distributional assumptions are of secondary importance [11]. The only requirement that must be met is that the dataset must be complete, meaning that there must be no missing spot values among the different samples. Finding techniques for performing multivariate statistical analyses in the presence of incomplete data and/or for estimating missing data can solve the problem. Several methods for estimating missing data have been reported [13–15]. PCA is outstanding to detect outlying data and to detect correlations among
220
Sebastien C. Carpentier
the different variables (proteins), but it is not able to determine a threshold level for identifying which proteins are significant in classifying the experimental groups, allowing an objective removal of variables (proteins) that do not contribute to the class distinction. Several algorithms exist to select a subset of features from the whole data set and to perform a classification. In proteome analysis this corresponds to selecting the proteins that can best discriminate the experimental groups. The use of PLS as a regression technique has been promoted primarily within the area of chemometrics [16]. In contrast to PCA, PLS is a supervised technique mainly applied to link (or regress) a continuous response variable (or dependent variable) to a set of independent variables (e.g., proteins in a gel). In proteomic data the response variable is often a categorical variable (e.g. treatment A, B, C, . . .). Partial Least Squares Discriminant Analysis (PLS-DA) offers an algorithm to deal with this typical data structure [16]. The aim of the technique is to identify the set of variables (proteins) that best discriminates between the groups. An analysis of the score and loading (correlation) plots allows defining the proteins which are important in discriminating the different experimental treatments. The variable importance plot (VIP) is an interesting tool for this purpose. A different way to get insight into the samples is hierarchical clustering (see Note 3). Samples that are more related appear on the same branches. The strength of their relation is represented by the length of the branch, where two samples which appear together on a short branch are more related than samples on a long branch. A clustering procedure usually begins by calculating a similarity or distance m y matrix D, where m is the number of variables (matched spots) and y the number of cases (samples). Consider the hypothetical example described in Table 1, where there are six samples (cases C_1;.. C_6) and only two proteins quantified in
Table 1 Matrix of two protein abundances in six samples Abundance Sample
Protein 1
Protein 2
1
5
5
2
11
10
3
15
14
4
16
15
5
25
20
6
30
19
Multiple Testing and Pattern Recognition in 2-DE Proteomics
221
the six samples. As shown in Fig. 2, each sample is represented in a two-dimensional space representing the abundance of protein 1 and 2. In reality, we have m protein spots that are represented in an m-dimensional space. The objective of cluster analysis is to group the samples into homogeneous clusters. Geometrically, the concept of cluster analysis can be very simple. In the example below we use the Euclidian distance as a measure of similarity (Fig. 2). Table 2 displays the Euclidian distances. More advanced algorithms
Fig. 2 Two-dimensional representation of the hypothetical example and the principle of Euclidian distance
Table 2 Euclidian distance matrix C_1
C_2
C_3
C_4
C_5
C_6
C_1
0.0
7.8
13.5
14.9
25.0
28.7
C_2
7.8
0.0
5.7
7.1
17.2
21.0
C_3
13.5
5.7
0.0
1.4
11.7
15.8
C_4
14.9
7.1
1.4
0.0
10.3
14.6
C_5
25.0
17.2
11.7
10.3
0.0
5.1
C_6
28.7
21.0
15.8
14.6
5.1
0.0
The smallest distance is highlighted in grey and indicates the linkage distance of the cluster (Fig. 3)
222
Sebastien C. Carpentier
Fig. 3 Tree diagram for the hypothetical six cases. The numbers indicate the different steps taken in the clustering process
exist, but only the single-linkage or nearest-neighbor method will be explained. In this method, the distance between the clusters (samples) is obtained by calculating the minimum distance between all possible pairs. Table 2 shows the distances between all the different samples and shows that the first cluster that can be found is the cluster between sample C_3 and C_4 (Fig. 3). Then a new matrix is formed where the distance between the C_1, C_2, C5, and C_6 samples and the new cluster C_3 & C_4 is represented by the nearest of the two neighbors of the new cluster, in this case C_3 or C_4 (Table 3). The minimum distance is again searched and a new cluster is created. Table 3 shows that the second new cluster that can be found is the cluster between sample C_5 and C_6 (Fig. 3). Then a new matrix is formed, where the distance between C_1, C_2, C_3 & C_4 and the new cluster C_5 & C_6 is represented by the nearest of the two neighbors of the cluster, in this case C_5 or C_6 (Table 4). The minimum distance is again searched and a new cluster is created. Table 4 shows that the third new
Multiple Testing and Pattern Recognition in 2-DE Proteomics
223
Table 3 Single-linkage/nearest-neighbor method: five clusters C_1
C_2
C_3 & C_4
C_5
C_6
C_1
0.0
7.8
13.5
25.0
28.7
C_2
7.8
0.0
5.7
17.2
21.0
C_3 & C_4
13.5
5.7
0.0
10.3
14.6
C_5
25.0
17.2
10.3
0.0
5.1
C_6
28.7
21.0
14.6
5.1
0.0
The smallest distance is highlighted in grey and indicates the linkage distance of the cluster (Fig. 3)
Table 4 Single-linkage/nearest-neighbor method: four clusters C_1
C_2
C_3 & C_4
C_5 & C_6
C_1
0.0
7.8
13.5
25.0
C_2
7.8
0.0
5.7
17.2
C_3 & C_4
13.5
5.7
0.0
10.3
C_5 & C_6
25.0
17.2
10.3
0.0
The smallest distance is highlighted in grey and indicates the linkage distance of the cluster (Fig. 3)
Table 5 Single-linkage/nearest-neighbor method: three clusters C_1
C_2 & C_3 & C_4
C_5 & C_6
C_1
0.0
7.8
25.0
C_2 & C_3 & C_4
7.8
0.0
10.3
25.0
10.3
0.0
C_5 & C_6
The smallest distance is highlighted in grey and indicates the linkage distance of the cluster (Fig. 3)
cluster that can be found is the cluster between C_2 and C_3 & C_4 (Fig. 3). Then a new matrix is formed where the distance between C_1, C_5 & C_6 and the new cluster C_2 & C_3 & C_4 is represented by the nearest of the two neighbors of the cluster, in this case C_2 or C_3 & C_4 (Table 5). The minimum distance is again searched and a new cluster is created. Table 5 shows that the fourth new
224
Sebastien C. Carpentier
Table 6 Single-linkage/nearest-neighbor method: two clusters C_1 & C_2 & C_3 & C_4 C_1 & C_2 & C_3 & C_4 C_5 & C_6
C_5 & C_6
0.0
10.3
10.3
0.0
cluster that can be found is the cluster between C_1 and C_2 & C_3 & C_4 (Fig. 3). Then a new matrix is formed where the distance between C_5 & C_6 and the new cluster C_1 & C_2 & C_3 & C_4 is represented by the nearest of the two neighbors of the cluster, in this case C_1 or C_2 & C_3 & C_4 (Table 6). The minimum distance is again searched and a new cluster is created. Table 6 shows that the fifth and the last cluster can be found is the cluster between C_5 & C_6 and C_1 & C_2 & C_3 & C_4. Figure 3 shows the overview of the 5 clusters and their distances. It is clear that samples C_3 and C_4 are the most similar; we see that samples C_5 and C_6 are the more distant from the rest of the samples. In this simple example, this is already clear from the raw data in Fig. 2, but if it gets more complex it is impossible to get this insight. Another, but important, objective is the understanding of the interdependencies of the proteins and to detect patterns. One possible route for investigating this aspect is to group the proteins via hierarchical clustering. Proteins which are more related appear on the same branches. As mentioned above an m y similarity or distance matrix D, where m is the number of variables (matched spots) and y the number of cases (samples) is formed. The dimensions of this matrix can be swapped: in this case one is investigating the similarity between the proteins. Proteins which are more related appear on the same branches. A disadvantage of this view is the number of variables. One cannot see the wood for the trees because there are too many variables. Some software try to solve this by creating a heat map. Next to the clustering of the proteins, another way to detect patterns among the different proteins is to look at the loading plot of X variables of a PLS-DA. The loading value indicates the correlation between the original variable (protein) and the new one, namely the latent variable. The loading gives an indication of the extent to which the original variables are important in forming the latent variable (LV). High and similar loadings for x1 and x2 mean that both these proteins are influential in forming the LV and so both have a high correlation to the LV. If some proteins have a similar loading value for multiple latent variables, this means that they are similarly correlated to them and that they are probably correlated one to each other.
Multiple Testing and Pattern Recognition in 2-DE Proteomics
225
Let us consider a real example. To discover relevant proteins that can be correlated to stress and stress tolerance, we have analyzed the proteome of two different genotypes (a stress-tolerant (CAC) and a sensitive (MBW)) under two different conditions (control and stress), and we have measured “blindly” as many proteins as possible via 2-D DIGE. Which are the relevant proteins and which proteins are not changed in abundance? What is the variability among the samples? In this preliminary exploratory blind high-throughput discovery experiment, we analysed three biological replicates per condition. After DIGE labeling, 1314 different variables (protein spots) have been considered to being matched correctly over the six gels and so over the 12 samples. What is a good statistical technique to characterize the individual plants and to discover interesting protein patterns? In this example we will limit to the rather “simple” linear techniques PCA, PLSDA, and hierarchical clustering. The score plot of PCA analysis shows that the biggest variability (principal component PC1) in the sample set can be attributed to the genotype (Fig. 4). This means that a different genotype produces a different proteome. But are there any proteins correlated to stress and, more important, potentially correlated to stress tolerance? (see Note 4). If one looks at the hierarchical clustering of the samples by the simple method of
Fig. 4 PCA score plot: the genotype CAC is represented in blue, the genotype MBW in yellow. The condition stress is represented as diamonds, control as dots. PC1 separates the genotypes
226
Sebastien C. Carpentier
Fig. 5 Tree diagram for the 12 samples. This method is able to cluster similar samples but the genotypic dependent trend is not visible
single linkage, then it is clear that this method is able to cluster similar samples but even here the genotypic dependent trend is not evident (Fig. 5) (see Note 4). As indicated above, in contrast to PCA, PLS is a supervised technique applied to link continuous variables (the Y variables) to a set of independent variables (the X variables, i.e., the proteins in a gel). Since, in our example the Y variables are categorical (treatment control/stress, genotype cac/ mbw) we apply a PLS-DA analysis. The score plot (Fig. 6) shows that, as in PCA, the biggest variability is explained by the genotype represented by LV1. In this case 49 % of the variability of the dependent Y variables is explained by LV1 and 34 % by LV2. If six LVs are taken into account, 99.7 % of the total variance in Y is explained. In contrast to PCA, we see now a separation of the control and the stress samples by LV2. The (correlation) loading plot of the X variables (Fig. 7) allows to identify which proteins are important in discriminating the different experimental treatments. The variable importance plot (VIP) is an interesting tool to combine the loading values of the different LVs. The proportion of variability explained in the X variables is then further weighted by
Multiple Testing and Pattern Recognition in 2-DE Proteomics
227
Fig. 6 PLS score plot: the genotype CAC is represented in blue, the genotype MBW in yellow. The condition stress is represented as diamonds, control as dots. The biggest variability is explained by the genotype represented by LV1, the treatment is explained by LV2
Fig. 7 PLS loading plot of the X variables. The 50 most important variables are displayed in red
228
Sebastien C. Carpentier
Fig. 8 VIP scores. The VIP scores for all the variables are ranked from high to low. In red are displayed the 20 most important variables
the variability accounted for by the LVs. If we plot the VIP scores, we see in Fig. 8 a cut off around the first 20 variables. These 50 most important variables are displayed in Fig. 7 in orange. We have seen from cluster analysis that we needed clearly a selection of the relevant variables (Fig. 5). If we redo the analysis blindly, based only on the VIP 20 (Fig. 9), then we see a similar pattern coming back as in the score plot of PLS-DA (Fig. 6). One can see that the stressed samples are more similar, and that there is a clear separation between the genotypes. The clustering of the proteins shows that the 12 proteins correlated to stress (858, 733, 735, 153, 548, 727, 653, 855, 740, 717, 808, 825) are more similar (Fig. 10). Next to the clustering of the proteins, another way to detect patterns among the different proteins is to look at the loading plot of PLS-DA. Variables with a high negative loading for LV1 are protein spots with a high abundance in the genotype CAC (Fig. 11a) while variables with a high positive loading for LV1 are protein spots with a high abundance in the genotype MBW (Fig. 11b). Variables with a high positive loading for LV2 are protein spots with a high abundance during stress (Fig. 11c) while variables with a high negative loading for LV2 are protein spots with a high abundance in control conditions (Fig. 11d). Variables with a high negative loading for LV1 and a high positive loading score for LV2 are
Multiple Testing and Pattern Recognition in 2-DE Proteomics
229
Fig. 9 Tree diagram for the 12 samples based on the 20 most important variables (VIP20). One can see that the stressed samples are more similar and that there is a clear separation between the genotypes
protein spots with a high abundance in the genotype CAC under stress conditions (Fig. 12a). Variables with a high negative loading for LV1 and a high negative loading for LV2 are protein spots with a high abundance in the genotype CAC under control conditions (Fig. 12b). Variables with a high positive loading for LV1 and a high positive loading for LV2 are protein spots with a high abundance in the genotype MBW under stress conditions (Fig. 12c). Variables with a high positive loading for LV1 and a high negative loading for LV2 are protein spots with a high abundance in the genotype MBW in control conditions (Fig. 12d). Especially the last four categories are important in our case, since we want to identify the genotype dependent differences correlated to the treatment. 3.2 The Characterization of Protein Families and Their Possible PTMs
As indicated above, 2-DE is very effective to identify and quantify protein species caused by genetic variations (paralogous genes and allelic variants), alternative splicing and/or PTMs. On 2-DE gels often several spot trains of similar proteins are observed. In many cases, especially when working with non-model organisms, the
Fig. 10 Tree diagram for the 20 most important variables (VIP20). The clustering of the proteins shows that the 12 proteins correlated to stress (858, 733, 735, 153, 548, 727, 653, 855, 740, 717, 808, 825) are more similar
a 2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
b
0
0 1
1
c
d
1.8
2
1.6
1.8
1.4
1.6
1.2
1.4
1
1.2
0.8
1
0.6
0.8
0.4
0.6 0.4
0.2 1
Cac cont
1
Cac stress
MWB cont
MBW stress
Cac cont
Cac stress
MWB cont
MBW stress
Fig. 11 Correlation between proteins. (a) Proteins correlated to the genotype CAC. (b) proteins correlated to the genotype MBW. (c) Proteins correlated to the treatment stress. (d) Proteins correlated to the treatment control
Multiple Testing and Pattern Recognition in 2-DE Proteomics
a
231
b
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5 0
0 1
1
c
d
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5 0
0
1
1
Cac cont
Cac stress
MWB cont
MBW stress
Cac cont
Cac stress
MWB cont
MBW stress
Fig. 12 Correlation between proteins. (a) Proteins correlated to the genotype CAC and to the treatment stress. (b) Proteins correlated to the genotype CAC and to the treatment control. (c) Proteins correlated to genotype MBW and to the treatment stress. (d) Proteins correlated to genotype MBW and to the treatment control
different spots are annotated at the protein family level, but what is causing the redundancy is seldom resolved. To characterize protein families and their possible PTMs, hierarchical cluster analysis techniques are very useful [6, 17]. Alm et al. [6] developed a free software (Speclust) that matches the mass spectra of the different spots against each other and an overview is given of the shared and unique peak masses. The clustering principle is the same as described above, except that the first step is to align the different peaks according to a mass tolerance. However, the results of spot visualization by staining is a simplification of the actual protein distribution within a gel and they show only the tip of the iceberg [17]. Though a 2-DE spot may look perfectly focused, in reality proteins are present along a horizontal trail with its highest intensity that coincides with the highest intensity of the staining (see Note 5). So clusters can be formed with neighboring spots due to the horizontal protein trailing. Taken this into account, clustering is an ideal way to determine the relationships between (neighboring) spots. Additionally, if one has identified the relationship between the different spots and their common and unique peptides, one might end up with several peptides that are not identified due to a PTM, genetic variability
232
Sebastien C. Carpentier
Fig. 13 2-DE gel (24 cm, pI 4–7, Cy2 standard) with inset zoomed in on a train of 8 HSP70 spots. Spots 2, 28, 15, and 21 contain predominantly cytoplasmic isoforms, spots 12 and 24 luminal isoforms, and spots 29 and 3 contain mitochondrial isoforms. The tree diagram generated by Speclust shows the close relation between the acidic and basic cytoplasmic isoforms
or chemical modification that is not predicted from the database. Although at that time Alm et al. [6] did not intend to use the software for this, also here the same cluster analysis method can be used to cluster similar parent masses. MS/MS spectra derived from similar peptides (only differing in one or two amino acids, a PTM or chemical modification) contain many similar fragments and will cluster together. In this way similar peptides but with a deviating m/z can be identified. An example of a train of eight spots identified as HSP70 is presented in Fig. 13. We observe four spots containing predominantly cytoplasmic isoforms: two acidic (2 and 28) and two more basic (15 and 21) spots, two spots containing luminal isoforms (12 and 24), and two mitochondrial isoforms (29 and 3). The tree diagram shows the close relation between the acidic and basic cytoplasmatic isoforms. For a detailed analysis of the HSP70 family the reader is referred to [18]. Several isoforms are present, even in one spot. So also on a 24 cm 3 pI unit gel, not all isoforms are resolved. As an example the MS/MS spectra of spot 15 (Fig. 13) were clustered with the Speclust software. Figure 14 identifies variants of the same peptide. We have visualized the same peptide with changes due to genetical and chemical modifications. We see four big clusters due to four paralogous genes. Within the clusters we can see genetical differences (allelic variants) and chemical modifications.
Multiple Testing and Pattern Recognition in 2-DE Proteomics
233
Fig. 14 Tree diagram generated by Speclust of the MS/MS spectra of spot 15. We have visualized the same peptide with changes due to genetic and chemical modifications. We see four big clusters due to four paralogous genes. Within the clusters we can see genetic differences (allelic variants) and chemical modifications
4
Conclusions 2-DE is still a powerful technique. Univariate and multivariate techniques should be combined. To avoid reporting too many false-positive results in the univariate approach, it is necessary to apply a correction for multiple testing. Benjamini and Hochberg [7] prove that their approach is a practical and powerful approach for multiple testing. To detect patterns in the protein abundances, multiple techniques are possible. A selection of the most important proteins helps in getting insight into interesting protein patterns. To further characterize the protein patterns, proteins must be identified via mass spectrometry and clustering of the mass spectra is of great help.
5
Notes 1. The null hypothesis (H 0) states that there is no difference in protein abundance between the treatments and states an accepted level α of having false-positive results and to reject H 0 erroneously. Commonly used α-levels are 0.05 or 0.01. 2. In the case of a proteomics experiment, a type I error means rejecting the H 0 hypothesis and claiming that a particular spot abundance is significantly different while in reality it is not. 3. Although only one method is explained here, multiple different methods exist such as centroid, single-linkage or nearest-neighbor, complete linkage or farthest-neighbor, and
234
Sebastien C. Carpentier
average-linkage method. For a detailed overview, the reader is referred to Chapter 7 of the book of Sharma [10]. 4. As indicated above, a blind or non-supervised technique is ideal to explore your samples and characterize the variability. PCA is able to quantify the variables that are influential for PC1 and so characteristic for a particular genotype, but it fails to show the variables that are related to stress. The single-linkage clustering method is able to cluster similar samples but even here the genotypic dependent trend is not visible. It is clear that we have analysed too many proteins that are not correlated to the genotype and/or stress condition. So we need a supervised method that selectively picks the relevant variables. 5. Even with sensitive staining as DIGE, only the area containing the highest concentration of the focused protein is visualized and the horizontal trails are not visualized when the proteins are well focussed [18].
Acknowledgements The author would like to thank Annick De Troyer and Anne-Catherine Vanhove for technical assistance. Prof. Etienne Waelkens and his group (Laboratory of Protein Phosphorylation and Proteomics, KU Leuven), are gratefully acknowledged for the MALDI-TOF/TOF measurements. Financial support from “CIALCA” and the Bioversity International project “ITC characterization” (research projects financed by the Belgian DirectorateGeneral for Development Cooperation (DGDC)) is gratefully acknowledged. References 1. Nesvizhskii AI, Aebersold R (2005) Interpretation of shotgun proteomics data: the protein inference problem. Mol Cell Proteomics 4 (10):1419–1440 2. Carpentier S, Panis B, Renaut J, Samyn B, Vertommen A, Vanhove A, Swennen R, Sergeant K (2011) The use of 2D-electrophoresis and de novo sequencing to characterize interand intra-cultivar protein polymorphisms in an allopolyploid crop. Phytochemistry 72:1243–1250 3. Henry I, Carpentier S, Pampurova S, Van Hoylandt A, Panis B, Swennen R, Remy S (2011) Structure and regulation of the ASR gene family in banana. Planta 234:785–798 4. Carpentier S, Swennen R, Panis B (2009) Plant protein sample preparation for 2DE. In: Walker
J (ed) The protein protocols handbook. Humana Press, Totowa, NJ, pp 109–119 5. Carpentier S, Witters E, Laukens K, Deckers P, Swennen R, Panis B (2005) Preparation of protein extracts from recalcitrant plant tissues: an evaluation of different methods for twodimensional gel electrophoresis analysis. Proteomics 5:2497–2507 6. Alm R, Johansson P, Hjernø K, Emanuelsson C, Ringne´r M, H€akkinen J (2006) Detection and identification of protein isoforms using cluster analysis of MALDI-MS mass spectra. J Proteome Res 5:785–792 7. Benjamin Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B 57:289–300
Multiple Testing and Pattern Recognition in 2-DE Proteomics 8. Carpentier S, Panis B, Swennen R, Lammertyn J (2008) Finding the significant markers: statistical analysis of proteomic data. In: Vlahou A (ed) Methods in molecular biology, vol 428. Humana Press Inc., Totowa, NJ, pp 327–347 9. Marengo E, Robotti E, Bobba M, Liparota MC, Rustichelli C, Zamoo A, Chilosi M, Righetti PG (2006) Electrophoresis 27:484–494 10. Sharma S (1996) Applied multivariate techniques. Wiley, New York ISBN 0-471-31064-6 11. Jackson JE (2003) A user’s guide to principal components. Wiley, New York 12. Tarroux P (1983) Analysis of protein-patterns during differentiation using 2-D electrophoresis and computer multidimensional classification. Electrophoresis 4:63–70 13. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
235
14. Scheel I, Aldrin M, Glad IK, Sorum R, Lyng H, Frigessi A (2005) The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 21:4272–4279 15. Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19:2088–2096 16. Wold S (1985) Partial least squares. Encyc Stat Sci 6:581–591 17. Schmidt F, Schmid M, Jungblut PR, Mattowb J, Faciusc A, Pleissner K (2003) Iterative data analysis is the key for exhaustive analysis of peptide mass fingerprints from proteins separated by two-dimensional electrophoresis. J Am Soc Mass Spectrom 14:943–956 18. Vanhove A (2014) The quest for osmotic stress markers in Musa: from protein to gene and back in a non-model crop. Dissertation presented for the degree of Doctor in Bioscience Engineering KUleuven, Leuven
Chapter 14 Chemometric Multivariate Tools for Candidate Biomarker Identification: LDA, PLS-DA, SIMCA, Ranking-PCA Elisa Robotti and Emilio Marengo Abstract 2-D gel electrophoresis usually provides complex maps characterized by a low reproducibility: this hampers the use of spot volume data for the identification of reliable biomarkers. Under these circumstances, effective and robust methods for the comparison and classification of 2-D maps are fundamental for the identification of an exhaustive panel of candidate biomarkers. Multivariate methods are the most suitable since they take into consideration the relationships between the variables, i.e., effects of synergy and antagonism between the spots. Here the most common multivariate methods used in spot volume datasets analysis are presented. The methods are applied on a sample dataset to prove their effectiveness. Key words Principal component analysis, Classification, SIMCA, LDA, PLS-DA, Ranking-PCA, Spot volume data
1
Introduction Biomarkers are usually searched for among the great variety of molecules translated and/or synthesized by a particular cell or tissue in response to different factors; some examples are: the arise of a disease or the administration of a drug in the clinical field; the effect of environmental conditions and pollution on the growth of plants and microorganisms in the field of plant environmental biology and microbiology; and ripening, storage conditions, etc., on the expression profile of food products in food chemistry. The field of biomarkers discovery is often quite related to the field of proteomics since the investigated effect (a pathology, a drug, ripening, etc.) usually reveals itself through a different protein pattern between controls and other samples. In such cases, a well-established procedure consists in the so-called top-down approach that starts from protein extracts and separates them by 2-D gel electrophoresis. Differential analysis tools are then applied to compare maps from control and other samples to provide a panel
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_14, © Springer ScienceþBusiness Media New York 2016
237
238
Elisa Robotti and Emilio Marengo
of candidate biomarkers to be successively characterized by mass spectrometry investigations. This approach is certainly quite widespread but suffers from some limitations: first of all the low reproducibility affecting 2-D gel electrophoresis, that hampers the identification of reliable and robust biomarkers. This can be partially overcome by carrying out sets of replications of the same electrophoretic separation or exploiting 2-D difference gel electrophoresis (DIGE), but the replications available are usually few ( 0.08
e Loading plot PC2
Loading PC2
100
100
80
80
60
60
40
40
20
20
0
0 0 10 < -0.09
20 30 -0.09 / -0.06
40 50 -0.06 / -0.04
60
70 80 -0.04 / -0.02
90 100 -0.02 / 0
0 10 0 / 0.02
20 30 0.02 / 0.04
40 50 0.04 / 0.06
60 70 0.06 / 0.09
80 > 0.09
Fig. 5 PCA results: Score plot of the first two PCs (a). Loading plots of PC1: negative (b) and positive (c) loadings; loading plots of PC2: negative (d) and positive (e) loadings. Loading plots are represented on a virtual 2-D map where spots are indicated as circles centered in the corresponding x–y coordinates on a color scale: from light to dark blue, indicating increasing negative loadings and from light to dark red, indicating increasing positive loadings
Chemometric Multivariate Tools for Candidate Biomarker Identification. . .
253
Table 2 Classification matrix for the models: LDA based on the original variables; LDA based on PCs; SIMCA LDA on the original variables; LDA on PCs; SIMCA model NER %
T0
T1
T2
T0
100
8
0
0
T1
100
0
8
0
T2
100
0
0
7
Total
100
8
8
7
3. Results: classification matrix. Table 2 reports the classification matrix, where the true classes are present on the rows and the calculated classes on the columns. The second column reports the NER % of each class and the overall NER %. In this case all the samples are correctly classified. 4. Results: Mahalanobis distances. Table 3a reports the squared Mahalanobis distances calculated for each sample with respect to the three classes present. The results show that the classification is very good, since each sample shows a small distance from its own class and a distance by far largest from the other classes: the classes appear therefore well separated from each other and quite compact. 5. Results: classification functions and significance. Table 4 reports the classification functions, i.e., the coefficients for each variable included in the model on the model of each class. The last two columns indicate for each variable the corresponding plevel and the F-value. All the variables included are statistically significant since they have a p-level 4E11
c
1E11 < DP < 4E11
DP > 4E11
Discrimination Power T1 vs T2 90
80
70
Center Y (mm)
60
50
40
30
20
10
0 0
10
20
30
40
50
60
70
80
90
100
Center X (mm) DP < 1E11
1E11 < DP < 4E11
DP > 4E11
Fig. 8 SIMCA results: Plot of the DPs calculated for each spot for each comparison: T0 vs. T1 (a); T0 vs. T2 (b); and T1 vs. T2 (c). DPs are represented on a virtual 2-D map where spots are indicated as circles centred in the corresponding x–y coordinates on a color scale, from light to dark red, indicating increasing positive DPs
negative values along LV1 and the other class at positive scores along the same LV. The loadings are reported as virtual 2-D maps where each discriminant spot is represented by a circle centred in the corresponding x–y coordinates on a color scale: blue-colored spots are up-regulated in controls and downregulated in the other class while red-colored spots show an opposite behavior. 4.7
Ranking-PCA
1. Arrange the data as in Subheading 4.1. 2. Run the Ranking-PCA algorithm for each comparison separately, autoscaling the original X variables (T0 vs. T1, T0 vs.
260
Elisa Robotti and Emilio Marengo
Table 7 PLS-DA: percentage of cumulative explained variance of the first seven LVs on the X and Y variables for the three PLS-DA models calculated T0 vs. T1
T0 vs. T2
Y
X
T1 vs. T2
Y
X
Y
X
LV1
96.75
51.45
98.19
61.39
96.76
53.24
LV2
99.60
56.53
99.68
67.93
99.52
58.81
LV3
99.82
61.48
99.97
72.27
99.89
64.01
LV4
99.93
66.20
100.00
74.56
99.98
68.39
LV5
99.98
71.17
100.00
77.96
100.00
72.02
LV6
99.99
74.62
100.00
83.20
100.00
76.24
LV7
100.00
79.18
100.00
86.04
100.00
79.57
Table 8 Classification matrices for the PLS-DA models and for the Ranking-PCA models PLS-DA models; Ranking-PCA models T0 vs. T1
T0 vs. T2 NER %
T0
T1
T0
100
8
0
T1
100
0
Total
100
8
T1 vs. T2 NER %
T0
T2
NER %
T1
T2
T0
100
8
0
T1
100
8
0
8
T2
100
0
7
T2
100
0
7
8
Total
100
8
7
Total
100
8
7
T2, T1 vs. T2); select a cross-validation procedure (here LOO cross-validation is applied—see Note 7). 3. Results: explained variance. Table 9 reports the percentage of explained variance for the first five PCs calculated for the three comparisons. The first two PCs are considered significant for all the cases investigated since they account for more than 50 % of the overall variance. 4. Results: classification matrix. Table 8 reports the classification matrices for the three comparisons: in all cases all the samples are correctly classified. 5. Results: Ranking-PCA indexes. Figure 10 reports the trend of Ind1, Ind2, and the distance between the centroids of the two classes for the three comparisons. Each index is reported on the y-axis while the variables present in the model are reported on
b
Loading Plot LV1 T0 vs T1 90 80
a
70
3
T1-3 T1-8
T0-3 2 T0-6
Factor-2 (5%, 3%)
Center Y (mm)
Scores
T1-7
T0-4 T0-2 T0-1
1
T1-1 T1-4 T1-6 T1-5
0
60 50 40 30
T0-7 -1
20 -2 T0-8
10
-3 T1-2
T0-5
-4
0 10
20
30
-5 -8
-7
-5
-6
-4
-3
-2
-1
1
0
2
4
3
5
7
6
8
9
40
50
60
70
80
90
Center X (mm)
10
Positive Loadings
Factor-1 (51%, 97%)
d
Negative Loadings
Loading Plot LV1 T0 vs T2 90 80 70
c Scores T0-3
Factor-2 (7%, 1%)
4 T2-3
T0-6
3 2
T2-1 T2-6
T0-4 T0-2 T0-1
1
60
Center Y (mm)
5
T2-2
50 40 30
0
T2-7
-1
20
T2-5 T2-4
-2 T0-7
-3
10
T0-8
-4
0
-5
10
T0-5
20
30
40
50
60
70
80
90
-6 -10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
Center X (mm)
10 11 12
Positive Loadings
Factor-1 (61%, 98%)
f
Negative Loadings
Loading plot LV1 T1 vs T2 80 70 60
e
T2-3
T1-3
Factor-2 (6%, 3%)
3
T2-6
2 1
Center Y (mm)
Scores 4
T2-1 T2-2
T1-8 T1-4
T1-7 T1-1
50 40 30
0 T2-7
20
-1 T2-5
-2
T1-6 T1-5
T2-4
-3 -4 -5 -10 -9
0
T1-2 -8
-7
-6
-5
10
10 -4
-3
-2
-1
0
1
2
Factor-1 (53%, 97%)
3
4
5
6
7
8
9
10
11
20
30
40
50
60
70
80
90
Center X (mm) Positive Loading
Negative Loading
Fig. 9 PLS-DA results: Score plot of the first two LVs calculated for the three comparisons: T0 vs. T1 (a); T0 vs. T2 (c); T1 vs. T2 (e). Loading plot of the first LV calculated for the three comparisons: T0 vs. T1 (b); T0 vs. T2 (d); T1 vs. T2 (f). Loadings are represented on a virtual 2-D map where spots are indicated as circles centred in the corresponding x–y coordinates on a color scale: blue spots show a negative loading on LV1 while red ones show a positive loading on the same LV
262
Elisa Robotti and Emilio Marengo
Table 9 Ranking-PCA models: percentage of explained variance and cumulative explained variance of the first five PCs calculated for each comparison T0 vs. T1 % Expl. Var.
T0 vs. T2
T1 vs. T2
% Cum. Expl. Var.
% Expl. Var.
% Cum. Expl. Var.
% Expl. Var.
% Cum. Expl. Var.
PC1 40.39
40.39
47.69
47.69
42.21
42.21
PC2 9.03
49.42
8.76
56.45
9.24
51.46
PC3 7.25
56.67
8.10
64.55
7.89
59.35
PC4 6.62
63.30
6.18
70.73
6.03
65.38
PC5 5.88
69.18
4.50
75.22
5.05
70.44
the x-axis in the order in which they are included in the final model. Looking at the plots, about 80 variables appear as the most significant in the comparison between T0 and T1 samples; about 50 in the comparison between T0 and T2 samples and about 40 in the comparison between T1 and T2 samples. 6. Results: score and loading plots. Figure 11 represents the score and loading plots of the first two PCs calculated for the three models investigated. In all the score plots the first PC allows the separation of the two classes of samples: control samples at negative values along PC1 and the other class at positive scores along the same PC. The loadings are reported separately for PC1 and PC2: the loadings are reported on the y-axis while the variables are reported on the x-axis in the order in which they are included in the final model. The loadings along PC1 show a characteristic trend: the first variables added show a more relevant positive/negative loading since they are the most discriminating and the loadings decrease while variables are added, since the spots are included in the final model according to their decreasing discriminating power. Spots with positive loadings on PC1 are up-regulated in T1 (Fig. 11a) or T2 (vs. T0—Fig. 11b; vs. T1—Fig. 11c) and down-regulated in T0 (vs. T1—Fig. 11a; vs. T2—Fig. 11b) or T1 samples (Fig. 11c). Spots with negative loadings on PC1 show an opposite behavior. 4.8 Comparison of the Different Methods
Table 10 reports the number of spots considered as differently expressed by the multivariate methods applied. All the models provide the correct classification of all the samples (NER %¼ 100 %). LDA based on the original variables, as expected, provides the most parsimonious model, with only 19 biomarkers identified: the maximum number of variables that can be included in the model
Chemometric Multivariate Tools for Candidate Biomarker Identification. . .
263
a 80
4.5
70
4
60
3.5
18 16 14
Ind2
40
Distance
3
50
Ind1
20 Calib Cross-valid
2.5 2
30 20 Calib Cross-valid
10 0 20
40
60
6 4
0.5
2
80 100 120 140 160 180 200
20
40
60
80 100 120 140 160 180 200
10
200
8
150
4
50
2
0
10
0 50
Added variables
c
15
Calib Cross-valid 0
250
100
150
200
250
0
50
Added variables
140
8
120
7
80 100 120 140 160 180 200
5
0 200
60
20
6
100
150
40
25 Calib Cross-valid
Distance
250
100
20
Added variables
12
Ind2
Ind1
0
Added variables
Calib Cross-valid
50
Calib Cross-valid
0 0
300
0
8
1
Added variables
b
10
1.5
0 0
12
100
150
200
250
Added variables 20 18
Calib Cross-valid
16 6
100
14
60
Distance
80
Ind2
Ind1
5 4 3
12 10 8 6
40
2 4
20
Calib Cross-valid
0
1
0
20
40
60
80
100 120 140 160 180
Added variables
Calib Cross-valid
2
0
0 0
20
40
60
80 100 120 140 160 180
Added variables
0
20
40
60
80 100 120 140 160 180
Added variables
Fig. 10 Ranking-PCA results; trend of Ind1, Ind2, and the distance between the two class centroids as discriminant variables are included in the model, for the three comparisons: T0 vs. T1 (a), T0 vs. T2 (b), and T1 vs. T2 (c). The original variables are reported on the x-axis in the order in which they are included in the model
is limited to the number of samples present. This constraint can be overcome by applying LDA on PCs rather than on the original variables (see Note 8). Biomarkers can be identified in SIMCA from the DP values calculated: the greatest difficulty is the identification of the statistically significant markers since in the most cases arbitrary thresholds are applied to the DP values to identify those statistically relevant (see Note 9). PLS-DA and Ranking-PCA provide the best results for what concerns the exhaustivity of the research, with Ranking-PCA being by far the most exhaustive method.
5 0.0 T0-1 T0-4
-1.0
-1.5
1.5
0.5
-0.5 0.0
-1.0
-2.5
c 1.5
1.0
0.5
0.0
-0.5
-1.5
-2.0 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
T1-8 T1-7
T1-1
-0.5 T1-5 T1-2
T0-3 T0-6 T1-6 T1-4
-2.0
-1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 PC1
-0.2 0.0 0.2
0.2 0.4
0.4 0.6
T0-2
T1-5
0.6 0.8
T1-1
T1-4
0.8
1.0
T0-1 T0-4
T1-8 T1-7
1.0
1.2
2.0
T0-3T0-6 T2-1
T2-5 T2-6
T0-7 T2-3T2-7 T2-2 T2-8
T0-8
T0-5
2.0
T1-3
T2-2 T2-3 T2-6 T2-5
T1-2
T2-1
T1-6
1.2
0.2
0.0
1.4
-1.5 -0.6
-2.0 -0.8
-3.0 -1.40 -1.20 -1.00 -0.80 -0.60 -0.40 -0.20 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 -1.2
PC1
1.4
PC1
Loading PC2
T1-3
Score plot
0.2
0.0
-0.2
Score Plot
0.2
0.0
Ranking-PCA
SSP4008 SSP5307 SSP3403 SSP6105 SSP2106 SSP3009 SSP4605 SSP7106 SSP4104 SSP3407 SSP0012 SSP3008 SSP3010 SSP2411 SSP3107 SSP6604 SSP2410 SSP1406 SSP2204 SSP2004 SSP1106 SSP6504 SSP5604 SSP4705 SSP1301 SSP3401 SSP0116 SSP2313 SSP0508 SSP4013 SSP3311 SSP3312 SSP3109 SSP0106 SSP4402 SSP5201 SSP5011 SSP5206 SSP1103 SSP3003 SSP8415 SSP1008 SSP4202 SSP6007 SSP5602 SSP7201 SSP1101 SSP3102 SSP8612
T0-7
Loading PC2
Score plot 1.2
1.0
0.6
0.4
-0.2
-0.4
-0.8
-1.0
1.2
1.0
1.0 0.6
0.4
-0.4
-1.0
-1.0
1.2
1.0
T2-7 T2-8
0.4
-1.0 -0.2
-0.8
-1.0
T0 vs. T1
SSP 4012 SSP 3207 SSP 5305 SSP 3106 SSP 5307 SSP 4008 SSP 3208 SSP 6106 SSP 4013 SSP 5411 SSP 6012 SSP 6109 SSP 6009 SSP 2108 SSP 0310 SSP 8508 SSP 4301 SSP 4605 SSP 6503 SSP 4102 SSP 0011 SSP 6606 SSP 3402 SSP 2208 SSP 4705 SSP 3108 SSP 3304 SSP 4408 SSP 1301 SSP 3609 SSP 7413 SSP 6504 SSP 6701 SSP 6112 SSP 8415 SSP 3401 SSP 7414 SSP 3606 SSP 5208 SSP 5010 SSP 3413 SSP 5701 SSP 3612 SSP 0312 SSP 3707 SSP 2106
T0-2
Loading PC1
T0-5
Loading PC2
0.5
SSP4008 SSP5307 SSP3403 SSP6105 SSP2106 SSP3009 SSP4605 SSP7106 SSP4104 SSP3407 SSP0012 SSP3008 SSP3010 SSP2411 SSP3107 SSP6604 SSP2410 SSP1406 SSP2204 SSP2004 SSP1106 SSP6504 SSP5604 SSP4705 SSP1301 SSP3401 SSP0116 SSP2313 SSP0508 SSP4013 SSP3311 SSP3312 SSP3109 SSP0106 SSP4402 SSP5201 SSP5011 SSP5206 SSP1103 SSP3003 SSP8415 SSP1008 SSP4202 SSP6007 SSP5602 SSP7201 SSP1101 SSP3102 SSP8612
1.5
SSP6304 SSP7307 SSP6111 SSP3401 SSP1406 SSP6107 SSP3012 SSP4006 SSP2303 SSP6010 SSP7508 SSP2105 SSP3311 SSP6109 SSP3411 SSP4303 SSP2302 SSP1108 SSP7201 SSP3203 SSP7205 SSP8306 SSP7609 SSP3406 SSP3104 SSP5406 SSP2208 SSP3608 SSP3307 SSP0011 SSP0112 SSP7202 SSP2108 SSP4013 SSP4203 SSP8307 SSP2313 SSP4408 SSP4308 SSP5102 SSP6201 SSP3413 SSP0013 SSP7607 SSP5011 SSP0115 SSP5606 SSP7113 SSP5201 SSP3011 SSP6308 SSP8612 SSP6606 SSP5601 SSP1006 SSP4601 SSP2201 SSP5706 SSP3107 SSP4206 SSP4608
b T0-8
Loading PC1
2.0
SSP 4012 SSP 3207 SSP 5305 SSP 3106 SSP 5307 SSP 4008 SSP 3208 SSP 6106 SSP 4013 SSP 5411 SSP 6012 SSP 6109 SSP 6009 SSP 2108 SSP 0310 SSP 8508 SSP 4301 SSP 4605 SSP 6503 SSP 4102 SSP 0011 SSP 6606 SSP 3402 SSP 2208 SSP 4705 SSP 3108 SSP 3304 SSP 4408 SSP 1301 SSP 3609 SSP 7413 SSP 6504 SSP 6701 SSP 6112 SSP 8415 SSP 3401 SSP 7414 SSP 3606 SSP 5208 SSP 5010 SSP 3413 SSP 5701 SSP 3612 SSP 0312 SSP 3707 SSP 2106
PC2 1.0
Loading PC1
PC2
a
SSP6304 SSP7307 SSP6111 SSP3401 SSP1406 SSP6107 SSP3012 SSP4006 SSP2303 SSP6010 SSP7508 SSP2105 SSP3311 SSP6109 SSP3411 SSP4303 SSP2302 SSP1108 SSP7201 SSP3203 SSP7205 SSP8306 SSP7609 SSP3406 SSP3104 SSP5406 SSP2208 SSP3608 SSP3307 SSP0011 SSP0112 SSP7202 SSP2108 SSP4013 SSP4203 SSP8307 SSP2313 SSP4408 SSP4308 SSP5102 SSP6201 SSP3413 SSP0013 SSP7607 SSP5011 SSP0115 SSP5606 SSP7113 SSP5201 SSP3011 SSP6308 SSP8612 SSP6606 SSP5601 SSP1006 SSP4601 SSP2201 SSP5706 SSP3107 SSP4206 SSP4608
PC2
264 Elisa Robotti and Emilio Marengo
Loading PC1 0.8
0.6
Loading PC2
0.8 0.4
0.2
-0.2
0.0
-0.6
-0.4
-0.6
-0.8
Loading plot PC1 0.8
0.6
Loading plot PC2
0.8 0.4
0.2
-0.2
0.0
-0.4
-0.6
-0.8
Loading PC1 0.8
0.6
Loading PC2
0.8
0.6 0.4
0.2
-0.2
0.0
-0.4
-0.6 -0.4
-0.6
-0.8
-1.0
Fig. 11 Ranking-PCA results; Score plot and loading plot of the first two PCs calculated for each comparison: T0 vs. T1 (a), T0 vs. T2 (b), and T1 vs. T2 (c). The loadings are represented separately for each PC: loadings are reported on the y-axis while the original variables are reported on the x-axis in the order in which they are included in the final model.
Table 10 Number of significant biomarkers identified by each method
T0 vs. T2 T1 vs. T2
LDA on original variables
19
19
19
SIMCA
57
106
44
PLS-DA
88
142
123
196
228
181
Notes
1. Another drawback that affects both the monovariate and the multivariate approaches is the correlation existing between maps and experimental runs: maps that are run together show in facts a correlated experimental error.
Chemometric Multivariate Tools for Candidate Biomarker Identification. . .
265
2. Autoscaling is usually applied in PCA where the variables show different scale effects: when this occurs, the variables with the largest scale factor weight mostly on the calculated PCs. 3. Different algorithms are available for PC calculation [1, 2]. Here NIPALS [1, 2] was adopted. 4. The variable selection procedure in forward search [1, 2] starts with a model with no variables and at each iteration it adds the variable showing the best separation of the classes with the variables already included in the model. The algorithm stops when all the significant variables are added. The selection procedure is based on a F-Fisher test; the j-th variable is included in the model, with p variables already included, if: F þj
¼ max j
"
RSS p
RSS pþ j s 2pþ j
#
> F to‐enter
ð15Þ
where: s p + j 2 ¼ variance calculated for the model with p variables plus j-th variable. RSSp ¼ residual sum of squares of the model with p variables. RSSp þ j ¼ residual sum of squares of the model with p variables plus j-th variable. The F value thus calculated is compared to a reference value (Fto-enter) usually set at values ranging from 2 (more permissive selection, including a larger number of variables in the final model) to 4 (more severe selection). At each iteration the variable showing the largest value of F is included in the model provided that the F calculated is larger than a selected threshold. In this case an Fto-enter ¼ 4 was adopted. Variable selection can also be applied in backward elimination: in this case the initial model includes all the variables and, at each iteration, the least discriminating variable is eliminated. 5. LDA can be effectively run on PCs rather than on the original variables, thus overcoming the problems related to the low number of samples available in proteomic studies: in such cases, the maximum number of variables that can be added to the final model is limited to the number of samples. The use of PCs overcomes this limitation since a dimensionality reduction tool is applied. 6. The Coomans plot [15] compares two models at a time: on the x-axis there is the distance of each sample from the first model to be compared and on the y-axis the distance from the second model. The plot shows two lines: an horizontal line indicating the boundary of the class indicated on the y-axis and a vertical line indicating the boundary of the class on the x-axis; samples
266
Elisa Robotti and Emilio Marengo
in the right bottom quadrant are samples within the boundary of the class on the y-axis and outside the boundary of the class on the x-axis; samples in the left top quadrant are samples within the boundary of the class on the x-axis and outside the boundary of the class on the y-axis; samples in the left bottom quadrant are samples within both classes (if samples are present here, the two classes appear overlying); samples in the right top quadrant are samples outside the boundary of both classes, i.e., outlier samples or samples belonging to other classes. 7. In LOO cross-validation, a sample is in turn eliminated from the model and the model built on the remaining samples is used to predict the response of the sample eliminated. The predicted responses are used to evaluate the performance of the model in prediction. An alternative is leave-more-out (LMO) cross-validation, in which more than one sample is eliminated at a time. LOO is usually applied with small number of samples, while LMO can be applied with large number of samples. 8. In this case the loadings of the significant PCs included in the model must be combined to provide a coefficient for the contribution of each original variable to the final LDA model, to identify the most discriminant spots. 9. In this case a threshold of 1.00 1011 was applied. A statistically sound method for identifying the statistically relevant markers from SIMCA results is reported in [6].
Acknowledgements The authors gratefully acknowledge the collaboration of Dr. Daniela Cecconi (University of Verona) who provided the biological samples and the 2D-maps used in this study. References 1. Massart DL, Vandeginste BGM, Deming SM, Michotte Y, Kaufman L (1988) Chemometrics: a textbook. Elsevier, Amsterdam 2. Vandeginste BGM, Massart DL, Buydens LMC, De Yong S, Lewi PJ, Smeyers-Verbeke J (1988) Handbook of chemometrics and qualimetrics: part B. Elsevier, Amsterdam 3. Frank IE, Lanteri S (1989) Classification models: discriminant analysis, SIMCA, CART. Chemometr Intell Lab Syst 5:247–256 4. Marengo E, Robotti E, Righetti PG, Campostrini N, Pascali J, Ponzoni M, Hamdan M, Astner H (2004) Study of proteomic changes associated with healthy and tumoral murine
samples in neuroblastoma by principal component analysis and classification methods. Clin Chim Acta 345:55–67 5. Marengo E, Robotti E, Bobba M, Liparota MC, Rustichelli C, Zamo` A, Chilosi M, Righetti PG (2006) Multivariate statistical tools applied to the characterization of the proteomic profiles of two human lymphoma cell lines by two-dimensional gel electrophoresis. Electrophoresis 27:484–494 6. Marengo E, Robotti E, Bobba M, Righetti PG (2008) Evaluation of the variables characterized by significant discriminating power in the application of SIMCA classification
Chemometric Multivariate Tools for Candidate Biomarker Identification. . . method to proteomic studies. J Proteome Res 7:2789–2796 7. Martens H, Naes T (1989) Multivariate calibration. Wiley, London 8. Seasholtz MB, Kowalski B (1993) The parsimony principle applied to multivariate calibration. Anal Chim Acta 277:165–177 9. Booksh KS, Kowalski BR (1997) Calibration method choice by comparison of model basis functions to the theoretical instrumental response function. Anal Chim Acta 348(1–3):1–9 10. Gributs CE, Burns DH (2006) Parsimonious calibration models for near-infrared spectroscopy using wavelets and scaling functions. Chemometr Intell Lab Syst 83(1):44–53 11. Lo Re VIII, Bellini LM (2002) William of Occam and Occam’s razor. Ann Intern Med 136(8):634–635 12. Robotti E, Demartini M, Gosetti F, Calabrese G, Marengo E (2011) Development of a classification and ranking method for the
267
identification of possible biomarkers in twodimensional gel-electrophoresis based on principal component analysis and variable selection procedures. Mol Biosyst 7(3):677–686 13. Marengo E, Robotti E, Bobba M, Gosetti F (2010) The principle of exhaustiveness versus the principle of parsimony: a new approach for the identification of biomarkers from proteomic spot volume datasets based on principal component analysis. Anal Bioanal Chem 397 (1):25–41 14. Polati R, Menini M, Robotti E, Millioni R, Marengo E, Novelli E, Balzan S, Cecconi D (2012) Proteomic changes involved in tenderization of bovine Longissimus dorsi muscle during prolonged ageing. Food Chem 135:2052–2069 15. Esbensen KH, Guyot D, Westad F, Houmoller LP (2002) Multivariate data analysis—in practice: an introduction to multivariate data analysis and experimental design. CAMO Process Inc., Oslo, Norway
Part IV Differential Analysis from Direct Image Analysis Tools
Chapter 15 The Use of Legendre and Zernike Moment Functions for the Comparison of 2-D PAGE Maps Emilio Marengo, Elisa Robotti, and Marco Demartini Abstract The comparison of 2-D maps is not trivial, the main difficulties being the high complexity of the sample and the large experimental variability characterizing 2-D gel electrophoresis. The comparison of maps from control and treated samples is usually performed by specific software, providing the so-called spot volume dataset where each spot of a specific map is matched to its analogous in other maps, and they are described by their optical density, which is supposed to be related to the underlying protein amount. Here, a different approach is presented, based on the direct comparison of 2-D map images: each map is decomposed in terms of moment functions, successively applying the multivariate tools usually adopted in image analysis problems. The moments calculated are then treated with multivariate classification techniques. Here, two types of moment functions are presented (Legendre and Zernike moments), while linear discriminant analysis and partial least squares discriminant analysis are exploited as classification tools to provide the classification of the samples. The procedure is applied to a sample dataset to prove its effectiveness. Key words Moment functions, Legendre moments, Zernike moments, Classification, LDA, PLS-DA, Image analysis
1
Introduction 2-D maps can be effectively used for both diagnostic and prognostic purposes by investigating the differences between 2-D maps from control and other individuals [1–13] and they are also widely applied in the field of drug development, especially for cancer [14]. In such studies, 2-D maps belonging to different individuals are compared to identify the differences between controls and other samples, and to build classification models. Unfortunately, this is not a trivial process [11, 15] due mainly to some drawbacks typical of 2-D PAGE technology: l
The complexity of the specimen, which can result in maps with thousands of spots;
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_15, © Springer ScienceþBusiness Media New York 2016
271
272
Emilio Marengo et al. l
l
The complexity of the sample pretreatment, often characterized by many purification/extraction steps, which may cause the appearance of spurious spots in the final 2-D maps; The usually small differences between controls and other samples, often difficult to detect in real complex maps.
The classical approach to the comparison of 2-D maps is based on the use of specific software (see Chapter 3 for a general description) [16–18], providing a so-called spot volume dataset, where each map is described in terms of the volume of the spots detected on its surface. Differential analysis is then performed on spot volume data to provide candidate biomarkers or to build classification models. The achievement of spot volume data is however quite complex and time consuming; moreover, it is usually affected by the particular ability of the operator (see Chapters 1 and 3 for more details). Here, a different approach is described, based on the direct analysis of 2-D map images: map images are decomposed in terms of mathematical moments, usually adopted in image analysis, in alternative to spot volume data [19–21]. Mathematical moments can be considered as a different way to look at images: they capture global features of the image that can be used for their description. The number of calculated moments is usually very large and many of them do not account for information related to the samples classification, e.g., control and treated samples; it is therefore useful to apply multivariate methods for selecting those moments characterized by the highest discrimination power and to build classification models containing only the most discriminant moments. The procedure proposed here allows to bypass all the steps typical of the software packages that provide spot volume data and it is characterized by a reduced dependence on the operator’s ability. In this chapter, Legendre [19, 22–24] and Zernike [20, 21, 25, 26] moments are applied: these two moment functions present the interesting characteristic of being invariant with respect to translations and rotations of the original images. The moments calculated are then coupled to two different multivariate classification methods: linear discriminant analysis (LDA) and partial least squares discriminant analysis (PLS-DA). The proposed procedure could be intended as a rapid diagnostic screening for assessing the belonging of unknown samples, e.g. to the class of controls or pathological samples. Both types of moments are applied to a real dataset consisting of 18 samples belonging to two different cell lines (PACA44 and T3M4) of pancreatic human cancer, before and after the treatment with Trichostatin A, to prove its effectiveness [14].
Legendre and Zernike Moment Functions
2
273
Theory
2.1 Moment Functions
Moment functions are quite widespread in different areas of image analysis: invariant pattern recognition, object classification, pose estimation, image coding, and reconstruction [27–33]. Mathematical moments represent global features of the shape of an image and provide information about its different geometrical characteristics. Geometric moments are the most simple and were the first to be applied in images analysis; then, other types of moment functions have been introduced, as orthogonal, rotational, and complex moments. They can be effectively used in pattern recognition to characterize different geometric features of an object such as its shape, area, border, location, and orientation. Orthogonal moments, as Legendre [34–36] and Zernike moments [37], are the most exploited, since they can reach a level of redundancy almost null in a set of moment functions; in other terms, they correspond to independent characteristics of the image. Their orthogonality, due to their orthogonal basis functions, guarantees their applicability in the representation of an image by a set of mutually independent descriptors, with a minimum amount of information redundancy. Moreover, orthogonal moments have two important additional properties: (1) they are more robust to image noise and (2) they permit the analytical reconstruction of an image intensity function from a finite set of moments, using the inverse moment transform. Legendre and Zernike moments, due to their orthogonality and invariance to translations and rotations, can be effectively used for the description of sets of 2-D maps belonging to different experimental conditions (e.g., controls and pathological) and they can be used for classification purposes. The number of moment functions that can be computed from an image is usually very large (>1,000 or even > 10,000): in these cases the number of calculated moments can be effectively reduced by the use of multivariate classification methods coupled to variables selection procedures selecting only the moments more responsible for the discrimination of the samples belonging to the different classes.
2.1.1 Legendre Moments
Legendre moments are moments with the Legendre polynomials as kernel functions. The Legendre polynomials form a complete orthogonal set inside the unit circle: the kernel of the Legendre moments are therefore products of Legendre polynomials defined along rectangular image coordinate axes inside a unit circle. The two-dimensional Legendre moments of order (p þ q) of an image intensity map f(x, y) are defined as:
274
Emilio Marengo et al.
L pq
ð2 p þ 1Þð2q þ 1Þ ¼ 4
ð1 ð1 1
1
P p ðx Þ P q ð y Þ f ðx; y Þdxdy
ð1Þ
where x and y range is [ 1,1] and the Legendre polynomial, Pp(x), of order p is given by: 8 9 p < k X p k 1 ð p þ kÞ!x = P p ðx Þ ¼ ð 1Þ 2 p : 2 p k ! pþk !k!; k¼0 2
2
ð2Þ p k¼even
The recurrence relation of Legendre polynomials, Pp(x), is: P p ðx Þ ¼
ð2 p
1ÞxP p 1 ðx Þ ð p p
1ÞP p 2 ðx Þ
ð3Þ
where P 0 ðx Þ ¼ 1, P 1 ðx Þ ¼ x, and p > 1. The region of definition of Legendre polynomials is within the range [ 1,1]; therefore a square image of N N pixels with intensity function f(i, j) (with 0 i, j ðN 1Þ) is scaled in the region 1 < x, y < 1. Legendre moments can be expressed in discrete form as: L pq ¼ λ pq
N X1 NX1 i¼0 j ¼0
P p ðx i ÞP q y j f ði; j Þ
ð4Þ
where the normalizing constant is: λ pq ¼
ð2 p þ 1Þð2q þ 1Þ N2
ð5Þ
xi and yj denote the normalized pixel coordinates in the range ½ 1, 1 and are defined by: 2i
xi ¼
N
yj ¼
2j N 1
1
1
ð6Þ
1
ð7Þ
The reconstruction of the image function from the calculated moments can be performed by the application of the following inverse transformation: f ði; j Þ ¼
pmax X q max X p¼0 q¼0
λ pq P p ðx i ÞP q y j
ð8Þ
Legendre and Zernike Moment Functions 2.1.2 Zernike Moments
275
Zernike moments [37] have the characteristic of being invariant to rotation and translations of the image. They are based on orthogonal Zernike polynomials defined using polar coordinates inside a unit circle. The two-dimensional Zernike moments of order p with repetition q for an image intensity function f(r;ϑ) are defined as: Z pq ¼
pþ1 π
ð 2π ð 1
ϑ¼0 r¼0
V *pq ðr; ϑÞ f ðr; ϑÞrdrdϑ
ð9Þ
with jr j 1, where the Zernike polynomials of order p with repetition q, Vpq(r;ϑ), are defined as: V
pq ðr; ϑÞ
¼ R pq ðr Þe iqϑ
ð10Þ
and the real-value radial polynomial, Rpq(r), is given by: R pq ðr Þ ¼
ð pX jq jÞ=2 k¼0
ð 1Þk k! pþ2jq j
ðp
kÞ! k ! p 2jq j
rp k !
2k
ð11Þ
where 0 jq j p and p jq j is even. Since Zernike moments are defined in terms of polar coordinates (r; ϑ) with |r| < ¼ 1, their computation requires a linear transformation of the image coordinates (i, j) (with i, j ¼ 0,1,2,. . .,N 1) inside a unit circle. In this way we can express the discrete approximation of the continuous integral of the moments as: Z pq ¼
2ð p þ 1Þ NX1NX1 R pq r i j e 2 π ðN 1Þ i¼0 j ¼0
iqϑi j
f ði; j Þ
ð12Þ
where the general image coordinate transformation to the interior of the unit circle is given by: ri j ¼ ϑi j
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x 2i þ y 2i
yi ¼ tan xi 1
ð13Þ ð14Þ
and where: pffiffiffi 2 i xi ¼ N 1
1 pffiffiffi 2
ð15Þ
276
Emilio Marengo et al.
pffiffiffi 2 yj ¼ j N 1
1 pffiffiffi 2
ð16Þ
The image intensity function f(i, j) can be reconstructed from a finite number of n Zernike moments, as: f ðr; ϑÞ ¼
n X X Z pq V p¼0 q
pq ðr; ϑÞ
ð17Þ
defined for jq j p and p jq j ¼ even. Zernike moments are calculated here by the so-called Q-recursive method [38], which allows to reduce the computational time. Rpq(r) (with p ¼ q 4) is therefore calculated by: R p ðq
H3 4Þ ðr Þ ¼ H 1 Rpp ðr Þ þ H 2 þ 2 R pðq r
2Þ ðr Þ
ð18Þ
where: Rpp ¼ r p R p ðq H1 ¼
2Þ ðr Þ
q ðq
2
H2 ¼
¼ pRpp ðr Þ 1Þ
qH 2 þ
ðp
ð19Þ 1ÞRð p
2Þðq 2Þ ðr Þ
H 3 ð p þ q þ 2Þð p 8
H 3 ð p þ q Þð p q þ 2Þ þ ðq 4ðq 1Þ
H3 ¼
4ðq 2Þðq 3Þ ð p þ q 2Þð p q þ 4Þ
2Þ
qÞ
ð20Þ ð21Þ ð22Þ ð23Þ
The choice of the algorithm used for the computation of Zernike moments is particularly important for two main reasons: on one hand, to reduce the computational time, and on the other hand, to avoid computational errors. In the calculation of high-order moments, in facts, computational errors can cause random or approximated values [38]. Though computationally very complex, compared to geometric and Legendre moments [34–36], Zernike moments have proved to be superior in terms of their feature representation capability and low noise sensitivity [39, 40]. Figure 1 represents an image reconstructed with Legendre and Zernike moments of increasing order.
Legendre and Zernike Moment Functions
277
Fig. 1 Reconstruction of an image by Legendre and Zernike moments of increasing order: original image (a), Legendre moments of order 40 (b), 60 (c), 80 (d), 100 (e), 120 (f), Zernike moments of order 40 (g), 60 (h), 80 (i), 100 (j), 120 (k) 2.2 Classification Methods 2.2.1 Linear Discriminant Analysis
LDA is a Bayesian classification method [41–43]. The assignment of a sample x, characterized by p features, to a class j among G classes is based on maximizing the posterior probability P(j|x) for j ¼ 1, . . . , G, where P(j|x) is the posterior probability that sample x belongs to class j. A sample x is then assigned to that class j, for which the largest posterior probability is obtained. In LDA, the multidimensional normal distribution is assumed as class probability density function: p x j ¼
1 e d=2 1=2 ð2π Þ Sj
T
1=2ðx i x j ÞS j 1 ðx i x j Þ
ð24Þ
where the covariance matrix Sj is equal for all classes and it is approximated by the pooled (between the classes) covariance matrix: in this way all the classes have the same shape, i.e., a shape averaged among all the classes and weighted for class numerosity. More details on LDA are reported in Chapter 14. Maximizing the posterior probability corresponds to minimizing the discriminant scores given by:
278
Emilio Marengo et al.
d j ðx Þ ¼ x
xj
T
Sj1 x
x j þ ln S j
2lnP
ð25Þ
An unknown sample is assigned to the j-th class for which the distance to the centroid is shortest. The term: x
xj
T
Sj1 x
xj
ð26Þ
represents the Mahalanobis distance between the sample x and the centroid of class j. Since the number of moment functions that can be computed from an image is usually very large, the set of moments most discriminant between two groups of samples (e.g., control vs. pathological) can be identified by coupling classification methods to variable selection procedures, selecting only the moments more responsible for the discrimination of the samples in the classes. When LDA is applied, usually a stepwise algorithm with forward selection is adopted as variable selection procedure (see Note 1 and Chapter 14 for more details). 2.2.2 Partial Least Squares Discriminant Analysis
Partial least squares (PLS) [9, 41, 44] is a multivariate regression method that identifies the relationship between one or more dependent variables (Y) and a group of descriptors (X). The X and Y variables are modeled simultaneously to find the latent variables (LVs) in X that predict the latent variables in Y. More details about PLS are reported in Chapter 14. The optimal number of LVs (i.e., a model that uses the information in X to predict the response Y while avoiding overfitting) is determined by the residual variance in prediction (see Note 2). Here, leave-oneout (LOO) cross-validation is applied to select the optimal number of LVs in the final model. PLS can also be applied for classification purposes by establishing an appropriate Y that is related to the membership of each sample to a class. When two classes are present (i.e., control and treated samples), a binary Y variable is added to the dataset, which is coded so that 1 is attributed to one class (control samples) and þ1 to the other one (treated samples). The regression is then carried out between the X-block variables (here Zernike moments) and the Y variable just established. This process of classification is called PLS discriminant analysis (PLS-DA) [9]. When a large number of descriptors (X variables) is present some techniques for variable selection must be exploited to select the set of original variables giving a final model with a suitable predictive ability. Here, a strategy consisting in two steps is exploited:
Legendre and Zernike Moment Functions
l
l
3
279
A first simplification of the model is achieved by eliminating groups of nonsignificant X variables, up to a model including a maximum of 200 variables; The remaining 200 variables are eliminated one at a time (backward elimination strategy) to provide a final model with an overall minimum error in cross-validation.
Materials 1. Case study. The procedures are described by the use of a set of 2-D maps previously investigated [14]. The data belong to two human pancreatic cancer cell lines (PACA44 and T3M4), both treated or not with Trichostatin A. For the PACA 44 cell line, four controls and four drug-treated samples are investigated, while for the T3M4 cell line the dataset consists of five controls and five drug-treated samples. 2. Image acquisition. The 2-D gels of all the samples (PACA44 control and treated with TSA, T3M4 control and treated with TSA) were scanned with a GS-710 densitometer (Bio-Rad Labs, Hercules, CA, USA). 3. Software. Legendre and Zernike moments were computed with self-developed programs [22, 25, 26] in Matlab environment (The Mathworks, v.6.5 and subsequent, Natick, MA, USA); the same software was also used for some graphical representations. Stepwise LDA was performed with Statistica (Statsoft, v. 7, Tulsa, OK, USA). PLS-DA with variable selection was performed by PARVUS (developed by Prof. M. Forina, University of Genova, Italy). Data pretreatment and graphical representations were performed by Matlab, Statistica and Microsoft Excel 2010 (Microsoft Corporation, Redmount, WA, USA).
4
Methods
4.1 Image Pretreatment
1. Reduce images. Legendre moments: reduce each digitalized image to a grid of 200 200 pixels. Zernike moments: reduce each digitalized image to provide a grid of 100 100 pixels. Each pixel contains the grayscale intensity at the corresponding position of the image (see Note 3). 2. Set threshold values. Set two threshold values: the first derivative of each image (the slope), indicating the presence of a spot distinguishable from the noise; the value of the pixel (the cut), indicating the threshold value corresponding to the background.
280
Emilio Marengo et al.
Fig. 2 Example of map pretreatment: before (a) and after (b) background correction with cut ¼ 100 and slope ¼ 15
3. First derivative calculation. This can be achieved by calculating the difference between the values of two adjacent pixels (perform calculations row-wise). 4. Background correction. If the first derivative is less than the slope and the pixel is larger than the cut, the value of the pixel is set to 255 (white) (see Note 4). Here cut ¼ 100 and slope ¼ 15 are adopted. Figure 2 shows an example of a sample corrected using these two values. 5. Image re-scaling. Rescale the images so that the grayscale values ranging from 0 (black) to 255 (white) are turned into values in the range from 0 (white) to 1 (black). 4.2 Moment Calculation
1. Legendre moments. Calculate the Legendre moments of each image up to a maximum order of 100. A matrix of dimensions 101 101 is obtained for each image. Unfold the matrix to obtain an array of 10201 values for each image (the Legendre moments). 2. Zernike moments. Calculate the Zernike moments of each image up to a maximum order p ¼ 100. This procedure provides a total of 2601 moments. The algorithm allows the separate calculation of the real and imaginary parts of the moments, providing a total of 5202 descriptors for each image: 2601 corresponding to the real parts of the moments and 2601 to the imaginary parts. Since the X matrix can only contain real numbers, only the coefficient of the imaginary part of the Zernike moment must be considered (i.e., the numerical coefficient multiplying the i character). For example, the
Legendre and Zernike Moment Functions
281
complex number ( 6.06 0.0567i) can be separated into two parts: 6.06 is its real part and 0.0567 is its imaginary part. 4.3 Classification: LDA on Legendre Moments
1. Establish a variable coded so that each sample is characterized by a number expressing the corresponding class (see Note 5). 2. Perform LDA on the final dataset consisting of the 18 maps, each described by the 10201 moments calculated, the variable just established being the grouping variable. 3. Apply LDA with a forward stepwise variable selection procedure (see Note 1) selecting a proper value for Fto-enter (see Note 1).
4.4 Classification: PLS-DA on Zernike Moments
1. Separate the data in two sets: samples belonging to Paca44 cell line (both control and drug treated) and samples belonging to T3M4 cell line (both control and drug-treated). 2. Establish for each separate set a Y variable coded so that each sample is characterized by a number expressing the corresponding class (see Note 6). 3. Autoscale the two datasets separately (see Note 7): the eight maps from Paca 44 cell line; the 10 maps from T3M4 cell line. 4. Perform PLS-DA with variable selection as specified in Subheading 2.
4.5 Results: Legendre Moments
1. Selected moments. In LDA only six Legendre moments are selected in this case as significant to discriminate the four classes of samples. In LDA, the discriminant functions are a linear combination of the moments Lpq selected, which are reported in Table 1. 2. Classification performance. The final model including the six moments selected can correctly classify all samples, as it is Table 1 Legendre moments selected by the final LDA model Order of moment p
q
2
0
2
11
3
10
5
5
86
8
96
0
282
Emilio Marengo et al.
Table 2 Classification matrix for the LDA model in fitting and cross-validation Fitting / LOO Cross-validation NER%
Paca44
Paca44TSA
T3M4
T3M4TSA
Paca44
100
4
0
0
0
Paca44TSA
100
0
4
0
0
T3M4
100
0
0
5
0
T3M4TSA
100
0
0
0
5
Total
100
4
4
5
5
Table 3 Classification matrix for the PLS-DA models: on Paca44 dataset (a) and T3M4 dataset (b) (a) Paca44 cell line
(b) T3M4 cell line
NER% Control Treated
NER% Control Treated
Control 100
4
0
Control 100
5
0
Treated
100
0
4
Treated
100
0
5
Total
100
4
4
Total
100
5
5
shown by the classification matrix (Table 2). The model obtained is also evaluated in cross-validation exploiting a leave-one-out procedure (see Note 8) to evaluate its predictive ability: the classification matrix (Table 2) shows also in this case that the perfect classification of all samples is achieved. The results obtained confirm the applicability of Legendre moments as global descriptors of 2-DE images for assessing the belonging of unknown samples to, for example, control or pathological samples, in rapid diagnostic screening. 4.6 Results: Zernike Moments
1. Percentage of Explained variance. The first LV is considered significant for both datasets (LOO cross-validation) since it explains more than 99 % of the total amount of information contained in the Y variable (see Note 9). 2. Classification matrix. Table 3 reports the classification matrices: the use of one LV in each classification model allows the correct classification of all the samples in each dataset, with a final NER% (non-error rate) of 100 %. 3. Score and loading plots. Figure 3 reports the scores and loadings plots for both datasets: control samples are represented as circles while treated samples as squares. In both cases, control
Legendre and Zernike Moment Functions
283
Fig. 3 Score and loading plots of the first two LVs for the datasets investigated: PACA44 cell line (a) and T3M4 cell line (b). Control samples are represented as circles, while treated samples as squares
samples are at large negative scores along LV1, while drugtreated ones are at large positive scores. The corresponding loading plots allow the identification of the Zernike moments responsible for the classification: moments at large negative values along LV1 show large positive values for control samples and large negative ones for drug-treated samples; the moments located at large positive values along LV1 show an opposite behavior. Only the most discriminating moments (selected by the backward elimination procedure) are reported in the loading plots. The selection procedure is very effective since it drastically reduces the number of moments included in the final model. The significant moments do not show recursive p and q; therefore, as expected, in the two cases investigated, different moments are significant. 4. R2 and root mean square error (RMSE) values. Table 4 reports the R2 and RMSE values obtained in fitting and crossvalidation for both datasets, proving that the models provide very good performances in terms of both fitting and validation. The good abilities of the calculated models both in fitting and
284
Emilio Marengo et al.
Table 4 R2 and RMSE values calculated for fitting (R2, RMSEC) and cross-validation (R2cv, RMSECV) after the application of PLS-DA R2
R2cv
RMSEC
RMSECV
Paca44 cell line
0.9943
0.9894
0.0875
0.1045
T3M4 cell line
0.9935
0.9890
0.0904
0.1093
Fig. 4 Calculated and predicted Y values vs. reference Y values for the datasets investigated: PACA44 cell line (a) and T3M4 cell line (b). Calculated values are represented as circles, while predicted values are shown as squares. Solid regression lines correspond to calculated values, while dotted regression lines correspond to predicted values
in prediction are also demonstrated by the RMSE values calculated: the fitting errors (RMSEC) are all below 0.1, while validation errors (RMSECV) are around 0.1. As for Legendre moments, also Zernike moment functions could be applied as global descriptors of 2-DE images in rapid diagnostic screening methods. 5. Graphical representations. Figure 4 reports the calculated and predicted Y values vs. the observed Y values. In all cases there is a good agreement between the actual and the calculated or predicted values. It is important to stress that in these graphics the most important information is related to the variations of the calculated and predicted responses along the Y-axis: the positions of both the fitted and the validated responses at negative values for control samples and at positive ones for the other class in each dataset prove that the models provide 100 % NER%.
Legendre and Zernike Moment Functions
5
285
Notes 1. The variables contained in the final LDA model can be chosen by a stepwise algorithm [41, 42], selecting iteratively the most discriminating variables. When a forward selection (FS) is applied, the method starts with a model where no variable is included and iteratively adds a variable at a time until a determined criterion of arrest is satisfied. The variable being included in the model in each step is the one providing the greatest value of an F-Fisher ratio, so that the j-th variable is included in the model, with t variables already included, if: F þj
¼ max j
"
RSS t
RSS tþ j s 2tþ j
#
> F to‐enter
ð27Þ
where s2tþj is the variance calculated for the model with t variables plus the j-th variable; RSSt is the residual sum of squares of the model with t variables; RSStþj is the residual sum of squares of the model with t variables plus the j-th variable. The F value thus calculated is compared to a reference value (Fto-enter) usually set at values ranging from 2 (more permissive selection, including a larger number of variables in the final model) to 4 (more severe selection). 2. Usually it is determined by the application of cross-validation strategies [41, 42]. In these methods the data are partitioned into a training set and a test set: the training set is used for calculating the model, while the test set is used to estimate the prediction performance of the calculated model. The final model is calculated on all the samples available (training and test sets) but the partitioning is used to obtain an estimation of the prediction performances of the model. Cross-validation techniques differ according to how severe is the partitioning of the samples in training and test set: in leave-one-out (LOO) cross-validation, one sample is in turn contained in the test set while the others belong to the training set; in leave-more-out (LMO) cross-validation, several samples are in turn contained in the test set while the others belong to the training set. LOO procedures are usually applied when a small number of samples is available. 3. These values therefore range from 0 (black) to 255 (white). 4. Good results can be obtained with cut values of 100–150 and slope values ranging from 10 to 20. 5. For example 1 ¼ Paca44, 2 ¼ T3M4, 3 ¼ Paca44TSA, 4 ¼ T3M4TSA.
6. For example
1 ¼ control, þ1 ¼ drug-treated samples.
286
Emilio Marengo et al.
7. Autoscaling [41, 42] consists in applying the following transformation on the data: x ii, j
¼
x i, j
xj sj
ð28Þ
where xi,j is the value of object i and variable j, x j is the average value for variable j and sj is the standard deviation of variable j. By the application of this scaling, all variables acquire the same relevance in the final model; the final values are mean centered and all variables have unit variance. 8. LOO is applied here since the number of samples available is not sufficient for performing a more severe validation. 9. The first LV explains 94.20 % and 99.43 % of the overall variance in X and Y variables, respectively, in Pacaa44 dataset, while it explains 79.77 % and 99.35 % of the overall variance in X and Y variables, respectively, in T3M4 dataset. References 1. Kossowska B, Dudka I, Bugla-Ploskonska G, Szymanska-Chabowska A, Doroszkiewicz W, Gancarz R, Andrzejak R, AntonowiczJuchniewicz J (2010) Proteomic analysis of serum of workers occupationally exposed to arsenic, cadmium, and lead for biomarker research: a preliminary study. Sci Total Environ 408(22):5317–5324 2. Rodriguez-Pineiro AM, Blanco-Prieto S, Sanchez-Otero N, Rodriguez-Berrocal FJ, Paez de la Cadena M (2010) On the identification of biomarkers for non-small cell lung cancer in serum and pleural effusion. J Proteomics 73(8):1511–1522 3. Poulsen NA, Andersen V, Moller JC, Moller HS, Jessen F, Purup S, Larsen LB (2012) Comparative analysis of inflamed and noninflamed colon biopsies reveals strong proteomic inflammation profile in patients with ulcerative colitis. BMC Gastroenterol 12:N76 4. Bouwman FG, de Roos B, Rubio-Aliaga I, Crosley LK, Duthie SJ, Mayer C, Horgan G, Polley AC, Heim C, Coort SLM, Evelo CT, Mulholland F, Johnson IT, Elliott RM, Daniel H, Mariman ECM (2011) 2D-electrophoresis and multiplex immunoassay proteomic analysis of different body fluids and cellular components reveal known and novel markers for extended fasting. BMC Med Genomics 4:N24 5. Ocak S, Friedman DB, Chen H, Ausborn JA, Hassanein M, Detry B, Weynand B, Aboubakar
F, Pilette C, Sibille Y, Massion PP (2014) Discovery of new membrane-associated proteins overexpressed in small-cell lung cancer. J Thorac Oncol 9(3):324–336 6. Fernando H, Wiktorowicz JE, Soman KV, Kaphalia BS, Khan MF, Ansari GAS (2013) Liver proteomics in progressive alcoholic steatosis. Toxicol Appl Pharmacol 266 (3):470–480 7. O’Dwyer D, Ralton LD, O’Shea A, Murray GI (2011) The proteomics of colorectal cancer: identification of a protein signature associated with prognosis. Plos One 6(11):e27718 8. Pitarch A, Jimenez A, Nombela C, Gil C (2006) Decoding serological response to Candida cell wall immunome into novel diagnostic, prognostic, and therapeutic candidates for systemic candidiasis by proteomic and bioinformatic analyses. Mol Cell Proteomics 5 (1):79–96 9. Marengo E, Robotti E, Bobba M, Milli A, Campostrini N, Righetti SC, Cecconi D, Righetti PG (2008) Application of partial least squares discriminant analysis and variable selection procedures: a 2D-PAGE proteomic study. Anal Bioanal Chem 390:1327–1342 10. Sofiadis A, Becker S, Hellman U, HultinRosenberg L, Dinets A, Hulchiy M, Zedenius J, Wallin G, Foukakis T, Hoog A, Auer G, Lehtio J, Larsson C (2012) Proteomic profiling of follicular and papillary thyroid tumors. Eur J Endocrinol 166(4):657–667
Legendre and Zernike Moment Functions 11. Marengo E, Robotti E, Righetti PG, Campostrini N, Pascali J, Ponzoni M, Hamdan M, Astner H (2004) Study of proteomic changes associated with healthy and tumoral murine samples in neuroblastoma by principal component analysis and classification methods. Clin Chim Acta 345:55–67 12. Marengo E, Robotti E, Bobba M, Liparota MC, Rustichelli C, Zamo` A, Chilosi M, Righetti PG (2006) Multivariate statistical tools applied to the characterization of the proteomic profiles of two human lymphoma cell lines by two-dimensional gel electrophoresis. Electrophoresis 27:484–494 13. Haas B, Serchi T, Wagner DR, Gilson G, Planchon S, Renaut J, Hoffmann L, Bohn T, Devaux Y (2011) Proteomic analysis of plasma samples from patients with acute myocardial infarction identifies haptoglobin as a potential prognostic biomarker. J Proteomics 75 (1):229–236 14. Marengo E, Robotti E, Cecconi D, Hamdan M, Scarpa A, Righetti PG (2004) Identification of the regulatory proteins in human pancreatic cancers treated with Trichostatin A by 2DPAGE maps and multivariate statistical analysis. Anal Bioanal Chem 379:992–1003 15. Schmid HR, Schmitter D, Blum O, Miller M, Vonderschmitt D (1995) Lung-tumor cells – a multivariate approach to cell classification using 2-dimensional protein pattern. Electrophoresis 16:1961–1968 16. Maurer MH (2006) Software analysis of twodimensional electrophoretic gels in proteomic experiments. Curr Bioinform 1:255–262 17. Mahon P, Dupree P (2001) Quantitative and reproducible two-dimensional gel analysis using Phoretix 2D Full. Electrophoresis 22:2075–2085 18. Appel RD, Vargas JR, Palagi PM et al (1997) Melanie II—a third-generation software package for analysis of two- dimensional electrophoresis images: II. Algorithms. Electrophoresis 18:2735–2748 19. Jensen K, Kesmir C, Sondergaard I (1996) From image processing to classification.4. Classification of electrophoretic patterns by neural networks and statistical methods enable quality assessment of wheat varieties for breadmaking. Electrophoresis 17:694–698 20. Grus FH, Augustin AJ (2000) Protein analysis methods in diagnosis of sicca syndrome. Ophthalmologe 97:54–61 21. Swets D, Weng J (1996) Using discriminant eigenfeatures for image retrieval. IEEE Trans Pattern Anal Mach Intell 18:831–836
287
22. Marengo E, Bobba M, Liparota MC, Robotti E, Righetti PG (2005) Use of Legendre moments for the fast comparison of twodimensional polyacrylamide gel electrophoresis maps images. J Chromatogr A 1096:86–91 23. Teague MR (1980) Image analysis via the general theory of moments. J Opt Soc Am 70:920–930 24. Zernike F (1934) Beugungstheorie des schneidenver-fahrens und seiner verbesserten form, der phasenkontrastmethode. Physica 1:689–704 25. Marengo E, Robotti E, Bobba M, Demartini M, Righetti PG (2008) A new method of comparing 2D-PAGE maps based on the computation of Zernike moments and multivariate statistical tools. Anal Bioanal Chem 391:1163–1173 26. Marengo E, Cocchi M, Demartini M, Robotti E, Bobba M, Righetti PG (2011) Investigation of the applicability of Zernike moments to the classification of SDS 2D-PAGE maps. Anal Bioanal Chem 400:1419–1431 27. Wee C, Paramesran R, Takeda F (2004) New computational methods for full and subset Zernike moments. Inform Sci 159:203–220 28. Kan C, Srinath MD (2002) Invariant character recognition with Zernike and orthogonal Fourier-Mellin moments. Pattern Recogn 35:143–154 29. Zenkouar H, Nachit A (1997) Images compression using moments method of orthogonal polynomials. Mat Sci Eng B Solid 49:211–215 30. Yin J, De Pierro RA, Wei M (2002) Analysis for the reconstruction of a noisy signal based on orthogonal moments. Appl Math Comput 132:249–263 31. Hu MK (1962) Visual pattern recognition by moment invariants. IRE Trans Inf Theor 8:179–187 32. Khotanzad A, Hong YH (1990) Invariant image recognition by Zernike moments. IEEE Trans Pattern Anal Mach Intell 12:489–497 33. Li BC, Shen J (1991) Fast computation of moment invariants. Pattern Recogn 24:807–813 34. Chong C, Raveendran P, Mukundan R (2004) Translation and scale invariants of Legendre moments. Pattern Recogn 37:119–129 35. Mukundan R, Ramakrishnan KR (1995) Computation of Legendre and Zernike moments. Pattern Recogn 28:1433–1442 36. Zhou JD, Shu HZ, Luo LM, Yu WX (2002) Two new algorithms for efficient computation of Legendre moments. Pattern Recogn 35:1143–1152
288
Emilio Marengo et al.
37. Mukundan R, Ramakrishnan KR (1998) Moment function in image analysis. World Scientific, Singapore 38. Chong CW, Raveendran P, Mukundan R (2003) A comparative analysis of algorithms for fast computation of Zernike moments. Pattern Recogn 36:731–742 39. Belkasim SO, Shridhar M, Ahmadi M (1991) Pattern recognition with moment invariants— a comparative study and new results. Pattern Recogn 24:1117–1138 40. Teh CH, Chin RT (1988) On image analysis by the method of moments. IEEE Trans Pattern Anal Mach Intell 10:496–513
41. Massart DL, Vandeginste BGM, Deming SM, Michotte Y, Kaufman L (1988) Chemometrics: a textbook. Elsevier, Amsterdam 42. Vandeginste BGM, Massart DL, Buydens LMC, De Yong S, Lewi PJ, Smeyers-Verbeke J (1988) Handbook of chemometrics and qualimetrics: part B. Elsevier, Amsterdam 43. Frank IE, Lanteri S (1989) Classification models: discriminant analysis, SIMCA, CART. Chemometr Intell Lab Syst 5:247–256 44. Martens H, Naes T (1989) Multivariate calibration. Wiley, London
Chapter 16 Nonlinear Dimensionality Reduction by Minimum Curvilinearity for Unsupervised Discovery of Patterns in Multidimensional Proteomic Data Massimo Alessio and Carlo Vittorio Cannistraci Abstract Dimensionality reduction is largely and successfully employed for the visualization and discrimination of patterns, hidden in multidimensional proteomics datasets. Principal component analysis (PCA), which is the preferred approach for linear dimensionality reduction, may present serious limitations, in particular when samples are nonlinearly related, as often occurs in several two-dimensional electrophoresis (2-DE) datasets. An aggravating factor is that PCA robustness is impaired when the number of samples is small in comparison to the number of proteomic features, and this is the case in high-dimensional proteomic datasets, including 2-DE ones. Here, we describe the use of a nonlinear unsupervised learning machine for dimensionality reduction called minimum curvilinear embedding (MCE) that was successfully applied to different biological samples datasets. In particular, we provide an example where we directly compare MCE performance with that of PCA in disclosing neuropathic pain patterns, hidden in a multidimensional proteomic dataset. Key words Nonlinear dimensionality reduction, Unsupervised machine learning, Pattern recognition, Multivariate analysis, Principal component analysis, Minimum curvilinearity, Minimum curvilinear embedding, Visualization, High-dimensional data, Two-dimensional gel electrophoresis
1
Introduction
1.1 Why and When to Use Nonlinear Dimensionality Reduction
Visualization and discrimination are widely employed in computational biology for the investigation and analysis of patterns hidden in multidimensional wet-lab data. Supervised methods present several pitfalls [1], among them the major is the curse of dimensionality [2]. In fact when the feature dimensionality increases (like in genomics and proteomics, in which there are much more genes or proteins than samples), the volume of the feature space increases exponentially and the distribution of the data samples in the
Both authors have equally contributed. Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_16, © Springer ScienceþBusiness Media New York 2016
289
290
Massimo Alessio and Carlo Vittorio Cannistraci
high-dimensional volume becomes sparse. This is a drawback in supervised learning, statistical estimation of model parameters and, in general, for any method that requires statistical significance, because it represents a sub-sampling issue: there are few sample points with respect to the volume of the informational space. In general, the more features we have, the more samples we need to train a supervised learning procedure, and small datasets or datasets with much less samples than features, make the supervised analysis problematic [3]. Complications particularly intensify when samples are nonlinearly related to the high-dimensional feature space obtained from high-throughput proteomic measures, like twodimensional electrophoresis (2-DE) features obtained as pixels regions or spot intensities of a gel map. In proteomics, generally, the number of samples recruited in a study is significantly less than the number of proteomic features extracted from each sample. When the aim is to classify a low number of samples characterized by a very large number of proteomic features, problems with parameter estimation may arise, and unsupervised approaches for dimensionality reduction are a valuable option [4]. Principal component analysis (PCA) is the preferred unsupervised linear dimensionality reduction technique [3–5] for discriminating different homogeneous classes on the basis of the first two or three dimensions (the first two or three principal components). However, this approach is insufficiently powerful to deal with nonlinear datasets, which often occur in proteomic studies. Here, we describe the use of a nonlinear unsupervised learning machine for dimensionality reduction, called minimum curvilinear embedding (MCE). This approach is aimed to address the abovementioned problems encountered in the case of presence of nonlinear sample patterns hidden in the multidimensional proteomic space. 1.2 Minimum Curvilinearity (MC) and Minimum Curvilinear Embedding
Minimum curvilinearity, the principle behind MCE, suggests that curvilinear distances between samples (here the proteomics samples) can be estimated as pairwise transversal distances over their minimum spanning tree (MST), which is constructed according to a selected norm (Euclidean, correlation, etc.) in a high-dimensional feature space (here the proteomic space). All the nonlinear pairwise distances can be stored in a distance matrix called the MC-distance matrix or the MC-kernel, which can be used as an input in algorithms for dimensionality reduction, clustering, classification [4], and more generally in learning machine. Figure at https://sites. google.com/site/carlovittoriocannistraci/what-is-minimum-curvi linearity-distance provides an example how MC works in the simplified condition when there are two features (proteins) and the Euclidean norm is selected. In the case of MCE, the MC-kernel is centred (this operation is neglected in the non-centered version of the approach, namely ncMCE) and its singular value decomposition (SVD) [4, 6] is used for a sample projection onto a twodimensional space, for visualization and analysis (see the algorithm
Nonlinear Dimensionality Reduction by Minimum Curvilinearity
291
in Subheading 3 and Note 1). Therefore, according to the above algorithmic description, MCE and ncMCE are categorised as a form of nonlinear and parameter-free kernel PCA. In addition, an algorithm variation consists in using the multidimensional scaling (MDS) in substitution of the SVD, which is the recommended device to adopt (see Note 2). The approach was originally introduced in its centered version [4], which provided remarkable results in (1) the visualization and discrimination of pain proteomics patients in peripheral neuropathy, and the germ-layer gene expression characterization of human organ tissues [4]; (2) the discrimination of microbiota in molecular ecology [7]; and (3) the stage identification of embryonic stem cell differentiation based on genome-wide expression data [8]. In this third example, MCE ranked first in a study of the performances of 12 different tested approaches (evaluated on ten diverse datasets). More recently, the non-centred version of the algorithm, called ncMCE, has been used to visualise clusters of ultra-conserved regions of DNA across eukaryotic species, and as a network embedding technique for predicting links in protein interactomes [6], outperforming several other link prediction techniques. The success of MCE, when applied to various types of problems, as well as its parameter-free nature, prompted us to apply and compare this algorithm with PCA for the visualization and the unsupervised analysis of patterns hidden in neuropathic proteomic profiles [4].
2 2.1
Materials Hardware
1. CPU machine with at least 1 GB RAM.
2.2 Software Suitable for Running Dimension Reduction Computing
The computing can be implemented using MATLAB® software version v. 7.0 (The MathWorks™, Natick, MA, USA) or a more recent version. Alternatively, the provided MATLAB function can be reimplemented in free software languages like Python, Java, and Cþþ on the basis of the algorithm provided in Subheading 3 and Note 1.
2.3 Computational Functions
The MATLAB function for MCE or ncMCE application to multidimensional datasets is available for free downloading under the name MCE_4june2012 from https://sites.google.com/site/car lovittoriocannistraci/5-datasets-and-matlab-code/minimum-curvi linearity-ii-april-2012. However PCA function is already integrated inside the statistical toolbox of the MATLAB® software under the name princomp.
2.4 The Input Proteomic Dataset
The ideal dataset to input in the MATLAB functions described in the next section should be organized having the samples on the rows of the dataset matrix, and the proteomic measures (features) on the columns of the dataset matrix.
292
Massimo Alessio and Carlo Vittorio Cannistraci
The proteomic dataset used in the reported example was obtained from 2-DE images generated from cerebrospinal fluid (CSF) samples, and the 2-D gel experimental procedure is described in the original proteomics study [9]. Each 2-DE image was denoised by median modified wiener filter (MMWF) [10] and the spots were detected by means of Progenesis PG240 v2006 software (Nonlinear Dynamics, Newcastle, UK) (see Note 3). Spots volume was estimated by means of their optical density (sum of the spot pixels) normalized as a percentage of the total spot optical density of the gel image. From each image a vector of 2050 proteomic features was obtained by means of a strategy previously developed, described in depth and validated by Pattini et al. [11] (see Note 4). The dataset consists of 42 samples divided into four groups: control healthy subjects (C ¼ 8), peripheral neuropathic patients without pain (NP ¼ 8), peripheral neuropathic patients with pain (P ¼ 7), motor neuron disease patients (M ¼ 19). This last group— derived from Conti et al. [12]—represents a second control group characterised by a neuropathic disease, but for which the pathology does not cause neuropathic pain (see Note 5). In the dataset that we used, four of the NP patients (NP1, NP3, NP5, and NP6) developed pain after 12 months from the CSF collection [4].
3
Methods
3.1 Nonlinear Dimensionality Reduction by MCE or ncMCE
The formal computational report as table format is reported in Note 1; refer to this note to interpret the formalism that is mentioned in the next steps. 1. Input the proteomic dataset (organized according to the indications of Subheading 2.3) as first argument of the MATLAB function MCE_4june2012. G, n m proteomic dataset matrix (n ¼ number of individuals, m ¼ number of proteomic features).
2. Set the size of the reduced space of embedding (d, embedding dimension), for example, write 2, for embedding in two dimensions. 3. Set also the norm (me) to adopt for computing the MST (T), for example: Euclidean norm (see Note 1). In general, like for any unsupervised dimension reduction procedure, there are not general rules to decide a priori which are the best settings. However, for visualization and exploration of patterns, the number of dimensions can be just 2 or 3; we advise to set the value 2 (that means the first two dimensions), because it has been extensively shown that one of the major strengths of ncMCE and MCE is the ability to unfold the hidden nonlinear patterns in the first two dimensions (in particular on the first dimension). Regarding the norm setting, we
Nonlinear Dimensionality Reduction by Minimum Curvilinearity
293
advise to adopt mainly Euclidean and correlation norm, although we do not exclude that other types of norms might be useful on particular datasets. 4. Run the code MCE_4june2012 and obtain as output the coordinates of the samples, embedded in the reduced space Xc (n d matrix, is the result of centred MC-kernel embedding: whose rows are the individuals, with coordinates in a d-dimensional reduced space) and Xnc (n d matrix, is the result of non-centered MC-kernel embedding: whose rows are the individuals, with coordinates in a d-dimensional reduced space). MCE_4june2012 provides as first argument in output the coordinates according to the MCE-SVD based algorithm. The second output are the coordinates set of the individuals according to ncMCE-SVD-based algorithm. Example. If the input dataset is called x and we want to perform a mapping in two dimensions using the Euclidean norm, we will write in MATLAB the instruction: [Xc,Xnc] ¼ MCE_4june2012(x,2,’euc’);
The outputs are the centered and non-centered coordinates, which are provided in the variables called Xc and Xnc. 5. For performing the embedding using PCA and using the previous example as reference, we will write in MATLAB the instruction: [pc,s]¼ princomp(x);
The output indicated by the variables provides the coordinates of the samples in the reduced space of the first two principal components. The variable pc returns the principal components loadings of the PCA: each column is the collection of the coefficients used for projecting the samples in the new PC space. 3.2 Mapping Samples in the Reduced Dimensional Space
Use the new coordinates of the samples mapped in the reduced dimensional space for plotting the figure that displays the relations between the samples. This action can be accomplished in any visualization environment including the one offered by MATLAB.
3.3 Comparing PCA and MCE Results on a Case Study
We performed PCA analysis of the dataset composed by the proteomic CSF profiles of the 42 individuals (see Subheading 2). The objective was to identify whether any unsupervised discrimination emerged between the pain and no-pain state of the individuals in a low-dimensional space usable for a visual representation and investigation. The 42 individuals were mapped in the two dimensional space of the first two principal components (PC1 and PC2), and a clustering algorithm called minimum curvilinear affinity propagation (MCAP) [4] was applied (see Note 6). The computational test
294
Massimo Alessio and Carlo Vittorio Cannistraci
Fig. 1 Linear dimensionality reduction by PCA. The first two principal components are used to investigate the relations between the individuals. From this visualization it is not emerging any clear segregation pattern between individuals with and without pain. The NP subjects for which are reported the sample names, are the peripheral neuropathic patients that did not present pain at the moment of CSF collection, but developed the neuropathic pain after 12 months
consisted in checking whether the two unsupervised detected clusters were somehow related with the presence and absence of pain in the subjects. Figure 1 reports the result of PCA followed by MCAP clusters in the reduced space. The prevalent pattern suggested by the discrimination along PC1 (and emphasized by the MCAP clustering) is however the separation of the individuals with motor neuron disease (M) from the rest of the samples. It is possible to recognize a sort of weak separation between peripheral neuropathy patients without pain (NP) and with pain (P); regrettably the control subjects are overlapped in the same area, in fact the clustering algorithm groups them all together inside the same cluster. In conclusion, according to this analysis we could not detect any interesting separation, in particular related with presence or absence of pain in the proteomic profiles of the 42 individuals. In addition, also exploration of information accounted for by subsequent PCs did not provide any meaningful result. In order to verify whether any pain vs. no-pain separation might emerge using a nonlinear learning machine technique for dimensionality reduction, we repeated the same identical analysis substituting PCA with MCE for embedding in the twodimensional reduced space. Figure 2 reports the result of the dimensionality reduction by MCE (Euclidean norm was used to generate the minimum spanning tree, and SVD was used for the spectral analysis of the MC-kernel) in two dimensions, where, as for PCA analysis, MCAP was applied for clustering the individuals into two groups. The result shows that using the first dimension of
Nonlinear Dimensionality Reduction by Minimum Curvilinearity
295
Fig. 2 Nonlinear dimensionality reduction by MCE. The first two dimensions of embedding are used to investigate the relations between the individuals. From this visualization it is emerging a clear and significant pattern of separation along D1 between individuals with and without pain. The NP subjects for which are reported the sample names are the peripheral neuropathic patients that at the moment of CSF collection did not present pain, but after 12 months developed the neuropathic pain. The M subjects for which are reported the sample names, are the ones with MND and misclassified as patients with pain. Two clear clusters related with the presence or absence of pain (90 % accuracy) are emerging in the MCE visualization
embedding (D1), MCE is able to significantly segregate and order: M samples (subjects with motor neuron disease but without pain, here used as an additional set of controls), C samples (controls healthy subjects without pain), NP samples (peripheral neuropathy patients without pain), and P samples (peripheral neuropathy patients with pain). Interestingly, the unsupervised clustering algorithm detected the presence of two clusters strongly related with presence/absence of pain-state with 90 % accuracy (Fig. 2): the only misclassifications are three M samples in the pain cluster (M10, M15 and M16) and the NP3 sample (that developed pain after 12 month), that was erroneously located in the no-pain cluster, but at least correctly posed within the NP samples. Notably, NP1, NP5, and NP6 all developed pain after 12 months from the CSF sampling, and they were all correctly located in the pain cluster. We can speculate that this last result suggests a potential prognostic capability for the follow-up pain state of certain subjects with peripheral neuropathy. In addition, we cannot exclude that the three misclassified M samples might have any kind of neuropathic pain at the moment of the sample collection, because this information was not available for checking. On the computational biology side, we emphasize how important is to understand the use of nonlinear and parameter-free dimensionality reduction (like the one offered by MCE) for unsupervised detection of hidden patterns in multidimensional proteomic data.
296
4
Massimo Alessio and Carlo Vittorio Cannistraci
Notes 1. Minimum Curvilinearity Embedding (MCE). Input G, n m proteomic dataset matrix (n ¼ number of individuals, m ¼ number of proteomic features);
d, the embedding dimension;
me, a norm to construct the minimum spanning tree T. Output Xc, n n matrix, is the result of centred MC-kernel embedding: the rows of which are individuals with coordinates in a d-dimensional reduced space;
Xnc, n n matrix, is the result of non-centered MC-kernel embedding: the rows of which are individuals with coordinates in a d-dimensional reduced space. Description Compute the distance between individuals (given a selected norm: in our case, the Euclidean norm) in G to generate the n n matrix A.
Extract the minimum spanning tree T out of A.
Compute the distances between all node pairs over T to obtain the non-centered MC-kernel Dnc, and the centred MCkernel Dc, i.e., Dc ¼ ½JDnc2J with J ¼ I 1/n11T. Perform “economy-size” singular value decomposition of Dc e ¼ U d Σd V T . or Dnc using the general formula: D d p ffiffiffiffiffi T Return Xc and Xnc using the general formula: X ¼ ð Σd V d Þ .
*D2 is the matrix of entry-wise squares, MT indicates the matrix transpose, I is the n n identity matrix, 1 is a column e is the closest approximation to D by a vector of ones, and D matrix of rank d. 2. The SVD implementation has the advantage to be faster and more stable. 3. Spot calibration, in accordance with protein physical-chemical coordinates (isoelectric point, pI; relative molecular mass, Mr), enabled the correction of spot location differences between different gels. 4. The goal of this strategy is to obtain proteomic features without gel registration across gel maps. Each gel was treated as a collection of identified spots (detected by using commercial software) whose positions were expressed in terms of the experimental coordinates of pI and Mr rather than in terms of pixels.
Nonlinear Dimensionality Reduction by Minimum Curvilinearity
297
The migration space (pI, 3.2–10.4; and Mr, 20–100 kDa) was ideally partitioned into sub-quadrants at a given resolution (e.g., 0.3 unit in pI and 3 kDa in Mr). These sub-quadrants were individuated consistently and traced the ideal separation area in the pI and Mr space in each gel image, despite contingent alterations/distortion of the single gel. This in turn made the samples comparable in the absence of canonical image matching by means of registration techniques. For each subquadrant the relative collection of spots was determined, by summation of the spot optical density volumes; the sum of the intensities of the pixel segmented as useful signal was obtained and considered to be a quantitative feature of the single subquadrant. Each gel map sample is described as vectors of sorted features, depending on the cumulative spot volumes in the different and equivalent (in the pI-Mr space) gel sub-quadrant areas. Thus, the samples can be explored through an approach of dimensionality reduction such as PCA with the aim of exploring their hidden patterns. 5. Peripheral neuropathy occurs either with or without pain; however, some of the patients without pain (NP) can, as the disease progresses, shift to the pathological variant with pain (P). 6. MCAP is the minimum curvilinear variation of the affinity propagation clustering algorithm [13] and it is able to detect both spherical and elongated (non-spherical) clusters; for this reason it was used to unsupervidedly detect the presence of two individual clusters in the reduced space of the first two principal components. The affinity propagation clustering algorithm takes, as input, measures of similarity between pairs of data points; real-valued messages are exchanged between data points until a high-quality set of centroids and corresponding clusters gradually emerges. In case of MCAP the messages exchanged between the data points flow on the MST, this is practically realized given in input to affinity propagation clustering measures of similarity between pairs of data points stored in the MC-kernel; thus these are nonlinear similarity estimations based on the minimum curvilinearity theory.
Acknowledgements M.A. is supported by AIRC Special Program Molecular Clinical Oncology 5 per mille n.9965.C.V.C. is supported by the independent group leader starting grant of the Technische Universit€at Dresden (TUD).
298
Massimo Alessio and Carlo Vittorio Cannistraci
References 1. Smialowski P, Frishman D, Kramer S (2009) Pitfalls of supervised feature selection. Bioinformatics 26:440–443 2. Bellman RE (2003) Dynamic programming. Courier Dover, New York 3. Martella F (2006) Classification of microarray data with factor mixture models. Bioinformatics 22:202–208 4. Cannistraci CV, Ravasi T, Montevecchi F et al. (2010) Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes. Bioinformatics 26:i1–i9 5. Marengo E, Robotti E, Antonucci F et al. (2005) Numerical approaches for quantitative analysis of two-dimensional maps: a review of commercial software and home-made systems. Proteomics 5:654–666 6. Cannistraci CV, Alanis-Lobato G, Ravasi T (2013) Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding. Bioinformatics 29:i199–i209 7. Moitinho-Silva L, Bayer K, Cannistraci CV et al. (2014) Specificity and transcriptional activity of microbiota associated with low and high microbial abundance sponges from the Red Sea. Mol Ecol 23:1348–1363
8. Zagar L, Mulas F, Garagna S et al. (2011) Stage prediction of embryonic stem cell differentiation from genome-wide expression data. Bioinformatics 27:2546–2553 9. Conti A, Ricchiuto P, Iannaccone S et al. (2005) Pigment epithelium-derived factor is differentially expressed in peripheral neuropathies. Proteomics 5:4558–4567 10. Cannistraci CV, Montevecchi FM, Alessio M (2009) Median-modified Wiener filter provides efficient denoising, preserving spot edge and morphology in 2-DE image processing. Proteomics 9:4908–4919 11. Pattini L, Mazzara S, Conti A et al. (2008) An integrated strategy in two-dimensional electrophoresis analysis able to identify discriminants between different clinical conditions. Exp Biol Med 233:483–491 12. Conti A, Iannaccone S, Sferrazza B et al. (2008) Differential expression of ceruloplasmin isoforms in the cerebrospinal fluid of Amyotrophic Lateral Sclerosis patients. Proteomics Clin Appl 2:1628–1637 13. Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315:972–976
Chapter 17 Differential Analysis of 2-D Maps by Pixel-Based Approaches Emilio Marengo, Elisa Robotti, and Fabio Quasso Abstract Two approaches to the analysis of 2-D maps are available: the first one involves a step of spot detection on each gel image; the second one is based instead on the direct differential analysis of 2-D map images, following a pixel-based procedure. Both approaches strongly depend on the proper alignment of the gel images, but the pixel-based approach allows to solve important drawbacks of the spot-volume procedure, i.e., the problem of missing data and of overlapping spots. However, this approach is quite computationally intensive and requires the use of algorithms able to separate the information (i.e., spot-related information) from the background. Here, the most recent pixel-based approaches are described. Key words Pixel-based approach, Gel-electrophoresis, Fuzzy logic, Three-way PCA
1
Introduction Two main approaches to the analysis of 2-D maps are available in literature; both of them are based on a multi-step procedure of 2-D map images analysis (Fig. 1). The first approach involves a step of spot detection on each gel image, to provide a final list of spots present on all the gels investigated, each spot characterized by its volume (i.e., sum of the optical density of all the pixels contained within the spot boundaries): the final differential analysis (to provide candidate biomarkers of the effect investigated) is then performed on the spot volume dataset. The second approach is instead based on the direct differential analysis of 2-D map images following a pixel-based procedure. 2-D maps are traditionally analyzed by identifying spot boundaries on each gel and matching spots from each gel on a reference or master gel (alternatively, alignment and matching can be made at a pixel level and spots detected on the master gel only) [1]. However, this approach suffers from an important drawback: when the
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9_17, © Springer ScienceþBusiness Media New York 2016
299
300
Emilio Marengo et al.
Fig. 1 Approaches to 2-D map analysis: spot based (a) and pixel based (b)
matching method fails, missing values are introduced in the spot volume data table; they can be due to the absence of a spot on the gel (in this case introducing a zero value in the data is correct), or failure in matching (in this case a zero value is a wrong information and brings to the possible identification of false biomarkers) [2, 3]. The problem of missing values can be solved by exploiting common spot boundaries for all images (this is the choice of some commercial software like Progenesis SameSpots from Nonlinear Dynamics and Delta-2-D from Decodon (see Chapter 4 for more details)). Whereas the problem of missing values can be solved in this way, the problem of defining the boundaries of each spot still remains unsolved and is particularly challenging in the case of overlapping spots: changes in overlapping spots that are within the same boundary, may not be detected if the overlapping spots respond differently to the investigated effect (a drug, a pathology, etc.) [4]. The pixel-based approach can overcome these problems since no assumptions are made on spots, and matching is based on single pixels. It is important to stress that both approaches (spot-based and pixel-based) rely on proper gels alignment. A pixel-based approach therefore offers several advantages [2] but its computational effort is quite huge and procedures must be employed to separate information (i.e., spot-related information) from background [4]. Here, the most recent pixel-based approaches are described and the main results obtained by the authors are briefly presented.
Differential Analysis of 2-D Maps by Pixel-Based Approaches
2
301
Methods
2.1 Correlation Analysis
2.1.1 Method Pipeline
Van Belle et al. [5] developed a pixel-based method allowing to omit the spot detection step, which is not prone to human interpretation of gel-to-gel information. Given a set of gel images, the approach measures the correlation between every pixel position and an external variable (either representing class belonging or a continuous variable). The authors tested the approach on a set of simulated 2-D maps with different levels of background, additional noise, and outliers, and proved its effectiveness on profiles of p53 protein isoforms of cell samples from patients with hematological pathologies. The possibility to use gel images without a previous spot detection step is conditioned to their correct alignment. A Y variable must be defined, expressing the biological information, e.g., class belonging (control vs. pathological, control vs. drug treated, etc.), life expectancy, and effect of ripening or of a drug. For every pixel coordinate in the set of the 2-DE images, a correlation analysis is performed between the pixel data at the corresponding position and the Y variable. The correlation image is created by repeating this process for all the available pixel coordinates. The authors illustrated the method by a simulated set of gels with defined spot characteristics, as a function of an external variable Y. More recently Øye et al. [6] developed a software tool for the correlation analysis based on the procedure here presented. Figure 2 represents the procedure presented by Van Belle et al. [5]. 1. Step 1: data preparation. Given a set of n gels, each is described as a vector Az in which z is the gel image ID number; the n external parameters related to the biological external variable are contained in the vector Yz. The gels can be further indicated as Ax,y,z in which (x, y) is each pixel position on gel number z. Ax,y is a vector containing the intensities of the pixels of all gels: ð1Þ Ax , y ¼ Ax , y , 1 Ax , y , 2 . . . Ax , y , n
2. Step 2: alignment and registration. The method is based on a first alignment of all gels: calibration spots can be used, otherwise other techniques can be exploited (Hough transformation) [7–9] (see Note 1). The aligned images are indicated as A0 z.
3. Step 3: intensity normalization. The next step normalizes the intensity values of the gels to allow for inter-gel pixel comparison. Different procedures are available (see Chapter 6 for more details). The authors propose the use of relative gray values (see Note 2). 4. Step 3a: background intensity. The background depends on the capture technique adopted. Background signal can
302
Emilio Marengo et al.
Fig. 2 Procedure for correlation analysis applied by Van Belle et al. [5]. Reprinted from an Open Access paper [5]
have either an additive effect (it is added to all pixel values), a multiplicative effect (it accumulates with a decaying signal), or a mixture of both of them [10]. Additive noise can be removed by subtracting the average or the median value: 00
0
0
Az
ð2Þ
0 median A z
ð3Þ
Az ¼ Az 00
0
Az ¼ Az 0
¯ z is the average value. Multiplicative noise can where A instead be removed as in: 00
Az ¼
0
Az 0
Az
1
ð4Þ
All normalization techniques applied in this step have to be applied independently for each gel. 5. Step 3b: scaling of gel intensity. Gel intensities have then to be normalized; if calibration spots exist, they can be used to this purpose (see Chapter 6 for more details on normalization tools). If A0 z is the image before normalization and
Differential Analysis of 2-D Maps by Pixel-Based Approaches
303
(x, y) is the calibration spot position, then the normalized image can be calculated as: 0
A Az ¼ 0 z Ax , y 00
ð5Þ
If calibration spots are not available, the sum of all intensities (RMS) can be a valid alternative: 0
Az 0 Az ¼ RMS A z 00
ð6Þ
6. Step 4: correlation image. The correlation analysis generates a new image showing the correlation between pixels in a specific position and the Y variable. Each pixel in the correlation image ranges between 1.0 (negative correlation) and 1.0 (positive correlation). Repeating the correlation test for every pixel, results in the correlation image C: 00 ð7Þ C x , y ¼ ρ A x, y ; Y The correlation image can be visualized using a different color scale for positive and negative correlations (e.g., green for positive correlations and brown for negative ones). The authors propose the robust Spearman rank order correlation (ρ-correlation) [11] as a measure of correlation (see Note 3).
7. Step 5: masking. Correlations in the correlation image do not necessarily imply a significant relationship between pixel-values and Y variable. Masking is therefore applied to filter out not significant correlations. Two filters are applied (called hereafter masks): the first (named “variance”) removes chance correlations, while the second (named “significance”) removes correlations that provide little useful information (e.g.: datasets containing all zero values). 8. Step 5a: significance. To remove chance correlations, the significance test usually associated with the Spearman correlation test is chosen by the authors. It is calculated as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n 2 S x, y ¼ 1 C x, y ð8Þ 1 C 2x , y If the result is close to 1 the probability of a chance correlation is very low; otherwise, if it is close to 0, the chance correlation probability is quite large. 9. Step 5b: variance. The second mask avoids significant correlations with a low biological significance due to small variations of gel intensities. The standard deviation [12]
304
Emilio Marengo et al.
measured on the relative, non-ranked, gel intensities is calculated: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 Xn 1 A00x , y , z 1 00 z¼0 Ax , y , * Dx, y ¼ ð9Þ N The calculated value is large in case of a varying gel expression, while it is zero when the gel expression is constant. 10. Step 5c: masked correlation image. The two masks (Eqs. 8 and 9) can then be multiplied by the correlation image (Eq. 7) to give the masked correlation image R: R ¼C S D
ð10Þ
The pixel values of R no longer relate to the correct correlation measure, so R is used as an indicator showing positions of possible interest. 2.1.2 Results
1. Altering spot position and sizes. The authors evaluated how the method worked with changes in spot position, size, and shifts, by a simulated set of gels with different spots: (a) Spot α grows and fades out and spot β changes shape: spots α and β correlate with Y and, in the correlation images, they are at their original position, showing that the correlation image offers correct positional information. (b) Spot δ shifts from left to right: shifts in spot δ relate to Y and the correlation image reveals this behavior showing original and destination positions that respectively correlate positively and negatively. (c) Spots γ have a constant amplitude and width: they are not correlated with Y. 2. Spot shape. The correlation image shows spots with diffusionlike alterations, indicating a difference in correlation between the inner and outer areas of the corresponding spots. Similar behavior can be observed for spot β: the initial vertical shape anticorrelates (it disappears) while the later horizontal shape correlates (it appears). 3. Masking the correlation image. In the simulated gels, background shows an almost constant intensity. If no masks are applied, background shows a strong correlation or anticorrelation: the area can be constant, resulting in correlations that in the correlation image are represented as þ 1 or 1; or, in areas with very small alterations (the spots boundary), the correlation is mathematically correct, but the lack of intensity variation
Differential Analysis of 2-D Maps by Pixel-Based Approaches
305
gives little information. The application of the masks allows the achievement of a correlation image where only areas with relevant spot modulations are indicated: the first mask removes non significant correlations while the second one removes areas without variance. 4. Effect of different normalizations. Different background removal and scaling techniques were tested. In all cases, the original information was retrieved. The normalization technique proved to be of little importance for qualitative analysis. In the case of gel normalization dividing by the average gel intensity, new information was found not directly related to the simulated gels: since spot α increases with Y, the mean intensity of the gel increases as well; therefore, spots are no more constant but decrease in intensity and are highlighted in the correlation image (see Note 4). Quantitatively, normalization strongly influences correlation (see Note 5). 5. White noise in 2-DE images. The addition of white noise [7] to the simulated images attenuates not significant background correlations. Increasing noise up to 75 % (with respect to the maximum intensity of the image) gives weaker correlations (see Note 6). If white noise does not relate to the Y variable, it does not influence correlation. 6. Effect of randomization of the dataset. Correlation towards a random Y vector was also tested [11] and the method proved to be robust. 7. Outliers. A test with outliers in the Y-values showed a low effect on the final interpretation; even with 13 % outliers, the original information was recovered (see Note 7). 2.2 Pixel-Based Analysis of Multiple Images for the Identification of Changes (PMC)
Fargestad et al. [2] have recently developed PMC, a pixel-based method for the comparison of 2-D map images. They propose a complete pipeline from image analysis to the final use of multivariate techniques, for the identification of differentially expressed regions on the gel images. They also compare the adopted procedure with the classical approach based on the evaluation of spot volume datasets. Here, their approach will be briefly described; the authors demonstrate the effectiveness of the procedure on samples of Longissimus dorsi muscle, collected after 1, 2, 3, 6, and 10 h after slaughter.
2.2.1 Method Pipeline
1. Image alignment. Gel alignment was performed by the software TT900 S2S (Nonlinear Dynamics, Newcastle Upon Tyne, UK), using one of the gel images as reference. An initial anchor was set on a protein easily identified on both images. Then, the software automatically suggests a number of additional, uniformly distributed anchors. All anchors are visually inspected and eventually removed by the operator. The resulting anchors are then used for the automatic alignment (see Note 8).
306
Emilio Marengo et al.
2. Image pretreatment: normalization. Normalization is applied by dividing each image by the total intensity of the image, rescaled in the range [0–1] across all images. 3. Image pretreatment: background correction. This can be accomplished by two alternative procedures. Procedure A. Background is eliminated by the asymmetric least squares (PALS) approach, proposed originally by Eilers [13] for 1D signals. The applied objective function is defined as: X 2 2 Q ¼ ð11Þ v i y i f i þ λ Δd f i i
where y and f are the original and the smoothed signals, respectively, Δd is the d-th degree derivative (see Note 9), λ is the penalty coefficient, and v are the weights. The term “asymmetric” refers to the fact that the data points are differently weighted according to their being below or beyond the signal approximation f (see Note 10): p i f yi > f i vi ð12Þ 1 pi f yi < f i where p can vary in the range from 0 to 1 (see Note 11).
The first term in Eq. 11 is the signal fit, while the second term is the penalty function, introduced to control smoothness of the background approximation. λ must be optimized. Signal approximation, f, is calculated as a linear combination of the B-splines functions to allow the extension to two dimensions: f ¼ αB
ð13Þ
where B denotes the B-splines basis. The iterative procedure starts with all unit weights, and then the data points are weighted asymmetrically according to Eq. 12 [14, 15]. Procedure B. The authors also applied a method by Lieber and Jansen [16] consisting of repeatedly fitting polynomial curves to a 1D signal. The method was adapted to 2-D images by applying the algorithm, first row-wise and then column-wise. The method first fits a polynomial curve of some degree to the signal values: signals above the polynomial curve are removed and a new approximation is performed. Iterations are repeated until convergence. When the function is applied sequentially in both vertical and horizontal directions, the largest value of the polynomial curve is used to build the background image. The authors used a polynomial of fourth degree and 50 iterations. 4. Data compression and reduction. Pixel-based analysis requires the unfolding of the image pixels in a 1D vector for each image, providing a huge number of variables (in the study presented
Differential Analysis of 2-D Maps by Pixel-Based Approaches
307
by the authors three million pixels were used) making significance testing and data capacity a challenge. The authors therefore operated a data reduction and compression step to make the method feasible for what regards calculation time. 5. Modeling. Multivariate techniques were applied by the authors: (a) principal component analysis (PCA) after mean-centering and (b) partial least squares (PLS) using “time after slaughter” as Y variable, with leave-one-out (LOO) cross-validation. Significant changes among different pixels are tested by evaluating the variability of the regression coefficients by Jack-knifing [17]. A t-test is then performed on the estimates of the regression coefficients from the various cross-validation segments, to test if they are significantly different from zero: only pixels with stable regression coefficients are retained as significant. To avoid that low regression coefficients are retained as significant (see Note 12), the authors fixed a minimum value for the standard deviation of the perturbed regression coefficients (see Note 13). Finally, the spatial location of the significant pixels as a 2-D map was used as extra visual validation: regression coefficients related to significant variations in protein spots will mainly correspond to areas densily populated by spots defining the corresponding differentially expressed protein spots. 2.2.2 Results
1. Image alignment. Gels alignment within the same batch shows a good alignment, while the comparison across different batches shows differences in spots migration. Only gels showing similar migrations must be retained. A closer inspection of the 1-D representation of some samples highlighted the existence of baseline variations resulting from nonuniform background intensities in the gels: normalization and background correction are therefore mandatory. 2. Normalization. The choice of the pretreatment procedures may affect the final differential analysis [18]. The authors applied a normalization providing the same total intensity for each image since equal protein amounts are loaded on each gel (see Note 14). This is suitable when there are not significant variations in staining intensity across the gels; otherwise it would cause an increase of the signals of the proteins on gels where fewer proteins are present. To solve this problem, the authors recommend to apply a quality control procedure to the gels images (only samples with similar staining intensities should be retained). It must be pointed out, however, that this normalization may cause great variations even if a single spot is missing (due to the smaller total intensity). 3. Background correction. The two procedures applied for background correction provided different results. The correction by
308
Emilio Marengo et al.
the line-wise polynomial 1-D approach captures much of the streaks in horizontal and vertical directions and a more smooth background staining over larger regions in the gel. The other approach, applied for both dimensions simultaneously, did not capture these streaks. However, both approaches performed better with respect to no background correction. The authors selected the method based on 1-D polynomial correction. 4. Data compression and reduction. Data reduction was applied by removing 50 pixels around the gel boundary (this area was noisy); then the resolution was reduced by a factor 0.5 via bicubic interpolation. This resulted in a total of 187,165 variables per sample for the whole gel. 5. Modeling. PCA was applied to the unfolded gel image: PC1 reflected variations related to proteome changes along time after slaughter. Similar scores were obtained applying PLS using PCs of the 2-DE pixel vectors as regressors and the log10 transformation of the time after slaughter as response. Loading plots were represented refolded back to the 2-DE gel images to identify the regions responsible of the differences between the samples: black pixels correspond to a decrease of intensity as time after slaughter increases, and white pixels show an opposite behavior. To highlight areas in the image containing important variations, significant pixels according to the Jack-knifing test were plotted on the average image of all samples: blue pixels correspond to a negative correlation with time after slaughter, while red pixels show a positive correlation with time after slaughter. The 2-D pattern of significant changes was similar with the application of PALS for background removal and for images without background correction; however, significant pixels outside these spot regions were identified without background correction. Significant pixels are clustered on top of or near to protein spots: for some spots, they cover the whole spot and for others are limited to certain regions. This can be due to two main reasons: (a) high-density spots can show the problem of saturation around their maxima, preventing pixels from being detected as significant; (b) significant pixels near the periphery of some spots may indicate that they represent partially unresolved proteins. The PMC approach effectively identifies significant changes in small shoulders of larger proteins and it is also able to detect significant changes in the border of saturated proteins. In spite of saturation in the central part of a protein spot, it may be possible, by PMC, to detect significant variations along the border of the spot. This would depend on how severe the saturation problem is.
Differential Analysis of 2-D Maps by Pixel-Based Approaches
2.3 Start-to-End Processing of 2-D Map Images
2.3.1 Method Pipeline
309
The main drawback of the method proposed by Færgestad et al. [2] is data capacity; in fact the large number of pixels to be processed makes multivariate techniques computationally intensive. The same authors propose an improved approach [4] that includes a step of identification of the pixels containing information (spots): only pixels containing information in at least one sample will be retained for multivariate analysis, while pixels related to background are eliminated. This approach makes it necessary a step of spot detection. A similar approach was also applied by Daszykowski et al. [19]. The two methods differ for what regards the choices that are made within the different steps. Here, we will briefly describe both approaches based on a common general pipeline. 1. Step 1a—Image pretreatment: background elimination. Daszykowski et al. [19] apply the procedure already proposed in Ref. 2, based on asymmetric least squares [13]. The authors in Ref. 4 instead apply a uniform background elimination by subtracting the minimum pixel value of each gel image. This procedure is applied only when the final aim is to detect spot pixels on the image. Pixel intensities corrected with this method are not used for multivariate analysis since this approach gives pixel intensities that are not comparable across different gels. For the application of multivariate tools, the procedure by Lieber and Jansen [16] was instead applied. 2. Step 1b—Image pretreatment: normalization. Rye et al. [4] propose the normalization of the pixel intensities as in Ref. 2, i.e. dividing each image by the total intensity of the image, rescaled in the range [0, 1] across all images. 3. Step 1c—Image pretreatment: noise elimination. Electrophoretic images contain features of different frequencies; efficient elimination of noise can be accomplished in the wavelet domain [20–23]. The decomposition is carried out in Ref. 19 using 1D wavelets applied along the two axes of the image (see Note 15). The original signal is decomposed in the chosen wavelet basis to its approximation and details, using a pair of wavelet filters (low pass and high pass). At the next level of decomposition, these filters are applied to the approximation from level 1, and so on. At the first level of the image decomposition, one set of approximation coefficients and three sets of details are obtained (see Note 16). Each set of approximation coefficients represents the smoothed version of the decomposed image, while the details are related to the eliminated features (see Note 17); these last can be used to calculate the cutoff value of the noise (see Note 18). The authors propose the Bayes threshold method [24, 25], where the threshold value is estimated from the generalized Gaussian distribution parameters:
310
Emilio Marengo et al.
t¼
σ2 σs , i
ð14Þ
where σ and σ s,i are calculated as: σ ¼ 1:4826 medianðjcHH1jÞ
ð15Þ
and: σs , i
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ffi 2 ¼ max σ i σ , 0
f or i ¼ 1, 2, . . . , L
ð16Þ
where cHH1 are the wavelet coefficients from the HH1 level, σ i2 is the variance of the coefficients from the i-th level, and L is the decomposition level (see Note 19). The authors applied this method and the soft-thresholding approach (see Note 20) obtaining a threshold value for each level. Then, the inverse wavelet transform can be used for images reconstruction (see Note 21). 4. Step 2 - image alignment. Procedure A. Gel alignment is performed in Ref. 4 by the software TT900 S2S (Nonlinear Dynamics, Newcastle Upon Tyne, UK), using one of the gels as reference for the alignment of all images. An initial anchor is manually set on a protein identified on both images. Then, the software suggests additional, spatially distributed anchors, which have to be visually inspected and eventually removed by the operator (other anchors can be added manually). The resulting anchors are then used for the automatic alignment. Procedure B. Images are warped in Ref. 19 using fuzzy warping [26–28], choosing the image with the best quality as the target. The first step is the identification of the maxima of the most intense spots in the target and in the image to be warped (see Note 22). An iterative approach is then applied: (a) Spot correspondence is calculated based on their distances. (b) Transform parameters are calculated weighting the pair of spots by their actual correspondence. To calculate the spots correspondence, k 2-D Gaussian functions are centered at the centers of the k identified spots in the target image (see Note 23). The output of each Gaussian function is calculated for all the spots of the warped image, resulting in a matrix G (k, h) (k and h ¼ spots identified in the target and in the image to be warped) (see Note 24), with elements in the range [0, 1] (see Note 25). To find the correspondence between the spots, the Shrinkon iterative standardization [29] of the rows and columns of G is applied
Differential Analysis of 2-D Maps by Pixel-Based Approaches
311
(see Note 26) to provide a sum of the elements of each row and column close to 1. Pairs of corresponding spots are characterized by the highest elements of this matrix. Then, the spots, weighted by the corresponding value, are used for calculating the transformation parameters (see Note 27). A second order polynomial is usually the best choice. Image P is then transformed into the target image, and the outputs of the Gaussian functions centered on the spots in the target image are calculated again. In each iteration the width of the Gaussian function is decreased (siterþ1 ¼ 0.8 siter) (see Note 28).
5. Step 3 - intensity interpolation. The transformation in step 2– Procedure B, expresses the coordinates of the spots in the warped image as a function of the coordinates of the spots in the target: the predicted values of the spots coordinates in the warped image are therefore no longer integers representing pixels. The intensity at the pixel coordinates can be calculated by interpolation (the authors applied a bi-cubic interpolation [30]). 6. Step 4 - image segmentation. The authors in Ref. 19 propose to identify the spots in an image A representing the average images calculated from all the gels. For this image a binary mask is then obtained. Instead in Ref. 4, the authors identify spots in each individual image and create a binary mask for each independent image. In both cases, however, identification of the spots requires image segmentation that can be performed by different procedures. Procedure A. In Ref. 19 segmentation is carried out by image morphology. Single spikes are first removed by a median filter of size 3. Then an image morphology procedure [32, 33] is applied to correct for streaks and nonuniform background. The authors then propose image segmentation by the watershed transform [31] (for more information see Chapter 9). Procedure B. Instead in Ref. 4 the authors propose to correct the images with structural elements similar to the features that one wants to keep or remove: the structural element for streak identification is a line of 61 pixels, while a circle with a diameter of 40 pixels is used for protein spot regions. 7. Step 5 - mask construction. The binary mask is constructed as follows: all values in an image smaller than a threshold are turned into 0 (background pixels), while the others are turned into 1 (spot pixels). The authors in Ref. 19 calculate the threshold value based on the Otsu’s method [34] (see Note 29) and apply it to image A. Instead in Ref. 4 the authors state that, after performing image morphology, a single threshold is sufficient to identify the spot regions in the resulting image, and apply a threshold equal to 0.025 to each image independently, to obtain a binary mask for each image.
312
Emilio Marengo et al.
The final binary mask is therefore based on the union of all the pixels representing protein spots regions in each individual gel: if a pixel has value 1 in only one of the binary images, it will be included in the mask. In both procedures, each final image is described by a vector containing the intensities observed for the pixels of the mask containing unit values. 8. Step 6 - modeling. The features extracted from all images are organized in matrix X; class belonging in two-class problems can be stated by adding a Y variable coded as 1 (sample belonging to the first class) and 1 (sample belonging to the second class). To identify features with statistically significant discrimination power, univariate t-tests or multivariate approaches can be applied (see Chapters 13 and 14 for more details). The authors in Ref. 19 applied discriminant partial least squares (PLS-DA) [35], the discriminant-radial basis function-partial least squares (D-RBFPLS) [36–39] and a modified version of the uninformative variable eliminationPLS (UVE-PLS) [40] (see Note 30), while Rye et al. [4] applied PLS-DA. The model complexity can be determined by cross-validation (CV), eventually exploiting the Monte Carlo method (see Note 31). (a) Partial least squares (PLS). PLS is a regression method that exploits the information contained in X to predict the Y variable. PLS models both X- and Y- variables simultaneously, to find the latent variables (LVs) in X that will predict the latent variables in Y. When the Y variable accounts for class belonging, the method is called PLS-DA [35]. (b) Uninformative variable elimination partial least squares (UVE-PLS). To improve PLS models, variables not related to Y should be removed; this can be accomplished by UVE-PLS [40]. Matrix X (m n) is augmented with a matrix of at least 300 uninformative variables, containing (m k) random numbers (see Note 32). The stability of the j-th variable, sj, is expressed as the ratio of the mean to the standard deviation of the m regression coefficients, obtained from the PLS models with f factors during LOO cross-validation for the j-th variable: sj ¼
mean B ð:; j Þ std B ð:; j Þ
ð17Þ
where B (:, j) is the j-th column of matrix B with m regression coefficients derived by using LOO crossvalidation to choose the optimal model complexity. The first n columns of B contain regression coefficients for the
Differential Analysis of 2-D Maps by Pixel-Based Approaches
313
experimental variables, while the remaining k columns correspond to the uninformative variables. These last ones show absolute values of stabilities of the regression coefficients below the cutoff value, c, defined as the maximum value of stability of the regression coefficients associated with noise (snþ1:nþk) (see Note 33). Monte Carlo UVE-PLS. To prevent overfitting the authors also apply Monte Carlo validation (see Note 34). (c) RBF-PLS. PLS can be extended to nonlinear versions by transforming the input data into the feature space, defined by reproducing kernel K [36–39]: K x; xT ¼ ϕðx Þ, ϕ x T ð18Þ where ϕ is the mapping from X to the feature space, F. The most widely used kernel functions are the Gaussian ones: x i x j 2 2σ 2 K xi ; x j ¼ e ð19Þ
The kernel function therefore represents the output of the Gaussian function centered at xi, calculated for xj. The σ parameter requires optimization.
The same pipeline is applied by Daszykowski et al. [41] in a study devoted to evaluate the differences between pixel-based and spotbased approaches. The authors compare different mono- and multivariate methods for the final analysis of the data (both pixel and spot volume matrices): significance analysis (for monovariate analysis), and PLS-DA and PLS with feature selection (for multivariate analysis). Significance analysis of pixels or spots is evaluated by significance analysis of microarray data (SAM) [1, 42], where permutation tests and minimization of the false discovery rate (FDR— percentage of chance correlations) are exploited for significance evaluation. The multistep procedure can be described as follows: 1. The “relative difference” d(i) for the intensities of the i-th variable are calculated by: d ði Þ ¼
x 1 ði Þ x 2 ði Þ s ði Þ þ s 0
ð20Þ
where x 1 ði Þ, x 2 ði Þ are the average intensities of variable i-th for class 1 and class 2; s(i) is the standard deviation given by: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi )ffi u ( X X u s ði Þ ¼ tα ½x a ði Þ x 1 ði Þ2 þ ½x b ði Þ x 2 ði Þ2 ð21Þ a¼1;m 1
b¼1;m2
314
Emilio Marengo et al.
and: α¼
1 1 1 þ m1 m2 m1 þ m2
2
ð22Þ
where m1 and m2 are the samples in class 1 and class 2 and the positive constant s0 (to minimize the coefficient of variation of d (i) versus s(i)) must be optimized, to eliminate the dependence of d(i) on the variable intensity. 2. Variables are ranked according to their d(i) values in decreasing order. 3. p permutations are performed and for each permutation dp(i) are calculated and ranked in decreasing order. 4. Based on the ranked dp(i) values, the expected relative differences dE(i) are calculated by: X d ði Þ p p d E ði Þ ¼ ð23Þ p where i is the position of the relative differences in the ordered sequence of d for the given permutation. 5. d(i) is plotted against dE(i). 6. For a fixed threshold, Δ, a variable is significantly positive, if d(i) > 0 and d(i) dE(i) > Δ, while it is significantly negative if d(i) < 0 and d(i) dE(i) < Δ. The smallest d(i) among the significant positive features is defined dU, while the largest d(i) among the significant negative features is defined dL. Therefore, the total number of significant variables, TP, is calculated as: T P ðth Þ ¼ #f1 i n : d ði Þ > d U ord ði Þ < d L g
ð24Þ
where # indicates the cardinality operator. 7. For a grid of threshold values, FDR is calculated as: F DRðΔÞ ¼ π^ 0
T P ðΔÞ F P ðΔÞ
ð25Þ
where: F P ðΔÞ ¼ mean F P 1 ðΔÞ, F P 2 ðΔÞ, . . . , F P p ðΔÞ
ð26Þ
F P p ðΔÞ ¼ # 1 i n : d ði Þ > d U ord p ði Þ p < d L
ð27Þ
and:
Differential Analysis of 2-D Maps by Pixel-Based Approaches
315
where # indicates the cardinality operator, FP ¼ number of falsely identified significant features, i.e., the permuted data, for a given threshold value; TP ¼ number of significant features in the experimental dataset for the same threshold. π0 is the estimated proportion of the non-significant features, calculated as follows: π0 ¼ minðπ0 ; 1Þ
ð28Þ
where π0 ¼ #fd ði Þ∈ðq25, q75Þg=ð0:5nÞ, q25 ¼ 25 % and q75 ¼ 75 % are the percentiles of all the dp values of the permuted data (n features np).
8. Δ, the best value distinguishing significant from insignificant variables, is identified by optimizing the false discovery rate (FDR). The authors performed 10,000 permutations. The principles of significance analysis of individual features can be exploited in multivariate methods as well. FDR can be minimized by using permutation tests to select significant coefficients according to their stability in the PLS model. The stability of the regression coefficients is calculated by LOO cross-validation. The stability of the regression coefficients of the experimental features is then compared with the stability of the regression coefficients of the PLS model, calculated for the permuted datasets. Stability therefore replaces parameter α in the univariate significance analysis, whereas the remaining steps are unchanged. Features with stability larger than that of the features for which the minimum FDR is observed are used to build the final PLS model. The authors performed 10,000 randomization tests. 2.3.2 Results
1. Denoising and background subtraction. The effect of images denoising and background correction depends on the image quality; in some cases these steps can be limited to the calculation of the average image only. 2. Warping and alignment. The authors propose to evaluate the most intense spots identified on each gel and on the target image as a pseudocolor image, combining three colors (red, green, and blue). Spots from the target image, which have no corresponding spots in the gel image, are displayed in green. Overlapping spots in the pseudocolor image of the warped images have grayscale appearance. The authors report a successful alignment since the pseudocolor image of two warped images reported in the paper shows mostly a grayscale appearance. 3. Mask construction. The authors built the binary mask using a threshold value equal to 0.1.
316
Emilio Marengo et al.
4. Modeling. The authors propose to display the correlation between each individual feature and the dependent variable Y in form of an image, by displaying regression coefficients at each pixel position on a color scale from blue to red: pixel towards red show large positive regression coefficients, while pixels towards blue show large negative coefficients. This way of displaying the results can be further refined by representing only the features with statistically significant correlation with the dependent variable Y. 2.4 Other Approaches
2.4.1 Three-Way PCA: Method Pipeline
Other pixel-based approaches have been proposed by the group of Marengo et al. and regard: the application of mathematical moment functions, like the Legendre and Zernike moments, to provide alternative image descriptors [43, 44]; the exploitation of threeway PCA to the digitalized images [45]; and the development of a procedure to compare gel images based on fuzzy logic principles [46, 47]. The first approach based on the Legendre and Zernike moments is clearly explained in Chapter 15 and will not be further described here. We will briefly report here the two other approaches. 1. Digitalization. Each 2-D map can be scanned by a densitometer (see Note 35) and turned into an image of 200 200 pixels, in which each pixel corresponds to the value of its optical density. Each image is then turned into a grid of 50 50 cells (see Note 36) containing values in the range [0;1]. The values smaller than a threshold (see Note 37) are substituted with zero to eliminate the information about the background intensity. 2. Data Transformation. A normalization is necessary before the calculation of the three-way PCA model, to make all samples comparable. Here maximum scaling is exploited, which means that each map is scaled to its maximum value by: x 0k ði; j Þ ¼
x k ði; j Þ maxðx k Þ
ð29Þ
where xk(i,j) is the value corresponding to the cell in (i,j) position in the k-th image and max(xk) is the maximum value in all the cells of the k-th image (see Note 38). 3. Three-Mode Principal Component Analysis. Three-way PCA [48–54] is a multivariate statistical method that allows to consider the three-way structure of the data that can be represented as a parallelepiped of size I J K (conventionally defined as objects, variables and conditions), where I and J are the number of rows and columns of the grids and K is the number of samples. Here three-way PCA is based on the
Differential Analysis of 2-D Maps by Pixel-Based Approaches
317
Tucker3 algorithm [48]. The final result is given by three sets of loadings and a core array G, describing the relationship among them. Each of the three sets of loadings can be displayed and interpreted in the same way as the scores of standard PCA. In the case of a cubic core array, a series of orthogonal rotations can be performed on the three spaces of the three modes, searching a common orientation for which the core array is body-diagonal as much as possible. If this condition is sufficiently fulfilled, i.e., if the elements g111, g222, . . . are the only elements of the core matrix being significantly different from 0, then the rotated sets of loadings can also be interpreted jointly. 2.4.2 Three-Way PCA: Results
1. Case study. The approach was applied to two datasets: (a) rat sera (five maps from control Wistar rat sera and five maps of sera from Wistar rats treated with nicotine); and (b) human lymph nodes (four maps from control human lymph nodes and four maps from human lymph nodes affected by mantle cell lymphoma). 2. Digitalization. Each map is turned into a grid of 50 50 cells containing values in the range [0;1] by applying a threshold value of 0.4. 3. Three-way PCA. Three-way PCA was applied on both the normalized datasets; for rat sera dataset, the first two factors explained 59.8 % of the total variance, while for the human lymph nodes dataset, they explained 50.8 % of the total variance. In both cases three-way PCA was successful in the separation of the samples of the two classes and in both datasets it was also possible to identify the regions in the maps responsible for the differences between the two classes of samples (i.e., regions containing potential biomarkers). 4. Difference analysis. The differences between the two classes of samples can be identified by the averaged samples of each class, in the space defined by variables and objects obtained from three-way PCA. The average maps can be re-projected into the original space, thus obtaining the corresponding 2-D map images, containing only the information accounted for by the first two 3-way factors. The images rebuilt can then be compared (they can be subtracted one to the other) to identify the discriminant regions of the 2-D maps. This corresponds to a sort of filter of the useful (discriminant) information contained in the 2-D maps. The map representing the differences between the two average samples can be represented on a color scale, with areas towards red corresponding to areas more intense in one class and areas towards blue corresponding to areas more intense in the other one.
318
Emilio Marengo et al.
2.4.3 Fuzzyfication: Method Pipeline
1. Digitalization. Each map is digitalized providing a grid of 200 200 cells containing the corresponding optical density of each pixel. 2. De-fuzzyfication. Each grid is turned into a binary mask where a 0 value is assigned if the optic density is below a selected threshold (corresponding to background pixels) and a unit value if the optic density is beyond the threshold (corresponding to spot pixels) (see Note 39). This step corresponds to a “de-fuzzyfication” of each image, eliminating the effect due to the staining and destaining protocol. 3. Fuzzyfication. With de-fuzzyfication, the information about spatial imprecision is lost and it is successively reintroduced with the fuzzyfication step. The position, size and shape of the spots on the maps may change across different experimental runs and different factors (e.g., temperature, staining and destaining conditions) can contribute to the appearance of spurious spots and artifacts. To account for these contributions, spots are turned into fuzzy entities; the influence of the presence of a spot in cell xi, yj on the neighbor cell xk, yl is therefore calculated by a 2-D gaussian function (see Note 40): f x i ; yj ; x k ; y l ¼
1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffie 2πσ x σ y 1 ρ2
1
2ð1 ρ2 Þ
ðx i
xk Þ σ2 x
2
ðx i 2ρ
ð
xk Þ y j σx σ y
yl
Þþð y j y l Þ σ 2y
2
ð30Þ
where σ x and σ y are the standard deviation of the gaussian function along each dimension, and ρ is the correlation between the two dimensions x and y (see Note 41). Here, the two parameters σ x and σ y are maintained identical (even if this is not a constraint), so that the gaussian function presents the same uncertainty with respect to both dimensions (see Note 42). The only parameter to be analyzed for its effect is σ ¼ σ x ¼ σ y . Large values of σ extend the effect of each spot to a larger distance, making the whole virtual map more fuzzy. The value of the signal Sk in each cell xi, yj of the virtual grid, is given by the sum of the effect of all neighbor cells containing spots: Sk ¼
i
0
n X
, j 0 ¼1
f x i ; yj ; x i0 ; y j 0
ð31Þ
The sum runs on all the cells of the grid, but in dependence on the value of the parameter σ, only the neighbor cells produce a significant effect. Each sample is then transformed into a virtual map representing a sort of probability of presence of a spot in each given cell of the grid. The virtual images are then used to calculate the similarity matrix between every pair of samples.
Differential Analysis of 2-D Maps by Pixel-Based Approaches
319
4. Statistical analysis. The fuzzy maps obtained can be treated by different multivariate methods to identify the statistically relevant changes between the classes of samples. The authors presented two different procedures: Procedure A. In a first paper [46], fuzzy maps were compared by multi-dimensional scaling (MDS) [55–63]. This method needs the definition of a square distance/similarity matrix, i.e., a matrix with all the samples distances/similarities. A similarity index was therefore developed for the comparison of couples of fuzzy maps, calculated after performing a match of the matrices. The match of the k and l virtual grids allows the computation of the common signal (SCkl), namely the sum of all the signals present in both maps, and the total signal (STkl): SC kl ¼
n X min S ik ; S il
ð32Þ
ST kl ¼
n X max S ik ; S il
ð33Þ
i¼1
i¼1
The proposed similarity index is given by: S kl ¼
SC kl ST kl
ð34Þ
A similarity equal to 1 indicates that the two virtual maps are identical. A value of 0 indicates that there is no match between the signals of the two maps. MDS is a multivariate method allowing an effective dimensionality reduction and powerfull graphical representation of the data [55]. Given n objects and a measure of their similarity Sij, MDS searches for a low dimensional space (see Note 43) in which the objects are represented by points in the space, in such a way that the distances between the objects fit in the best possible way the original distance/similarity matrix [55] (see Note 44). MDS was performed by the Kruskal iterative method [63]: the configuration of points obtained at convergence of the iterative process, was considered as the final solution (see Note 45). Procedure B. In a second paper [47], fuzzy maps were unfolded to provide a 1D array of fuzzy values for each sample, and all arrays were arranged in a matrix of dimensions n p (n being the samples and p the unfolded values). This matrix underwent PCA after autoscaling. Linear Discriminant Analysis (LDA) was then applied to the scores of the first five PCs calculated with a variable selection procedure in forward search (see Note 46).
320
Emilio Marengo et al.
2.4.4 Fuzzyfication: Results
1. Case study and digitalization. The method was applied to two different real case studies: (a) rat sera (five maps from control Wistar rat sera and five maps of sera from Wistar rats treated with nicotine); and (b) human lymph nodes (four maps from control human lymph nodes and four maps from human lymph nodes affected by mantle cell lymphoma). Each map was digitalized providing a grid of 200 200 cells containing the corresponding optic density of each pixel. The approach based on MDS was also applied to simulated maps for evaluating the effect of the following parameters on the similarity indexes: changes in size and shape of a single spot; changes in position of a single spot; the effect of two spots, with the largest that changes its size and shape; the effect of two spots, with the smallest that changes its position. Simulated maps were 40 40 grids, for decreasing the computational effort.
2. De-fuzzyfication. Each grid was turned into the binary mask by applying a threshold value of 0.4 (see Note 47).
3. Fuzzyfication. In both studies the effect of the parameter σ was investigated at different levels (0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75), in order to identify the optimal value giving the best separation of the classes. Increasing the σ parameter, the influence of each signal on the neighbor cells increases, and the final maps become more fuzzy, with spots merging one into the others. 4. Statistical analysis. Procedure A. The results obtained for the simulated maps showed that the method is sensitive to changes in spot position and size. Better results are obtained with larger σ values. For what regards the application to real 2-D maps, the separation of the two classes of samples was satisfactory with all the σ values investigated, but it was more effective for larger σ values, the best value being σ ¼ 1.25.
Procedure B. Increasing the σ parameter value also the number of variables retained increased: the fuzzyfication step in facts increases the number of cells with a large probability of containing a signal. The amount of variance explained by the first four PCs also increased with σ, and they were definitely sufficient to describe the overall information for all σ values (cumulative explained variance always >76 %). LDA was applied on the dataset consisting of the eight samples described by five PCs. Discriminant PCs were selected by a stepwise algorithm in forward search (Fto-enter ¼ 4). The classification power (NER%) increased with σ, up to a value of σ of 3.50, then further fuzzyfication generated too confused maps, and the classification power decreased. The best results were obtained for σ ¼ 2.25. It should be remarked that when PCA and LDA were applied without fuzzyfication, the perfect classification
Differential Analysis of 2-D Maps by Pixel-Based Approaches
321
of all samples was not obtained. The approach also allowed to identify, through the analysis of the loadings of the significant PCs, the regions of the maps characterized by the most relevant differences between the two classes (i.e., containing the potential candidate biomarkers).
3
Notes 1. Images can also undergo a further step of warping and registration [64]. 2. The authors prove that the choice of the normalization technique scarcely influences the final correlation image. 3. ρ-Correlation requires a previous step of ranking of the two vectors to be correlated, and then a standard linear Pearson correlation is performed. The ranking process replaces every value in the input vector by its rank. If the same value occurs more than once, their rank will be substituted by the average of their ranks. 4. With real gels, qualitative analysis is not hampered, since normalization is performed on individual gels: it can be repeated on any new gel without the need to take into account again the previous gels. 5. If the technique is used as a quantitative method, calibration spots should be used and the camera properties should be known. 6. When noise depends on Y, correct information about negative correlation is observed, but loss of information about positive correlation is also achieved. This is the case when a camera gets images at waning signal strength. 7. This is mainly due to the robust correlation which relies on ranking of the data rather than on the original values. 8. A further fine tuning can be achieved by adding anchors manually. 9. Usually the first- or the second- order derivatives are employed. 10. Points above the approximation f are weighted by very small weights, whereas points below the approximation f are weighted with weights close to 1. 11. For background elimination p is usually set to a very small value, e.g., 0.001. 12. When the number of variables is large, the unreasonable significance results can be numerous. 13. A general description of t-tests is given in Ref. 65 and adapted to Jackknife significance tests for bilinear regression methods
322
Emilio Marengo et al.
by Gidskehaug et al. [66]. In this study, the estimated variance of each pixel across cross-validation segments is weighted with one tenth of the mean variance across pixels. 14. If also the streaks and background color are due to not focused proteins, this normalization should adjust the gels to an equal protein basis; a rational is given by the fact that background streaks are frequently seen along the tracks of heavily stained proteins, and darker background is typically observed in areas overcrowded by proteins. 15. Another strategy could be the application of 2-D wavelet basis functions. 16. At the second level of image decomposition, there is one set of approximation coefficients and six sets of details. 17. The details calculated for the first level of decomposition are associated with the features of the highest frequency. 18. This is possible since the noise component is always the highest frequency component of the signal. 19. In many cases, the universal threshold [67] is a valid alternative: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi σ 2 logðN Þ pffiffiffiffiffi t¼ ð35Þ N where σ is the robust estimate of the standard deviation of noise (see Eq. 15) and N is the number of pixels in the image.
20. In the soft thresholding, all wavelet coefficients smaller than the threshold are replaced by zeros, whereas the remaining coefficients are shrunk towards zero by the threshold value. 21. This approach requires the optimization of the wavelet basis and the optimization of the decomposition level. However, the optimization process can be carried out for one image only. 22. The number of identified spots can vary in the range 200–500, depending on the quality of the image. 23. Each 2-D gaussian function is characterized by the width parameter σ, producing Gaussians overlapping. 24. Here k ¼ h.
25. If gi,j is very small, the i-th spot in the target and the j-th spot in the warped image are far from each other. If the two spots are close to each other, the corresponding value in the matrix is close to one. 26. The matrix is augmented with an additional row and column, to find the spots without correspondence. 27. Spots without correspondence are not taken into account for the transform calculation.
Differential Analysis of 2-D Maps by Pixel-Based Approaches
323
28. The procedure should be stopped before the width of the Gaussians function reaches the predefined value or when there is no significant increase of the correlation of the images. Once the correspondence of the spots is known, the local piecewise linear transform can be applied for further improvement of the correlation of the images. 29. This method chooses the threshold value to minimize the within-class variance of the thresholded black and white pixels. 30. As datasets can contain outliers, robust versions of the modeling techniques are available. To perform linear modeling, the partial robust M-regression (PRM) can be applied [68]. 31. When a small number of samples is present, usually leave-oneout cross-validation is applied. When possible, the predictive power of the final model should be evaluated on an independent test set. The authors applied the Kennard and Stone algorithm [69] for the choice of the training set (used to build the model) and test sets: in this way the training set uniformly covers the experimental domain. 32. They are normally distributed with the magnitude order of 10 10. 33. It is also possible to consider a more robust version of the cutoff by selecting the ranked stabilities value of the regression coefficients associated to the noise, corresponding to predefined quantiles, e.g., 0.95. 34. UVE-PLS is run 100 times and at each run the randomly selected model set (containing about 70 % of the data objects) is used for building the model selecting the features. The final subset of the relevant features is then selected, based on the frequency of the feature appearance in the set of the retained features. 35. The authors used a GS-710 densitometer (Bio-Rad Labs, Richmond, CA, USA). 36. The choice of a 50 50 grid is not a constraint but it was suggested by computational and memory requirements. 37. Here 0.4 was used for all gels. 38. By applying such a transformation to each two-dimensional map, the maximum signal intensity value of every 2-D PAGE map becomes a unit value; all the samples are thus ranged from 0 to 1, and the dataset becomes independent on the intensity differences due to the staining step. This scaling is suggested by the fact that the large variability of the staining procedure causes a “systematic” error (i.e., maps being consistently darker or lighter). If not removed, this error would account for the major amount of the variation.
324
Emilio Marengo et al.
39. A threshold equal to 0.4 was used by the authors. 40. A gaussian function was chosen since the spots can be properly described by a function with the highest intensity in the center of the spot itself and decreasing values for increasing distance from the spot. Moreover the Gaussian function is a good approximation if the diffusion model works. 41. The parameter ρ is fixed at 0, corresponding to the expected complete independence of the two electrophoretic runs. 42. This is based on the assumption of a similar variability of the two electrophoretic runs. 43. The space is usually Euclidean, but this is not a constraint. 44. There are several different approaches to MDS depending on the method of matching of the similarities, on the metrics, on the method used to compute the similarities and on the way the point configuration is obtained. 45. The search for the coordinates is based on the steepest descent minimization algorithm, where the target function is the socalled stress (S), which is proportional to the sum of squares of differences between calculated and real distances, i.e., a measure of the ability of the configuration of points to reproduce the distance matrix. 46. LDA can be applied with a variable selection procedure based on a stepwise algorithm [70, 71], selecting iteratively the most discriminating variables. When a forward selection (FS) is applied, the initial model includes no variables and iteratively a variable at a time is added, until a determined criterion of arrest of the procedure is satisfied. The variable being included in the model in each step is the one providing the greatest value of an F-Fisher ratio, so that the j-th variable is included in the model, with t variables already included, if: " # RSSt RSStþ j þ F j ¼ max j > F to‐enter ð36Þ s 2tþ j where s2tþj is the variance calculated for the model with t variables plus the j-th variable; RSSt is the residual sum of squares of the model with t variables; RSStþj is the residual sum of squares of the model with t variables plus the j-th variable. The F value thus calculated is compared to a reference value (Fto-enter) usually set at values ranging from 2 (more permissive selection, including a larger number of variables in the final model) to 4 (more severe selection). 47. The threshold value must be optimized for each map independently.
Differential Analysis of 2-D Maps by Pixel-Based Approaches
325
References 1. Daszykowski M, Wro´bel MS, BierczynskaKrzysik A, Silberring J, Lubec G, Walczak B (2009) Automatic preprocessing of electrophoretic images. Chemometr Intell Lab Syst 97:132–140 2. Færgestad EM, Rye M, Walczak B, Gidskehaug L, Wold JP, Grove H, Jia X, Hollung K, Indahl UG, Westad F, van den Berg F, Martens H (2007) Pixel-based analysis of multiple images for the identification of changes: a novel approach applied to unravel proteome patters of 2-D electrophoresis gel images. Proteomics 7:3450–3461 3. Grove H, Hollung K, Uhlen AK, Martens H, Færgestad EM (2006) Challenges related to analysis of protein spot volumes from twodimensional gel electrophoresis as revealed by replicate gels. J Proteome Res 5:3399–3410 4. Rye MB, Faergestad EM, Martens H, Wold JP, Alsberg BK (2008) An improved pixel-based approach for analyzing images in twodimensional gel electrophoresis. Electrophoresis 29:1382–1393 ˚ nensen N, Haaland I, Bruserud 5. Van Belle W, A O, Høgda K-A, Gjertsen BT (2006) Correlation analysis of two-dimensional gel electrophoretic protein patterns and biological variables. BMC Bioinformatics 7:198 6. Øye OK, Jørgensen KM, Hjelle SM, Sulen A, Ulvang DM, Gjertsen BT (2013) Gel2DE—a software tool for correlation analysis of 2D gel electrophoresis data. BMC Bioinformatics 14:215 7. Gonzalez RC, Woods RE (2002) Digital image processing, 2nd edn. Prentice Hall, Upper Saddle River, NJ, pp 432–438 8. Hough P (1962) Methods and means for recognizing complex patterns. US Patent 3,069,654 1962 9. Conradsen K, Pedersen J (1992) Analysis of two-dimensional electrophoresis gels. Biometrics 48:1273–1287 ˚ nensen N, Høgda 10. Van Belle W, Sjøholt G, A KA, Gjertsen BT (2006) Adaptive contrast enhancement of two-dimensional electrophoretic gels facilitates visualization, orientation and alignment. Electrophoresis 27 (20):4086–4095 11. Veterling WT, Flannery BP (2002) Numerical recipes in Cþþ, 2nd edn. Cambridge University Press, Cambridge, UK 12. Kenny J, Keeping E (1962) The standard deviation and calculation of the standard deviation (Volume chap 6.5–6.6), 3rd edn. Princeton NJ, pp 77–80
13. Eilers PHC (2004) Parametric time warping. Anal Chem 76:404–411 14. Eilers PHC, Currie ID, Durban M (2006) Fast and compact smoothing on large multidimensional grids. Comput Stat Data Anal 50:61–76 15. Kaczmarek K, Walczak B, de Jong S, Vandeginste BGM (2005) Baseline reduction in two dimensional gel electrophoresis images. Acta Chromatogr 15:82–96 16. Lieber CA, Jansen AM (2003) Automated method for subtraction of fluorescence from biological Raman spectra. Appl Spectrosc 57:1363–1367 17. Martens H, Martens M (2000) Modified Jackknife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Qual Prefer 11:5–16 18. Wheelock A˚, Buchpitt AR (2005) Softwareinduced variance in two-dimensional gel electrophoresis image analysis. Electrophoresis 26:4508–4520 19. Daszykowski M, Stanimirova I, BodzonKulakowsk A, Silberring J, Lubec G, Walczak B (2007) Start-to-end processing of twodimensional gel electrophoretic images. J Chromatogr A 1158:306–317 20. Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11:674–693 21. Hubbard BB (1998) The World according to wavelets. A K Peters, Wellesley, MA 22. Walczak B (ed) (2000) Wavelets in chemistry. Elsevier, Amsterdam 23. Walczak B, Massart DL (1997) Noise suppression and signal compression using the wavelet packet transform. Chemom Intell Lab Syst 36:81–94 24. Kaczmarek K, Walczak B, de Jong S, Vandeginste BGM (2004) Preprocessing of twodimensional gel electrophoresis images. Proteomics 4:2377–2389 25. Chang SG, Yu B, Vetterli M (2000) Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Process 9:1532–1546 26. Kaczmarek K, Walczak B, de Jong S, Vandeginste BGM (2002) Feature based fuzzy matching of 2D gel electrophoresis images. J Chem Inf Comput Sci 6:1431–1442 27. Kaczmarek K, Walczak B, de Jong S, Vandeginste BGM (2003) Matching 2D gel electrophoresis images. J Chem Inf Comput Sci 43:978–986
326
Emilio Marengo et al.
28. Walczak B, Wu W (2005) Fuzzy warping of chromatograms. Chemom Intell Lab Syst 77:173–180 29. Sinkhorn RA (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann Math Stat 35:876–879 30. Kaczmarek K, Walczak B, de Jong S, Vandeginste BGM (2003) Comparison of imagetransformation methods used in matching 2D gel electrophoresis images. Acta Chromatogr 13:7–21 31. Beucher S (1992) The watershed transformation applied to image segmentation. Scanning Microsc 6:299–314 32. Skolnick MM (1986) Application of morphological transformations to the analysis of twodimensional electrophoretic gels of biological materials. Comput Vis Graph Image Process 35:306–332 33. Sternberg SR (1986) Grayscale morphology. Comput Vis Graph Image Process 35:333–355 34. Otsu N (1979) A threshold selection method from gray level histograms. IEEE Trans Syst Man Cybern B 9:62–66 35. Martens H, Næs T (1989) Mutivariate calibration. Wiley, Chichester 36. Walczak B, Massart DL (1996) The radial basis functions—partial least squares approach as a flexible non-linear regression technique. Anal Chim Acta 331:177–185 37. Walczak B, Massart DL (1996) Application of radial basis functions—partial least squares to non-linear pattern recognition problems: diagnosis of process faults. Anal Chim Acta 331:187–193 38. Walczak B, Massart DL (2000) Local modelling with radial basis function networks. Chemom Intell Lab Syst 51:179–198 39. Czekaj T, Wu W, Walczak B (2005) About kernel latent variable approaches and SVM. J Chemometrics 19:341–354 40. Centner V, Massart DL, de Noord OE, de Jong S, Vandeginste BGM, Sterna C (1996) Elimination of uninformative variables for multivariate calibration. Anal Chem 68: 3851–3858 41. Daszykowski M, Bierczynska-Krzysik A, Silberring J, Walczak B (2010) Avoiding spots detection in analysis of electrophoretic gel images. Chemometr Intell Lab Syst 104:2–7 42. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98:5116–5121 43. Marengo E, Bobba M, Liparota MC, Robotti E, Righetti PG (2005) Use of Legendre moments for the fast comparison of twodimensional polyacrylamide gel electrophoresis
maps images. J Chromatogr A 1096(12):86–91 44. Marengo E, Robotti E, Bobba M, Demartini M, Righetti PG (2008) A new method for comparing 2-D-PAGE maps based on the computation of Zernike moments and multivariate statistical tools. Anal Bioanal Chem 391 (4):1163–1173 45. Marengo E, Leardi R, Robotti E, Righetti PG, Antonucci F, Cecconi D (2003) Application of three-way principal component analysis to the evaluation of two-dimensional maps in proteomics. J Proteome Res 2(4):351–360 46. Marengo E, Robotti E, Gianotti V, Righetti PG, Cecconi D, Domenici E (2003) A new integrated statistical approach to the diagnostic use of proteomic two-dimensional maps. Electrophoresis 24(1-2):225–236 47. Marengo E, Robotti E, Righetti PG, Antonucci F (2003) New approach based on fuzzy logic and principal component analysis for the classification of two-dimensional maps in health and disease: application to lymphomas. J Chromatogr A 1004(1-2):13–28 48. Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279–311 49. Kroonenberg PM (1983) Three-mode principal component analysis. DSWO Press, Leiden 50. Geladi P (1989) Analysis of multi-way (multimode) data. Chemometr Intell Lab 7:11–30 51. Smilde AK (1989) Three-way analyses problems and prospects. Chemometr Intell Lab 15:143–157 52. Henrion R (1993) Body diagonalization of core matrices in three-way principal component analysis: theoretical bounds and simulations. Chemometr Intell Lab 6:477–494 53. Henrion R (1994) N-way principal component analysis. Theory, algorithms and applications. Chemometr Intell Lab 25:1–23 54. Henrion R, Andersson CA (1999) A new criterion for simple-structure transformations of core arrays in N-way principal component analysis. Chemometr Intell Lab 47:189–204 55. Tulp A, Verwoerd D, Neefjes J (1999) Electromigration for separations of protein complexes. J Chromatogr B 722:141–151 56. Young G, Householder AS (1930) Discussion of a set of points in terms of their mutual distances. Psychometrika 3:19–22 57. Cox TF, Cox MAA (1994) Multidimensional Scaling. Chapman & Hall, London 58. Schoenberg IJ (1935) Remarks to Maurice Fre´chet’s article “Sur la de´finition axiomatique d’une classe d’espace distancie´s vectoriellement applicable sur l’espace de Hilbert”. Ann Math 36:724–732
Differential Analysis of 2-D Maps by Pixel-Based Approaches 59. Young G, Householder AS (1938) Discussion of a set of points in terms of their mutual distances. Psycometrika 3:19–22 60. Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325–338 61. Shepard RN (1962) The analysis of proximities: multidimensional scaling with an unknown distance function, I. Psycometrika 27:125–140 62. Shepard RN (1962) The analysis of proximities: multidimensional scaling with an unknown distance function, II. Psycometrika 27:219–246 63. Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psycometrika 29:1–27 64. Wang X, Feng DD (2005) Hybrid registration for two-dimensional gel protein images. Third Asia Pacific bioinformatics conference (APBC2005), paper 241, pp 201–210 65. Allison DB, Cui XQ, Page GP, Sabripour M (2006) Microarray data analysis: from disarray
327
to consolidation and consensus. Nat Rev Genet 7:55–65 66. Gidskehaug L, Anderssen E, Alsberg BK (2006) Cross model validated feature selection based on gene clusters. Chemometrics Intell Lab Syst 84:172–176 67. Donoho DL (1995) De-noising by softthresholding. IEEE Trans Inf Theory 41:613–627 68. Serneels S, Croux C, Filzmoser P, Van Espen PJ (2005) Partial robust M-regression. Chemom Intell Lab Syst 79:55–64 69. Kennard RW, Stone LA (1969) Computer aided design of experiments. Technometrics 11:137–148 70. Massart DL, Vandeginste BGM, Deming SM, Michotte Y, Kaufman L (1988) Chemometrics: a textbook. Elsevier, Amsterdam 71. Vandeginste BGM, Massart DL, Buydens LMC, De Yong S, Lewi PJ, Smeyers-Verbeke J (1988) Handbook of chemometrics and qualimetrics: part B. Elsevier, Amsterdam
INDEX A
F
Affine transformation.................................. 120, 125, 146 Automated methods ....................................................... 80
Factorial design .................................................. 17, 20, 22 False discovery rate (FDR) .......... 77, 217–219, 313–315 Family wise error rate (FWER) ........................... 217, 219 Forward search ..................................................... 319, 320 Fuzzyfication ................................ 83, 317–318, 319–320 Fuzzy logic .................................................................... 316
B Bayesian classification........................................... 241, 277 Bilinear and perspective/projective transformations.................................................. 121 Bioinformatics ...........................................................41, 59 Biological variation ......... 5–7, 15, 16, 18, 19, 28, 32, 33 Biomarkers identification........................... 238, 239, 262, 264, 300 Block designs...................................................... 21, 22, 33 Bonferroni ......................................................77, 217, 218
C Chemometric methods ........................................ 237–267 Classification matrix .................................... 253, 280, 282 Classification method................. 241, 272, 273, 277–279 Classification performance................................... 281–282 Clustering method ........................................................ 234 Cluster sampling ..................................... 14, 15, 220, 225 Conjugate gradient optimization........................ 168, 174 Contourlet denoising (CD)............................................ 80 Contrast-limited adaptive histogram equalization (CLAHE).................................................... 95, 103 Correlation analysis ......................................................... 77 Cost-benefit analysis ...................................................5, 30 Cross-validation............................................................. 285
D Data analysis ....................... 4, 11, 12, 34, 64, 72, 76–77, 133, 160, 218, 240 Discrimination power (DP).......245, 246, 257, 272, 312
E Experiment layout................................................ 5, 17, 19 Experimental design........................................... 3–36, 218 Experimental variation ................................ 10, 22, 97, 98
G Gaussian extrapolation of saturated protein spots .......................................................... 203–210 Gaussian filter ............................................................82, 83 Gaussian noise ............................................. 79, 80, 83, 87 Gel electrophoresis................................... 56, 93, 99, 109, 111–113, 116, 157, 185, 237 Gel matching ........................................ 70, 72, 73, 75–77, 126, 141, 162, 163, 250 Genetic algorithm ....................................... 120, 143, 167 Genetic algorithm for numerical optimization of constrained problems (GENOCOP) ..............167, 168, 170, 171, 173, 174, 182
H Hierarchical grid transformation (HGT)........... 125, 143, 165–183 High-dimensional data ................................................. 290 Histogram equalization ............................................94, 95
I Image analysis............................ 8, 11, 56, 69, 72, 75–77, 96, 157–159, 167, 203–206, 210, 251, 272, 273, 299, 305 Image denoising ............................................................ 315 Image matching ..................................110, 141, 146, 297 Image normalization.............................................. 93, 142 Image pretreatment ..................................79–88, 91–104, 279–280, 306, 309 Image processing.................... 79, 85, 94, 109, 139, 141, 156, 161
Emilio Marengo and Elisa Robotti (eds.), 2-D PAGE Map Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 1384, DOI 10.1007/978-1-4939-3255-9, © Springer Science+Business Media New York 2016
329
330 Index Image registration ............. 119, 135–136, 143, 144, 156 Image segmentation............ 75, 119, 159, 160, 204, 311 Image warping.......................75–77, 110, 120, 121, 123, 130, 132, 165–183, 188
O
L
Partial least square (PLS)...................................... 77, 133, 217, 218, 220, 224, 225, 227, 238, 244, 246, 250, 258, 260, 261, 263, 278, 281, 282, 284, 307, 312, 313, 315 Partial least square-discriminant analysis (PLS-DA)..... 220, 224, 225, 228, 237–267, 272, 278, 281, 282, 284, 312, 313 Pattern recognition ...................... 92, 215–234, 240, 273 Pixel-based analysis of multiple images for the identification of changes (PMC) .....133, 305–308 Pixel-based approaches ................ 73, 144, 161, 299–325 Polynomial Savitzky-Golay ............................................. 83 Polynomial transformation ........................................... 123 Post-translational modifications (PTMs)............... 31, 48, 52, 74, 215, 216, 229–233 Power analysis....................................4, 23–25, 30, 34, 77 Precision and reliability of automated systems ...................................................... 158–159 Pre-processing of the gel images.................................. 156 Principal component analysis (PCA)...................... 77, 80, 96–98, 133, 217, 218, 225, 234, 238–240, 242, 244, 246, 247, 250–252, 257, 263, 290, 291, 293, 294, 297, 307, 308, 316, 317, 319, 320 Principle of attraction ..................................111–113, 115 Principle of exhaustiveness ........................................... 241 Principle of parsimony .................................................. 239 Protein correlations..................................... 219, 230, 231 Protein isoform ................................ 48, 50, 52, 188, 301 Proteomics................................... 3–5, 17, 24, 25, 33, 40, 52, 56, 61, 69, 72, 73, 75, 77, 119, 156, 165, 167, 185, 188, 215–234, 237, 241, 243, 244, 249, 265, 289–292
Latent variables (LVs) ................................ 224, 226, 228, 244, 258, 259, 261, 278, 283, 312 Leave-more-out (LMO) cross-validation ........... 266, 285 Leave-one-out (LOO) cross-validation ......................260, 266, 278, 282, 285, 307, 312, 315 Legendre moments ......................................271–286, 316 Linear discriminant analysis (LDA)........... 237–267, 272, 276, 277, 278, 279, 281, 282, 285, 319, 320, 324 Loadings ............................. 16, 155, 157, 190, 194, 218, 224, 226–229, 240, 244, 246, 248, 250–252, 258–259, 282–283, 308, 317, 321 Locally weighted regression (Loess) ................... 100–104
M Matched random sampling.......................................15, 22 Median modified Wiener filter (MMWF)...............80–85, 88, 292 Minimum curvilinear embedding (MCE) ............. 80, 84, 290–296 Minimum curvilinearity ....................................... 289–297 Missing values.......................................... 72, 74, 133, 300 Modelling power (MP)................................................. 245 Moment function .........................................271–286, 316 Morphological filter ...................................................... 206 Multi-dimensional scaling (MDS) ..................... 291, 319, 320, 324 Multiple testing ..............................................77, 215–234 Multivariate analysis .................................... 219, 309, 313 Multivariate approaches .................................76, 264, 312 Multivariate statistics....................................218–229, 316 M versus A-values (M-A plots)......................99, 100, 103
Optimal sample size ............................5, 8, 12, 23–30, 35
P
R N Nearest neighbor algorithm ................................ 101, 217 Nested design .............................................. 15, 17, 19, 20 Noise reduction filter ...................................................... 87 Non-error-rate% (NER%) .......................... 247, 249, 253, 260, 282, 284, 320 Nonlinear dimensionality reduction ...................... 80, 84, 289–297 NonLinear Iterative Partial Least Squares (NIPALS) algorithm .................................................. 217, 265 Non-parametric tests....................................................... 34 Normalization for 2-DE .........................................91–104 Normalization of DIGE gel ...................................99–102 Normalization of multiple gels ................................96–99 Normalization of single gels.....................................94–96
R2 .........................................................171, 188, 283–284 Randomization ........................................ 5, 12, 15–17, 22 Random sampling .....................................................12, 14 Random variations ............................................... 240, 247 Ranking-PCA ....................................................... 237–267 RBF-PLS........................................................................ 313 Replication..................................5, 15, 16, 238, 239, 247 Restriction landmark genomic scanning (RLGS) ....... 120 Rolling-ball algorithm .................................................. 206 Root mean square error (RMSE)........................ 283, 284
S Salt and pepper noise ...................................................... 81 Sample pool(ing)......................................... 16, 18, 19, 30
Index 331 Sampling ............................... 7, 8, 10, 12, 14, 15, 21, 40, 133, 143, 148, 295 Sampling scheme..........................................................5, 8, 12, 14 Score ............................................................. 97, 127, 130, 218, 220, 225–228, 243, 244, 250–253, 256, 258–259, 264, 282–283, 284, 308, 317, 319 Significance analysis of microarrays data (SAM) .....4, 313 Singular value decomposition (SVD) implementation ................................................. 296 Soft-independent model of class analogy (SIMCA) 237–267 boxes ...................................................... 242, 244, 245 Software packages ..........................69–77, 157–160, 167, 186, 194, 216, 272 Software tool ................................................................. 301 Spatial filtering ..........................................................80, 87 Spike noise .................................................................79, 83 Split plot design .............................................................. 22 Spot detection ............................... 11, 39, 61, 70–73, 75, 77, 80, 91, 130, 138, 155–163, 167, 185, 204, 205, 299, 301, 309 Spot matching ...................................................59, 63, 75, 76, 92, 99, 109–117, 124, 127, 128, 130, 185, 186, 188 Spot overlapping .................................................... 51, 204 Spot quantification............................................71, 72, 75, 93, 158, 159, 185, 194, 198, 205, 250 Spot segmentation ...................................... 156, 157, 161 Spot volume data ..............................................92, 93, 97, 210, 238–241, 247, 250, 251, 272, 299, 300, 305 Start-to-end processing of 2-D map images ...... 309–316 Statistical power analysis ...........................................23–25 Stratified sampling.....................................................12, 14 Supervised classification methods ................................ 240 Systematic sampling ..................................................13, 14
T Technical variation .............................5, 8, 11, 15–17, 19, 24, 33, 34 3D mathematical morphology (3DMM) ......... 80, 85–87
Three-way PCA .................................................... 316, 317 Two-dimensional difference gel-electrophoresis (2-D DIGE) ............................... 91–104, 157, 225, 238 Two-dimensional electrophoresis images (2-DE images) .................. 69, 79–83, 85, 86, 88, 91–104, 109–117, 159–161, 238, 292 Two-dimensional gel electrophoresis (2-D GE) ...55 –64, 109–116, 155–163, 203–210, 215–234, 238 Two dimensional maps (2D-maps)........ 3–36, 177, 238, 251, 299–325 2D-maps denoising and background subtraction ...79–88 Two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) maps ........... 119–150, 174, 239, 240 2D software ........................................................ 57, 71, 72 Type I error ...................................... 23, 24, 30, 217, 233 Type II error................................................ 23, 24, 30, 76
U Uninformative variable elimination partial least squares (UVE-PLS) ...................................... 312, 313, 323 Univariate statistics ......................................................... 23 Unsupervised machine learning ................................... 290
V Variable importance plot (VIP).................. 220, 226, 228 Variable selection.......................239, 241, 247, 251, 255, 265, 319, 324 Visualization ......... 56, 60, 156, 157, 209, 231, 289–295
W Warping .... 58, 70, 74, 76, 91, 119–150, 167–169, 172, 177–183, 189, 310, 315, 321 Watershed transformation (WST) algorithm..............138, 159–161, 207 Wavelet denoising (WD) ......................80, 83, 188, 194, 195, 198, 309
Z Zernike moments .........................................271–286, 316 Z-score normalization ..............................................98, 99