501 80 7MB
English Pages 164 [175] Year 2015
Elena N Ieno Alain F Zuur
A Beginner’s Guide to Data Exploration and Visualisation with R
Published by Highland Statistics Ltd. Highland Statistics Ltd. Newburgh United Kingdom [email protected]
ISBN: 978-0-9571741-7-7 First published in February 2015
ii
© Highland Statistics Ltd. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Highland Statistics Ltd., 9 St Clair Wynd, Newburgh, United Kingdom), except for brief excerpts in connection with reviews or scholarly analyses. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methods now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, whether or not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. This book is copyrighted material from Highland Statistics Ltd. Scanning this book all or in part and distributing via digital media (including uploading to the internet) without our explicit permission constitutes copyright infringement. Infringing copyright is a criminal offence, and you will be taken to court and risk paying damages and compensation. Highland Statistics Ltd. actively polices against copyright infringement. Although the authors and publisher (Highland Statistics Ltd., 9 St Clair Wynd, Newburgh, United Kingdom) have taken every care in the preparation and writing of this book, they accept no liability for errors or omissions or for misuse or misunderstandings on the part of any person who uses it. The author and publisher accept no responsibility for damage, injury, or loss occasioned to any person as a result of relying on material included in, implied, or omitted from this book. www.highstat.com
iii
To Norma, Juan Carlos, and Walter, who pushed me to achieve this goal. Thank you! – Elena N Ieno –
To my wife Nandi. Without you this book would have been finished much sooner, but life would have been less colourful. – Alain F Zuur –
iv
v
Preface In 2010 we published a paper in the journal Methods in Ecology and Evolution entitled ‘A protocol for data exploration to avoid common statistical problems’. Little did we know at the time that this paper would become one of the journal’s all-time top papers, both top downloaded and top cited papers, with more than 20K downloads during the first 6 months. Based on this success we decided to extend the material in the paper into a book, which is now before you. It is part of our Beginner’s Guide to … book series. We tried to write this book in such a way that the statistical knowledge level is as low as possible. A knowledge of linear regression is all that you need.
Acknowledgements The material in this book was presented in a large number of courses between 2003 and 2014. We are greatly indebted to all participants who supplied data for this book and helped improve the material and the feasibility of explanations. Yvon-Durocher provided the methane data in Chapter 6. Chris Elphick and Michael Reed provided the Hawaiian data included in Chapter 7. We also appreciate the efforts of those who wrote R (R Development Core Team 2013) and its many packages. We made extensive use of the lattice (Sarkar 2008) and ggplot2 (Wickham 2008) packages. Special thanks to Christine Andreasen for editing this book. The photo on the front cover is from © Wayne Lynch/Arcticphoto.com.
Datasets used in this book All datasets used in this book may be downloaded from www.highstat.com/BGDEV.htm. All R code also may be downloaded from the website for this book. To open the ZIP file with R code, use the password PolarBear2015.
Elena N Ieno Alicante, Spain Alain F Zuur Newburgh, Scotland
February 2015
vi
vii
Contents PREFACE ............................................................................................. V ACKNOWLEDGEMENTS........................................................................ V DATASETS USED IN THIS BOOK ............................................................ V 1 INTRODUCTION ..................................................................................1 1.1 SPEAKING THE SAME LANGUAGE ......................................................1 1.2. GENERAL POINTS .............................................................................2 1.3 OUTLINE OF THIS BOOK .....................................................................5 2 OUTLIERS ..............................................................................................7 2.1 WHAT IS AN OUTLIER? ......................................................................7 2.2 BOXPLOT TO IDENTIFY OUTLIERS IN ONE DIMENSION .......................8 2.2.1 Simple boxplot ...........................................................................8 2.2.2 Conditional boxplot.................................................................10 2.2.3 Multi-panel boxplots from the lattice package .......................13 2.3 CLEVELAND DOTPLOT TO IDENTIFY OUTLIERS ...............................15 2.3.1 Simple Cleveland dotplots .......................................................15 2.3.2 Conditional Cleveland dotplots ..............................................17 2.3.3 Multi-panel Cleveland dotplots from the lattice package .......18 2.4 BOXPLOTS OR CLEVELAND DOTPLOTS? ..........................................20 2.5 CAN WE APPLY A TEST FOR OUTLIERS? ...........................................21 2.5.1 Z-score ....................................................................................22 2.5.2 Grubbs’ test .............................................................................22 2.6 OUTLIERS IN THE TWO-DIMENSIONAL SPACE ..................................24 2.7 INFLUENTIAL OBSERVATIONS IN REGRESSION MODELS ...................25 2.8 WHAT TO DO IF YOU DETECT POTENTIAL OUTLIERS ........................27 2.9 OUTLIERS AND MULTIVARIATE DATA .............................................31 2.10 THE PROS AND CONS OF TRANSFORMATIONS ................................33 3 NORMALITY AND HOMOGENEITY .............................................37 3.1 WHAT IS NORMALITY? ....................................................................37 3.2 HISTOGRAMS AND CONDITIONAL HISTOGRAMS ..............................38 3.2.1 Multipanel histograms from the lattice package .....................39 3.2.2 When is normality of the raw data considered? .....................41 3.3 KERNEL DENSITY PLOTS ..................................................................42 3.4 QUANTILE–QUANTILE PLOTS ..........................................................43 3.4.1 Quantile–quantile plots from the lattice package ...................44 3.5 USING TESTS TO CHECK FOR NORMALITY........................................45 3.6 HOMOGENEITY OF VARIANCE .........................................................47 3.6.1 Conditional boxplots ...............................................................47 3.6.2 Scatterplots for continuous explanatory variables .................49 3.7 USING TESTS TO CHECK FOR HOMOGENEITY ...................................50 3.7.1 The Bartlett test .......................................................................50 3.7.2 The F-ratio test........................................................................50 3.7.3 Levene’s test ............................................................................51 3.7.4 So which test would you choose? ............................................51
viii
3.7.5 R code ......................................................................................51 3.7.6 Using graphs? .........................................................................52 4 RELATIONSHIPS................................................................................55 4.1 SIMPLE SCATTERPLOTS ...................................................................55 4.1.1 Example: Clam data ................................................................55 4.1.2 Example: Rabbit data ..............................................................57 4.1.3 Example: Blow fly data ...........................................................58 4.2 MULTIPANEL SCATTERPLOTS ..........................................................60 4.2.1 Example: Polychaeta data ......................................................60 4.2.2 Example: Bioluminescence data .............................................61 4.3 PAIRPLOTS ......................................................................................62 4.3.1 Bioluminescence data .............................................................63 4.3.2 Cephalopod data .....................................................................64 4.3.3 Zoobenthos data ......................................................................65 4.4 CAN WE INCLUDE INTERACTIONS? ..................................................66 4.4.1 Irish pH data ...........................................................................66 4.4.2 Godwit data .............................................................................68 4.4.3 Irish pH data revisited ............................................................70 4.4.4 Parasite data ...........................................................................71 4.5 DESIGN AND INTERACTION PLOTS ...................................................73 5 COLLINEARITY AND CONFOUNDING ........................................77 5.1 WHAT IS COLLINEARITY? ................................................................77 5.2 THE SAMPLE CORRELATION COEFFICIENT .......................................77 5.3 CORRELATION AND OUTLIERS .........................................................78 5.4 CORRELATION MATRICES ................................................................79 5.5 CORRELATION AND PAIRPLOTS .......................................................80 5.6 COLLINEARITY DUE TO INTERACTIONS ...........................................82 5.7 VISUALISING COLLINEARITY WITH CONDITIONAL BOXPLOTS .........83 5.8 QUANTIFYING COLLINEARITY USING VIFS .......................................85 5.8.1 Variance inflation factors .......................................................85 5.8.2 Geometric presentation of collinearity ...................................86 5.8.3 Tolerance ................................................................................88 5.8.4 What constitutes a high VIF value? ........................................88 5.8.5 VIFs in action ..........................................................................89 5.9 GENERALISED VIF VALUES .............................................................91 5.10 VISUALISING COLLINEARITY USING PCA BIPLOT ...........................93 5.11 CAUSES OF COLLINEARITY AND SOLUTIONS ..................................94 5.12 BE STUBBORN AND KEEP COLLINEAR COVARIATES? .....................96 5.13 CONFOUNDING VARIABLES ...........................................................97 5.13.1 Visualising confounding variables ........................................99 5.13.2 Confounding factors in time series analysis .......................100 6 CASE STUDY: METHANE FLUXES .............................................103 6.1 INTRODUCTION .............................................................................103 6.2 DATA EXPLORATION .....................................................................104 6.2.1 Where in the world are the sites? ..........................................104
ix
6.2.2 Working with ggplot2 ............................................................105 6.2.3 Outliers..................................................................................108 6.2.4 Collinearity ...........................................................................111 6.2.5 Relationships .........................................................................112 6.2.6 Interactions ...........................................................................114 6.2.7 Where in the world are the sites (continued)? ......................115 6.3 STATISTICAL ANALYSIS USING LINEAR REGRESSION .....................118 6.3.1 Model formulation .................................................................118 6.3.2 Fitting a linear regression model..........................................118 6.3.3 Model validation of the linear regression model ..................120 6.3.4 Interpretation of the linear regression model .......................125 6.4 STATISTICAL ANALYSIS USING A MIXED EFFECTS MODEL .............131 6.4.1 Model formulation .................................................................131 6.4.2 Fitting a mixed effects model ................................................132 6.4.3 Model validation of the mixed effects model .........................132 6.4.4 Interpretation of the linear mixed effects model ...................132 6.5 CONCLUSIONS ...............................................................................134 6.6 WHAT TO PRESENT IN A PAPER ......................................................134 7 CASE STUDY: OYSTERCATCHER SHELL LENGTH ..............135 7.1 IMPORTING THE DATA ...................................................................136 7.2 DATA EXPLORATION .....................................................................136 7.3 APPLYING A LINEAR REGRESSION MODEL .....................................138 7.4 UNDERSTANDING THE RESULTS ....................................................140 7.5 TROUBLE .......................................................................................143 7.6 CONCLUSIONS ...............................................................................146 8 CASE STUDY: HAWAIIAN BIRD TIME SERIES ...................147 8.1 IMPORTING THE DATA ...................................................................147 8.2 CODING THE DATA ........................................................................148 8.3 MULTI-PANEL GRAPH USING XYPLOT FROM LATTICE ....................148 8.3.1 Attempt 1 using xyplot ...........................................................149 8.3.2 Attempt 2 using xyplot ...........................................................150 8.3.3 Attempt 3 using xyplot ...........................................................151 8.4 MULTI-PANEL GRAPH USING GGPLOT2 ..........................................153 8.5 CONCLUSIONS ...............................................................................154 REFERENCES.......................................................................................155 INDEX ....................................................................................................159 BOOKS BY HIGHLAND STATISTICS .............................................161
x
1 Introduction In this chapter we review some simple concepts such as data structure and classification of variables. These general concepts are discussed in many introductory textbooks; however, some topics may be new to readers with little or no experience in data analysis. Subsequent chapters require knowledge of the terminology used in this chapter. Prerequisite for this chapter: A working knowledge of basic statistical concepts is required, e.g., population/sample; qualitative/quantitative data; sampling techniques, etc.
1.1 Speaking the same language Data are often collected without the involvement of a statistician. A typical scenario is sketched in the left side of the Figure 1.1 illustration. In our experience, meeting with a statistician before designing your research is crucial and so is the involvement of a statistician during the analysis and writing-up process (see the right side of Figure 1.1). Some biologists complain (sometimes rightfully) about the inability of statisticians to understand the biology underlying the data or the inability to communicate through some nonmathematical formulation. We have seen many biologists frightened by the language employed by statisticians. The problem is that many universities do not perceive statistics to be an important part of the education program; therefore many students are not exposed properly to statistics. The recent advances in software development have contributed enormously to the quality of data analysis; in particular, the availability of R software (R Core Team 2013) has played an important role. The pressure to publish is one of the greatest demands on scientific researchers: an academic career depends on publications. Today there are more scientific journals than ever before. Despite the large number of journals, competition to publish is particularly great in the areas of life science and ecology. In this book we provide graphical tools to enhance the data communication process between the data and yourself, and between yourself and the readers of your manuscript. Graphs speak for themselves, and using them wisely should hopefully increase the likelihood of getting your work published in good quality journals.
2
1 Introduction
Figure 1.1. Classical problem-solving strategies.
1.2. General points Response and covariates – who is who? First we distinguish between response and explanatory variable(s). The statistical analysis of an experimental or observational study typically involves one or more independent and dependent variables. The dependent variable is the one that we will model as a function of the independent variables. Other synonyms for the term independent variable are covariate, predictor, or simple explanatory variable. The dependent variable is also referred to as the response variable.
1.2. General points
3
Univariate or multivariate analysis? Depending on the number of response variables that a researcher measures in a study, we make a distinction between the application of univariate statistics and multivariate statistics. We will refer to univariate techniques if a single response variable is involved, regardless of the number of explanatory variables that can be added to a model. In contrast, multivariate methods are procedures that comprise more than one response variable either with or without explanatory variables. Redundancy analysis (RDA), among others, is a typical example of a multivariate statistical technique that requires multiple response and explanatory variables. Continuous and categorical variables After defining the explanatory variables, it is useful to identify which of the explanatory variables are quantitative and which are qualitative as this has implications and restrictions for data analysis. Examples of quantitative variables are weight, height, temperature, oxygen concentration, etc. These variables can be continuous or discrete. Qualitative data can also be described as ordinal, nominal, or categorical data. Examples are week (with the days of the week as values), gender (male vs. female), country, month, vessel, etc. If the categories (also called levels) can be ordered in some way, we call the variable ordinal (e.g., days of the week). If there is no order then it is categorical. Coding variables Even when the data have already been collected, new variables may be included to improve the quality of the analysis or provide support for some underlying relationship. If your data are collected by different observers (or laboratories), it would be advisable to add a new covariate that could capture this. When merging data gathered from different sampling efforts, a new variable should be added to quantify this. A common example would be the situation in which sampling collectors of different volumes were used to obtain water samples in a river or lake. The coding of time should be done by using different columns for year, month, hour, etc. Similarly, spatial position should be provided in the correct format (typically UTM coordinates or WSG84). This is quite crucial as it may define the dependency structure of the data (see the Independence section below). Finally, keep the variable names short. Many researchers tend to type the full name of the species in the spreadsheet, which will result in cluttered graphs. Create a principal component analysis biplot with 100 species and see what a referee will think if you submit a paper with such a graph!
4
1 Introduction
Zeros and missing values Make sure you do not omit the zeros in the data unless it can be justified. There are statistical techniques that can deal with zero-inflated data (Zuur et al. 2012). Another problem is missing values. Code these as NA in the data file. It is important to realise that a missing value is not a zero! Provide a brief description of the underlying problem When preparing your manuscript, do not spend too much time/text/space explaining the biological details of the data. Be as concise as possible and formulate the information in such a way that someone from another discipline can readily understand you. Before addressing sub-questions, be aware of the quality of data you have. The authors of this book have been involved in projects where data were highly unbalanced and most of the questions were impossible to answer. Note that for each covariate you need at least 15–20 observations. Visualise the structure of the data When appropriate, provide graphics to explain the structure of the data. Figure 1. 2 shows an example of nested data in which blood samples of 15 adult female giraffes were sampled for progesterone levels. For each specimen, samples were collected for several weeks after gestation. The number of samples per giraffe ranged between 4 and 8. As you may note, a flowchart like this clarifies the data structure far better than explaining the sampling design in words. In ecology this type of data structure is commonly sampled and is known as one-way nested data.
S1 S2 S3 S4
S1 S2 S3 S4
S1 S2 … S6
S1 S2 … S7
Figure 1. 2. Setup of the Giraffe data. Measurements were taken on 15 giraffes. From 4 to 7 blood samples were taken on each animal. Progesterone levels from the same animal are likely to be more similar to one another than to values from different animals. Some datasets may exhibit a more complex sampling design, in which two or more levels of units are integrated or nested one in the other. This is known in the literature as multistage sampling. We could extend the example above where blood samples were used to model progesterone
1.3 Outline of this book
5
concentrations in giraffes at several sites in relation to explanatory variables such as age and body condition (Figure 1.3). Due to the nested structure of the data, mixed effect models are required. See Snijders and Boskers (1999), Pinheiro and Bates (2000), Bolker (2008), or Zuur et al. (2009a, 2012, 2013) for further reading.
Figure 1.3. Sketch of the nested structure of the data. A total of 23 sites were sampled. At each site, various specimens (giraffes) were monitored. The number of giraffes per site ranged from 2 to 4. From 2 to 5 blood samples for each specimen were obtained. Store and keep the data in the right format It is essential to manage the datasets appropriately in a spreadsheet or database programme. Make sure you remove unnecessary comments from the spreadsheet. Do not blindly follow the others At times we meet scientists who are determined to apply a certain technique because they read about it in a paper. Never push for a specific type of analysis if the quality of your data does not support it. Every dataset requires a different thinking process. If a basic chi-square test does the job, then keep it simple and apply this test.
1.3 Outline of this book In Zuur et al. (2007), we described a basic introduction to data exploration. We provided a range of exploratory tools and indicated how they could be used to ensure the validity of any follow-up analysis. Other books that contain chapters on data exploration techniques include Montgomery and Peck (1992), Chatfield (1995), Crawley (2002), Dalgaard (2002), Fox (2002), Quinn and Keough (2002), and Borcard et al. (2011).
6
1 Introduction
However, while teaching statistics to ecologists and reviewing manuscripts, the authors of this book have noticed common statistical problems that could have been avoided. That was one of the reasons why we wrote a paper in 2010 that presented a protocol for data exploration; see Zuur et al. (2010). This paper is one of the most downloaded papers in the history of the journal in which it was published, indicating the need and niche for a more detailed text on data exploration and visualisation. In this book we will follow the data exploration protocol paper, and discuss the important steps in separate chapters. Detecting outliers is fundamental and should be the first thing one looks at. Graphical tools to identify outliers are discussed in Chapter 2. Homogeneity is an important assumption for regression type techniques. Heterogeneity might produce estimated parameters with standard errors that are too small. Traditional solutions include data transformation but we will try to avoid this for as long as possible. A list of problem-solving options that looks at heterogeneity patterns is given in Chapter 3. Normality is discussed in detail in most statistics textbooks and undergraduate statistics courses. In Chapter 3, we emphasise that when applying regression models you do not test normality of the raw data but rather their residuals. The inspection of data to determine whether variables are associated constitutes a fundamental part of any data exploration process. An important step is identifying linear or nonlinear relationships, or even more complex relationships that could, in principle, define what type of mathematical model needs to be formulated. Chapter 4 provides a list of various options discussed in terms of graphic methods. In ecological studies one tends to have a large number of explanatory variables, and evaluating collinearity (correlated explanatory variables) is a major challenge. In Chapter 5 we explain how to detect collinear variables. In Chapter 6 we use a detailed case study to show how the ggplot2 package (Wickham 2009) in R can be used for data exploration and visualisations of the results for a linear regression model as well as a linear mixed effects model. In Chapter 7 we show how you can end up with the wrong ecological conclusions if no good data exploration is applied. And in Chapter 8 we reproduce some graphs (using the lattice and ggplot2 packages) that we have used in scientific publications. Do not use data exploration to generate a hypothesis. The underlying questions should be formulated a priori based on biological knowledge. If data dredging cannot be avoided, then mention it.
2 Outliers In this chapter we present various tools to identify outliers. We start with graphical tools such as boxplots and Cleveland dotplots. We also discuss tests for outliers. Prerequisite for this chapter: A working knowledge of R is required. See, for example, Zuur et al. (2009b).
2.1 What is an outlier? An outlier is an observation that has a rather large, or rather small, value compared to the bulk of observations for a variable of interest. Such ‘outliers’ can dominate the results of a statistical analysis. One single observation may determine whether a covariate is significant or not in a regression-type model, or dominate the axes in multivariate analysis techniques such as principal component analysis (PCA) or discriminant analysis. Jolliffe (2002) showed that outliers may influence the shape of the higher axes in PCA. Our experience with discriminant analysis seems to suggest that outliers dominate the first few axes (Zuur et al. 2007). Nevertheless, in some scientific fields, researchers may be particularly interested in outliers, for example, the analysis of high water levels in the context of dike enforcement, the occurrence of a specific disease when analysing epidemiological data, or tornado levels. In such analyses, you certainly do not want to remove outliers! Coles (2001) focusses on extreme values. Other sources of information on outliers are Barnett and Lewis (1994) and Chatfield (1995). Whether you are interested in outliers or not, you need to know if they are present and their effects on the results of the analysis. You then have to decide what to do with the outliers. Potential actions are discussed in the last section of this chapter. The aim of this chapter is to present a set of tools to identify potential outliers. We distinguish the following types of outliers. 1.
2. 3.
Outliers in one-dimensional space. We have a variable, either the response variable or a covariate, and it has one or multiple observations that are considerably larger (or smaller) than the majority of the observations. Outliers in two-dimensional space. These are observations that do not follow the general pattern shown for two variables. Influential observations in a regression-type analysis. Once a regression-type analysis (e.g., linear regression, generalised linear modelling, mixed modelling) has been performed, we need to assess the influence of individual observations on the results and determine
8
4.
5.
2 Outliers
whether single observations or a group of observations are causing odd behaviour of the model. Odd species. Suppose we have multivariate datasets in which multiple species are measured at multiple sites. Species that behave differently from the other species may be labelled as outliers, and the same holds for a site with a different species composition. Outliers due to the nature of the data. We may be sampling a rare species that is absent from most sites. Finding such a rare species is not a real outlier, but we need to choose the appropriate statistical methodology to analyse it. We will address some of these outliers below.
2.2 Boxplot to identify outliers in one dimension Zuur et al. (2007, 2010) defined an observation as a potential outlier if it is substantially larger or smaller compared to the majority of the observations. However, as we will see in this chapter, it is also possible that such an observation is not an outlier. For the moment, we will focus only on identifying points that are considerably smaller or larger than the bulk of the observations. We will classify these as potential outliers. Later in this chapter we will discuss whether these are really outliers and what we should do with them (e.g., transform, remove, leave them as they are, select the appropriate statistical technique, etc.). In this section we will use boxplots to identify observations that are considerably larger or smaller than the other observations. In the next section we will use Cleveland dotplots to do the same. 2.2.1 Simple boxplot The first tool for identifying data points that are considerably larger or smaller than the majority of the observations is the boxplot. The boxplot visualises both the centre of the data via the median (denoted by Q2) and the spread of the data. Most statistical software packages present the median as a horizontal line or as a point. So how do we quantify the spread? One option is to use the variance or standard deviation and plot a graph. Another option is to use the difference between the 75% and 25% quartiles (denoted by Q3 and Q1, respectively), also called the interquartile range. Most software routines for boxplots draw a box around the median, and the size of the box is equal to Q3 – Q1. Hence, such a box around the median contains half of the observations. To visualise some extra spread, two lines are drawn from the boxes: one going up and one going down. In the boxplot function in R, the length of such a line is 1.5 times the spread. To identify potential outliers, points that are beyond these lines are presented as circles or stars in R. Note that other software packages may use different rules for building up the boxplot so you must read the help files for the software package you choose.
2.2 Boxplot to identify outliers in one dimension
9
In summary, the essential information in a boxplot is given by five numbers: the minimum, Q1, Q2, Q3, and the maximum; see Keen (2010) for useful references. A boxplot visualises the centre and the spread of the data as well as the extremes.
To illustrate the boxplot, we use a dataset from Zuur et al. (2009a), which was originally used in Cruikshanks et al. (2006). The underlying idea behind this research was to develop new tools to identify acidsensitive water in coastal rivers in Ireland. Cruikshanks and co-workers modelled pH as a function of SDI (sodium dominance index) taken at 257 sites along coastal rivers, the altitude at each site, and whether the sites were forested or not. The boxplot is given in Figure 2.1. Note that there are at least 10 observations that are considerably larger than the majority of the observations! Some scientists consider these points as outliers and remove them, something we would certainly caution against. Instead, we suggest you take a pen and paper and write: ‘Potential outliers, keep an eye on them; investigate further if they cause trouble’. The R code to create Figure 2.1 follows. > IrishpH boxplot(IrishpH$Altitude, ylab = "Altitude")
Figure 2.1. Boxplot for altitude of the Irish pH dataset. Note that there are at least 10 observations that are considerably larger than the majority of the observations.
10
2 Outliers
2.2.2 Conditional boxplot A useful extension of the boxplot is the conditional boxplot. In addition to outlier detection, this tool is also useful for examining the difference between median values of the different groups. Conditional boxplots can also be used as diagnostic plots to identify heterogeneity of variance in regression type models (see Chapter 3 for further details). The conditional boxplot example that we present shows the effect of month and sex on the cephalothorax length of the red swamp crayfish Procambarus clarkii. The data are taken from Ligas (2008). The goal of his work was to gather and interpret existing information on red swamp crayfish populations with emphasis on the dynamic features (size-frequency distributions, sex-ratio, growth of juveniles, etc.) in the Salica River in the southern part of Tuscany (Italy). Six morphometric variables were taken, together with variables such as weight, sex, month, and sexual maturity of 746 crayfish specimens. We import the data into R and create a boxplot using only the cephalothorax length variable (CTL); see Figure 2.2. There are no observations that are considerably smaller or larger than the majority of the observations. Hence, there are no outliers in one dimension.
Figure 2.2. Boxplot of cephalothorax length of the red swamp crayfish (Procambarus clarkia). The following R code imports the data. > Crayfish boxplot(Crayfish$CTL, ylab = "Cephalothorax length", main = expression(italic( "Procambarus clarkii"))) We now demonstrate how conditional boxplots can be used to visualise whether the length values differ per month and sex. We first create the conditional boxplot of cephalothorax length versus month; see Figure 2.3. Each boxplot uses data from a specific month; the syntax for this is CTL ~ Month. The identify function can be used to print the labels of some of the points. The identify function only works if a graph exists, so do not create the boxplot, delete the graph, and then run the identity function. Press the escape button to stop the identify function. > boxplot(CTL ~ Month, ylab = "cephalothorax Length", xlab = "Month", data= Crayfish, main = expression(italic( "Procambarus clarkii"))) > identify(y = Crayfish$CTL, x = Crayfish$Month) #Press escape
Figure 2.3. Conditional boxplot showing the relationship between cephalothorax length and month, in addition to labelled outliers.
12
2 Outliers
In Figure 2.2 no outliers were detected, but we can see that, especially in March 2005, some observations are rather different from the other observations for that month. For example, observations 733, 734, and 740 are outside the 90% range of the data from March 2005. This is confusing; one boxplot (Figure 2.2) shows that there are no potential outliers and a more detailed boxplot (Figure 2.3) suggests that we may have some trouble. Our suggestion is to ignore this for the moment. The conditional boxplot shows the two-way relationship between length and month. Perhaps the observations with high length values in March 2005 are due to something else, and using an interaction term in the statistical models may (or may not) take care of it. We suggest taking a pen and paper, and writing: ‘Check residuals of observations 733, 734, and 740 once the model has been fitted’ (idem with observation 380 from August 2004). If these observations are influential in the model then we can re-address what to do with them. We will present another example (using sexual maturity instead of month) to demonstrate how we can visualise differences in sampling effort using a conditional boxplot. Before doing this, and to ensure that sampling effort is approximately similar per maturity level, we calculate the number of observations per level of maturity using the table function. Maturity has been coded with values from 0 to 4. > table(Crayfish$Maturity) 0 1 2 1 369 326
3 40
4 10
Note that the number of observations per maturity level differs considerably, hence we have unbalanced data. We have only one specimen with sexual maturity 0. There is no point in creating a conditional boxplot (or statistical analysis) containing a single observation in one of the levels. So we decided to remove level ‘0’ from the data using the following command. > Crayfish2 boxplot(CTL ~ Maturity, data = Crayfish2, ylab = "Cephalothorax length", xlab = "Sexual maturity", main = expression(italic( "Procambarus clarkii")), varwidth = TRUE)
2.2 Boxplot to identify outliers in one dimension
13
Figure 2.4. Conditional boxplot of cephalothorax length versus maturity for the red swamp crayfish (Procambarus clarkia). Note the wider boxes for the first two levels of data. It is also possible to create a conditional boxplot on two variables in the same graph. See the code below that produces a conditional boxplot of length conditional on month and sex. The graph is not shown here. > boxplot(CTL ~ Sex * Month, data = Crayfish2, ylab = "Cephalothorax length", xlab = "Sexual maturity", main = expression(italic( "Procambarus clarkii")), varwidth = TRUE) #Graph not shown 2.2.3 Multi-panel boxplots from the lattice package A rather interesting set of multi-panel graphs is available from the lattice (Sarkar 2008) and the ggplot2 (Wickham 2009) packages. Functions from these packages are considerably more difficult to learn than standard functions such as the boxplot. However, as with many things in R, the effort is worthwhile. In this chapter we only discuss the lattice package. The ggplot2 package is used in Chapter 6. The lattice package is designed to execute trellis graphs in R. A trellis graph shows relationships between variables, conditional on one or more other variables. Trellis graphs are accessible for an extensive selection of plot types. For further reading, and some insights into the creation and development of trellis graphs, see Cleveland (1985, 1993). Multi-panel boxplots can be implemented with trellis graphs via the function bwplot; we will use the crayfish data from the previous section to illustrate this function. The multi-panel boxplot for the crayfish data is presented in Figure 2.5. It shows a set of panels with easily readable boxplots. With the exception of August the graph shows a weak sex effect
14
2 Outliers
within each period. To create the graph, let us first define the variable fSex as a factor (i.e. a categorical variable) with the levels Female and Male instead of f and m. This makes interpretation easier. > Crayfish$fSex library (lattice) The bwplot is created as follows. > bwplot(CTL ~ fSex | Month, data = Crayfish , ylab = list(label = "Cephalothorax length", cex = 1.5), strip = strip.custom(bg = 'white', par.strip.text = list(cex = 1.2)), scales = list(rot = 45, cex = 1.3), par.settings = list( box.rectangle = list(col = 1), box.umbrella = list(col = 1), plot.symbol = list(cex = .5, col = 1)))
Figure 2.5. Conditional boxplot of cephalothorax length versus sex for the red swamp crayfish (Procambarus clarkia). The filled dot in each box is the median and the boxes define the hinges (25%–75% quartile and the line is 1.5 times the hinge). Points outside this interval are represented by dots and may (or may not) be considered outliers.
15
2.3 Cleveland dotplot to identify outliers in one dimension
This is not code that one would type in from scratch; you need to copy and paste it. Once it runs for a particular dataset you only need to change the first two or three rows for a new dataset. To make the multi-panel boxplot more readable (especially if you have several boxes) you can rotate the text at the bottom. The argument scales ensures that you make a rotation of 45 degrees in this case, and adding abbreviate = TRUE to the scales option can be useful, too. The strip argument creates a white background in the strips containing the names of the levels, and the cex option specifies the font size.
2.3 Cleveland dimension
dotplot
to
identify
outliers
in
one
2.3.1 Simple Cleveland dotplots A second graphical tool to identify potential outliers in a variable is the Cleveland dotplot (Cleveland 1993). This is a graph in which the row number of an observation is plotted versus the observed value. Row numbers are typically plotted along the vertical axis and the observed values along the horizontal axis. Hence the y-axis shows the order of the data and the x-axis shows the values. An example of the altitude variable from the Irish pH dataset is presented in Figure 2.6. The first observation of the variable Altitude is plotted as a dot on the bottom row of the dotplot and the last observation will be on the top row. So what is the purpose of this graph? Where exactly should you look? We are only interested in whether there are any points that are considerably larger, or smaller, than the other values. So, are there any points ‘sticking out’ on the right-hand side or the left-hand side? Such observations are potential outliers. In Figure 2.6 we can observe a group of observations with altitude values that are considerably larger than the majority of the observations. This is the same message as was given by the boxplot. The R code to create the Cleveland dotplot in Figure 2.6 follows. > dotchart(IrishpH$Altitude, main = "Altitude", ylab = "order of data", xlab = "range of data")
16
2 Outliers
Figure 2.6. Cleveland dotplot of altitude of the Irish pH data. The majority of the observations were taken at low altitude but at least 10 samples were observed at higher altitude. Instead of relying on the dotchart function, it is also possible to create this graph using the plot function. If you inspect the data, you will see that the first column of the object IrishpH contains the sample identification number called ID; it is a sequence from 1 to N, where N is the number of observations. By plotting the variable ID versus Altitude we get the same graph as in Figure 2.6. If you do not have a variable ID, then create it using > IrishpH$ID plot(y = IrishpH$ID, x = IrishpH$Altitude, ylab = "order of data", xlab = "range of data", main = "Altitude")
2.3 Cleveland dotplot to identify outliers in one dimension
17
> identify(y = IrishpH$ID, x = IrishpH$Altitude) The third line includes the identify function. It allows us to identify any points by clicking with the cursor on points in the graph. To abort the identify process press the escape button.
Figure 2.7. Scatterplot of ID versus Altitude for the Irish pH data. The graph is similar to the one shown in Figure 2.6. 2.3.2 Conditional Cleveland dotplots Shortly after introducing boxplots in Subsection 2.2.2, we showed the conditional boxplot that allowed us to look at relationships between continuous and categorical variables. We called this graph a conditional boxplot. Examples were presented in Figure 2.3 and 2.5. Something similar can be done with the Cleveland dotplot. We call such a plot a conditional Cleveland dotplot. The principle is the same, in that we can visualise the relationship between a continuous variable and a categorical variable. You can apply this exploration technique on the raw data or use it in the model validation process after carrying out a regression analysis. We will present a conditional dotplot using the Irish pH data. We create a dotplot of pH conditional on the categorical variable Forested; see Figure 2.8. The graph indicates that sample sizes differ per forestry category (i.e., forested versus non-forested). The variation in pH values per level is similar and there are no clear outliers. The R code for the conditional dotplot uses the dotchart again, but this time we included the groups argument; it is set to the categorical variable fForested, which has the values ‘Forested’ and ‘Nonforested’.
18
2 Outliers
Figure 2.8. Dotplot of pH conditional on forested. There are slightly more values with higher pH in the non-forested group. > IrishpH$fForested dotchart(IrishpH$pH, groups = IrishpH$fForested, main = "pH", xlab = "range of data") You can extend this graph by adding the extra argument (not shown here) "col = IrishpH$Forested"; this will colour the observations from the different levels of Forested in black and red because Forested has the values 1 and 2. The colour of 1 is black and the colour of 2 is red. 2.3.3 Multi-panel Cleveland dotplots from package
the
lattice
If you want to have multiple graphs in a single panel, one of the best options (in our opinion) is the multi-panel dotplot. This is done using functions from the lattice package. As an example we are going to explore the variables from the Irish pH data again. The end result is presented in Figure 2.9. The variable Altitude has a couple of observations that are considerably larger than the other values, while the response variable pH seems to contain one sample with a low value.
2.3 Cleveland dotplot to identify outliers in one dimension
19
The advantage of multi-panel dotplots is that you can judge the behaviour of each individual variable at the same time. Should you see outliers and decide to apply a transformation, it is better to make an overall decision while seeing dotplots of all variables at the same time, and if possible, applying the same transformation (see Section 2.10). Of course, this is a rather subjective approach; it also depends on the experience of the researcher. Some people prefer looking at one graph at a time, but the processes certainly will be slow if you have a considerable number of variables in the data.
Figure 2.9. Multi-panel dotplot for the Irish pH data. We now discuss how we created the graph. Note that what you see below is advanced R code, and it may look intimidating at first. > library(lattice) > SelX dotplot(as.matrix(IrishpH[, SelX]), groups = FALSE, layout = c(3, 1), strip = strip.custom(bg = 'white', par.strip.text = list(cex = 1.2)), scales = list(x = list(relation = "free", draw = TRUE), y = list(relation = "free", draw = FALSE)), col = 1, cex = 0.5, pch = 16, xlab = list(label = c("Value of
20
2 Outliers
the variable"), cex = 1.5), ylab = list(label = c("Order of the data from text file"), cex = 1.5)) First we load the lattice package with the library function. If you already created the bwplot earlier in this chapter then there is no need to repeat this command. The second line defines a vector of character strings called SelX; these are the variables that we want to use in the multi-panel dotplot. The code for the multi-panel dotplot looks intimidating, but in fact, to get it running for your own data, you only need to modify the first line; everything else can stay as it is. The option layout = c(3, 1)ensures that the panels are arranged in a one-bythree format. The command IrishpH[, SelX] means that only the variables in the character string SelX are selected, but you must ensure that the spellings of the variable names in SelX matches those in IrishpH.
2.4 Boxplots or Cleveland dotplots? In the previous two sections we discussed boxplots and Cleveland dotplots. Both tools are useful for identifying observations that are considerably larger or smaller than the bulk of the observations. Each tool has its advantages and disadvantages. The advantage of the Cleveland dotplot over the boxplot is that you can observe all data points at a glance and distinguish whether you have observations that are considerably smaller or larger than the bulk of the data. However, the definition of smaller and larger remains subjective. In our experience, one tends to judge the presence of outliers much more easily by using dotplots rather than boxplots because dotplots provide more detailed information. Special care should be taken with points identified as outliers by the boxplot. It is not always the case that these are indeed outliers. For example, Figure 2.10 shows an example in which the boxplot identifies a group of observations as outliers but the dotplot shows these points are not outliers. We used the crayfish data again but this time we used the length of the propodus of the chela (PLsx). The R code to generate Figure 2.10 is as follows. > par(mfrow = c(1, 2), mar = c(5, 4, 1, 1)) > dotchart(Crayfish$PLsx, ylab = "order of data", xlab = "range of data") > boxplot(Crayfish$PLsx) We use the par function here because we want to show two graphs in the same graphical window. The size of the margins around the graphic area can also be controlled by the mar argument. The numbers for the
2.5 Can we apply a test for outliers?
21
mar argument refer to the white space at the bottom, left, top, and right side of the graph, respectively. It is perhaps better to compare both graphical tools and make your final decision by relying more on a dotplot. The disadvantage of a dotplot is that people are less familiar with this type of graph.
Figure 2.10. Dotplot and boxplot of length of the propodus of the chela for the crayfish data. We recommend starting the data exploration using boxplots and Cleveland dotplots on both your response and explanatory variables, although our favourite tool is the dotplot.
2.5 Can we apply a test for outliers? Outlier detection is an essential part of data analysis and thus far we have used graphical techniques. Although these graphical tools are quite useful there is a certain level of subjectivity involved when using them to detect outliers. This section describes and compares some formal tests, which can be used as an alternative to visual inspection of data. Some of these tests are designed to detect the presence of a single outlier and others for multiple outliers. Tests do not provide a golden solution either because controversy about them exists in the literature. Studies on outlier detection have been conducted extensively by Barnett and Lewis (1994), who described tests according to the data distribution. A good revision of the subject matter can be found in Iglewicz and Hoaglin (1993).
22
2 Outliers
2.5.1 Z-score Before describing formal testing, we mention one simple approach for detecting outliers, the Z-score. It is given by the formula
Zi =
Yi − y σˆ
where y and σˆ are the sample mean and standard deviation, respectively, and Yi is the value of a variable for observation i. An observation i for which the Zi-score exceeds 3 in absolute value may be viewed as an outlier. However, this method is not good for small sample sizes (Schiffler 1988). Another drawback to this approach is that the effect of covariates is completely ignored. 2.5.2 Grubbs’ test Grubbs’ test is a test for checking the presence of a single outlier in a set of observations. It assumes normality of the data, hence we should not use it if there are covariate effects in the data (i.e., before doing a linear regression). The null hypothesis and alternative hypothesis are as follows. H0: There are no outliers in the dataset. Ha: There is at least one outlier in the dataset. The Grubbs’ test statistic is written as
G=
Ymax − y σˆ
where y and σˆ are the sample mean and standard deviation, respectively. The value G is the deviation between the mean and the maximum value, divided by the standard deviation. You can also use this test to determine whether the minimum value is an outlier, in which case G is defined as the mean of the Yi values minus the smallest value. The test should not be used when the sample size is smaller than 6 since in that case most of the observations are likely to be classified as outliers. We illustrate the use of the Grubbs’ test with a worker bee dataset (Maggi et al. 2010). The dataset consists of two variables: the number of Varroa destructor mites, one of the world’s most destructive pests of honeybees, and cell size (width) in mm. We apply the Grubbs’ test to determine whether there is an outlier in the cell size values. We first load the data. > Wbees min(Wbees$cell) 0.014 > max(Wbees$cell) 0.753 We test for one outlier (the minimum) using the grubbs.test function from the outliers package (you need to download and install this package). The syntax type = 10 checks whether the minimum data point is extreme. > library(outliers) > grubbs.test(Wbees$cell, type = 10) Grubbs test for one outlier data: Wbees$cell G = 9.8516, U = 0.9670, p-value < 2.2e-16 alternative hypothesis: lowest value 0.014 is an outlier Results show that we can reject the null hypothesis. We conclude that a cell size of width = 0.014 mm is an outlier. However, looking at the Cleveland dotplot in Figure 2.11 gives us the same information.
Figure 2.11. Cleveland dotplot of cell size for the worker bee data.
24
2 Outliers
Figure 2.11 was created using the following R code. > plot(x = Wbees$cell, y = 1:nrow(Wbees), xlab = "Cell size", ylab = "Order of the data", pch = 16) If we wish to test whether the observation with the largest cell size is an outlier, we need to use opposite = TRUE. > grubbs.test(Wbees$cell, type = 10, opposite = TRUE) In this case we get a G statistic of 4.38 with a p-value of 0.02, indicating that the largest observation is indeed an outlier. However, based on the Cleveland dotplot we would not have labelled this observation as an outlier because it is not that much larger than the second largest observation. An alternative to the Grubbs’ test when testing for more than one outlier is the Tietjen–Moore test. However, it should be stated that this test requires the specification of the number of outliers, so the alternative hypothesis needs to be of the form ‘There are exactly k outliers in the dataset’. If you are uncertain about the exact number of outliers in your data, then the generalised extreme Studentised deviate test (ESD) can be used. This test only needs an upper bound on the number of alleged outliers. You therefore should change the alternative hypothesis, again referring to ‘There are up to r outliers in the dataset’. The ESD test basically runs r separate tests: a test for one outlier, a test for two outliers, and so on up to r outliers. The authors of this book have never used tests to determine whether there are outliers; they use only graphical tools.
2.6 Outliers in the two-dimensional space It is also possible to have observations that are not outliers in the onedimensional space, but they become outliers as soon as you set up a plot of two variables. We call such values outliers in the x-y space. That is, you may explore the y- and x-axes and not see any influential observations but when you plot the variables against one another, an outlier may appear. If there are only two variables, it is easy to detect with the help of a scatterplot (see Chapter 4), but if you run a model with multiple explanatory variables, then it becomes more difficult and time-consuming to visualise outliers in higher dimensions. The example we provide here constitutes an example of a relationship between ash-free dry weight (AFD) and length of the wedge clam Donax hanleyanus (Ieno, unpublished data). When plotting y versus x, the increase of weight with length is very strong. However, one observation
2.7 Influential observations in regression models
25
does not comply with the rest of the data, indicating an obvious outlier (Figure 2.12). We now provide the R code for the wedge clam data. First, we load the data, then log-transform the variables to linearise the relationship. > Wclam Wclam$LOGAFD Wclam$LOGLENGTH plot(x = Wclam$LOGLENGTH, y = Wclam$LOGAFD, xlab = "Length", ylab = "Weight", main = "Wedge clam") > identify(x = Wclam$LOGLENGTH, y = Wclam$LOGAFD)
Figure 2.12. Scatterplot for the wedge clam data. Observation 108 is an outlier in the x-y space because it does not comply with the relationship.
2.7 Influential observations in regression models In the previous example we encountered a dataset with an outlier in the x-y space. Obviously, this is easy to visualise if you only have two variables. The problem with this type of outlier arises when you are
26
2 Outliers
dealing with more than one covariate. From the data exploration point of view this is indeed very difficult to assess, but fortunately, a tool called the Cook’s distance plot is available. It allows the researcher to assess the influence of each individual observation on the fitted values in regression type models, e.g., multiple linear regression, generalised linear models (Cook and Weisberg 1982). Continuing with the wedge clam example, we present the scenario where a simple linear regression analysis is carried out, including observation 108. The Cook’s distance function is used to identify points which have a larger influence on the fit. It will leave out each observation in turn and compare the fitted values obtained with all observations with the fitted values obtained by dropping a particular observation. According to some authors, a Cook’s distance value larger than 1 is a typical threshold that is used to suggest that a given data point might be influential. This is a tool that helps diagnose, which does not necessarily mean that these points should be removed just because doing so improves the fit of your regression model. In any case, care should be taken. The best way to deal with this is to mention in your report/paper both (with and without the influential points) results and to investigate their specific weights on the results. Next we show the code for obtaining a Cook’s distance plot. The first line of code fits the bivariate linear regression model LogAFDi = α + β × LogLengthi + εi. > M1 plot(M1, which = 4) The resulting graph is presented in Figure 2.13A. Note that one observation has a rather large Cook’s distance value, indicating that it is influential on the fitted values. We also show Cook distance values for the same linear regression model applied on the data in which we removed the suspicious observation 108. The resulting graph is presented in Figure 2.13B. The R code for Figure 2.13B is as follows. > M2 plot(M2, which = 4)
2.8 What to do if you detect potential outliers
27
Figure 2.13. Cook’s distance plot obtained from the linear regression model applied on the wedge clam data, with (A) and without (B) observation 108.
2.8 What to do if you detect potential outliers In the previous sections we discussed tools to identify outliers in one dimension and in higher dimensions. In various datasets presented earlier in this chapter (e.g., the clam data) we discovered outliers. If you think that a dataset contains outliers, you have several options: 1. Remove the outliers. 2. Do not do anything and apply a statistical technique. This may work, or it may result in disaster. 3. Present the results of the models with and without the outliers. 4. Apply a transformation. We should make a distinction here between outliers in response variables and in explanatory variables (see Section 2.10). If you decide on the first option, removing observations, then make sure that this decision is justified. You should always double check if there are typing mistakes or errors in the data. This is a common problem when handling large datasets. Whenever laboratory experiments are carried out, you should check the data carefully because some observations could be outside the ‘normal’ range (e.g., below or above the detection limit) or indicate that an error was made with the setup of the experiment. The example presented next is part of an experimental procedure to examine the effects of species density (Cerastoderma edule) on sediment nutrient release (NH4) taken from estuarine waters (Ieno et al. 2006). A Cleveland dotplot of NH4 release is
28
2 Outliers
presented in Figure 2.14. Note that there is one observation with a rather large value. One might speculate whether this is indeed a true observation (this observation actually has a very low biomass) or whether a failure in the equipment produced a ‘wrong’ reading. In this type of experimental design in which everything has been set a priori (homogenised sediment, equal number of organisms in each container, etc.) one would think that this observation is indeed reliable. Instead, consultation with the technician in charge of the nutrient determination method revealed that the odd value was processed after exceeding the number of chemical analyses that can be performed in a single run and just before calibrating the instrument for a new run to test for instrumental drift. To produce data compatible with previous records, we therefore removed that observation. The R code to create Figure 2.14 follows. > Cockles boxplot(Cockles$Concentration, ylab = "Concentration", main = "NH4 Concentration")
Figure 2.14. Accumulated nutrient concentrations of NH4-N with increasing species biomass. A value of 32.46 mg/l at a low density level is highly unlikely based on previous studies. When outliers are detected in field data, deleting them may be more complicated, especially if an outlier is a real observation and no errors were encountered at the time of measuring the data. One way of dealing with outliers is by applying a transformation. See, for example, Quinn and Keough (2002), Zar (1999), or Zuur et al. (2007) for detailed discussions as to which transformation to use and when.
2.8 What to do if you detect potential outliers
29
In general, when exploring data, you may come across two different types of outliers: outliers on the right-hand side and outliers on the lefthand side. In the first case a transformation will work by compressing the right side data points more than the left side ones. Depending on how influential observations are, you can move from choosing a simple square root transformation to a much stronger option such as the logarithmic transformation. Figure 2.15 shows an example in which a log transformation has been applied on a dataset containing one outlier (though one may also argue that the outlier is only marginally larger than the second largest observation, and that therefore no transformation is needed). We load the Heteromastus data, which consist of measurements of length and width of faecal pellets of the polychaete species Heteromastus similis (Ieno, unpublished data). > Heteromastus dotchart(Heteromastus$Length, ylab = "order of data", xlab = "range of Length") A log transformation is then applied. > LogHeter dotchart(LogHeter, ylab = "Order of data", xlab = "Range of length")
Figure 2.15. Length distribution of the polychaete (Heteromastus similis). Left: Raw data with an outlier on the right-hand side. Right: Log-transformed with no outliers.
30
2 Outliers
The opposite situation may arise when outliers are on the left side of the majority of the observations (or data are negatively skewed). In this scenario you can reflect the data and then apply a transformation (e.g., square root or log transformation) and finally reflect again to establish the original order of the variable. In order to reflect a variable, you will have to create a new variable where the original value of the variable is subtracted from a constant. The constant is calculated by adding, for example, 1 to the largest value of the original variable. Note that this will help in the case of zeros if a logarithmic transformation is applied in the next step. We can also use the maximum observed value as constant (in case a square root transformation is applied in the next step). The dataset in Figure 2.16 (upper graph) shows the length distribution in mm of 57 yellow clams (Mesodema mactriodes) extracted from a single South American beach (Ieno, unpublished data), in which an outlier on the left side appeared. The largest value is 45.1 mm. Now you have to subtract each value from 45.1 to get the reflected new variable (Figure 2.16, middle graph). Finally a square root transformation has been applied on the reflected variable to get rid of the effect of the outlier as shown in Figure 2.16, bottom graph. Let us focus on the R code to create Figure 2.16. We first load the data and create the first dotplot. > Clam dotchart(Clam$Size, ylab = "Order of xlab = "Range of
"Clams.txt", = TRUE) data", size (mm)")
Before reflecting the data, we have to determine which is the largest observation. > max(Clam$Size) 45.1 Next we reflect the data. > SizeReflect dotchart(SizeReflect, ylab = "order of data", xlab = "range of size (mm)") The new Cleveland dotplot shows an outlier, but this time, as the data have been reflected, the outlier appears on the right-hand side. It is then time for a square root transformation to get rid of the outlier. > Size2 dotchart(Size2, ylab = "order of data", xlab = "range of size (mm)")
2.9 Outliers and multivariate data Multivariate analysis also requires data inspection for outliers as part of the data exploration process. There are graphical tools available to aid visual inspection for multiple variables. One important technique is principal component analysis (PCA). PCA is one of the most commonly used ordination techniques in ecology. Interested readers should refer to Jolliffe (2002) for an extensive discussion. According to Jolliffe (2002), PCA can be employed for detecting two types of outliers: (1) a variable with one (or more) large values that should have been observed in a simple Cleveland dotplot or boxplot and (2) outliers that cannot be observed with simple univariate methods.
32
2 Outliers
The following example corresponds to a dataset that considers morphologic variations in the ectoparasite mite (Varroa destructor) affecting different western honeybee populations (Maggi et al. 2009). A simple dotplot (not shown here) reveals that there are two mites with large width values of the variable was (anal shield). Jolliffe (2010) pointed out the dominance of the first few axes in PCA caused by this type of outlier. Figure 2.17 shows the PCA biplot applied on all the data. Note that two observations appear far away from the rest. These are the two mites with large was values. Figure 2.18 also displays a PCA biplot but the two outliers were removed. The lines for the variables are also slightly different. We do not present the mathematics behind the biplot, but both biplots are based on the correlation matrix, and a correlation matrix may (or may not) be influenced by outliers (see Chapter 5).
Figure 2.17. PCA biplot for the V. destructor data. All the data were used. The correlation biplot was used. The numbers are the observations.
2.10 The pros and cons of transformations
33
Figure 2.18. PCA biplot for the V. destructor data. Two observations were removed. The correlation biplot was used. The numbers are the observations. The R code to make the two PCA biplots follows. We load the data and use the princomp and biplot functions. > Varroa Varroa.PCA biplot (Varroa.PCA) We reproduce the same PCA biplot, after removing two observations. > Varroa2 Varroa2.PCA biplot(Varroa2.PCA)
2.10 The pros and cons of transformations We conclude this chapter by discussing the major concerns about data transformation. We will not go into different types of transformations in too much detail. Other textbooks and papers have already done so (Mosteller and Tukey 1977; Sokal and Rohlf 1995; Zar 1999; Fox 2002; Zuur et al. 2007, 2010).
34
2 Outliers
We will briefly point out the essentials required to understand that a data transformation is not always needed nor does it constitute the solution for some particular types of data. If the aim of a transformation is to get rid of outliers and/or linearise relationships, then this is a correct choice, as we have already suggested in the previous sections when using boxplots, Cleveland dotplots, and scatterplots. Gaps in the explanatory variables may benefit from a transformation, especially if a generalised additive model is the required method for the follow-up analysis. It is very common in ecology that the spacing of the samples along an environmental gradient is so irregular that no choice other than a transformation exists (unless a new categorical variable is to be created). Further motivations for a transformation may be linked directly to the achievements of normality and homogeneity after revising residuals in regression type analysis (see Chapter 3 for additional details). However, some complications from transformed data might occur at the time of interpreting results from the back transformation. After obtaining the final model a back-transformed estimated parameter may be required. If you model the mean of a variable as a function of covariates, it should be noted that the mean in the original scale is not the same as the backtransformed mean of the data in the log scale. A simple example below will illustrate this situation. We load the Heteromastus data once more, but we will focus only on those measurements taken in August. > Heteromastus NewH_similis mean(NewH_similis) 0.3452874 > median(NewH_similis) 0.4 The log10 transformation of the original variable is > LogNewH_similis mean(LogNewH_similis) -0.5057098 > median(LogNewH_similis) -0.39794
2.10 The pros and cons of transformations
35
If we want to back-transform the parameter estimates, we just take the anti-log and compare the results with the parameters of the original variable. For the mean we have > 10^(-0.5057098) 0.3120974 We name the value estimated the geometric mean. The geometric mean will be less than the mean of the raw data. And finally, for the median we have > 10^(-0.39794) 0.4 In this particular case, this is the same as the median of the original variable. Similar problems may arise when interpreting a log-transformed variable (e.g., the Shannon–Weaver diversity index) that originally contained the logarithm in the formula expressed above. The Irish pH data presented above constitute a good example as pH values were calculated by the formula pH= –log [H+]. Needless to say, a log transformation of pH will result in a double log transformation of that variable. Another disadvantage that should be noted is that a transformation might change the nature of the data. This is the case when you run models with interactions (see Chapter 4). The authors of this book have seen examples in which an interaction term in a linear regression model was significant, but when a log transformation was applied the interaction was no longer significant. We recommend that you think twice about the choices you have before applying a transformation. Make sure you select the right family of distribution as this may help comply with the assumptions and avoid unnecessary transformations on the response variable. The Figure 2.19 flowchart summarises the pros and cons discussed in this section. The authors of this book only apply transformations on continuous covariates that have outliers.
Reduce effect of outliers
Get covariate gradient with equal spaced observations
Advantages
Improve linearity between variables
Difficult for backtransforming
Change the relationships between variables
Change the effect of interactions
Disadvantages
Affect the entire YX relationship
Try first without transformation For the response variable, it might be better to select a different distribution
Obtain homogeneity & normality
Transformations
36 2 Outliers
Figure 2.19. Flowchart showing how to determine when to apply a transformation.
3 Normality and homogeneity In this chapter we present various tools to identify normality and homogeneity. We start with graphical tools such as conditional boxplots and Cleveland dotplots. We also discuss tests for normality and homogeneity. Prerequisite for this chapter • A working knowledge of R is required. See, for example, Zuur et al. (2009b).
3.1 What is normality? We associate normality with distributions. The normal (or Gaussian) distribution is a continuous probability density function. It is one of the most commonly used distributions in the modelling of continuous data. However, not every statistical technique assumes normality. Many scientists are confused when and to what extent normality is required. There is a clear misconception that normality is always needed. What do statisticians assume when they formulate this assumption? The answer to this question depends on which statistical method is being used. In linear regression models, the normality assumption works as follows. Figure 3.1A shows a simulated dataset of 100 observations: one covariate and one response variable. We have fitted a bivariate linear regression model; this is the straight line. The normality assumption means that the response variable is normally distributed at each value of the covariate. Note the emphasis on the ‘each’.
Figure 3.1. A: Simulated dataset visualising the normality assumption. B: Normality at each covariate value.
38
3 Normality and homogeneity
We visualise the normality assumption in Figure 3.1B; a large number of observations are simulated from a normal distribution where the mean is given by the fitted line at 10 arbitrarily selected covariate values, and we superimposed these simulated observations on the graph. Instead of adding simulated values we can also visualise the normality assumption with a three-dimensional graph and put normal density curves on top of the fitted line (see Zuur et al. 2007 for an example). In fact, we have a bell-shaped tunnel on top of the line. Normality in linear regression models is not as important as homogeneity of variance, which is one of the other assumptions. If this assumption is violated, interpretation and inference may not be reliable. Obviously, how early you need to start exploring normality depends on the follow-up analysis and whether or not you are going to perform hypothesis testing. There is a real danger in creating a histogram of the response variable, applying a transformation to judge normality while ignoring the potential effect of the covariates. Bear in mind that a nonnormality pattern in the response variable may be due to the effect of covariate(s)! Linear regression is known to be sufficiently robust to nonnormality (Quinn and Keough 2002). This does not, however, mean that you need to put blinders on or should ignore it. Rather, you want to ensure that when you finalise your analysis and do a proper model validation, the residuals should look normally distributed.
3.2 Histograms and conditional histograms Exploratory techniques for checking normality are discussed in most statistics textbooks. Here we present a brief review. The classic view of normality and thus, the way to portray distributions, is by means of a histogram. A histogram is created from a frequency table. It aims to show the centre and distribution of the data. The shape of the distribution communicates essential information such as the probability distribution of the data. We associate a histogram with visualising bellshaped curves, but not all bell-shaped distributions correspond to a normal distribution (e.g., t-distribution; see Zar 1999). Figure 3.2 shows the histogram of the weights of sparrows (unpublished data from Chris Elphick, University of Connecticut, USA). Zuur et al. (2007) used these data to explain discriminant analysis because several morphometric variables were also measured to allow for discrimination between two sparrow species: the seaside sparrow (Ammodramus maritimus) and the salt marsh sharp-tailed sparrow (Ammodramus caudacutus). The y-axis shows the number per class. The shape of the histogram suggests that the data are normally distributed, although a little bit skewed to the left.
3.2 Histograms and conditional histograms
39
Figure 3.2. Histogram of weights of 861 sparrows (Ammodramus caudacutus). The following code was used to create the histogram. We call the dataset Sparrows. We select the rows of Sparrows for which species equals SSTS. This means that we display the weight distribution of the salt marsh sharp-tailed sparrow species only. The hist function is used to create the histogram. > Sparrows Sparrows2 hist(Sparrows2$Wt, xlab = "Weight in grams", main = expression(italic ("Ammodramus caudacutus "))) 3.2.1 Multipanel histograms from the lattice package The function histogram in the lattice package can be used to draw multiple histograms. See Chapter 8 of Zuur et al. (2009b) for further details. In the object Sparrows, covariates such as Age, Sex, and Observer are also available. We could then investigate if the weight distribution of Ammodramus caudacutus is similar across the observations obtained by different observers. An inspection of the data reveals that there are seven observers; therefore we draw a conditional histogram on the explanatory variable Observer.
40
3 Normality and homogeneity
The resulting graph is presented in Figure 3.3. We set the layout = c(1,7)so that the panels are arranged vertically. Note that we also change the default, setting nint = 30 to increase the number of bars. > library(lattice) > histogram(~ Wt|factor(Observer), data = Sparrows2, layout = c(1, 7), nint = 30, xlab = "Weight in grams", strip = FALSE, strip.left = TRUE, ylab = "Frequencies")
Figure 3.3. A multipanel histogram of the Sparrows data conditional on the variable Observer. When plotting this type of graph, you would probably have in mind the idea of quantifying an observer effect. This could be done by performing a simple linear regression model that will let us know if significant differences are found among observers in a model that also contains the Age and the Sex effect of the sparrows. So is this graph relevant at this stage? Do we need to worry if the histograms in Figure 3.3 were to show deviations from normality? The answer is no! Instead, we should continue the statistical analysis, extract the residuals from the model, and finally verify if the normality assumption is achieved for the residuals. We will not explain the principle of linear regression in this
3.2 Histograms and conditional histograms
41
book. Rather, we would like to emphasise the idea that normality of the raw data is not of concern because the histogram above ignores the effect of the remaining covariates. Figure 3.4 provides the final answer to this matter. What we display here is the histogram of residuals from the final model. Any indication of non-normality would have invalidated the results from that analysis and therefore action should have been taken. To produce this graph, we first need to define the model, extract the residuals, and then generate the histogram. We call the function rstandard to extract the standardised residuals. The argument breaks will increase (or decrease) the number of bins. The default number of bins is set to 10. > M1 E1 hist(E1, xlab = "Residuals", main = "")
Figure 3.4. Histogram of residuals. Note that the assumption of normality is reasonably met in this case. But see Section 3.4. 3.2.2 When is normality of the raw data considered? From the previous example we found that histograms of residuals should be investigated if linear regression is the primary statistical tool used to analyse these data. So the question is: When do we judge normality of the raw data? Normality of the raw data is a prerequisite for statistical tests such as ttests. There you only have a response variable being split up into two groups and nothing more as no further covariates are involved. If this is
42
3 Normality and homogeneity
the case, then normality of the data within each group is expected before performing the test.
3.3 Kernel density plots Histograms are strongly affected by the number of bins and bin locations, which can make it sometimes difficult to determine the shape of a distribution. Instead of using histograms one can also use kernel density curves. In this technique small functions (view them as Lego pieces) are defined for each observation and these functions are then added up, resulting in a smoothing function to obtain a representation of the probability density function. However, the user can select from different functions (the Lego pieces) that build up the final smoother and also the bandwidth that controls the amount of smoothing. Kernel density function can be made using the density function. > d Sparrows$fSpecies d plot(d, xlab = "Weights (in grams)", cex.lab = 1.5, cex.main = 1.5, main = "", xlim = c(15,28), ylim = c(0, 0.35), lwd = 5) > d1 d2 lines(d1, lty = 2, lwd = 2) > lines(d2, lty = 3, lwd = 2) > legend("topright", legend = expression("Both Species", italic("A. maritimus"), italic("A. caudacutus")), lty = c(1, 2, 3), lwd = c(5, 2, 2)) First we relabel the categorical variable. The rest is a matter of using the density function and elementary plotting tools. Other useful functions are the densityplot function in the lattice package or the sm.density.compare function in the sm package.
3.4 Quantile–quantile plots Another way of depicting the distribution of a continuous variable is the quantile–quantile plot (Q–Q plot). A Q–Q plot plots the quantiles of two distributions versus each other. Points on an approximate straight line indicate that the two distributions are similar. We can use this concept to plot the quantiles of the raw data (or residuals) versus the quantiles of a Z distribution. If the points in such a graph are roughly on the y = x line then normality of the raw data (or residuals) may be a plausible assumption. Let us use the same sparrow example to illustrate and see how a Q–Q plot works. In this case we would like to assess normality of residuals as shown in Figure 3.4. The function qqnorm produces the Q–Q plot and qqline includes a line to a normal plot that passes through the first and third quartiles. If the points follow the line then we may assume normality.
44
3 Normality and homogeneity
> qqnorm(E1) > qqline(E1) The resulting graph in Figure 3.6 indicates that the residuals deviate from normality in the right tail.
Figure 3.6. Quantile–quantile plot of the residuals. 3.4.1 Quantile–quantile plots from the lattice package Quantile–quantile plots are also available within the lattice package. The lattice function qqmath can produce several graphs in one graphical window. Figure 3.8 uses the same data as in Figure 3.5 in the sense that the weight data have been split by species. The weight distribution of the first species appears to be normally distributed, whereas the second one shows deviation from normality. The following code will produce the lattice graph. > library(lattice) > qqmath (~Wt|fSpecies, data = Sparrows, cex = 1, col = 1, ylab = "Weight (in grams)", xlab = "Theoretical Quantiles") Multiple linear regression techniques are reasonably robust against the violation of the normality assumption.
3.5 Using tests to check for normality
45
Figure 3.7. Quantile–quantile plots showing the weight distribution for both sparrow species.
3.5 Using tests to check for normality This section discusses tests for detecting normality. However, the aim of this section is not to explain the mathematical background of these tests. We will use two datasets: (1) the Sparrow2 data we showed before, which has more than 800 weight observations with lots of ties (observations with the same value), and (2) the Clam data described in Chapter 2, where the length of 57 yellow clams (Mesodesma mactroides) was analysed. The use of two different sample sizes will enable us to discuss how results from tests performed on large sample sizes could contradict the central limit theorem. We first applied the Shapiro–Wilk test. The test statistic W is between 0 and 1, with 1 indicating normality. However, W is set to be highly skewed and therefore this test may reject the normality assumption even for large values of W. You can run the test with the shapiro.test function. Results for both datasets are given in Table 3.1. For the sparrows data we can reject the normality assumption but there is no evidence to reject the normality assumption for the clam data. Another test is the Kolmogorov–Smirnov test. It can be run in R with the ks.test function using the pnorm option to test for normality. The D’Agostino test (use agostino.test from the moments package) is also used to examine significant deviations from normality. It can detect departures from normality from skewness and/or kurtosis.
46
3 Normality and homogeneity
Table 3.1. Results of various tests for normality applied on the sparrows and clam datasets. In all the tests p-values less than 0.05 indicate that the sample data deviate from normality. Test Shapiro–Wilk Kolmogorov– Smirnov D’Agostino
Sparrows W = 0.977 (p < 0.001) D = 1.00 (p < 0.001)
Clams W = 0.983 (p = 0.581) D=1 (p < 0.001)
skew = –0.358, z = –2.768 (p = 0.006)
skew = –0.015 z = –0.034 (p = 0.972)
Figure 3.2 showed that the Sparrows2$wt data are likely to follow a normal distribution. You may argue that data are slightly skewed to the right, but recall again that, due to the central limit theorem, minor diversions are allowed. However, all the tests rejected the null hypothesis that data came from a normal distribution. One possible explanation for this is that the data consist of a larger sample (>800 observations), where we can easily anticipate that slight departures from normality may be highlighted as statistically significant. In contrast, the Clam data successfully passed most of the normality tests except for the Kolmogorov–Smirnov test. This is not surprising because that is the case for small samples. Recall that normality tests are not sufficient for ruling out whether a small sample comes from a normal distribution. To illustrate this problem, we will produce different histograms using data that we know a priori are normally distributed. The command rnorm will generate random numbers from a normal distribution with mean = 0 and standard deviation 1. By executing the first line of code given below several times, we see that histograms of normally distributed data can be successively obtained. This is still possible with a sample size of 800 but, as soon as the sample size gets smaller, no bell-shaped histograms are seen any longer. > hist((rnorm(10000, mean = 0, sd = 1))) > hist((rnorm(800, mean = 0, sd = 1))) > hist((rnorm(50, mean = 0, sd = 1))) The authors of this book never perform any of these tests; they assess the normality assumption using graphs. In addition, it is well known that some departures from normality are not that serious because the t-test and linear regression models are reasonably robust to violations of the normality assumption. Do we need to check normality of covariates in regression type techniques? The answer to this question is no. Hence there is no need to make histograms of covariates.
3.6 Homogeneity of variance
47
3.6 Homogeneity of variance Homogeneity of variance, also called homoscedasticity, is an important assumption in statistical techniques such as linear regression, or the Student’s t-test. Homogeneity occurs when the spread of all possible values of the population is the same for every value of the covariate. Violation of homogeneity is very common in ecological data and the consequence of ignoring this assumption may invalidate the statistical significance of the estimated parameters as it increases type I error, even when normality is fulfilled (Quinn and Keough 2002; Zar 1999). Violation of homogeneity makes it difficult to determine the true standard errors, typically resulting in confidence intervals that are too wide or too narrow. Therefore, ignoring this problem will derivate in regression parameters with biased standard errors. This means that the F statistic is no longer F distributed, and the t statistics will not follow a t distribution either (Wooldridge 2006). As a final result, p-values are not reliable. The consequence of violating homogeneity is more critical than violating normality. This section discusses a range of exploratory tools and is intended to show you how they can be used to ensure the validity of this assumption. Looking at homogeneity may require the use of conditional boxplots, conditional Cleveland dotplots, or scatterplots. You can compare these graphical tools and decide whether the underlying assumption can be met. 3.6.1 Conditional boxplots A common tool used to explore variation in the data is the conditional boxplot. We introduced simple and conditional boxplots previously when explaining the presence of outliers, but the real advantage of a conditional boxplot is that it breaks up the data of a continuous or discrete variable into classes according to a categorical variable. This not only allows for a quick overview of differences among groups (or levels of a covariate) but also of the spread that these different levels may prove. As an example, we use benthic data that showed the effect of macrobenthic communities in a fishery exclusion zone and in a trawled area in the Adriatic Sea, Italy (De Biasi and Pacciardi 2008). We will limit the data to include total abundance of main taxonomic ranks of macroinvertebrates, divided into five classes: Mollusca, Crustacea, Echinodermata, Sipunculida, and Polychaeta. Figure 3.10 clearly shows that the polychaetes are by far the most abundant organisms in the community, but interest here lies in the larger variation of that group compared with the other taxa. Should you apply a simple regression-type model, you would need to take this point into account and think carefully about how to proceed with the analysis as this is indeed a clear example of heterogeneity of variances. We now generate Figure 3.8 step by step. The first command is used to read the file. > Benthos Benthos$fTaxonID boxplot(Abundance ~ fTaxonID, ylab = "Abundance", data = Benthos) You can easily extend the previous conditional boxplot one step further and plot abundance conditional on taxonID and Control. You can use the bwplot from the lattice package already introduced in Chapter 2 to display multiple panels. Just as for normality, we need to use residuals to assess whether we really have heterogeneity.
3.6 Homogeneity of variance
49
3.6.2 Scatterplots for continuous explanatory variables So far we have shown examples in which we had large variations in some levels of the independent variable. That is, we used only categorical variables and therefore boxplots provide useful exploratory tools for assessing homogeneity of variance. The question now is how you can visualise violation of homogeneity when a continuous covariate is used. Scatterplots can be useful for showing this. We will deal with scatterplots later in Chapter 4, but we will introduce one ecological example of sexual dimorphism, which shows violation of heterogeneity. The scatterplot in Figure 3.9 shows differences in body size (by using the longitudinal length of lower alveoli) of the southern fur seal (Arctocephalus australis; Denuncio 2007) plotted versus age of the animals. Different symbols are used for males and females. It will come as no surprise that males will be bigger than females! Adult males tend to show differences in size as a result of competition for space, food, and mating, and therefore more variation may be found within the male group compared to the female group.
Figure 3.9. Scatterplot of longitudinal length of lower alveoli canine and age conditional on sex for the southern fur seal (Arctocephalus australis). Open circles indicate males and filled circles indicate females. Note the variation in length for the male adult specimens as age increases. Obviously, this will go against the homogeneity assumption if regression models are applied. A classical solution to this problem would
50
3 Normality and homogeneity
be a transformation of the response variable to stabilise the variance and thus correct for heterogeneity. Although you will be able to validate this model and fulfill the homogeneity assumption, the real consequence of applying a transformation is that you could potentially remove this interesting size-behavioral functional pattern that might explain why subordinate male individuals grow more slowly than the competitive stronger adults! Zuur et al. (2009a) provide a wide range of ecological examples in which heterogeneity can be incorporated into a model without applying transformations on the raw data or move on to a nonparametric type of analysis. Let’s discuss the R code. First we import the data and define a new numerical vector using 1 for males and 16 for females. You could have done it using the original labelling of 1 and 2, but note we wanted open and filled circles instead. The option pch = Sex2 will complete that step and help highlight the variation in the length due to the sex effect. > Fseal Fseal$Sex2 Fseal$Sex2[Fseal$Sex == 1] Fseal$Sex2[Fseal$Sex == 2] plot(x = Fseal$IS, y = Fseal$VAR26, pch = Fseal$Sex2, ylab = "Longitudinal length of lower alveoli", xlab = "Age", main = "Fur Seal")
3.7 Using tests to check for homogeneity Tests for homogeneity remain among the primary procedures taught in university statistics courses regarding homoscedasticity. It is tempting to say ‘tests are more accurate’ whereas ‘graphs are somewhat subjective’. This does not mean that you should simply look at test results just because a p-value is given. 3.7.1 The Bartlett test In the late 1930s, Bartlett described a test for assessing homogeneity of variance (Bartlett 1937). The null hypothesis assumes that the variances in each of the groups (or samples) are the same. This test requires normality of data. 3.7.2 The F-ratio test The F-ratio is used to decide whether the variances in two independent samples are equal. The smaller variance is used in the denominator and when computing the ratio you look at how much it diverts from 1. If the Fratio is statistically significant, you may assume that equality of variances is not met. When the number of groups is two and the sample size in both
3.7 Using tests to check for homogeneity
51
samples is the same, then Bartlett’s test is equivalent to the variance ratio test (Zar 1999). Bear in mind that this test is heavily influenced by nonnormality. 3.7.3 Levene’s test Another well-known option is Levene’s test (Levene 1960). It is used to test the assumption of equal population variances. This test is less sensitive to departures from normality. If your data are not normally distributed, then this is perhaps a better choice than the Bartlett test. 3.7.4 So which test would you choose? Three options were given above. A review of the advantages and limitations of the different tests can be found in Lee et al. (2010). They conclude that no single test is the best and that software packages should offer a wide range of choices. The modified Levene’s test (also known as the Brown and Forsythe test), which uses the median instead of the mean, was evaluated in their study as being superior in terms of power and robustness compared to Levene’s test (Lee et al. 2010). 3.7.5 R code Running these tests in R is relatively easy. We can use the Benthos data and type > bartlett.test(Abundance ~ fTaxonID, data = Benthos) > var.test(Benthos$CT, Benthos$Abundance) The code for the Levene’s test is equally simple (use the levene.test function from the lawstat package). All tests reject the null hypothesis of homogeneity, which does not come as a surprise. What we should do is apply these tests on the residuals of a linear regression model and assess the residuals for homogeneity. We applied the Bartlett test on the residuals from the sparrows that we obtained from a linear regression model in which weight is modeled as a function of age, sex, and observer. The Bartlett test indicated that residual variation does not differ significantly per observer, but it does differ per age class and also per sex. R code to do this follows. First we fit the linear regression model again and extract the residuals. > M1 E1 bartlett.test(E1 ~ factor(Sex), data=Sparrows2) > bartlett.test(E1 ~ factor(Age), data=Sparrows2)
52
3 Normality and homogeneity
> bartlett.test(E1 ~ factor(Observer), data = Sparrows2) Only results of the first test are presented here. Bartlett test of homogeneity of variances data: E1 by factor(Sex) Bartlett's K-squared = 142.447, df = 1, p < 0.001 These results indicate that we can reject the null hypothesis that the variances are equal. A rule of thumb states that if the ratio between the largest and smallest variance is larger than 4 we have heterogeneity. That doesn’t seem to hold here. > tapply(E1, FUN = var, INDEX = Sparrows2$Sex) Female Male 2.0038527 0.6025372 The ratio is about 3.3. 3.7.6 Using graphs? We will now discuss the scenario in which graphs are used to check for the homogeneity assumption after a statistical model has been applied. So how do you verify that homoscedasticity is achieved when multiple covariates (especially continuous) are in a linear regression model? The answer is to plot residuals versus each covariate that was used in the model, as well as those that were not used in the model, and assess whether there is heterogeneity in the residuals; see Figure 3.10. The R code to create Figure 3.10 follows is based on elementary plotting code. > boxplot(E1 ~ data xlab ylab > boxplot(E1 ~ data xlab ylab > boxplot(E1 ~ data xlab ylab
factor(Sex), = Sparrows2, = "Sex", = "Residuals") factor(Age), = Sparrows2, = "Age", = "Residuals") factor(Observer), = Sparrows2, = "Observer", = "Residuals")
3.7 Using tests to check for homogeneity
53
Figure 3.10. Residuals plotted versus sex, age, and observer. The same principle applies if continuous covariates are used. The example below shows the results of a model that recorded the relationship between deep-sea pelagic bioluminescence as a function of depth in the Mid-Atlantic Ridge (Heger et al. 2008). Depth is a continuous covariate. Because we wish to verify the homogeneity assumption, we first need to run the model. Once we have the model, we extract the residuals and create the corresponding plot; see Figure 3.11. Note the cone-shaped pattern that leads to violation of the homogeneity assumption. Heterogeneity is clearly structured as variation in sources decreases as depth increases. The structure in the residuals also indicates that the heterogeneity may be due to missing covariates (e.g., location), something that a test doesn’t tell us (and this is a major argument in favor of graphs). The R code below is used to create the graph. Note that we removed data from stations 52 and 53 as we have only a few observations for them. > BL BL2 library(mgcv) > M1 E1 plot(y = E1, x = BL2$Depth, ylab = "Residuals", xlab = "Depth",
54
3 Normality and homogeneity
main = "Bioluminescence data") > abline(h = 0)
Figure 3.11. Residuals vs. bioluminescence data.
depth
for
the
deep-sea
pelagic
4 Relationships An essential question any scientist may encounter during the data exploration process is how to determine relationships between variables and which graphical procedures to use for that purpose. Such exploration procedures include scatterplots, pairplots, coplots, and multipanel scatterplots. In this chapter we examine how to implement and interpret these graphical tools. Required knowledge for this chapter A working knowledge of R is required. See, for example, Zuur et al. (2009b) and Chapters 2 and 3 of this book.
4.1 Simple scatterplots It is important for a researcher to visually inspect data to see whether variables are associated. If variables are then related, the second question is how they are related (e.g., linear, nonlinear). The third question is how this association can be described with a mathematical model. The type of patterns observed during data exploration will partly determine what type of mathematical model needs to be applied. Questions 1 and 2 are part of data exploration and will be discussed in this chapter. Question 3 relates to modelling the data and is discussed in, for example, Zuur et al. (2007, 2009a, 2012, 2014). 4.1.1 Example: Clam data In Chapter 2 we made use of simple plotting tools to detect outliers. We finished the outlier section by introducing outliers on y-x space. Figure 2.12 represented the so-called standard bivariate scatterplot, which plotted one variable along the horizontal axis and the second variable along the vertical axis. A clear straight line could be seen in that relationship. We begin this chapter by examining scatterplots and pairplots in more detail. Scatterplots are essential tools to visualise any type of patterns or associations between two variables. Obviously there are several ways in which variables can be associated. The data points may describe positive or negative linear relationships, a parabola, or a complex oscillating pattern. Or we may assume that there is no relationship at all because all the points are randomly scattered. The first example we introduce is fairly simple. Figure 4.1A shows a graph similar to the one described in Chapter 2, but a second mussel species has been added here: the yellow clam (Mesodesma mactroides). It is the convention to plot the dependent variable along the vertical axis and the independent variable (covariate) along the horizontal axis. The aim of
56
4 Relationships
this scatterplot is to explore how well body length describes biomass and whether both species are likely to exhibit the same relationship.
Figure 4.1. A: Scatterplot of adult clam biomass (dry weight) against body length for the yellow clam (Mesodesma mactroides) and the wedge clam (Donax hanleyanus). B: Same as A, but now both variables are log-transformed. Note that the graph shows a clear nonlinear relationship and heterogeneity. This means that we need a statistical technique that can cope with this. Linear regression will not take us anywhere! Alternatively, we can apply a logarithmic transformation on the biomass and length data; see Figure 4.1B. We now have clear linear relationships and homogeneity. The two bands of points are most likely due to the two species. Based on Figure 4.1B, from a statistics point of view, the appropriate model to start with is a linear regression model in which log-biomass is modelled as a function of log length, species identity, and an interaction term. It should be noted that the inclusion of the interaction term must be driven by your underlying biological question and not based on the pattern observed in the scatterplot. Based on Figure 4.1A we would apply a generalised additive model using length as a smoother (plus an interaction term between the smoother and species). The following R code imports the data, concatenates the length and biomass data of the two species, and creates scatterplots for both the original and log-transformed biomass-length measurements for both species. > Mussels Biomass Length MyPch plot(x = Length,
4.1 Simple scatterplots
57
y = Biomass, xlab = "Length (mm)", ylab = "Biomass (mg)", pch = MyPch) > LogLength LogBiomass plot(x = LogLength, y = LogBiomass, xlab = "Log Length (mm)", ylab = "Log Biomass (mg)", pch = MyPch) The legend for the first graph was created using the legend function. The x and y values specify the position of the legend. The bty = "n" argument suppresses drawing a box. > legend(x = 5, y = 1.45, legend = expression(italic("M. mactroides"), italic("D. hanleyanus")), horiz = FALSE, pch = c(1, 17), cex = 0.8, bty = "n") Scatterplots applied on the mussels data indicate that we need mathematical tools that can cope with nonlinear patterns and heterogeneity. Alternatively, log-transform the data and apply linear regression. 4.1.2 Example: Rabbit data We will now illustrate a different example that does not represent a straight line (not even after a transformation); see Figure 4.2. The data presented here is the ratio of rabbit to partridge hunted in Spain along time (data from Joaquin Vicente, University of Castilla La Mancha, Spain). Visual inspection of the scatterplot suggests that a quadratic linear regression model may be an appropriate starting point. Such a model is of the form Yi = β1 × Yeari + β2 × Yeari2 + ei, although we may encounter problems with the nonsymmetric shape of the data. The term ‘linear’ in ‘linear regression’ does not necessarily mean a straight-line relationship; it refers to ‘linear within the parameters’.
58
4 Relationships
Figure 4.2. Scatterplot of the ratio of rabbits and partridges hunted in Spain versus time. To create the graph, we first load the rabbit data, then use elementary plotting tools to create the graph. The smoother was obtained with the loess function. > Rabbits plot(y = Rabbits$RbyP_index, x = Rabbits$year, pch = 16, ylab = "Rabbit/partridge index", xlab = "Year") > M.loess Fit lines(x = Rabbits$year, y = Fit, lwd = 2) 4.1.3 Example: Blow fly data The last scatterplot we introduce does not necessarily show a linear pattern either. The example represents length measurements on the blow fly Calliphora vicina (Diptera: Calliphoridae) sampled during eight consecutive days; see Figure 4.3. Note the cluster of points showing the development of Calliphora vicina larvae under various experimental conditions (see Ieno et al. 2011 for technical details).
4.1 Simple scatterplots
59
It is clear from the graph that the growth first rose sharply, before it dropped off when the larvae go into a pupae stage. This pattern suggests a nonlinear time effect. One approach to model these data is via an additive mixed effects model, which allows for nonlinear patterns and also for correlation between observations from the same batch.
Figure 4.3. Scatterplot of length as a function of time. Let’s create the graph shown in Figure 4.3. We first load the data. The code for the scatterplot follows. > Maggot plot(y = Maggot$Length, x = Maggot$Series, xlab = "Time", ylab = "Maggot length") The following code applies the smoothing method and superimposes the fitted smoothing curve over the plot by applying the lines command. > M.Loess Fit lines(x = Maggot$Series, y = Fit) Scatterplots applied on the rabbit and blow fly datasets showed nonlinear patterns, which cannot be linearised with a transformation. Hence we need quadratic regression models or generalised additive models to analyse the data.
60
4 Relationships
4.2 Multipanel scatterplots We know from experience that not every relationship in biology consists of two sets of paired measurements in which one is clearly related to the other. There are always cases when a researcher would like to see how one response variable relates to multiple covariates. Multipanel scatterplots will allow us to examine such relationships. We will introduce the xyplot from the lattice package illustrated with two different datasets. Readers should be aware of the architectonic piece of work that has to be done with the code. In Chapter 6 we will use the ggplot2 package to produce similar graphs. 4.2.1 Example: Polychaeta data The first example (Figure 4.4) consists of relationships between the abundance of six different Polychate families and the silt content in sediment for the benthic data from the Adriatic Sea introduced in Chapter 3. Relationships between the abundances of the families and silt content are rather weak. Capitellidae denotes a nonlinear trend over silt. Also note the difference in absolute values per family.
Figure 4.4. Multipanel scatterplot showing polychaete abundance of six families versus silt content in the Adriatic Sea (De Biasi and Pacciardi 2008). A smoothing line has been added to help with visualisation.
4.2 Multipanel scatterplots
61
The following R code will load the data and create a variable SpAbund containing the abundance of the selected families we will work with. The as.vector and as.matrix functions ensure that the abundance data is in one long vector. Hence, the 80 abundance observations for each of the families are stacked. The variable ID contains each family name repeated 80 times. Hence, a row of ID indicates from which family a row in SpAbund is. Finally, we create a Silt variable; the original covariate silt is repeated 6 times, once for each family. > Poly MyVar SpAbund ID Silt library(lattice) > xyplot(SpAbund ~ Silt | factor(ID), data = Poly, xlab = "Silt content", ylab = "Abundance", scales = list(alternating = TRUE, x = list(relation = "same"), y = list(relation = "same")), panel=function(x, y){ panel.points(x, y, pch = 16, col = 1) panel.loess(x, y, col = 1, span = 0.7)}) Multipanel scatterplots applied on the zoobenthos data indicate weak relationships.
4.2.2 Example: Bioluminescence data We will illustrate a second example that shows the relationship between deep-sea pelagic bioluminescence (measurements of light emission, called ‘sources’) as a function of depth observed at 14 stations in the MidAtlantic Ridge (Heger et al. 2008); see Figure 4.5. Bioluminescence was
62
4 Relationships
detected at all stations and displayed a general decrease in abundance with depth. There are clear nonlinear patterns and we need generalised additive models (Wood 2006) to analyse the data.
Figure 4.5. xyplot of bioluminescent sources (m–-3) as a function of depth for all reference stations (52–76). The R code to create the graph is as follows. The BL data are loaded once more (we introduced it in Chapter 3), and the code for the xyplot is shown below. The type = "l" draws a line between the points. > BL library(lattice) > xyplot(Sources ~ Depth|factor(Station), ylab = expression(paste("Sources (", m^-3,")")), xlab = "Depth (m)", data = BL, col = 1, type = "l", lwd = 2)
4.3 Pairplots Individual scatterplots can be quite limited if you want to show multiple relationships at the same time as the number of scatterplots may increase and make the exploration part of the analysis a time-consuming process.
4.3 Pairplots
63
4.3.1 Bioluminescence data An effective data visualisation tool is the pairplot, or scatterplot matrix. Figure 4.6 gives an example in which six pairwise scatterplots are drawn in one graph.
Figure 4.6. Multiple pairwise scatterplots (pairplot) showing the relationship between the abundance (number of sources m–3) of pelagic bioluminescent organisms and environmental variables. This pairplot shows the relationship between deep-sea pelagic bioluminescence as a function of sea temperature, depth, and fluorescence in the Mid-Atlantic Ridge. Each panel is a scatterplot between two variables, with the variable’s name printed in the panels running diagonally through the plot. A smoothing line has been added to help visualise the strength of the relationship, but this is optional and you can modify a pairplot as much as you want (points only, regression lines, etc.). The pairplot in Figure 4.6 suggests the same pattern as seen previously when data were divided by station, which is a clear nonlinear pattern between pelagic bioluminescence (sources) and depth. As more covariates were added this time, the relationship between sources and temperature and sources and fluorescence can also be discussed. The pairplot in Figure 4.6 not only contains scatterplots of the response variable versus each covariate (top row, first column), but also scatterplots between covariates. The latter ones can be used to assess collinearity (see Chapter 5). Let’s present the code for the pairplot in Figure 4.6. It is the same dataset used for Figure 4.5 so we do not need to load it again. We apply the function pairs to produce the corresponding pairplot. In this case we
64
4 Relationships
decide to plot the response variable (sources) and three covariates (depth, fluorescence, and temperature). > MyVar pairs(BL[, MyVar], upper.panel = panel.smooth) You can have different scenarios such as (a) pairplots between a response and multiple explanatory variables (Figure 4.6 and Figure 4.7); (b) pairplots between multiple response variables (Figure 4.8); and (c) pairplots between multiple explanatory variables. A combination of multiple response and explanatory variables is also possible provided you have a limited number of variables. 4.3.2 Cephalopod data It is also possible to visualise multiple relationships involving response and continuous explanatory variables while taking into account the effect of a categorical variable; see Figure 4.7. The data presented here consist of cephalopod morphometry, using alcohol-stored specimens from the London Natural History Museum collected over several decades by different expeditions (Neves, unpublished data, University of the Azores). The pairplot shows the relationship between weight and mantel length for two different species (open circles and filled circles), and the same for weight and lower rostal length. For the first combination there are quite some differences between the two species, but this is not the case for the weight–lower rostal length data. A statistical model will send the same message.
Figure 4.7. Pairplot of Histioteuthis spp. and Ctenopteryx sicula showing the relationship of mantel length and weight versus lower rostral length. Mantel length seems to correlate better with lower rostral length.
4.3 Pairplots
65
The R code for this graph follows. > Squid pairs(Squid[c(1,2,3)], cex = 1.5, pch = c(1,16)[unclass(Squid$Species)], label = c("Weight", "Mantel length", "Lower rostral length"), lower.panel = NULL, cex.labels = 2) The unclass bit converts the categorical variable into a vector with ones and twos, and these are then used to repeat the ones and sixteens for the pch option (point character). The legend function can be used to superimpose a legend for the different types of points. 4.3.3 Zoobenthos data Finally, we provide an example where multiple responses variables are shown. The following example uses the zoobenthos data. We are going to focus on the abundance of a new selection of polychaetes families. We will not select any explanatory variable this time as the aim is to visualise how well the family data correlate to one another. The pairplot is presented in Figure 4.8. Relationships are clearly linear. This means that principal component analysis (Jollife 2002) is a suitable statistical technique for these data as it works on correlation matrices, which measure linear relationships.
Figure 4.8. Pairplot of abundance of polychaetes belonging to the families Phyllodocidae, Spionidae, Maldanidae, Glyceridae, Onuphidae, Nephtydae, Lumbrineridae, and Cirratulidae.
66
4 Relationships
The R code for this figure is trivial. > MyVar pairs(Poly[, MyVar])
4.4 Can we include interactions? An important concept often encountered when building up statistical models is interactions. We talk about interaction if the y–x relationship changes depending on z. If this sentence makes you dizzy, then read on. Interactions are difficult to understand; scientists sometimes get confused about the interpretation of interactions, especially if many covariates are involved. There are three possible types of interactions: (1) between a continuous variable and a factor; (2) between two continuous variables; and (3) between two factors. In this chapter we use data visualisation tools to check whether we can apply models with interaction terms; is the quality of the data good enough to include them? In Chapter 6 we show how the results of regression models with interaction terms can be visualised. 4.4.1 Irish pH data The first example using interactions is based on the Irish pH data from Chapter 2. Recall that the underlying idea behind this research was to develop new tools to identify acid-sensitive waters of coastal rivers in Ireland. The proposal was to model pH as a function of SDI (sodium dominance index taken at 257 sites along rivers), altitude at each site, and whether the sites were forested or not. The relationship between pH and SDI may be affected by the altitude gradient and forestation. This model consists of a three-way interaction containing two continuous explanatory variables and one categorical variable. Because we want to start simple, we first look at two-way interaction and leave the discussion of the threeway interaction for later. The multipanel scatterplot of pH versus log-transformed altitude is presented in Figure 4.9. Let us return to the cryptic sentence ‘the y–x relationship changes depending on z’. In this case the pH–log altitude relationship changes depending on the forested factor (note the change in slope). This is called an interaction between log altitude and the categorical variable forested. The graph does not tell us whether the interaction is significant; we need a statistical model with an interaction term for this. But the graph does indicate that sample size in each level of the categorical covariate Forested is large enough to fit such a model.
4.4 Can we include interactions?
67
Figure 4.9. Multipanel scatterplot of pH versus log altitude conditional on the categorical covariate fForested. The R code for Figure 4.9 follows. The initial three lines of code load the data and log transform the variable altitude (they had outliers). Then we change the labels of the categorical variable. > IrishpH IrishpH$LOGAltitude IrishpH$fForested library(lattice) > xyplot(pH ~ LOGAltitude | fForested, data = IrishpH, strip = strip.custom(bg = 'white', par.strip.text = list(cex = 1.5)), xlab = list(label = "Log altitude", cex = 1.5), ylab = list(label = "pH", cex = 1.5), panel = function(x, y) { M1 coplot(pH ~ LOGAltitude | fForested, data = IrishpH, panel = function(x, y, ...) { tmp Limosa Limosa$Bplumage01 Limosa$Bplumage01[Limosa$Bplumage01>0] Limosa$fBplumage01 Limosa$fplumage xyplot(IntakeRate ~ Time | fplumage, data = Limosa, strip = strip.custom(bg = 'white', par.strip.text = list(cex = 1.5)), xlab = list(label = "Time", cex = 1.5), ylab = list(label = "Intake rate", cex = 1.5), panel = function(x, y) { M1 coplot (pH ~ LOGAltitude | SDI * fForested, number = 4 , data = IrishpH, panel = function(x, y, ...) { tmp TF TF$fsex TF$farea xyplot(Trifur~LT |farea*fsex, data = TF, strip = strip.custom(bg = 'white', par.strip.text = list(cex = 1.5)), xlab = list(label = "Length (cm)", cex = 1.5), ylab = list(label = expression(paste("Prevalence of ", italic("T.tortuosus"), sep = "")), cex = 1.5), panel = function(x, y) { N1 5 ) { tmp interaction.plot(Bio$Nutrient, Bio$Treatment, Bio$Concentration, legend = FALSE, lty = c(1, 2), xlab = "Nturient ID", ylim = c(0,16), ylab = "Concentration", main = "Interaction Plot") > legend("topright", c("Algae", "Non Algae"), bty = "o", lty = c(1,2), cex = 0.8, title = "Treatment") Figure 4.16 shows a flowchart of some of the tools discussed in this chapter.
76
4 Relationships
Figure 4.16. Flowchart showing tools discussed in this chapter.
5 Collinearity and confounding One of the major problems many scientists face during data analysis is collinearity. The statistical examples we present in this chapter illustrate what collinearity is, how to identify it, and how best to proceed if it is present. Prerequisite for this chapter: A working knowledge of R and multiple linear regression is required. 5.1 What is collinearity? Suppose that you want to model the numbers of parasites in fish as a function of length and weight, or phytoplankton concentration as a function of temperature at the surface and temperature at 30 metres depth, or plant abundance as a function of soil type and height, etc. In all cases, it is obvious that the explanatory variables are correlated with each other. Correlation between two explanatory variables is also called collinearity. Collinearity does not necessarily have to occur with two explanatory variables; it can also occur when multiple explanatory variables are used. In this case it is referred to as multicollinearity (Montgomery and Peck 2002). Detecting collinearity and finding a solution for it is one of the more challenging aspects of statistical analysis. If you don’t deal with this problem properly, then the remaining bit of the statistical analysis becomes a rather confusing process; certain combinations of explanatory variables may give highly significant parameters, while with a slightly modified set nothing is significant. A common misconception about data analysis is that all recorded explanatory variables must be included in the model as each explanatory variable may represent ‘important’ information within the confines of a research study. This misconception may lead to serious problems due to collinearity. Before introducing graphical tools to detect collinearity, let’s discuss simple numerical tools to quantify collinearity. 5.2 The sample correlation coefficient When two random variables show a tendency to increase or decrease together in value they are said to be correlated. Correlation values are always between –1 and 1 and, depending on the sign, you may have a positive or negative correlation, with 0 an indication that variables are uncorrelated. Correlation can be quantified with a so-called correlation coefficient. There are different types of correlation coefficients, e.g., the ‘Pearson’,
78
5 Collinearity and confounding
‘Kendall’, and ‘Spearman’ correlation coefficients, with Pearson’s the best-known index [see Becker et al. (1988)]. We first present the mathematical expression to estimate the Pearson’s correlation coefficient (r) that is widely used for continuous variables: r=
N # 1 x − x & # yi − y & ( × ∑% i (×% N −1 i=1 $ sx ' %$ sy ('
where N is the sample size, xi and yi are the two variables, and x and sx are the sample mean and standard deviation, respectively. Hence, we subtract the sample mean of each variable and divide by the sample standard deviation, before calculating the cross-product for each observation, adding it all up, and dividing by N – 1. Note that r is the estimated Pearson correlation, but in the remaining text we will refer to it as the Pearson correlation coefficient. The following example will demonstrate how to calculate the Pearson correlation coefficient using the cor function in R. We will rely on the Trifur.txt file shown in Chapter 4 (sand perch data), which contains two variables that we know a priori should be correlated, namely weight and length of the fish. The first three lines of code load the data, define a variable with variable names, and calculate the Pearson correlation coefficient. > TF MyVar cor(TF[ ,MyVar]) Weight LT Weight 1.0000000 0.9528479 LT 0.9528479 1.0000000 The correlation between the same variables is always 1. The correlation between the two covariates Weight and LT is rather high. 5.3 Correlation and outliers The first question we address is whether we should compute the Pearson correlation coefficient before or after outlier detection. Zuur et al. (2010) presented a protocol in which outliers were given priority in the data exploration process, especially in situations where extreme observations were encountered. The reason is obvious; outliers will influence the value of the sample correlation coefficient between two variables. The following example demonstrates how an outlier affects the value of the sample correlation coefficient. We use the Trifur data again with the explanatory variables LT and Weight, but now an outlier has been created on purpose; we change one weight value with the value of 420 to 4200.
5.4 Correlation matrices
79
> TF$WeightNew TF$WeightNew[TF$Weight == 420] head(TF$WeightNew, 15) 380 470 696 248 500 508 478 400 468 478 298 364 410 642 4200 It is clear that observation 15 is an outlier. The cor function first determines the mean of WeightNew, which is highly influenced by the outlier. When cor calculates the difference between WeightNew and its mean, most likely most differences are negative, and only the outlier has a positive difference. This will certainly influence the value of the Pearson correlation coefficient! Here is the newly calculated Pearson correlation coefficient: > MyVar cor(TF[ ,MyVar ]) WeightNew LT WeightNew 1.0000000 0.3882942 LT 0.3882942 1.0000000 The Pearson correlation has a rather different value. Outliers can increase, decrease, or even change the sign of a Pearson correlation. What we have learned from this example is that outliers should be dealt with in the first instance, either by removing them or by transforming the affected variable(s). Outliers may change the value and sign of the sample correlation coefficient, and therefore the collinearity between covariates. We suggest dealing with outliers before assessing collinearity. 5.4 Correlation matrices Now that we are familiar with the Pearson correlation coefficient, the next step is to assess collinearity in the data. Various exploration tools can be used to detect collinearity. The easiest way is by examining a correlation matrix. This is what we did in the previous section when applying the cor function using the variables Weight and LT. You can create correlation matrices with more than two variables and see all the possible combinations of variables. Correlations between explanatory variables larger than 0.80 are said to be critical (this is a rule of thumb), but special care also should be taken when intermediate correlation values are found between 0.50 and 0.70 (Graham 2003). Unfortunately, there is no golden rule to rely on or decide when and what to remove from the dataset; this will depend on how strong the ecological and biological signals are. If you see very strong patterns between your response and covariates, then some collinearity is acceptable. However, if patterns between your response and explanatory
80
5 Collinearity and confounding
variables are weak, a more conservative approach should be followed. We will further explain this paragraph later in this chapter. Continuing with the sand perch data, we can include one more variable and produce a new correlation matrix. The additional explanatory variable corresponds to the sagittal length (LS) of the sand perch. The following code produces the correlation matrix. > MyVar cor(TF[ , MyVar]) Weight LT LS Weight 1.0000000 0.9528479 0.9478347 LT 0.9528479 1.0000000 0.9933593 LS 0.9478347 0.9933593 1.0000000 Note the central diagonal from left to right with correlations equal to 1 and the remaining corresponding correlations between covariates. It will come as no surprise that the three covariates are highly correlated. As length and weight are collinear, you will not be able to say which of the three covariates is driving the presence–absence of the parasite species T. tortuosus. The discussion section of a paper, thesis, or report should always address this issue. 5.5 Correlation and pairplots Apart from computing correlation matrices, there are other tools that can be used to check for collinearity. In Chapter 4, we discussed the use of pairplots to show relationships between variables. You can add Pearson correlation coefficients to a pairplot to show a graphical and numerical presentation of relationships between all continuous covariates. An example of a pairplot to detect collinearity is given in Figure 5.1. Here we use Irish data from Oxbrough et al. (2005) who investigated how a spider community changed over the forest cycle in conifer and broadleaf plantation forests. They identified environmental and structural features of the habitat, which can be used as indicators of spider biodiversity. A total of 31 sites were surveyed. Each site contained five to seven pitfall traps that were active for 9 weeks from May to July and emptied once every 3 weeks. Finally, 12 environmental explanatory variables were included in the analysis. The explanatory variables selected were GVC (ground vegetation cover), HLC (herb layer vegetation), SLC (shrub layer vegetation), NLC (needle litter cover), LC (litter cover), FWDC (fine woody debris cover), CWDC (coarse woody debris cover), LLC (leaf litter cover), MLD (mean litter depth in cm), and OM (percent organic content of soil). They also recorded the spatial position of sites (y- and x-coordinates), but we will not consider them in the pairplot. The following code loads the data and creates Figure 5.1. In order to run this code, specific pieces of code from the pairs help file must be copied and modified and later pasted into the R console. Alternatively, and
5.5 Correlation and pairplots
81
to facilitate the process, we modified this piece of code slightly, stored it in the HighstatLibV8.R file, and then sourced it from the existing directory. > Spiders source("HighstatLibV8.R") We exclude those rows with missing values and select the variables we would like to display in the graph. > Spiders2 MyVar pairs(Spiders2[,MyVar], lower.panel = panel.cor)
Figure 5.1. Pairplot of selected explanatory variables from the Spider data. The lower panel contains estimated pairwise Pearson correlations and the font size is proportional to the absolute value of the estimated correlation coefficient. The diagonal shows the abbreviations of variables. Note that the Pearson correlation only detects linear relationships, not nonlinear relationships. It may be an option to use nonparametric correlation coefficients, but these still will not pick up nonlinear relationships (which is also collinearity!). We therefore recommend looking at both the scatterplots and correlation coefficients. The pairplot in Figure 5.1 constitutes a clear example of collinearity. How do we proceed from here? One option is to start with the pair of covariates that show the highest correlation coefficient and remove one of them. Repeat this selection process with the remaining subset of variables
82
5 Collinearity and confounding
until the resulting graph does not show any covariates that are highly correlated. In the example, it is obvious that the first candidate to eliminate is LC or LLC. It comes as no surprise that litter cover and leaf litter cover represent the same environmental information, so it does not matter which one you choose as this information is 100% collinear. We recognize that NLC and HCL are correlated but also that the pair FWDC and CWDC exhibits the same Pearson correlation value (but with a different sign). We remove NCL, for instance. It may be an option to redraw the graph as there are fewer covariates now, making the graph easier to read (though the correlation values will not change). Decide whether fine woody debris cover or coarse woody debris cover will subsequently be dropped. A set of explanatory variables that does not show collinearity may be GVC, HLC, SLC, LC, CWDC, MLD and OM, but the selection of covariates can be done in different ways depending on which one you picked up on in the first instance. Recall that it does not matter which covariate from the pair (e.g., FWDC or CWDC) you choose for the analysis because in your discussion you won’t be able to conclude which covariate is the cause–effect variable for the response variable. It could be FWDC, or CWDC, or even both! 5.6 Collinearity due to interactions When fitting a multiple linear regression model with the continuous main terms and their interaction, there will be collinearity between each explanatory variable and the interaction term. The dataset we use to illustrate this has been used and explained by Crawley (2005). Ozone concentrations at 110 sites in the USA were analysed, using three covariates such as wind, temperature, and radiation. Some interaction terms were also included in Crawley’s analysis. An interaction between two continuous covariates means that their product is calculated and added as an extra covariate. For example, if we want to model ozone concentrations as a function of temperature, wind, radiation and an interaction between temperature and wind, then we need the product of temperature and wind. There is no need to calculate this product ourselves before doing a multiple linear regression (the function lm will do it for us), but for the data exploration we do need to define it ourselves. The pairplot is presented in Figure 5.2. Note the high collinearity between TW (the product of temperature and wind) and wind. This is not something to be concerned about as the p-value of the interaction term is not affected. In case of numerical estimation problems one should centre the main terms before calculating the interaction term. The R code to generate Figure 5.2 is given below. We import the data, calculate the product of temperature and wind, and use the pairs function to create the graph. > Ozone Ozone$TW MyVar pairs(Ozone[, MyVar], lower.panel = panel.cor)
Figure 5.2. Pairplot for the Ozone data. TW is the interaction term between temperature and wind. 5.7 Visualising collinearity with conditional boxplots The spider data constitute a typical example where all the covariates are continuous, but this is not always the case when you analyse ecological data. The dataset may consist of a mixture of continuous and categorical variables. How do you assess collinearity in such a case? The Pearson correlation coefficient is not a good option if one of the variables is a categorical variable that contains more than two levels (though it can be used for binary variables that are coded numerically). A graphical tool to assess collinearity between a continuous and categorical variable is to make a conditional boxplot. We will use the macrobenthos data from Chapter 2. The following R code loads the data and defines the categorical variables fCT and fPeriod. > Benthos Benthos$fCT Benthos$fPeriod MyY Mybwplot(Benthos, MyY, "fCT") > Mybwplot(Benthos, MyY, "fperiod") 5.8 Quantifying collinearity using variance inflation factors 5.8.1 Variance inflation factors We have already explained what collinearity is and how to visualise it. To understand why it is causing problems, we need to delve a little into the underlying mathematics of linear regression. Suppose we have a multiple linear regression model of the form
Yi = α + β1 × X i1 + β 2 × X i 2 +!+ β m × X im + εi Yi is the response variable, Xij are the m explanatory variables, and εi is normally distributed noise with mean 0 and variance σ2. The underlying method to estimate the parameters βj is ordinary least squares, and a detailed explanation can be found in, for example, Draper and Smith (1998) or Fox (2008), among many others. An expression for the variances of the parameters βj is given by
86
5 Collinearity and confounding
Variance( β j ) =
σ2 1 × 2 1− R j (n −1)S 2j
(5.1)
We used the same notation as Fox (2008), and the derivation of the expression in (5.1) can be found in many statistics textbooks. The Sj is a function of the Xijs (it involves the sum of squared differences of Xij from its mean over all observations i). The important bit for our discussion is the 1 / (1 – Rj2) bit. The Rj2 is the squared multiple correlation, R2, from the following linear regression model:
X ij = α + β1 × X i1 +!+ β j−1 × X i, j−1 + β j+1 × X i, j+1 + β m × X im + ζ i
(5.2)
The jth explanatory variable is used as a response variable, and all the other explanatory variables as explanatory variables (ζi is normally distributed noise). Suppose the R2 is large, say 0.9.Your first reaction to a high R2 may be positive, but in fact, it is not good at all. It means that the information in the jth explanatory variable is explained by the other explanatory variables. Formulated differently, the information in Xij is contained in the other explanatory variables, which makes it redundant. The first term in Equation (5.1) is called the variance inflation factor (VIF), hence:
VIFj =
1 1− R 2j
(5.3)
If Rj2 is large, then 1 – Rj2 is small, and VIFj is large. So, the higher the VIF, the larger the variance of a regression parameter. But the variances are used to calculate standard errors, and these are used to calculate confidence intervals, or a t-statistic and p-values. In summary, the higher the VIF, the less significant are the parameters, compared to a situation in which there is no collinearity. 5.8.2 Geometric presentation of collinearity Equation (5.1) shows the problem of collinearity in terms of the underlying equations. Figure 13.2 in Fox (2008) gives a geometrical presentation of the problem, and the next few paragraphs are inspired by his figure. We simulated some data using the equation
Yi = 2 + 0.25× X i1 +1.6 × X i 2 + εi
(5.4)
To get values for the covariate X1, we sampled 30 observations from a normal distribution, and we did the same for X2 and for the noise. The intercept of 2 and the slopes of 0.25 and 1.6 were chosen arbitrary. The correlation between X1 and X2 is small (< 0.3), hence there is no serious collinearity. A linear regression model with one explanatory variable produces a straight line, and the model with two explanatory variables forms a two-dimensional plane. Figure 5.5 shows the fitted plane in three
5.8 Quantifying collinearity using variance inflation factors
87
dimensions. Note that this plane goes through the cloud of observed observations; some points will be below the plane, some will be on the plane, and yet others will be above the plane. You can also recognize the slopes of 0.25 and 1.6 in the graph; these are the angles between the plane and the X1 and X2 axes. The plane looks stable in the sense that it cannot easily be rotated without causing a considerably worse fit.
Figure 5.5. Fit of the linear regression model applied on the simulated data. The two-dimensional plane goes through the set observations and provides the best least squares fit. Collinearity is small. Figure 5.6 on the other hand shows a problem. We again used Equation (5.4) to calculate Y values, but we set X2 equal to 0.95 × X1. As a result, X1 and X2 are highly correlated (r = 0.99), hence there is considerable collinearity. The high correlation between X1 and X2 visualises itself by a set of observations along a nearly straight line in the X1 – X2 space. The least squares solution is highly unstable as we can easily rotate the twodimensional plane. Rotating the plane means that we get nearly the same residual sum of squares, but considerably different estimated regression parameters (hence larger standard errors) compared to the situation in which there is no collinearity.
88
5 Collinearity and confounding
Figure 5.6. The two-dimensional plane goes through the set observations and provides the best least squares fit. The correlation coefficient between the two explanatory variables is high (0.99), and nearly all points lie on a line in the X1 – X2 space. The fitted plane is not stable as rotating it means that we get nearly the same residual sums of squares. 5.8.3 Tolerance VIF is also expressed as the inverse of the tolerance 1 – R2 (Quinn and Keough 2002). Both are widely used for collinearity diagnostics. In the case of tolerance, we want to be very cautious if the value is very small, because it may indicate that the explanatory variable under consideration is a linear combination of the independent variable already in the equation and that it should not be included in the regression model. 5.8.4 What constitutes a high VIF value? What do we consider to be a high VIF? Figure 13.1 in Fox (2008) shows the relationship between R2 and the square root of the VIF. His motivation for using the square root is because this quantity is used in the standard error, which requires the square of the expression in Equation (5.1). To be more precise, it is the multiplication factor of the standard error due to collinearity, compared to the situation in which there is no collinearity. We reproduced the graph here (Figure 5.7). It shows that for R2 values of about 0.75, the standard error doubles. A square root VIF value of 2 corresponds to a VIF value of 4.
5.8 Quantifying collinearity using variance inflation factors
89
There are various suggested values for a cut-off level for the VIF in the literature. Opinions vary depending on how conservative a researcher would like to be. According to some, a value larger than 10 indicates a very strong collinearity (Belsley et al. 1980; Quinn and Keough 2002), while others argue that values larger than 5 or even 3 might be considered quite detrimental to regression models (Montgomery and Peck 1992). Our preferred limit is 5, or even 3. Obviously, the cut-off level also depends on the strength of the relationships. If you have weak signals in your dataset, then collinearity may result in no explanatory variable being significant. The code to create Figure 5.7 uses Equation (5.3) and is rather simple: > R2 VIF plot(x = R2, y = sqrt(VIF), type = "l") First we define R2 values from 0 to 1 (excluding 1), and calculate corresponding VIF values. These are square rooted and plotted versus R2.
Figure 5.7. A plot of R2 versus the square root of the VIF. A similar graph is given in Figure 13.1 in Fox (2008). 5.8.5 VIFs in action So, how can we use VIFs? A first suggestion may be to drop all covariates from the model that have a high VIFj. But think about the sand perch example; in the same way as length is related to weight, so is weight related to length. Both covariates will have a high VIF, and we don’t want to drop both length and weight. We already mentioned that the Trifur example contained another explanatory variable, namely sagittal length (LS). So, we can create three linear regression models:
90
5 Collinearity and confounding
Lengthi = α + β1 ×Weighti + β 2 × LSi + ζ i Weighti = α + β1 × Lengthi + β 2 × LSi + ζ i LSi
= α + β1 × Lengthi + β 2 ×Weighti + ζ i
Note that each model has different alphas, betas, and noise. Using the R2s from each model, we can calculate VIFLength, VIFWeight, and VIFLS. We can compare the VIFs, and the explanatory variable with the highest VIF can be dropped from the analysis. In a second round, new VIFs, without the dropped explanatory variable, can be calculated and compared. This process can then be repeated. We wrote our own R routines for calculating VIF values, and these can be found on the website for this book. Alternatively, VIFs are also available in the car package from John Fox. The following code shows the implementation of the VIF routine for the Spider data. Note that all continuous explanatory variables are used, including the spatial coordinates. We decide a priori that the cut-off level will be 3. To run our code, type: > source(file = " HighstatLibV8.R") > MyVar corvif(Spiders2[, MyVar]) If you run this piece of code you will get this message: Error in diag(tmp_cor) M VIF VIF 1.319975 This is the same value that the corvif function produced. 5.9 Generalised VIF values The approach outlined in the previous section needs some modification before it can be applied to a set of explanatory variables containing continuous and categorical variables. The only missing part is how much information in the categorical variables is explained by other categorical variables and the continuous explanatory variables. The problem is that we cannot use any of the categorical explanatory variables as response variables. Hence, with this approach we can only get VIF values for the continuous explanatory variables, but not for the categorical ones. To get VIF values for categorical variables we need a different approach. Fox and Monette (1992) described a method that is called generalised variance inflation factors (GVIF), but their paper is rather technical and requires knowledge of matrix algebra. Its underlying
92
5 Collinearity and confounding
principle is as follows. The starting point is a linear regression model of the form:
Y = α + β1 × X1 + β2 × X 2 + ε
(5.5)
We used a vector notation. Hence, Y contains all the observed values of the response variable. The explanatory variables are in the matrices X1 and X2. If we want to know the VIF of a particular explanatory variable, then this variable is put in X1, and all the other explanatory variables are in X2. We will now revisit the Trifur data as it constitutes a simple example of three continuous and two categorical variables. Recall that the explanatory variables were total length, sagittal length, weight, sex, and area of fish collected. For example, if we want to know the VIF for Weight, then X1 contains only one column with all the values of Weight. If we want to know the VIF for Area (which has three levels), then X1 contains the two dummy variables (hence X1 has two columns and each column contains zeros and ones). It does not matter which level of Area is chosen as baseline. Using matrix algebra, Fox and Monette (1992) show that the VIF value is given by
det(R11 ) det(R 22 ) det(R )
(5.6)
The det stands for determinant, R11 is the correlation matrix for X1, R22 is the correlation matrix for X2, and R is the correlation matrix for all explanatory variables. The expression in Equation (5.6) is called the generalised VIF value; it is denoted by GVIF. Before going any further, let’s get these values. The following R code can be used. First we make a data frame X that contains all continuous explanatory variables and we obtain the VIF values. > X corvif(X) VIF Weight 10.87542 LS 75.65106 LT 83.47936 Then we calculate the VIF and GVIF values for all covariates (continuous and categorical) using our corvif function. > TF$fSex TF$fArea X2 corvif(X2)
5.10 Visualising collinearity using PCA biplot
93
GVIF Df GVIF^(1/2Df) Weight 11.780795 1 3.432316 LS 82.681617 1 9.092943 LT 86.454488 1 9.298091 fSex 1.221040 1 1.105007 fArea 1.631532 2 1.130183 Note that for continuous explanatory variables, the GVIFs are similar to the VIFs. The column labelled Df contains the number of parameters used by each explanatory variable. fArea has three levels, and therefore two dummy variables are used. The categorical variable fSex is binary (0 or 1), hence only one parameter is used. The third column is calculated as 1
GVIF 2×Df
(5.7)
If a continuous explanatory variable is used, or a categorical variable with only two levels, then Df = 1, and the expression in Equation (5.7) becomes the square root of the VIF. If a categorical variable with more than two levels is used, like fArea, then Df is larger than 1. Fox and Monette (1992) advised using Equation (5.7) in this case as “it provides a one-dimensional expression of the decrease in precision of estimation due to collinearity — analogues to taking the square root of the usual varianceinflation factor” [see Fox (2002), p. 217]. 5.10 Visualising collinearity using PCA biplot Finally, there is another approach that can be helpful for dealing with collinearity in continuous covariates: the principal component analysis (PCA) biplot. We are not going to discuss the theory of PCA in this book; interested readers can refer to Legendre and Legendre (1998), Jolliffe (2002), and Zuur et al. (2007), among others. The PCA correlation biplot provides information on relationships between variables. Each variable is represented by a line. The angle between lines gives an indication of the correlation between the variables. Lines pointing in the similar direction correspond to variables that are positively correlated, while lines pointing in opposite directions correspond to negatively correlated variables. Figure 5.8 shows a PCA biplot for the covariates in the Spider data. Collinearity between LC and LLC is very obvious as both lines are overlapping. CWDC and FWDC are also positively correlated. SLC appears to be negatively correlated with both CWDC and FWDC. The group MLD, OM, and NLC are also positively correlated with one another and negatively correlated with HLC. A selection of five or six variables would be best if you don’t want to have repeated information when doing the follow-up analysis.
94
5 Collinearity and confounding
Figure 5.8. PCA correlation biplot for the Spider data. The R code to create Figure 5.8 follows. > MyVar cor_PCA biplot(cor_PCA, choices = 1:2, scale = 0, cex = c(0.5, 1.5), col = 1, arrow.len = 0) Alternatively, it is also possible to apply a PCA on the explanatory variables and use some of the principal components as new explanatory variables. Note that the axes can be seen as index functions. Examples of this approach can be found in Jolliffe (2002) or Zuur et al. (2007), but interpretation of the results is not always trivial. 5.11 Causes of collinearity and solutions Montgomery and Peck (1992) mention four possible causes of collinearity: (1) the data collection method, (2) constraints on the population, (3) model specification, and (4) an over-defined model. In ecology, we often see examples of the first two. Collinearity due to the
5.11 Causes of collinearity and solutions
95
data collection means that you go into the field, follow a track, and along the track you count, e.g., monkeys and various explanatory variables. If the track goes uphill, the explanatory variable slope will change, as will the type of trees, the type of vegetation, the tree volume, etc. Similar examples we have seen are salinity and temperature changing along a transect, or temperature, oxygen, and depth sampled at different depths in the sea in oceanographic studies. The solution is to choose a sampling design that avoids such problems. As an example of a constraint on the population we mention the length and weight of Trifur data again. Fish with high values for length tend to have high values for weight as well, and vice versa. This is inherent in the structure of the population. The easiest solution is to use either length or weight as the explanatory variable, but not both. As to model specification, a model with polynomial terms (X and X2) may also have high collinearity, especially if the range of X is small. An option is to centre the linear term. An over-defined model means that too many explanatory variables are used, relative to the sample size. To solve this problem, refine the number of explanatory variables. It is easy to tell a biologist to create a better experimental design, but we realise that field conditions may not always allow one to walk in any direction of a jungle, or tell the captain of a boat to go in circles instead of a straight line. Half of the time, field data are measurements obtained by opportunity [e.g., the deep sea data used in Zuur et al. (2009a) were taken whenever no other experiments were being conducted]. However, do realize that there is no point measuring every possible explanatory variable along a transect, line, or gradient if they are correlated. If you really want to use all collinear explanatory variables, then creating an index, or a new function of them, may be an option. For example, if X1, X2, and X3 are collinear, then perhaps a new function X4 = X1 × X2 × X3 can be created, and X4 can be used as the explanatory variable instead of the other three, provided X4 has a biological interpretation. In summary, think carefully before measuring explanatory variables, especially if money or time is involved. Choose an experimental design that minimises collinearity problems, and if collinearity occurs during the analysis, be prepared to drop some explanatory variables. Other signs of collinearity may appear if you apply a linear regression model (or a generalised linear or mixed model), and some of the parameters change sign and importance during a model selection process. Different results of a forward and backwards selection is another indication. Even a regression parameter with the wrong sign may indicate it (Montgomery and Peck 1992). But in all these situations you are already too late; deal with collinearity before starting the actual analysis! Further information on collinearity can be found in Montgomery and Peck (1992). You will see that they (and others) use the expression ‘multicollinearity’. They also show that collinearity not only affects the
96
5 Collinearity and confounding
variances of the estimated regression parameters but also their covariances. 5.12 Be stubborn and keep collinear covariates in the model? So what are the consequences of keeping collinear variables? The example presented below was published by Loyn (1987) and used by Quinn and Keough (2002) and Zuur et al. (2009a). It relates the density of forest birds in 56 forest patches in southeastern Australia to six explanatory variables: size of the patch area, distance to the nearest patch, distance to the nearest larger patch, grazing level (five levels indicating agriculture intensity), altitude of the patch, and year of isolation. The first three explanatory variables were log-transformed due to outliers. When we started exploring the data we made a boxplot of year of isolation conditional on grazing intensity (Figure 5.9). It showed that forest patches that were isolated around 1900 also have higher grazing intensity. So what happens if we ignore this collinearity and start with a model containing both variables at the same time?
Figure 5.9. Boxplot of year of isolation, conditional on factor graze. The following code loads the Loyn data and creates the boxplot > Loyn boxplot(YR.ISOL ~ factor(GRAZE), ylab = "Years of isolation", xlab = "Graze level", data = Loyn) We first apply the linear regression model containing all the covariates. Because we have a categorical variable (graze) we then execute the drop1 function to get a unique p-value for factor graze. The drop1 function drops one covariate in turn, compares the full model with the model in which one covariate was dropped, and applies an F test.
97
5.13 Confounding variables Df SumofSq
LOGAREA LOGDIST LOGLDIST YR.ISOL ALT factor(GRAZE)
1 1 1 1 1 4
770.01 0.55 5.19 1.81 7.47 413.50
RSS 1714.43 2484.44 1714.98 1719.62 1716.24 1721.90 2127.92
AIC F-val p-val 211.60 230.38 20.660 Methane library(lattice) > library(maps) > xyplot(Latitude ~ Longitude | Ecosystem, data = Methane, layout = c(1,3), aspect = "iso", xlab = "Longitude", ylab = "Latitude", panel = function(x,y) { panel.points(x, y, pch = 16, col = 1) mp install.packages("ggplot2") Next we need to load the package. > library(ggplot2)
106
6 Case study: Methane fluxes
The first ggplot2 function we will illustrate is qplot, which stands for quick plot. Its syntax is similar to that of the plot function. > qplot(y = Flux, x = Temp, data = Methane) The resulting graph is presented in Figure 6.2. It is a typical ggplot2 graph with the grey background and white lines marking the grid cells (though this can easily be changed). We will discuss the ecological interpretation of the scatterplot later in this chapter. The data are collected from sites with three different ecosystems. Colours in the scatterplot for the three different ecosystems are easily implemented using > qplot(y = Flux, x = Temp, data = Methane, col = Ecosystem)
Figure 6.2. Scatterplot of temperature versus flux. The graph is obtained with the qplot function from the ggplot2 package.
6.2 Data exploration
107
The graph is not shown here. Note that this command is different from the plot function. The variable Ecosystem is a categorical variable and qplot will automatically allocate colours to the different levels. In a similar way different symbols can be implemented using the shape option. > qplot(y = Flux, x = Temp, data = Methane, col = Ecosystem, shape = Ecosystem) There are many more options in qplot. Instead of presenting some of these options we show the more advanced usage of ggplot2, namely via layers. This allows the user to build up a graph component by component. The following code produces exactly the same graph as in Figure 6.2. > p p p The first line creates a graph and specifies what goes along the axes. But there is nothing to plot yet. The aes stands for aesthetics. We can add components (also called layers) to the graph, for example, points. This is via a so-called geom. There is a whole set of these geoms in ggplot2, for example, geom_boxplot, geom_histogram, geom_smooth, and geom_errobar. Table 4.2 in Wickham (2009) gives an extensive list, and we will see some of these geoms later in this chapter. The geom_point() function on the second line in the code above adds the points and the last line shows the graph on the screen. We would like to improve visual interpretation by adding a smoother and adjusting the labels. This is done with the following code. > > > >
p p p p
MyVar K MyData p p p p p Only the line with the facet_wrap function is new. This function is used to split up the graph in multiple plots, one per level of Var. You can specify the number of columns for the graph (the current graph has two rows and two columns), but facet_wrap tends to do a decent job with respect to this. The online supplemental material has a function Mydotplotggp2.R. We essentially put the four ggplot2 commands in a function and the user has to type > MyDotplot.ggp2(Methane, MyVar) to obtain the multipanel dotplot in Figure 6.5. Results suggest that temperature does not have any obvious outliers.
6.2 Data exploration
111
Figure 6.5. Multipanel dotplot produced with ggplot2. 6.2.4 Collinearity The next point we investigate is collinearity. Originally, the equivalent of the pairs function in ggplot2 was the plotmatrix function, but in later ggplot2 versions this function was replaced by ggpairs from the GGally package (Schloerke et al. 2014). The syntax is as follows. First we load the GGally package (which you need to download and install), then define the covariates (which contain a categorical variable), and run the ggpairs function. The resulting graph is presented in Figure 6.6. > library(GGally) > MyVar ggpairs(Methane[,MyVar], axisLabels = "show") There is quite some information in this graph: Pearson correlation coefficients in the upper diagonal panels, scatterplots in the lower diagonal panels, density curves along the diagonal, the variable names at the boundaries, boxplots shown for the combinations of a continuous and
112
6 Case study: Methane fluxes
categorical variable, and conditional histograms as well. This is too much information for our likening. Obviously, the use of this type of diagram is limited to a small number of covariates. Changing default settings (e.g., font size of the correlations, colours) requires Google searching, but in principle everything can be changed. It is also possible to replace the scatterplots by density curves or heat maps. Results indicate that there is moderate collinearity between temperature and latitude, and temperature seems to change per ecosystem.
Figure 6.6. Multipanel scatterplots, boxplots, and histograms for the covariates. 6.2.5 Relationships We already presented ggplot2 code for scatterplots between flux and temperature (Figure 6.3). We now show code to improve the quality of the graph. The smoother in Figure 6.3 is not presented very clearly, mainly because there are a lot of observations that are all printed as black dots. > p #p #p p p p p p The geom_point function with shape = "." is hashed out, but it is useful for a large dataset; observations are presented as small dots. The geom_point function with the alpha option is handy as well. It makes the points more transparent; the alpha = 1/5 means that five points on top of one another is identical to the ordinary black colour. We decided to settle for open circles and grey colour (see Figure 6.7), increase the thickness of the smoother, and suppress the plotting of the standard errors. To visualise the relationship between flux and ecosystem we use a boxplot; see Figure 6.8. Flux values in the rice paddy level seem to be higher. The code for the boxplot follows. We use the geom_boxplot function, and the Ecosystem variable was used in the aes() function. > > > > >
p p p p p
WorldMap p p p p We can easily split the graph into one graph per ecosystem level; see Figure 6.12. We just added: > p p
6.2 Data exploration
117
Figure 6.10. Map of the world with sampling locations superimposed.
Figure 6.11. Map of the world with sampling locations.
118
6 Case study: Methane fluxes
Figure 6.12. Figure 6.11, but with facet grid(. ~ Ecosystem) added to obtain one panel for each ecosystem level.
6.3 Statistical analysis using linear regression 6.3.1 Model formulation The aim of the study is to find a relationship between flux and temperature and ecosystem. To be more precise, we focus on the question whether the relationship between flux and temperature differs per ecosystem level. This means that we need to include an interaction term. Our starting model is
Fluxi = Intercept + Tempi + Ecosystemi + Tempi × Ecosystemi + εi
εi ~ N(0, σ 2 )
(6.1)
This is a multiple linear regression model. Fluxi is the observed flux for observation i, Tempi is the corresponding temperature, Ecosystemi is a categorical variable (with three levels), and εi is normally distributed noise. There is one major problem with this model; we have multiple observations from the same site (location), and flux observations at these sites are likely to be similar. This violates the most important underlying assumptions of linear regression models, namely independence. As a result of this violation estimated standard errors obtained by the linear regression model are too small. There are various options to continue. The easiest option is to take average values per site (average flux, average temperature) and use these averages in a linear regression model. But obviously we lose a lot of data if we do this. An alternative analysis is to apply a linear mixed effects model (Pinheiro and Bates 2000; Bolker 2008; Zuur et al. 2009a). This approach is discussed in Section 6.4. 6.3.2 Fitting a linear regression model We recommend applying the linear mixed effects model and not the linear regression model in Equation (6.1); however, we take site averages in this subsection to illustrate how to use some of the tools presented in this book for a linear regression model. In order to obtain the averages per site we use the following R code. > Methane.lm M1 summary(M1) Coefficients: (Intercept) Temp EcosystemRicePaddy EcosystemWetland Temp:EcosystemRicePaddy Temp:EcosystemWetland
Estimate Std. Error t value Pr(>|t|) -0.001982 0.033521 -0.059 0.9529 1.085127 0.095629 11.347 drop1(M1, test = "F") Single term deletions Model:Flux ~ Temp * Ecosystem Df Sum of Sq RSS AIC F value Pr(>F)
7.6022 -345.60 Temp:Ecosystem 2 0.54584 8.1481 -340.79 4.3439 0.01507
As mentioned earlier, we will not explain what multiple linear regression is; we refer the reader to the many textbooks on this topic, e.g., Montgomery and Peck (1992), Draper and Smith (1998), Fox (2002), and Zuur et al. (2007). The drop1 function compares a model with and without the interaction term. The F-statistic indicates that the interaction is significantly different from 0 at the 5% level, though a p-value of 0.015 implies that we need to ensure that all underlying assumptions of the multiple linear regression model are met. To assess whether all
120
6 Case study: Methane fluxes
assumptions are indeed met we apply model validation in the next subsection. 6.3.3 Model validation of the linear regression model The underlying assumptions of the multiple linear regression model are homogeneity of variance, independence, and normality, among various other assumptions (Zuur et al. 2007). Homogeneity of variance Tools to assess homogeneity were discussed in Chapter 3. Here we plot the residuals of the linear regression model versus fitted values; see Figure 6.13. Homogeneity of variance means that the vertical spread should be similar along the horizontal axis. Our first impression is that the variation in the residuals is slightly larger for the larger fitted values, but a closer inspection suggests that this impression is only based on five or six points. We are slightly doubtful whether we have homogeneity or minor heterogeneity. Figure 6.13 was created with the following R code. > E1 F1 par(mar = c(5, 5, 2, 2)) > plot(x = F1, y = E1, xlab = "Fitted values", ylab = "Residuals", cex.lab = 1.5) > abline(h = 0) The rstandard function extracts standardised residuals. There are mathematical reasons to prefer standardised residuals over ordinary residuals (obtained with the resid function); see Zuur (2012). The fitted function extracts the fitted values of the regression model. The mar option controls the white space around the graph. The cex.lab is used to increase the font size of the labels, and the abline function draws a horizontal line at 0. Another graph to assess homogeneity is presented in Figure 6.14. We have plotted standardised residuals versus the levels of Ecosystem. The mean value of the residuals per level will be 0 as Ecosystem is used as a categorical covariate in the model (the horizontal bar in the boxplots represents the median and these can be different from 0). The residual variation per level should be similar. However, the results in Figure 6.14 seem to suggest that the variation in the Aquatic level is lower. It is interesting to redraw Figure 6.13 and use different colours for the different Ecosystem levels and see whether the (minor?) heterogeneity by the Ecosystem levels can be spotted. Instead of using a graph to assess heterogeneity we can apply one of the various tests for homogeneity that were discussed in Chapter 3. Another informal way to assess homogeneity is to calculate the residual variances per level: > tapply(E1, FUN = var, INDEX = Methane.lm$Ecosystem) Aquatic RicePaddy Wetland 0.3632011 4.2370747 0.7505010 The ratio between the largest and smallest variance is larger than 4, indicating heterogeneity (see Chapter 3); perhaps we should drop the word ‘minor’ in the phrase ‘minor heterogeneity’ mentioned earlier in this subsection. The following R code was used to create Figure 6.14. > par(mar = c(5, 5, 2, 2)) > boxplot(E1 ~ Ecosystem, xlab = "Ecosystem", ylab = "Residuals", data = Methane.lm, cex.lab = 1.5) > abline(h = 0)
122
6 Case study: Methane fluxes
Figure 6.14. Residuals versus the levels of the categorical covariate Ecosystem. Independence The most important assumption underlying regression models is independence. Violation of independence means that the standard errors produced by the software are too small, resulting in type I errors (stating wrongly that there is an effect of a covariate). We can distinguish two types of dependency: dependency due to model misfit and inherent dependency. The first source of dependency is due to using the wrong model, for example, modeling a nonlinear relationship as linear, or omitting an important covariate or interaction. To investigate whether this type of dependency (or model misfit) is present, take the standardised residuals of the model, plot them against each covariate in the model, each covariate not in the model, and inspect the graphs for any patterns. If there are clear patterns, then improve the model. Figure 6.15 shows the residuals of the linear regression model versus temperature. If groups of residuals are all above or below the horizontal line at 0, then we have a model misfit. This is not the case here.
6.3 Statistical analysis using linear regression
123
Figure 6.15. Residuals plotted versus temperature for the linear regression model. The more difficult form of dependency is inherent dependency. With this we mean dependency in the response variable that we know a priori. Examples are multiple observations from the same animal, site, person, subject, class, school, observer, etc. Other examples are time series or observations from multiple sites close to one another (spatial correlation). We took averages per site and expected that this would take care of the fact that observations close to one another may have similar flux values. However, some of the sites are rather close to one another. Hence, there may still be spatial correlation. An informal way to detect spatial correlation is to make a scatterplot of latitude versus longitude and superimpose the residuals using different colours for positive and negative residuals, and different point sizes based on the value of the residuals; see Figure 6.16. If there is clustering of points with the same colour or the same size, there is spatial correlation in the residuals. One would hope not to see any patterns in Figure 6.16. In this example it is difficult to judge whether there is spatial correlation. This is partly because the points are scattered rather unevenly on the planet, and there is clustering of spatial locations of the sites, and as a result points are nearly on top of one another. A more formal tool to assess spatial correlation is a variogram.
124
6 Case study: Methane fluxes
Figure 6.16. Position of sites. The size of a dot is proportional to the value of the corresponding residual. Dark and grey colours are used for negative and positive residuals. The R code to create Figure 6.16 follows. We first load the world map and define two variables, MySize and MyCol. The first variable contains the point size and the second one contains the colour of the points. > > > > > > >
>
> > >
world OC names(OC) "ShellLength" "Month" "FeedingType" "FeedingPlot" Results of the str function are not shown here but the covariates Month, FeedingType, and FeedingPlot are imported as factors (they are character strings in the dataset). Next we load the required packages. > library(coefplot2) > library(ggplot2)
7.2 Data exploration As part of the data exploration we make boxplots of shell length conditional on feeding type, month, and feeding plot; see Figure 7.1. None of the three covariates shows a clear effect on shell length. The graphs were produced with the ggplot2 (Wickham 2009) package. The code for the first boxplot is as follows.
7.2 Data exploration
> > > > >
p1 p1 p1 p1 p1
table(OC$FeedingType) Hammerers 165
Stabbers 32
We stop the data exploration at this point, although results presented later in this chapter show that this is a premature decision.
7.3 Applying a linear regression model As discussed in the introduction to this chapter, we apply a model in which shell length is modelled as a function of the three categorical covariates feeding type, feeding plot, and month, and all two-way interactions and the three-way interaction term. To fit such a model in R we use the following. > M1 drop1(M1, test = "F") Results of the drop1 function are not presented here, but indicate that the three-way interaction term is significant at the 5% level (F2,185 = 6.09, p = 0.003). We therefore apply no further model selection steps. The output from the summary(M1) function is rather difficult to present due to the long names of the interaction terms. (Intercept) FeedingTypeStabbers FeedingPlotB FeedingPlotC MonthJan FeedingTypeStabbers:FeedingPlotB FeedingTypeStabbers:FeedingPlotC FeedingTypeStabbers:MonthJan FeedingPlotB:MonthJan FeedingPlotC:MonthJan FeedingTypeStabbers:FeedingPlotB:MonthJan FeedingTypeStabbers:FeedingPlotC:MonthJan
Estimate Std. Error t value Pr(>|t|) 2.208 0.076 29.0
> > >
> >
> >
> >
>
E1 MyData P1 > > >
MyData$Fit MyData$SE MyData$se.low MyData$se.up
coefplot2(coefs = MyData$Fit, sds = 0 * MyData$SE, varnames = paste(MyData$FeedingType, MyData$FeedingPlot, MyData$Month, sep = " "), main = "Fitted values",
7.5 Trouble
143
lower1 = MyData$Fit -1.96 * MyData$SE, upper1 = MyData$Fit +1.96 * MyData$SE, cex.var = 1.2, xlim = c(1.2, 3.5))
7.5 Trouble It is tempting to stop the analysis at this point and write an ecological story around the results. Such a story would contain a remark about the relatively large fitted values for the stabbers at location A in December. However, as a final graph we want to reproduce a graph similar to Figure 7.4, but now we want to superimpose the observed data. The resulting graph is presented in Figure 7.5. Each panel contains the observed data for a specific feeding plot and month combination, and the feeding type data are plotted. The thick dot is the predicted value and 95% confidence intervals are added. Note that for the December data at plot A for stabbers there are no points visible. Well, we do have observations for this combination, namely three. But their values are identical, and the predicted value is printed on top of the three observed values. Hence, these three observations most likely caused the three-way interaction to be significant. The Cook’s distance function did not pick up any single influential observation as we have three of them. It is up to the scientist to decide whether a model in which only three observations determine the significance of a three-way interaction term is useful.
Figure 7.5. Fitted values and 95% confidence intervals obtained by the linear regression model.
144
7 Case study: Oystercatcher shell length
Actually, we could have known this problem if we had done a more detailed data exploration. The results from the table function applied on all three categorical covariates show that for stabbers at location A in December we only have two observations. > table(OC$Month, OC$FeedingPlot, OC$FeedingType) , ,
= Hammerers
A B C Dec 17 14 26 Jan 43 31 34 , ,
= Stabbers
Dec Jan
A 2 4
B C 5 15 3 3
We finally discuss how we created Figure 7.5. We first create an empty ggplot graph, then add the x- and y-labels (with text size 15) and the observed data. > > > > > >
p p p p p p
> > > >
library(ggplot2) p