Visualization and Imputation of Missing Values: With Applications in R (Statistics and Computing) 3031300726, 9783031300721

This book explores visualization and imputation techniques for missing values and presents practical applications using

140 33 23MB

English Pages 484 [478] Year 2023

Table of contents :
Preface
Limitations
Mathematical Notation
Code Excerpt at GitHub
Acknowledgement
Reference
Contents
1 Topic-Focused Introduction to R and Data sets Used
1.1 Resources and Software for the Imputation of Missing Values
1.1.1 Amelia
1.1.2 mi
1.1.3 mice (and BaBooN)
1.1.4 missMDA
1.1.5 missForest and missRanger
1.1.6 robCompositions
1.1.7 VIM
1.2 The Statistics Environment R
1.3 Simple Calculations in R
1.4 Installation of and Updates
1.5 Help
1.6 The R Workspace and the Working Directory
1.7 Data Types
1.8 Generic Functions, Methods, and Classes
1.9 A Note on Functions with the Same Name in Different Packages
1.10 Basic Data Manipulation with the dplyr Package
1.10.1 Pipes
1.10.2 dplyr—tibbles
1.10.3 dplyr—Selection of Rows
1.10.4 dplyr—Order
1.10.5 dplyr—Selection of Columns
1.10.6 dplyr—Uniqueness
1.10.7 dplyr—Creating Variables
1.10.8 dplyr—Grouping and Summary Statistics
1.10.9 dplyr—Window Functions
1.11 Data Manipulation with the data.table Package
1.11.1 data.table—Variable Construction
1.11.2 data.table—Indexing/Subsetting
1.11.3 data.table—Keys
1.11.4 data.table—Fast Subsetting
1.11.5 data.table—Calculations in Groups
1.12 Data Sets
1.12.1 Census Data from UCI
1.12.2 Airquality
1.12.3 Breast Cancer
1.12.4 Brittleness Index
1.12.5 Kola C-horizon Data
1.12.6 Colic Horse Data
1.12.7 New York Collission Data
1.12.8 Diabetes
1.12.9 Austrian EU-SILC Data
1.12.10 Food Consumption
1.12.11 Pulp Lignin
1.12.12 Structural Business Statistics Data
1.12.13 Mammal Sleep Data
1.12.14 West Pacific Tropical Atmosphere Ocean Data
1.12.15 Wine Tasting and Price of Wines
1.12.16 Further Data Sets
References
2 Distribution, Pre-analysis of Missing Values and Data Quality
2.1 Introduction
2.2 How Does Missing Data Arise?
2.2.1 Surveys in Official Statistics and Surveys Obtained with Questionaires
2.2.2 Comment on Structural Zeros and Non-applicable Questions in a Questionaire
2.2.3 Missing Values from Measuring Experiments
2.2.4 Censored Values
2.2.5 Monotone Missingness
2.3 Missing Value Mechanisms
2.3.1 Missing at Random (MAR)
2.3.2 Missing at Completely Random (MCAR)
2.3.3 Missing Not at Random (MNAR)
2.3.4 Example
2.3.5 Summary on MCAR, MAR, and MNAR
2.4 Limitations for the Detection of the Missing Value Mechanisms
2.5 Kinds of Attributes
2.5.1 Binary and Nominal Variables and Related Distances
2.5.2 Ordered Categorical Variables
2.5.3 Count Variables, Continuous Variables, Semi-continuous Variables, and Related Distances
2.5.4 The Gower Distance
2.6 Data Quality and Consistency of Data
2.6.1 Outliers
2.6.1.1 Outliers in Relation with Other Data Problems
2.6.1.2 What Is an Outlier and When an Outlier Should Be Deleted and Imputed?
2.6.1.3 Univariate Methods
2.6.1.4 Multivariate Methods
2.6.2 Rule-Based Approaches for Checking the Consistency of Data
2.6.3 Localization of Inconsistencies and Errors
References
3 Detection of the Missing Values Mechanism with Tests and Models
3.1 Introduction
3.2 A Simple t-Test for MCAR
3.3 Non-parametric Version
3.4 Extension to the Multiple Case
3.5 Little's Test on MCAR and Extensions
3.6 Further Tests
References
4 Visualization of Missing Values
4.1 Motivation
4.1.1 Why to Apply Visualization Methods
4.1.2 Software
4.2 Rough Summaries and the Aggregation Plot
4.3 Histogram, Barplot, Spinogram, and Spine Plot
4.4 Parallel Boxplots
4.5 Scatterplots
4.6 Scatterplot Matrices
4.7 Scatterplot Faceting
4.8 Parallel Coordinate Plot
4.9 Matrix Plot
4.10 Visualizing Missing Values in Multivariate Categorical Data with Mosaic Plots
4.11 Missing Values in Maps
4.12 Studying Dropouts in Longitudinal Cohort Studies
4.13 Summary
References
5 General Considerations on Univariate Methods: Single and Multiple Imputation
5.1 A Few Words on Listwise Deletion; Deletion of Observation with Missing Values
5.1.1 Bias in Complete Case Analysis
5.2 Univariate Imputation Methods
5.2.1 Imputation with the Mean
5.2.2 Bias for Mean Imputation
5.2.3 Univariate Mean Imputation for Non-continuous Variables
5.3 What Is Single Imputation?
5.4 What Is Multiple Imputation?
5.4.1 General Steps in Multiple Imputation
5.4.2 Benefits
5.4.3 Requirements
5.4.4 How Many Imputations?
5.5 Single Imputation Versus Multiple Imputation
5.5.1 Why Standard Errors in Single Imputation Are Not Quite Correct?
5.6 Pooling of Multiple Imputation Results
5.6.0.1 Example
5.7 General Concepts to Allow for Randomness of Imputations
5.8 Joint Modeling, Distribution Fitting, and Copulas
5.9 The EM Algorithm and Fully Conditional Modeling in the Multivariate Case
References
6 Deductive Imputation and Outlier Replacement
6.1 Correction of Errors, But Where?
6.2 Correct Typos in Continuous Variables
6.3 Adjusting Values After Imputation to Match Restrictions
6.4 Imputing Outliers and Erroneous Values
References
7 Imputation Without a Formal Statistical Model
7.1 Hot-Deck Imputation
7.1.1 Random Hot-Deck
7.1.2 Random or Sequential Hot-Deck with Constraints
7.1.3 Sequential Hot-Deck
7.1.4 An Application of Hot-Deck
7.1.5 Computational Time Revisited
7.1.6 Some Remarks on Hot-Deck Imputation
7.2 k Nearest Neighbor Methods
7.2.1 Distance Calculation Within kNNand Weighting
7.2.2 Illustration of kNN Imputation
7.2.3 Application of kNN
7.2.4 Weighting the Variables
7.2.5 Random k Nearest Neighbor Imputation
7.3 Covariance-Based Methods
7.3.1 Principal Component Analysis
7.3.2 Imputation with Principal ComponentAnalysis
7.3.3 Multiple Imputation with PCA Imputation
7.3.4 A Simple Example to Compare the Approaches
References
8 Model-Based Methods
8.1 Linear Regression
8.2 Robust Linear Regression
8.2.1 Regression M Estimator
8.2.2 Weight Functions
8.2.3 Regression S Estimator
8.2.4 MM-Estimator
8.3 Robustness in Regression Imputation
8.3.1 Why Not Use Expected Values from a Model?
8.4 Making Regression-Based Imputation Methods Fit for Multiple Regression
8.4.1 (Normal) Noise Added to Predicted Values
8.4.2 Bootstrap Residuals and Addition to the Predicted Values of the Missings
8.4.3 Additional Consideration of Model Uncertainty Through Bootstrapping
8.4.4 Additional Consideration of Model Uncertainty Through Bayesian Regression
8.5 Predictive Mean Matching (PMM)
8.6 Weighted Donor Selection with Midastouch
8.7 EM-Based Imputation and Extensions to Non-continuous Variables
8.8 Robust Stepwise Sequential EM-Based Imputation with IRMI
8.9 Enhancement of IRMI: Robust Multiple Imputation Using imputeRobust
References
9 Nonlinear Methods
9.1 Tree-based Methods Using Random Forests
9.1.1 Imputation Using Random Forests
9.2 Tree-Based Methods Using XGBoost
9.2.1 Imputation Using XGBoost
9.3 Generalized Additive Models
9.3.1 Splines
9.3.2 Piecewise Polynomials
9.3.3 GAM's
9.3.4 Imputation with GAMs Using Thin Plate Regression Splines
9.3.5 Imputation with GAMs for Location, Scale, and Shape
9.4 Artificial Neural Network–Based Methods
9.4.1 Fully Conditional Modelling with Artificial Neural Networks
9.4.2 Artificial Neural Networks to Impute Missing Values
9.4.2.1 Choice of Hyperparameters
9.4.2.2 Application in R
9.4.3 Extension to Impute Rounded Zeros in Compositional Data
9.4.3.1 Application in R
9.4.3.2 Performance Overview
9.4.4 Imputation Using GAN (GAIN)
References
10 Methods for Compositional Data
10.1 What are Compositional Data?
10.1.1 Negative Bias
10.1.2 The Simplex Sample Space
10.1.3 Absolute or Relative Information
10.1.4 Log-ratio Coordinate Representation
10.1.5 Requirements of a Compositional Analysis
10.1.6 Are Outliers Also Relevant for Compositional Data?
10.2 Different Types of Missing Information
10.3 Imputation of Missing Values
10.3.1 k-nearest Neighbor Imputation
10.3.2 Iterative Model-Based Imputation
10.3.3 Using the R-package robCompositions for Imputing Missing Values
10.3.4 Comments
10.4 Imputation of Rounded and Count Zeros
10.4.1 Imputation of Rounded Zeros
10.4.2 Model-Based Replacement of Rounded Zeros
10.4.3 An Artificial Neural Network Approach to Impute Rounded Zeros
10.4.4 Rounded Zeros in High-Dimensional Data
10.4.5 Application in R
10.4.6 Count Zeros
10.5 Compositional Approach for Structural Zeros
10.5.1 Avoiding Zeros by Amalgamation
10.5.2 Imputation of Structural Zeros as Auxiliary Step
10.5.3 Application in R
References
11 Evaluation of the Quality of Imputation
11.1 Visual Inspection of Imputed Values
11.1.1 Aggregation Plots of Missings and Imputed Values
11.1.2 Comparing Complete Cases with Imputed Data
11.1.3 An Example of a Univariate Plot: Histogram with Imputed Values
11.1.4 An Example of a Multiple Plot: Marginplot with Imputed Missings
11.2 Parallel Coordinate Plot for Imputed Values
11.3 Biplot Analysis of Imputed Data
11.4 Insepection of Multiple Imputed Data Sets
11.5 Tours
11.6 How to Apply Evaluation Measures
11.7 Evaluation Measures: Precision
11.7.1 Mean Absolute Percentage Error (MAPE)
11.7.2 Normalized Root Mean Squared Error(NRMSE)
11.7.3 Considering Continuous and Semi-Continuous Variables
11.7.4 Percentage of Falsely Classified Entries
11.7.5 Illustration of Precision Measures
11.7.6 Compositional Error Deviation
11.8 Evaluation Measures: Correlation-Based Measures
11.9 Evaluation Measures Based on Estimators
11.9.1 Bias, Variance and Mean Squared Error of an Estimator
11.9.2 Coverage Rate
11.10 Evaluation Based on Prediction Performance
11.11 Final Comments
References
12 Simulation of Data for Simulation Studies
12.1 Introduction
12.2 Type of Simulation
12.2.1 Insertion of Missing Values in Real Data
12.2.2 Model-Based Simulation
12.2.3 Design-Based Simulation
12.3 Inserting Missing Values
12.3.1 Simulating MCAR Data (Model-Based)
12.3.2 Creating Patterns of Missingness
12.3.3 Simulating MAR Data
Probabilities Based on the Magnitude of Values
Example
12.3.4 Simulating MNAR Data
12.4 Simulating Data Using Covariances (Model-Based)
12.5 Simulating an Additional Variable
12.6 Simulating (High-Dimensional) Data with a LatentModel
12.7 A Design-Based Simulation Study Based on a Complex Survey from a Finite Population
12.7.1 Setup of the Structure
12.7.2 Simulation of Additional Variables
12.7.3 Example in R
12.8 Prediction Performance Simulation
12.9 Final Comments
References
Index

Recommend Papers

Statistics and Data Visualization in Climate Science with R and Python 9781108842570, 9781108903578, 9781108829465

A comprehensive overview of essential statistical concepts, useful statistical methods, data visualization, and modern c

113 74 36MB Read more

Fundamentals of Supervised Machine Learning: With Applications in Python, R, and Stata (Statistics and Computing) [1st ed. 2023] 3031413369, 9783031413360

This book presents the fundamental theoretical notions of supervised machine learning along with a wide range of applica

97 30 7MB Read more

Nonparametric Statistics with Applications to Science and Engineering with R (Wiley Series in Probability and Statistics) 9781119268130, 1119268133

NONPARAMETRIC STATISTICS WITH APPLICATIONS TO SCIENCE AND ENGINEERING WITH R Introduction to the methods and techniques

236 54 20MB Read more

Computing and Visualization in Science

414 58 1MB Read more

Introductory Statistics with R

277 100 2MB Read more

R for Stata Users (Statistics and Computing) 1441913173, 9781441913173

Stata is the most flexible and extensible data analysis package available from a commercial vendor. R is a similarly fle

98 23 6MB Read more

Analyzing Statistics with GNU R

206 86 160KB Read more

Basic Statistics with R 9780128209264

742 80 10MB Read more

R with RStudio for Introductory Statistics

This book, R with RStudio for Introductory Statistics, as the first part of a sequence, introduces the student to statis

193 66 3MB Read more

Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R [1st ed. 2016] 3319461605, 9783319461601

This introductory statistics textbook conveys the essential concepts and tools needed to develop and nurture statistical

486 74 6MB Read more

Visualization and Imputation of Missing Values: With Applications in R (Statistics and Computing)
3031300726, 9783031300721

Author / Uploaded
Matthias Templ

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Statistics and Computing

Matthias Templ

Visualization and Imputation of Missing Values With Applications in R

Statistics and Computing Series Editor Wolfgang Karl Härdle, Humboldt-Universität zu Berlin, Berlin, Germany

Statistics and Computing (SC) includes monographs and advanced texts on statistical computing and statistical packages.

Matthias Templ

Visualization and Imputation of Missing Values With Applications in R

Matthias Templ Institute for Competitiveness and Communication University of Applied Sciences and Arts Northwestern Switzerland Olten, Switzerland

ISSN 1431-8784 ISSN 2197-1706 (electronic) Statistics and Computing ISBN 978-3-031-30072-1 ISBN 978-3-031-30073-8 (eBook) https://doi.org/10.1007/978-3-031-30073-8 Mathematics Subject Classiﬁcation: 62-01, 62-02, 62F35, 62H99, 62H12 © The Editor(s) (if applicable) and The Author(s), under exclusive licence to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional aﬃliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

Data often contain missing values, and the reasons for the appearance of them are manifold. Missing values occur when measurements fail, from nonresponses in surveys, or when analysis results are lost. They also replace erroneous values when measurements are implausible and do not meet a certain level of prior knowledge of the data, i.e., if a value does not comply with predeﬁned editing rules (validity checks). Examples for missing values in the natural sciences are failed measurement units for measurements of groundwater quality or temperature, lost soil samples in geochemistry, or soil samples that to analyze anew, but are exhausted. Examples for missing values in oﬃcial statistics are respondents who deny to provide information about their income, or small companies that do not report their sales. Or a patient dies and is therefore no longer in a medical study. Missing values can have a strong inﬂuence on resulting ﬁgures and analysis. For example, indicators in oﬃcial statistics can be biased and their variance may be underestimated if missing values and their structure—best analyzed with appropriate visualization tools—are ignored. However, such indicators are of great interest to policy makers, who depend on the correct estimation of indicators. To ensure the quality of the estimates, missing values should be imputed using advanced methods. Planning to analyze data that include missing values involve answering the following questions: • • • •

Which kind of missing data are present in the data? What mechanism does this missing data follow? Which imputation strategy is appropriate for these data? Is there an accepted model that these data follow?

v

vi

Preface

• How many missing values are present in the data set? What are the consequences of a sensitivity analysis, e.g., by comparing the study results with and without imputed data (possibly using diﬀerent imputation methods)? • Is the aim to train and use a predictive model with high predictive power? Or is the statistical uncertainty in main focus, e.g., estimating some key indicators and their corresponding conﬁdence interval? Is there a need for multiple imputation due to these facts, or is a single imputation the better, more practical way? • Are there project-related time constraints for the analysis and imputation of missing values using sophisticated visualization and imputation methods? Clarifying these questions helps to improve your analysis results and data quality, and it is the aim of this book to investigate these questions in detail. The book had been written rather in a great freedomness from what should be followed or mentioned from historical or any other reasons. We clearly distance ourselves from simple imputation procedures whose methodology consists of replacing all missing values of a variable with the empirical location measure of the observed values—usually the mean value or the median, or in the case of non-quantitative characteristics with the median or mode. In addition, this monograph extends existing literature on several aspects. This book is not just about multiple imputation: Textbooks and monographs on missing values and imputation methods often only follow the paradigm of multiple imputation. However, this book is not just about multiple imputation methods, where single imputation methods are used more than once to create multiple copies of the imputed data sets to properly estimate the variance of the estimated parameters. While multiple imputation is theoretically sound and often the best way to correctly estimate the variance of an estimator, multiple imputation is often not applicable in practice. Reading the previous statement, many researchers will probably think: “What? Is he crazy? Multiple imputation is the standard!” But have you ever seen a data set in a statistical software or on a website with multiple copies (multiple imputed) of data sets? For example, one can imagine that even in a large organization where data moves from department to department, they are often linked to other data sets or information. Analysts in various departments would then not be happy with multiple imputed data. In other words, multiple imputation is rather a theoretical sound concept that only works in practice when the same person prepares and analyzes the data and produces results from it—a start-to-end product, so to speak. This is also often the thinking of computer scientists or people working on a project in isolation. In practice, however, multiple imputed data are often not practical for very practical reasons. Furthermore, the variance of the estimators may only diﬀer signiﬁcantly when multiple

Preface

vii

imputation is used and the rate of missing data is very high. That is why we also focus in great detail on single imputation methods. And anyway, a multiple imputation framework relies on single imputation methods, i.e., single imputation methods produce multiple imputed records if they have some randomness in their imputations. This book is not just about imputation of missing values in a target variable/response: We want to impute not only missing values in the target variable, but also in all important covariables and predictors. In most cases, the goal is to create a complete data set (or in case of multiple imputation several complete data sets) without any missing value at least in the most important variables in the data set if not for the whole data set. This book is not just about fancy model-based multiple imputation methods: There is also a lot of emphasis on traditional donor methods like hotdeck and k-nearest neighbor methods. These methods are simpler than methods based on a statistical model, but applicable in many real-world situations and useful for practitioners. They perform better or almost as well as complex modelbased methods in many situations and can be applied more generally without the need for an explicit statistical model. This book is not just about imputation methods for standard data sets: We also consider special kind of data sets, e.g., compositional data, whereby other methods are applied than for standard data sets. In addition, we discuss in detail the imputation of data sets with mixed scaled variables. A data set may contain a mixture of continuous, semi-continuous, binary, nominal, ordinal, or count variables. In many domains, data contain variables with these diﬀerent distributions. Data from oﬃcial statistics, social statistics, or econometrics consist of a variety of diﬀerently scaled variables, but also in other ﬁelds like metabolomics, just to name on additional domain, metadata on humans age, body mass index, gender, country, etc. provides essential information. For example, demographic variables such as gender, age, region, economic status or industry, nationality, household size, etc. are present in the data, as are variables on income and income components. For imputation of missing values in a variable, all these variables with diﬀerent scales can be used as they provide essential information. Only a few methods can handle diﬀerent variable types, and we discuss the problems of variables of mixed types.

viii

Preface

This book is not just about classical statistical methods: Outliers and/or misclassiﬁed objects may inﬂuence a (classical) imputation method so that the imputations became arbitrary. We also discuss robust methods for imputation that bound the inﬂuence of outliers and misclassiﬁed objects. Why we would apply robust methods? In the real world, data sets are interspersed with outliers (but true values), erroneous values, and/or misclassiﬁed objects. They can aﬀect classical statistical methods for imputation. Most well-known methods such as regression imputation, random forest imputation, predictive mean matching, and artiﬁcial neural network imputation, at least as they are implemented today, can be strongly inﬂuenced by outliers, erroneous values, or misclassiﬁed objects and therefore have poor performance in many situations. Removing outliers in a multivariate setting is diﬃcult and never a good idea anyway, but robust statistical methods help us here. Indeed, outliers and robust imputation methods play an important role in this book. And the good news is that if a data set does not contain outliers or misclassiﬁcations, robust imputation methods give about the same results as non-robust methods, while if outliers or misclassiﬁcations are present in a data set, you will only get reliable imputations with robust methods. So you are on the safe side if you use robust methods. In this book, the use of robust procedures is strongly recommended. This book is not just about linear relationships in data: In a few cases, neither a transformation of variables, nor the inclusion of quadratic or cubic terms, the consideration of interactions between variables in a model, nor the inclusion of newly generated variables (e.g., describing a trend reversal or structural breaks) can establish a linear relationship between a target variable and explanatory variables. In this case, non-linear methods such as imputation with GAM, random forests, or artiﬁcial deep neural networks are used. Even more common in practice is the case of missing values in several variables and the use of automated procedures to ﬁll them sequentially variable by variable. Then the check whether a linear relationship exists or can be established is often dispensed with. The use of non-linear methods can then also be advantageous. This book is mostly not about joint modelling and GAN: Joint modeling approaches focus on learning the joint distribution of multiple random variables, either through distribution ﬁtting and/or modeling distributions using copulas or, more recently and more successfully when the data are large enough, through adversarial training with generative adversarial networks (GAN). In particular, GANs are successfully used to model large datasets with relatively simple relationships between variables; they are less successful for complex datasets with mixed scaled distributions, logical

Preface

ix

relationships, hierarchies and clusters (e.g., people in households), outliers, and/or many variables. The focus of this book is therefore more on conditional modeling, i.e., imputation of one variable taking into account other variables, but also on model-free methods such as k-nearest neighbor imputation. This book is not just about imputation methods: Topics like the visualization of missing values and imputed values are discussed in detail. This allows to observe the data structure and the mechanism of the missing values and gives many insights into the data set. We strongly recommend: Do not apply any test, model, or imputation method before you have visualized your data set to learn about possible data problems and about the (multiple and multivariate) structure of your data. This book is not just about imputation methods and analysis of missing values: We use a lot of space to evaluate imputation results, either with visual or numerical methods. In the literature, simulations (e.g. to compare imputation methods) are often applied in a too simpliﬁed manner. We dedicate a chapter to simulation studies and how to set up a simulation study to evaluate the inﬂuence of missing value imputation. In the literature, we also see the limited choice of evaluation metrics. In the book, evaluation metrics are discussed on a large scale and in the form of a critical review.

Limitations There are other types of data sets that we do not cover in this book • Time series: Only for very speciﬁc types of time series would we use time series analysis methods for imputing missing values in the time series. In general, if a target variable is a time series, we can still ﬁt a linear regression model to the time series if we take into account important auxiliary variables that reduce autocorrelation. Instead of using time series methods like ARIMA, in practice we often use a regression model to explain the time series and also use this model to predict missing values in the target time series. Alternatively, one can use deep learning methods, which we also discuss in this book. We refer to other publications that deal with imputation of time series, e.g., Moritz and Bartz-Beielstein (2017). • Unstructured data: We do not discuss imputation of textual or sound or impainting of pictures or video data. The methods used for unstructured data are mostly based on deep learning approaches and are successfully used when the data are large and well labeled.

x

Preface

• Longitudinal data: Imputation of longitudinal data is not explicitly addressed here. In practice, we often deal with this situation by combining historical information with our current data to extend our data set. In this way, all the methods discussed in our book are applicable. • Imputation with a Generative Adversarial Networks (GAN). GANs are widely used nowadays for imputation, but especially for generating synthetic data. However, great care must be taken in training the parameters, and there is no general solution or library for imputation at least for mixed scaled data. In other words, for each data set, a lot of time may be required for parameter tuning, otherwise the GAN is not competitive with standard imputation approaches and may also converge to false global optima. Figure 1 shows important concepts and methods covered in this book and their dependencies. Step 1 is about visualization of missing values (and also tests, although they are not so successful compared to visualization tools), to learn about the distribution of your variables and the multivariate dependencies of your data set in the presence of missing values, and also to learn about the structure of missing values and the multivariate dependencies of their occurrence. Diﬀerent types of data sets are also mentioned. Step 2 is about imputation concepts and imputation methods. We discuss a number of methods (only the most important ones are included in this diagram) and distinguish between model-based methods and imputation methods that are not based on a model. We also distinguish between non-linear and linear methods. We can also note that the concept of multiple imputation requires single imputation methods, and a model may have some kind of randomness incorporated or a method may be deterministic. In addition, a second source of uncertainty, which basically accounts for model uncertainty when we work with sample data, can be accounted by a bootstrap, for example (Bayesian regression would be one other method). Some of the methods allow for multiple imputation and the results then need to be pooled. Step 3 deals with all aspects after the imputations have been performed. The imputations should be evaluated, both by visualization and numerical evaluation. Often a simulation setup is required, e.g., when estimating coverage rates. If this visual and numerical evaluation shows that the imputations have been performed well, the data set(s) can be analyzed using statistical methods developed for complete data.

Mathematical Notation Mathematical symbols are italicized. However, vectors and matrices are shown in Roman script (not italics) and in boldface, with vectors lowercase

Preface

xi

Fig. 1: An overview about what’s content in this book including some dependencies

xii

Preface

and matrices uppercase. Thus, .x would be a vector and .X would be a matrix. Element i of .x is often denoted .xi and the element in row i and column j of .X is hence .xij . A vector a will usually be a column vector unless otherwise speciﬁed. Generic random variables are usually denoted by uppercase letters (e.g., X), with possible numeric values represented by the corresponding lowercase letter (x). A statistical distribution is denoted in lowercase italicized letters, e.g., like .X ∼ N (μ, σ 2 ) for a random variable X from a normal distribution with mean .μ and standard deviation .σ. To help emphasize the distinction between variables and parameters, lowercase (and Greek) letters designate parameters of a model.

Code Excerpt at GitHub The GitHub repository https://github.com/matthias-da/vimbook-code contains the essential code seen in this book to save copying and pasting. The code is therefore of limited use without the explanations in the book, as the comments on this code can only be found in the book. There are two exceptions: • Code for impGAM and deepImp are only be found in the book. These functions are planned to be released in 2023/24, but still are not part of package VIM. Note also that there is not much code on the application of the methods themselves, since the aim of the book is to explain the methods, while the application in R is often trivial (e.g., VIM::irmi(x), with x a data frame with missing values and sensible defaults on other function arguments).

Acknowledgement My special thanks go to Alexander Kowarik and Andreas Alfons for their many contributions to the R package VIM. Without them, this book would look very diﬀerent and would omit many interesting methods. I would also like to thank Markus Ulmer, who was involved in the simulation studies and improved some graphics. His encouragement, critical comments, and burning interest motivated me as I worked on this book. My thanks also go to the Institute for Data Analysis and Process Design at the Zurich University of Applied Sciences in Switzerland, in particular to Andreas Ruckstuhl and Jürg Hosang. The Institute’s ﬁnancial support in the

Preface

xiii

form of funded man-hours for the production of the book has contributed to its success. I would also like to thank Veronika Rosteck from Springer for her patience, encouragement, motivation, and professionalism. It’s just great to work with her and have her as a supervising editor. Finally, my thanks go to my wife Barbara, who has always supported me in love and who gave me the freedom to spend many evenings and nights working on this book. Thanks to our cat and her nightly motivational help. Because of her, it was often impossible to stop writing this book, see for yourself:

Olten, Switzerland

Matthias Templ

Reference Moritz, S., and T. Bartz-Beielstein. 2017. “imputeTS: Time Series Missing Value Imputation in R.” The R Journal 9 (1): 207–18. https://doi.org/10. 32614/RJ-2017-009.

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

1

1

Topic-Focused Introduction to R and Data sets Used . . . . . 1.1 Resources and Software for the Imputation of Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Amelia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 mi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 mice (and BaBooN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 missMDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 missForest and missRanger . . . . . . . . . . . . . . . . . . . . . . 1.1.6 robCompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.7 VIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The Statistics Environment R . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Simple Calculations in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Installation of R and Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 The R Workspace and the Working Directory . . . . . . . . . . . . . 1.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Generic Functions, Methods, and Classes . . . . . . . . . . . . . . . . . 1.9 A Note on Functions with the Same Name in Diﬀerent Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Basic Data Manipulation with the dplyr Package . . . . . . . . . . . 1.10.1 Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.2 dplyr—tibbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.3 dplyr—Selection of Rows . . . . . . . . . . . . . . . . . . . . . . . 1.10.4 dplyr—Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.5 dplyr—Selection of Columns . . . . . . . . . . . . . . . . . . . . 1.10.6 dplyr—Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.7 dplyr—Creating Variables . . . . . . . . . . . . . . . . . . . . . . .

1 2 3 4 4 4 4 5 5 6 7 8 10 10 17 20 21 22 23 24 25 26 27 28

xv

xvi

Contents

1.10.8 dplyr—Grouping and Summary Statistics . . . . . . . . . 1.10.9 dplyr—Window Functions . . . . . . . . . . . . . . . . . . . . . . 1.11 Data Manipulation with the data.table Package . . . . . . . . . . . 1.11.1 data.table—Variable Construction . . . . . . . . . . . . . . . . 1.11.2 data.table—Indexing/Subsetting . . . . . . . . . . . . . . . . . . 1.11.3 data.table—Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11.4 data.table—Fast Subsetting . . . . . . . . . . . . . . . . . . . . . 1.11.5 data.table—Calculations in Groups . . . . . . . . . . . . . . 1.12 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.1 Census Data from UCI . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.2 Airquality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.3 Breast Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.4 Brittleness Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.5 Kola C-horizon Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.6 Colic Horse Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.7 New York Collission Data . . . . . . . . . . . . . . . . . . . . . . . 1.12.8 Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.9 Austrian EU-SILC Data . . . . . . . . . . . . . . . . . . . . . . . . 1.12.10 Food Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.11 Pulp Lignin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.12 Structural Business Statistics Data . . . . . . . . . . . . . . . 1.12.13 Mammal Sleep Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.14 West Paciﬁc Tropical Atmosphere Ocean Data . . . . . 1.12.15 Wine Tasting and Price of Wines . . . . . . . . . . . . . . . . 1.12.16 Further Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Distribution, Pre-analysis of Missing Values and Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 How Does Missing Data Arise? . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Surveys in Oﬃcial Statistics and Surveys Obtained with Questionaires . . . . . . . . . . . . . . . . . . . . 2.2.2 Comment on Structural Zeros and Non-applicable Questions in a Questionaire . . . . . . . 2.2.3 Missing Values from Measuring Experiments . . . . . . 2.2.4 Censored Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Monotone Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Missing Value Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Missing at Random (MAR) . . . . . . . . . . . . . . . . . . . . . 2.3.2 Missing at Completely Random (MCAR) . . . . . . . . . 2.3.3 Missing Not at Random (MNAR) . . . . . . . . . . . . . . . . 2.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Summary on MCAR, MAR, and MNAR . . . . . . . . . .

28 29 30 31 31 33 33 35 35 36 37 37 38 39 39 40 41 42 45 46 46 47 49 49 50 52 55 55 56 56 58 59 60 60 62 63 64 64 65 65

Contents

2.4 Limitations for the Detection of the Missing Value Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Kinds of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Binary and Nominal Variables and Related Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Ordered Categorical Variables . . . . . . . . . . . . . . . . . . . 2.5.3 Count Variables, Continuous Variables, Semi-continuous Variables, and Related Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 The Gower Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Data Quality and Consistency of Data . . . . . . . . . . . . . . . . . . . 2.6.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Rule-Based Approaches for Checking the Consistency of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Localization of Inconsistencies and Errors . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvii

66 69 69 70

70 72 73 73 80 80 83

3

Detection of the Missing Values Mechanism with Tests and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.2 A Simple t-Test for MCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3 Non-parametric Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.4 Extension to the Multiple Case . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.5 Little’s Test on MCAR and Extensions . . . . . . . . . . . . . . . . . . . 96 3.6 Further Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4

Visualization of Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Why to Apply Visualization Methods . . . . . . . . . . . . 4.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Rough Summaries and the Aggregation Plot . . . . . . . . . . . . . . 4.3 Histogram, Barplot, Spinogram, and Spine Plot . . . . . . . . . . . 4.4 Parallel Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Scatterplot Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Scatterplot Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Parallel Coordinate Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Matrix Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Visualizing Missing Values in Multivariate Categorical Data with Mosaic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Missing Values in Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Studying Dropouts in Longitudinal Cohort Studies . . . . . . . . 4.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107 107 108 111 112 119 125 127 130 131 133 137 141 142 144 146 147

xviii

5

Contents

General Considerations on Univariate Methods: Single and Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 A Few Words on Listwise Deletion; Deletion of Observation with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Bias in Complete Case Analysis . . . . . . . . . . . . . . . . . 5.2 Univariate Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Imputation with the Mean . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Bias for Mean Imputation . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Univariate Mean Imputation for Non-continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 What Is Single Imputation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 What Is Multiple Imputation? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 General Steps in Multiple Imputation . . . . . . . . . . . . 5.4.2 Beneﬁts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 How Many Imputations? . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Single Imputation Versus Multiple Imputation . . . . . . . . . . . . 5.5.1 Why Standard Errors in Single Imputation Are Not Quite Correct? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Pooling of Multiple Imputation Results . . . . . . . . . . . . . . . . . . . 5.7 General Concepts to Allow for Randomness of Imputations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Joint Modeling, Distribution Fitting, and Copulas . . . . . . . . . 5.9 The EM Algorithm and Fully Conditional Modeling in the Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

151 151 152 155 155 157 160 161 161 162 164 164 164 165 167 172 178 184 187 190

6

Deductive Imputation and Outlier Replacement . . . . . . . . . 6.1 Correction of Errors, But Where? . . . . . . . . . . . . . . . . . . . . . . . 6.2 Correct Typos in Continuous Variables . . . . . . . . . . . . . . . . . . . 6.3 Adjusting Values After Imputation to Match Restrictions . . . 6.4 Imputing Outliers and Erroneous Values . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

193 194 195 198 200 204

7

Imputation Without a Formal Statistical Model . . . . . . . . . 7.1 Hot-Deck Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Random Hot-Deck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Random or Sequential Hot-Deck with Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Sequential Hot-Deck . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 An Application of Hot-Deck . . . . . . . . . . . . . . . . . . . . . 7.1.5 Computational Time Revisited . . . . . . . . . . . . . . . . . . 7.1.6 Some Remarks on Hot-Deck Imputation . . . . . . . . . .

207 208 209 211 212 213 214 215

Contents

8

xix

7.2 k Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Distance Calculation Within kNN and Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Illustration of kNN Imputation . . . . . . . . . . . . . . . . . . 7.2.3 Application of kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Weighting the Variables . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.5 Random k Nearest Neighbor Imputation . . . . . . . . . . 7.3 Covariance-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . 7.3.2 Imputation with Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Multiple Imputation with PCA Imputation . . . . . . . . 7.3.4 A Simple Example to Compare the Approaches . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

216

Model-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Robust Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Regression M Estimator . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Weight Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Regression S Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 MM-Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Robustness in Regression Imputation . . . . . . . . . . . . . . . . . . . . 8.3.1 Why Not Use Expected Values from a Model? . . . . . 8.4 Making Regression-Based Imputation Methods Fit for Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 (Normal) Noise Added to Predicted Values . . . . . . . . 8.4.2 Bootstrap Residuals and Addition to the Predicted Values of the Missings . . . . . . . . . . . . . . . . . 8.4.3 Additional Consideration of Model Uncertainty Through Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Additional Consideration of Model Uncertainty Through Bayesian Regression . . . . . . . . . . . . . . . . . . . . 8.5 Predictive Mean Matching (PMM) . . . . . . . . . . . . . . . . . . . . . . . 8.6 Weighted Donor Selection with Midastouch . . . . . . . . . . . . . . . . 8.7 EM-Based Imputation and Extensions to Non-continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Robust Stepwise Sequential EM-Based Imputation with IRMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Enhancement of IRMI: Robust Multiple Imputation Using imputeRobust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

237 238 238 240 241 243 244 244 245

217 219 220 221 224 225 228 230 231 232 233

247 248 250 251 253 257 261 263 265 269 271

xx

9

Contents

Nonlinear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Tree-based Methods Using Random Forests . . . . . . . . . . . . . . . 9.1.1 Imputation Using Random Forests . . . . . . . . . . . . . . . 9.2 Tree-Based Methods Using XGBoost . . . . . . . . . . . . . . . . . . . . . 9.2.1 Imputation Using XGBoost . . . . . . . . . . . . . . . . . . . . . 9.3 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 GAM’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Imputation with GAMs Using Thin Plate Regression Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.5 Imputation with GAMs for Location, Scale, and Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Artiﬁcial Neural Network–Based Methods . . . . . . . . . . . . . . . . 9.4.1 Fully Conditional Modelling with Artiﬁcial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Artiﬁcial Neural Networks to Impute Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Extension to Impute Rounded Zeros in Compositional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Imputation Using GAN (GAIN) . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

273 274 276 279 280 283 284 286 291

10 Methods for Compositional Data . . . . . . . . . . . . . . . . . . . . . . . . 10.1 What are Compositional Data? . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Negative Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 The Simplex Sample Space . . . . . . . . . . . . . . . . . . . . . . 10.1.3 Absolute or Relative Information . . . . . . . . . . . . . . . . 10.1.4 Log-ratio Coordinate Representation . . . . . . . . . . . . . 10.1.5 Requirements of a Compositional Analysis . . . . . . . . 10.1.6 Are Outliers Also Relevant for Compositional Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Diﬀerent Types of Missing Information . . . . . . . . . . . . . . . . . . . 10.3 Imputation of Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 k-nearest neighbor imputation . . . . . . . . . . . . . . . . . . . 10.3.2 Iterative Model-Based Imputation . . . . . . . . . . . . . . . . 10.3.3 Using the R-package robCompositions for Imputing Missing Values . . . . . . . . . . . . . . . . . . . . . . . . 10.3.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Imputation of Rounded and Count Zeros . . . . . . . . . . . . . . . . . . 10.4.1 Imputation of Rounded Zeros . . . . . . . . . . . . . . . . . . . 10.4.2 Model-Based Replacement of Rounded Zeros . . . . . . 10.4.3 An Artiﬁcial Neural Network Approach to Impute Rounded Zeros . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 Rounded Zeros in High-Dimensional Data . . . . . . . . .

325 325 326 327 328 329 330

298 302 304 305 308 315 318 319

331 332 332 333 335 337 341 341 343 345 347 347

Contents

xxi

10.4.5 Application in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.6 Count Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Compositional Approach for Structural Zeros . . . . . . . . . . . . . . 10.5.1 Avoiding Zeros by Amalgamation . . . . . . . . . . . . . . . . 10.5.2 Imputation of Structural Zeros as Auxiliary Step . . . 10.5.3 Application in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

349 350 351 352 353 354 359

11 Evaluation of the Quality of Imputation . . . . . . . . . . . . . . . . . 11.1 Visual Inspection of Imputed Values . . . . . . . . . . . . . . . . . . . . . 11.1.1 Aggregation Plots of Missings and Imputed Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Comparing Complete Cases with Imputed Data . . . . 11.1.3 An Example of a Univariate Plot: Histogram with Imputed Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.4 An Example of a Multiple Plot: Marginplot with Imputed Missings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Parallel Coordinate Plot for Imputed Values . . . . . . . . . . . . . . 11.3 Biplot Analysis of Imputed Data . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Insepection of Multiple Imputed Data Sets . . . . . . . . . . . . . . . . 11.5 Tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 How to Apply Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . 11.7 Evaluation Measures: Precision . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.1 Mean Absolute Percentage Error (MAPE) . . . . . . . . 11.7.2 Normalized Root Mean Squared Error (NRMSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.3 Considering Continuous and Semi-Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.4 Percentage of Falsely Classiﬁed Entries . . . . . . . . . . . . 11.7.5 Illustration of Precision Measures . . . . . . . . . . . . . . . . . 11.7.6 Compositional Error Deviation . . . . . . . . . . . . . . . . . . 11.8 Evaluation Measures: Correlation-Based Measures . . . . . . . . . 11.9 Evaluation Measures Based on Estimators . . . . . . . . . . . . . . . . 11.9.1 Bias, Variance and Mean Squared Error of an Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9.2 Coverage Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.10Evaluation Based on Prediction Performance . . . . . . . . . . . . . . 11.11Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

363 363

12 Simulation of Data for Simulation Studies . . . . . . . . . . . . . . . 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Type of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Insertion of Missing Values in Real Data . . . . . . . . . . . 12.2.2 Model-Based Simulation . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 Design-Based Simulation . . . . . . . . . . . . . . . . . . . . . . . .

409 409 410 411 415 417

364 367 369 370 377 379 384 386 388 389 389 390 390 391 391 392 393 394 394 402 405 406 406

xxii

Contents

12.3 Inserting Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Simulating MCAR Data (Model-Based) . . . . . . . . . . . 12.3.2 Creating Patterns of Missingness . . . . . . . . . . . . . . . . . . 12.3.3 Simulating MAR Data . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.4 Simulating MNAR Data . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Simulating Data Using Covariances (Model-Based) . . . . . . . . 12.5 Simulating an Additional Variable . . . . . . . . . . . . . . . . . . . . . . . 12.6 Simulating (High-Dimensional) Data with a Latent Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7 A Design-Based Simulation Study Based on a Complex Survey from a Finite Population . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.1 Setup of the Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.2 Simulation of Additional Variables . . . . . . . . . . . . . . . 12.7.3 Example in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.8 Prediction Performance Simulation . . . . . . . . . . . . . . . . . . . . . . 12.9 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

418 418 421 422 427 427 429 430 434 438 439 441 446 451 456

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

Chapter 1

Topic-Focused Introduction to R and Data sets Used

Abstract The theoretical concepts explained in the book are illustrated by examples, which make use of the statistical software environment R. In this chapter, a short introduction to some functionalities of R is given. This introduction does not replace a general introduction to R, but it provides the background that is necessary to understand the examples and the R code in the book. First, the available software tools are brieﬂy discussed before the focused introduction to R is given by also introducing the package VIM, which is used throughout the book. Finally, some interesting data sets are introduced. Most of them are used for demonstration purposes and exercises in this book. Note that most of the methods explained in this book are exclusively available in the R package VIM. The package includes not only imputation methods such as imputation with distance-based, model-based, neural-network-based methods, but it also contains diagnostic tools for missing and imputed values. Note that most of the data sets used in this book has been made available in the R package VIM.

1.1 Resources and Software for the Imputation of Missing Values This section gives a brief overview on packages related to missing data analysis and imputation of missing values. The implementation of methods in software is essential to apply the exploration and imputation of missing values in practice. For the imputation of missing values, a variety of software tools have been written, which will be discussed in the following sections.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Templ, Visualization and Imputation of Missing Values, Statistics and Computing, https://doi.org/10.1007/978-3-031-30073-8_1

1

2

1 Topic-Focused Introduction to R and Data sets Used

An updated list of packages for imputation is available at the CRAN Task View on Missing Values, https://CRAN.R-project.org/view=MissingData. The sections on this link are based on thematic topics, namely, the exploration of missing values, likelihood-based methods, single imputation, multiple imputation, and methods for speciﬁc data types such as longitudinal data, time series, spatial or spatio-temporal data, matrix completion, graphs, or compositional data. The CRAN task view MissingData oﬀer an up-to-date overview of R packages in the area of missing values and imputation. Another CRAN task view focuses more on missing values and imputation in the oﬃcial statistics, and survey methodology and its information on imputation can be found in the Imputation section at the following link https:// CRAN.R-project.org/view=OﬃcialStatistics. An additional resource of information can be found at the R-miss-tastic website: https://rmisstastic.netlify.com (accessed at April 22, 2021). It is out of scope to mention all packages with imputation features, but we list the most well-known and most used packages. Table 1.1 shows the average daily and absolute downloads from the RStudio CRAN mirror from January 2020 till end of May 2022. The package mice leads this list. In this book, we mainly focus on the package VIM (fourth position on this list) with about 744 downloads daily over the RStudio CRAN mirror, but we also show the methods and applications from mice, missRanger, and robCompositions. The reason for focusing on VIM is that it contains computationally fast imputation methods that can handle outliers and mixed variable types and that it has the largest available collection of diagnostic tools for missing values. The most important R packages are brieﬂy presented below.

1.1.1 Amelia The R package Amelia (Honaker et al., 2011) implements bootstrap multiple imputation using an expectation maximum (EM) algorithm to estimate the parameters for quantitative data. It imputes assuming a multivariate Gaussian distribution. Its strengths lie in the imputation of longitudinal data. A bootstrap is applied to appropriately introduce model uncertainty into the imputation process, and the package allows for multiple imputation.

1.1 Resources and Software for the Imputation of Missing Values

3

Table 1.1: Downloads and daily average downloads from the RStudio CRAN mirror in the time span from 1.1.2020 till 27.05.2022 including the trend compared to the previous 2 years. The absolute trend shows the increase or decrease of average daily downloads and the relative one the change in position Package Downloads Daily Trend (abs.) Trend (rel.) mice 2363945 2699 + imputeTS 937612 1070 + + mi 822182 939 + VIM 651729 744 + norm 478497 546 + + Amelia 328761 375 + jomo 260891 298 missMDA 221956 253 + + missForest 212811 243 + zCompositions 161688 185 + robCompositions 148872 170 yaImpute 117316 134 mix 83728 96 + simputation 75398 86 + + missRanger 60431 69 + + softImpute 46525 53 + imputeR 45937 52 + + cat 28277 32 + hot.deck 27065 31 + BaBooN 20232 23 CoImp 17587 20 + + imputeMulti 16255 19 + + ﬁlling 15786 18 + + GenForImp 14859 17 + + MissMech 14672 17 + MImix 14389 16 + HotDeckImputation 4490 5 MixedDataImpute 2520 3 + tidyimpute 2422 3 -

1.1.2 mi The R package mi (Gelman and Hill, 2011) implements multiple imputation by chained equations (iterative regression imputation/EM algorithm, as so also applied in mice, VIM, and mi). It works in an approximate Bayesian framework and provides functions to analyze multiply imputed data sets with the appropriate degree of sampling uncertainty. Specially and useful, a model can be deﬁned for each variables to be imputed in the iterative chain.

4

1 Topic-Focused Introduction to R and Data sets Used

1.1.3 mice (and BaBooN) The R package mice (Buuren and Groothuis-Oudshoorn, 2011; Buuren and Oudshoorn, 2005) is the most popular package for multiple imputation (details on multiple imputation can be found in Sect. 5.5). Also so-called pooling of estimations is supported for various estimators. The most important methods are discussed in Sect. 5.5. Salfran and Spieß (2018) extends this package with generalized additive models. The R package BaBooN got less attention as mice for its multiple imputation features and its approach is for discrete data imputation, that is, as mice, based on predictive mean matching (PMM) and their midastouch approach described later. For more details on methods used, see Sect. 5.5.

1.1.4 missMDA The R package missMDA (Josse and Husson, 2016) provides imputation methods of incomplete continuous or categorical data sets based on principal components; more precisely, continuous missing values are imputed with a principal component analysis (PCA), categorical missing values with a multiple correspondence analysis (MCA) model, and mixed data with factor analysis for mixed data. Based on dimensionality reduction methods, for example, with PCA, it takes the similarities between the observations and the relationship between variables into account. These methods are discussed in Sect. 7.3.

1.1.5 missForest and missRanger Both packages support the use of random forests to impute missing values. While missForest (Stekhoven and Bühlmann, 2011) was implemented earlier than missRanger, missRanger (Mayer, 2019) access to ranger (Wright and Ziegler, 2017) which is a faster implementation of random forests and has better error handling and user guidance.

1.1.6 robCompositions The R package robCompositions (Templ et al., 2011; Filzmoser et al., 2018b) implements methods for imputing missing values and rounded zeros for compositional data. These methods are discussed in Chap. 10.

1.2 The Statistics Environment R

5

1.1.7 VIM In 2006, the ﬁrst version of the R package VIM (Templ et al., 2019; Templ et al., 2012; Kowarik and Templ, 2016) have been made available on CRAN. The R package VIM can be used for exploring the data and the structure of the missing and/or imputed values as well as for imputation of missing values. Depending on this structure of the missing values, the corresponding methods may help to identify the mechanism generating the missing values and allow to explore the data including missing values. In addition, the quality of imputation can be visually explored using various univariate, bivariate, multiple, and multivariate plot methods. The imputation methods are computationally fast and include nearest neighbor methods (hotdeck, k-nearest neighbor imputation), model-based imputation methods in an EM-fashion, and methods based on neural networks. In many chapters of this book, this package plays a central role.

1.2 The Statistics Environment R R was founded by Ross Ihaka and Robert Gentlemen in 1993–1994. It is based on S, a programming language developed by John Chambers (Bell Laboratories) and Scheme. Since 1997, it is internationally developed and distributed from Vienna over the Comprehensive R Archive Network (CRAN, cran.r-project.org). R nowadays belongs to the most popular and most used software environments in the statistics world. Furthermore, R is free and opensource (under the GPL2). R is not only a software for statistical analysis, and it is an environment for interactive computing with data supporting facilities to produce high-quality graphics. Exchanging code with others is easy, since anyone can download R. This might also be one reason why modern methods are often exclusively developed in R. R is an object-oriented programming language and has interfaces to many other software products such as C, C++, Java and interfaces to data bases. Useful information can be found at: • Homepage: http://www.r-project.org/ and CRAN http://cran.r-project.org for download • Lists with frequently asked questions (FAQ) on CRAN • Manuals and contributed manuals • Task-views on CRAN, especially the task views on missing data and oﬃcial statistics and survey methodology. Both tasks view list packages for imputation and missing values. The basic installation of R is extendable with about 20,000 add-on packages.

6

1 Topic-Focused Introduction to R and Data sets Used

For R programming, it is advisable to write the code in a well-developed script editor. An editor should allow syntax highlighting, code completion, and interactive communication with R. For beginners but also for advanced users, R-Studio (http://www.rstudio.org/) is one possibility. It also supports the integration of git, features a lot of useful tools for package development, plotting, etc., and provides interfaces to other languages.

1.3 Simple Calculations in R Give your calculator as a gift, because the R can be used as an (overgrown) calculator. All operations of a calculator can be used very easily in R. For instance, addition is done with +, subtraction with -, division with /, exponential with exp(), logarithm with log(), square-root using sqrt(), sinus with sin(), etc. All operations work as expected. As an example, the following expression is parsed by R, inner brackets are solved ﬁrst, multiplication and division operators have precedence over the addition and subtraction operators, etc. 0.5 + 0.2 * log(0.15ˆ2) ## [1] -0.258848 R is a function and object-oriented language. All you can call in R is an object with a deﬁned class. Functions are special objects that do something, and they are typically applied to objects. The syntax is as shown in the following example, where the add-on package VIM (Templ et al., 2019; Templ et al., 2012; Kowarik and Templ, 2016) is loaded (using function library) ﬁrst as well a data set containing missing values. library("VIM") data(sleep, package = "VIM") countNA(sleep) ## [1] 38 Here the function countNA was applied on the data set sleep (actually an object of class data.frame), and the output shows that 38 missing values are present in this data set. Functions typically have function arguments that can be set. The syntax for calling a function has the general structure: res1 threshold & !is.na(eusilc$py010n)]