Mixture and Hidden Markov Models with R (Use R!) 3031014383, 9783031014383

This book discusses mixture and hidden Markov models for modeling behavioral data. Mixture and hidden Markov models are

152 109

English Pages 283 [277] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Chapter Outlines and Reading Guide
Acknowledgments
Settings, Appearance, and Notation
Contents
1 Introduction and Preliminaries
1.1 What Are Mixture and Hidden Markov Models?
1.1.1 Outline
1.2 Getting Started with R
1.2.1 Help!
1.2.2 Loading Packages and Data
1.2.3 Object Types and Manipulation
1.2.3.1 Functions
1.2.3.2 S4 Objects
1.2.4 Visualizing Data
1.2.5 Summarizing Data
1.2.6 Linear and Generalized Linear Models
1.2.7 Multinomial Logistic Regression
1.2.8 Time-Series
1.3 Datasets Used in the Book
1.3.1 Speed-Accuracy Data
1.3.1.1 Description
1.3.2 S&P 500
1.3.2.1 Description
1.3.3 Perth Dams Data
1.3.3.1 Description
1.3.4 Discrimination Learning Data
1.3.4.1 Description
1.3.5 Balance Data
1.3.5.1 Description
1.3.6 Repeated Measures on the Balance Scale Task
1.3.6.1 Description
1.3.7 Dimensional Change Card Sorting Task Data
1.3.7.1 Description
1.3.8 Weather Prediction Task Data
1.3.8.1 Description
1.3.9 Conservation of Liquid Data
1.3.9.1 Description
1.3.10 Iowa Gambling Task Data
1.3.10.1 Description
2 Mixture and Latent Class Models
2.1 Introduction and Motivating Example
2.2 Definitions and Notation
2.2.1 Mixture Distribution
2.2.2 Example: Generating Data from a Mixture Distribution
2.2.3 Parameters of the Mixture Model
2.2.4 Mixture Likelihood
2.2.5 Posterior Probabilities
2.3 Parameter Estimation
2.3.1 Maximum Likelihood Estimation
2.3.2 Numerical Optimization of the Likelihood
2.3.3 Expectation Maximization (EM)
2.3.3.1 EM for a Gaussian Mixture
2.3.3.2 Mixtures of Generalized Linear Models
2.3.3.3 Why Does the EM Algorithm Work?
2.3.4 Optimizing Parameters Subject to Constraints
2.3.5 EM or Numerical Optimization?
2.3.6 Starting Values for Parameters in Mixture Models
2.4 Parameter Inference: Likelihood Ratio Tests
2.4.1 Example: Equality Constraint on Standard Deviations
2.5 Parameter Inference: Standard Errors and Confidence Intervals
2.5.1 Finite Difference Approximation of the Hessian
2.5.2 Parametric Bootstrap
2.5.3 Correcting the Hessian for Linear Constraints
2.6 Model Selection
2.6.1 Likelihood-Ratio Tests
2.6.2 Information Criteria
2.6.2.1 Akaike Information Criterion
2.6.2.2 Bayesian Information Criterion
2.6.2.3 Which to Use?
2.6.3 Example: Model Selection for the Speed1 RT Data
2.7 Covariates on the Prior Probabilities
2.8 Identifiability of Mixture Models
2.9 Further Reading
3 Mixture and Latent Class Models: Applications
3.1 Gaussian Mixture for the S&P500 Data
3.2 Gaussian Mixture Model for Conservation Data
3.3 Bivariate Gaussian Mixture Model for Conservation Data
3.4 Latent Class Model for Balance Scale Data
3.4.1 Model Selection and Checking
3.4.2 Testing Item Homogeneity Using Parameter Constraints
3.5 Binomial Mixture Model for Balance Scale Data
3.5.1 Binomial Logistic Regression
3.5.2 Mixture Models
3.5.3 Model Selection Model Checking
3.6 Model Selection with the Bootstrap Likelihood Ratio
3.7 Further Reading
4 Hidden Markov Models
4.1 Preliminaries: Markov Models
4.1.1 Definitions
4.1.1.1 Markov Property
4.1.1.2 Transition Matrix
4.1.1.3 Homogeneity
4.1.1.4 Initial State Probabilities
4.1.2 Properties of Markov Models
4.1.2.1 Stationary Distribution
4.1.2.2 Ergodicity
4.1.2.3 Transient and Absorbing States
4.1.2.4 Dwell Time Distribution
4.1.2.5 Markov Models and Grammars: The Golden Mean Model
4.2 Introducing the Hidden Markov Model
4.2.1 Definitions
4.2.2 Relation Between Hidden Markov and Mixture Model
4.2.3 Example: Bernoulli Hidden Markov Model
4.2.4 Likelihood and Inference Problems
4.3 Filtering, Likelihood, Smoothing and Prediction
4.3.1 Filtering
4.3.2 Likelihood
4.3.3 Smoothing
4.3.4 Scaling
4.3.5 The Likelihood Revisited
4.3.6 Multiple Timeseries
4.3.7 Prediction
4.4 Parameter Estimation
4.4.1 Numerical Optimization of the Likelihood
4.4.2 Expectation Maximization (EM)
4.5 Decoding
4.5.1 Local Decoding
4.5.2 Global Decoding
4.6 Parameter Inference
4.6.1 Standard Errors
4.7 Covariates on Initial and Transition Probabilities
4.8 Missing Data
4.8.1 Missing Data in Hidden Markov Models
4.8.2 Missing at Random
4.8.3 State-Dependent Missingness
4.8.3.1 Simulation Study
5 Univariate Hidden Markov Models
5.1 Gaussian Hidden Markov Model for Financial Time Series
5.2 Bernoulli HMM for the DCCS Data
5.3 Accounting for Autocorrelation Between Response Times
5.3.1 Response Times
5.3.2 Models for Response Times
5.3.3 Model Assessment and Selection of RT Models
5.4 Change Point HMM for Climate Data
5.5 Generalized Linear Hidden Markov Models for Multiple Cue Learning
6 Multivariate Hidden Markov Models
6.1 Latent Transition Model for Balance Scale Data
6.1.1 Learning and Regression
6.2 Switching Between Speed and Accuracy
6.2.1 Modeling Hysteresis
6.2.2 Testing Conditional Independence and FurtherExtensions
6.3 Dependency Between Binomial and Multinomial Responses: The IGT Data
7 Extensions
7.1 Higher-Order Markov Models
7.1.1 Reformulating a Higher-Order HMM as a First-Order HMM
7.1.2 Example: A Two-State Second-Order HMM for Discrimination Learning
7.2 Models with a Distributed State Representation
7.3 Dealing with Practical Issues in Estimation
7.3.1 Unbounded Likelihood
7.4 The Classification Likelihood
7.4.1 Mixture Models
7.4.2 Hidden Markov Models
7.5 Bayesian Estimation
7.5.1 Sampling States and Model Parameters
7.5.1.1 FFBS for the Speed Data
7.5.2 Sampling Model Parameters by Marginalizing Over Hidden States
References
Epilogue
The Production of the Book
Index
Recommend Papers

Mixture and Hidden Markov Models with R (Use R!)
 3031014383, 9783031014383

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

UseR !

Ingmar Visser Maarten Speekenbrink

Mixture and Hidden Markov Models with R

Use R! Series Editors Robert Gentleman, 23andMe Inc., South San Francisco, USA Kurt Hornik, Department of Finance, Accounting and Statistics, WU Wirtschaftsuniversität Wien, Vienna, Austria Giovanni Parmigiani, Dana-Farber Cancer Institute, Boston, USA

This series of inexpensive and focused books on R is aimed at practitioners. Books can discuss the use of R in a particular subject area (e.g., epidemiology, econometrics, psychometrics) or as it relates to statistical topics (e.g., missing data, longitudinal data). In most cases, books combine LaTeX and R so that the code for figures and tables can be put on a website. Authors should assume a background as supplied by Dalgaard’s Introductory Statistics with R or other introductory books so that each book does not repeat basic material. How to Submit Your Proposal Book proposals and manuscripts should be submitted to one of the publishing editors in your region per email – for the list of statistics editors by their location please see https://www.springer.com/gp/statistics/contact-us. All submissions should include a completed Book Proposal Form. For general and technical questions regarding the series and the submission process please contact Laura Briskman ([email protected]) or Veronika Rosteck ([email protected]).

Ingmar Visser • Maarten Speekenbrink

Mixture and Hidden Markov Models with R

Ingmar Visser University of Amsterdam Amsterdam, The Netherlands

Maarten Speekenbrink University College London London, UK

ISSN 2197-5736 ISSN 2197-5744 (electronic) Use R! ISBN 978-3-031-01438-3 ISBN 978-3-031-01440-6 (eBook) https://doi.org/10.1007/978-3-031-01440-6 © Springer Science+Business Media, LLC, part of Springer Nature 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To Lieke and Pien I.V. To Gabriel and Hunter M.S.

Preface

This book aims to provide a self-contained practical introduction to mixture models and hidden Markov models. The reason for introducing both in one book is that there are very close links between these models. This allows us to introduce important concepts, such as maximum likelihood estimation and the ExpectationMaximization algorithm, in the relatively simpler context of mixture models. Approaching hidden Markov models from a thorough understanding of mixture models involves, we hope, a relatively small conceptual leap. We aimed to provide a reasonable balance between statistical theory and practice. The objective is to provide enough mathematical details—but no more!—to allow our target audience to understand key results that are necessary to apply these models. Our target audience are those with a more applied background, in particular researchers, graduate, and advanced undergraduate students in the social and behavioral sciences. Researchers or future researchers hence who see the potential for applying these models and explaining heterogeneity in their data, but who currently lack the tools to fulfill this potential. Those looking for a more purely mathematical treatment of mixture and hidden Markov models we gladly refer to the books by Cappe et al. (2005) and/or Frühwirth-Schnatter (2006). To familiarize readers with the possibilities of mixture and hidden Markov models, a large part of this book consists of practical examples of applying these models, many of which taken from our own research in developmental and experimental psychology. Much of our work on these models was driven by the research questions that arose during the study of experimental or developmental (time series) data. Over the years, we have also accumulated examples from other fields, such as climate change and economics. In the examples, we provide some background knowledge of these different domains as applicable to understand the rationale of the analyses. At the same time, we abstract away from many details and focus on the generalizability of the presented models to research questions in other domains. The example analyses in this book rely on the R programming language and software environment (R Core Team 2020) and in particular the depmixS4 package (Visser and Speekenbrink 2010). Nowadays, the choice for R hardly needs vii

viii

Preface

justification, having become the lingua franca of statistics and data science. R is open source, freely available, and has an active user community such that anyone interested can add and contribute packages implementing new analytical methods. As all required tools are freely available, the readers should be able to replicate the example analyses on their own computers, as well as adapting the analyses for their own purposes. To aid in this process, all the code for running the examples in the book are provided online at https://depmix.github.io/hmmr/. Moreover, the datasets and special purpose functions written for this book are available as an R package called hmmr. Section 1.2 provides pointers for getting started with R and provides all the basics that are needed to then understand and apply subsequent analyses and examples.

Chapter Outlines and Reading Guide Chapter 1 provides a brief introduction to R and describes the basic features of the datasets used throughout this book to illustrate the use of mixture and hidden Markov models. Chapters 2 and 4 are mostly theoretical in nature, providing a statistical treatment of mixture and latent class models (Chap. 2), and the extension of those models into hidden Markov models (Chap. 4). Chapter 3 provides a number of worked examples of applications of mixture and latent class models to analyze both univariate and multivariate data. Similarly, Chaps. 5 and 6 provide detailed example analyses which apply hidden Markov models to univariate (Chap. 5) and multivariate (Chap. 6) time series data. Finally, Chap. 7 discusses some extensions of the basic hidden Markov model, as well as alternative estimation techniques, including a brief introduction to Bayesian estimation of these models. In Chaps. 2 and 4, the first two sections are devoted to conceptually describing and defining mixture and hidden Markov models, respectively. These sections lay the foundations for understanding how these models work and how they can be usefully applied. These sections should be read by everyone. Rushed readers wanting to get started right away with applying the models may skip the remainder of those chapters, where we delve deeper into parameter estimation and inference. The examples in Chaps. 3, 5, and 6, are standalone sections that treat data with particular characteristics and describe the models that can be used to answer the research questions of interest. Where warranted, these application sections also refer back to the relevant sections in Chaps. 2 and 4 which offer more technical detail of topics that arise. Readers who skipped most of Chaps. 2 and 4 can then read the relevant parts of these chapters when the need arises.

Preface

ix

Acknowledgments Writing and producing a book is rarely done in isolation, and this one is no exception. Many people have contributed by asking tough research questions, providing data, LATEX, and Sweave() advice. Below is a list of people that we know have surely contributed in important ways. We know that this list is likely incomplete, so just let us know if you ought to be on the list so we can include you in future editions. We would like to thank Achim Zeileis for getting us started with the combination of LATEX, Sweave() and make files to produce this book. We would like to thank Chen Haibo for providing the S&P-500 data example, Brenda Jansen for sharing her balance scale data, Gilles Dutilh and Han van der Maas for sharing their speed-accuracy data, the Water Corporation of Western Australia for providing the Perth dams data on their website, Bianca van Bers for sharing the dimensional change card sorting task data, Han van der Maas for sharing the conservation of liquid data, Maartje Raijmakers for sharing the discrimination data, and Emmanouil Konstantinidis for sharing the Iowa gambling task data. Finally, we would like to thank John Kimmel, Marc Strauss, and Laura Briskman for inviting us to write this book and organize things Springer. This book has been taking us a while to complete. On a more personal level, many people have been by our sides during that period. Maarten would like to thank Gabriel and Hunter for being wondering and wonderful human beings, and Ria and Jan for their love and support. Ingmar would like to thank Jaro for her love and support. Amsterdam, The Netherlands London, UK January 2022

Ingmar Visser Maarten Speekenbrink

Settings, Appearance, and Notation

In producing the examples in this book, R is mainly run at its default settings. A few modifications were made to render the output more easily readable; these were invoked by the following chunk of R-code: R> options(prompt = "R> ", continue = "+ ", width = 60, + digits = 4, show.signif.stars = FALSE, + useFancyQuotes = FALSE) This replaces the standard R prompt > by R>. For compactness, digits = 4 reduces the number of digits shown when printing numbers from the default of 7. Note that this does not reduce the precision with which these numbers are internally processed and stored. We use set.seed(x) whenever we generate data or fit models such that the exact values of data and fitted parameters may be replicated. When fitting models, this is necessary, because random starting values are generated (see Sect. 2.3.6 for more details). We use a typewriter font for all code; additionally, function names are followed by parentheses, as in plot(), and class names (a concept that is explained in Chap. 1) are displayed as in “depmix.” Furthermore, boldface is used for package names, as in hmmr. The following symbols are used throughout the book: A π S s Y y z θ θ pr θ tr

Transition matrix Initial state probability vector Stochastic state variable Realization of the state variable Stochastic (possibly multivariate) observation variable Realization of the observation variable Covariate, possibly multivariate Total model parameter vector; θ = (θ pr , θ tr , θ obs ) Subvector of the parameter vector with parameters of the prior model Subvector of the parameter vector with parameters of the transition model xi

xii

θ obs T N f P H

Settings, Appearance, and Notation

Subvector of the parameter vector with parameters of the observation model Total number of time points Number of states of a model Probability density function Probability distribution Hessian matrix

Contents

1

Introduction and Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What Are Mixture and Hidden Markov Models?. . . . . . . . . . . . . . . . . . . . . . 1.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Getting Started with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Help! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Loading Packages and Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Object Types and Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Visualizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Summarizing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Linear and Generalized Linear Models. . . . . . . . . . . . . . . . . . . . . . . 1.2.7 Multinomial Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.8 Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Datasets Used in the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Speed-Accuracy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Perth Dams Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Discrimination Learning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Balance Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.6 Repeated Measures on the Balance Scale Task . . . . . . . . . . . . . . 1.3.7 Dimensional Change Card Sorting Task Data . . . . . . . . . . . . . . . 1.3.8 Weather Prediction Task Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.9 Conservation of Liquid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.10 Iowa Gambling Task Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 3 5 5 6 13 15 17 21 25 26 26 28 29 30 31 33 34 36 39 41

2

Mixture and Latent Class Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction and Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Mixture Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Example: Generating Data from a Mixture Distribution . . . . 2.2.3 Parameters of the Mixture Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 45 47 48 48 49 xiii

xiv

Contents

2.2.4 Mixture Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Numerical Optimization of the Likelihood. . . . . . . . . . . . . . . . . . . 2.3.3 Expectation Maximization (EM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Optimizing Parameters Subject to Constraints . . . . . . . . . . . . . . . 2.3.5 EM or Numerical Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Starting Values for Parameters in Mixture Models . . . . . . . . . . Parameter Inference: Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Example: Equality Constraint on Standard Deviations . . . . . . Parameter Inference: Standard Errors and Confidence Intervals. . . . . . 2.5.1 Finite Difference Approximation of the Hessian . . . . . . . . . . . . 2.5.2 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Correcting the Hessian for Linear Constraints . . . . . . . . . . . . . . . Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Likelihood-Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Example: Model Selection for the Speed1 RT Data . . . . . . . . . Covariates on the Prior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identifiability of Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 50 52 52 54 56 66 69 70 72 73 74 76 77 79 80 80 85 89 90 92 93

3

Mixture and Latent Class Models: Applications . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Gaussian Mixture for the S&P500 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Gaussian Mixture Model for Conservation Data . . . . . . . . . . . . . . . . . . . . . . 3.3 Bivariate Gaussian Mixture Model for Conservation Data . . . . . . . . . . . 3.4 Latent Class Model for Balance Scale Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Model Selection and Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Testing Item Homogeneity Using Parameter Constraints . . . 3.5 Binomial Mixture Model for Balance Scale Data . . . . . . . . . . . . . . . . . . . . . 3.5.1 Binomial Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Mixture Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Model Selection Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Model Selection with the Bootstrap Likelihood Ratio . . . . . . . . . . . . . . . . 3.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95 99 100 106 108 110 112 113 115 116 119 123

4

Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Preliminaries: Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Properties of Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Introducing the Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Relation Between Hidden Markov and Mixture Model . . . . . 4.2.3 Example: Bernoulli Hidden Markov Model . . . . . . . . . . . . . . . . . 4.2.4 Likelihood and Inference Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

125 126 126 127 135 135 136 137 139

2.3

2.4 2.5

2.6

2.7 2.8 2.9

Contents

5

6

xv

4.3 Filtering, Likelihood, Smoothing and Prediction . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 The Likelihood Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Multiple Timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Numerical Optimization of the Likelihood. . . . . . . . . . . . . . . . . . . 4.4.2 Expectation Maximization (EM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Local Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Global Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Parameter Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Covariates on Initial and Transition Probabilities . . . . . . . . . . . . . . . . . . . . . 4.8 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Missing Data in Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 4.8.2 Missing at Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 State-Dependent Missingness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

140 141 144 144 146 149 150 151 152 152 154 157 157 158 160 161 163 164 166 166 169

Univariate Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Gaussian Hidden Markov Model for Financial Time Series . . . . . . . . . . 5.2 Bernoulli HMM for the DCCS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Accounting for Autocorrelation Between Response Times . . . . . . . . . . . 5.3.1 Response Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Models for Response Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Model Assessment and Selection of RT Models . . . . . . . . . . . . . 5.4 Change Point HMM for Climate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Generalized Linear Hidden Markov Models for Multiple Cue Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

173 173 177 182 183 184 186 189

Multivariate Hidden Markov Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Latent Transition Model for Balance Scale Data . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Learning and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Switching Between Speed and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Modeling Hysteresis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Testing Conditional Independence and Further Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Dependency Between Binomial and Multinomial Responses: The IGT Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

201 201 208 209 216

195

219 223

xvi

7

Contents

Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Higher-Order Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Reformulating a Higher-Order HMM as a First-Order HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Example: A Two-State Second-Order HMM for Discrimination Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Models with a Distributed State Representation . . . . . . . . . . . . . . . . . . . . . . . 7.3 Dealing with Practical Issues in Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Unbounded Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 The Classification Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Mixture Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Sampling States and Model Parameters . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Sampling Model Parameters by Marginalizing Over Hidden States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

231 231 233 234 237 241 241 242 243 247 248 249 256

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Chapter 1

Introduction and Preliminaries

Chapter Overview This chapter first provides a brief overview of the contents of the book. Section 1.2 then provides a brief introduction to R, focusing on summarizing and visualizing data, and modelling data with (generalized) linear models. Section 1.3 introduces the datasets used throughout the book to illustrate the models, including their origin and main research questions.

1.1 What Are Mixture and Hidden Markov Models? One of the key characteristics of mixture and hidden Markov models is the dependency of observable responses on hidden, latent, or in other words not directly observable states. In a statistical sense, this means that the properties of the states can only be inferred indirectly, from the responses that are emitted in those states. The dependency of responses (observations) on a hidden state means that the statistical distribution of responses differs between states. In mixture and hidden Markov models, the states are typically discrete-valued. Models with continuous-valued states are more commonly referred to as state-space models. A main difference between discrete-valued and continuous-valued states is that the latter imply a somewhat smooth transition between response distributions that are governed by states with less diverging numbers. Discrete-valued states are distinct, but the numbers assigned to them are arbitrary. Discrete-valued states correspond to different distributions of the observable responses, but labeling them by a number or by any other symbol is inconsequential. As such, the latent states can be deemed as points on a measurement nominal scale, rather than an ordinal,interval, or ratio scale. For such discrete-valued states, moving from one state to another implies a non-smooth or discrete switch from one response distribution to another. One way to think of such states is, if there are multiple variables involved, as clusters © Springer Science+Business Media, LLC, part of Springer Nature 2022 I. Visser, M. Speekenbrink, Mixture and Hidden Markov Models with R, Use R!, https://doi.org/10.1007/978-3-031-01440-6_1

1

2

1 Introduction and Preliminaries

of co-occurring responses. For instance, if I’m in a supermarket whilst hungry I will probably browse the isles that contain immediately consumable pro cuts, whilst when I am in the same supermarket after lunch and looking for a birthday card my looking behaviour would be rather different. In many applications of mixture and hidden Markov models, the main interest is in characterizing such differences in observable responses, identifying states as meaningful clusters of co-occurring responses learned purely from the data. In hidden Markov models, where the responses are time-series, an additional interest is to characterize the process by which one state (cluster) transitions to the next. To get some idea of the breath and scope of a mixture model (MM) and a hidden Markov model (HMM), we list some common applications of these models and the corresponding interpretation of the hidden states. In applications in psychology, the discrete states can often be interpreted as cognitive strategies or emotional states. For example, a well-known task in developmental psychology is the conservation of liquid task, in which participants are shown a glass filled to a certain level with liquid and asked to indicate the expected level of the liquid if it were poured into another glass with a different width. Participants typically use one of two strategies in indicating the level: the wrong strategy that younger children often apply is to indicate the same level as in the other glass, effectively ignoring that the second glass has a different width. Older children and adults typically apply the correct strategy of adjusting the height proportional to the change in width of the glass.These two strategies lead to different behavior, and can be represented by different states in an HMM (Schmittmann et al. 2005). Kaplan (2008) provides a review of other applications of hidden Markov models in developmental psychology. In applications in neuroscience, the discrete states may refer to the activity of individual neurons or brain regions. For example, in an application focusing on sleep stages from electroencephalogram (EEG) measurements, the states of an HMM correspond to various sleep stages, such as rapid eye movement (REM) sleep, deep sleep, and wakefulness (Flexer et al. 2002). Each of these states was associated with typical patterns in the EEG measurements. In biology, HMMs are used in gene analysis to align measurement DNA sequences. A nucleotide sequence that forms a gene may vary in the population due to insertions, deletions or mutations. In such applications, the states of the HMM represent the nucleotides of a gene, and the observations are noisy versions of the same nucleotides, i.e, where the noise is caused by insertions, deletions or mutations (Krogh 1998). In applications in economy, the states of the HMMs may correspond to periods of expansion and recession, and the interest is in studying the dynamics between these (Ghysels 1994; Hamilton 1989). Other common applications relate to language processing. In speech recognition, the hidden states are words that are to be recognized, and the observations are the raw sound signals (Rabiner 1989). In all of these applications of hidden Markov models, the hidden states are characterized precisely through the use of these models. It may also be possible

1.2 Getting Started with R

3

to uncover unanticipated states or strategies by using these models (van der Maas and Straatemeier 2008).

1.1.1 Outline Hidden Markov models can be seen as a generalization of two types of simpler model. First, HMMs are a generalization of simple Markov models, where the generalization involves additional modelling of the error structure. Second, HMMs are a generalization of mixture models. Here, the generalization involves additional modelling of the sequential nature of the data. Mixture modelling is an interesting topic in its own right. Moreover, we believe that HMMs are easier to understand if one first has a thorough understanding of mixture models, rather than by delving straight into hidden Markov models. Therefore, we first discuss mixture models in the next chapter, and move on to hidden Markov models in Chap. 4. For both models, we describe the algorithms to estimate the model parameters, how to make inferences about these parameters, and how to perform model selection. Chapter 3 contains applications of mixture models, while Chaps. 5 and 6 contain applications of hidden Markov models for univariate and multivariate time-series. Finally, Chap. 7 discusses miscellaneous topics such as higher-order and continuous-time HMMs. Before getting started with the flesh and meat of this book, the remainder of this Chapter introduces R and a number of datasets that are used in the remainder of the book. First, the next section provides a brief introduction to the basics of R, with examples of basic plotting and analyses using a dataset that is used repeatedly throughout the book. After discussing the basics of R, a number of (other) datasets that are used throughout the book as examples are described, focusing on their origin and the main hypotheses that can be tested using mixture and hidden Markov models.

1.2 Getting Started with R R is an open-source programming environment for statistical computing and graphics. It is a well-developed, relatively simple, flexible and effective programming language. The R environment offers a large, coherent, integrated collection of intermediate tools for data analysis, and excellent graphical facilities for data analysis and display either on-screen or in print. R is supported by a large community of users and contributors. Many new developments in statistical computing come with an accompanying R package which implements these developments. As a result, the Comprehensive R Archive (CRAN; http://cran.r-project.org/) provides access to thousands of add-on packages.

4

1 Introduction and Preliminaries

For a first impression of R’s “look and feel”, we provide an introductory R session in which we briefly analyze a dataset which is used throughout the book as an example for illustrating the use of mixture, Markov, and hidden Markov models. This introduction is by no means sufficient to learn all the ins and outs of R. More extensive introductory texts and manuals can be found on the R-website: http://www.r-project.org/. Instructions on downloading and installing R can also be found there. An excellent introduction into basic statistics with R is Peter Dalgaard’s Introductory Statistics with R (Dalgaard 2008). Having installed R, you will have access to the R interpreter via the console. This allows you to interact with R, for instance using it as a calculator: R> 2 + 11 # addition [1] 13 R> 2 * 11 # multiplication [1] 22 R> 2 / 11 # division [1] 0.1818 R> 2^(11) # exponentiation [1] 2048 R> #R follows the usual rules of arithmetic R> 2 + 11*3 [1] 35 R> (2 + 11)*3 [1] 39 For longer and more complicated analyses, it is generally good practice to write scripts that you can load into R. These scripts can be written on any external program that allows you to save simple text files. The contents of a file can be loaded into R and evaluated through the source function. However, many people prefer to work with an integrated development interface (IDE) such as RStudio http:www.rstudio.com or Emacs Speaks Statistics (ESS, http://ess.r-project.org/).

1.2 Getting Started with R

5

1.2.1 Help! A nice feature of R is the integrated help system. Whenever you need help in using a function in R or one of the add-on packages, you can type in the name of the function preceded by a question mark, e.g. ?help will open the help file for the help() function which can be used for the same purpose, e.g. help("help") will also open the help file for the help() function. If you don’t know the precise name of a function, you can try to find help by typing a search term preceded by two question marks, or equivalently calling the help.search() function with the search term as argument. E.g., ??sum and help.search("sum") will open a list with various documentation (including help pages) which contain the search term provided in either the name, alias, title, concept or keyword entries. If you can’t find what you’re looking for this way, you can also search online resources using the RSiteSearch() function, which will open up a browser window with search results from the http://search.r-project.org website. Another good search engine for online R material is the http://www.rseek.org website. Finally, you can also use the R-help mailing list by subscribing and/or sending an email with a help request to [email protected] (make sure you read the posting guide first!).

1.2.2 Loading Packages and Data This book makes extensive use of the depmixS4 package (Visser and Speekenbrink 2010). The book also comes with its own add-on package hmmr. You can install these packages from within R through the following commands: R> R> R> R>

# download and install the packages install.packages(c("depmixS4","hmmr")) # load the packages into R library(depmixS4,hmmr)

Both these packages come with a number of datasets. In the following examples, we will use the speed1 data, which is included in the hmmr package. We can load this dataset using the following command R> data(speed1) The data() function is used to load datasets from add-on packages. You can read in data from other files using e.g. the read.csv() function for data in CommaSeparated Value format and the read.table() function for data where values are separated by spaces or tabs. You can view the first rows of the dataset by calling the head() function:

6

1 Introduction and Preliminaries

R> head(speed1) 1 2 3 4 5 6

RT 6.457 5.602 6.254 5.451 5.872 6.004

ACC Pacc cor 0 cor 0 inc 0 inc 0 inc 0 cor 0

which displays the row names (numbers in this case), and the values of the three variables that are in this dataset: RT, ACC, and Pacc (another numerical variable).

1.2.3 Object Types and Manipulation In R, loaded data usually comes in the form of a data.frame object. A data frame is rectangular table where each column refers to a variable, and each row corresponds to an observation. Data frames are extremely useful for data analysis and each variable can be of a different data type. Common types are numeric variables (i.e., numbers, whether integers or real numbers) such as RT and Pacc and factor variables such as ACC. A factor is used to store categorical variables. Effectively, this is a variable with integer values where each integer has an associated label. For example R> transport c(FALSE, TRUE, TRUE, FALSE) # a logical vector [1] FALSE

TRUE

TRUE FALSE

R> c(FALSE, TRUE, TRUE, 2) # combining logical and numeric [1] 0 1 1 2 R> #a numeric vector (even when only using integers) R> c(0,1,0,2) [1] 0 1 0 2 R> c(0,1,0,"2") # combining numeric and character [1] "0" "1" "0" "2" This shows that vectors are created by converting all elements to most general mode present (in the order logical < integer < numeric (double) < complex < character). You can also create vectors of a desired mode through the vector function, or convert between modes through the as function or explicit variants such as as.logical, as.numeric, etc. R> # create a logical vector by repetition (rep(x,times)) R> logi_vec logi_vec [1] FALSE

TRUE FALSE

TRUE FALSE

TRUE FALSE

R> # initialize a double-precision (or "numeric") vector R> num_vec num_vec # default value is 0 [1] 0 0 0 0 0 0 0 R> num_vec[1] num_vec [1] 0.1111 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 R> #regular sequence R> num_vec as.numeric(logi_vec) [1] 0 1 0 1 0 1 0 R> as.character(num_vec) [1] "0" "0.125" "0.25" [8] "0.875" "1"

"0.375" "0.5"

"0.625" "0.75"

8

1 Introduction and Preliminaries

Vectors can be added and multiplied as follows: R> short_vec short_vec + short_vec # element-wise addition [1] 0.00 0.25 0.50 0.75 1.00 R> 2 * short_vec # element-wise multiplication [1] 0.00 0.25 0.50 0.75 1.00 R> short_vec * short_vec # same as short_vec^2 [1] 0.00000 0.01562 0.06250 0.14062 0.25000 Elements of shorter vectors are recycled in order to make all vectors the required length to perform such operations. This is done by stacking (initial parts) of shorter vectors until the desired length is reached. For instance: R> #same as c(vec_num2,vec_num[1:2])*vec_num R> short_vec*num_vec [1] 0.00000 0.01562 0.06250 0.14062 0.25000 0.00000 0.09375 [8] 0.21875 0.37500 Such recycling of shorter vectors is done when performing element-by-element operations. Multiplying vectors by default provides a vector in which each element is the product of the corresponding elements in the arguments (e.g., the first element of the resulting vector is the multiplication of the first elements of the vectors on either side of the multiplication). Matrix multiplication, which does not offer automatic recycling, is done through the matrix multiplication operator %*%: R> t(short_vec)%*%short_vec [,1] [1,] 0.4688 Matrices can be constructed by combining (numeric) vectors, binding them together by column (cbind) or row (rbind), or directly through the constructor function matrix. Again, when combining vectors of different length, recycling of the elements will be used to make the vectors the required length: R> cbind(short_vec,2:6) [1,] [2,] [3,] [4,]

short_vec 0.000 0.125 0.250 0.375

2 3 4 5

1.2 Getting Started with R

[5,]

9

0.500 6

R> matrix(short_vec,nrow=4,ncol=2) [1,] [2,] [3,] [4,]

[,1] 0.000 0.125 0.250 0.375

[,2] 0.500 0.000 0.125 0.250

R> rbind(short_vec,num_vec) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] 0 0.125 0.25 0.375 0.5 0.000 0.125 0.250 0 0.125 0.25 0.375 0.5 0.625 0.750 0.875 [,9] short_vec 0.375 num_vec 1.000 short_vec num_vec

A list is one of the most general data types in R. A list is a collection of elements that each can have a different type (including lists): R> my_list my_list [[1]] [[1]]$a [1] 1 [[1]]$b [1] 1 2 1

$num [1] 1 2 1 Note that the elements of a list can be (optionally) named. Elements of a list can be called by name or index (e.g. myList[["num"]] and myList[[2]] both return the second element of the list). A data.frame is actually a special type of list where each element in the list is a vector of the same length. The rectangular shape of a data.frame allows you to subset a data frame by rows and/or columns. For instance, you can obtain the values of just the RT variable by speed1[,"RT] or speed1$RT. You can get the first six rows of the dataset by calling speed1[1:6,], where 1:6 is shorthand for seq(from=1,to=6,by=1), which creates a sequence of

10

1 Introduction and Preliminaries

the numbers 1 to 6. Obtaining a subset of the observations in a dataset (i.e. selecting particular rows in a data frame) is generally done by using a logical indexing vector, or through the subset() function. For instance, the following commands both return a subset of the data where the (log) RT is smaller than 5.3: R> speed1[speed1$RT < 5.3,] R> subset(speed1,RT < 5.3) The selection criteria can be combined, for instance R> speed1[speed1$RT subset(speed1,RT my_mean my_mean # construct a mixture model R> mmod is(mmod) # show the class name [1] "mix" R> slotNames(mmod) # get all slot names [1] "response" "prior" [6] "nresp" "ntimes"

"dens" "npars"

"init"

"nstates"

R> slot(mmod,"response") # access the response slot [[1]] [[1]][[1]] Model of type gaussian (identity), formula: RT ~ 1 Coefficients: (Intercept) 0 sd 1

[[2]]

12

1 Introduction and Preliminaries

[[2]][[1]] Model of type gaussian (identity), formula: RT ~ 1 Coefficients: (Intercept) 0 sd 1 R> is(mmod@response[[1]][[1]]) [1] "NORMresponse" "GLMresponse"

"response"

It is beyond the scope of this introduction to go further into the specifics of the S4 object-oriented system. A main benefit of using an object-oriented system is that it allows you to define classes of objects, and write methods for those classes, that are guaranteed to work for other classes that are inherited from that class. For instance, the following code shows the formal definition of the mix class, and also shows the classes such as mix.fitted and depmix, that inherit from the mix class: R> getClass("mix") Class "mix" [package "depmixS4"] Slots: Name: response Class: list Name: Class: Known Class Class Class Class Class Class Class

nresp numeric

prior ANY

dens array

ntimes numeric

npars numeric

init array

nstates numeric

Subclasses: "depmix", directly "mix.fitted", directly "mix.fitted.classLik", directly "mix.sim", directly "depmix.fitted", by class "depmix", distance 2 "depmix.fitted.classLik", by class "depmix", distance 2 "depmix.sim", by class "depmix", distance 2

Any methods written for the mix class will also be used (dispatched) for objects of inherited classes, unless the method has another definition for that class. For instance: R> showMethods("fit") Function: fit (package depmixS4)

1.2 Getting Started with R

13

object="GLMresponse" object="MULTINOMresponse" object="MVNresponse" object="NORMresponse" object="mix" object="transInit" shows the classes of objects for which specific definitions of the fit method have been defined.

1.2.4 Visualizing Data As we are interested in studying the relationships between three variables in the speed1 data, a good start is to plot the data. Such a plot can be made by first loading the data (as shown earlier) and then calling the plot() function: R> plot(speed1) which produces a scatterplot of the variables in the data.frame seen in Fig. 1.1. As can be seen in the scatterplot of the data in Fig. 1.1, the ACCuracy variable is binary and hence the scatterplots for this variable are difficult to read. A more appropriate graphical representation for these variables is provided in Fig. 1.2 as a boxplot for each level of accuracy. This plot was generated by: R> boxplot(RT~ACC,data=speed1, frame.plot=FALSE) Note that we set the frame.plot argument to FALSE to suppress drawing a box around the plot. We do that for most figures in this book, but for brevity will not always explicitly show this when we provide the code to produce plots. As the data is a time series, it may be more appropriate to plot it as such: R> plot(as.ts(speed1), main="Speed data, series 1") which produces a plot similar to Fig. 1.3. A final useful way of graphical inspection of these data is to plot the marginal distribution of the variables in a hist()ogram or density() plot of the data: R> layout(matrix(1:2,ncol=2)) R> hist(speed1$RT, main = "Histogram", xlab= "RT") R> plot(density(speed1$RT), main = "Density") and the results are in Fig. 1.4. The layout function divides the main plotting area into multiple subregions, making it possible to place multiple plots side-by-side.

14

1 Introduction and Preliminaries

6.5

7.0

1.0 1.2 1.4 1.6 1.8 2.0

1.0 1.2 1.4 1.6 1.8 2.0

5.0

5.5

6.0

RT

0.4

0.6

0.8

ACC

0.0

0.2

Pacc

5.0

5.5

6.0

6.5

7.0

0.0

0.2

0.4

0.6

0.8

Fig. 1.1 Scatterplot of the speed1 data

6.0 5.0

5.5

RT

6.5

7.0

Fig. 1.2 Boxplot of reaction times (RT) in the speed1 data, separately for each level of accuracy (ACC)

inc

cor ACC

In this book, we use base R to create all plots. The basic plotting functions in R allow for rapid visual inspection of data and are sufficient for our purposes here. More advanced plots can be created with the ggplot2 package (Wickham 2009).

1.2 Getting Started with R

15

RT (log ms)

7.0

6.5

6.0

5.5

ACC

2.0 1.8 1.6 1.4 1.2 1.0

Pacc

5.0

0.8 0.6 0.4 0.2 0.0 1

42

83

124

165

Trials

Fig. 1.3 A plot of the variables in the speed1 data as time-series Density

0.0

0.4

Density

20 10 0

Frequency

30

0.8

Histogram

5.0

5.5

6.0

6.5

7.0

4.5 5.0 5.5 6.0 6.5 7.0 7.5

RT

N = 168 Bandwidth = 0.1551

Fig. 1.4 Histogram and density plot of reaction times (RT) in the speed1 data

1.2.5 Summarizing Data Basic summaries of data can be obtained through the summary() function, e.g.: R> summary(speed1) RT Min. :5.04 1st Qu.:5.57

ACC inc: 40 cor:128

Pacc Min. :0.000 1st Qu.:0.136 (continued)

16

Median :6.12 Mean :6.04 3rd Qu.:6.40 Max. :7.20

1 Introduction and Preliminaries

Median :0.295 Mean :0.301 3rd Qu.:0.455 Max. :0.773

applies this function to all variables in the data frame. For numeric variables, the summary() function provides minimum and maximum values, median and mean, and the 25% and 75% percentiles. These can also be obtained through the corresponding functions min, max, mean, median, and quantile functions. For factors, such as ACC, the summary function provides a frequency table, which can also be obtained through the table function. Further measures of spread, such as variances and standard deviations, are obtained through the var and sd functions, respectively. Correlations among numeric variables can be computed using the cor() function. This function allows only numeric variables. As the speed1 dataset contains a mix of numeric and categorical variables, we can use the hetcor() function (for heterogeneous correlations) from the polycor package instead: R> library(polycor) R> hetcor(speed1) Two-Step Estimates Correlations/Type of Correlation: RT ACC Pacc RT 1 Polyserial Pearson ACC 0.46 1 Polyserial Pacc 0.643 0.596 1 Standard Errors: RT ACC RT ACC 0.0867 Pacc 0.0456 0.0731 n = 168 P-values for Tests of Bivariate Normality: RT ACC RT ACC 8.43e-07 Pacc 9.41e-06 0.279 As can be seen, all three variables are significantly correlated. More importantly, both speed and accuracy are significantly influenced by the pay-off for accuracy Pacc.

1.2 Getting Started with R

17

1.2.6 Linear and Generalized Linear Models To elaborate on the correlation between RT and Pacc, fitting a linear regression model is the natural next step. The following code first fits such a linear model through the lm() function and then summarizes the model: R> lm1 summary(lm1) Call: lm(formula = RT ~ Pacc, data = speed1) Residuals: Min 1Q Median -0.8764 -0.2332 -0.0466

3Q 0.1928

Max 1.0914

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.5618 0.0523 106.4 lm2 summary(lm2) Call: lm(formula = RT ~ Pacc + ACC, data = speed1) Residuals: Min 1Q Median -0.8977 -0.2361 -0.0282

3Q 0.2040

Max 1.0730

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.5282 0.0633 87.34 F)

0.1206 0.88 0.0874 0.64

0.35 0.42

which first compares lm2 to lm1, and then lm3 to lm2. Comparing lm3 to lm1 would be achieved by calling anova(lm1,lm3). Calling the anova function on a single model, in which case an analysis of variance table is produced for the terms in that model. The analogous analysis of response accuracy uses a logistic regression model, which is a generalized linear model (GLM) that can be estimated with the glm() function: R> mod_acc summary(mod_acc) Call: glm(formula = ACC ~ Pacc, family = binomial(), data = speed1) Deviance Residuals: Min 1Q Median -2.423 0.157 0.382

3Q 0.672

Max 1.369

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.441 0.328 -1.34 0.18 Pacc 6.645 1.334 4.98 6.3e-07 (Dispersion parameter for binomial family taken to be 1) Null deviance: 184.42

on 167

degrees of freedom

1.2 Getting Started with R

Residual deviance: 149.70 AIC: 153.7

21

on 166

degrees of freedom

Number of Fisher Scoring iterations: 5 As seen earlier, response times and accuracy correlate with one another and hence they could be analyzed in relation to one another. Hence, rather than analyzing RT only, a multivariate model can be applied. A complication is the binary nature of the ACC variable. A multivariate linear model could be fitted, although we should probably not put too much trust in the outcome. A multivariate linear model is estimated by using a multivariate response in the lm() function. This will effectively fit separate linear models to each of the response variables. To account for the dependence between the response variables, we can then perform a MANOVA: R> mlm summary(manova(mlm)) Df Pillai approx F num Df den Df Pr(>F) Pacc 1 0.466 71.9 2 165 summary(aov(mlm)) Response RT : Df Sum Sq Mean Sq F value Pr(>F) Pacc 1 15.9 15.90 117 F) Pacc 1 5.51 5.51 36.7 9.1e-09 Residuals 166 24.96 0.15

1.2.7 Multinomial Logistic Regression When an observed variable Y has more than two possible categorical values, the effect of predictors z (usually including a constant covariate z = 1 to represent the intercept) on this variable can be estimated with multinomial logistic regression, a generalization of logistic regression for binary outcomes. For a variable with K categorical outcomes, a multinomial logistic regression model predicts log odds

22

1 Introduction and Preliminaries

  of all K2 pairs of categories by estimating a model for K − 1 of these pairs, which makes the remaining pairs redundant (Agresti 2002). In the baseline-category formulation, we choose one of the categories J ∈ {1, . . . , K} as the baseline category and formulate linear models for the log odds of the remaining categories over the baseline: log

P(Y = i|z) = βi z P(Y = J |z)

(1.1)

where β i is a vector with regression coefficients and z a vector with predictor values. We can also define models for the probabilities directly as exp{β i z} , P(Y = i|z) = K j =1 exp{β j z}

(1.2)

where the parameters for the baseline category are fixed to β J = (0, . . . , 0). Estimation of multinomial logistic regression models is implemented in the multinom() function in the nnet package. For a—somewhat contrived—example of a multinomial regression model, we will discretize the reaction times in the speed1 data into four categories, and then model this new variable (catRT) as a function of Pacc: R> require(nnet) R> catRT multi_mod summary(multi_mod) Call: multinom(formula = catRT ~ speed1$Pacc) Coefficients: (Intercept) speed1$Pacc med-fast -0.5214 2.132 med-slow -3.3849 13.468 slow -4.2468 11.200 Std. Errors:

1.2 Getting Started with R

med-fast med-slow slow

23

(Intercept) speed1$Pacc 0.3586 1.697 0.6148 2.057 0.8896 2.564

Residual Deviance: 312.8 AIC: 324.8 The default baseline category for the factor catRT is the first level (“fast”), and hence the model estimates the intercepts and slopes for the log odds of the remaining levels over the baseline. For instance, log

P(Y = med -fast|z) = −0.52 + 2.13 × Pacc. P(Y = fast|z)

This equation can be interpreted as follows. Conditional upon the reaction time being in either the “fast” or “medium-fast” category, the odds that it in the “mediumfast” condition increases with Pacc. For instance, when Pacc = 0, the odds are exp{−0.52} = 0.59, while when Pacc = 1, the odds are exp{−0.52 + 2.13} = 5. Interpretation of a multinomial logistic regression model may be more straightforward when determining the probabilities of the categories directly, as in (1.2). The following function computes a matrix with predicted probabilities for all categories for values of the predictors specified in the predictors argument, also adding a label for the baseline category specified in the baseline argument: R> pred_prob R> R> R> R>

require(TTR) Sys.setenv(tz=’UTC’) sp500 R> R> R> + R>

library("hmmr") data(speed1) mrt states R> R> R>

m1=0; m2=3 mm sum(log(0.5*dnorm(y,0,1)+0.5*dnorm(y,3,1))) [1] -1928

2.2.5 Posterior Probabilities The parameters of a mixture model do not directly identify which observations stem from which component distribution. The probabilities associated with each component, πi , describe the prior probability that a unit of observation stems def from each component, and hence πi =P(S = i). The posterior probability is the probability that a unit of observation stems from a component distribution a posteriori, i.e. given the value of the observation: P(St = i|Yt ). For mixture models these posterior probabilities are defined as: πi f (yt |St = i) , P(St = i|yt ) = N j =1 πj f (yt |St = j )

(2.4)

where f (yt |St = i) = fi (yt ) is the component density or distribution function evaluated at yt .

2.2 Definitions and Notation

0

100 200 300 400

Histogram of p

Frequency

Fig. 2.3 Histogram of the posterior probabilities of component 1 in the 2-component model of the simulated data y

51

0.0

0.2

0.4

0.6

0.8

1.0

probability

Assigning observational units to components can be useful, e.g. for the purposes of interventions or post-hoc analyses. For a 2-component Gaussian mixture model, the posterior probabilities are easy to compute with the following function: R> post 0. Such minimum

2.3 Parameter Estimation

55

and maximum bounds can be placed on the parameters by calling optim with method="L-BFGS-B". R> + + + + + + + R> R> R> R> + + R>

GaussMix2 stationary 0,

(4.7)

where as earlier aij(s) = A(s) ij denotes the ij -th element of the s-step transition matrix of the model. A periodic state of a Markov model is a state that can only be returned to after a certain number of steps j > 1. To illustrate, consider the following transition matrix:   01 A= , (4.8) 10 and assume that the Markov chain starts in state S1 = 1 of the model. It can then be easily seen that state 1 can only be visited by the process for odd values of t: P(St = 1) = 1 for odd t and P(St = 1) = 0 for even t. If the process started in state S1 = 2, then state 1 can only be visited at even values of t. In both cases, the process can only be returned to after 2 steps. We then say that the period of state 1 is 2. Note that in this model, state S = 2 is also periodic with period 2. An aperiodic state is a state for which the period equals 1. As noted above, an ergodic Markov model is a model that is both irreducible and aperiodic (i.e. it has no periodic states). As a consequence, there is some value s for which the s-step transition matrix As has only strictly positive entries. As was shown in the example of the stationary distribution above, we now know that for ergodic Markov models this s-step transition matrix converges to a matrix with its rows equal to the stationary distribution. This is true regardless of the starting distribution. This property means that the initial state is ‘forgotten’ by the Markov chain.

4.1.2.3

Transient and Absorbing States

Earlier, periodic and aperiodic states were defined. There are two other types of states that are relevant to define: transient and absorbing states. To illustrate, Fig. 4.1 contains four paradigmatic examples of 2-state Markov models. Fig. 4.1a shows an ergodic Markov model with positive and equal transition probabilities between states; the Markov chains this model generates are sequences of flips of a fair coin. Figure 4.1b shows a periodic Markov model, i.e. all states are periodic and

132

4 Hidden Markov Models

0.5 0.5

1

1

2

0.5

0

1

2

0.5

1

(a)

(b)

0.5 0.5

0

1

1

2 0

(c)

1

0

1

2

0.5

0.5

(d)

Fig. 4.1 Four paradigmatic examples of 2-state Markov models. Nodes depict states and arrows transitions and are labelled by the transition probabilities. (a) Ergodic model, coin flips. (b) Switch model. (c) All-or-none model. (d) Golden mean model

the Markov process switches back and forth between the 2 states at every time point t. Figure 4.1c shows a 2-state Markov model with one transient state and one absorbing state, labeled the all-or-none model. Note that the transition probability P(St+1 = 2|St = 2) = 1, meaning that once the process enters state 2 it never leaves this state again. Hence, state 2 is called an absorbing state. For state 1 the situation is different: with probability 1 the process leaves state 1 at some point, and then never returns. Hence, 1 is called a transient state. The sequences generated by the all-ornone model start with a number of ones and then switch to generating twos until infinity. This is a prototypical sequence for e.g. mastering a skill or attaining some knowledge: before learning there was no knowledge (S = 1), and after mastery knowledge is attained (S = 2) and never lost again. Hence, the name all-or-none model, referring to the particular knowledge acquisition that happens suddenly. The final and fourth example model depicted in Fig. 4.1d is labeled the golden mean model, for reasons that will become clear soon. Note that the model is a combination of the coin flip model and the periodic switch model. It has one state with equal transition probabilities and one state which is the opposite of absorbing: once the state is entered, the process leaves it the next step. This Markov model generates sequences of the form: 1,2,. . . ,2,1,2,. . . ,1,2. . . ,2,1, et cetera. That is, ones that are interspersed with sequences of one or more twos, but there are never two consecutive ones.

4.1.2.4

Dwell Time Distribution

In the golden ratio model, the number of twos between consecutive ones has a geometric distribution with p equal to 0.5. Note that this is a general property of states of Markov models: the number of time points spent in a particular state (the

4.1 Preliminaries: Markov Models

133 0.5

Fig. 4.2 Example dwell time distributions

0.3 0.2 0.0

0.1

Probability

0.4

p=.5 p=.8 p=.95

2

4

6

8

10

12

14

Dwell time (consecutive state visits)

so-called dwell time) has a geometric distribution with a probability p that is the probability of remaining in that particular state: P(dwell time = k) = p1−k (1 − p), where p = P(St+1 = i|St = i) = aii . In other words, these are the diagonal probabilities of the transition matrix A. This distribution is called the dwell time distribution. Some examples of dwell time distributions, for different values of p, are shown in Fig. 4.2. Before combining mixture model properties with Markov model properties to define the hidden Markov model in the next section, we first briefly discuss some connections between Markov models and grammars. This connection lies at the basis of the first applications of Markov models in psychology.

4.1.2.5

Markov Models and Grammars: The Golden Mean Model

Markov models can be used to represent grammars, in particular so-called finite state grammars. The formalization of the notion of grammars is intimately tied in with the development of digital computing machinery in the middle of the twentieth century (Turing 1950/1990), and has had a profound influence on thinking about human psychological capabilities (Chomsky 1959). To illustrate the connection between Markov models and grammars, consider the golden mean model introduced above. The model generates strings of ones and twos, and these strings can be interpreted as the grammatical sentences of a language. The strings of length one that the language has are 1 and 2; the strings of length 2 that the language has are: 12, 21, and 22, but not 11; the strings of length 3 are: 121, 221, 212, 122, and 222. Using the properties of the Markov model, the number of strings of a given length can be easily computed. Consider the adjacency matrix of the golden mean model:

134

4 Hidden Markov Models

  11 Adj = , 10

(4.9)

where a 1 indicates that there is a direct path between states and a 0 that there is no path between states. The number of paths of length m is simply computed as the sum of the entries in the adjacency matrix to the power m:       11 m 1 . 11 1 10

(4.10)

Note that a path of length two reflects two state transitions; a path as used here does not include the starting state. Hence, a path of length two corresponds to a string of length three. In R, it is straightforward to compute these numbers of paths: R> x sum(x) # path length 1, string length 2 [1] 3 R> sum(x%*%x) # path length 2, string length 3 [1] 5 R> sum(x%*%x%*%x) # path length 3, string length 4 [1] 8 R> sum(x%*%x%*%x%*%x) # path length 4, string length 5 [1] 13 which confirms that the number of strings of length two is 3, and the number of strings of length three is 5. There are two strings of length 1; these strings have path length 0, as they contain only the starting states and no state transitions. The model can generate 2, 3, 5, 8, 13, . . . numbers of strings of lengths 1, 2, 3, 4, 5, . . . . This sequence of numbers can be recognized as the continuation of the Fibonacci sequence (the full Fibonacci sequence starts with initial numbers 1, 1, . . . , which are missing here). The growth rate of the number of paths/strings in this grammar is related to the largest eigenvalue of the adjacency matrix, which are computed as the roots of the characteristic polynomial of this matrix:    t − 1 −1   = t 2 − t − 1. χ (t) =  −1 t  The solutions are t =

√ 1+ 5 2

and t =

√ 1− 5 2 ,

(4.11)

and the largest of these two, t =

√ 1+ 5 2 , determines the growth rate of the number of paths/strings. This number is the

4.2 Introducing the Hidden Markov Model

135

familiar golden mean ratio, approximately equal to 1.618. In R, the value can be obtained as R> eigen(x)$values[1] [1] 1.618 The logarithm of the growth rate is known as the entropy of the model or of the grammar/language. The entropy of the golden mean model is 0.481 nats. Compared to the entropy of the coin flip model, which is 0.693 nats, the entropy is lower, which is due to certain strings being impossible in the golden ratio model.

4.2 Introducing the Hidden Markov Model Just like a Markov model, a hidden Markov model is characterized by the state sequence S1:T . The difference is that in a hidden Markov model, the states are not not observable, i.e. they are hidden. However, the states are assumed to “produce” a sequence of observable responses Y1:T . Each state i is characterized by a conditional density/distribution, f (Y |S = i) which specifies the distribution of values of the observable response variable Y whenever the state equals S = i. Note that in this chapter and in the following sections the presentation of models and methods is limited to single timeseries Y1:T . Thus, where in the context of mixture models, subscripts t denoted observations from different units such as participants, in the context of hidden Markov models, subscript t denotes the time point of an observation from the same unit. As the later applications will show, there are many situations in which data consists of multiple timeseries, e.g. when data of multiple participants in an experiment are observed. To distinguish between (i) (i) different timeseries, they can each be denoted as Y1:T , or e.g. Y1:Ti when the series have different length. As generalization of the methods from a single to multiple timeseries is generally straightforward, we avoid such complications and focus on single timeseries. Section 4.3.6 explicitly presents the case for computing the likelihood of multiple timeseries under a hidden Markov model.

4.2.1 Definitions A hidden Markov model is formally defined by two sequences, the state sequence S1:T and the response sequence Y1:T , that are linked through conditional dependencies as shown in the directed graph in Fig. 4.3 and Eqs. 4.12 and 4.13: P(St |S1:t−1 ) = P(St |St−1 ), f (Yt |S1:t−1 , Y1:t−1 ) = f (Yt |St ),

t = 2, 3, . . . , T , t = 1, 2, . . . , T .

(4.12) (4.13)

136

4 Hidden Markov Models

1

2

3

...

1

2

3

...

Fig. 4.3 A basic hidden Markov model

Equation 4.12 expresses the Markov property for the state sequence of the hidden Markov model, i.e., the current state St only depends on the previous state St−1 . This is identical to the Markov property of a (simple) Markov model. Equation 4.13 expresses that the current observation Yt only depends only the current state of the underlying Markov chain. Making use of these conditional independencies, the joint distribution of observations and hidden states can be expressed as f (Y1:T , S1:T ) = P(S1 )f (Y1 |S1 ) ×

T 

P(St |St−1 )f (Yt |St ).

(4.14)

t=2

The probability density functions f and probability distribution functions P generally depend on parameters θ . Explicitly incorporating these parameters into the notation, we can think of the hidden Markov model as consisting of three submodels: 1. Prior model: A model of P(S1 |θ pr ), the prior probability for each state j , given parameters θ pr . 2. Transition model: A model of P(St = j |θ tr , St−1 = i), the transition probability of moving from state i on trial t − 1 to state j on trial t, given parameters θ tr . 3. Observation or response model: A model of P(Yt |θ obs , St = i), the probability of an observation given state i and parameters θ obs . Using these three submodels, important properties such as the likelihood can be computed. Before doing so we first present an example of a simple hidden Markov model and discuss the relationship with the mixture model that was the focus of earlier chapters.

4.2.2 Relation Between Hidden Markov and Mixture Model Mixture models are a special case of hidden Markov models. If each row of the transition matrix is identical to the vector of initial state probabilities, we have: P(St = i|St−1 = j ) = P(St = i|St−1 = k) = P(St = i) = πi ,

(4.15)

4.2 Introducing the Hidden Markov Model

137

for all i, j, k, t. This model is thus identical to a mixture model with component probabilities πi . HMMs can thus be viewed as a generalization of mixture models, where we apply a Markovian structure to the component probabilities. For this reason, HMMs are sometimes also referred to as dependent mixture models (Leroux and Puterman 1992), where the mixture components are dependent on each other over repeated measurements.

4.2.3 Example: Bernoulli Hidden Markov Model As a simple example consider a two-state hidden Markov model with binary observations, also called a Bernoulli hidden Markov model. The responses in one state are produced by a fair coin, generating equal numbers of heads and tails, whereas the responses in the second state stem from a biased coin that produces almost only tails. A typical series of responses that could be produced by this process is, with heads and tails replaced by zeroes and ones: 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1. From this series of zeroes and ones it is hard to tell which coin generated each observation. The states of the process are indeed hidden, and not identifiable with certainty from the observations. One can however compute the most likely coin (state) underlying the outcomes (response) from knowledge of the (biased versus unbiased) probabilities of the outcomes of flipping each coin, and the probabilities of using each coin at time t. The series here was in fact created by the following R-code: R> R> R> R>

set.seed(2) y1 R> R> R> R> R> R> R> + +

data(discrimination) y R> R> + + + + +

scaled_alpha R> R> R> R> + + + + + +

nt + + + R> R> R> R>

data(dccslong) data(dccs) grad + +

library(lmtest) ar1 mod3 set.seed(1) R> fm3 summary(fm3) Initial state probabilities model pr1 pr2 pr3 1 0 0

194

5 Univariate Hidden Markov Models

Transition matrix toS1 toS2 toS3 fromS1 0.984 0.016 0.000 fromS2 0.000 0.962 0.038 fromS3 0.000 0.000 1.000 Response parameters Resp 1 : gaussian Re1.(Intercept) Re1.sd St1 337.0 204.2 St2 169.4 70.7 St3 70.5 39.9 The parameter values clearly indicate that that there are large shifts in both means and variances between the different states of the model. The goodness-offit statistics of the change-point models are listed in Table 5.6. The model predicted values of the change-point models are depicted in Fig. 5.13. As is clear from Table 5.6, according to AIC and BIC criteria the change-point models capture the data much better than the other models. Hence, there is evidence that sudden changes are a more likely mechanism causing the decrease in water catchments in these dams than more gradual changes. As a final check on the goodness-of-fit of the model, consider the autocorrelation functions in Fig. 5.14. The upper panel depicts the autocorrelation function of the data, whereas the bottom panel depicts the autocorrelation function of the residuals of the changepoint model. Clearly, the model captures the autocorrelation in the data quite well, although some residual autocorrelation is left, particularly at longer lags.

600 400 0

200

Inflow (GL)

800

data 1 changepoint 2 changepoints

1910

1921

1932

1943

1954

1965

1976

1987

1998

2009

2020

year

Fig. 5.13 Perth dams water inflow and predicted averages for the change-points models with 1 and 2 change points respectively

5.5 Generalized Linear Hidden Markov Models for Multiple Cue Learning residuals

−0.2

−0.2

0.2

0.2

ACF

0.6

0.6

1.0

1.0

data

195

0

2

4

6 Lag

8

10

0

2

4

6

8

10

Lag

Fig. 5.14 Autocorrelation function for the Perth dams data and for the residuals of the 2changepoint model

There is still room for improvement of the model, and one of the options is to include an autoregressive component in the changepoint model by including the observation at the previous timepoint as a predictor in the response models. We leave it to the reader to check whether this indeed improves the model. Another possibility for better capturing this data stems from the observation that the means and variances in the changepoint model seem to somewhat proportional to one another. It seems sensible therefore to model the log-transformed data or use a log-normal distribution that can exploit this property. It is quite common also for psychological data for example to find a more-or-less linear relationship between means and variances; in particular, response times also typically show such a relationship.

5.5 Generalized Linear Hidden Markov Models for Multiple Cue Learning This section concerns hidden Markov models in which the response models are formulated as generalized linear models. In the preceding, we have often looked at models in which a response variable depended on the state, but not on additional auxiliary variables. In this section we will look at hidden Markov models in which the states define different relations between a response variable and auxiliary predictors. The WPT dataset (see Sect. 1.3.8) concerns people’s ability to learn to associate a binary outcome (the state of the weather: “Rainy” or “Fine”) to patterns of binary cues (the presence or absence of 4 “tarot” cards). On each trial of the task, a participant is presented with a cue pattern, asked for their prediction, after which they are informed of the true state of the weather. From this outcome feedback, they can learn to associate the state of the weather to the cues. This learning can occur

196

5 Univariate Hidden Markov Models

gradually, for instance by increasing the associations between individual cues and the outcome that occurred, and decreasing the associations with the outcome that did not occur. An alternative view is that learning might occur in more discrete stages, where people develop more complex response strategies over time. This latter view was proposed by Gluck et al. and also formed the basis of the analyses in Speekenbrink et al. (2010). On the basis of participants’ answers in a post-task questionnaire, Gluck et al. distinguished three broad classes of response strategies in the Weather Prediction Task: a Singleton strategy, where participants respond optimally on those trials in which a single card is present, but guess for combinations of multiple cards; Single Cue strategies, in which participants respond optimally to the presence or absence of a particular card, ignoring the other cards; and a Multi-Cue strategy, in which participants respond optimally to all possible cue patterns. In Speekenbrink et al. (2010), hidden Markov models were formulated to analyse participants’ behaviour in terms of (variants) of these strategies. Here, we will take a different approach, and use hidden Markov models to identify possible response strategies that people may use. If we let go of the optimality assumption, then each strategy (apart from the Singleton) identified by Gluck et al. (2002) can be implemented as a logistic regression model: 1. 2. 3. 4. 5.

Single Cue 1: Single Cue 2: Single Cue 3: Single Cue 4: Multi-cue: r

r r r r ~

~ ~ ~ ~ 1

1 1 1 1 +

+ c1 + c2 + c3 + c4 c1 + c2 + c3 + c4

Note that all other states can be viewed as special cases of the Multi-cue strategy state, in which particular slopes for the cues are fixed to 0 (e.g., the Single Cue 1 strategy can be derived from the Multi-cue strategy by fixing the slopes for c2, c3, and c4 to 0). Thus, if any of these strategies is used, we might be able to identify them by fitting a hidden Markov model with different versions of the multicue strategy as states. If e.g. the Single Cue 1 strategy is used, then this would be indicated by the slopes of c2, c3, and c4 being (close to) 0. This is the approach we will take here. We assume that through learning, people will apply more useful response strategies over time. We therefore expect the states to be ordered, in the sense that strategies applied earlier will be less useful than strategies applied later on. As a result, we expect that once someone switches from one strategy to another, the earlier strategy will not be applied again. However, we do not assume that everyone will necessarily go through the same sequence of strategies. For example, while the probability of transitioning from state 2 back to state 1 should be P(St+1 = 1|St = 2) = 0, transitions to other states may all have a positive probability, i.e. P(St+1 ≥ 2|St = 2) ≥ 0. Thus, the transition matrix is constrained such that Aij = 0 whenever i > j . Note that this is less restrictive than the left-to-right structure applied to the change-point model in the last section, which additionally had the condition that Aij = 0 whenever j > i + 1.

5.5 Generalized Linear Hidden Markov Models for Multiple Cue Learning

197

In the following, we will fit several hidden Markov models with the number of hidden states ranging between 2 and 8. As the models with a larger number of hidden states become quite complex (in terms of the number of freely estimable parameters), obtaining good estimates may be tricky. To make it more likely to arrive at estimates which are at the global maximum of the likelihood, we will estimate each model starting with a number of random starting values. This procedure is automated in the multistart() function in the depmixS4 package. This function takes as arguments the name of an (unfitted) depmix model, and two integer valued arguments called nstart and initIters. The function relies on the random initial assignment of observations to states (see Sect. 2.3.6) to automatically generate a total of nstart different starting values for the response model parameters. Rather than running the EM algorithm until convergence from each starting value, the EM algorithm is run for initIters iterations. If a model has a higher likelihood at this point than any of the preceding models, it is considered the currently best model. After trying a total of nstart starting values, the EM algorithm is then applied to the currently best model until convergence. Not running the EM algorithm until convergence for each starting value saves considerable computation time. And as large increases in likelihood generally occur at the earlier iterations of EM, this procedure provides a reasonable balance such that it is possible to try many starting values without increasing the computational burden too much. In the following code, we first define the function lrTrStart, which generates starting values for the transition matrix that follow the (partial) left-to-right structure. For all states (except the final one), the starting value for self-transitions is set to aii = .8, while the remaining probability is equally divided over the transitions to states with a higher index. The models are constrained to start in the first state by setting P(S1 = 1) = 1 as a starting value. Note that the random assignment of observations to states is used to derive initial values for the parameters of the response models. The starting values of the initial state and transition probabilities are those provided. We then use the provided and randomly determined starting values to estimate hidden Markov models with between 2 and 8 states. R> R> + + + + + + + + R> R> R>

data(WPT) lrTrStart R> R> + + + + + + + +

library(hmmr) data(balance8) set.seed(12) hm3id R> fhm3id R> hm4id R> fhm4id fitlm summary(fitlm) Df Pillai approx F num Df den Df Pr(>F) Pacc 1 0.408 149.8 2 434 summary.aov(fitlm) Response rt : Df Sum Sq Mean Sq F value Pr(>F) Pacc 1 40.8 40.8 291.73 F) Pacc 1 8.6 8.57 50.06 6e-12 prev 1 0.4 0.40 2.31 0.13 I(Pacc^2) 1 0.0 0.03 0.16 0.69 Residuals 435 74.4 0.17

6.2 Switching Between Speed and Accuracy

211

These separate ANOVAs indicate that the response times (Response 1) are most strongly influenced by all three variables whereas accuracy (Response 2) is most strongly influenced by the pay-off for accuracy but less so by the accuracy on the previous trial. Our hypothesis of interest concerns models with two modes, and possibly serial dependence between the modes. In particular, we hypothesize that the pay-off for accuracy does not directly influence response times and accuracies. Rather, we expect two modes of responding, fast guessing and slow and accurate responding, and the pay-off for accuracy determines which mode of responding is active. The first part of the hypothesis concerns the test of the existence of two modes of responding for the two responses simultaneously. The response model for the RTs is identical to that used in Sect. 5.3. Response accuracies are modeled as Bernoulli distributed variables, with probability of success depending on the state: ACCt |St = i ∼ Bernoulli(pi ), i = 1, 2.

(6.1)

The response times and accuracies are modeled as locally independent, meaning that RT and ACC are independent conditional on the state variable: ACCt ⊥ RTt |St .

(6.2)

Using conditional independence is a typical assumption in latent variable modeling (Bollen 2002). The most basic model to fit is a 2-state hidden Markov model with response times and accuracies as responses. The code below specifies such a model with response times modeled with a Gaussian distribution and response accuracies modeled with a multinomial distribution. Note that a multinomial distribution for two categories and size n = 1 is identical to a Bernoulli distribution, as is the binomial distribution with size n = 1. We use a multinomial rather than a binomial distribution as the former allows the use of an “identity” link function, which provides simple estimates of the probability of success and failure. The final lines of code fit the model and provide the parameter estimates. R> hm2 set.seed(1) R> fhm2 summary(fhm2) Initial state probabilities model pr1 pr2 0 1

212

6 Multivariate Hidden Markov Models

Transition matrix toS1 toS2 fromS1 0.899 0.101 fromS2 0.084 0.916 Response parameters Resp 1 : gaussian Resp 2 : multinomial Re1.(Intercept) Re1.sd Re2.inc Re2.cor St1 5.521 0.202 0.472 0.528 St2 6.392 0.240 0.099 0.901 The parameter estimates clearly support the intended interpretation of the 2-state hidden Markov model, with state 1 being the fast guessing state and state 2 the slow and accurate mode of responding. Below, these states are referred to as state FG (Fast Guessing) and state SC (Stimulus Controlled; meaning that responding is controlled by the contents of the stimulus, while in the FG mode responding is determined only by the presence of the stimulus). The SC state has an accuracy of just over 90% correct responding whereas the first state has accuracy at guessing levels (roughly 50% correct). The second part of the hypothesis concerns the effect of Pacc on switching between the modes. Note that the variable Pacc has no explicit role in the 2state model described above. In the linear model, the assumption is expressed that Pacc influences response times and accuracy directly. In the 2-state model, the core assumption is that there are two modes of behavior, slow and accurate versus fast guessing behavior, and that switching between these modes of behavior is influenced by the pay-off for accuracy. Hence, Pacc is hypothesized to have an indirect influence on RT and ACC through the state variables St . If this assumption is true, then it should be possible to predict the states St from the values of Pacc. More precisely, we should be able to use Pacc to predict switching from FG to SC (when Pacc becomes relatively large), and from SC to FG (when Pacc becomes relatively small). To incorporate this assumption into the hidden Markov model framework, we can include Pacc as a covariate on the transition probabilities. Note that by including such a covariate, the hidden Markov model is no longer homogeneous. Markovian models with time-varying covariates have been studied before in time series data (Hughes et al. 1999; Hughes and Guttorp 1994), and in the context of longitudinal data (Vermunt et al. 1999; Chung et al. 2007). In the next model, the transition probabilities are modeled by including Pacc as a predictor through a multinomial logistic regression (see Sect. 1.2.7). The following equations define this regression for the 2-state model:

6.2 Switching Between Speed and Accuracy

log

1 − a11 (t) = β1,0 + β1,1 · Pacct a11 (t)

a22 (t) = β2,0 + β2,1 · Pacct . log 1 − a22 (t)

213

(6.3)

The following code specifies and fits the model by adding the extra argument transition=~Pacc to the previous model specification: R> hm2tr set.seed(1) R> fhm2tr summary(fhm2tr,which="response") Response parameters Resp 1 : gaussian Resp 2 : multinomial Re1.(Intercept) Re1.sd Re2.inc Re2.cor St1 5.522 0.203 0.474 0.526 St2 6.394 0.237 0.096 0.904 which are very similar to the model without the covariate, and their interpretation hence remains the same. The estimated model for the transition probabilities is: R> summary(fhm2tr,which="transition") Transition model for state (component) 1 Model of type multinomial (mlogit), formula: ~Pacc Coefficients: St1 St2 (Intercept) 0 -4.223 Pacc 0 9.133 Probalities at zero values of the covariates. 0.9856 0.01445 Transition model for state (component) 2 Model of type multinomial (mlogit), formula: ~Pacc Coefficients: St1 St2 (Intercept) 0 -3.374 Pacc 0 15.805

214

6 Multivariate Hidden Markov Models 1.0

Probability

Fig. 6.7 Transition probabilities as a function of the pay-off for accuracy (Pacc) between the stimulus-controlled state (SC) and the fast-guessing state (FG)

0.5

P(switch from FG to SC) P(stay in SC)

0.0 0.0

0.5

1.0

Pacc

Probalities at zero values of the covariates. 0.9669 0.03313 The parameter estimates indicates a sizable effect of Pacc on the transition probabilities from one state to the other. As can be seen in the summary of the transition model, βˆ1,0 = −4.22, and βˆ1,1 = 9.13 for the transition model of the fast-guessing state. Similarly, βˆ2,0 = −3.37, and βˆ2,1 = 15.8 for the transition model of the stimulus-controlled state. To illustrate what these parameters entail, the following code plots the transition probabilities as a function of Pacc: R> R> R> R> + R> R> R> R> + +

pars R> R> R> R> R> R>

pars R> R> R> R> R> R> R> R> R> R> R>

pars R> R> + + + R>

IGT$prevloss 0 IGT$prevloss[IGT$trial==1] R> R> R> + R> R> R> + R> R> R> R> R> R> R> R> R> R> R>

data("discrimination") ntim