224 55 3MB
English Pages 558 [559] Year 2007
Chemometrics in Spectroscopy
This page intentionally left blank
Chemometrics in Spectroscopy
Howard Mark Mark Electronics Suffern, New York USA
Jerry Workman Jr. Thermo Fischer Scientific Inc. Molecular Spectroscopy & Microanalysis Madison, WI USA
Amsterdam • Boston • Heidelberg • London • New York • Oxford Paris • San Diego • San Francisco • Singapore • Sydney • Tokyo Academic Press is an imprint of Elsevier
Academic Press is an imprint of Elsevier 84 Theobald’s Road, London WC1X 8RR, UK Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands Linacre House, Jordan Hill, Oxford OX2 8DP, UK 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA 525 B Street, Suite 1900, San Diego, CA 921014495, USA First edition 2007 Copyright © 2007 Elsevier Inc. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: [email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made ISBN: 9780123740243
For information on all Academic Press publications visit our website at books.elsevier.com
Printed and bound in USA 07 08 09 10 11
10 9 8 7 6 5 4 3 2 1
Working together to grow libraries in developing countries www.elsevier.com | www.bookaid.org | www.sabre.org
Dedication To our families and to our readers � � � – Howard Mark and Jerry Workman
This page intentionally left blank
Contents Preface Note to Readers 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.
A New Beginning � � � Elementary Matrix Algebra: Part 1 Elementary Matrix Algebra: Part 2 Matrix Algebra and Multiple Linear Regression: Part 1 Matrix Algebra and Multiple Linear Regression: Part 2 Matrix Algebra and Multiple Linear Regression: Part 3 – The Concept of Determinants Matrix Algebra and Multiple Linear Regression: Part 4 – Concluding Remarks Experimental Designs: Part 1 Experimental Designs: Part 2 Experimental Designs: Part 3 Analytic Geometry: Part 1 – The Basics in Two and Three Dimensions Analytic Geometry: Part 2 – Geometric Representation of Vectors and Algebraic Operations Analytic Geometry: Part 3 – Reducing Dimensionality Analytic Geometry: Part 4 – The Geometry of Vectors and Matrices Experimental Designs: Part 4 – Varying Parameters to Expand the Design Experimental Designs: Part 5 – Oneatatime Designs Experimental Designs: Part 6 – Sequential Designs Experimental Designs: Part 7 – �, the Power of a Test Experimental Designs: Part 8 – �, the Power of a Test (Continued) Experimental Designs: Part 9 – Sequential Designs Concluded Calculating the Solution for Regression Techniques: Part 1 – Multivariate Regression Made Simple Calculating the Solution for Regression Techniques: Part 2 – Principal Component(s) Regression Made Simple Calculating the Solution for Regression Techniques: Part 3 – Partial Least Squares Regression Made Simple Looking Behind and Ahead: Interlude A Simple Question: The Meaning of Chemometrics Pondered Calculating the Solution for Regression Techniques: Part 4 – Singular Value Decomposition Linearity in Calibration Challenges: Unsolved Problems in Chemometrics Linearity in Calibration: Act II Scene I Linearity in Calibration: Act II Scene II – Reader’s Comments � � � Linearity in Calibration: Act II Scene III
xi xiii 1 9 17 23 33 43 47 51 57 63 71 77 81 85 89 91 93 97 101 103 107 109 113 117 119 127 131 135 141 145 149
viii
32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71.
Contents
Linearity in Calibration: Act II Scene IV Linearity in Calibration: Act II Scene V Collaborative Laboratory Studies: Part 1 – A Blueprint Collaborative Laboratory Studies: Part 2 – using ANOVA Collaborative Laboratory Studies: Part 3 – Testing for Systematic Error Collaborative Laboratory Studies: Part 4 – Ranking Test Collaborative Laboratory Studies: Part 5 – Efficient Comparison of Two Methods Collaborative Laboratory Studies: Part 6 – MathCad Worksheet Text Is Noise Brought by the Stork? Analysis of Noise: Part 1 Analysis of Noise: Part 2 Analysis of Noise: Part 3 Analysis of Noise: Part 4 Analysis of Noise: Part 5 Analysis of Noise: Part 6 Analysis of Noise: Part 7 Analysis of Noise: Part 8 Analysis of Noise: Part 9 Analysis of Noise: Part 10 Analysis of Noise: Part 11 Analysis of Noise: Part 12 Analysis of Noise: Part 13 Analysis of Noise: Part 14 Derivatives in Spectroscopy: Part 1 – The Behavior of the Derivative Derivatives in Spectroscopy: Part 2 – The “True” Derivative Derivatives in Spectroscopy: Part 3 – Computing the Derivative Derivatives in Spectroscopy: Part 4 – Calibrating with Derivatives Comparison of Goodness of Fit Statistics for Linear Regression: Part 1 – Introduction Comparison of Goodness of Fit Statistics for Linear Regression: Part 2 – The Correlation Coefficient Comparison of Goodness of Fit Statistics for Linear Regression: Part 3 – Computing Confidence Limits for the Correlation Coefficient Comparison of Goodness of Fit Statistics for Linear Regression: Part 4 – Confidence Limits for Slope and Intercept Correction and Discussion Regarding Derivatives Linearity in Calibration: Act III Scene I – Importance of Nonlinearity Linearity in Calibration: Act III Scene II – A Discussion of the DurbinWatson Statistic, a Step in the Right Direction Linearity in Calibration: Act III Scene III – Other Tests for Nonlinearity Linearity in Calibration: Act III Scene IV – How to Test for Nonlinearity Linearity in Calibration: Act III Scene V – Quantifying Nonlinearity Linearity in Calibration: Act III Scene VI – Quantifying Nonlinearity, Part II, and a News Flash Connecting Chemometrics to Statistics: Part 1 – The Chemometrics Side Connecting Chemometrics to Statistics: Part 2 – The Statistics Side Limitations in Analytical Accuracy: Part 1 – Horwitz’s Trumpet
159 163 167 179 183 185 187 193 223 227 235 243 253 271 277 285 293 299 313 317 323 329 339 351 359 371 379 385 393 399 413 421 427 435 439 451 459 471 477 481
Contents
72. Limitations in Analytical Accuracy: Part 2 – Theories to Describe the Limits in Analytical Accuracy 73. Limitations in Analytical Accuracy: Part 3 – Comparing Test Results for Analytical Uncertainty 74. The Statistics of Spectral Searches 75. The Chemometrics of Imaging Spectroscopy Glossary of Terms Index Colour Plate Section
ix
487 491 497 503 509 513
This page intentionally left blank
Preface
This large single volume fulfils the need for chemometricbased tutorials on topics of interest to analytical chemists or other scientists performing modern mathematical and statistical operations for use with analytical measurements. The book covers a very broad range of chemometric topics as indicated in the extensive table of contents. This book is a collection of the series of columns first published in Spectroscopy providing detailed mathematical and philosophical discussions on the use of chemometrics and statistical methods for scientific measurements and analytical methods. In addition the new revolution in biotechnology and the use of spectroscopic techniques therein provides an opportunity for those scientists to strengthen their use of mathematics and calibration through the use of this book. Subjects covered include those of interest to many groups of scientists, mathemati cians, and practicing analysts for daily problem solving as well as detailed insights into subjects difficult to thoroughly grasp for the nonspecialist. The coverage relies more on concept delineation than on rigorous mathematics, but the descriptive mathematics and derivations are included for the more rigorously minded. Sections on matrix algebra, analytic geometry, experimental design, instrument and system calibration, noise, derivatives and their use in data analysis, linearity and nonlinearity are described. Collaborative laboratory studies, using ANOVA, testing for systematic error, ranking tests for collaborative studies, and efficient comparison of two analytical methods are included. Discussion on topics such as the limitations in analytical accuracy; and brief introductions to the statistics of spectral searches; and the chemometrics of imaging spectroscopy are included. The popularity of the Chemometrics in Spectroscopy series (ongoing since the early 1990s) as well as the Statistics in Spectroscopy series and books has been overwhelming and we sincerely thank our readership over the years. We have received emails from many people, one memorable one thanking us that a career change was made due to the renewed and stimulated interest in statistics and chemometrics due largely to our thoughtprovoking columns. We hope you find this collection useful and will continue to read the columns and write to us with your thoughts, comments, and questions regarding this stimulating topic. Howard Mark Suffern, NY Jerry Workman Madison, WI
This page intentionally left blank
Note to Readers
In some cases there were errors, both trivial and significant, in the original column from which a given chapter was taken. Sometimes we found the error ourselves (unfortunately after the column was printed) and sometimes, more embarrassingly, the error was brought to our attention by one of our evervigilant readers. For all significant errors, the necessary corrections were made in a subsequent column; in all cases, the corrected version is what is in this book. Sometimes, for the more serious errors, we note that the corresponding column was erroneous, so that any reader who wants to go back to the original will be aware that a comparison with what is presented here will fail.
This page intentionally left blank
1 A New Beginning � � �
Why do we title this chapter “A New Beginning � � � ”? Well, there are a lot of reasons. First of all, of course, is the simple fact that that is just the way we do things. Secondly, is the fact that we developed this book in much the same way we developed our previous book Statistics in Spectroscopy (SiS). Those of you out there who have followed the series of articles published in Spectroscopy magazine since 1986 know that for the most part, each column in the series was pretty much selfcontained and could stand alone, yet also fit into that series in the appropriate place and contributed to the flow of information in that series as a whole. We hope to be able to reproduce that on a larger scale. Just as the series Statistics in Spectroscopy (this is too long to write out each time, from here on we will abbreviate it SiS) was selfcontained and stood alone, so too will we try to make this new series stand alone, and at the same time be a worthy successor to SiS, and also continue to develop the concepts we began there. Thirdly is the fact that we are finally starting to write again. To you, our readership, it may seem like we have been writing continuously since we began SiS, but in fact we have been running on backlog for a longer time than you would believe. That was advantageous in that it allowed us time to pursue our personal and professional lives including such other projects as arranging for SiS to be published as a book [1]. The downside of our getting ahead of ourselves, on the other hand, is that we were not able to keep you abreast on the latest developments related to our favorite topic. However, since the last time we actually wrote something, there have been a number of noteworthy developments. Our last series dealt only with the elementary concepts of statistics related to the general practice of calibration used for UVVISNIR and occasionally for IR spec troscopy. Our purpose in writing SiS was to help provide a small foot bridge to cross the gap between specialized chemometrics literature written at the expert level and those general statistics articles and texts dealing with examples and questions far removed from chemistry or spectroscopic practice. Since the beginning of the “Statistics” series in 1986, several reviews, tutorials, and textbooks have been published to begin the construction of a major highway bridging this gap. Most notably, at least in our minds, have been tutorial articles on classical least squares (CLS), principal components regression (PCR), and partial least squares regression (PLSR) by Haaland and Thomas [2, 3]. Other important work includes textbooks on calibration and chemometrics by Naes and Martens [4], and Mark [5]. Chemometric reviews discussing the progress of tutorial and textbook literature appear regularly in Analytical Chemistry, Critical Review issues. Another recent series of articles on chemometric concepts termed “The Chemometric Space” by Naes and Isaksson has appeared [6]. In addition, there is a North American chapter of the International Chemometrics Society (NAmICS) which we are told has
2
Chemometrics in Spectroscopy
over 300 members. Those interested in joining or obtaining further information may contact Professor Thomas O’Haver at the Department of Chemistry and Biochemistry, University of Maryland, College Park, Maryland 20742 (Donald B. Dahlberg, 1993, personal communication). All the foregoing was true as of when the Chemometrics column began in 1993. Now in 2006, when we are preparing this for book publication, there are many more sources of information about Chemometrics. However, since this is not a review of the field, we forebear to list them all, but will correct one item that has changed since then: to obtain information about NAmICS, or to join the discussion group, contact David Duewer at NIST ([email protected])) or send a message to the discussion group ([email protected]). Finally, since imitation is the sincerest form of flattery (or so they tell us), we are pleased to see that others have also taken the route of printing longer tutorial discussions in the form of a series of related articles on a given topic. Two series that we have no qualms recommending, on topics related to ours, have appeared in some of the sister publications of Spectroscopy [7–15] (note: there have been recent indications that the series in Spectroscopy International has continued beyond the ones we have listed. If we can obtain more information we will keep you posted – Spectroscopy International has also undergone some transformations and it is not always easy to get copies). So, overall the chemometrics bridge between the lands of the overly simplistic and severely complex is well under construction; one may find at least a single lane open by which to pass. So why another series? Well, it is still our labor of love to deal with specific issues that plague ourselves and our colleagues involved in the practice of multivariate qualitative and quantitative spectroscopic calibration. Having collectively worked with hundreds of instrument users over 25 combined years of calibration problems, we are compelled, like bees loaded with pollen, to disseminate the problems, answers, and questions brought about by these experiences. Then what would a series named “Chemometrics in Spectroscopy” hope to cover which is of interest to the readers of “Spectroscopy”? We have been taken to task (with perhaps some justice) for using the broader title label “Chemometrics in Spectroscopy” for what we have claimed will be discussions of the somewhat narrower range of topics included in the field of multivariate statistical algorithms applied to chemical problems, when the term “Chemometrics” actually applies to a much wider range of topics. Nevertheless, we will use this title, for a number of reasons. First, that is what we said we were going to do, and we hate to not follow through, even on such a minor point. Secondly, we have said before (with all due arrogance) that this is our column, and we have been pretty fortunate that the editors of Spectroscopy have always pretty much let us do as we please. Finally, at this point we consider the possibility that we may very well eventually extend our range to include some of these other topics that the broader term will cover. As of right now, some of the topics we foresee being able to expand upon over the series will include, but not be limited to • The multivariate normal distribution • Defining the bounds for a data set
A New Beginning � � �
3
• The concept of Mahalanobis distance • Discriminant analysis and its subtopics of – Sample selection – Spectral matching (Qualitative analysis) • Finding the maximum variance in the multivariate distribution • Matrix algebra refresher • Analytic geometry refresher • Principal components analysis (PCA) • Principal components regression (PCR) • More on Multiple linear least squares regression (MLLSR), also known as Multiple linear regression (MLR) and Pmatrix, and its sibling, Kmatrix • More on Simple linear least squares regression (SLLSR), also known as Simple least squares regression (SLSR) or univariate least squares regression • Partial least squares regression (PLSR) • Validation of calibration models • Laboratory data and assessing error • Diagnosis of data problems • An attempt to standardize statistical/chemometric terms • Special calibration problems (and solutions) • The concept of outliers: theory and practice • Standardization concepts and methods for transfer of calibrations • Collaborative study problems related to methods and instruments. We also plan to include in the discussions the important statistical concepts, such as correlation, bias, slope, and associated errors and confidence limits. Beyond this, it is also our hope that readers will write to us with their comments or suggestions for chemometric challenges which confront them. If time and energy permit, we may be able to discuss such issues as neural networks, general factor analysis, clustering techniques, maximizing graphical presentation of data, and signal processing.
THE MULTIVARIATE NORMAL DISTRIBUTION We will begin with the concept of the multivariate normal distribution. Think of a cigar, suspended in space. If you cannot think of a cigar suspended in space, look at Figure 11a. Now imagine the cigar filled with little flecks of stuff, as in Figure 11b (it does not really matter what the stuff is, mathematics never concerned itself with such unimportant details). Imagine the flecks being more densely packed toward the middle of the cigar. Now imagine a swarm of gnats surrounding the cigar; if they are attracted to the cigar, then naturally there will be fewer of them far away from the cigar than close to it (Figure 11c). Next take away the cigar, and just leave the flecks and the gnats. By this time, of course, you should realize that the flecks and the gnats are really the same thing, and are neither flecks nor gnats but simply abstract representations of points in space. What is left looks like Figure 11d.
4
Chemometrics in Spectroscopy (a)
(b)
(c)
(d)
Figure 1-1 Development of the concept of the Multivariate Normal Distribution (this one shown having three dimensions) – see text for details. The density of points along a crosssection of the distribution in any direction is also an MND, of lower dimension.
Figure 11d, of course, is simply a pictorial/graphical representation of what a Multivariate Normal Distribution (MND) would look like, if you could see it. Furthermore, it is a representation of only one particular MND. First of all, this particular MND is a threedimensional MND. A twodimensional MND will be represented by points in a plane, and a onedimensional MND is simply the ordinary Normal distri bution that we have come to know and love [16]. An MND can have any number of dimensions; unfortunately we humans cannot visualize anything with more than three dimensions, so for our examples we are limited to such pictures. Also, the MND depicted has a particular shape and orientation. In general, an MND can have a variety of shapes and orientations, depending upon the dispersion of the data along the different axes. Thus, for example, it would not be uncommon for the dispersion along two of the axes to be equal and independent. In this case, which represents one limiting situation, an appropriate crosssection of the MND would be circular rather than elliptical. Another limiting situation, by the way, is for two or more of the variables to be perfectly corre lated, in which case the data would lie along a straight line (or plane, or hyperplane as the corresponding higherdimensional figure is called). Each point in the MND can be projected onto the planes defined by each pair of the axes of the coordinate system. For example, Figure 12 shows the projection of the data onto the plane at the “bottom” of the coordinate system. There it forms a two dimensional MND, which is characterized by several parameters, the twodimensional MND being the prototype for all MNDs of higher dimension and the properties of this MND are the characteristics of the MND that are the key defining properties of it. First of all, the data contributing to an MND itself has a Normal distribution along any of the
5
A New Beginning � � �
Figure 1-2 Projecting each point of the threedimensional MND onto any of the planes defined by two axes of the coordinate system (or, more generally, any plane passing through the coor dinate system) results in the projected points being represented by a twodimensional MND). The correlation coefficients for the projections in all planes are needed to fully describe the original MND.
axes of the MND. We have discussed the Normal distribution previously [16], and have seen that it is described by the expression: f �x� = ae−�
x−x �
�
2
(11)
The MND can be mathematically described by an expression that is similar in form, but has the characteristic that each of the individual parts of the expression represents the multivariate analog of the corresponding part of equation 11. Thus, for example, where x represents the mean of the data for which equation 11 describes the distribution, there is a corresponding quantity X that represents in matrix notation the fact that for each of the axes shown in Figure 11, each datum has a value, and therefore the collection of data has a mean value along each dimension. This quantity represented as a list of the set of means along all the different dimensions is called a vector, and is represented as X (as opposed to x, an individual mean). If we project the MND onto each axis of the coordinate system containing the MND, then as stated above, these projections of the data will be distributed as an ordinary Normal distribution, as shown in Figure 13. This distribution will itself then have a standard deviation, so that another defining characteristic of the MND is the standard deviation of the projection of the MND along each axis. This must also then be represented by a vector.
Figure 1-3 Projecting the points onto a line results in a point density that is our familiar Normal Distribution.
6
Chemometrics in Spectroscopy
The final key point to note about the MND, which can also be seen from Figure 12, is the fact that when the MND is projected onto the plane defined by any two axes of the coordinate system the data may show some correlation (as does the data in Figure 12). In fact, the projection onto any of the planes defined by two of the axes will have some value for the correlation coefficient between the corresponding pair of variables. The amount of correlation between projections along any pair of axis can vary from zero, in which case the data would lie in a circular blob, to unity, in which case the data would all lie exactly on a straight line. Since each pair of axes define another plane, many such projections may be possible, depending on the number of dimensions in which the MND exists. Indeed, every possible pair of axes in the coordinate system defines such a plane. As we have shown, we mere mortals cannot visualize more than three dimensions, as so our examples and diagrams will be limited to showing data in three or lesser dimensions, but the mathematical descriptions can be extended with all generality, to as high dimensionality as might be needed. Thus, the full description of the MND must include all the correlations of the data between every pair of axes. This is conventionally done by creating what is known as the correlation matrix. This matrix is a square matrix, in which any given row or column corresponds to a variable, and the individual positions (i.e., the m, n position for example, where m and n represent indices of the variables) in the matrix represent the correlation between the variable represented by the row it lies in and the variable represented by the column it lies in. In actuality, for mathematical reasons, the correlation itself is not used, but rather the related quantity the covariance replaces the correlation coefficient in the matrix. The elements of the matrix that lie along what is called the main diagonal (i.e., where the column and row numbers are the same) are then the variances (the square of the standard deviation – this shows that there is a rather close relationship between the standard deviation and the correlation) of the data. This matrix is thus called the variancecovariance matrix, and sometimes just the covariance matrix for simplicity. Since it is necessary to represent the various quantities by vectors and matrices, the operations for the MND that correspond to operations using the univariate (simple) Normal distribution must be matrix operations. Discussion of matrix operations is beyond the scope of this column, but for now it suffices to note that the simple arithmetic operations of addition, subtraction, multiplication, and division all have their matrix counterparts. In addition, certain matrix operations exist which do not have counterparts in simple arithmetic. The beauty of the scheme is that many manipulations of data using matrix operations can be done using the same formalism as for simple arithmetic, since when they are expressed in matrix notation, they follow corresponding rules. However, there is one major exception to this: the commutative rule, whereby for simple arithmetic: A (operation) B = B (operation) A e.g.: A + B = B + A A−B = B−A does not hold true for matrix multiplication: A × B = B × A
7
A New Beginning � � �
That is because of the way matrix multiplication is defined. Thus, for this case the order of appearance of the two matrices to be multiplied may provide different matrices as the answer. Thus, instead of f�x� and the expression for it in equation 11 describing the simple Normal distribution, the MND is described by the corresponding multivariate expression (12): T A�X−X�
f �X� = Ke−�X−X�
(12)
where now the capital letters X and K represent vectors, and the capital letter A represents the covariance matrix. This is, by the way, a somewhat straightforward extension of the definition (although it may not seem so at first glance) because for the simple univariate case the matrix A degenerates into the number 1, X becomes x, and thus the exponent becomes simply the square of x − x. Most texts dealing with multivariate statistics have a section on the MND, but a particularly good one, if a bit heavy on the math, is the discussion by Anderson [17]. To help with this a bit, our next few chapters will include a review of some of the elementary concepts of matrix algebra. Another very useful series of chemometric related articles has been written by David Coleman and Lynn Vanatta. Their series is on the subject of regression anal ysis. It has appeared in American Laboratory in a set of over twentyfive articles. Copies of the back articles are available on the web at the URL address found in reference [18].
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991) Haaland, D. and Thomas, E., Analytical Chemistry 60, 1193–1202 (1988). Haaland, D. and Thomas, E., Analytical Chemistry 60, 1202–1208 (1988). Naes, T. and Martens, H., Multivariate Calibration (John Wiley & Sons, New York, 1989). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). Naes, T. and Isaksson, T., “The Chemometric Space”, NIR News (PO Box 10, Selsey, Chichester, West Sussex, PO20 9HR, UK, 1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(4), 310–314 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(5), 378–379 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(6), 448–450 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(7), 531–532 (1992). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(2), 42–44 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(4), 41–43 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(5), 43–46 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(6), 45–47 (1991).
8
Chemometrics in Spectroscopy
15. Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 4(1), 41–43 (1992). 16. Mark, H. and Workman, J., “Statistics in Spectroscopy – Part 6 – The Normal Distribution”, Spectroscopy 2(9), 37–44 (1987). 17. Anderson, T.W., An Introduction to Multivariate Statistical Analysis (Wiley, New York, 1958). 18. Coleman, D. and Vanatta, L., Statistics in Analytical Chemistry, International Scientific Com munications, Inc. found at http://www.iscpubs.com/articles/index.php?2.
2 Elementary Matrix Algebra: Part 1
You may recall that in the first chapter we promised that a review of elementary matrix algebra would be forthcoming; so the next several chapters will cover this topic all the way from the very basics to the more advanced spectroscopic subjects. You may already have discovered that the term “matrix” is a fanciful name for a table or list. If you have recently made a grocery list you have created an n×1 matrix, or in more correct nomenclature, an Xn×1 matrix where n is the number of items you would like to buy (rows) and 1 is the number of columns. If you have become a highly sophisticated shopper and have made lists consisting of one column for Store A and a second one for Store B, you have ascended into the world of Xn×2 matrix. If you include the price of each item and put brackets around the entire column(s) of prices, you will have created a numerical matrix. By definition, a numerical matrix is a rectangular array of numbers (termed “ele ments”) enclosed by square brackets [ ]. Matrices can be used to organize information such as size versus cost in a grocery department, or they may be used to simplify the problems associated with systems or groups of linear equations. Later in this chapter we will introduce the operations involved for linear equations (see Table 21 for common symbols used).
Table 2-1 Common symbols used in matrix notation Matrix∗ Determinant∗ Vectors∗ Scalars∗ Parameters or matrix names Errors and residuals Addition Subtraction Multiplication Division Empty or null set Inverse of a matrix Transpose of a matrix Generalized inverse of a matrix Identity matrix ∗
[X] or X �X� x x A, B, C, G, H, P, Q, R, S, U, V D, E, F + − × or • ÷ or / � [X]−1 �X�� or [X]T [X]− [I] of [1]
Where X or x are represented by any letter, generally those are listed under “Parameters or matrix names” in this table.
10
Chemometrics in Spectroscopy
The symbols below represent a matrix:
a1 a2
b1 b2
Note that a1 and a2 are in column 1, b1 and b2 are in column 2, a1 and b1 are in row 1, and a2 and b2 are in row 2. The above matrix is a 2 × 2 (rows × columns) matrix. The first number indicates the number of rows, and the second indicates the number of columns. Matrices can be denoted as X2×2 using a capital, boldface letter with the row and column subscript.
MATRIX OPERATIONS The following illustrations are useful to describe very basic matrix operations. Discus sions covering more advanced matrix operations will be included in later chapters, but for now, just review these elementary operations.
Matrix addition To add two matrices, the following operation is performed:
a1 a2
b1 c + 1 b2 c2
d1 a + c1 = 1 d2 a2 + c2
b1 + d1 b2 + d2
To add larger matrices, the following operation applies:
a1 a2
b1 b2
c1 c2
d1 e + 1 e2 d2
f1 f2
g1 g2
h1 a + e1 = 1 a2 + e2 h2
b1 + f1 b2 + f2
c1 + g1 c2 + g2
d1 + h1 d2 + h2
c1 − g1 c2 − g2
d1 − h1 d2 − h2
Subtraction For subtraction, use the following operations:
a1 a2
b1 c − 1 b2 c2
d1 a − c1 = 1 d2 a2 − c2
b1 − d1 b2 − d2
The same operation holds true for larger matrices such as
a1 a2
b1 b2
and so on.
c1 c2
d1 e − 1 d2 e2
f1 f2
g1 g2
h1 a − e1 = 1 h2 a2 − e2
b1 − f1 b2 − f2
11
Elementary Matrix Algebra: Part 1
Matrix multiplication To multiply a scalar by a matrix (or a vector) we use a A 1 a2
A × a1 b1 = b2 A × a2
A × b1 A × b2
where A is a scalar value. The product of two matrices (or vectors) is given by
a1 a2
b1 c × 1 b2 c2
d1 a c + b1 c2 = 1 1 d2 a2 c1 + b2 c2
a1 d1 + b1 d2 a2 d1 + b2 d2
In another example, in which an X1×2 matrix is multiplied by an X2×1 matrix, we have:
a1
b1
a × 2 = �a1 b1 + a2 b2 � b2
denoted by X1 × X2 in matrix notation.
Matrix division Division of a matrix by a scalar is accomplished:
a1 a2
b1 a A ÷ A = 1 b2 a2 A
b1 A b2 A
where A is a scalar value.
Inverse of a matrix The inverse of a matrix is the conceptual equivalent to its reciprocal. Therefore if we denote our matrix by X, then the inverse of X is denoted as X−1 and the following relationship holds. X × X−1 = �1� = X−1 × X where [1] is an identity matrix. Only square matrices, which have an equal number of rows and columns (for example, 2 × 2, 3 × 3, 4 × 4, etc.) have inverses. Several computer packages provide the algorithms for calculating the inverse of square matrices. The identity matrix for a 2 × 2 matrix is �1�2×2 =
1 0
0 1
12
Chemometrics in Spectroscopy
and for a 3 × 3 matrix, the identity matrix is ⎡
1 �1�3×3 = ⎣ 0 0
0 1 0
⎤ 0 0⎦ 1
and so on. Note that the diagonal is always composed of ones for the identity matrix, and all other values are zero. To summarize, by definition: X2×2 × X−1 2×2 = �1�2×2 The basic methods for calculating X−1 will be addressed in the next chapter.
Transpose of a matrix The transpose of a matrix is denoted by X� (or, alternatively, by XT �. For example, for the matrix: �X� = a1 a2
then
b1 b2 ⎡
a1 �X�� = ⎣ b1 c1
c1 c2
⎤ a2 b2 ⎦ c2
The first column of [X] becomes the first row of �X�� ; the second column of [X] becomes the second row of �X�� ; the third column of [X] becomes the third row of �X�� ; and so on.
ELEMENTARY OPERATIONS FOR LINEAR EQUATIONS To solve problems involving calibration equations using multivariate linear models, we need to be able to perform elementary operations on sets or systems of linear equations. So before using our newly discovered powers of matrix algebra, let us solve a problem using the algebra many of us learned very early in life. The elementary operations used for manipulating linear equations include three simple rules [1, 2]: • Equations can be listed in any order for convenience and organizational purposes. • Any equation may be multiplied by any real number other than zero. • Any equation in a series of equations can be replaced by the sum of itself and any other equation in the system. As an example, we can illustrate these operations using
13
Elementary Matrix Algebra: Part 1
the three equations below as part of what is termed an “equation system” or simply a “system” (equations 21 through 23): 1a1 + 1b1 = −2
(21)
4a1 + 2b1 + c1 = 6
(22)
6a1 − 2b1 − 4c1 = 14
(23)
To solve for this system of three equations, we begin by following the three elementary operations rules above: • We can rearrange the equations in any order. In our case the equations happen to be in a useful order. • We decide to multiply equation 21 by a factor such that the coefficients of a are of opposite sign and of the same absolute value for equations 21 and 22. Therefore, we multiply equation 21 by −4 to yield −4a1 − 4b1 = 8
(24)
• We can eliminate a1 in the first and the second equations by adding equations 24 and 22 to give equation (25) �−4a1 − 4b1 = 8� + �4a1 + 2b1 + c1 = 6� = 6a1 − 2b1 + c1 = 14
(25)
and we bring equation 21 back in the system by dividing equation 24 by −4 to get a1 + b1 = −2
(26)
−2b1 + c1 = 14
(27)
6a1 − 2b1 − 4c1 = 14
(28)
Now to eliminate the a1 term in equations 26 and 28, we multiply equation 26 by −6 to yield −6a1 − 6b1 = 12
(29)
Then we add equation 29 to equation 28: �−6a1 − 6b1 = 12� + �6a1 − 2b1 − 4c1 = 14� = −8b1 − 4c1 = 26
(210)
14
Chemometrics in Spectroscopy
Now we bring back equation 26 in its original form by dividing equation 29 by −6, and our system of equations looks like this: a1 + b1 = −2
(211)
−1b1 + c1 = 14
(212)
−8b1 − 4c1 = 26
(213)
We can eliminate the b1 term from equations 212 and 213 by multiplying equation 212 by −8 and equation 213 by 2 to obtain 16b1 − 8c1 = −112
(214)
−16b1 − 8c1 = 52
(215)
−16c1 = −60
(216)
Adding these equations, we find
Restore equation 27 by dividing equation 214 by −8 to yield a1 + b1 = −2
(217)
−2b1 + c1 = 14
(218)
−16c1 = −60
(219)
The solution Solving for c1 , we find c1 = �−60/ − 16� = 3�75� Substituting c1 into equation 218, we obtain −2b1 + 3�75 = 14� Solving this for b1 , we find b1 = −5�13� Substituting b1 into equation 217 , we find a1 + �−5�13� = −2. Solving this for a1 , we find a1 = 3�13� Finally, a1 = 3�13 b1 = −5�13 c1 = 3�75 A system of equations where the first unknown is missing from all subsequent equations and the second unknown is missing from all subsequent equations is said to be in echelon form. Every set or equation system comprised of linear equations can be brought into echelon form by using elementary algebraic operations. The use of augmented matrices can accomplish the task of solving the equation system just illustrated.
15
Elementary Matrix Algebra: Part 1
For our previous example, the original equations a1 + b1 = −2
(220)
4a1 + 2b1 + c1 = 6
(221)
6a1 − 2b1 − 4c1 = 14
(222)
can be written in augmented matrix form as: ⎡ ⎤ 1 1 0 −2 ⎣4 2 1 6⎦ 6 −2 −4 14
(223)
The echelon form of the equations can also be put into matrix form as follows. Echelon form: a1 + b1 = −2
(224)
−2b1 + c1 = 14
(225)
−16c1 = −60
(226)
Matrix form: ⎡
1 ⎣0 0
1 −2 0
⎤ 0 −2 1 14 ⎦ −16 −60
(227)
SUMMARY In this chapter, we have used elementary operations for linear equations to solve a problem. The three rules listed for these operations have a parallel set of three rules used for elementary matrix operations on linear equations. In our next chapter we will explore the rules for solving a system of linear equations by using matrix techniques.
REFERENCES 1. Kowalski, B.R., Recommendations to IUPAC Chemometrics Society (Laboratory for Chemo metrics, Department of Chemistry, BG10, University of Washington, Seattle, WA 98195; 1985), pp. 1–2. 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 408–457.
This page intentionally left blank
3 Elementary Matrix Algebra: Part 2
ELEMENTARY MATRIX OPERATIONS To solve the set of linear equations introduced in our previous chapter referenced as [1], we will now use elementary matrix operations. These matrix operations have a set of rules which parallel the rules used for elementary algebraic operations used for solving systems of linear equations. The rules for elementary matrix operations are as follows [2]: 1) Rows can be listed in any order for convenience or organizational purposes. 2) All elements within a row may be multiplied using any real number other than zero. 3) Any row can be replaced by the elementbyelement sum of itself and any other row. To solve a system of equations, our first step is to put zeros into the second and the third rows of the first column, and into the third row of the second column. For our exercise we will bring forward equations 21 through 23 as (equation set 31): 1a1 + 1b1 = −2 4a1 + 2b1 + 1c1 = 6 6a1 − 2b1 − 4c1 = 14
(31)
We can put the above set or system of equations in matrix notation as: ⎡ 1 A = ⎣4 6
⎤ 0 1⎦ −4
1 2 −2
⎡ ⎤ a1 B = ⎣b1 ⎦ c1
⎡
⎤ −2 C = ⎣ 6⎦ 14
and so, AB = C or
A • B = C
Matrix A is termed the “matrix of the equation system”. The matrix formed by A C is termed the “augmented matrix”. For this problem the augmented matrix is given as:
⎡
1 A C = ⎣4 6
1 2 −2
0 1 −4
⎤ −2 6⎦ 14
18
Chemometrics in Spectroscopy
Now if we were to find a set of equations with zeros in the second and the third rows of the first column, and in the third row of the second column we could use equations 217 through 219 [1] which look like (equation set 32): a1 + b1 = −2 −2b1 + c1 = 14 −16c1 = −60 we can rewrite these equations in matrix notation as: ⎡ ⎤ ⎡ ⎤ 1 1 0 a1 1⎦ H = ⎣ b1 ⎦ G = ⎣0 −2 0 0 −16 c1
(32)
⎡
⎤ −2 P = ⎣ 14⎦ −60
and the augmented form of the above matrices is written as: ⎡ ⎤ 1 0 −2 1 G P = ⎣0 −2 1 14⎦ 0 0 −16 −60 For equation 27, we can reduce or simplify the third row in G P by following Rule 3 of the basic matrix operations previously mentioned. As such we can multiply row III in G P by 1/2 to give ⎡ ⎤ 1 1 0 −2 1 14⎦ G P = ⎣0 −2 0 0 −8 −30 We can use elementary also known as elementary matrix to row operations, operations obtain matrix G P from A C . By the way, if we can achieve G P from A C using these operations, the matrices are termed “row equivalent” denoted by X1 ≥ X2 . To begin with an illustration of the use of elementary matrix operations let us use the following example. Our original A matrix above can be manipulated to yield zeros in rows II and III of column I by a series of row operations. The example below illustrates this: ⎡ ⎤ ⎡ ⎤ 1 1 0 −2 1 1 0 −2 ⎣4 2 1 6⎦ ≥ ⎣0 −2 1 14⎦ 6 −2 −4 14 0 −8 −4 26 The lefthand augmented matrix is converted to the righthand augmented matrix by II/II − 4I or row II is replaced by row II minus 4 times row I. Then III/III − 6I or row III is replaced by row III minus 6 times row I. To complete the row operations to yield G P from A C we write ⎡ ⎤ ⎡ ⎤ 1 1 0 −2 1 1 0 −2 ⎣0 −2 1 14⎦ ≥ ⎣0 −2 1 14⎦ 0 −8 −4 26 0 0 −8 −30
19
Elementary Matrix Algebra: Part 2
This is accomplished by III/III − 4II or row III is replaced by row III minus 4 times row II. As we have just shown using two series of row operations we have ⎡
1 ⎣0 0
1 −2 0
0 1 −8
⎤ −2 14⎦ −30
which is equivalent to equations 217 through 219, and equations (33) above; this is shown here as (equation set 33). a1 + b1 = −2 −2b1 + c1 = 14 −8c1 = −30
(33)
Now, solving for c1 = −30/− 8 = 375; substituting c1 into equation 218, we find −2b1 + 375 = 14, therefore b1 = −513; and substituting b1 into equation 217, we find a1 + −513 = −2, therefore a1 = 313; and so, a1 = 313 b1 = −513 c1 = 375 Thus matrix operations provide a simplified method for solving equation systems as compared to elementary algebraic operations for linear equations.
CALCULATING THE INVERSE OF A MATRIX In Chapter 2, we promised to show the steps involved in taking the inverse of a matrix. Given a 2 × 2 matrix X2×2 , how is the inverse calculated? We can ask the question another way as, “What matrix when multiplied by a given matrix Xr×c will give the identity matrix ([I])? In matrix form we may write a specific example as: −2 −3
1 1 ≥ 2 0
0 1
Therefore, −2 −3
1 c × 1 2 d1
d1 1 = d2 0
0 =1 1
or stated in matrix notations as A × B = I, where B is the inverse matrix of A, and [I] is the identity matrix.
20
Chemometrics in Spectroscopy
By multiplying A × B we can calculate the two basic equation systems to use in solving this problem as: −2c1 + 1c2 = 1
System 1
−3c1 + 2c2 = 0 −2d1 + 1d2 = 0
System 2
−3d1 + 2d2 = 1 The augmented matrices are denoted as: −2 1 −3 2
1 0
0 1
The first (preceding) matrix is reduced to echelon form (zeros in the first and the second rows of column one) by −2 1 1 0 −2 1 1 0 ≥ −3 2 0 1 0 −1 3 −2 The row operation is II/3I − 2II or row II is replaced by three times row I minus two times row II. The next steps are as follows: −2 1 1 0 −2 0 4 −2 ≥ 0 −1 3 −2 0 −1 3 −2 with row operations as (I/I + II) and I/ − 1/2I. Thus c1 = −2, c2 = −3, d1 = 1, and d2 = 2. So B = A−1 (inverse of A) and −2 1 −1 A = −3 2 So now we check our work by multiplying A • A−1 as follows: −2 1 −2 1 −2 × −2 + 1 × −3 −2 × 1 + 1 × 2 −1 A × A = × = −3 2 −3 2 −3 × −2 + 2 × −3 −3 × 1 + 2 × 2 1 0 = = 1 0 1 By coincidence, we have found a matrix which when multiplied by itself gives the identity matrix or, saying it another way, it is its own inverse. Of course, that does not generally happen, a matrix and its inverse are usually different.
SUMMARY Hopefully Chapters 1 and 2 have refreshed your memory of early studies in matrix algebra. In this chapter we have tried to review the basic steps used to solve a system of linear equations using elementary matrix algebra. In addition, basic row operations
Elementary Matrix Algebra: Part 2
21
were used to calculate the inverse of a matrix. In the next chapter we will address the matrix nomenclature used for a simple case of multiple linear regression.
REFERENCES 1. Workman, J., Jr. and Mark, H., Spectroscopy 8(7), 16–19 (1993). 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 408–457.
This page intentionally left blank
4 Matrix Algebra and Multiple Linear Regression: Part 1
In a previous chapter we noted that by augmenting the matrix of coefficients with unit matrix (i.e., one that has all the members equal to zero except on the main diagonal, where the members of the matrix equal unity), we could arrive at the solution to the simultaneous equations that were presented. Since simultaneous equations are, in one sense, a special case of regression (i.e., the case where there are no degrees of freedom for error), it is still appropriate to discuss a few odds and ends that were left dangling. We started in the previous chapter with the set of simultaneous equations: 1a + 1b + 0c = −2
(41a)
4a + 2b + 1c = 6
(41b)
6a − 2b − 4c = 14
(41c)
(where we now leave the subscripts off the variables for simplicity, with no loss of generality for our current purposes). Also note that here we write all the coefficients out explicitly, even when the ones and zeroes do not necessarily appear in the original equations – this is so that they will not be inadvertently left out of the matrix expressions, where the “place filling” function must be performed), and we noted that we could express these equations in matrix notation as: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 0 a −2 2 1⎦ B = ⎣b ⎦ C = ⎣ 6⎦ A = ⎣4 6 −2 −4 c 14 where the equations then take the matrix form: A ∗ B = C
(42)
The question here is, how did we get from equations 41 to equation 42? The answer is that it is not at all obvious, even in such a simple and straightforward case, how to break up a group of algebraic equations into their equivalent matrix expression. It turns out, however, that going in the other direction is often much simpler and straightforward. Thus, when setting up matrix expressions, it is often desirable to run a check on the work to verify that the matrix expression indeed correctly represents the algebraic expression of interest. In the current case, this can be done very simply by carrying out the matrix multiplication indicated on the lefthand side of equation 42. Thus, expanding the matrix expression AB into its full representation, we obtain ⎡ ⎤ ⎡ ⎤ 1 1 0 A ⎣4 2 1⎦ × ⎣ B ⎦ (43) 6 −2 −4 C
24
Chemometrics in Spectroscopy
From our previous chapter defining the elementary matrix operations, we recall the operation for multiplying two matrices: the i j element of the result matrix (where i and j represent the row and the column of an element in the matrix respectively) is the sum of crossproducts of the ith row of the first matrix and the jth column of the second matrix (this is the reason that the order of multiplying matrices depends upon the order of appearance of the matrices – if the indicated ith row and jth column do not have the same number of elements, the matrices cannot be multiplied). Now let us apply this definition to the pair of matrices listed above. The first matrix (A) has three rows and three columns. The second matrix (B) has three rows and one column. Since each row of A has three elements, and the single column of B has three elements, matrix multiplication is possible. The resulting matrix will have three rows, each row resulting from one of the rows of matrix A, and one column, corresponding to the single column in the matrix B. Thus the first row of the result matrix will have the single element resulting from the sumofproducts of the first row of A times the column of B, which will be 1a + 1b + 0c
(44)
Similarly the second row of the result matrix will have the single element resulting from the sumofproducts of the second row of A times the column of B, which will be 4a + 2b + 1c
(45)
and the third row of the result matrix will have the single element resulting from the sumofproducts of the third row of A times the column of B, which will be 6a + −2b + −4c
(46)
6a − 2b − 4c
(47)
or, simplifying:
The entire matrix product, then, is ⎡
⎤ 1a + 1b + 0c AB = ⎣4a + 2b + 1c⎦ 6a − 2b − 4c Equations 44, 45, and 46 represent the three elements of the matrix product of A and B. Note that each row of this resulting matrix contains only one element, even though each of these elements is the result of a fairly extensive sequence of arithmetic operations. Equations 44, 45, and 47, however, represent the symbolism you would normally expect to see when looking at the set of simultaneous equations that these matrix expressions replace. Note further that this matrix product AB is the same as the entire lefthand side of the original set of simultaneous equations that we originally set out to solve. Thus we have shown that these matrix expressions can be readily verified through straightforward application of the basic matrix operations, thus clearing up one of the loose ends we had left.
Matrix Algebra and Multiple Linear Regression: Part 1
25
Another loose end is the relationship between the quasialgebraic expressions that matrix operations are normally written in and the computations that are used to implement those relationships. The computations themselves have been covered at some length in the previous two chapters [1, 2]. To relate these to the quasialgebraic operations that matrices are subject to, let us look at those operations a bit more closely.
QUASI-ALGEBRAIC OPERATIONS Thus, considering equation 42, we note that the matrix expression looks like a simple algebraic expression relating the product of two variables to a third variable, even though in this case the “variables” in question are entire matrices. In equation 42, the matrix B represents the unknown quantities in the original simultaneous equations. If equation 42 were a simple algebraic equation, clearly the solution would be to divide both sides of this equation by A, which would result in the equation B = C/A. Since A and C both represent known quantities, a simple calculation would give the solution for the unknown B. There is no defined operation of division for matrices. However, a comparable result can be obtained by multiplying both sides of an equation (such as equation 42 by the inverse of matrix A. The inverse (of matrix A, for example) is conventionally written as A−1 . Thus, the symbolic solution to equation 42 is generated by multiplying both sides of equation 42 by A−1 : A−1 AB = A−1 C
(48)
There are a couple of key points to note about this operation. The main point is that since the order of appearance of the matrices matters, it is important that the new matrix, the one we are multiplying both sides of the equation by, is placed at the beginning of the expressions on each side of the equation. The second key point is the accomplishment of a desired goal: on the lefthand side of equation 48 we have the expression A−1 A. We noted earlier that the key defining characteristic of the inverse of a matrix is that fact that when multiplied by the original matrix (that it is the inverse of), the result is a unit matrix. Thus equation 48 is equivalent to 1B = A−1 C
(49)
where [1] represents the unit matrix. Since the property of the unit matrix is that when multiplied by any other matrix, the result is the same as the other matrix, then [1]B = B, and equation 49 becomes B = A−1 C
(410)
Thus we have symbolically solved equation 42 for the unknown matrix B, the elements of which are the unknown variables of the original set of simultaneous equations. Performing the matrix multiplication of A−1 C will then provide the values of these unknown variables.
26
Chemometrics in Spectroscopy
Let us examine these symbolic transformations with a view toward seeing how they translate into the required arithmetic operations that will provide the answers to the original simultaneous equations. There are two key operations involved. The first is the inversion of the matrix, to provide the inverse matrix. This is an extremely intensive computational task, so much so that it is in general done only on computers, except in the simplest cases for pedagogical purposes, such as we did in our previous chapter. In this regard we are reminded of an old, and somewhat famous, cartoon, where two obviously professortype characters are staring at a large blackboard. On the left side of the blackboard are a large number of mathematical symbols, obviously representing some complicated and abstruse mathematical derivations. On the right side of the blackboard is a similar set of symbols. In the middle of the blackboard is a large blank space, in the middle of which is written, in big letters: “AND THEN SOME MAGIC HAPPENS”, and one of the characters is saying to the other: “I think you need to be a bit more explicit here in step 10.” To some extent, we feel the same way about matrix inversions. The complications and amount of computation involved in actually doing a matrix inversion are enough to make even the most intrepid mathematician/statistician/chemometrician run for the nearest computer with a preprogrammed algorithm for the task. Indeed, there sometimes seem to be just about as many algorithms for performing a matrix inversion as there are people interested in doing them. In most cases, then, this process is in practice treated as a “black box” where “some magic happens”. Except for the theoretical mathematician, however, there is usually little interest in “being more explicit”, as long as the program gives the right answer. As is our wont, however, our previous chapter worked out the gory details for the simplest possible case, the case of a 2 × 2 matrix. For larger matrices, the amount of computation increases so rapidly with matrix size that even the 3 × 3 matrix is left to the computer to handle. But how can we tell then if the answer is correct? Well, there is a way, and one that is not too overwhelming. From the definition of the inverse of a matrix, you should obtain a unit matrix if you multiply the inverse of a given matrix by the matrix itself. In our previous chapter [1] we showed this for the 2 × 2 case. For the simultaneous equations at hand, however, the process is only a little more extensive. From the original matrix of coefficients in the simultaneous equations that we are working with, the one called A above, we find that the inverse of this matrix is
−1
A
⎡
−0375 = ⎣ 1375 −125
025 −025 05
⎤ 00625 −00625⎦ −0125
(411)
How did we find this? Well, we used some of our magic. The details of the computations needed were described in the previous chapter, for the 2 × 2 case; we will not even try to go through the computations needed for the 3 × 3 case we concern ourselves with here. However, having a set of numbers that purports to be the inverse of a matrix, we can verify whether or not it is the inverse of that matrix: all we need to do is multiply by the original matrix and see if the result is a unit matrix. We have done this for the 2 × 2 matrix in our previous chapter. An exercise for the reader is to verify that the matrix shown in equation 411 is, in fact, the inverse of the matrix A.
Matrix Algebra and Multiple Linear Regression: Part 1
27
That was the hard part. It now remains to calculate out the expressions shown in equation 410, to find the final values for the unknowns in the original simultaneous equations. Thus, we need to form the matrix product of A−1 and C: ⎡ ⎤ ⎡ ⎤ −0375 025 00625 −2 (412) A−1 C = ⎣ 1375 −025 −00625⎦ × ⎣ 6⎦ −125 05 −0125 14 This matrix multiplication is similar to the one we did before: we need to multiply a 3 × 3 matrix by a 3 × 1 matrix; the result will then also have dimensions of three rows and one column. The three rows of this matrix will thus be the result of these computations: C11 = −0375 ∗ −2 + 025 ∗ 6 + 00625 ∗ 14 = 075 + 15 + 0875 = 3125
(413a)
C21 = 1375 ∗ −2 + −025 ∗ 6 + −00625 ∗ 14 = −275 + −15 + −875 = −5125
(413b)
C31 = −125 ∗ −2 + 05 ∗ 6 + −0125 ∗ 14 = 25 + 3 + −175 = 375
(413c)
Thus, in matrix terms, the matrix C is ⎡
⎤ 3125 C = ⎣−5125⎦ 375
(414)
and this may be compared to the result we obtained algebraically in the last chapter (and found to be identical, within the limits of different roundings used). At first glance it would seem as though this approach has the additional characteristic of requiring fewer computations than our previous method of solving similar equations. However, the computations are exactly the same, but most of them are “hidden” inside the matrix inversion. It might also seem that we have been repetitive in our explanation of these simul taneous equations. This is intentional – we are attempting to explicate the relationship between the algebraic approach and the matrix approach to solving the equations. Our first solution (in the previous chapter) was strictly algebraic. Our second solution used matrix terminology and concepts, in addition to explicitly writing out all the arithmetic involved. Our third approach uses symbolic matrix manipulation, substituting numbers only in the last step.
28
Chemometrics in Spectroscopy
MULTIPLE LINEAR REGRESSION In Chapters 2 and 3, we discussed the rules related to solving systems of linear equations using elementary algebraic manipulation, including simple matrix operations. The past chapters have described the inverse and transpose of a matrix in at least an introductory fashion. In this installment we would like to introduce the concepts of matrix algebra and their relationship to multiple linear regression (MLR). Let us start with the basic spectroscopic calibration relationship: Concentration = Bias + (Regression Coefficient 1) × (Absorbance at Wavelength 1) + (Regression Coefficient 2) × (Absorbance at Wavelength 2) Also written as: Concentration = 0 + 1 A1 + 2 A2
(415)
In this example we state that the concentration of an analyte within a sample is a linear combination of two variables. These variables, in our case, are measured in the same units, that is Absorbance units. In this case the concentration is known as the dependent variable or response variable because its magnitude depends or responds to the values of the changes in Absorbances at Wavelengths 1 and 2. The Absorbances are the xvariables, referred to as independent variables, regressor variables, or predictor variables. Thus an equation such as equation 44 through 415 attempts to explain the relationship between concentration and changes in Absorbance. This calibration equation or calibration model is said to be linear because the relationship is a linear combination of multiplier terms or regression coefficients as predictors of the concentration (response or dependent variable). Note that the 1 and 2 terms are called Regression Coefficients, Multiplier Terms, Multipliers, or sometimes Parameters. The analysis described is referred to as Linear Regression, LeastSquares, Linear LeastSquares, or most properly, MLR. In more formal notation, we can rewrite Equation 415 as: Ecj = 0 + 1 A1 + 2 A2
(416)
where Ecj is the expected value for the concentration. Note: the difference between Ecj and cj is the difference between the predicted or expected value Ecj and the actual or observed value cj . This can be rewritten as: cj − Ecj = cj − 0 + 1 A1 + 2 A2
(417)
cj = 0 + 1 A1 + 2 A2 + j
(418)
and
where j is termed the Prediction Error, Residual Error, Residual, Error, Lack of Fit Error, or the Unexplained Error.
29
Matrix Algebra and Multiple Linear Regression: Part 1
We can also rewrite the equation in matrix form as: ⎡
⎤ c1 ⎢ ⎥ ⎢ c2 ⎥ ⎢•⎥ ⎥ C = ⎢ ⎢ ⎥ ⎢•⎥ ⎣•⎦ cN
⎡ 1 ⎢1 ⎢ ⎢1 ⎢ A = ⎢ ⎢• ⎣• 1
A11 A21 A31 • • AN 1
⎤ A12 A22 ⎥ ⎥ A32 ⎥ ⎥ ⎥ • ⎥ • ⎦ AN 2
⎡ ⎤ 0 = ⎣ 1 ⎦ 2
⎡
⎤ 1 ⎢ ⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎥ =⎢ ⎢ ⎥ ⎢•⎥ ⎣•⎦ N
(419)
This equation of the model in matrix notation is written as: C = A +
(420)
THE LEAST SQUARES METHOD The problem now becomes: how do we handle the situation in which we have more equations than unknowns? When there are fewer equations than unknowns it is clear that there is not enough information available to determine the values of the unknown variables. When we have more equations than unknowns, however, we would seem to have the problem of having too much information; how do we handle all this extra information and put it to use? For example, consider the following set of simultaneous equations: 1a + 1b + 0c = −2
(421a)
4a + 2b + 1c = 6
(421b)
6a − 2b − 4c = 14
(421c)
1a + 3b + −1c = −15
(421d)
This is a set of equations in three unknowns. The first three of these equations are the ones we dealt with above, and we have seen that the solution to the first three equations is a = 3125
(422a)
b = −5125
(422b)
c = 375
(422c)
However, when we replace a, b and c in equation 421d by those values, we find that 1 × 3125 + 3 × −5125 + −1 × 375 = −16 rather than the −15 that the equation specifies. If we were to use different subset of groups of three of these equations at a time, we would obtain different answers depending
30
Chemometrics in Spectroscopy
on which set of three equations we used. There seems to be an inconsistency here, yet in the set of four equations represented by equations 421 (a–d) all the equations have the same significance; there are no a priori criteria for eliminating any one of them. This is the situation we must handle. We cannot simply ignore one or more of these equations arbitrarily; dealing with them properly has become known variously as the Least Squares method, Multiple Least Squares, or Multiple Linear Regression. As spectroscopists, we are concerned with the application of these mathematical techniques to the solution of spectroscopic problems, particularly the use of spectroscopy to perform quantitative analysis, which is done by applying these concepts to a set of linear equations, as we will see. In this least squares method example the object is to calculate the terms 0 , 1 and 2 which produce a prediction model yielding the smallest or “least squared” differences or residuals between the actual analyte value cj , and the predicted or expected concentration Ecj . To calculate the multiplier terms or regression coefficients j for the model we can begin with the matrix notation: A� A = A� C
(423)
When solving for ˆ the expression becomes ˆ To illustrate the matrix ⎡ 2 1 j ⎢ ⎢ A� A = ⎢ j 1 × Aj1 ⎣ 1 × Aj2 j
= A� A−1 A� C
algebra involved for this problem we write 2 2 ⎤ ⎡ Aj1 × 1 Aj2 × 1 N j j A1•2 ⎥ ⎢A Aj1 ⎥ Aj1 × Aj1 Aj2 × Aj1 ⎥ = ⎢ •1 j ⎣ j j ⎦ A•2 Aj1 Aj2 Aj1 × Aj2 Aj2 × Aj2 j j
j
(424)
⎤ A2• Aj2 Aj1 ⎥ ⎥ j 2 ⎦ Aj2 j
(425)
Then rewriting in summation notation we have N
12 = N
and
j=1
N
Aj1 × Aj2 =
j=1
N j=1
Aj1 =
Aj1 Aj2
Aj•
(426)
j
Note that A� C is also required for the computations (see equation 424) and is given as: ⎡ ⎤ ⎡ ⎤ 1 × Cj NCj j ⎢ ⎥ ⎢ A C ⎥ ⎢ ⎥ j1 j ⎥ (427) A� C = ⎢ j Aj1 Cj ⎥ = ⎢ j ⎣ ⎦ ⎣ A C ⎦ j2 j Aj2 Cj j j
Matrix Algebra and Multiple Linear Regression: Part 1
31
If we represent our spectroscopic data using the following symbols: j Cj N Aj1 Aj2
= Spectrum number = Actual concentration for each spectrum = Rank of each spectrum (1) = Absorbance at Wavelength 1 = Absorbance at Wavelength 2.
From this information we can calculate the ˆ (see equation 48) using ⎡ ⎤ c1 ⎢c2 ⎥ ⎢ ⎥ ⎢•⎥ ⎢ C = ⎢ ⎥ ⎥ ⎢•⎥ ⎣•⎦ cj ⎡
1 ⎢1 ⎢ ⎢1 A = ⎢ ⎢• ⎢ ⎣• 1
A11 A21 A31 • • Aj1
⎤ A12 A22 ⎥ ⎥ A32 ⎥ ⎥ • ⎥ ⎥ • ⎦ Aj2
(428)
⎡
⎤ NC j ⎢ Aj1 Cj ⎥ ⎢j ⎥ A� C = ⎣ ⎦ Aj2 Cj j
If we then calculate the inverse of A� A, written as A� A−1 , the computations are nearly complete and we finally obtain ⎡ ⎤ ˆ0 ⎢ˆ⎥ ˆ = A� A−1 A� C = ⎣ (429) 1 ⎦ ˆ 2 which in conclusion gives the completed regression equation ECˆ = ˆ0 + ˆ1 A1 + ˆ2 A2
(430)
In our next installment, we will review the “how to” of the matrix operations for this example using numerical data. Authors’ note: This initial chapter dealing with matrix algebra and regression has been adapted for spectroscopic nomenclature from Shayle R. Searle’s book, Matrix Algebra Useful for Statistics (John Wiley & Sons, New York, 1982), pp. 363–368. Other particularly useful reference sources with page numbers are listed below as [1–3].
32
Chemometrics in Spectroscopy
REFERENCES 1. Draper, N.R. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981), pp. 70–87. 2. Kleinbaum, D.G. and Kupper, L.L., Applied Regression Analysis and Other Multivariable Methods (Duxbury Press, Boston, 1978), pp. 508–520. 3. Workman, J., Jr. and Mark, H., Spectroscopy 8(7), 16–19 (1993).
5 Matrix Algebra and Multiple Linear Regression: Part 2
In the previous chapter we presented the problem of fitting data when there is more information (in the form of equations relating the several variables involved) available than the minimum amount that will allow for the solution of the equations. We then presented the matrix equations for calculating the least squares solution to this case of overdetermined variables. How did we get from one to the other? As we described the situation, when there are more equations than unknowns, one possibility is to ignore some of the equations. This is unsatisfactory, for a number of reasons. In the first place, there is no a priori criterion for deciding which equations to ignore, so that any choice is arbitrary. Secondly, by rejecting some of the equations, we are also rejecting and wasting the work that went into the collection of the data represented by those equations. Thirdly, and perhaps most importantly, when we ignore some of the equations, we are also ignoring the (rather important) fact that the lack of perfect fit to all the equations is itself an important piece of information. What the set of equations is telling us in this case is that there is, in fact, not a perfect fit of the data, taken as a whole, of any of the equations in the set. Rather, there is some average equation, that in some sense gives a best fit to all of the data taken as a set, without favoring any particular subset of them. It is this “average” equation that we would like to be able to find. In the history of the development of mathematics, one important branch was the study of the behavior of randomness. Initially, there were no highfalutin ideas of making “science” out of what appeared to be disorder; rather, the investigations of random phenomena that lead to what we now know as the science of Statistics began as studies of the behavior of the random phenomena that existed in the somewhat more prosaic context of gambling. It was not until much later that the recognition came that the same random phenomena that affected, say, dice, also affected the values obtained when physical measurements were made. By the time this realization arose, it was well recognized that random phenomena were describable only by probabilistic statements; by definition it is not possible to state a priori what the outcome of any given random event will be. Thus, when the attention of the mathematicians of the time turned to the description of overdetermined systems, such as we are dealing with here, it was natural for them to seek the desired solution in terms of probabilistic descriptions. They then defined the “best fitting” equation for an overdetermined set of data as being the “most probable” equation, or, in more formal terminology, the “maximum likelihood” equation. Under the proper conditions (said conditions being that the errors that prevent all the data relationships from being described by a single equation are normally [1, 2] distributed) it can be proven mathematically that the “most probable” equation is exactly the one that is the “least square” equation. While we have discussed this point
34
Chemometrics in Spectroscopy
briefly in the past [3] it is, perhaps, appropriate at this point to revisit it, in a bit more detail. The basis upon which this concept rests is the very fact that not all the data follows the same equation. Another way to express this is to note that an equation describes a line (or more generally, a plane or hyperplane if more than two dimensions are involved. In fact, anywhere in this discussion, when we talk about a calibration line, you should mentally add the phrase “� � � or plane, or hyperplane � � � ”). Thus any point that fits the equation will fall exactly on the line. On the other hand, since the data points themselves do not fall on the line (recall that, by definition, the line is generated by applying some sort of [at this point undefined] averaging process), any given data point will not fall on the line described by the equation. The difference between these two points, the one on the line described by the equation and the one described by the data, is the error in the estimate of that data point by the equation. For each of the data points there is a corresponding point described by the equation, and therefore a corresponding error. The least square principle states that the sum of the squares of all these errors should have a minimum value; and as we stated above, this will also provide the “maximum likelihood” equation. It is certainly true that for any arbitrarily chosen equation, we can calculate what the point described by that equation is, that corresponds to any given data point. Having done that for each of the data points, we can easily calculate the error for each data point, square these errors, and add together all these squares. Clearly, the sum of squares of the errors we obtain by this procedure will depend upon the equation we use, and some equations will provide smaller sums of squares than other equations. It is not necessarily intuitively obvious that there is one and only one equation that will provide the smallest possible sum of squares of these errors under these conditions; however, it has been proven mathematically to be so. This proof is very abstruse and difficult. In fact, it is easier to find the equation that provides this “least square” solution than it is to prove that the solution is unique. A reasonably accessible demonstration, expressed in both algebraic and matrix terms, of how to find the least square solution is available. Even though regression analysis (one of the more common names for the application of the least square principle) is a general mathematical technique, when we are dealing with spectroscopic data, so that the equation we wish to fit must be fitted to data obtained from systems that follow Beer’s law, it is convenient to limit our discussion to the properties of spectroscopic systems. Thus we will couch our discussion in terms of quantitative analysis performed using spectroscopic data; then the dependent variable of the least square regression analysis (usually called the “Y” variable by mathematicians) will represent the concentration of analyte in the set of samples used to calibrate the system, and the independent (or “X”) variable will represent absorbance values measured by a suitable instrument in whichever spectral region we are dealing with. We will begin our discussion by demonstrating that, for a nonoverdetermined system of equations, the algebraic approach and the leastsquare approach provide the same solution. We will then extend the discussion to the case of an overdetermined system of equations. Therefore this chapter will continue the multiple linear regression (MLR) discussion introduced in the previous chapter, by solving a numerical example for MLR. Recalling
35
Matrix Algebra and Multiple Linear Regression: Part 2
the basic ultraviolet, visible, nearinfrared, and infrared use of MLR for spectroscopic calibration, we have Concentration = Constant term (or Bias) + �Regression coefficient 1� • �Absorbance at wavelength 1� + �Regression coefficient 2� • �Absorbance at wavelength 2� + · · · + �Regression coefficient N� • �Absorbance at wavelength N� Also written in equation form as: Concentration = �0 + �1 A�1 + �2 A�2 + · · · + �N A�N
(51)
By including an error term, we can write the equation as: Concentration = �0 + �1 A�1 + �2 A�2 + · · · + �N A�N + e And also in expanded matrix form as: ⎡ ⎤ ⎡ A11 A12 A13 A14 c1 ⎢c2 ⎥ ⎢ A21 A22 A23 A24 ⎢ ⎥ ⎢ ⎢•⎥ ⎢ • • • ⎥ A=⎢ • c = ⎢ ⎢•⎥ ⎢ • • • • ⎢ ⎥ ⎢ ⎣•⎦ ⎣ • • • • cN AM1 AM2 AM3 AM4
• • • • • •
⎤ • A1N ⎥ • A2N ⎥ ⎥ • • ⎥ ⎥ • ⎥ • ⎦ • • • AMN
⎡ ⎤ �1 ⎢�2 ⎥ ⎢ ⎥ ⎢� 3 ⎥ ⎥ �=⎢ ⎢ ⎥ ⎢ ∗ ⎥ ⎣•⎦ �N
(52)
⎡ ⎤ e1 ⎢e2 ⎥ ⎢ ⎥ ⎢e 3 ⎥ ⎥ e = ⎢ ⎢ ⎥ ⎢•⎥ ⎣•⎦ eN (53)
and in simplified matrix notation, the equation is c = a� + e
(54)
Because we have limited time and space, let us solve our problem using two wavelengths (or frequencies) and a basic calculator. To define the problem, we start with a set of calibration samples with the characteristics listed in Table 51: The system of equations for solving this problem can be written as 2�0 = �0 + �1 �0�75� + �2 �0�28�
(55a)
4�0 = �0 + �1 �0�51� + �2 �0�485�
(55b)
7�0 = �0 + �1 �0�32� + �2 �0�78�
(55c)
Table 5-1 Characteristics of the calibration samples Sample number 1 2 3
Concentration 2�0 4�0 7�0
Signal at wavelength 1
Signal at wavelength 2
0�75 0�51 0�32
0�28 0�485 0�78
36
Chemometrics in Spectroscopy
and in simplified matrix form as C = �A� • ���
(56)
and written in matrix form (with the constant term as the third column) as: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 2�0 �0 0�75 0�28 1 C = ⎣ 4�0 ⎦ � � = ⎣ �1 ⎦ � A = ⎣0�51 0�485 1⎦ 7�0 �2 0�32 0�78 1 The augmented matrix formed by [A�C] is ⎡ 0�75 �A�C� = ⎣0�51 0�32
(57)
designated as: 0�28 0�485 0�78
1 1 1
⎤ 2�0 4�0⎦ 7�0
(58)
The first task is to use elementary matrix row operations to manipulate matrix [A�C] to yield zeros in rows II and III of column I. The row operations are to replace row II by row II minus 0.68 times of row I; that is. II = II − 0�68 × I; followed by replacing row III by row III minus 0.4267 times of row I; that is, III = III − 0�4267 × I. To complete our row operations we must accomplish placing zeros in columns I and II of row III by replacing row III by row III minus 2.242 times of row II: that is: III = III − 2�242 × II. These row operations yield (remember to keep as much precision as possible in your calculations): ⎡ ⎤ 0�75 0�28 1 2�0 ⎣0 0�2946 0�32 2�64 ⎦ (59) 0 0 −0�1442 0�2274 In summary, by using two series of row operations, namely III − 0�4267 I: and III = III − 2�242 II we have ⎡ ⎤ ⎡ 0�75 0�28 1 2�0 0�75 0�28 1 ⎣0�51 0�485 1 4�0⎦ ≥ ⎣0 0�2946 0�32 0�32 0�78 1 7�0 0 0 −0�1442
II = II − 0�68 I� III = ⎤ 2�0 2�64 ⎦ 0�2274
(510)
These two matrices (original and final) are row equivalent because by using simple row operations the right matrix was formed from the left matrix. The final matrix is equivalent to a set of equations as shown below: 0�75�1 + 0�28�2 + 1�0�0 = 2�0 0�2946�2 + 0�32�0 = 2�64 −0�1442�0 = 0�2274
(511a) (511b) (511c)
Now solving the system of equations yields (−0�1442��0 = 0�2274� �0 = −1�577; solv ing for �2 , we find (0.2946) �2 + 0�32�−1�577� = 2�64� �2 = 10�674; solving for �1 yields (0.75)�1 + 6�28�10�674� + 1�−1�577� = 2�0�1 = 0�784.
37
Matrix Algebra and Multiple Linear Regression: Part 2
And so, �0 = −1�577 �1 = 0�784 �2 = 10�674 Substituting into the original equations and calculating the differences between predicted and actual results, we find the results shown in Table 52. The foregoing discussion is all based on one important assumption: that the equation describing the relationship between the data does, in fact, include a constant term. If Beer’s law is strictly followed, however, when the concentration of all absorbing constituents is zero, then the absorbance (at all wavelengths, no less) is also zero: that is, the equation describing the relationship between the data generates a line that passes through the origin. If this condition holds, then the constant term of the equation is also exactly zero, and may be dropped from the equation. It has been shown possible to generate a least squares expression for this case also, that is, with the constant of the equation forced to be zero: it is merely necessary to formulate the expression for the prediction equation, corresponding to equation 511d as: Conc� = �1 A1 + �2 A2
(511d)
Starting from this expression, one can execute the derivation just as in the case of the full equation (i.e., the equation including the constant term), and arrive at a set of equations that result in the least square expression for an equation that passes through the origin. We will not dwell on this point since it is not common in practice. However, we will use this concept to fit the data presented, just to illustrate its use, and for the sake of comparison, ignoring the fact that without the constant term these data are overdetermined, while they are not overdetermined if the constant term is included – if we had more data (even only one more relationship), they would be overdetermined in both cases. Then, if the equation system is solved with no constant term (�0 �, we have the following results (you can either take our word for it or perform the row operations for yourself. Exercise for the reader: do those row operations.): �2 �0�2946� = 2�64, �2 = 8�9613; and �0�75� + 0�28�8�9613� = 2�0, �1 = −0�679. And so, �1 � = −0�679 �2 � = 8�9613 Table 5-2 Results after substituting into the original equations and calculating the differences between predicted and actual results (using manual row operations) Sample number 1 2 3
�0
+
�1 (A�1 �
+
�2 (A�2 �
= Predicted − Actual = Residual
−1�577 + 0.784(0.75) + 10.674(0.28) = −1�577 + 0.784(0.51) + 10.674(0.485) = −1�577 + 0.784(0.32) + 10.674(0.78) =
2.0 4.0 7.0
− − −
2.0 4.0 7.0
= = =
0 0 0
38
Chemometrics in Spectroscopy
Table 5-3 Results when there is no constant (bias) term after substituting into the original equations and calculating the differences between predicted and actual results �1 �A�1 �
+
�1 �A�2 �
=
Predicted
−
Actual
=
Residual
−0�679�0�75� −0�679�0�51� −0�679�0�32�
+ + +
8.9613(0.26) 8.9613(0.485) 8.9613(0.78)
= = =
2�0 4�0 6�78
− − −
2�0 4�0 7�0
= = =
0�0 0�0 −0�23
Sample number 1 2 3
and the results are shown in Table 53. Another exercise for the reader: Why is a bias term often used in regression for spectroscopic data?
THE POWER OF MATRIX MATHEMATICS Now let us see what happens when we use pure, unadulterated matrix power to solve this equation system, such that A� A�ˆ = A� C
(512)
as equation 423 showed us. When solving for the regression coefficients (��, we have ⎡
⎤ �0 ⎣ �1 ⎦ = �ˆ = �A� A�−1 A� C �2
(513)
Noting the matrix algebra for this problem (Equation 25 from reference [1]) ⎡ j
A2j0
⎢ ⎢ A� A = ⎢ j Aj0 Aj1 ⎣ Aj0 Aj2 j
⎤ ⎡ ⎤ Aj1 Aj0 Aj2 Aj0 A•2 N A•1 j j j ⎥ ⎢ ⎥ 2 2 ⎥ ⎢ A•1 ⎥ Aj1 Aj2 Aj1⎥ Aj1 Aj2 Aj1⎥ = (514) ⎢ j j j j 2 ⎦ ⎣ j 2 ⎦ A•2 Aj1 Aj2 Aj2 Aj1 Aj2 Aj2 j
j
j
j
j
j
and substituting the numbers from our current example, we illustrate the following steps: ⎡
⎤ 1 0�75 0�28 A = ⎣ 1 0�51 0�485 ⎦ 1 0�32 0�78
(515)
and so the transpose of A (which is A� ) is ⎡ 1 A� = ⎣0�75 0�28
1 0�51 0�485
⎤ 1 0�32⎦ 0�78
(516)
39
Matrix Algebra and Multiple Linear Regression: Part 2
and to continue. A transpose (A� ) times A is ⎡
1×1+1×1+1×1 1 × 0�75 + 1 × 0�51 + 1 × 0�32 A� A = ⎣ 0�75 × 1 + 0�51 × 1 + 0�32 × 1 0�75 × 0�75 + 0�51 × 0�51 + 0�32 × 0�32 0�28 × 1 + 0�485 × 1 + 0�78 × 1 0�28 × 0�75 + 0�485 × 0�51 + 0�78 × 0�32 ⎤ ⎡ ⎤ 1 × 0�28 + 1 × 0�485 + 1 × 0�78 3 1�58 1�5450 0�75 × 0�28 + 0�51 × 0�485 + 0�32 × 0�78 ⎦ = ⎣ 1�58 0�925 0�707 ⎦ 0�28 × 0�29 + 0�485 × 0�485 + 0�78 × 0�78 1�545 0�707 0�922 (517) Next we need to calculate the inverse of [A� A], designated [A� A]−1 . Because A� A is an X3×3 problem, we had better use a computer program suitably equipped to calculate the inverse (2). ⎡ 3 ⎣1�58 1�545
1�58 0�925 0�707
⎤ ⎡ 1�545 1 0�707⎦ ≥ ⎣0 0�922 0
0 1 0
⎤ 0 0⎦ 1
(518)
Exercise for the reader: See if you are able to determine all the row operations required to find the inverse of A� A (We recommend you set aside the better part of an afternoon to work this one through!) The augmented form is written as ⎡ 3 ⎣1�58 1�545
1�58 0�925 0�707
1�545 0�707 0�922
1 0 0
⎤ 0 0⎦ 1
0 1 0
(519)
Thanks to the power of computers we find that the inverse of A� A is ⎡
348�0747 −1 �A� A� = ⎣−359�3786 −307�7061
−359�3786 373�6609 315�6969
⎤ −307�7061 315�6969⎦ 274�639
(520)
Then the next step is to calculate ⎡
A•0 c0
⎤
⎡
⎤ ⎡ Nc• 1 ⎢ ⎥ ⎢ A c⎥ ⎢ ⎥ •1 1 ⎥ ⎣ 0�75 A� c = ⎢ j A•1 c1 ⎥ = ⎢ = j ⎣ ⎦ ⎣ A c ⎦ 0�28 •2 2 A c j
j
⎡
•2 2
1 0�51 0�485
j
⎤ ⎡ ⎤ 1�2� + 1�4� + 1�7� 13 = ⎣ 0�75�2� + 0�51�4� + 0�32�7� ⎦ = ⎣ 5�78 ⎦ 0�28�2� + 0�485�4� + 0�78�7� 7�96
⎤ ⎡ ⎤ 1 2�0 0�32 ⎦ • ⎣ 4�0 ⎦ 0�78 7�0
(521)
40
Chemometrics in Spectroscopy
To solve for the regression coefficients (�i �, we are required to calculate (A� A�−1 A� C as follows (see equation 513): ⎡ ⎤ ⎡ ⎤ 348�0747 −359�3786 −307�7061 13�0 373�6609 315�6969⎦ • ⎣ 5�78⎦ � = �A� A�−1 A� C = ⎣−359�3786 −307�7061 315�6969 274�639 7�96 ⎡ ⎤ 348�0707�13� + �−359�3786��5�78� + �−307�7061��7�96� = ⎣ �−359�3786��13� + 373�6609�5�78� + 315�6969�7�96� ⎦ (522) �−307�7061��13� + 315�6969�5�78� + 274�639�7�96� ⎡ ⎤ ⎡ ⎤ −1�577 �0 = ⎣ 0�786⎦ = ⎣�1 ⎦ 10�675 �2 And, checking our work, we arrive at Table 54. Now, if we took our original set of data, as expressed in equations 55a–55c, and added one more relationship to them, we come up with the following situation: 2�0 = b0 + b1 �0�75� + b2 �0�28�
(523a� )
4�0 = b0 + b1 �0�51� + b2 �0�485�
(523b� )
7�0 = b0 + b1 �0�32� + b2 �0�78�
(523c� )
8�0 = b0 + b1 �0�40� + b2 �0�79�
(223d� )
Now we have the situation we discussed earlier: we have four relationships among a set of data, and only three possible variables (even including the b0 term) that we can use to fit these data. We can solve any subset of three of these relationships, simply by leaving one of the four equations out of the solution. If we do that we come up with the following table of results (we forbear to show all the computations here; however, we do recommend to our readers that they do one or two of these, for the practice): b1
b0 Eliminating Eliminating Eliminating Eliminating
equation equation equation equation
51: −9�47843 52: −10�86455 53: −0�520039 54: −1�5777
b2
10�39215 10�15801 4�1461 0�78492
16�86274 10�73589 14�6100 10�675
Table 5-4 Results after substituting into the original equations and calculating the differences between predicted and actual results (using MATLAB calculations) Sample number 1 2 3
�0
+
�1 �A�1 �
+
�2 �A�2 �
= Predicted − Actual = Residual
−1�577 + 0.786(0.75) + 10.675(0.28) = −1�577 + 0.786(0.51) + 10.675(0.485) = −1�577 + 0.786(0.32) + 10.675(0.78) =
2.002 4.001 7.001
− − −
2.0 4.0 7.0
= = =
0.002 0.001 0.001
41
Matrix Algebra and Multiple Linear Regression: Part 2
The last entry in this table, the results obtained from eliminating equation 54, rep resents of course the results obtained from the original set of three equations, since eliminating equation 54 from the set leaves us with exactly that same set. However, even though there does not seem to be much difference between the various equa tions represented by equations 2a� –2d� , it is clear that the fitting equation depends very strongly upon which subset of these equations we choose to keep in our calculations. Thus we see that we cannot arbitrarily select any subset of the data to use in our computations; it is critical to keep all the data, in order to achieve the correct result, and that requires using the regression approach, as we discussed above. If we do that, then we find that the correct fitting equation is (again, this system of equations is simple enough to do for practice – the matrix inversion can be performed using the row operations as we described previously):
Regression results:
b0
b1
−0�685719
6.15659
b2 15.50951
Note, by the way, that if you thought that the regression solution would simply be the average of all the other solutions, you were wrong. By now some of you must be thinking that there must be an easier way to solve systems of equations than wrestling with manual row operations. Well, of course there are better ways, which is why we will refresh your memory on the concept of determinants in the next chapter. After we have introduced determinants we will conclude our introductory coverage of matrix algebra and MLR with some final remarks.
REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, (Academic Press, Boston, 1991), pp. 45–56; see also Mark, H. and Workman, J., Spectroscopy 2(9), 37–43 (1987). 2. Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991), pp. 21–24. 3. Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991), pp. 271–281; see also H. Mark and J. Workman, Spectroscopy 7(3), 20–23 (1992).
This page intentionally left blank
6 Matrix Algebra and Multiple Linear Regression: Part 3 – The Concept of Determinants
In the previous chapter [1] we promised a discussion of an easier way to solve equation systems – the method of determinants [2]. To begin, given an X2×2 matrix [A] as a1 b1 A = (61) a2 b2 the determinant of A is designated by � � a1 A = �� a2
� b1 �� b2 �
(62)
Note that the brackets [ ] used to denote matrices are converted to vertical lines to denote a determinant. To continue, then the determinant of A is calculated this way: Adet = a1 b2 − a2 b1
(63)
The determinant is found by crossmultiplying the diagonal elements in a matrix and subtracting one diagonal product from the other, such that � � �a b1 �� = a1 b2 − a2 b1 (64) Adet = �� 1 a2 b2 �
A numerical example is given as follows: Given A, find its determinant: � � � � �0�75 0�28 � 0�75 0�28 � � If A = � then Adet = � 0�51 0�485 0�51 0�485�
= 0�75 × 0�485 − 0�28 × 0�5 = 0�364 − 0�141 = 0�221
(65)
To use determinants to solve a system of linear equations, we look at a simple application given two equations and two unknowns. For the equation system C1 = �1 Ak11 + �2 Ak12
(66a)
C2 = �1 Ak21 + �2 Ak22
(66b)
we denote �1 and �2 as unknown regression coefficients. By algebraic manipulation, we can eliminate the �2 term from the equation system by multiplying the first equation
44
Chemometrics in Spectroscopy
by Ak22 and the second equation by Ak12 . By subtracting the two equations, we arrive at equations 66 through 67d: Ak22 C1 = Ak22 �1 Ak11 + Ak22 �2 Ak12
(67a)
�−�Ak12 C2 = Ak12 �1 Ak21 + Ak12 �2 Ak22
(67b)
Ak21 C1 − Ak12 C2 = Ak21 �1 Ak11 − Ak12 �1 Ak21
(67c)
Ak21 C1 − Ak12 C2 = Ak21 Ak11 − Ak12 Ak21 �1
(67d)
and
If the (Ak22 Ak11 − Ak12 Ak2 � term is nonzero, then we can divide this term into the above equation (67d) to arrive at Ak22 C1 − Ak12 C2 Ak22 Ak11 − Ak12 Ak21
(68)
Note the denominator can be written as the determinant � � �Ak11 Bk12 � � � �Ak21 Bk21 �
(69)
�1 =
referred to as the determinant of coefficients. We can also write the numerator as the determinant: � � �C1 Ak12 � � � (610) �C2 Ak22 �
and so,
� � C1 � � C2
�1 = � �Ak11 � �Ak21
� Ak12 �� Ak22 �
� Ak12 �� Ak22 �
(611)
We can also solve for �2 by algebraic manipulation of the equation system. Elimination of the �1 term is accomplished by multiplying the first equation by Ak21 and the second equation by Ak11 and subtracting the results, dividing by the common term, and lastly, by converting both the numerator and the denominator to determinants, finally arriving at equation 612. � � �Ak11 C1 � � � �Ak21 C2 � � � �2 = (612) �Ak11 Ak12 � � � �Ak21 Ak22 �
45
Matrix Algebra and Multiple Linear Regression: Part 3
To summarize what is referred to as Cramer’s rule, we can use the following general expressions given a system of two equations (613a and 613b) in two unknowns such that C1 = �1 Ak11 + �2 Ak12
(613a)
C2 = �1 Ak21 + �2 Ak22
(613b)
We can generalize a solution to this system of equations by using the following deter minant notation: � � � � � � �Ak11 Ak12 � �C1 Ak12 � �Ak11 C1 � � � D�1 = � � � � D = �� �C2 Ak22 � � D�2 = �Ak21 C2 � Ak21 Ak21 � And so, if D = 0, then we can solve for �1 , and �2 , using the relationships � � � � �C1 Ak12 � �C2 Ak22 � D�1 � �1 = = � � � D �Ak11 Ak12 � �Ak21 Ak22 �
(614)
and
�2 =
� � �Ak11 �Ak21
D�2 = � � D �Ak11 �Ak21
� C1 �� C2 �
� Ak12 �� Ak22 �
(615)
There are, of course, additional rules for solving larger equation systems. We will address this subject again in later chapters when we discuss multivariate calibration in greater depth.
REFERENCES 1. Workman, J., Jr. and Mark, H., Spectroscopy 9(1), 16–19 (1994). 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 445–451.
This page intentionally left blank
7 Matrix Algebra and Multiple Linear Regression: Part 4 – Concluding Remarks
Our discussions on MLR in previous chapters are all based on one important assumption: that the equation describing the relationship between the data does include a constant term. If Beer’s law is strictly followed, however, when the concentration of all absorbing constituents is zero, then the absorbance (at all wavelengths, no less) is also zero, that is the equation describing the relationship between the data generates a line that passes through the origin. If this condition holds, then the constant term of the equation is also exactly zero, and may be dropped from the equation. It has been shown possible to generate a least square expression for this case also, that is with the constant of the equation forced to be zero: it is merely necessary to formulate the expression for the prediction equation, corresponding to equation 71 as: Conc� = b1 A1 + b2 A2
(71′ )
Starting from this expression, one can execute the derivation just as in the case of the full equation (i.e., the equation including the constant term), and arrive at a set of equations that result in the least square expression for an equation that passes through the origin. We will not dwell on this point since it is not common in practice. We will use this concept to fit the data presented, just to illustrate its use, and for the sake of comparison, ignoring the fact that without the constant term these data are overdetermined, while they are not overdetermined if the constant term is included – if we had more data (even only one more relationship) they would be overdetermined in both cases. If we take our original set of data, as expressed in equations 75a–7.5c [1], and add one more relationship to them, we come up with the following situation: 2�0 = b0 + b1 �0�75� + b2 �0�28�
(72a′ )
4�0 = b0 + b1 �0�51� + b2 �0�485�
(72b′ )
7�0 = b0 + b1 �0�32� + b2 �0�78�
(72c′ )
8�0 = b0 + b1 �0�40� + b2 �0�79�
(72d′ )
We now have the situation we discussed earlier: we have four relationships among a set of data, and only three possible variables (even including the b0 term) that we can use to fit these data. We can solve any subset of three of these relationships, simply by leaving one of the four equations out of the solution. If we do that we come up with the
48
Chemometrics in Spectroscopy
following table of results (we forbear to show all the computations here; however, we do recommend to our readers that they do one or two of these, for the practice):
Eliminating Eliminating Eliminating Eliminating
equation equation equation equation
71: 72: 73: 74:
b0
b1
−9�47843 −10�86455 −0�520039 −1�5777
10.39215 10.15801 4.1461 0.78492
b2 16.86274 10.73589 14.6100 10.675
The last entry in this table, the results obtained from eliminating equation 74, of course represents the results obtained from the original set of three equations, since eliminating equation 74 from the set leaves us with exactly that same original set. However, even though there does not seem to be much difference between the various equations represented by equations 72a′ –72d′ , clearly the fitting equation depends very strongly upon which subset of these equations we choose to keep in our calculations. Thus we see that we cannot arbitrarily select any subset of the data to use in our computations; it is critical to keep all the data, to achieve the correct result, and that requires using the regression approach, as we discussed above. If we do that, then we find that the correct fitting equation is (again, this system of equations is simple enough to do for practice – the matrix inversion can be performed using the row operations as we described previously):
Regression results:
b0
b1
b2
−0�685719
6.15659
15.50951
Note, by the way, if you thought that the regression solution would simply be the average of all the other solutions, you were incorrect. With this chapter we will suspend our coverage of elementary matrix operations until a later chapter.
A WORD OF CAUTION We have noticed recently, a growing tendency for the chemical/spectroscopic community to draw the inference that the term “chemometrics” is virtually equivalent to “quanti tative analysis algorithms”. This misconception seems to be due to the overwhelming concentration of interest in that aspect of the application of chemometric techniques. This perceived equivalency is, of course, incorrect and nonexistent in reality. The purview of chemometrics is much wider than that single application area, and encompasses a wide variety of techniques; including algorithms not only for quantitative and qualitative chemical analysis, but also for methods for analyzing, categorizing and generally dealing with data in a variety of ways (just look at the topic list included in the Analytical Chemistry reviews issue when Chemometrics is included). We ourselves have to plead guilty to some extent to promoting this misconception. While discussing and explaining the underlying concepts, we have also inherently spent much time and attention on that single topic, in much the same way that many other authors do.
Matrix Algebra and Multiple Linear Regression: Part 4
49
However, we do recognize and wish to caution our readers to recognize the fact that Chemometrics does in fact include this variety of methodologies alluded to above. We do, in fact, hope to eventually discuss these other concepts. Two items prevent us from just jumping in chin first, however. The first item is that there are, in fact, useful and important things that need to be said about the application of the quantitative analysis algorithms. The second item is the fact that while we are knowledgeable concerning some of the other areas of chemometric interest, we are not and could not possibly be experts in all such areas. We have discussed this between ourselves, and have decided that the only reasonable way to deal with this limitation is to entertain submissions from our readership. Anyone who has particular expertise in a topic that falls under the wider definition of “chemometrics” is welcome to submit one (or more) chapters dealing with that topic. We only request that you try to keep your discussions both simple and complete, using, as we say, only words of one syllable or less.
REFERENCE 1. Workman, J., Jr. and Mark, H., Spectroscopy 9(1), 16–19 (1994).
This page intentionally left blank
8 Experimental Designs: Part 1
The next several chapters will deal with the philosophy of experimental designs. Exper imental design is at the very heart of the scientific method; without proper design, it is wellnigh impossible to glean highquality information from experimental data col lected. No amount of sophisticated processing or chemometrics can create information not presented within the data. Every scientist has designed experiments. So what is there left for us to say about that topic that chemometrics/statistics can shed some light on? Well, quite a bit actually, since not all experiments are designed equally, but some are definitely more equal than others (to steal a paraphrase). Another way to say it is that every experiment is a designed experiment, but some designs are better than others. In point of fact, the sciences of both statistics and chemometrics each have their own approach to how experiments should be designed, each with a view toward mak ing experimental procedures “better” in some sense. There is a gradation between the two approaches, nevertheless there is also somewhat of a distinction between what might be thought of as classical “statistical experimental design” and the more currently fashionable experimental designs considered from a chemometric point of view. These differences in approach reflect differences in the nature of the information to be obtained from each. Experimental designs, and in particular “statistical” experimental designs, are used in order to achieve one or more of the following goals: 1) Increase efficiency of resource use, that is, obtain the desired information using the fewest possible necessary experiments (this is usually what is thought of when “statistical experimental designs” are considered). This aspect of experimentation is particularly important when the experiment is large to begin with, or if the experiment uses resources that are rare or expensive, or if the experiment is destructive, so that materials (especially expensive ones) are used up. 2) Determine which variables or phenomena (“factors” in statistical/chemometric par lance) in an experiment are the “important” ones. This has two aspects: first is an effect large enough that we can be sure it is real, and not due simply to noise (or error) alone (i.e., “statistically significant”). We have treated this question to some extent in our previous chapters, and the book from it (both titled “Statistics in Spectroscopy”). The second aspect is, if the effect of a factor is indeed real, is it of sufficiently large magnitude to be of practical importance? While the answer to this question is important to understanding the outcome of the experiment, it is not a statistical question, and we will give it fairly short shrift.
52
Chemometrics in Spectroscopy
3) Accommodate noise and/or other random error. 4) Allow estimates to be made of the magnitude of the noise and/or other random error, if for no other reason than to compare our results to so as to tell if they are statistically significant. 5) Allow estimates to be made of the sensitivity to variations in the several factors. This can help decide whether any of the variations seen are of practical importance. A good design also allows these estimates of sensitivity to be made against an error background that is reduced compared to the actual error. This is accomplished by causing the effects of the factors to be effectively “averaged”, thus reducing the effect of error by the square root of the number of items being averaged. 6) Optimize some characteristic of the experimental system. To achieve these goals, certain requirements are imposed on the design and/or the data to be collected. The maximum amount of information can be obtained when: 1) The standard requirements for the behavior of the errors are met, that is, the errors associated with the various measurements are random, independent, normally (i. e., Gaussian) distributed, and are a random sample from a (hypothetical, perhaps) pop ulation of similar errors that have a mean of zero and a variance equal to some finite value of sigmasquared. 2) The design is balanced. This requirement is critical for certain types of designs and unimportant in others. Balance, in the sense used here, means that the values of a given experimental variable (factor) occurs in combination with all of the values of every other factor. For example, common variables in chemical experimentation are temperature and pressure. For a balanced design, experiments should be carried out where the material is held at low temperature, and at both high and low pressure. Additionally, experiments should be carried out where the material is held at high temperature, and at both high and low pressure. If a third variable, such as con centration of a reactant, is to be studied, then high and low pressure and high and low temperature should coexist with both the high and the low concentrations. The foregoing would seem to imply that a balanced experiment would require all possible combinations of conditions. While allpossiblecombinations is certainly one way to achieve this balance, the advan tage of “statistical” deigns comes from the fact that clever ways have been devised to achieve balance while needing far fewer experiments than the allpossiblecombinations approach would require (Table 81). As an illustration of this, let us consider the three aforementioned variables: tem perature, pressure, and concentration of reactant. An allpossiblecombinations design would require eight experiments, with the following set of conditions in each experiment (where H and L represent the high and the low temperatures, pressures, etc.): However, to achieve balance, it is not necessary to carry out eight experiments; balance can be achieved with only four experiments with the conditions suitably set (Table 82). Check it out: High reactant concentration occurs in combination with each (high and low) temperature, and with each pressure; similarly for low reactant concentration.
53
Experimental Designs: Part 1
Table 8-1 An allpossiblecombinations design of three factors, needing eight experiments and sets of conditions Experiment number 1 2 3 4 5 6 7 8
Temperature
Pressure
Concentration
L L L L H H H H
L L H H L L H H
L H L H L H L H
Table 8-2 Balanced design for three factors, needing only four experiments Experiment number 1 2 3 4
Temperature
Pressure
Concentration
L L H H
L H L H
L H H L
You will find the same situation for the other variables. This is not to say that there are no benefits to the larger experimental design, but we are making the point that balance can be achieved with the smaller one, and for those designs where balance is an important consideration, much work (and resources, and MONEY) can be saved. Balance is not always achievable in practice due to physical constraints on the mea surements that can be made. Certain designs do not require balance, and in fact to enforce balance would mitigate some of the benefits of the design. In particular, there are some designs where future experiments to be performed are determined by the results of the past experiments. To enforce balance here would require extra, unnecessary experimentation that did not contribute to the main goal of the whole venture. The various designs that have been generated can be classified into one of several categories. One way to classify experimetal designs is as follows: 1) 2) 3) 4)
Classical designs Screening designs Analytical designs Optimization designs.
In one sense, it is possible to think of the categories involved as “building blocks” for designs, which can then be combined in various ways which depend upon the information that you want to obtain which, in turn, determines the nature of the data to collect. These
54
Chemometrics in Spectroscopy
general categories, by the way, are not mutually exclusive. It is even possible to consider some types of designs as extensions of others, or, vice versa, as subsets, or special cases of other types of designs. Some of these main categories are A) B) C) D) E)
Factorial designs Fractional factorial designs Nested designs Blocked designs Response surface designs.
The key to all “statistical experimental” designs is planning. A properly planned experi ment can achieve all the goals set forth above, and in fewer runs than you might expect (that’s where achieving the goal of efficiency comes in). However, there are certain requirements that must be met: The experiment must be executed according to the plan! All the planning in the world is of naught if carrying out the experiment results in blunders (e.g., even something as crude as dropping a key sample on the floor – and look at how often that has been done!). The statistical literature contains examples (unfortunately) where large experiments, that cost millions of dollars to perform, were completely ruined by carelessness on the part of the personnel actually carrying it out. As noted above, the variations in the data representing the error must meet the usual conditions for statistical validity: they must be random and statistically independent, and it is highly desirable that they be homoscedastic and Normally distributed. The data should be a representative sampling of the populations that the experiment is supposed to explore. Blunders must be eliminated, and all specified data must be collected. The efficiency of these experimental designs has another side effect: any missing or defective data has a disproportionate effect relative to the amount of information that can be extracted from the final data set. When simpler experimental designs are used, where each piece of data is collected for the sole purpose of determining the effect of one variable, loss of that piece of data results in the loss of only that one result. When the more efficient “statistical” experimental designs are used, each piece of data contributes to more than one of the final results, thus each one is used the equivalent of many times and any missing piece of data causes the loss of all the results that are dependent upon it. These types of experimental designs also have some limitations. The first is the exaggeration of the effect of missing or defective data on the results, as mentioned above. The second is the fact that until the entire plan is carried out, little or no information can be obtained. There are generally few, if any, “intermediate results”; only after all the data is available can any results at all be calculated, and then all of them are calculated at once. This phenomenon is related to the first caveat: until each piece of data is collected, it is “missing” from the experiment, and therefore the results that depend upon it cannot be calculated. The simplest possible experimental design would almost not be recognized as an “experimental design” at all, but does serve as a prototype situation (as we like to use for pedagogical purposes). The situation arises when there is one variable (factor) to investigate, and the question is, does this factor have an effect on the property studied? We have introduced this situation earlier, in our discussion of hypothesis testing, as in
Experimental Designs: Part 1
55
our previous Statistics in Spectroscopy book [1–3]. We will discuss how we treated this situation previously, then change our point of view to see how we would do it from the point of view of an “experimental design”.
REFERENCES 1. H. Mark, and J. Workman, “Statistics in Spectroscopy; Elementary Matrix Algebra and Multiple Linear Regression: Conclusion”, Spectroscopy 9(5), 22–23 (June, 1994). 2. H. Mark, and J. Workman, “Statistics in Spectroscopy’, Spectroscopy 4(7), 53–54 (1989). 3. H. Mark, and J. Workman, Statistics in Spectroscopy (Academic Press, Boston, 1991), chapter 18.
This page intentionally left blank
9 Experimental Designs: Part 2
As we have mentioned in the last chapter, “Experimental Design” often takes a form in scientific investigations, such that some of experimental objects have been exposed to one level of the variable, while others have not been so exposed. Oftentimes this situation is called the “experimental subject” versus the “control subject” type of experiment. In the face of experimental error, or other source of variability of the readings, both the “experimental” and the “control” readings would be taken multiple times. That provides the information about the “natural” variability of the system against which the difference between the two can be compared. Then, a ttest is used to see if the difference between the “experimental” and the “control” subjects is greater than can be accounted for by the inherent variability of the system. If it is, we conclude that the difference is “statistically significant”, and that there is a real effect due to the “treatment” applied to the experimental subject. Of course there are variations on this theme: the difference between the “experimental” and the “control” subjects can be due to different amounts of something applied to the two types of object, for example. That is how we have treated this type of experiment previously. We will now consider a somewhat different way to formulate the same experiment; the purpose being to be able to set up the experimental design, and the analysis of the data, in such a way that it can be generalized to more complicated types of experiments. In order to do this, we recognize that the value of any individual reading, whether from the experimental subject or the control subject, can be expressed as the sum of three quantities. These three quantities arise from a careful consideration of the nature of the data. Given that a particular measurement belongs either to the experimental group or to the control group, then the value of the data collected can be expressed as the sum of these three quantities: 1) The grand mean of all the data (experimental + control) 2) The difference between the mean of the data group (experimental or control) and the grand mean of the data 3) The difference between the individual reading and the mean reading of its pertinent group. This can then be expressed mathematically as: � � Xij = X + X i − X + Xij − X i
(91)
58
Chemometrics in Spectroscopy
where, Xij represents each individual datum. X i represents the mean of the particular data group (experimental or control) that the individual datum belongs to. X represents the grand mean of all the data (from both groups). By rearranging equation 91, we can also express it as follows, wherein the fact that it is a mathematical identity becomes apparent: � � Xij = X − X + X − X + Xij (92) We have previously shown that through the operation called “partitioning the sums of squares”, the following equality holds [1]: �2 � 2 � 2 �� X − X (93) Xi = X + Note that what we call the grand mean here is simply called the mean in the prior discussion. That is because in the prior discussion there was no further splitting of the data into subgroups. In the current discussion we have indeed split the data into subgroups; and we note that what was previously the total difference from the mean now consists of two contributions: the difference of each subgroup’s mean from the grand mean, and the difference of each datum’s value from its subgroup’s mean. We might expect, and it turns out to be so (again we leave the proof as an “exercise for the reader”), that sum of squares of the differences of each datum’s value from the grand mean can also be partitioned; thus,: �2 � � 2 � 2 �� 2 (94) Xij = X + X i − X + Xij − X i We had previously discussed the situation (from a slightly different point of view) where more than two subgroups of data existed. In that case we noted that we could generate two estimates of sigma, the withingroup standard deviation. One estimate is calculated from the pooled withingroup standard deviation. The other is calculated from the standard deviation between the means of the various subgroups. This quantity, you recall, is equal to the withingroup standard deviation divided by the square root of n, the number of data used in the calculation of each subgroup’s mean. However, the second calculation is correct only if the differences between the means is due to the random variations of the data itself, and there are no external influences. If such influences exist, then the second calculation (from the betweengroup means) will estimate a larger value for sigma than the first calculation (the pooled withingroup standard deviations). This was then used as the basis of a statistical hypothesis test: if the value of sigma calculated from the betweengroups means is statistically significantly larger than the value of sigma calculated from with the groups, then we have evidence to conclude that there are indeed, external influences acting upon the data, and we used an F test to determine whether there was more scatter between the means than could be accounted for by the random variations within the subgroups. In the case at hand, with only two subgroups, we can proceed the same way. The difference is that now, with only two subgroups, there is only one degree of freedom
59
Experimental Designs: Part 2
available for the difference between the subgroups. No matter; an F test with one degree of freedom is possible. Thus, to analyze the data from the model of equation 94, we calculate the mean square between the subgroups, and the mean square within the subgroups and perform an F test (rather than a ttest as before) between these two mean squares. We would recommend doing it formally, with an ANOVA table, but this is the basic calculation. The conclusions drawn will be identical to those drawn by use of the ttest. Check it out: the tabled values of F for one and n degrees of freedom is equal to the square of the value of t for n degrees of freedom. We might also note here, almost parenthetically, that if the hypothesis test gives a statistically significant result, it would be valid to calculate the sensitivity of the result to the difference between the two groups (i.e., divide the difference in the means of the two groups by the difference in the values of the variable that correspond to the “experimental” and “control” groups). As an example of using an experimental design together with its associated analysis of variance to obtain a meaningful result, we have here an example based on some real data that we have collected. The problem was interesting: to troubleshoot a method of (wet) chemical analysis. A large quantity of sample was available, and had been wellground and mixed. Suitable data was collected to permit performing a straightforward oneway analysis of variance. To start with, 5 g of sample was dissolved in 100 ml of water, and 20 repeat analyses were performed. The resulting values are shown in Table 91. The entry in the third row, second column was noted to have been measured under abnormal conditions. Since an assignable cause for this discrepant value was available, the reading was discarded. The statistics for the remaining data were Mean = 5.01, SD = 0.327. This value for the standard deviation was accepted as the best available approximation to the population value for . The next step was to take several different aliquots from a large sample (a different sample than used previously) and collect multiple readings from each of them. Six aliquots were placed in each of six flasks, and six repeat measurements were made on each of these six flasks. Each aliquot consisted of 10 g of test sample/100 ml water. The results are shown in Table 92. The value for the pooled withinflask standard deviation, while somewhat higher than for the twenty repeat readings, is not so high as to be worrisome. Strictly speaking, we should have done an F test between the variance from the two sets of results to see if there is any extra variance there, but we will ignore that question for now, because the important point here is the highly statistically significant value of the “between” flasks standard deviation, indicating some extra source of variation was superimposed on the analytical value.
Table 9-1 Results from 20 repeat readings of 5 g of sample dissolved in 100 ml water 5.12 5.28 4.97 5.20 4.50
5.60 5.14 3.85 4.69 5.12
5.18 4.74 5.39 4.49 5.61
4.71 4.72 4.94 4.91 4.99
60
Chemometrics in Spectroscopy
Table 9-2 Results of repeat readings of six aliquots in six flasks (from 10g samples) Flask #
Means: SDs:
1
2
3
4
5
6
7.25 7.68 7.76 8.10 7.50 7.58
10.07 9.02 9.51 10.64 10.27 9.64
5.96 6.66 5.87 6.95 6.54 6.29
7.10 6.10 6.27 5.99 6.32 5.54
5.74 6.90 6.29 6.37 5.99 6.58
4.74 6.75 6.71 6.51 5.95 6.50
7.64 0.28
9.85 0.58
6.37 0.42
6.22 0.51
6.31 0.41
6.19 0.77
Pooled SD = 0.52, “Between” SD = 1.46 Expected “Between” SD = 0.212 F = 47 F (crit) = F (0.95, 5, 30) = 2.53
Having found a statistically significant “between” flasks standard deviation, the next step was to formulate hypotheses as to the possible physical causes of this situation. The list we arrived at was the following: • • • •
Inhomogeneous sample Drift between sets of readings Sampling error Something else.
The first physical cause considered was the possibility of an inhomogeneous sample. To eliminate this as a possibility, the sample was ground before aliquots were taken. The sample size was still 10 g of sample per 100 ml of water. In this case, however, time constraints permitted only three replicate readings per flask. The results are shown in Table 93. We note that there is still much larger difference between the different flasks’ readings that can be accounted for by the withinflask repeatability. Therefore we press onward to consider another possible cause of the variation; in this case we consider the possibility of inhomogeneity of the sample, at a scale not affected by grinding. For example, the sample might contain small specks of material that are too small to be ground further, Table 9-3 Results of repeat readings of six aliquots in six flasks (from 10g samples ground)
Means: SDs:
6.57 6.27 6.35
5.06 6.27 5.88
8.07 7.82 8.52
4.93 5.64 5.19
4.78 5.50 5.99
6.23 7.37 5.27
6.39 0.16
5.74 0.61
8.19 0.35
5.25 0.36
5.43 0.61
7.29 1.01
Pooled SD = 0.58, “Between” SD = 1.14 Expected “Between” SD = 0.33 F = 113 F (crit) = F (0.95, 5, 12) = 3.10
61
Experimental Designs: Part 2 Table 9-4 Results from using 10 × larger (100gram) samples
Means: SDs:
8.29 8.12 8.72 8.54
8.61 8.72 8.42 8.76
10.04 11.67 11.38 10.19
8.86 9.02 9.29 8.63
8.42 0.26
8.63 0.15
10.82 0.82
8.94 0.26
Pooled SD = 0.46, “Between” SD = 1.10 Expected “Between” SD = 0.23 F = 23 F (crit) = F (0.95, 3, 12) = 3.49
but which are large enough to measurably affect the analysis. In this case, the expected distribution of the sampling variation of such particles would be the Poisson distribution [2]. In such a case, if we take a larger sample, we would expect the standard deviation to decrease as the square root of the sample size. Thus, if we take samples ten times larger than previously, the standard deviation of the “between” readings should become approximately onethird of the previous value. Therefore, for the next test, 100 g samples each were dissolved in 1 liter of water. The results are shown in Table 94. Note that the “between” standard deviation is almost identical to the previous value; we conclude that inhomogeneity of the sample is not the problem. The possibility of drift between sets of readings was ruled out by virtue of the fact that many of the steps of the analytical procedure were done simultaneously on the several readings of the different aliquots. The possibility of drift between readings was ruled out by repeating the readings in different orders; the same values were obtained regardless of the order of reading. This left “something else” as the possible cause of the variability. When we considered the nature of the test, which was sensitive to parts per million of organic materials, we realized that one possibility was contamination of the glassware by the soap used to clean it. We next cleaned all glassware with chromic acid cleaning solution, and reran the tests, with the result as shown in Table 95. Removal of the extraneous source of variability did indeed reduce the “betweenflasks” variance to a level that is now explainable (in the statistical sense) by the underlying random variations attributable to the withinflask variability. Table 9-5 Results after cleaning glassware with chromic acid
Means: SDs:
4.65 5.03 4.38
5.98 4.61 4.49
5.19 3.96 4.92
4.97 4.43 4.79
4.62 4.94 3.37
3.93 4.60 5.95
4.68 0.33
5.16 0.73
4.69 0.64
4.73 0.27
4.31 0.83
4.84 1.03
Pooled SD = 0.69, “Between” SD = 0.27 Expected “Between” SD = 0.39 F = 047 F (crit) = F (0.95, 5, 12) = 3.10
62
Chemometrics in Spectroscopy
Table 9-6 Types of experimental designs Number of levels
Number of factors Single
Multiple
Two
Experimental versus control subjects
Oneatatime designs Factorial designs Fractional factorial designs Nested designs Special designs
Multiple
Sensitivity testing Simple regression
Response surface designs Multiple regression
End of example From the prototype experiment, we can generate many variations of the basic scheme. The two main ways that the model shown in equation 94 can be varied is to increase the number of factors and to increase the number of levels of each factor. A given factor must have at least two levels (even if one of the levels is an implied zero), and may have any number greater than two. Table 96 lists the types of designs that fall into each of these categories. The types of designs used by scientists in simple settings, not usually considered “statistical” designs, are the “experimental versus control” designs (discussed above), the oneatatime designs (where each factor is individually changed from its “control” value to its “experimental” value, then restored when the next fac tor is changed), and the simple regression (often used in calibration work when only one physical variable is affected – in chemistry, electrochemical and chromatographic applications come to mind). The table is not exhaustive, although it does include a majority of experimental designs that are used. Oneatatime designs are the usual “nonstatistical” type of experiments that are often carried out by scientists in all disciplines. Not included explicitly, however, are experimental designs that are generated from combinations of listed items. For example, a multifactor experiment may have several levels of some of the factors but only two levels of other factors. Also, due to the nature of the physical factors involved, the values of some of the factors may not be under the experimenter’s control. Thus, some factors may be nested, while others may not be.
REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991), pp. 80–81. 2. Mark, H. and Workman, J., Spectroscopy 5(3), 55–56 (1991).
10 Experimental Designs: Part 3
We continue with this chapter specifically dealing with experimental design issues. When we leave the realm of the simplest designs, we find that the experiments, and the analyses of the data therefrom, acquire characteristics not existing in the simpler designs, and beyond obvious extensions of them. For example, consider a twofactor design with each factor at two levels. This is also a form of allpossiblecombinations experiment. One item we note here is that there is more than one way to describe the form of an experiment, and we include a short digression here to explicate this multiplicity of ways of describing an experiment. In this particular case, we have two factors, each at two levels. We can describe it as a listing of values corresponding to each experiment (Table 101). Alternatively, we can describe it as the experiment number that will correspond to each set of combinations of factors (Table 102): Whichever way we choose to describe the design, it (and the others of this type) has some attractive features. We will illustrate these features with a numerical example. For our example, we will imagine an experiment where the scientist is interested in determining the influence of temperature and of catalyst on the yield of a chemical reaction. The questions to be answered are: does the concentration of catalyst make a difference, and does the type of catalyst make a difference? The experiment is to consist of trying each of the four available catalysts and three solvents, and determining the yield. The experiment can be described by Table 103. In a more complicated case, where a physical variable such as temperature, which can be assigned meaningful physical values, was the physical variable and the sensitivity of the yield to temperature was of concern, we would then need to maintain (or control) the information regarding the actual temperatures. For our first look at this experiment we will examine the behavior of the experiment under two sets of conditions. The first scenario gives a set of conditions with the results obtained under the following assumptions: 1) There is no influence of solvent 2) None of the catalysts have an effect 3) There are no random influences on the experiment. The second scenario has similar conditions, but with one change: 1) There is no influence of solvent 2) One of the three catalysts has an effect 3) There are no random influences on the experiment.
64
Chemometrics in Spectroscopy
Table 10-1 Allpossiblecombinations experiment organized as a list of values Experiment number 1 2 3 4
Factor #1
Factor #2
L L H H
L H L H
Table 10-2 Allpossiblecombinations experiment organized as a table where the body of the table contains the experiment number corresponding to each set of experimental conditions
L H
Factor #1 1 3
2 4
L H
Factor #2 1 2
3 4
Table 10-3 Conditions for the experiment consisting of determining the yield of a chemical reaction with different solvents and temperatures Catalyst number 1 2 3 4
Solvent #1
Solvent #2
Solvent #3
1 4 7 10
2 5 8 11
3 6 9 12
In both experiments, Conditions 1 and 2 together mean that all results from the experi ment will be the same in the first scenario, and all results except the ones corresponding to the “effective” catalyst will be the same; while that one will differ. Condition 3 means that we do not need to use any statistical or chemometric considerations to help explain the results. However, for pedagogical purposes we will examine this experiment as though random error were present, in order to be able to compare the analyses we obtain in the presence and in the absence of random effects. The data from these two scenarios might look like that shown in Table 104. For each scenario, the statistical analysis of this type of experimental design would be a twoway analysis of variance. This is predicated on the construction of the experiment, which includes some implicit assumptions. These assumptions are 1) The influence of the factors changing between the rows is independent of the influence of the factors changing between the columns.
65
Experimental Designs: Part 3
Table 10-4 Hypothetical data under two different scenarios, for the experiment examining the effect of temperature and catalyst on yield; with no random variations affecting the data Catalyst number
1 2 3 4
First scenario
Second scenario
Solvent number
Solvent number
1
2
3
1
2
3
25 25 25 25
25 25 25 25
25 25 25 25
25 25 35 25
25 25 35 25
25 25 35 25
2) The influence of the factors changing between the columns is independent of the influence of the factors changing between the rows. 3) Any error (in these first two scenarios assumed zero) is random, has a mean value of zero, and is Normally distributed. If these assumptions hold, then each quantity in the data table can be expressed as the sum of the following four factors: 1) 2) 3) 4)
The The The The
grand mean of all the data influence of the value of the factor corresponding to each row influence of the value of the factor corresponding to each column. variation superimposed by any random phenomena affecting the data.
This being the case, quantities computed for a twoway analysis of variance are the following: 1) The grand mean of all the data 2) The mean of each row, and the difference of each row mean from the grand mean (this estimates the influence of the values of the factor corresponding to the rows) 3) The mean of each column, and the difference of each column mean from the grand mean (this estimates the influence of the values of the factor corresponding to the columns) 4) Any difference between the actual data and the corresponding values calculated from the grand mean and the influences of the row and columns factors (this estimates the error variability). In Tabel 105, we present the standard representation of this breakdown of the data. There are two important points to note about the results in this table: first the data, shown in the body of the table in Part A, is in fact equal to the sum of the following quantities: 1) the grand mean (shown in Part A) 2) + row differences from the grand mean (shown in Part B)
66
Chemometrics in Spectroscopy
Table 10-5 Part A – ANOVA for the errorless data from Table 104 Catalyst number
First scenario Solvent number 1
2
3
1 2 3 4
25 25 25 25
25 25 25 25
25 25 25 25
Col. means:
25
25
25
∗
Second scenario
Row means
Solvent number
Row means
1
2
3
25 25 25 25
25 25 35 25
25 25 35 25
25 25 35 25
25 25 35 25
25
27.5
27.5
27.5
27.5∗
Grand mean
Table 10-5 Part B – RESIDUALS for ANOVA from Table 104 after correcting for row and column means Catalyst number
First scenario Solvent number 1
2
3
1 2 3 4
0 0 0 0
0 0 0 0
0 0 0 0
Mean diff. from grand mean:
0
0
0
Second scenario
Row diffs
0 0 0 0
Solvent number 1
2
3
0 0 0 0
0 0 0 0
0 0 0 0
0
0
0
Row diffs
−2�5 −2�5 7.5 −2�5
3) + column differences from the grand mean (shown in Part B) 4) + residuals (shown in the body of Part B). The second point is that the mean of the residuals, representing the error portion of the data, are zero; the data is accounted for entirely by the systematic variations due to the variations between the rows and the variations between the columns (of course, the column differences happen to be zero in this data). Now the really interesting stuff happens when we do in fact have error in the data. Let us look at what happens to these two scenarios when there is a small amount of random error variability superimposed on the data. Now the experimental conditions for the two scenarios are as follows: Scenario #3: 1) There is no influence of solvent 2) None of the catalysts have an effect 3) There is a random due to error on the experiment.
67
Experimental Designs: Part 3
Scenario #4: 1) There is no influence of solvent 2) One of the three catalysts has an effect 3) The same random error exists as in Scenario #1. For these two situations, let us suppose each error has the value as shown in Table 106 for the corresponding datum. The values in Table 106 were selected randomly, and have a mean of zero and a standard deviation of unity. When these error values are superimposed on the data, we arrive at the Table 107. When we subject this data to the same ANOVA calculations as the errorless data, we arrive at the following results (Table 108): It is instructive to compare the values in these tables with the corresponding values in the ANOVA tables for the errorless data. In particular, note that in the table corresponding to Scenario 1, even though there is no underlying systematic variations in the data, both the row and the column means are perturbed by the random variations superimposed on the data. How then, can we differentiate these differences from the ones due to real systematic variations such as are present in Scenario 2? The answer, of course, is to do a statistical hypothesis test, but as it stands, we do not seem to have enough information available for such a test. We can compute variances between rows and also between columns, in order to have the mean squares for the corresponding differences, but what are we going to compare these mean squares to? In particular, what are we going to use
Table 10-6 For Scenarios 3 and 4 each error has the following value for the corresponding datum −0�3583 −0�9583 0.0416 −1�0583
0.8416 −1�2583 −1�3583 0.4416
0.5416 1.4416 1.4416 0.2416
Table 10-7 Hypothetical data under two different scenarios; for the experiment examining the effect of solvent and catalyst on yield, random variations (from Table 106) have zero mean and unity standard deviation Catalyst number
1 2 3 4
Third scenario
Fourth scenario
Solvent number
Solvent number
1
2
3
1
2
3
25.8416 23.7416 23.6416 25.4416
24.6416 24.0416 25.0416 23.9416
25.5416 26.4416 26.4416 25.2416
25.8416 23.7416 33.6416 25.4416
24.6416 24.0416 35.0416 23.9416
25.5416 26.4416 36.4416 25.2416
68
Table 10-8 Part A – DATA: ANOVA for the hypothetical data containing error with mean equal 0 and standard deviation (S) equal to unity Catalyst number
Third scenario
Fourth scenario
Solvent number 1
2
3
1 2 3 4
25.8416 25.7416 25.6416 25.4416
24.6416 24.0416 25.0416 23.9416
25.5416 26.4416 26.4416 25.2416
Col. means:
25.6666
24.4166
25.9166
Grand mean
Row means
1
2
3
25.3416 24.7416 25.0416 24.875
25.8416 25.7416 33.6416 25.4416
24.6416 24.0416 35.0416 23.9416
25.5416 26.4416 36.4416 25.2416
25.3416 24.7416 35.0416 24.875
25∗
27.1666
26.9166
28.4166
27.5∗ Chemometrics in Spectroscopy
∗
Solvent number
Row means
Experimental Designs: Part 3
Table 10-8 Part B – RESIDUALS for the hypothetical data containing error with mean equal 0 and standard deviation (S) equal to unity Catalyst number
Third scenario
Fourth scenario
Solvent number 1
2
3
1 2 3 4
0.8333 −0�6666 −1�0666 0.9
−0�1166 −0�1166 0.5833 −0�35
−0�7166 0.7833 0.4833 −0�55
Col. diff from grand mean
−0�3333
−0�5833
0.9166
Row diff. from grand mean 0.3416 −0�2583 0.0416 −0�125
Solvent number 1
2
3
0.8333 −0�6666 −1�0666 0.9
−0�1166 −0�1166 0.5833 −0�35
−0�7166 0.7833 0.4833 −0�55
−0�3333
−0�5833
0.9166
Row diff from grand mean −2�1583 −2�7583 7�5416 −2�625
69
70
Chemometrics in Spectroscopy
to represent the error, to see if the row mean squares or the column mean squares are larger than can be accounted for by the error of the data? The answer to this question is in the residuals. While the residuals might not seem to bear any relationship to either the original data or the errors (which in this case we know because we created them and they are listed above), in fact the residuals contain the variance present in the errors of the original data. However, the value of the error sum of squares is reduced from that of the original data, because of the subtraction of some fraction of the error variation from the total when the row and column means were subtracted from the data itself. This reduction in the sum of squares can be compensated for by making a corresponding compensation in the degrees of freedom used to calculate the mean square from the sum of squares. In this data the sum of squares of the residuals is 5.24 (check it out). The number of degrees of freedom in these residuals is calculated by starting with the total (which is twelve, one from each piece of data in the experiment) and subtracting one degree of freedom for each quantity calculated from and subtracted from the data. What are these? Well, there is one grand mean, four row means, and three column means. The number of degrees of freedom lost = �r − 1��c − 1� = �4 − 1��3 − 2� = 6. Thus there is a loss of six degrees of freedom from the twelve, leaving six for the residuals. The mean square for the residuals is thus 5.24/6, or 0.877, and as a check, the square root of that value, 0.934 is an estimate of the error (which we know is unity).
11 Analytic Geometry: Part 1 – The Basics in Two and Three Dimensions
Analytic geometry is a branch of mathematics in which geometry is described through the use of algebra. Rene Descartes (1596–1650) is credited for conceptualizing this mathematical discipline. Recalling the basics, we can express the points of a plane as a pair of numbers with xaxis and yaxis coordinates, designated by (x, y). Note that the xaxis coordinate is termed the “abscissa”, and the yaxis the “ordinate”.
THE DISTANCE FORMULA In two dimensions (x and y), the distance between two points (x1 , y1 ) and (x2 , y2 ) in twodimensional space (as shown in Figure 111) is given by the Pythagorean theorem as D2 = x2 − x1 2 + y2 − y1 2 = x2 − x1 2 + y2 − y1 2
(111)
and D=
√ x2 − x1 2 + y2 − y1 2
(112)
Note: This relationship holds even when x1 or y1 or both are negative (also shown in Figure 111). In three dimensions (x, y, z), we describe three lines at right angles to one another, designated as the x, y, z axes. Three planes are represented as xy, yz, and zx, and the distance between two points (x1 , y1 , z1 ) and (x2 , y2 , z2 is given by D2 = x2 − x1 2 + y2 − y1 2 + z2 − z1 2 = x2 − x1 2 + y2 − y1 2 + z2 − z1 2
(113)
and D=
√ x2 − x1 2 + y2 − y1 2 + z2 − z1 2
(114)
72
Chemometrics in Spectroscopy Y
(x2, y2)
X
(x1, y1)
Figure 11-1 The distance between two points in a twodimensional coordinate space is deter mined using the Pythagorean theorem.
DIRECTION NOTATION For twodimensional problems, given a line with respect to two axes x and y, there is a set of angles and that are designated as the x direction angle and y direction angle, respectively. Thus, as illustrated by using Figures 112a and 112b, a clearly defined line segment can be described given the angles and on the coordinate axes x and y. The only restriction that applies here is that both angles and must be ≥ 0 and ≤ 180 .
THE COSINE FUNCTION The cosine function applied to Figures 112a and 112b is given as cos =
x2 − x1 d
(115a)
y2 − y1 d
(115b)
and cos = (a)
(b) Y
Y
β
β α X X
α
Figure 11-2 Two illustrations of the xdirection angle ( and ydirection angle ( for a twodimensional coordinate system.
73
Analytic Geometry: Part 1
where, d=
√
x2 − x1 2 + y2 − y1 2
(116)
Note that cos a and cos p are referred to as the direction cosines of the line segment described. To summarize in expanded notation: x2 − x1 cos = √ x2 − x1 2 + y2 − y1 2
(117a)
and cos = √
y2 − y1
x2 − x1 2 + y2 − y1 2
(117b)
Example: To find the direction cosines and corresponding angles for a line segment AB, where A is (3, 5) and B is (2, 7); check your work using cos2 + cos2 = 10, and draw a graphic of the line segment (Figure 113). The solution proceeds as follows: x2 − x1 = 2 − 3 = −1
(118a)
y2 − y1 = 7 − 5 = 2
(118b)
and
Therefore, the distance (d) is given by √
x2 − x1 2 + y2 − y1 2 √ √ d = −12 + 22 = 5
d=
(119a) (119b)
From the formulas above, we can determine that √ cos = −1/ 5 Y
B
β = 26.57° α = 116.5° A
X
Figure 11-3 The xdirection angle and ydirection angle for a line segment, where A is (3, 5) and B is (2, 7) (see example in text).
74
Chemometrics in Spectroscopy
and the corresponding angle is given as √ = cos−1 −1/ 5 = 11657 We also know that √ cos = 2/ 5 therefore the angle is given by √ = cos−1 2/ 5 = 2657 Checking our work using the formula cos2 + cos2 = 10, we find that cos2 11657 + cos2 2657 = 020 + 080 = 10
DIRECTION IN 3-D SPACE To continue our discussion of direction angles, we will use the same nomenclature: x, designated by ; y, designated by ; and z, newly designated by . We can determine the cosine of any direction angle, given the corresponding x, y, z coordinates for designated points in space as: cos = x2 − x1 /d
(1110a)
cos = y2 − y1 /d
(1110b)
cos = z2 − z1 /d
(1110c)
and
and
where, d=
√
x2 − x1 2 + y2 − y1 2 + z2 − z1 2
(1111)
It follows algebraically that cos 2 + cos 2 + cos 2 = 10
(1112)
Example: Find the direction cosines and corresponding angles for a line segment AB where A is (2, −1, 4) and B is (4, 1, 2). To solve, use x2 − x1 = 4 − 2 = 2
75
Analytic Geometry: Part 1
and y2 − y1 = 1 − −1 = 2 and z2 − z1 = 2 − 4 = −2 √
x2 − x1 2 + y2 − y1 2 + z2 − z1 2 √ √ d = 22 + 22 + −22 = 12 = 346
d=
and cos = 2/346 = 0577 cos = 2/346 = 0577 cos = −2/346 = −0577 To find the direction angles corresponding to the above we use = cos−1 0577 = 5476 = cos−1 0577 = 5476 = cos−1 −0577 = 12523 Checking the calculations, we use cos2 + cos2 + cos2 = 10 or 0333 + 0333 + 0333 = 100
DEFINING SLOPE IN TWO DIMENSIONS The slope m of a line segment between two points is given as: m = y2 − y1 /x2 − x1 = tan
(1113)
where is the x direction angle and 0 < 360 . This wellknown expression is also equivalent to the tangent of the x direction angle for the line segment defined by the two points on the line. Thus the slope of the line given in Figure 114 is tan120 = −174. Just store this information away for the next several chapters as we build a pre chemometrics view of analytic geometry.
76
Chemometrics in Spectroscopy Y
θ = 120°
X
Figure 11-4 Illustration of the slope of a line given an xdirection angle of 120 .
RECOMMENDED READING We recommend a standard text on vector analytic geometry. One good example is 1. White, P.A., Vector Analytic Geometry (Dickenson, Belmont, CA, 1966).
12 Analytic Geometry: Part 2 – Geometric Representation of Vectors and Algebraic Operations
We continue with our prechemometrics review of analytic geometry, noting the term “vector” in all cases can be represented by a matrix of r × c dimensions, where r = # of rows and c = # of columns. The operations defined below will be employed in future discussions.
VECTOR MULTIPLICATION (SCALAR × VECTOR) If M represents a vector with components (or elements) as (Mx , My , then sM (where s is a real number, also termed a “scalar”) is defined as the vector represented by (sMx , sMy ); and the length of sM is s times the length of M. One can relate the direction angles of M to those of sM as follows: For the case where s > 0 (s is a positive, real number), then cos sM = cos M
(121a)
cos sM = cos M
(121b)
and
So the vectors sM and M have the exact same direction. For the case where s < 0 (where s is a negative, real number), then cos sM = −cos M
(121c)
cos sM = −cos M
(121d)
and
In this case, the vectors sM and M have the exact opposite directions. (Note: When s = 0, there is no definition for the vector or direction.) Example problem. If M = 1 5, then 2M (where s = 2) = 2 × 1 2 × 5 = 2 10, represented in Figure 121 as the line segment from point (0, 0) to (2, 10). (Note: The expression −2M = −2 −10 is represented by the line segment from point (0, 0) to −2 −10.]
78
Chemometrics in Spectroscopy
(2, 10) 2M segment (0, 0) to (2, 10) (1, 5) M segment (0, 0) to (1, 5)
–2M segment (0, 0) to (–2, –10)
(–2, –10)
Figure 12-1 An example of scalar × vector multiplication: if M = �1 5, then 2M = 2 10 and −2M = −2 −10.
VECTOR DIVISION (VECTOR ÷ SCALAR) Vector division is represented as vector multiplication by using a fractional multi plier term. For example, if s = 1/2, then sM = 05 25; if s = −1/2, then sM = −05 −25, and so forth.
VECTOR ADDITION (VECTOR + VECTOR) Given M = Mx , My ), where M = 1 3; and N = Nx , Ny ), where N = 3 1, then M + N = MX + Nx My + Ny
(122)
The geometric representation is shown in Figure 122 for 1 + 3 3 + 1 = 4 4.
M + N = (4, 4)
M = (1, 3)
N = (3, 1)
Figure 12-2 An example of vector + vector addition: If M = 1 3 and N = 3 1, then M + N = 4 4.
79
Analytic Geometry: Part 2
VECTOR SUBTRACTION (VECTOR − VECTOR) Given M = �Mx , My ), where M = �1 3, and N = Nx , Ny ), where N = 3 1, then M − N = Mx − Nx My − Ny The geometric representation of M − N = 1 − 3 3 − 1 = −2 2 is shown in Figure 123. In our next chapter we will look at the problem of representing higher dimensional space with fewer dimensions; it will be a precursor to discussions of the dimensional aspects of multivariate algorithms.
M – N = (–2, 2)
–N
M = (1, 3)
N = (3, 1)
Figure 12-3 An example of vectorvector subtraction: If M = 1 3 and N = 3 1 then M −N = −2 2.
This page intentionally left blank
13 Analytic Geometry: Part 3 – Reducing Dimensionality
For this chapter, we will reduce threedimensional data to onedimensional data using the techniques of projection and rotation. The (x, y, z) data will be projected onto the (x, z) plane and then rotated onto the x axis. This chapter is purely pedagogical and is intended only to demonstrate the use of projection and rotation as geometric terms.
REDUCING DIMENSIONALITY The exercise for this column is to reduce a point on a vector in 3D space to a point on a vector in 2D space, then to further reduce the point on a vector in 2D space to a point on a vector in 1D space – all the while maintaining as much information as possible. So (x, y, z) is reduced to (x, z), which is further reduced to (x). This process can be represented in symbolic language as (x, y, z) → (x, z) → x.
3-D TO 2-D BY PROJECTION Let us calculate some of the angles relative to the vector in 3D space as shown in Figure 131. To calculate these angles, we refer to Chapter 1, and if we proceed with our calculations we find = cos−1 07071 = 45
(131)
and cos =
y2 − y1 2−0 = √ = 07071 d 8
= cos−1 07071 = 45
(132)
where, d=
�
x2 − x2 2 + y2 − y2 2 =
�
2 − 02 + 2 − 02 =
√ 8
82
Chemometrics in Spectroscopy z (2, 2, 6)
α
y
β α
x
Figure 13-1 A point (X, Y , Z) = (2, 2, 6) located along a vector in 3D space. Both the angle (the angle to the xaxis) and the angle (the angle to the yaxis), as illustrated in the figure are shown as a projection of the 3Dvector (2,2,6) onto the (x, y) plane, and the proper calculations for both and from what is then a 2D vector are correct as given in equations 131 and 132.
Because the third dimension is represented by the z axis, we calculate the zdirection angle on the (x, z) plane as : = cos
−1
�
x2 − x1 � x2 − x1 2 + z2 − z1 2
�
−1
= cos
= cos−1 03162 = 7157
�
2−0 �
2 − 02 + 6 − 02
� (133)
Now look at Table 131 , which describes the trigonometric functions of a right triangle (Figure 132). If we apply Table 131 to this problem, we can calculate the length of a vector using trigonometric functions. Figure 133 illustrates the geometric problem for solving the length of the vector A to B or from points on the (x, z) axis (0, 0) to (2, 6). The angle calculated in equation 133 is represented in Figures 133 and 134; the angle shown in Figure 131 is not discussed. Because the third dimension is represented by the zaxis, we calculate the xdirection angle on the (x z) plane as : The correct calculation for this angle () is given in equation 133. To calculate the length of the horizontal vector for the projection of vector AB onto the (x, z) plane, we can use sin = opp/hyp Table 13-1 Trigonometric functions of a right triangle opposite hypotenuse adjacent cos = hypotenuse opposite tan = adjacent sin =
hypotenuse opposite hypotenuse sec = adjacent adjacent cot = opposite csc =
83
Analytic Geometry: Part 3 Hypotenuse Opposite
θ Adjacent
Figure 13-2 A right triangle showing adjacent (adj.), hypotenuse (hyp.) and opposite sides relative to angle .
B
z
(2, 6)
θ hyp
A
adj
D
x
opp
Figure 13-3 The geometric problem associated with calculating the length of a vector AB, given a point (x, z) = (2, 6) in 2D space. Note that the angle is equal to 90 − 7157 = 183 . z L = 6.33
α = 71.57°
x
Figure 13-4 Illustration of twodimensional reduction to one dimension by an xdirectional rotation of 7157 .
which becomes hyp = opp/ sin = 2/ sin1843 = 633 Therefore, we can project the AB vector in 3D space onto 2D space by using a projection onto the (x, z) plane, resulting in a point on a vector (on the 2D (x, z) plane) the vector being 6.33 units in length and having an Xdirection angle equal to 7157 (as in Figure 134).
84
Chemometrics in Spectroscopy
2-D INTO 1-D BY ROTATION By rotating the vector in 2D space over 7157 in the Xdirection, we can align it to the X axis as a 1D line 6.33 units in length (as shown in Figure 135). z
L = 6.33
x
Figure 13-5 By projecting a vector in (x, y, z) space onto a plane in (x, z) space, and by an xdirectional rotation of 7157 in the (x, z) plane, we have the reduction of a point on a vector in 3D space to a point on a vector in 1D space.
In our next chapter, we will be applying the lessons reviewed over these past three chapters toward a better understanding of the geometric concepts relative to multivari ate regression.
14 Analytic Geometry: Part 4 – The Geometry of Vectors and Matrices
In this chapter, we plan to use the information presented over the past three chapters to illustrate the geometry of vectors and matrices; these concepts will continue to be discussed routinely throughout this series in relation to regression vectors.
ROW VECTORS IN COLUMN SPACE Let us begin by representing a row matrix M = 1� 2� 3� in column space as shown in Figure 141. Note that the row vector M = 1� 2� 3� projects onto the plane defined by columns 1 and 2 as a point (1, 2) or a vector (straight line) with a C1 direction angle () equal to = cos
−1
�
C12 − C11 d
�
= cos
−1
�
1−0 √ 5
�
�
2−0 √ 5
�
(141)
cos−1 04472 = 6343 and a C2 direction angle () equal to = cos
−1
�
C22 − C21 d
�
= cos
−1
(142a)
cos−1 08944 = 2657 where d=
� � √ C12 − C11 2 + C22 − C21 2 = 12 + 12 = 5
(142b)
COLUMN VECTORS IN ROW SPACE � 1 2 can be represented 3 4 by 2D row space as shown in Figure 142. Note that each column in the matrix can be represented by a column vector as shown in the figure.
A matrix consisting of more than one row, such as M =
�
86
Chemometrics in Spectroscopy
Column 3
Row vector M = [1, 2, 3]
Column 2
β
Column 1
α
Figure 14-1 A representation of a row vector M = 1� 2� 3 in column space, and the projection of this vector onto the plane represented by Columns 1 and 2.
Row 2 4
Column 2
Column 1
3
2
1 Row 1
0 0
1
2
3
4
�
� 12 Figure 14-2 The representation of column vectors in row space of matrix M = . 34
PRINCIPAL COMPONENTS FOR REGRESSION VECTORS Figure 143a shows the projection of two column vectors – C1 = 1� 3� and C2 = 3� 1� onto their vector sum (or principal component (PC1)). We note that the product �1� 3 × 3� 1 = 1 × 3� 3 × 1 = 3� 3 . The vector sum of the two column vectors passes through the point (3, 3). but the projection of each column onto PC1 gives a vector with a length equal to line segments B + C as shown in Figure 143b.
87
Analytic Geometry: Part 4 (a)
(b) 4
4 PC1
Column 1
3
PC1
Column 1
3
B 2
2
D
E A
C 1 ∠D
Column 2
1
Column 2
∠α ∠β
0
∠C
0 0
1
2
3
4
0
1
2
3
4
Figure 14-3 (a) The representation of two columns of a matrix in row space. The vector sum of the two column vectors is the first principal component (PC1). (b) A closeup view of Figure 143a, illustrating the line segments, direction angles, and projection of Columns 1 and 2 onto the first principal component.
To determine the geometry for Figures 143a and 143b, we begin by calculating the length of line segment E (Column 1) by using the Pythagorean theorem as E 2 = Hyp2 = 3−0�2 + 1−0�2 = 32 + 12 = 10 √ Therefore: E = 10 = 3162
(143)
Then the angle C can be determined using opp 1 = adj 3
(144a)
1 = 18435 3
(144b)
tanC = and tan−1
So ∠C = 18435 , ∠D = 18435 , and ∠ + ∠ − 2 × 18435 = 90 . Thus, both ∠ and ∠ are each equal to 26565 . It follows that the projection of the vectors represented by Columns 1 and 2 onto the vector PC1 yields a right triangle defined by the three line segments C + B, D, and E. The length of PC1 (the hypotenuse) is equal to line segments C + B and is given by cos =
adj E 3162 ⇒ cos = ⇒ 08944 = = 35353 hyp C +B hyp
(145)
So the length of the hypotenuse (segments C + B) is 3.5353. We can check our work by calculating the opposite side (D) length as tan =
opp D opp ⇒ tan = ⇒ 0500 = = 15810 adj E 3162
(146)
88
Chemometrics in Spectroscopy
And by using the Pythagorean theorem we can calculate the length of the hypotenuse: 3162�2 + 15810�2 = 35352�2
(147)
By representing a row vector in column space, or a column vector in row space, we can illustrate the geometry of regression. These concepts combined with matrix algebra will be useful for further discussions of regression. In Chapters 15–20, we will digress from these topics and revisit experimental design concepts. Readers may wish to study additional materials related to the subject of analytical geometry and regression. We recommend two sources of such information below.
RECOMMENDED READING 1. Beebe, K.R. and Kowalski, B.R., Analytical Chemistry 59(17), 1007A–1017A (1987). 2. Fogiel, M., ed., The Geometry Problem Solver (Research and Education Association, New York, 1987).
15 Experimental Designs: Part 4 – Varying Parameters to Expand the Design
We have discussed experimental designs in previous papers [1–4], and in Chapters 8–10. In those previous chapters, the designs we discussed were, with the exception of one particularly interesting design (representing a special case of a more general type of design that we will discuss later), rather simple and plain, in the sense that the designs included only small numbers of levels of the various factors of interest, and were basically considerations of “all possible combination” of those factors – the types of experiments that scientists have been designing “forever” without any thought or consideration that they were “statistical experimental designs”. Obviously, though, since they represent special cases of wider classes of designs, they must also come under that umbrella. So what is special about the experimental designs that we call “statistical” or “chemometric” designs? Actually, very little, until we take a look at what happens when we need to scale these designs up to larger sample numbers or more complex designs. Before we do that, let us consider the various types of experiments, and the nature of the factors that are used in those experiments, involved. Someone doing an experiment is generally trying to learn about the effect of some phenomenon on some quantity that can be measured. While there are cases that do not fit the description we are about to present, one very common type of experiment involves changing (or allowing the change of) some parameter, and then measure the effect of that change. If there is only one such parameter, the situation is pretty straightforward, but things start getting interesting when two or more possible parameters are involved. Intuitively, the first instinct is to measure the results that are obtained for all possible combinations of the available values of the parameters. In Chapter 8, we looked at some experiments that involved two parameters (factors), each at two levels. In Chapter 10, we briefly looked at a threefactor, twolevel design, with attention to how it could be represented geometrically. The use of the term “three factor, two level” to describe the design means that each factor was present at two levels, that is, the corresponding parameters were each permitted to assume two values. There are several ways we can expand a design such as this: we can increase the number of factors, the number of levels of each factor, or we can do both, of course. There are other differences than can be superimposed over the basic idea of the simple, allpossible combinations of factors, such as to consider the effect of whether we can control the levels of the factors (if we can then do things that are not possible to do if we cannot control the levels of the factors), whether the “levels” correspond to physical characteristics that can be evaluated and the values described have real physical meaning (temperature, for example, has real physical meaning, while catalyst type does not, even though different catalysts in an experiment may all have different degrees of effectiveness, and reproducibly so).
90
Chemometrics in Spectroscopy
Another consideration is whether all the factors can be changed independently through their range of possible values, or whether there are limits on the possible values. The most obvious limiting situation is the case of mixtures, where all the components of a mixture must sum to 100%. Other limitations might be imposed by the physical (or chemical) behavior of the materials involved: solubility as a function of temperature, for example, or as a function of other materials present (maximum solubility of salt in water–alcohol mixtures, for example, will vary with the ratio of the two solvents). Other limits might be set by practical considerations such as safety; except for specialized work by scientists experienced in the field, few experimenters would want to work, for example, with materials at concentrations above their explosive limits.
REFERENCES 1. 2. 3. 4.
Mark Mark Mark Mark
H. H. H. H.
and and and and
Workman, Workman, Workman, Workman,
J., J., J., J.,
Spectroscopy Spectroscopy Spectroscopy Spectroscopy
9(8), 26–27 (1994). 9(9), 30–32 (1994). 6(1), 13–16 (1991). 10(1),17–20 (1995).
16 Experimental Designs: Part 5 – One-at-a-time Designs
In Chapter 15, which was based on reference [1] we began our discussions of factorial designs. If we expand the basic nfactor twolevel experiment by increasing the number of factors, maintaining the restriction of allowing each to assume only two values, then the number of experiments required is 2n , where n is the number of factors. Even for experiments that are easy to perform, this number quickly gets out of hand; if eight different factors are of interest, the number of experiments needed to determine the effect of all possible combinations is 256, and this number increases exponentially. The other obvious way we might want to expand the experiment is to increase the number of levels (values) that some or all of the factors take. In this case, the number of experiments required increases even faster than 2n . So, for example, if each factor is at three levels, then the number of experiments needed is 3n (for eight factors, corresponding to our previous calculation, this comes to 6,561 experiments!). In the general case, the number of experiments needed is i ni , where ni is the number of levels of the ith factor. It should be clear at this point that the problem with this scenario is the sheer number of experiments needed, which in the real world translates into time, resources, and expense. Something must be done. Several “somethings” have been done. The intuitive experimenter, expert in his partic ular field of science but untrained in “statistical” designs, simplifies the whole process by throwing out all the combinations, and uses what are known as simply “oneatatime” designs [2]. Five variations of this basic design are described, but basically these are only useful when the random noise or error is small (compared to the expected magnitude of the effects), and involve the experimenter changing one variable (factor) at a time to see which one(s) cause the greatest effect. Sometimes those are then examined in greater detail, by varying them over larger ranges, and/or at values lying within the original range. This solves the problem of the proliferation of experiments, since the number of experiments needed is now only 1+i ni instead of i ni , a much smaller number. It also provides a firstorder indication of the effect of each of the factors. The difficulty now is the possibility of throwing out the baby with the bathwater, so to speak, by losing all information about the actual noise level, and information about any possible synergistic or inhibitory interactions between the factors. Thus, when statisticians got into the act, there saw a need to retain the information that was not included in the oneatatime plans, while still keeping the total number of experiments manageable; the birth of “statistical experimental designs”. Several types of “statistical experimental designs” have been developed over the years, with, of course,
92
Chemometrics in Spectroscopy
innumerable variations. However, they can be placed into a fairly small group of main design types: 1) 2) 3) 4)
Factorial Fractional factorial Sequential a) Latin square b) Graecolatin square c) Latin and Graecolatin cubes 5) Modelbuilding 6) Response surface.
By far the most statistical energy has been spent on the design and analysis of factorial designs. Books dealing with such designs (e.g., [3, 4]) spend a good part of their space discussing the variations required to accommodate such considerations as replication, blocking, how to deal with situations where the experiment itself is destructive (so that the same specimen is never available for retesting), whether the experimental conditions can be reproduced at will, and whether the experimental factors (or the desired response) can be assigned meaningful numerical values. Each of these considerations dictates the types of designs that can be considered and how they must be implemented. For our current discussions, however, we have been taking the path of discussing ways to reduce the required number of experiments, while still retaining the advantages of obtaining several types of information about the system under consideration. The simplest such type of design is the sequential design, simplest if for no other reason than that the type of design it replaces is one of the simplest designs itself. We will discuss this type of design in Chapter 17.
REFERENCES 1. 2. 3. 4.
Mark, H. and Workman, J., Spectroscopy 10(9), 21–22 (1995). Daniel, C., Journal of American Statistical Association 68(342), 353–360 (1973). Davies, O.L., The Design and Analysis of Industrial Experiments (Longman, New York, 1978). Box, G.E.P. Hunter, W.G. and Hunter, J.S., Statistics for Experimenters (John Wiley & Sons, New York, 1978).
17 Experimental Designs: Part 6 – Sequential Designs
We begin our discussion of resourceconserving (for want of a better generic term) experimental design with a look at sequential designs. This is the first of the types of experimental designs that have as one of their goals, a reduction in the required number of experiments, while still retaining the advantages of obtaining several types of information about the system under consideration. The simplest such type of design is the sequential design, simplest if for no other reason than that the type of design it replaces is one of the simplest designs itself. This design is the simple test for comparison of means, using the Ztest or the ttest as the test statistic; we have discussed these in our previous column series and book: “Statistics in Spectroscopy” (now in its second edition [1]). The standard ttest (or Ztest) specifies a predefined number of measurements to be made, either for a single condition or for a pair of conditions (i.e., sampleversus “control”). The difference between the two states is compared to the experimental error evidenced in the data, and a decision made based on whether the difference between the states is “large enough”, compared to the noise (or error). For a sequential test, the number of experiments is not predefined. Rather, experiments are performed sequentially (surprise!), and the series terminated as soon as enough data is available that a decision can be made as to whether the difference is “large enough”. True, it is theoretically possible for such a sequence of experiments to be indefinitely long; in practice, however, it is far more common for the situation to become decidable after fewer experiments than are required for the case of a fixed number of experiments. So how does this “magic” experimental design work? The best available discussion we know of is in reference [2]. The standard concept behind this experimental design is illustrated in Figure 171. As this figure shows, the “universe” is divided into three regions: the region (A) is the region of acceptance of the null hypothesis; region C is the region of acceptance of the alternative hypothesis. The middle region, B, is the region of continuation: as long as values fall into this region, we must continue with the experiments, since there is not enough information to make a decision. Figure 172 shows how this works for two typical cases. First a single experiment is performed, and the results noted. If these results put it into the region of continuing the project (virtually inevitable after only one experiment), then a second experiment is performed, and so forth. Figure 172 shows typical results for two possible sequences of experiments: the one indicated by the crosses enters the region of acceptance of the alternative hypothesis after seven experiments, the one indicated by the circles enters the region of acceptance of the null hypothesis after nine experiments. Obviously, the actual number of experiments required will depend on both the nature of the experiments and the definition of the two regions of acceptance. The xaxis represents, clearly, the number of experiments that have been carried out. The yaxis represents a function of
94
Chemometrics in Spectroscopy
A
f (α, β)
B
C
Number of experiments
Figure 17-1 Standard concept behind sequential experimental design (see text for definition of function f ( )).
1 A
B f (α, β)
2
C
5 10 Number of experiments
15
Figure 17-2 Typical results for two possible experimental sequences.
the results of the experiments. Important to note at this point is the fact that, in one way or another, the quantity plotted along the yaxis is a function, not of the result of an experiment, but on one way or another, the cumulative results of all the experiments done up to that point. The key point, then, is how the lines separating the different regions are defined. The total answer will depend, of course, on which statistic is being plotted and on the details of the nature of the hypothesis test being done (e.g., twotailed versus onetailed, etc.). For an illustration we consider the sequential test of the hypothesis of the mean of a sample being the same as that of a given population, with the standard deviation known. In the case of fixed sample size, this would be done using a statistical hypothesis test with the Z statistic as the test statistic, and the probability level set simply to . For a sequential test, both the theory and the computations are a bit more complicated. In the case at hand, the defining limits are constructed as shown in Figure 173. The expected value of any given measurement is, of course 0 , the population mean. Then the expected value of the sum of n readings, which we label T , equals n for each value
95
Experimental Designs: Part 6
f (α, β)
A
B C
h0 Number of experiments
Figure 17-3 The relationship between the expected value of the statistic and the lines separating the regions of acceptance and rejection from the region indicating continuation of the experiment.
of n, and plotting these sums as a function of n gives the central straight line shown in Figure 173; this line represents the expected value of the sum, and has a slope equal to 0 . As can be seen, data that agrees with the null hypothesis will follow this line and eventually move into region A, the region of acceptance of the null hypothesis. The lines separating the two regions are defined by their slope and intercepts. If we let represent the minimum difference from 0 we wish to detect, then the slope of the lines (which is common to the two lines: they are parallel) equals 0 + /2. The yintercepts, which we designate h, are h0 = − ln1 − / 2 / h1 = ln1 − / 2 / We note several interesting points about these expressions. First, the positions of the lines of demarcation depend, as we would expect, on both the minimum expected departure from 0 we wish to detect and . It also depends upon a quantity that is a logarithm, and the logarithm of the quantity no less, that we have always previously dismissed. While a discussion of properly belongs in the realm of elementary statistics, at this point it is worthwhile to go back to some of those discussions, to examine how this impacts our current interests. We will proceed along with this digression in our next chapter.
REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 2. Davies, O.L., The Design and Analysis of Industrial Experiments (Longman, New York, 1978).
This page intentionally left blank
18 Experimental Designs: Part 7 – �, the Power of a Test
In Chapter 17 and reference [1], we started discussing the way a series of experiments could be designed so that the decision to perform another experiment could be based on the outcomes of the experiments already done. We saw there that we needed to be able to tell if we could stop because the result had become statistically significant; and we also saw that we needed a way to tell if we could stop because we had reached the statistically significant conclusion that there is no real difference between the sample and the (hypothetical) reference population. This is necessary, indeed crucial, otherwise we could continue experimenting endlessly, waiting for a statistically significant result when there was no real difference to detect so that none would be expected. The first stopping criterion is straightforward, it is simply the standard hypothesis test, based on probabilities that we have previously discussed of a sample coming from the hypothesized population P0 [2]. The second stopping criterion, however, seems to fly in the face of our previous discussions on the topic, where we said that you could not prove two populations the same. However, the reason for the second statement is that the difficulty in proving that a sample came from a given population is easier to see if we reword the statement of it by making it a double negative, and ask whether we can prove that it did not come from a different population? Now the nature of the difficulty becomes clearer: we have no information about the nature of the “different” population that we want to test against. Now that we can see the problem, we can find a point of attack against it. We can hypothesize a population �Pa � with any given characteristics we want, and then consider the consequences of dealing with that alternate population. In particular, we consider the probabilities of either accepting or rejecting our original null hypothesis (based on P0 � if, in fact, our sample came from the alternate population Pa . The probability of coming to the incorrect conclusion that the sample came from P0 when it really came from Pa is called the � probability (compare with the probability, which is the probability of drawing the incorrect conclusion that a sample did not come from P0 when it really did). This is known in statistical parlance as the “power” of the statistical test. Thus, in performing a statistical hypothesis test, we would normally consider only the ordinary tests against the alpha error as a means of determining statistical significance. However, as we have seen, that leaves completely open the number of samples needed. The power of a test gives us a criterion which will allow determining the number of samples. To redefine the term: the power of a statistical test is the probability of obtaining a statistically significant result given that in fact the null hypothesis is false. Ordinarily to show a nonsignificant result is easy: just use few enough samples. To show that you have obtained a nonsignificant result when there is a high probability of obtaining a significant result for a false hypothesis is convincing indeed, and also gives us the basis for determining the number of samples needed. On the other hand, we do not want to go overboard and use so many samples that we get statistically significant results for
98
Chemometrics in Spectroscopy
tiny, unimportant differences. As we will see below, the power of the test does allow us to specify the minimum number of samples required, but this number can quickly get out of hand, and show up tiny differences, if we are not careful on how we specify the requirements. The problem with defining criteria for such a test is that it depends on the probability, which is difficult to determine (although we could arbitrarily specify a value, such as 95%). It also depends on the smallest difference you need to detect, the number of samples, the variability of the data (which at least can be determined from the data, the same way it is done for determining ), and the probability of detecting the given difference at a specified alpha significance level. Thus what we do is to work backwards, so to speak. Since we want to find the number of samples corresponding to different probabilities for , and D (the difference between the data and 0 , we first find the difference corresponding to given values of the other quantities. This can be seen more easily in Figure 181. To summarize Figure 181 in words, the top curve represents the characteristics of a population P0 with mean 0 . Also indicated in Figure 181 is the upper critical limit, marking the 95% point for a standard hypothesis test H0 that the mean of a given sample is consistent with 0 . A measured value above the critical value indicates that it would be “too unlikely” to have come from population P0 , so we would conclude that such a reading came from a different population. Two such possible different, or alternate, populations are also shown in Figure 181, and labeled P1 and P2 . Now, if in fact a random sample was taken from one of these alternate populations, there is a given probability, whose value depends on which population it came from, that it would fall above (or below) the upper critical limit indicated for H0 . The shaded areas in Figure 181 indicate the probabilities for a random sample falling below the critical value for H0 , when one of those alternate populations is in fact the correct population from which the sample was taken. As can be seen, these probabilities are 50% for population P1 and roughly 5% for population P2 . These probabilities are
P0 Upper critical limit for P0 Mean = µ 0 P1
P2
Figure 18-1 Characteristics of population P0 with mean 0 and alternate populations P1 and P2 (Note that the Xaxes have been offset for clarity).
Experimental Designs: Part 7
99
the probabilities of (incorrectly) concluding that the data is consistent with H0 , for the two cases. This same topic is continued in our next chapter.
REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(2), 43 (1996). 2. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).
This page intentionally left blank
19 Experimental Designs: Part 8 – �, the Power of a Test (Continued)
Continuing from our previous discussion in Chapter 18 from reference [1], analogous to making what we have called (and is the standard statistical terminology) the error when the data is above the critical value but is really from P0 , this new error is called the error, and the corresponding probability is called the probability. As a caveat, we must note that the correct value of can be obtained only subject to the usual considerations of all statistical calculations: errors are random and independent, and so on. In addition, since we do not really know the characteristics of the alternate population, we must make additional assumptions. One of these assumptions is that the standard deviation of the alternate population Pa is the same as that of the hypothesized population P0 , regardless of the value of its mean. The existence of the probability provides us with the tool for determining what is called the power of the test, which is just 1 − , the probability of coming to the correct conclusion when in fact the data did not come from the hypothesized population P0 . This is the answer to our earlier question: once we have defined the alternate population Pa , we can determine the probability of a sample having come from Pa , just as we can determine the probability of that sample having come from P0 . So how does this help us determine n? As we know from our previous discussion of the Central Limit Theorem [2], the standard deviation of a sample from a population decreases from the population standard deviation as n increases. Thus, we can fix 0 and a and adjust the and probabilities by adjusting n and the critical value. Normally, it is convenient to adjust the critical value to be equidistant from 0 and a , and then adjust n so that that critical value represents the desired probability levels for and . As an example, we can set alpha and beta levels to the same value, which makes for a simple computation of the number of samples needed, at least for the simple case we have been considering: the comparison of means. If we use the 95% value for both (a very stringent test), which corresponds to a Zvalue of 1.96 (as we know), then if we let D represent the difference in means between the two values (sample data and population mean), and S is the precision of the data, we find that √ D >= 392 S/ n
(191)
so that n = 392S/D2 = 15 S/D2
(192)
In words, we would need 15 samples for 95% confidence on both alpha and beta, to distinguish a difference of the means equal to the precision of the measurement, and the number increases as the square of any decrease in difference we want to detect.
102
Chemometrics in Spectroscopy
To compute the power for a hypothesis test based on standard deviation, we would have to read off the corresponding probability points from a chisquare table; for 95% confidences on both alpha and beta, the square root of the ratio of 2 (0.95, v) and 2 (0.05, v (v = the degrees of freedom, close enough to n for now) is the ratio of standard deviations that can be distinguished at that level of power. Similarly to the case of the means, v would also be related to the square of that ratio, but 2 would still have to be read from tables (or computed numerically). As an example, for 35 samples, the precision of the instrument could not be tested to be better than � √ 486/216 = 225 = 15 (193)
or 1.5 times the precision of the reference method with that amount of power, and as before, n will increase as the square of any improvement we want to demonstrate. The ratio of 2 (.95, v to 2 (.05, v does decrease as v increases, but not nearly as fast as the square increases: it is a losing fight. Thus, the use of the concept of the Power of a Test allows specification of the number of samples (although it may turn out to be very high), and by virtue of that forms the basis for performing experiments as a sequential series.
REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(6), 30–31 (1996). 2. Mark, H. and Workman, J., Spectroscopy 3(1), 44–48 (1988).
20 Experimental Designs: Part 9 – Sequential Designs Concluded
Our previous two chapters based on references [1, 2] describe how the use of the power concept for a hypothesis test allows us to determine a value for n at which we can state with both and % certainty that the given data either is or is not consistent with the stated null hypothesis H0 . To recap those results briefly, as a leadin for returning to our main topic [3], we showed that the concept of the power of a statistical hypothesis test allowed us to determine both the and the probabilities, and that these two known values allowed us to then determine, for every n, what was otherwise a “floating” quantity, D. At this point it should be starting to become clear what is going on. If a given set of , and D allow us to determine n, then similarly, a corresponding set of , and n allow us to determine D. Thus for a given and , n and D are functions of each other, and it then becomes a simple matter (at least in principle, in practice the math involved is extremely hairy) to determine the functionality. In fact the actual situation is considerably more complicated to determine mathemat ically. In our previous discussions, we have made a number of simplifying assumptions which cannot be used if we wish to calculate correct values for our expressions, and for which the actual situation must be incorporated into the math. The first of these assumptions is the use of the Normal distribution. When we perform an experiment using a sequential design, we are implicitly using the experimentally determined value of s, the sample standard deviation, against which to compare the difference between the data and the hypothesis. As we have discussed previously, the use of the experimental value of s for the standard deviation, rather than the population value of , means that we must use the tdistribution as the basis of our comparisons, rather than the Normal distribution. This, of course, causes a change in the critical value we must consider, especially at small values of n (which is where we want to be working, after all). The other key assumption that we sort of implied was that the comparison of standard deviation is constant. Of course we know that as n changes, the comparison value changes as the square root of n. This is on top of and in addition to the changes caused by the use of the t rather than the Normal (Z) distribution. So how is this related to the nature of the graph used for the sequential experimental design? We forgo the detailed math here, in deference to trying to impart an intuitive grasp of the topic, and we have already presented the equations involved [3]. The limits of the allowable values around the hypothesized values close in on it as n increases. This behavior is shown in Figure 201. If, in fact, we were to plot the mean of the population as a function of n, it would be a horizontal line, just as shown. The mean of the actual data would vary around this horizontal line (assuming the null hypothesis was correct), at smaller and smaller distances, as n increased.
104
Chemometrics in Spectroscopy
Upper critical limit Mean (µ0)
Lower critical limit
n
Figure 20-1 The limits of the allowable values around the hypothesized value close in on it as n increases.
If the null hypothesis was wrong, then the data would vary around a line offset from the line representing 0 , and get closer and closer to it, instead. Eventually, at some value of n, this line would cross the converging lines representing the critical limits around 0 , indicating the result. This is the basic picture, shown in Figure 202. For a sequential experimental plan, the sequence is terminated at the first significant experiment, as shown. The details differ, however. By convention, instead of plotting the mean, 0 , as a function of n, the sum of the data, which has a theoretical value of n∗ 0 , is used. Clearly this line will slope upward with a slope of 0 , instead of being horizontal, as will the data plot. The rest of the conceptual picture is the same, however. As we saw previously in reference [3], the slope of the line represented by n∗ 0 is paralleled by the confidence limits for the sum of the data, as represented by the equations in that
First significant reading Upper critical limit Mean (x) Mean (µ0)
Lower critical limit
n
Figure 20-2 If the null hypothesis was wrong, then the data would vary around a line offset from the line representing 0 and get closer and closer to that line.
105
Experimental Designs: Part 9 n × (x )
n × (µ 0)
First significant point
Upper critical limit
Lower critical limit
n
Figure 20-3 The approach of the upper line, representing the probability, corresponds to the approach of the curved lines to the n × 0 line (representing the null hypothesis).
column; thus, at the point where the line representing the successive mean values from the experimental design crosses the confidence limit in Figure 202, so does the line representing the successive sums eventually cross the line specified by the equations in reference [3], and illustrated in Figure 203 here. According to the derived equations, as we saw previously, the actual confidence limits representing the and probabilities are straight lines parallel to each other but not parallel to the line representing n∗ 0 . The approach of the upper line, representing the probability, corresponds to the approach of the curved lines, shown in Figure 203, to the n∗ 0 line (representing the null hypothesis) there. The line representing , however, being parallel to the line, departs from the null hypothesis. This can be interpreted as stating, as we have previously implied, that it is always harder to “prove” the null hypothesis than to disprove it.
REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(6), 30–31 (1996). 2. Mark, H. and Workman, J., Spectroscopy 11(8), 34 (1996). 3. Mark, H. and Workman, J., Spectroscopy 11(4), 32–33 (1996).
This page intentionally left blank
21 Calculating the Solution for Regression Techniques: Part 1 – Multivariate Regression Made Simple
For the next several chapters we will illustrate the straightforward calculations used for multivariate regression (MLR), principal components regression (PCR), partial least squares regression (PLS), and singular value decomposition (SVD). In all cases we will use the same notation and perform all mathematical operations using MATLAB (Matrix Laboratory) software [1, 2]. We have already discussed and shown many of the manual methods for calculating the matrix algebra in references [3–6]. Let us begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ ⎤ ⎡ ⎤ A11 A12 1 7 A = ⎣ A21 A22 ⎦ = AI×K = ⎣ 4 10 ⎦ (211) A31 A32 6 14 Thus, the integers 1 and 7 represent the instrument signal for two data channels (fre quencies 1 and 2) for sample Spectrum #1, 4 and 10 represent the same data channel signals (e.g., frequencies 1 and 2) for sample Spectrum #2, and so on. If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ 4 c11 cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ (212) c31 11 we now have the data necessary to calculate the matrix of regression coefficients b which is given by b11 −1 b = = A′ A A′ c = A+ c = pˆ (213) b21 This b (also known as pˆ = the prediction vector) is often referred to as the regression vector or set of regression coefficients. Note that A′ A−1 A′ is referred to as the pseu doinverse of A designated as A+ . Note that there is one regression coefficient for each frequency (or data channel). The matrix of predicted values is easily obtained as Matrix A (the data matrix) × Vector b (the regression coefficients) = Vector c (the predicted values). This is shown in matrix notation as A×b = c
(214)
108
Chemometrics in Spectroscopy
Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of simple matrix operations as shown in Table 211 below: Table 21-1 Matrix operations in MATLAB to compute equations 211–214 Command line
Comments
⊂ A = [1 7;4 10;6 14]
Enter the A matrix
⊂ A = 1 7 4 10 6 14
Display the A matrix
⊂ c = [4;8;11]
Enter the concentration vector c
c = 4 8 11
Display the concentration vector c
⊂ b = invA�∗ A∗ A�∗ c
Calculate the regression vector [Note: The inverse applies only to (A�∗ A)]
b = 0.7722 0.4662
Display the regression vector b
⊂ A∗ b ans = 4.0356 7.7509 11.1601
Predict the concentrations [Note: A residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].
REFERENCES 1. MATLAB software for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 017601500. Internet:[email protected] 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994).
22 Calculating the Solution for Regression Techniques: Part 2 – Principal Component(s) Regression Made Simple
For the next several chapters in this book we will illustrate the straight forward cal culations used for multivariate regression. In each case we continue to perform all mathematical operations using MATLAB software [1, 2]. We have already discussed and shown the manual methods for calculating most of the matrix algebra used here in references [3–6]. You may wish to program these operations yourselves or use other software to routinely make these calculations. As in Chapter 21, we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ A11 A = ⎣A21 A31
⎤ ⎡ A12 1 A22 ⎦ = AI×K = ⎣4 A32 6
⎤ 7 10⎦ 14
(221)
Thus, 1 and 7 represent the instrument signal for two data channels (frequencies 1 and 2) for sample spectrum #1; 4 and 10 represent the same data channel signals (e.g., frequencies 1 and 2) for sample spectrum #2, and so on. We now have the data necessary to calculate the singular value decomposition (SVD) for matrix A. The operation performed in SVD is sometimes referred to as eigenanal ysis, principal components analysis, or factor analysis. If we perform SVD on the A matrix, the result is three matrices, termed the left singular values (LSV) matrix or the U matrix; the singular values matrix (SVM) or the S matrix; and the right singular values matrix (RSV) or the V matrix. We now have enough information to find our Scores matrix and Loadings matrix. First of all the Loadings matrix is simply the right singular values matrix or the V matrix; this matrix is referred to as the P matrix in principal components analysis terminology. The Scores matrix is calculated as The data matrix A × the Loadings matrix V = Scores matrix T
(222)
Note: the Scores matrix is referred to as the T matrix in principal components analysis terminology. Let us look at what we have completed so far by showing the SVD calculations in MATLAB as illustrated in Table 221.
110
Chemometrics in Spectroscopy
Table 22-1 Matrix operations in MATLAB to compute the SVD of data matrix A Command line
Comments
⊂ A = [1 7;4 10;6 14] A = 1 7 4 10 6 14
Enter the A matrix Display the A matrix
⊂ [U,S,V] = svd(A);
Perform SVD on the A matrix
⊂ U U = 0�3468 0�9303 0�1193 0�5417 0.0949 0.8352 0�7656 0.3543 0�5369
Display the U matrix or the left singular values (LSV) matrix
⊂ S S = 19�8785 0 0 1�6865 0 0
Display the S matrix or the singular values (SV) matrix
⊂ V V = 0�3576 0.9339 0�9339 0�3576
Display the V matrix or the right singular values (RSV) matrix (Note: this is also known as the P matrix or Loadings matrix)
⊂ T = A*V T = 6�8948 1�5690 10�7691 0.1600 15�2198 0.5976
Calculate the Scores Matrix or the T matrix
If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ c11 4 cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ (223) c31 11 We can now use S, V, and T to calculate the following; A reconstruction of the original data matrix A is computed by using the preselected number of principal components (i.e., columns in our T and V matrices) as A �estimated� = T × V �
(224)
The set of regression coefficients (i.e., the regression vector) is calculated as b (regression vector) = V × S−1 × U � × c
(225)
Calculating the Solution for Regression Techniques: Part 2
111
Table 22-2 Matrix operations in MATLAB to compute equations 224–226 Command line
Comments
⊂ Aest = T*V�
Estimate the A data matrix
⊂ Aest = 1�0000 7�0000 4�0000 10�0000 6�0000 14�0000
Display the estimate for A
⊂ b = V(:,1:2)*inv(S(1:2,1:2))*U(:,1:2)’*c;
Calculate the regression vector [Note: The inverse operation refers only to the singular values matrix S. The calculation to determine b can only be performed using two columns in each of the V, S, and U matrices; this number is equivalent to the number of latent variables (or principal components) used.
b = 0�7722 0�4662
Display the regression vector
⊂ cest = (T*V� )*b
Predict the concentrations [Note: This computation is equivalent to (Aest × b)].
cest = 4�0356 7�7509 11�1601
Display the concentration vector [Note: For this example of PCR a residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].
The predicted or estimated values of c are computed as c (estimated) = �T × V � � × b
(226)
Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of matrix operations as shown in Table 222.
REFERENCES 1. MATLAB software from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 017601500. Internet: [email protected]. 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994).
This page intentionally left blank
23 Calculating the Solution for Regression Techniques: Part 3 – Partial Least Squares Regression Made Simple
For the past three chapters we have described the most basic calculations for MLR, PCR, and PLS. Our intent is to show basic computations for these regression methods while avoiding unnecessary complexity which could confuse rather than instruct. There are of course a number of difficulties in taking this simplistic approach; namely the assumptions made for our simple cases do not always hold, and poorly behaved matrices are the rule rather than the exception. We have not yet discussed the concepts of rank, collinearity, scaling, or data conditioning. Issues of graphical representation and details of computational methods and assessing model performance are forthcoming. We ask that you abide with us over the next several chapters as we intend to delve much more deeply into the details and problems associated with regression methods. For this chapter we will illustrate the straightforward calculations used for PLS regres sion utilizing singular value decomposition. For PLS a special case of SVD is used. You will notice that the PLS form of SVD includes the use of the concentration vector c as well as the data matrix A. The reader will note that the scores and loadings are determined using the concentration values for PLSSVD whereas only the data matrix A is used to perform SVD for principal components analysis. The SVD and PLS SVD will be the subject of several future chapters so we will only introduce its use here and not its derivation. All mathematical operations are completed using MATLAB soft ware [1, 2]. As previously discussed the manual methods for calculating the matrix algebra used within these chapters on the subject is found in references [3–7]. You may wish to program these operations yourselves or use other software to routinely make the calculations. As in our last installment we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples and three data channels, as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡
A11 Ar×c = ⎣A21 A31
A12 A22 A32
⎤ ⎡ A13 1 A23 ⎦ = AI×K = ⎣4 A33 6
7 10 14
⎤ 9 12⎦ 16
(231)
Thus, 1, 7, and 9 represent the instrument signal for three data channels (frequencies 1, 2, and 3) for sample spectrum #1; 4, 10, and 12 represent the same data channel signals (e.g., frequencies 1, 2, and 3) for sample spectrum #2, and so on.
114
Chemometrics in Spectroscopy
If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ 4 c11 cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ (232) c31 11 We now have both the data matrix A and the concentration vector c required to calculate PLS SVD. Both A and c are necessary to calculate the special case of PLS singular value decomposition (PLSSVD). The operation performed in PLSSVD is sometimes referred to as the PLS form of eigenanalysis, or factor analysis. If we perform PLSSVD on the A matrix and the c vector, the result is three matrices, termed the left singular values (LSV) matrix or the U matrix; the singular values matrix (SVM) or the S matrix; and the right singular values matrix (RSV) or the V matrix. We now have enough information to find our PLS Scores matrix and PLS Loadings matrix. First of all the PLS Loadings matrix is simply the right singular values matrix or the V matrix; this matrix is referred to as the P matrix in principal components analysis and partial least squares terminology. The PLS Scores matrix is calculated as The data matrix A × the PLS Loadings matrix V = PLS Scores matrix T
(233)
Note: the PLS Scores matrix is referred to as the T matrix in principal components analysis and partial least squares terminology. Let us look at what we have completed so far by showing the PLS SVD calculations in MATLAB as illustrated in Table 231. We can now use S, V, and T to calculate the following: A reconstruction of the original data matrix A is computed by using the preselected number of factors (i.e., columns in our T and V matrices) as A estimated = T × V �
(234)
The set of regression coefficients (i.e., the regression vector) is calculated as b regression vector = V × S−1 × U � × c
(235)
The predicted or estimated values of c are computed as c estimated = T × V � × b
(236)
This expression is equivalent to c estimated = A estimated × b = A × b
(237)
or can be used to predict a single sample spectrum a using the expression c estimated = a estimated × c = a × b
(238)
Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of matrix operations as shown in Table 232.
Calculating the Solution for Regression Techniques: Part 3
115
Table 23-1 Matrix operations in MATLAB to compute the PLS SVD calculations of data matrix A (see equations 231–233) Command line
Comments
⊂ A = 1 7 9 4 10 12 6 14 16
Enter the A matrix
A = 1 7 9 4 10 12 6 14 16
Display the A matrix
⊂ c = [4;8;11]
Enter the c vector
⊂ c = 4 8 11
Display the c vector
⊂ ⊂ [U,S,V] = SVDPLS(A,c,3);
Perform PLS SVD on the A matrix. This is a CPAC(7) version of the PLS SVD algorithm.
⊂U U = 03817 0.9067 0.1797 05451 00638 08359 07465 04170 0.5186
Display the U matrix or the left singular values (LSV) matrix
⊂S S = 295796 0.2076 00000 00000 19904 0.0367 00000 00000 02038
Display the S matrix or the singular values (SV) matrix
⊂V V = 02446 09345 02588 06283 00506 0.7764 07386 0.3525 05747
Display the PLS V matrix or the right singular values (RSV) matrix (Note: this is also known as the P matrix or PLS Loadings matrix)
⊂ T = A∗ V T = 112894 1.8839 0.0034 161236 00138 01680 220801 06750 0.1210
Calculate the PLS Scores Matrix or the T matrix
116
Chemometrics in Spectroscopy
Table 23-2 Matrix operations in MATLAB to compute equations 234–238) Command line
Comments
⊂ Aest = T∗ V�
Estimate the A data matrix
⊂ Aest = 10000 70000 90000 40000 100000 120000 60000 140000 160000
Display the estimate for A
⊂ b = V∗ invS∗ U�∗ c
Calculate the PLS regression vector [Note: The inverse operation refers only to the singular values matrix S. The calculation to determine b is performed using three columns in each of the V, S, and U matrices; this number is equivalent to the number of latent variables (or PLS factors) used.
b = 11667 0.6667 08333
Display the regression vector
⊂ cest = T∗ V � ∗ b
Predict the concentrations [Note: This computation is equivalent to (Aest × b)].
cest = 40000 80000 110000
Display the concentration vector [Note: For this simple example of PLS no residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].
REFERENCES 1. MatLab software Version 4.2 for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 017601500. Internet: [email protected]. 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems, 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H., and Workman, J., Spectroscopy 9(5), 22 (1994). 7. Center for Process Analytical Chemistry, University of Washington, Seattle, WA, mscript library, 1993 (Contact Mel Koch or Dave Veltkamp for current versions).
24 Looking Behind and Ahead: Interlude
We depart from discussion of our usual topics in this chapter. Over the years since beginning writing on this topic, there has been a spate of telephone calls where the callers, after introducing themselves, said something that could generically be rendered as: “By chance I came across a copy of one of your articles, and am interested in reading more about this subject. Are there any more articles like this, and what are they, and how can I get them?” After discussing this between ourselves, we decided that we have reached a point where it is worthwhile to present our readers with a complete set of the chemometrics writings published to date. Those of you who have been reading our work for a long time will recall that the column series “Chemometrics in Spectroscopy” is a continuation of our previous column series, “Statistics in Spectroscopy”. Statistics in Spectroscopy was published from 1986 to 1992, with some preliminary articles in 1985. The columns from the earlier series, “Statistics in Spectroscopy”, have been collected and published in their entirety as a book (with minor editorial changes appropriate to the change in format from a series of columns to a book) of the same name, now in its second edition. So much for the past; what about the discussion? The last few chapters have been presenting the “nuts and bolts” of some of the more common chemometric techniques for performing quantitative chemometric/spectroscopic calibration, even getting down to the level of a “cookbook” of actual code (written for the MATLAB Matrix Algebra multivariate analysis software). The following chapters will deal first with completing a discussion on the various chemometric techniques in current use, and then to go “under the hood” with them to emphasize the underlying mathematical and theoretical framework that these methods rest upon. One upcoming topic will be a description of the socalled “statistical design of experiments” methodologies, emphasizing those techniques that tend to be obscure, but are more useful than they are dealt with in mainstream Chemometric discussions.
This page intentionally left blank
25 A Simple Question: The Meaning of Chemometrics Pondered
In a 1997 paper, Steve Brown and Barry Lavine state, “Chemometrics is not a subfield of Statistics. Although statistical methods are employed in Chemometrics, they are not the primary vehicles for data analysis” [1]. Parenthetically, we recommend this article as a very nice nonmathematical introduction for the average chemist as to what Chemometrics is, and how it can be used. As far as the quote is concerned, we have to both agree and disagree. On the one hand, we have to recognize the de facto truth that many users of Chemometric techniques are not aware of the Statistical backgrounds of the techniques, and indeed, we sometimes suspect that even the developers of those techniques may also not be aware of, or at least, give the statistical considerations their proper weight. Having said that, we will issue some disclaimers a little further on, because there are some legitimate and justifiable reasons for the existence of this situation. However, ignoring the existence of this situation means that nobody is paying the attention that would eventually lead to the condition being corrected, which would result in a better theoretical understanding of the techniques themselves, with a concomitant improvement in their reliability and definition of their range of applicability. This leads us to the other hand, which, it should be obvious, is that we feel that Chemometrics should be considered a subfield of Statistics, for the reasons given above. Questions currently plaguing us, such as “How many MLR/PCA/PLS factors should I use in my model?”, “Can I transfer my calibration model?” (or more importantly and fundamentally: “How can I tell if I can transfer my calibration model?”), may never be answered in a completely rigorous and satisfactory fashion, but certainly improvements in the current state of knowledge should be attainable, with attendant improvements in the answers to such questions. New questions may arise which only fundamental statistical/probabilistic considerations may answer; one that has recently come to our attention is, “What is the best way to create a qualitative (i.e., identification) model, if there may be errors in the classifications of the samples used for training the algorithm?” Part of the problem, of course, is that the statistical questions involved are very difficult, and have not yet been solved completely and rigorously even by statisticians. Another part of the problem is that very few firstclass statisticians are interested in, or perhaps even aware of, the existence of our subdiscipline or its problems. Thus of necessity we push on and muddle through in the face of not always having a completely firm, mathematically rigorous foundation on which to base our use of the techniques we deal with (here comes our disclaimer). So we use these techniques anyway because otherwise we would have nothing: if we waited for complete rigor before we did anything, we would likely be waiting a long, long time, maybe indefinitely, for a solution that might never appear, and in the meanwhile be helpless in the face of the real (and realworld) problems that confront us.
120
Chemometrics in Spectroscopy
But that does not mean that we should not fight the good fight while we are trying to solve current problems, or let that effort distract us. This means two things. The first is to do as we have been doing, and use our imperfect tools and our imperfect understanding of them, to continue to solve problems as best we can. But the second thing we need to do is what we have not been doing, which is to improve our understanding of the tools we use. In this endeavor, more widespread and better understanding and application of the fundamental statistical/probabilistic basis of our chemometric algorithms is crucial. Maybe one of the things we need to accomplish this is to recruit more firstclass statisticians into our ranks, so that they can pay proper attention to the fundamentals, and explain them to the rest of us. Also each of us should pay attention and put some effort into learning more about these fundamentals ourselves. Then we could ourselves better understand the phenomena we see occurring in our data and analyses thereof, and then maybe eventually learn how to deal with them properly. In order to appreciate how understanding new statistical concepts can help us, let us look at an example of where we can better apply known statistical concepts, to understand phenomena currently afflicting us. To this end, let us pose the seemingly innocuous question: “When doing quantitative calibration, why is it that we use the formulation of the problem that makes the constituent values the dependent (i.e., the Y ) variable, and make the spectroscopic data the X (or independent) variable, called the Inverse Beer’s Law formulation (sometimes called the Pmatrix formulation)?” (For that matter, why is the formulation that we most commonly use called “Inverse Beer’s Law” instead of the direct “Beer’s Law”?) Now, we are sure that everybody reading this chapter thinks they know the answer. Now, if you are among those readers, then you are wrong already, because there are multiple answers to this question, all of them correct, and each of them incomplete. Let us dispose of the most common answer first. This answer is the one given in most of the discussions about the relative merits of the two formulations, e.g. [2], and is essentially a practical one: we use the Inverse Beer’s Law formulation because by doing so, we need to only determine the concentration(s) of the analyte(s) of interest. In the Beer’s law formulation, you must determine the concentrations of all components in a mixture, whether they are of interest or not. Of course, there is benefit to that also; as Malinowski points out, you can determine the number of components in a mixture and their spectra, as well as their concentrations, by proper application of the techniques of factor analysis in such a case [3]. The second answer is similar, but even more simplistic. Figure 251 shows a graphical depiction of a twowavelength calibration situation: the values on the two wavelength axes determine the point on the calibration plane from which to strike a line to the concentration axis. The situation, however, is symmetric; so why don’t we consider the possibility of using the value along one of the wavelength axes along with the concentration value to determine the value along the other wavelength axis? In theory this could be done, but the reason we do not do it is the same as the answer to the main question above: we do not care; this case is of no interest to us. As chemists, we are interested in determining quantities of chemical interest, and we use the spectroscopic values as a mean of attaining this goal; the reverse calculation is of no interest to us as chemists. None of these answers deal with fundamentals. So finally we get to the substantive part of the discussion, the one that connects with our original diatribe concerning the goal
A Simple Question: The Meaning of Chemometrics Pondered
121
Calibration plane CONC
+
WL 2
WL 1
Figure 25-1 Symbolic graphical depiction of a twowavelength calibration.
and role of Statistics in Chemometric calculations, the one that will give us an answer to our original question that is based on fundamental considerations, and therefore the one that is the purpose of this whole discussion. To fully appreciate the point we have to go back a bit and look at the historical development of spectroscopic quantitative analysis. Back when we were in school and taking academic courses in Analytical Chemistry, spectroscopy was only one of many techniques presented (and one of the “minor” ones, at that). Now, we can not really compare our experiences with what is being done currently because we are somewhat out of touch with academia, but back then what we now call the Beer’s Law formulation (i.e., making the constituent concentration the Xvariable) was the one presented and taught, and we were required to use it. Of course, as an academic exercise the system was simplified: there was only one analyte in a pure solvent, so in principle it would seem that we could have put either variable on the Xaxis. Nowadays, standard practice would impel us to put the analyte concentration on the Y axis even in this simplified situation (whether it belonged there or not). What has changed between then and now? Well in fact considerable has changed, in both the nature of the situation surrounding the analysis and the instruments we use to do the measurements. Back in the days of our academic exercises, spectrometers were based on vacuumtube technology (remember them? – or are we dating ourselves?), were noisy, drifted terribly, and were full of all manner of error sources. The samples we used to calibrate the instrument, on the other hand, were made synthetically, by weighing the analyte on an analytical balance and dis solving it in the fixed volume of a volumetric flask. Both of these items were considered to be the highestprecision, highestaccuracy measuring devices available. Therefore, in those days, the accuracy of the spectroscopic measurements were considered to be far inferior to the accuracy of the training samples. In those days, Statistics was more highly regarded than it is now, and the analytical chemists then knew the fundamental requirements of doing calibration work. There are several; we need not go into all of them now, but the one that is pertinent to our current discussion is the one that states that, while the Y variable may contain error, the Xvariable must be known without error. Now, in the real world this is never true, since all quantities are the result of some measurement, which will therefore have error
122
Chemometrics in Spectroscopy
associated with it. In practice, however, it is sometimes possible to reduce the error to a sufficiently small value that it approximates zero well enough for the calibration calculations to work. What happens if we do not manage to keep the X error “sufficiently small”? Let us examine a situation which is just complicated enough to show the effects; three sets of data are presented in Table 251, that we will use, along with some of the statistics Table 25-1 Three sets of data illustrating the effect of errors in X and in Y on the results obtained by calibration (A) No error Sample #
X
Y
1 2 3 4
0 0 10 10
0 0 10 10
Intercept = 0 Slope = 1 Correlation coeff = 1 SEE = 0 PRESS = 0 (B) Error in Y Sample #
X
Y
1 2 3 4
0 0 10 10
−1 1 9 11
X
Y
−1 1 9 11
0 0 10 10
Intercept = 0 Slope = 1 Correlation coeff = 0.98058 SEE = 1.4142 PRESS = 2.000 (C) Error in X Sample # 1 2 3 4 Intercept = 0.19231 Slope = 0.96154 Correlation coeff = 0.98058 SEE = 1.38675 PRESS = 1.92018
123
A Simple Question: The Meaning of Chemometrics Pondered (a)
(b)
Y
Y Correct model
Correct model
X
X
(c) Correct model Y Calculated model
X
Figure 25-2 Graphical representation of three regression situations. (a) no error. (b) Error in y only. (c) Error in x only. See text for discussion.
associated with calibration calculations based on those data. Graphical representations of the three data sets are displayed in Figures 252A through 252C, so that the respective models can be compared to the data. We present univariate data, since that shows the effects we wish to illustrate, and is the simplest example that will do so. The biggest advantage to a scenario like this is that we know the “right” answer, because we can make it whatever we want it to be. In this case, the right answer is that the intercept is zero and the slope is 1 (unity). Table 251A represents this condition with four samples whose data follow that model without error. The data in Table 251A are the prototype data upon which we will build data containing error, and investigate the effects of errors in Y and in X. We use four data points, in coincident pairs, so that when we introduce error, we can retain certain important properties that will result in the same model being the correct one for the data. Along with the data, we show the results of doing the calibration calculations on the data. For Table 251A, the slope and the intercept are as we described, the error (which we measure as both the Standard Error of Estimate [SEE] and using crossvalidation [the PRESS statistic, using the leaveoneout algorithm]) is zero (naturally), and the correlation coefficient is unity – a necessary concomitant of having zero error.
124
Chemometrics in Spectroscopy
Now in Table 251B, we introduce error into the Y variable. We do so by adding +1 to one each of the high and low values, and −1 to each of the other high and low values. This maintains symmetry and keep the average position of the pairs of points remains the same, which guarantees that the correct model for the data does not change. This is in accordance with theory and is borne out when the calibration calculations are performed: the model is identical, even though the error (SEE) is no longer zero and the correlation coefficient is no longer unity. Go ahead: redo the calculations and check this out for yourself. Now, the purists and the sharpereyed among us may argue that another requirement of regression theory is that the errors follow a Normal (i.e., Gaussian) distribution and that these errors are not distributed properly. We counter this argument by pointing out that there is not enough data to tell the difference; there is no significance test that can be used to demonstrate that the data either do or do not follow any predetermined distribution. Finally, and of most interest, is the data in Table 251C. Here we have taken the same errors as in Table 251B and applied them to the X variable rather than the Y variable. By symmetry arguments, we might expect that we should find the same results as in Table 251B. In fact, however, the results are different, in several notable ways. In the first place, we arrive at the wrong model. We know that this model is not correct because we know what the right model is, since we predetermined it. This is the first place that what the statisticians have told us about the results are seen. In statistical parlance, the presence of error in the X variable “biases the coefficient toward zero”, and so we find: the slope is decreased (always decreased) from the correct value (of unity, with this data) to 0�96+. So the first problem is that we obtain the wrong model. The next item we will look at is the correlation coefficient. The correlation coeffi cient for Table 251C is identical to that in Table 251B. There is nothing particularly noteworthy about this, except that the correlation coefficient is useless as a means of distinguishing between the two cases: obviously, since we obtain the same result in both situations, we cannot tell from the value of the correlation coefficient which situation we are dealing with. Now we come to the Standard Error of Estimate and the PRESS statistic, which show interesting behavior indeed. Compare the values of these statistics in Tables 251B and 251C. Note that the value in Table 251C is lower than the value in Table 251B. Thus, using either of these as a guide, an analyst would prefer the model of Table 251C to that of Table 251B. But we know a priori that the model in Table 251C is the wrong model. Therefore we come to the inescapable conclusion that in the presence of error in the X variable, the use of SEE, or even crossvalidation as an indicator, is worse than useless, since it is actively misleading us as to the correct model to use to describe the data. This is for univariate data; what happens in the case of multivariate (multiwavelength) spectroscopic analysis. The same thing, only worse. To calculate the effects rigorously and quantitatively is an extremely difficult exercise for the multivariate case, because not only are the errors themselves are involved, but in addition the correlation structure of the data exacerbates the effects. Qualitatively we can note that, just as in the univariate case, the presence of error in the absorbance data will “bias the coefficient(s) toward zero”, to use the formal statistical description. In the multivariate case, however, each coefficient will be biased by different amounts, reflecting the different amounts of noise (or error, more generally) affecting the data at different wavelengths. As mentioned above, these
A Simple Question: The Meaning of Chemometrics Pondered
125
effects will be exacerbated by intercorrelation between the data at different wavelengths. The difficulty comes when you realize that it is not simply the correlations between pairs of wavelengths that are operative in this regard, but also the intercorrelation effects of the data when the wavelengths are taken 3, 4, n at a time. This is what has made the problem so intractable. Now, we are sure that there are some readers who will read this and say something along the lines of “well, all you need do is do a PCA/PLS analysis and get rid of all those effects”. Actually, there might be a germ of truth to that – if you can always do all your calibration modeling using only the first two or three PCA or PLS factors. Beyond that you will run into what we might almost call the Law of Conservation of Error (except for the fact that, as we all know, error is much easier to create than destroy!). In special cases, however, such as PCA and PLS, the total error really is constant, so that we quickly get into territory where the noise that you pushed out of the first couple of factors reappears, and affects the higher factors even more than the original noise affected the original data. So in the longgone days of our academic lives, the chemical measurements, being based on highaccuracy gravimetric and volumetric techniques, were indeed the proper ones to put on the Xaxis. Contrast that with the current state of technology: instruments have improved enormously, and rather than making up training samples by simple gravi metric dilutions, we often obtain our training, or reference, values through complicated analytical methodologies, which are themselves fraught with so much error that even in favorable cases, the error can be 5–10% of the analytical value. In our current practice, therefore, the error in the reference lab values really is greater than the error in the absorbance data. For this reason it is now appropriate to reverse the positions of the concentration and absorbance values relative to their place in the calculation schema. So it is the changing nature of the world and the types of analyses we do that dictate how we go about organizing the calculations we use to do them. This comes from fundamental considerations of the behavior of the modeling process, which the science of Statistics can tell us about.
REFERENCES 1. Lavine, B.K. and Brown, S., Today’s Chemist at Work 6(9), 29–37 (1997). 2. Brown, C.W., Spectroscopy 1(4), 32–37 (1986). 3. Malinowski, E.R., Factor Analysis in Chemistry, 2nd ed. (John Wiley & Sons, New York, (1991).
This page intentionally left blank
26 Calculating the Solution for Regression Techniques: Part 4 – Singular Value Decomposition
In Chapters 21–23 and in this chapter, we have described the most basic calculations for MLR, PCR, and PLS. To reiterate, our intention is to demonstrate these basic computations for each mathematical method presently, and then to delve into greater detail as the chapters progress; consider these articles linear algebra bytes. For this chapter we will illustrate the basic calculation and mathematical relationships of different matrices for the calculations of Singular Value Decomposition or SVD. You will note from previous chapters that SVD is used for modern computations of principal components regression (PCR) and partial least squares regression (PLSR), although slightly different forms of SVD are used for each set of computations. Recall for PCR we simply used SVD and for PLS a special case of SVD that we called PLS SVD was used. You will also recall that the PLS form of SVD includes the use of the concentration vector c as well as the data matrix A. The reader will note that the scores (T) and loadings (V) are determined using the concentration values for PLS SVD whereas only the data matrix A is used to perform SVD for principal components analysis. All mathematical operations used for this chapter are completed using MATLAB software for Windows [1]. As previously discussed the manual methods for calculating the matrix algebra used within these chapters is found in references [2–5]. You may wish to program these operations yourselves or use other software to routinely make the calculations. As in previous installments we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples and three data channels, as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ ⎤ ⎡ ⎤ 1 7 9 A11 A12 A13 Ar×c = ⎣ A21 A22 A23 ⎦ = AI×K = ⎣ 4 10 12 ⎦ (261) A31 A32 A33 6 14 16 Thus, 1, 7, and 9 represent the instrument signal for three data channels (frequencies 1, 2, and 3) for sample spectrum #1; 4, 10, and 12 represent the same data channel signals (e.g., frequencies 1, 2, and 3) for sample spectrum #2, and so on. Given any data matrix A of arbitrary size (as rows × columns) the matrix A can be written or defined using the computation of Singular Value Decomposition [6–8] as A = USV � = U × S × V �
(262)
where U is the left singular values matrix, V is the loadings matrix, and S is the diagonal matrix containing information on the variance described by each principal component
128
Chemometrics in Spectroscopy
(as the S matrix columns). It is important to note when reviewing the use of SVD in the literature that many references define the scores matrix (T) as U × S. Keep in mind that the scores can be calculated as U×S=A×V=T
(263)
and it holds that the original data matrix A can be reconstructed as U × S × V� = T × V� = A × V × V� = A × I = A
(264)
We can demonstrate the interrelationships between the different matrices resulting from the SVD calculations by the use of MATLAB as shown in Table 261. By studying the relationships between the various matrices resulting from the com putation of SVD, one can observe that there are several ways to compute the same Table 26-1 Simple SVD performed on matrix A using MATLAB; other matrix relation ships are also shown (see equations 261 through 264) Command line
Comments
⊂A = [1 7 9;4 10 12;6 14 16]
Enter the A matrix
A = 1 7 9 4 10 12 6 14 16
Display the A matrix
⊂[U,S,V] = svd(A)
Calculate the SVD of A
U = 0�3821 0�9061 0.1814 0�5451 0.0624 0�8361 0�7463 0.4183 0.5178
Display the U matrix, also known as the left singular values matrix, and rarely referred to as the scores matrix. The scores matrix is most often denoted as U × S or A × V which as it turns out are exactly the same.
S = 0 29�5803 0 0 1�9907 0 0�2038 0 0
Display the S matrix or the singular values matrix. This diagonal matrix contains the variance described by each principal component. Note: the squares of the singular values are termed the eigenvalues.
V = 0�2380 0.9312 0�2762 0�6279 0.0694 0.7752 0�7410 0�3579 0�5681
Display the V matrix or the right singular values matrix; this is also known as the loadings matrix. Note: this matrix is the eigenvectors corresponding to the positive eigenvalues.
⊂U*S*V� = ans = 1�0000 7�0000 9�0000 4�0000 10�0000 12�0000 6�0000 14�0000 16�0000
U*S*V� is equivalent to the original data matrix A derived using the SVD computation
Calculating the Solution for Regression Techniques: Part 4
129
Table 26-1 (Continued) Command line
Comments
⊂T = A*V T = 11�3024 1�8038 0.0370 16�1231 0.1243 0�1704 22�0748 0.8328 0.1055
The scores matrix (often designated as T) can be calculated as A × V
⊂U*S ans = 11�3024 1�8038 0.0370 16�1231 0.1243 0�1704 22�0748 0.8328 0.1055
As mentioned in the text of the article, the scores matrix T can also be calculated as U × S.
⊂T*V� ans = 1�0000 7�0000 9�0000 4�0000 10�0000 12�0000 6�0000 14�0000 16�0000
As we have stated, the original data matrix A can be estimated as the scores matrix (T) × the transpose of the loadings matrix (V� ) as shown.
⊂A*V*V� ans = 1�0000 7�0000 9�0000 4�0000 10�0000 12�0000 6�0000 14�0000 16�0000
Just another way to estimate the original data matrix A. In this case, V times the transpose of V (itself) is a diagonal matrix with a value of ones along the diagonal, such as shown below. Note: this matrix of ones along the diagonal is called an identity matrix or (I). 1�0000 0�0000 0�0000 0�0000 1�0000 0�0000 0�0000 0�0000 1�0000
final results, making it somewhat difficult to follow the literature. However, knowing these inner mathematical relationships can help clarify our understanding of the different nomenclature. We will compare and contrast some of the literature and the use of different terms in later installments; right now just tuck this information away for future reference.
REFERENCES 1. MatLab software for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 017601500. Internet: [email protected]. 2. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 3. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 4. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 5. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994). 6. Mandel, J., American Statistician 36, 15 (1982). 7. Golub, G.H. and Van Loan, Charles F., Matrix Computations, 2nd ed. (The Johns Hopkins University Press Baltimore, MD, 1989), pp. 427, 431. 8. Searle, S.R., Matrix Algebra Useful for Statistics (John Wiley & Sons, New York, 1982), p. 316.
This page intentionally left blank
27 Linearity in Calibration
Those who know us know that we have always been proponents of the approach to calibration that uses a small number of selected wavelengths. The reasons for this are partly historical, since we became involved in Chemometrics through our involvement in nearinfrared spectroscopy, back when wavelengthbased calibration techniques were essentially the only ones available, and these methods did yeoman’s service for many years. When fullspectrum methods came on the scene (PCR, PLS) and became popu lar, we adopted them as another set of tools in our chemometric armamentarium, but always kept in mind our roots, and used wavelengthbased techniques when necessary and appropriate, and we always knew that they could sometimes perform better than the full spectrum techniques under the proper conditions, despite all the hype of the proponents of the fullspectrum methods. Lately, various other workers have also noticed that eliminating “extra” wavelengths could improve the results, but nobody (including ourselves) could predict when this would happen, or explain or define the conditions that make it possible. The advantages of the fullspectrum methods are obvious, and are promoted by the proponents of fullspectrum methods at every opportunity: the ability to reduce noise by averaging data over both wavelengths and spectra, noise rejection by rejecting the higher factors, into which the noise is preferentially placed, the advantages inherent in the use of orthogonal variables, and the avoidance of the timeconsuming step of performing the wavelength selection process. The main problem was to define the conditions where wavelength selection was superior; we could never quite put our finger on what characteristics of spectra would allow the wavelengthbased techniques to perform better than fullspectrum methods. Until recently. What sparked our realization of (at least one of) the key characteristics was an online discussion of the NIR discussion group [1] dealing with a similar question, whereupon the ideas floating around in our heads congealed. At the time, the concept was proposed simply as a thought experiment, but afterward, the realization dawned that it was a relatively simple matter to convert the thought experiment into a computer simulation of the situation, and check it out in reality (or at least as near to reality as a simulation permits). The advantage of this approach is that simulation allows the experimenter to separate the effect under study from all other effects and investigate its behavior in isolation, something which cannot be done in the real world, especially when the subject is something as complicated as the calibration process based on real spectroscopic data. The basic situation is illustrated in Figure 271. What we have here is a simulation of an ideal case: a transmission measurement using a perfectly noisefree spectrometer through a clear, nonabsorbing solvent, with a single, completely soluble analyte dissolved in it. The Xaxis represents the wavelength index, the Y axis represents the measured absorbance. In our simulation there are six evenly spaced concentrations of analyte, with simulated “concentrations” ranging from 1 to 6 units, and a maximum simulated
132
Chemometrics in Spectroscopy 1.6 1.4 1.2 1
0.8 0.6 0.4 0.2 301
289
277
265
253
241
229
217
205
193
181
169
157
145
133
121
97
109
85
73
61
49
37
25
1
13
0 –0.2
Figure 27-1 Six samples worth of spectra with two bands, without (left) and with (right) stray light. (see Color Plate 1)
absorbance for the highest concentration sample of 1.5 absorbance units. Theoretically, this situation should be describable, and modeled by a single wavelength, or a single factor. Therefore in our simulation we use only one wavelength (or factor) to study. For the purpose of our simulation, the solute is assumed to have two equal bands, both of which perfectly follow Beer’s law. What we want to study is the effect of non linearities on the calibration. Any nonlinearity would do, but in the interest of retaining some resemblance to reality, we created the nonlinearity by simulating the effect of stray light in the instrument, such that the spectra are measured with an instrument that exhibits 5% stray light at the higher wavelengths. Now, 5% might be considered an excessive amount of stray light, and certainly, most actual instruments can easily exhibit more than an order of magnitude better performance. However, this whole exercise is being done for pedagogical purposes, and for that reason, it is preferable for the effects to be large enough to be visible to the eye; 5% is about right for that purpose. Thus, the band at the lower wavelengths exhibits perfect linearity, but the one at the higher wavelengths does not. Therefore, even though the underlying spectra follow Beer’s law, the measured spectra not only show nonlinearity, they do so differently at different wavelengths. This is clearly shown in Figure 272, where absorbance versus concentration is plotted for the two peaks. Now, what is interesting about this situation is that ordinary regression theory and the theory of PCA and PLS specify that the model generated must be linear in the coefficients. Nothing is specified about the nature of the data (except that it be noisefree, as our simulated data is); the data may be nonlinear to any degree. Ordinarily this is not a problem because any data transform may be used to linearize the data, if that is desirable. In this case, however, one band is linearly related to the concentrations and one is not; a transformation, blindly applied, that linearized the absorbance of the higherwavelength band would cause the other band to become nonlinear. So now, what is the effect of this all on the calibration results that would be obtained? Clearly, in a wavelengthbased approach, a single wavelength (which would be theo retically correct), at the peak of the lowerwavelength band, would give a perfect fit to the absorbance data. On the other hand, a single wavelength at the higherwavelength band would give errors due to the nonlinearity of the absorbance. The key question then becomes, how would a fullwavelength (factorbased) approach behave in this situation?
133
Linearity in Calibration 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1
2
3
4
5
6
Figure 27-2 Absorbance versus concentration, without (upper) and with (lower) stray light.
In the discussion group, it was conjectured that a single factor would split the dif ference; the factor would take on some character of both absorbance bands, and would adjust itself to give less error than the nonlinear band alone, but still not be as good as using the linear band. Figure 273 shows the factor obtained from the PCA of this data. It seems to be essentially Gaussian in the region of the lowerwavelength band, and somewhat flattened in the region of the higherwavelength band, conforming to the nature of the underlying absorbances in the two spectral regions. Because of the way the data was created, we can rely on the calibration statistics as an indicator of performance. There is no need to use a validation set of data here. Validation sets are required mainly to assess the effects of noise and intercorrelation. Our simulated data contains no noise. Furthermore, since we are using only one wavelength or one factor, intercorrelation effects are not operative, and can be ignored. Therefore the final test lies in the values obtained from the sets of calibration results, which are presented in Table 271. Those results seem to bear out our conjecture. The different calibration statistics all show the same effects: the fullwavelength approach does seem to be sort of “split the difference” and accommodate some, but not all, of the nonlinearities; the algorithm 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02
Figure 27-3 First principal component from concentration spectra.
157
151
145
139
133
127
121
115
109
97
103
91
85
79
73
67
61
55
49
43
37
31
25
19
7
13
1
0
134
Chemometrics in Spectroscopy
Table 27-1 Calibration statistics obtained from the three calibration models discussed in the text Linear wavelength SEE Corr. Coeff. F
0 1
Nonlinear wavelength
Principal component
0�237 0�9935 305
0�0575 0�9996 5294
uses the data from the linear region to improve the model over what could be achieved from the nonlinear region alone. On the other hand, it could not do so completely; it could not ignore the effect of the nonlinearity entirely to give the best model that this data was capable of achieving. Only the singlewavelength model using only the linear region of the spectrum was capable of that. So we seem to have identified a key characteristic of chemometric modeling that influences the capabilities of the models that can be achieved: not nonlinearity per se, because simple nonlinearity could be accommodated by a suitable transformation of the data, but differential nonlinearity, which cannot be fixed that way. In those cases where this type of differential, or nonuniform, nonlinearity is an important characteristic of the data, then selecting those wavelengths and only those wavelengths where the data are most nearly linear will provide better models than the fullspectrum methods, which are forced to include the nonlinear regions as well, are capable of. Now, the following discussion does not really constitute a proof of this condition (in the mathematical sense), but this line of reasoning is fairly convincing that this must be so. If, in fact, a fullspectrum method is splitting the difference between spectral regions with different types and degrees of nonlinearity, then those regions, at different wavelengths, themselves must have different amounts of nonlinearity, so that some regions must be less nonlinear than others. Furthermore, since the fullspectrum method (e.g., PCR) has a nonlinearity that is, in some sense, between that of the lowest and highest, then the wavelengths of least nonlinearity must be more linear than the fullspectrum method and therefore give a more accurate model than the fullspectrum algorithm. All that is needed in such a case, then, is to find and use those wavelengths. Thus, when this condition of differential nonlinearity exists in the data, modeling tech niques based on searching through and selecting the “best” wavelengths (essentially we’re saying MLR) are capable of creating more accurate models than fullwavelength methods, since almost by definition this approach will find the wavelength(s) where the effects of nonlinearity are minimal, which the fullspectrum methods (PCA, PLS) cannot do.
REFERENCE 1. The moderator of this discussion group was Bruce Campbell. He can be reached for information, or to join the discussion group by sending a message to: [email protected]. New members are welcome.
28 Challenges: Unsolved Problems in Chemometrics
We term the issues we plan to discuss in this chapter as “unsolved” problems, but that may be incorrect. It may be, perhaps, more accurate to call them “Unaddressed Problems in Chemometrics”. Calling them “unsolved” implies that attempts have been made to solve them, but those attempts were unsuccessful, possibly because these problems are too difficult, or possibly because maybe we are not smart enough. Calling them “unaddressed” on the other hand, really gets to the heart of the matter: a number of problems have come to our attention that nobody seems to be paying any heed to. It may very well turn out that some of these problems are too difficult to solve at the current state of the art in Chemometrics, and maybe we are really not smart enough, but at this point we do not know, and we will never know if nobody tries. Our attention was drawn to these problems via various routes. Some arose from our own work on various projects. Some arose from discussions in the online discussion group. Some have been floating around in the backs of our minds for what seems like forever, but only recently crystallized into something concrete enough to write down in a coherent manner so that it could be explained to somebody else. Answers – we have none, only questions. We bring up these points to stir up some discussion, and maybe even a little controversy, and certainly with the hope that we can prod some of our compatriots “out there” to tackle some of these. Conspicuous by its absence is the question of calibration transfer, even though we consider it unsolved in the general sense, in that there is no single “recipe” or algorithm that is pretty much guaranteed to work in all (or at least a majority) of cases. Nevertheless, not only are many people working on the problem (so that it is hardly “unaddressed”), but there have been many specific solutions developed over the years, albeit for particular calibration models on particular instruments. So we do not need to beat up on this one by ourselves. So what are these problems? 1) The first one we mention is the question of the validity of a test set. We all know and agree (at least, we hope that we all do) that the best way to test a calibration model, whether it is a quantitative or a qualitative model, is to have some samples in reserve, that are not included among the ones on which the calibration calculations are based, and use those samples as “validation samples” (sometimes called “test samples” or “prediction samples” or “known” samples). The question is, how can we define a proper validation set? Alternatively, what criteria can we use to ascertain whether a given set of samples constitutes an adequate set for testing the calibration model at hand? A very limited version of this question, does in fact, sometimes appear, when the question arises of how many samples from a given calibration set to keep in reserve for
136
Chemometrics in Spectroscopy
the validation process. Answers range from one (at a time, in the PRESS algorithm) to half the set, and there is no objective, scientific criterion given for any of the choices that indicate whether that amount is optimum. Each one is justified by a different heuristic criterion, and there is never any discussion of the failings of any particular approach. For example, while the PRESS algorithm is appealing, it does not even test the calibration model: if anything, for n samples it tests n different models, none of which is the one to be used, and so forth. Another shortcoming of PRESS is that if each sample was read multiple times, then a computer program that simply removes one reading at a time does not remove the effect of that sample from the data. Even so, at best any of these answers treat only one aspect of the larger question, which includes not only how many samples, but which ones? A properly taken random sample is indeed representative of the population from which it comes. So one sub question here is, how should we properly sample? The answer is “randomly” but how many workers select their validation samples in a verifiably random manner? How can someone then tell if their test set is then valid, and against what criteria? Some of this goes back to the original question of obtaining a proper and valid set of calibration samples in the first place, but that is a different, although related problem. We can turn that question around in the same way: what are the criteria for telling if a calibration sample set is a valid set? Maybe both problems have the same solution, but we do not know because nobody is working on either one. But to pose the question more directly: how can we tell if any set of samples constitute a valid test set? Even if they were chosen in a proper random manner, are there any independent tests for their validity? What characteristics should the criteria for deciding be based on, and what are the criteria to use? 2) The next problem we bring up for discussion is the definition of “validation”. Now, we are sure there are some who will complain that we are arguing terminology rather than substance. However, we think that agreement on what terms mean has substantive consequences, especially in modern times when standardssetting organizations (e.g., ASTM) and government agencies are taking an interest in what we do. As we will see below, there is the question of the time required to validate, so on the one hand, if we recognize that verifying the accuracy of a given model at the time that model is created may or may not be a sufficient test of its longterm behavior and we may need to include longterm testing procedures. On the other hand, if government agencies create regulations for how models are to be validated, which presumably they are likely to do on the basis of what we ourselves decide is required, do we want to be constrained to not being able to declare that we have created a model until months or years have passed? Such questions involve much more than terminology, especially if the government decides that “validation” is, in fact, whatever we claim it is. As we hinted above, the most common use of the term “validation” involves simply retaining some samples separately from the main set of calibration samples and using those as a moreorless independent test of the accuracy of the calibration model obtained. However, this definition is not universally agreed to. When the subject came up in the online discussion group, the following comment was made by Richard Kramer of the discussion group [1]: The issue Howard raises is an important one. However, I disagree with his characterization of validation and with the resulting conclusion. It all depends upon
Unsolved Problems in Chemometrics
137
what one means by the concept of validation. If validation means the ongoing validation of a plurality of alternative models (my preferred meaning), it DOES become the means of selecting one model over others. And importantly, it permits selection of models which exhibit the best performance with respect to timerelated properties such as robustness. It is not uncommon to observe that the model which initially appears to be optimum is the one whose performance degrades most rapidly as time passes. Validation over time also provides a means of gaining insight into which portions of the data might contain more confusion than information and would be best discarded. In particular, it can be interesting to look at the data residuals over time. It is not uncommon to find that the residuals in some parts of the data space increase more rapidly, over time, than the residuals in other parts of the data space. Generally excluding (or deweighting) the former from the model can improve the model’s performance, short term and long term. Certainly Richard raises valid points, and you can hardly fault his prescription for monitoring and improving the results. However, is that considered, or should that be considered a requirement for validation, or even a necessary part of the validation process? The response comment to Richard at the time was as follows: I think Rich & I agree more than we disagree. If you use his definition of validation then what he says follows. However, that definition is not the one in common use – the MUCH more common definition is simply the one that tells you to separate your calibration samples & keep some out of the calibration calculations, then use those to validate. Once you’ve gone to the trouble to collect data over time then your options expand greatly. Not only can you use that data for ongoing validation, you can also include those new readings in the calibration calculations. There are at least two ways to do this: 1) As Richard implies, one way is to gradually replace the older data with the new as it becomes available. This has been standard practice for a long time, for example in the agricultural industry, where old samples will never be seen again. A grain elevator, e.g., will never again have to measure another sample from the 1989 crop year. 2) The other obvious extension, which is more useful for the case where you may still have to measure samples with the same characteristics as the old ones, is to simply keep adding to and expanding the calibration set as new samples become available. The new samples then not only allow you to test for robustness, but inclusion of such samples will actually make the calibration more robust. I think we all know this intuitively, but I have also been able to prove this mathematically. So validation may not only involve the time frame required to perform it, it may also involve questions of the models (or at least the number of models) being tested. So there we have it: what exactly is “validation”? 3) The next unsolved problem we bring up is the question of error in the classification of training samples when calibrating an instrument to do identification. We mentioned
138
Chemometrics in Spectroscopy
this briefly in a recent column, but it is worth some more discussion. The problem appears to arise primarily in medical applications, so as a nonproprietary example, let us imagine we are interested in identifying the degree of burn of a burn victim: that is whether the subject has a 1st, 2nd or 3rd degree burn. The distinctions are medically important, and furthermore there are qualitative differences between them despite the fact that they arise out of the quantitative difference in the amount of heat involved. In these respects this typifies other medical situations. We could take spectra of the burned areas from subjects who have been burned, but there is a certain amount of subjectivity in assigning the degree of burn in a given case, and occasionally two physicians will disagree on the designation of the degree of burn in some cases. Clearly, if they disagree, they both cannot be correct, so if we use one or the other’s diagnosis, the training classification will also occasionally be in error. While there is certainly a progression in the intensity and severity of the burn as we go from 1st to 3rd degree burns, we cannot simply use a quantitative scale, for a number of reasons: a quantitative scale of that sort is not agreed to by all physicians, it would be, at best, highly nonlinear, and most importantly, there are real qualitative differences between tissue subjected to the different extents of damage, besides the potential quantitative ones. Because of this, a straightforward quantitative approach would not suffice, even if one could be developed. We need methods to deal with the existence of errors in the training classifications when training instruments to do automated identification. 4) The final problem we bring up is based on the question of modeling based on individual wavelengths versus fullspectrum methods and the modern variations on those themes. Basically the question can be put: “How far should we go in eliminating wavelengths?”. As we discussed in a recent column, as well as in times past, our backgrounds are from the days of prePCA/PLS/PCR/NN calibration modeling, and we there learned the value of wavelengthbased models (principally MLR, or Pmatrix as it’s sometimes called), which we only recently crystallized into something concrete enough to write down in a coherent manner so that it could be explained to somebody else. (does that sound familiar?) The fullspectrum methods (PLS, PCR, Kmatrix, etc.) have their advantages and, as we recently discussed, so do the individualwavelength methods. The users of the fullspectrum approaches have in recent years taken an empirical, ad hoc approach to the question of wavelength elimination, finding that there was benefit to it, even if there were no explanations of the reasons for that benefit. Our initial reaction was something on the order of: why not go the whole way and eliminate all the wavelengths except those few that are needed to do the analysis (i.e., go to the limit of wavelength elimination, which essentially brings it back to MLR)? However, now that we know what the benefit of MLRtype modeling is, it is clear that eliminating all those wavelengths is counterproductive, because it throws the baby out with the bathwater, so to speak. Ideally, we should like to devise criteria for determining how many wavelengths, and which wavelengths, to keep and which to eliminate, to obtain the optimum balance between the noisereduction capabilities of the fillspectrum methods and the linearitymaximization capabilities of the individual wavelength approaches.
Unsolved Problems in Chemometrics
139
Well, there we have it: our list of current unsolved/unaddressed problems. Hop to it, readers!!!
REFERENCE 1. Chemometrics discussion group moderated by Bruce Campbell. He can be reached for infor mation, or to join the discussion group by sending a message to: [email protected]. New members are welcome.
This page intentionally left blank
29 Linearity in Calibration: Act II Scene I
When we first published our chapter “Linearity in Calibration” as an article in Spectroscopy magazine [1] we did not quite realize what a firestorm we were going to ignite, although, truth be told, we did not expect everybody to agree with us, either. But if so many actually took the trouble to send their criticisms to us, then there must also be a large “silent majority” out there that are upset, perhaps angry, and almost certainly misunderstanding what we said. We prepared responses to these criticisms, but they became so lengthy that we could not print them all in a single published column, and thus the topic is included in several smaller chapters. At this point in our discussion, let us raise the question of the linearity of spectro scopic data as a general topic. There are a number of causes of nonlinearity that most chemists and spectroscopists are familiar with. Let us define our terms. When speak ing of “linearity” the meaning of the term depends on your point of view, and your interests. An engineer is concerned, perhaps, with the linearity of detector response as a function of incident radiant energy. To a chemist or spectroscopist, the interest is in the linearity of an instrument’s readings as a function of the concentration of an analyte in a set of samples. In practice, this is generally interpreted to mean that when measuring a transparent, nonscattering sample, the response of the instrument can be calculated as some constant times the concentration of the analyte (or at least some function of the instrument response can be calculated as a constant times some other function of the concentration). In spectroscopic usage, that is normally interpreted as meaning the condition described theoretically by Beer’s Law, that is the instrument response function is the negative exponential of the concentration: I = k Io e−bC
(291)
where I = k= Io = b= C=
the the the the the
radiation passing through the sample multiplying constant radiation incident on the sample product of the pathlength and absorbtivity concentration of the analyte.
When other types of samples are measured, the resulting data is usually known to be nonlinear (except possibly in a few special cases), so those measurements are of no interest to us here. Thus, in practice, the invocation of “linearity” implies the assumption that Beer’s Law holds, therefore discussions of nonlinearity are essentially about those phenomena that cause departures from Beer’s law.
142
Chemometrics in Spectroscopy
These include 1) Chemical causes a) Hydrogen bonding b) Selfpolymerization or condensation c) Interaction with solvent d) Selfinteraction 2) Instrumental causes a) Nonlinear detector b) Nonlinear electronics c) Instrument bandwidth broad compared to absorbance band d) Stray light e) Noncollimated radiation f) Excessive signal levels (saturation). Most chemists and spectroscopists expect that in the absence of these distinct phenom ena causing nonlinearity, Beer’s Law provides an exact description of the relationship between the absorbance and the analyte concentration. Unfortunately the world is not so simple, and Beer’s Law never holds exactly, EVEN IN PRINCIPLE. The reason for this arises from thermodynamics. Optical designers and specialists in heat transfer calculations in the chemical engineer ing and mechanical engineering sciences are familiar with the mathematical construct known as The Equation of Radiative Transfer, although most chemists and spectro scopists are not. The Equation of Radiative Transfer states that, disregarding absorbance and scattering, in a lossless optical system dE = I d d da dt
(292)
where dE = the differential energy transferred in differential time dt I = the optical intensity as a function of wavelength (i.e., the “spectrum”) d = the differential wavelength increment d = the differential optical solid angle the beam encompasses da = the differential area occupied by the beam. For a static (i.e., unvarying with time) system, we can recast equation 292 as: dE/dt = I d d da
(293)
where dE/dt is the power in the beam. The application of these equations to heat transfer problems is obvious, since by knowing the radiation characteristics of a source and the geometry of the system, these equations allow an engineer, by integrating over the differential terms of equation 292 or equation 293, to calculate the amount of energy transferred by electromagnetic radiation from one place to another. Furthermore, the first law of thermodynamics assures us that dE/dt will be constant anywhere along the optical beam, since any change would require that the energy in the
Linearity in Calibration: Act II Scene I
143
beam be either increased or decreased, which would require that energy would be either created or destroyed, respectively. Less obviously, perhaps, the second law of thermodynamics assures us that the inten sity, I, is also constant along the beam, for if this were not the case, then it would be possible to focus all the radiation from a hot body onto a part of itself, increasing the radiation flux onto that portion and raising its temperature of that portion without doing work – a violation of the second law. The constancy of beam energy and intensity has other consequences, some of which are familiar to most of us. If we solve equation 293 for the product (d da) we get: d da = dE/dt × d/I
(294)
All the terms on the righthand side of equation 294 are constants, therefore for any given wavelength and source characteristics, the product d da) is a constant, and in an optical system one can be traded off for the other. We are all familiar with this characteristic of optical systems, in the magnification and demagnification of images described by geometric optics. Whenever light is brought to a small focus (i.e., da becomes small) the light converges on the focal point through a large range of angles (i.e., d becomes large) and vice versa. This tradeoff of parameters is more obvious to us when seen through the paradigm of geometric optics, but now we see that this is a manifestation of the thermodynamics underlying it all. We are also familiar with this effect in another context: in the fact that we cannot focus light to an arbitrarily small focal point, but are limited to what we usually call the “diffraction limit” of the radiation in the beam. This effect also comes out of equation 294, since there is a physical (or perhaps a geometrical) limit to d: d cannot become arbitrarily large, therefore da cannot become arbitrarily small. Again, we are familiar with this effect by coming across it in another context, but we see that it is another manifestation of the underlying thermodynamic reality. Getting back to our main line of discussion, we can see from equation 292 (or equation 293) that the differential terms must all have finite values. If any of the terms d, d, or da were zero, then zero energy would pass through the system and we could not make any measurements. One thing this tells us, of interest to us as spectroscopists, is that we can never build an instrument with perfect resolution. The mechanistic fundamentals (quantum broadening, Doppler broadening, etc.) have been extensively discussed by one of our colleagues [2]. This effect also manifests itself in the fact that every technology has an “instrument function” that is convolved with the sample spectrum, and each instrument function is explained by the paradigms of the associated technology, but since “perfect” resolution means that d = 0, we see again that this is another result of the same underlying thermodynamics. More to the point of our discussion regarding nonlinearity, however, is the fact that d cannot be zero. d is related to the concept of “collimation”: for a “perfectly collimated” beam, d = 0. But as we have just seen, such a beam can transfer zero energy; so just as with d and da, a perfectly collimated beam has no energy. Beer’s law, on the other hand, is based on the assumption that there is a single pathlength (normally represented by the variable b in the equation A = abc) for all rays through the sample. In a real, physical, measurement system, this assumption is always false, because of the fact that d cannot be zero. As Figure 291 shows, the actual
144
Chemometrics in Spectroscopy I2
I0
θ
θ max
I1 b
Figure 29-1 Diagram showing the pathlength in a sample for ray going straight through (to I1 ) and those going at an angle (to I2 ).
rays have pathlengths that range from b (for those rays that travel “straight through”, i.e., normal to the sample surfaces) to b/cos(max (for the rays at the most extreme angles). We noted this effect above as item 2e in our list of sources of nonlinearity, and here we see the reason that there is fundamental limitation. Mechanistically, the nonlinearity is caused by the fact that the absorbance for the rays traveling normally = abc, while for the extreme rays it is abc/cos(max . Thus the nonnormal rays suffer higher absorbance than the normal ones do, and the discrepancy (which equals abc1 − 1/cos) increases with increasing concentration. When the medium is completely nonabsorbing, then the difference in pathlength does not affect the measurement. When the sample has absorbance, however, it is clear that ray I2 will have its intensity reduced more than ray I1 , due to the longer pathlength. Thus not all rays are reduced by the same amount and this leads to the nonlinearity of the measurement. Mathematically, this can be expressed by noting that the intensity measured when a beam with a finite range of angles passes through a sample is I = Io
�max
e−b/ cos d
(295)
0
rather than the simpler form shown in equation 291 (which, we remind the reader, only holds true for “perfectly collimated” beams, which have zero energy). In practice, of course, this effect is very small, normally much smaller than any of the other sources of nonlinear behavior, and we are ordinarily safe in ignoring it, and calling Beer’s law behavior “linear” in the absence of any of the other known sources of nonlinear behavior. However, the point here is that this completes the demonstration of our statement above, that Beer’s law never exactly holds IN PRINCIPLE and that as spectroscopists we never ever really work with perfectly linear data.
REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). 2. Ball, D.W., Spectroscopy 11(1), 29–30 (1996).
30 Linearity in Calibration: Act II Scene II – Reader’s Comments � � �
Some time ago we wrote an article entitled “Linearity in Calibration” [1], in which we presented some unexpected results when comparing a calibration model using MLR with the model found using PCR. That column generated an active response, so we are discussing the subject in some detail, spread over several columns. The first part of these discussions have been published [2]; this chapter is the continuation of that one. In this chapter we now present the responses we received to the original published article [1] in order of receipt, following which we will comment about them in subsequent chapters. Here, in order of receipt, are the comments: The first set of comments we received were from Richard Kramer: [Howard & Jerry], I’m afraid that this month’s Spectroscopy Column is badly off the mark (pun intended (with apologies)). The errors are twofold with the most serious error so significant that the other error is moot. 1) If I understand the column correctly, a 1factor model was used. Well, a single linear factor can never be sufficient to properly model a nonlinear system. A minimum of 2 factors are required. The synthetic data did NOT demonstrate the advantage of a single linear wavelength over a multiple wavelength model, it merely illustrated the fact that a single linear factor is not sufficient to model nonlinear data. We could stop here, but, for the sake of completeness � � � . 2) The second problem is that that we never have the luxury of working with noisefree data. Thus, the column did not ask the right question(s). The proper question to ask is “In what ways and under which circumstances do the signal averaging advantages of multiplewavelength models outperform or underper form with respect to a single (or n wavelength, where n is a small integer) wavelength calibration when noise is present?” The answer will depend upon the levels of noise and nonlinearity and the number of wavelengths in each model. Regards, Richard We went back and forth a couple of times, but rather than list each of our conversations individually, we will reserve comments until we have looked at all the comments, and then we will summarize our responses to all four respondents together, since several of these response comments say the same things, to some extent.
146
Chemometrics in Spectroscopy
Second, we received comments from Patrick Wiegand: Gents, I have always looked forward to reading your articles on Chemometrics in Spec troscopy. They are truly a valuable resource – I usually cut them out and save them for future reference. However, I think your article “Linearity in Calibration” in the June 1998 issue of Spectroscopy leads the reader to an erroneous conclusion. This conclusion results largely because of the assumptions you make about the application of PLS and PCR. I know of no experienced practitioner of chemometrics who would blindly use the “full spectrum” when applying PLS or PCR. In the book “Chemometrics” by Beebe, Pell and Seasholtz, the first step they suggest is to “examine the data.” Likewise, Kramer in his new book has two essential conditions: The data must have information content and the information in the data must have some rela tionship with the property or properties which we are trying to predict. Likewise, in the course I teach at Union Carbide, I begin by saying that “no model ing technique, no matter how complex, can produce good predictions from bad data.” In your article, you appear to be creating an artificial set of circumstances: 1) You start with a “perfectly noisefree spectrum” 2) You create an excessively high degree of nonlinearity which would never be tolerated by an experienced spectroscopist. 3) You assume the spectroscopist will use the entire spectrum blindly when apply ing PLS or PCR, even though some parts of the spectrum clearly have no information and other parts are clearly nonlinear. 4) You limit the number of factors for PLS/PCR to 1, even though the number of latent variables must be greater, due to the nonlinearity. In regards to number 1, by using a perfectly noisefree spectrum, you have elim inated the main advantage of PLS/PCR. That is, the whole point of using these techniques is that they have better ability to reject noise than MLR. To come to an adequate conclusion as to the best performer, you should at least add an amount of random noise an order of magnitude greater than normal, since the amount of nonlinearity you use is an order of magnitude greater than normal. Number 2 – I understand that you wanted to use a high degree of nonlinearity so that the absorbance vs. concentration plot will be nonlinear to the naked eye, but you can’t really expect to use this degree of nonlinearity to make a judgmental comparison between two techniques if it is not realistic that it will ever occur in real life. Number 3 – There are many wellestablished techniques for choosing which wavelength regions to use when modeling with PLS/PCR. First, I advise people to make sure that the pure component spectrum actually has a band in the location being modeled. If this is not possible, at least only include regions that look like
Linearity in Calibration: Act II Scene II
147
valid bands – no sense in trying to include low s/n baseline regions. Plots of a linear correlation coefficient vs. wavelength for the property of interest are also useful in choosing the right regions to include in the model. Finally, if the initial model is built using the fullspectrum, an examination of factor plots would reveal areas in which there is no activity. Number 4 – In cases where there is no choice but to deal with nonlinearity in the spectra, then it will be necessary to use more factors than the number of chemical species in the system. Once again, an experienced practitioner will use other ways of choosing the right number of factors, like a PRESS plot, etc. Thus your conclusion – that MLR is more capable of producing accurate models than PLS/PCR – is based on a contrived set of circumstances that would not occur in reality, especially when the chemometrician/spectroscopist is experienced. It would be very interesting also, since the performance of the models presented are so similar, to see how the performance would be affected by noise, drift, etc. which are always present in actuality. I would not be surprised if PLS/PCR outperformed MLR under those circumstances. All of the above would seem to indicate that I am totally against using MLR. This is not the case. In my practice, I always try the simplest approach first. This means first trying MLR. If that does not work, then I use PLS. If that does not work – well, some people may use neural networks, but I have not yet found a need to do so. I think you are right in saying that there has been a lot of hype over PLS (although not as much as there has been over neural nets!) In many cases MLR works great, and I will continue to use it. To paraphrase Einstein, “Always use the simplest approach that works – but no simpler.” The third set of comments we received were from Fred Cahn: I read your article in Spectroscopy (13(6), June 1998) with interest. However, I don’t agree with the conclusions and the way your simulation was carried out and/or presented. While I am no longer working in this field, and cannot easily do simulations, I think that a 2 factor PCR or PLS model would fully model the simulated spectra. At any wavelength in your simulation, a second degree power series applies, which is linear in coefficients, and the coefficients of a 2 factor PCR or PLS model will be a linear function of the coefficients of the power series. (This assumes an adequate number of calibration spectra, that is, at least as many spectra as factors and a sufficient number of wavelength, which the full spectrum method assures.) The PCR or PLS regression should find the linear combination of these PCR/PLS coefficients that is linear in concentration. See my publication: Cahn, F. and S. Compton, “Multivariate Calibration of Infrared Spectra for Quanti tative Analysis Using Designed Experiments”, Applied Spectroscopy, 42:865–872 (July, 1988).
148
Chemometrics in Spectroscopy
Fred supplied a copy of the cited paper, and we read it. Again, the comments about it will be included among the general comments. And finally, the fourth set of comments we received were from Paul Chabot: Hello, I recently read your column in the Spectroscopy issue of June 1998, which was dealing with “Linearity in Calibration”. First, I have to tell you that I really like your monthly column. You do a good job at explaining the basics and more of many topics related to chemometrics, and “demistify” the subjects. As an avid user of PLS, I was concerned when you were comparing MLR to PLS and PCR on your synthetic data set. Even though I agree with you that in some cases, MLR is a much better approach than PLS or PCR, sometimes the use of a full spectrum technique is essential. In this particular case, I do not doubt your results showing that MLR outperforms the full spectrum techniques because the data set was designed to do so. But out of the full spectrum techniques, I would expect PLS to outperform PCR, and the loading of the first principal component to be mostly located around the lower wavelength peak for PLS. Did you notice any difference between PCR and PLS on this data set? I would appreciate it if you could let me know if you tried both approaches and the results you obtained so I don’t have to regenerate the data. Thank you very much, and keep up the good work, Paul Chabot To summarize the comments (including ones presented during subsequent discussions, and therefore not included above): 1) Richard Kramer, Patrick Wiegand, and Fred Cahn felt that we should have tried two factors. 2) Richard Kramer and Patrick Wiegand thought we should have added simulated noise to the data. 3) All four responders indicated that we should have tried PLS. 4) Richard Kramer, Patrick Wiegand, and Paul Chabot indicated that one PLS factor might do as well as one wavelength. 5) Richard Kramer and Patrick Wiegand thought that our conclusion was that MLR is better than PCA. As stated in the introduction to this chapter, we present our responses in chapters to follow.
REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). 2. Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998).
31 Linearity in Calibration: Act II Scene III
In Chapter 27, we discussed a previously published paper entitled “Linearity in Calibration” [1]. In the chapter and original paper we presented some unexpected results when comparing a calibration model using MLR with the model found using PCR. That chapter, when first published as an article, generated a rather active response, so we are discussing the subject and responding to the comments received in some detail, spread over several chapters. The first two parts of our response were included as Chapters 29 and 30, which refer to the papers published as [2, 3]; this Chapter 31 is the continuation of those. We ended Chapter 30 with a summary of the comments received regarding a previous “Linearity in Calibration” paper. We therefore pick up where we left off by starting this chapter with that same summary (naturally, anyone who wishes to read the full text of the comments will have to go back and reread Chapter 30 derived from reference [3]): 1) Richard Kramer, Patrick Wiegand, and Fred Cahn felt that we should have tried two factors. 2) Richard Kramer and Patrick Wiegand thought we should have added simulated noise to the data. 3) All four responders indicated that we should have tried PLS. 4) Richard Kramer, Patrick Wiegand, and Paul Chabot indicated that one PLS factor might do as well as one wavelength. 5) Richard Kramer and Patrick Wiegand thought that our conclusion was the MLR is better than PCA. In addition, each of the responders had some of their own individual comments; we discuss all these below. We now continue with our responses, and discussion of these comments: It may surprise some to hear this, especially in light of some of the comments we make below, but we agree with the responders more than we disagree. We also believe, for example, in prescreening the data, at least as strongly as Patrick Wiegand does, and we believe his comments regarding the way all (or at least, let’s hope all) experienced chemometricians approach a problem. Indeed, fully half the book that one of us authored [4] was spent on just that point: how to “look at the data”. However, our experience in the “real world” (as some like to call it) of instrument manufacturers has given us a somewhat different slant on the reality of what actually happens when users get hold of a new superwhizbang package of calculation. In many years of experience in the NIR applications department at Technicon Instru ments, there was about an hour and a half available to teach both theory and practice of calibration to each group of new users; the rest of the training time was spent teaching the students how to set the instrument up, prepare samples, take reproducible readings,
150
Chemometrics in Spectroscopy
and learn the rest of the mechanics needed to run the instrument, take readings, and collect the data. How much attention do you think could be paid to the finer points? This seems to be typical of what happens in the majority of cases involving novice users, and it is rare that there is anyone “back at the plant” who can pick up the ball and take them any further. Even experienced practitioners can be misled, however. As was pointed out, real data contains various types and amounts of variations in both the X and Y variables. Furthermore, in the usual case, neither the constituent values nor the optical readings are spaced at nice, even, uniform intervals. Under such circumstances, it is extremely difficult to pick out the various effects that are operative at the different wavelengths, and even when the data analyst does examine the data, it may not always be clear which phenomena are affecting the spectra at each particular wavelength. Now we will respond to the various comments, and make some more observations of our own. We will requote the pertinent parts of the communications from the responders, collecting together those on a similar topic and comment on them collectively. Note than some of these quotes were from later messages than those quoted in our previous column, because they were generated during subsequent discussions, and so may not have appeared previously. We hope nobody takes our reply comments personally. Both some of the comments and some of our responses are energetic, because we seem to have touched on a subject that turned out to be somewhat controversial. So we do not take the responders comments personally, but we do enter with zest and gusto into what looks like something turning into a rather lively debate, and we sincerely hope that everybody can take our own comments in that same spirit. The format of this columns is as follows: each numbered section starts with the comments from the various responders dealing with a given aspect of the subject, followed by our response to them collectively. So now let us consider the various points raised, starting with the use of noisefree data: 1) “You start with a ‘perfectly noisefree spectrum’ ” (Patrick Wiegand) “In regards to number 1, by using a perfectly noisefree spectrum, you have eliminated the main advantage of PLS/PCR. That is, the whole point of using these techniques is that they have better ability to reject noise than MLR. To come to an adequate conclusion as to the best performer, you should at least add an amount of random noise an order of magnitude greater than normal, since the amount of nonlinearity you use is an order of magnitude greater than normal.” (Patrick Wiegand) “The second problem is that that we never have the luxury of working with noisefree data. Thus, the column did not ask the right question(s). The proper question to ask is ‘In what ways and under which circumstances do the signal averaging advantages of multiplewavelength models outperform or underperform with respect to a single (or n wavelength, where n is a small integer) wavelength calibration when noise is present?’ The answer will depend upon the levels of noise and nonlinearity and the number of wavelengths in each model.” (Richard Kramer) “It isn’t a case of ‘extreme difficulty’. It is a situation where, in one case you use a factor which happens to be based upon an explicit model (i.e. linearity) which is correct
Linearity in Calibration: Act II Scene III
151
for the data while stacking the deck against the second case by denying any opportunity to be correct.” (Richard Kramer) Response: Of course we used noisefree data. Otherwise we could not be sure that the effects we see are due to the characteristics we impose on the data, rather than the random effects of the noise. When anyone does an actual, physical experiment and takes real readings, the noise level or the signaltonoise ratio is a consideration of paramount importance, and any experimenter normally takes great pains to reduce the noise as much as possible, for just that reason. Why shouldn’t we do the same in a computer experiment? On the other hand, PCA and PLS are both known to perform better than MLR when the data is noisy because of the inherent averaging that they include. In this we agree fully; indeed, we also mentioned this characteristic in Chapter 27, as well as in the original column. Richard Kramer hit the nail on the head with his question “In what ways � � � ?” The important question, then, that needs to be asked (and answered) is, at what point does one phenomenon or the other become dominant, so as to control or determine which algorithm will provide a better model? The next important question is, how can we tell which phenomenon is dominant in any particular case? Rich Kramer also had the insight to go to the next step, and realized that the only way to determine whether the nonlinearity is “small” or “large” is by having something to compare to, and the natural characteristic to compare it to is the noise. On this score we also agree with Richard and Patrick fully, and this is one place where much research is needed (there are others; and we will get to them in due course): How do you compare the systematic behavior of nonlinearity with the random behavior of noise? The standard application of the science of Statistics provides us with tools to detect systematic effects, but how do we go to the next step and ascertain their relative effects on calibration models? These are among the fundamental behavioral properties of calibrations that are not being investigated, but need to be. There are important theoretical reasons to reduce the spectral noise when doing calibrations. Nevertheless, if the main advantage of PLS is its behavior in the presence of noisy data (as Patrick Wiegand states), that is poor praise indeed. Noise levels of modern instruments are far below those of the past. In some cases, and NIR instruments come to mind here, the noise levels are so low that they are tantamount to having “zero noise” to start with. This improvement in instrumentation is a good thing, and we sincerely doubt that anybody would recommend using a noisy instrument for the sole purpose of justifying a more sophisticated algorithm. In any case, even if all the above statements are 100% true, it does not affect our discussion because they are beside the point. The behavior of calibration algorithms in the face of noisy data is an important topic and perhaps should be studied in depth, but it was not at issue in the “Linearity in Calibration” column. 2) “You create an excessively high degree of nonlinearity which would never be tolerated by an experienced spectroscopist.” (Patrick Wiegand) Response: In the absence of random variation, ANY amount of nonlinearity would give the same results, and if we used less, any differences from the results we presented would be only of degree, not of kind. Any amount of nonlinearity is infinitely greater
152
Chemometrics in Spectroscopy
than zero. As we explained in the original column, we deliberately chose an unrealis tically large amount of nonlinearity for pedagogical purposes; what would be the point of comparing different calibration lines that the naked eye saw as equally straight? The fact that it is “unrealistically” large is immaterial. 3) “You assume the spectroscopist will use the entire spectrum blindly when applying PLS or PCR, even though some parts of the spectrum clearly have no information and other parts are clearly nonlinear.” (Patrick Wiegand) Response: Above, I described the situation as we see it, regarding the traps that both experienced and novice users of these very sophisticated algorithms can fall into. Keep in mind the pedagogy involved as well as the chemometrics: by suitable choice of values for the “constituent”, the peaks at the nonlinear wavelengths could have been made to appear equally spaced, and the linear wavelengths appear stretched out at the higher values. The “clarity” of the nonlinearity is due to the presentation, not to any fundamental property of the data, and this clarity does not normally exist in real data. How is someone to detect this, especially if not looking for it? Attempts to address this issue have been made in the past (see [5]) with results that in our opinion are mixed, at best. And that simulated data was also noisefree. With real data, a more scientifically valid approach would be to correct the nonlinearity from physical theory. In the current case, for example, a scientifically valid approach would be to convert the data to transmission mode, subtract the stray light and reconvert to absorbance: the nonlinear wavelengths would have become linear again. There are, of course, several things wrong with this procedure, all of them stemming from the fact that this data was created in a specific way for a specific purpose, not necessarily to be representative of real data: a) You would have to know a priori that only certain wavelengths (and which ones) were subject to the “stray light” or whatever source of nonlinearity was present. b) One of the problems of current chemometric practice is the “numbers game” aspect. No matter how soundly based in physical theory a procedure is, if the numbers it produces are not as good (whatever that might mean in a specific case) as a different, more empirical, procedure, the second procedure will be used, no matter how empirical its basis. The counterargument to that, of course, is something on the order of “Well, we have to get as good results as we can for the user” and there is a certain amount of legitimacy to this statement. However, we know of no other field of scientific study where a situation of this sort is tolerated. Certainly, every field has areas of unknown effects where not all the fundamental physical theory is available, but in all fields other than chemometrics, there are workers investigating these dark areas, to try to fill in the missing knowledge. In chemometrics, on the other hand, for at least the 22 years we have been involved with the field, all we have seen the workers in the field doing are building bigger and higher and more fanciful mathematical superstructures on foundations that few, if any of them, seem to be aware of. We will have more to say about this below. c) The simple fact that sometimes the nature of the correct physical theory to use is unknown. d) Finally, the real reason we presented these results the way we did was that the whole purpose of the exercise was to study the effect of this type of variation of
Linearity in Calibration: Act II Scene III
153
the data, so that simply removing it would not only be trivial, it would also be a counterproductive procedure. 4) “If I understand the column correctly, a 1factor model was used. Well, a single linear factor can never be sufficient to properly model a nonlinear system. A minimum of 2 factors are required.” (Richard Kramer) “PLS should have, in principle, rejected a portion of the nonlinear variance resulting in a better, although not completely exact, fit to the data with just 1 factor. � � � The PLS does tend to reject (exclude) those portions of the xdata which do not correlate linearly to the yblock.” (Richard Kramer) “You limit the number of factors for PLS/PCR to 1, even though the number of latent variables must be greater, due to the nonlinearity.” (Patrick Wiegand) “In principle, in the absence of noise, the PLS factor should completely reject the non linear data by rotating the first factor into orthogonality with the dimensions of the xdata space which are ‘spawned’ by the nonlinearity. The PLS algorithm is supposed to find the (first) factor which maximizes the linear relationship between the xblock scores and the yblock scores. So clearly, in the absence of noise, a good implementation of PLS should completely reject all of the nonlinearity and return a factor which is exactly linearly related to the yblock variances.” (Richard Kramer) “While I am no longer working in this field, and cannot easily do simulations, I think that a 2 factor PCR or PLS model would fully model the simulated spectra.” (Fred Cahn) “My “objection” is that you did not seem to look at the 2nd factor, which I think is needed to accurately model the spectra after the background is added.” (Fred Cahn) “I would expect PLS to outperform PCR, and the loading of the first principal component to be mostly located around the lower wavelength peak for PLS.” (Paul Chabot) Response: Yes, but: The point being that, as our conclusions indicate, this is one case where the use of latent variables is not the best approach. The fact remains that with data such as this, one wavelength can model the constituent concentration exactly, with zero error – precisely because it can avoid the regions of nonlinearity, which the PCA/PLS methods cannot do. It is not possible to model the “constituent” better than that, and even if PLS could model it just as well (a point we are not yet convinced of since it has not yet been tried – it should work for a polynomial nonlinearity but this nonlinearity is logarithmic) with one or even two factors, you still wind up with a more complicated model, something that there is no benefit to. Richard Kramer suggested that we use two wavelengths (with the MLR approach) to see what happens. Well, here’s what happens: if the second wavelength is also on the linear absorbance band, you get a “divide by zero” error upon performing the matrix inversion due to the perfect collinearity between the data at the two wavelengths. If the second wavelength is on the nonlinear band, the regression coefficient calculated for it is exactly zero (at least to 16 digits, where the computer truncation error becomes important), since it plays exactly no role in the modeling. In other words, not only is it
154
Chemometrics in Spectroscopy
unnecessary to add a second wavelength to the model, it is impossible to do so if you try; when the model is perfectly correct you can’t force a second wavelength into that model even if you want to. Richard Kramer, Patrick Wiegand, and Paul Chabot suggested that a onefactor PLS model should reject the data from the nonlinear wavelength and therefore also provide a perfect fit to the “constituent”. I offered to provide the data as an EXCEL spreadsheet to these responders; Paul accepted the offer, and I emailed the data to him. We will see the results at an appropriate stage. 5) “There are many wellestablished techniques for choosing which wavelength regions to use when modeling with PLS/PCR. First, I advise people to make sure that the pure component spectrum actually has a band in the location being modeled." (Patrick Wiegand) Response: That indeed is a good procedure when you can do it (keeping in mind our earlier discussion regarding users reactions to the case of a conflict between theoret ical correctness and the experimental “numbers game”), and we also make the same recommendation when appropriate. If anything, proper wavelength choice is even more important when using MLR than either PCA or PLS. But what do you do when the “constituent” is a physical property, with no distinct absorbance band? This consider ation becomes particularly pernicious when that property is not itself being calibrated for, but is a variation superimposed on the data, and needs a factor (or wavelength) to compensate for, yet has no absorbance band of its own? The prototype example of this is the “repack” effect found when the measurements are made by diffuse reflectance: “Repack” does not have an absorbance band. Other situations arise where that approach fails: when the chemistry is unknown or too complicated (octane rating in gasoline, for example). Here again, even though a fair amount is known about the chemistry behind octane rating, there is no absorbance band for “octane value”. Another case is where the chemistry is known, but the spectroscopy is unknown, because the pure material is not available. Protein, for example, cannot be extracted from wheat (or at least not and still remain protein), so the spectrum of “pure” protein as it exists in wheat is unknown. Even simpler molecules are subject to this effect: we can measure the spectrum of pure water easily enough, for example, but that is not the same spectrum as water has when it is present as an intimate mixture in a natural product – the changes in the hydrogen bonding completely change the nature of the spectrum. And these examples are ones we know about! 6) “Finally, the calibration statistics presented in Table 271 show a correlation coef ficient of 0.9996 for PCR, even when an obviously nonlinear region is used! I am not sure if this is significantly different from the one shown for MLR using only the linear region. To me either model would be acceptable at the stage of method development where the article ended. Besides, it is unlikely that someone would be able to know a priori that the linear region was the better region to use for MLR.” (Patrick Wiegand) Response: As a purely practical matter, we agree with that interpretation. However, we hope that by now we have convinced you that we are trying to do more than that – we are trying to find out what really goes on inside the “black boxes” of chemometric
Linearity in Calibration: Act II Scene III
155
calculations. The fact that the value of the PCR correlation coefficient differs significantly from unity becomes clear when you look at the other term of the ANOVA equation: in the MLR case the sumsquared error is zero, in the PCR case it is “infinitely” greater than that. Don’t forget that “significance”, at least in the statistical sense, is defined only when dealing with random variables. This also relates to the earlier comment regarding how to find ways to compare the relative effects of noise and nonlinearity on calibration models. 7) “It would be very interesting also, since the performance of the models presented are so similar, to see how the performance would be affected by noise, drift, etc. which are always present in actuality. I would not be surprised if PLS/PCR outperformed MLR under those circumstances.” (Patrick Wiegand) Response: Yes, it certainly would be most interesting to investigate this question. This is closely related to the previous discussion concerning the relationship between noise and nonlinearity, so I would modify the statement of the problem to “At what point does one or another effect dominate the behavior of the calibration?” that is, where is the crossover point? Investigating questions of this sort is called “research”, and a more fundamental question arises: why isn’t anybody doing such investigations? Other, related, questions are also important: Having determined this in isolation, how does the data analyst determine this in real data, where unknown amounts of several effects may be present? There is a similarity here to Richard’s earlier point regarding the relationship between the amount of noise and the amount of nonlinearity. Here are more fertile areas for research into the behavior of calibration models. 8) “At any wavelength in your simulation, a second degree power series applies, which is linear in coefficients, and the coefficients of a 2 factor PCR or PLS model will be a linear function of the coefficients of the power series. (This assumes an adequate number of calibration spectra, that is, at least as many spectra as factors and a sufficient number of wavelength, which the full spectrum method assures.) The PCR or PLS regression should find the linear combination of these PCR/PLS coefficients that is linear in concentration.” (Fred Cahn) Response: We have read the indicated section of that paper [6], and scanned the rest of it. We agree with much of what it says, both in the paper and in Fred Cahn’s messages, but we are not sure we see the relevance to the column. Certainly, nonlinearities in real data can have several possible causes, both chemical (e.g., interactions that make the true concentrations of any given species different than expected or might be calculated solely from what was introduced into a sample, and interaction can change the underlying absorbance bands, to boot) and physical (such as the stray light, that we simulated). Approximating these nonlinearities with a Taylor expansion is a risky procedure unless you know a priori what the error bound of the approximation is, but in any case it remains an approximation, not an exact solution. In the case of our simulated data, the nonlinearity was logarithmic, thus even a secondorder Taylor expansion would be of limited accuracy. Alternative methods, such as correcting the nonlinearity though the application of an appropriate physical theory as we described above, may do as well or even better than a Taylor series approximation, but a rigorous theory is not always available. Even in
156
Chemometrics in Spectroscopy
cases where a theory exists, often the physical conditions for which the theory is valid cannot be achieved; we demonstrated this in the discussion in Chapters 29 and 30 of the fundamental impossibility of truly achieving “Beer’s Law linearity”. Thus we are left with a situation where even in the best cases we can achieve, there can be residual nonlinearities in the data. The purpose of our column was to investigate the behavior of different modeling methods in the face of nonlinearity. 9) “Thus, my interest in 2 or more factor chemometric models of your simulation is in line with this view of chemometrics. I agree with the need for better physical understanding of instrument responses as well as of the spectra themselves. I would not choose PCR/PLS or MLR to construct such physical models, however.” (Fred Cahn) Response: We were not trying to use the chemometric techniques to create a physical model in the column. We also agree that physical models should be created in the traditional manner, based on the study of the physical considerations of a situation. Ideally you would start from a fundamental physical law and derive, through logic and mathematics, the behavior of a particular system: this is how all other fields of science work. A chemometric technique then would be used only to ascertain the value (from a series of physical measurements) of an unknown parameter that the mathematical derivation created. What we were trying to do in the column was to ascertain the behavior of a mathemat ical (not physical!) system in the face of a certain type of (simulated) physical behavior. There is nothing wrong with trying to come up with empirical methods for improving the practical performance of chemometric calibration, but one of the philosophical problems with the current state of chemometrics is that nobody is trying to do anything else, that is to determine the fundamental behavior of these mathematical systems. 10) “The synthetic data did NOT demonstrate the advantage of a single linear wavelength over a multiple wavelength [sic] model � � � ” (Richard Kramer) “� � � in one case you use a factor which happens to be based upon an explicit model (i.e. linearity) which is correct for the data while stacking the deck against the second case by denying any opportunity to be correct.” (Richard Kramer) “In your article, you appear to be creating an artificial set of circumstances: � � � ” (Patrick Wiegand) “Thus your conclusion – that MLR is more capable of producing accurate models than PLS/PCR – is based on a contrived set of circumstances that would not occur in reality, especially when the chemometrician/spectroscopist is experienced.” (Patrick Wiegand) Response: Artificial? Contrived? Only insofar as any experimental study is based on a “contrived” set of circumstances – contrived to enable the experimenter to separate the phenomenon of interest and study its effects, with “everything else the same”. But that is a minor matter. Richard and Patrick (and how many others, who didn’t respond?) believe that we concluded that “MLR is better than PCA/PLS”. The really critical point here is that that is NOT our conclusion, and anyone who thinks this has misunderstood us. We put the fault for this on ourselves, since the one thing that is clear is that we did not explain ourselves sufficiently.
Linearity in Calibration: Act II Scene III
157
Therefore let us clarify the point here and now: we are not fighting a “holy war” against PCA/PLS etc. The purpose of the exercise was NOT to “prove that MLR with wavelength selection is better”, but to investigate and explain conditions that cause that to be so, when it happens (which it does, sometimes). As we discussed in the original column, more and more discussions about calibration processes, both oral and in the literature, describe situations where wavelength selection improved the results (in PCR and PLS as well as MLR), but there has previously been no explanation for this phenomenon. Therefore we decided to investigate nonlinearity since we suspected that to be a major consideration, and so it turned out to be. We continue our discussion in the following chapters.
REFERENCES 1. 2. 3. 4.
Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). 5. Mark, H., Applied Spectroscopy 42(5), 832–844 (1988). 6. Cahn, F. and Compton, S., Applied Spectroscopy 42, 865–872 (1988).
This page intentionally left blank
32 Linearity in Calibration: Act II Scene IV
This chapter continues our discussion started by the responses received to our Chapter 27 when it was first published as a paper entitled “Linearity in Calibration” [1]. So far our discussion has extended over three previous chapters (29 through 31) whose original published citations are given in references [2–4]. In Chapter 31, originally referenced as [4] we stated, “we are not fighting a ‘holy war’ against PCA/PLS etc.” and then went on to discuss what our original column was really about. However, if there is a “holy war” being fought at all, then from our point of view it is against the practice of simply accepting the results of the computer’s cogitations without attempting to understand the underlying phenomena that affect the behavior of the calibration models, regardless of the algorithm used. This has been our “fight” since the beginning – which can be verified by going back and rereading our very first column ever [5]. The authors do not always agree, but we do agree on the following: it is incompre hensible how a person calling himself a scientist can fail to wonder WHY calibration models behave the way they do, and try to relate their behavior to the properties of the data giving rise to them. There are reasons for everything that happens, whether we know what those reasons are or not, and the goal of science is to determine what those underlying reasons or principles are. At least that is the goal of every other field of scientific endeavor that we are aware of – why is Chemometrics exempt? Real data, as we have seen, is far too complicated to work with to try to obtain fundamental understanding, just as the physical world is often too complicated to study directly in toto. Therefore work such as was presented in the “Linearity in Calibration” chapter is needed, creating a simplified system where the characteristic of interest can be isolated and studied – just as physical experiments often work with a simplified portion of the physical world for the same reason. This might be categorized as “Experimental Chemometrics”, controlling the nature of the data in a way that allows us to relate the properties of the data to the behavior of the model. Does this mimic the “real world”? No, but it does provide a window into the inner workings of the calibration calculations, and we need as many such windows as we can get. We will go so far as to make an analogy with Chemistry itself. The alchemists of old had an enormous empirical knowledge base, and from that could do all manner of useful things. But we do not consider alchemy a science, and it did not become a science until the underlying principles and phenomena were discovered and codified in a way that all could use. The current state of Chemometrics is more nearly akin to alchemy than Chemistry: we can do all manner of useful things with it, but it is all empirical and there are still many areas where even the most expert and prominent practitioners treat it as a “black box” and make no attempt to understand the inner workings of that black box.
160
Chemometrics in Spectroscopy
Empiricism is important and even necessary, but hardly sufficient. The ultimate test of whether something is scientific is its ability to predict – and that does NOT mean SEP!! The irony of the situation is that a good deal of basic knowledge is available. The field of Chemometrics bypasses all the Statistical basics and jumps right into the heavy duty sophisticated algorithms: everybody just wants to start running before they can even crawl. We commented on this situation in earlier Chapters 29–31 and previous publications [6], and what response we received was on the order of “Why was so much space wasted before getting to the important part?” It is certainly unfortunate that the portion of the discussion that was perceived as “wasted space” was the important part, but was not recognized as such. The early foundations of Statistics go back to the 1600s or so, to the time when proba bility theory was recognized as a distinct branch of mathematics. The current problem is that nobody currently seems to apply the knowledge gained over the intervening span of time, or to be interested in applying that knowledge, or to do fundamental investigations at all. The chemometric community completely ignores the previous mathematical basis underlying its structure. The science of Statistics does, in fact, form a firm foundation that Chemometrics is built on. It is almost shameful that the modern Chemometrics community seems to be content to build ever higher and fancier superstructures on a foundation that is solid enough, but to which it is hardly connected. Worse, there seems to be an active antipathy to such investigations: just look at the firestorm we aroused by publishing a very small and innocuous study of the funda mental behavior of a particular data system! In fact, from the response, you would almost think we committed heresy or attacked religious beliefs, in daring to suggest that PCR/PLS was not always the best way to go, much less do some serious research on the subject. Everybody gives lip service to the concept of “fundamental research is good for the long run”, but nobody seems interested in putting that concept into practice, even with the possibility of fairly shortterm returns. Let us look at a couple of examples. In reference [7] we found the following passage: But, it would be dangerous to assume that we can routinely get away with extrapolation of that kind. Sometimes it can be done, sometimes it can’t. There is no simple rule that can tell us which situation we might be facing. (see p. 129 in [7]). And that passage seems to sum up the current state of affairs. Theoretically, a good straight line should be extrapolatable almost indefinitely, yet we all know how risky it is to extrapolate even a little bit beyond the range of our data. Why does not practice conform to theory? The obvious answer is that something is nonlinear. But why cannot we detect this? As Rich says, we do not have any simple rules. Well, OK, so we do not have simple rules. Maybe no simple rules exist. But then, why do not we at least have complicated rules to help us make such important decisions? At least then we would have a way to predict (in the scientific sense) something that is worthwhile knowing. As it stands we have nothing, and nobody seems interested in finding out why. Maybe a new approach is needed. Maybe this is where Fred Cahn’s work is pertinent: if you can approximate the nonlinearity with a Taylor series, then maybe the quality of the fit can provide a diagnostic to form the foundation of a rule on which to base a decision. Maybe something else will work. We do not know, but it is a possible starting
Linearity in Calibration: Act II Scene IV
161
point. Fred, you are in the ideal position to pursue this, how about it – will you accept this challenge? The above example, of course, is relatively abstract and “academic”, and as such perhaps not of too much interest to the majority. Another example, with more practical application, is transfer of calibration models from one instrument to another. This is an endeavor of enormous current practical importance. Witness that hardly a month passes without at least one article on that topic in one or more of the analytical or spectroscopic journals. Yet all those reports are the same: “Effect of Data Treatment ABC Combined with Algorithm XYZ Compared to Algorithm UVW” or some such; they are all completely empirical studies. In themselves there is nothing wrong with that. The problem is that there is nothing else. There are no critical reviews summarizing all this work and extracting those aspects that are common and beneficial (or common and harmful, for that matter). Even worse, there are no fundamental studies dealing with the relationship of the algorithm’s behavior to the underlying physics, chemistry, mathematics, or instrumental effects. It is not difficult to see that the calibration transfer problem breaks down into two pieces: a) The effect of instrumental variation on the data b) The effect of variations of the data on the model. Studying the effects of instrumental performance should be the province of the manu facturers. Unfortunately, the perception is that it is to their benefit to release such results only if they turn out to be “good”, and there is little incentive for them to perform studies whose only purpose is to increase scientific knowledge. Thus it is up to academia to pick up this particular ball, if there is any interest in it at all. Fundamental studies in those areas will eventually give rise to real knowledge about how and when calibrations can be transferred, and provide us with trustworthy recipes for doing the transfer. Such knowledge will also provide us with the confidence of knowing that the underlying science is sound, and thus take us beyond the “my algorithm is better than your algorithm” stage that we are now at. Furthermore, true fundamental understanding could also be applied in reverse. Then instrument manufacturers could concentrate on those aspects of construction and opera tion that affect the transferability situation, and be able to verify their capabilities in an unambiguous, scientifically valid and agreedon manner. This is just one other example of a current problem that COULD be attacked with fundamental studies, with both short and longterm benefits that are obvious to all. Connecting to the statistical foundations, as described above, can have other benefits. For example, computing an SEP on a validation set of data is considered the beall and endall of calibration diagnostics. This is an important calculation, to be sure, but it has its limitations, as well. For example, the SEP alone has no diagnostic capability: it tells you nothing about what you need to do in order to improve a calibration model. For another, even when you compare SEPs from different models and choose the model with the smallest SEP, that does not necessarily mean you are choosing the best model. We often see “robustness” bandied about in discussions of calibration models, but what diagnostics do we have to quantify “robustness”? Without such a diagnostic, how can we expect to evaluate “robustness” either in isolation or to compare with SEP?
162
Chemometrics in Spectroscopy
By focusing all our attention on the SEP we have also lost the ability to evaluate calibrations on their own. When calibrating spectrometers to do quantitative analysis, where samples are cheap and easy to come by, this loss is not too serious, but what do you do when a project requires calibration runs that cost a million (or ten million) dollars per run, and minimizing the number of runs is the absolute top priority? In such a case, you will not only not have validation data, you will likely not even have enough calibration data to do a leaveoneout calculation, and then being able to evaluate models from calibration diagnostics alone will be critical. Statisticians have, in fact, developed diagnostic tests that provide information about such characteristics, but the Chemometric community, in our arrogance, think we know better and ignore all this prior work. The statistical community has also developed many local and semilocal diagnostic tools to help understand and improve calibration models; we really need to get back to the roots on this, as well. There are innumerable unsolved problems in Chemometrics that need to be addressed: real, scientific problems, not just new ways to throw numbers around.
REFERENCES 1. 2. 3. 4. 5. 6. 7.
Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27 (1999). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 2(1), 38–39 (1987). Mark, H. and Workman, J., Spectroscopy 13(4), 26–29 (1998). Kramer, R., Chemometric Techniques for Quantitative Analysis (Marcel Dekker, New York, 1998).
33 Linearity in Calibration: Act II Scene V
This chapter is still a continuation of our discussion started by the responses received to Chapter 27 from our initial publication of “Linearity in Calibration” [1]. Up until now our discussion has extended over Chapters 29–32 as original paper publications ([2–5], respectively). At this point, however, we are finally getting toward the end of our obsession with considerations of linearity – at least until we receive another set of comments from our readers. Incidentally, we welcome such feedback, even those that disagree with us or with which we disagree, so please keep it coming. Indeed, it seems that we do not get much feedback unless our readers disagree with us, and feel it strongly enough to feel the need to say so. That is great – there is nothing like a little controversy to keep a book like this interesting: who said chemometrics and statistics and mathematics were dry subjects, anyway?! In our original column on this topic [1] we had only done a principal component analysis to compare with the MLR results. One of the comments made, and it was made by all the responders, was to ask why we did not also do a PLS analysis of the synthetic linearity data. There were a number of reasons, and we offered to send the data to any or all of the responders who would care to do the PLS analysis and report the results. Of the original responders, Paul Chabot took us up on our offer. In addition, at the 1998 International Diffuse Reflectance Conference (The “Chambersburg” meeting), Susan Foulk also offered to do the PLS analysis of this data. Gratifyingly, when Paul and Susan reported their PLS loadings they were identical, even though they used different software packages to do the PLS calculations (PLSIQ and Unscrambler). We are certainly glad we do not have to worry about sorting out dif ferences in software packages (due to different convergence criteria, etc., that sometimes creep into results such as these) on top of the Chemometric issues we want to address. Figure 331 presents the plot of the PLS loadings. Paul and Susan each computed both loadings. Note that the first loading is indistinguishable to the eye from the first PCA loading (see our original column on this topic [1]). Paul and Susan each also computed the two calibration models and performance statistics for both models. Except that various programs did not compute the same sets of performance statistics (although in one case a different computation seemed to be given the same label as SEE), the ones that were reported by both programs had identical values. As expected by all responders, and by your hosts as well, when twofactor models (either PCR or PLS) were computed, the fit of the model to the synthetic data was perfect. Table 331 presents a summary of the numerical results obtained, for onefactor calibration models. Interestingly, when comparing the calibration results we find that the reported cor relation coefficients agree among the different programs using the same algorithm, but the SEE values differ appreciably; it would seem that not all programs use the same
164
Chemometrics in Spectroscopy PLS Loadings 0.2 0.15 0.1
300
288
276
264
252
240
228
216
204
192
180
168
156
144
132
120
108
96
84
72
60
48
36
24
0
0 –0.05
12
Loading
0.05
–0.1 –0.15 –0.2 –0.25 –0.3 Index
Figure 33-1 PLS loadings from the synthetic data used to test the fit of models to nonlinearity. (see Colour Plate 2)
Table 33-1 Summary of results obtained from synthetic linearity data using one PCA or PLS fac tor. We present only those performance results listed by the data analyst as Correlation Coefficient and Standard Error of Estimate Data analyst Column Chabot Chabot Foulk
Type of analysis
Corr. Coeff.
SEE
PCR PCR PLS PLS
0�999622439 0�999622411 0�999623691 0�999624
0�057472 0�01434417 0�01436852 0�051319
definition of SEE. This leaves in question, for example, whether the value reported for SEE from PLS by Susan Foulk is really as large an improvement over the SEE for PCR reported by your columnists, or if it is due to a difference in the computation used. Since Paul Chabot reported SEE for both algorithms and his values are more nearly the same, even though his computation seems to differ from both the others, the tentative conclusion is that there is a difference in the computation. Indeed, we find that if we multiply our own value for SEE by the square root of 4/5, we obtain a value of 0.0514045, a value that compares to the SEE obtained by Susan Foulk in more nearly the same way that Paul Chabot’s values compare to each other, indicating a possibility that there is a discrepancy in the determination of degrees of freedom that are used in the two algorithms. Based on the values of the correlation coefficients, then, we can find the following comparisons between the two algorithms: as several of the responders indicated, the PLS model did provide improved results over the PCR model. On the other hand, the degree of improvement was not the major effect that at least some of the responders expected. As Richard Kramer expected,
Linearity in Calibration: Act II Scene V
165
PLS should have, in principle, rejected a portion of the nonlinear variance result ing in a better, although not completely exact, fit to the data with just 1 factor. Some of this variance was indeed rejected by the PLS algorithm, but the amount, compared to the Principal Component algorithm, seems to have been rather minuscule, rather than providing a nearly exact fit. Nonlinearity is a subject the specifics of which are not prolifically or extensively discussed as a specific topic in the multivariate calibration literature, to say the least. Textbooks routinely cover the issues of multiple linear regression and nonlinearity, but do not cover the issue with “fullspectrum” methods such as PCR and PLS. Some discussion does exist relative to multiple linear regression, for example in Chemometrics: A Textbook by D.L. Massart et al. [6], see Section 2.1, “Linear Regression” (pp. 167–175) and Section 2.2, “Nonlinear Regression,” (pp. 175–181). The authors state, In general, a much larger number of parameters [wavelengths, frequencies, or factors] needs to be calculated in overlapping peak systems [some spectra or chromatograms] than in the linear regression problems. (p. 176) The authors describe the use of a Taylor expansion to negate the second and the higher order terms under specific mathematical conditions in order to make “any function” (i.e., our regression model) firstorder (or linear). They introduce the use of the Jacobian matrix for solving nonlinear regression problems and describe the matrix mathematics in some detail (pp. 178–181). There are also forms of nonlinear PCR and PLS where the linear PCR or PLS factors are subjected to a nonlinear transformation during singular value decomposition; the nonlinear transformation function can be varied with the nonlinearity expected within the data. These forms of PCR/PLS utilize a polynomial inner relation as spline fit functions or neural networks. References for these methods are found in [7]. A mathematical description of the nonlinear decomposition steps in PLS is found in [8]. These methods can be used to empirically fit data for building calibration models in nonlinear systems. The interesting point is that there are cases, such as the one demonstrated in the Linearity in Calibration chapter where nonlinearity is the dominant phenomenon, where MLR will fit the data more closely with fewer terms than either PCR or PLS. One could imagine a real case where an analyte would have a minor absorption band such that the magnitude of the spectral band is within a linear region of the measuring instrument. One could also imagine the major absorption band of this analyte is somewhat nonlinear at the higher concentration ranges. In this special case the MLR would provide a closer fit with fewer terms than either the PLS or the PCR, unless the minor band was isolated prior to model development using the PCR or PLS. This points to a continuing need for spectral band selection algorithms that can automatically search for the optimum spectral information and linear fit prior to the calibration modeling step. But all things remaining constant, cases remain where MLR with automatic channel selection feature will provide a more optimum fit, in some cases, than either PCR or PLS. Surprising indeed, to some people! In their day, Principal Components and Partial Least Squares were each considered almost as “the magic answer to all calibration problems”. It took a long time for the realization to dawn that they contain no “magic” and are subject to most of the
166
Chemometrics in Spectroscopy
same problems as the algorithm previously available (at that time, what we now call MLR). Now we see a surge in other new algorithms: wavelets, neural networks, genetic algorithms, as well as the combining of techniques (e.g., selecting wavelengths before performing a PCA or PLS calculation). While some of the veterans of the “PC wars” (not “political correctness”, by the way) realize that they can be overfit just as MLR calibrations can, have become wary of the problem and are more cautious with new algorithms, there is some evidence that a large number, perhaps the majority, of users are not nearly so careful, and are still looking for their “magic answer”. There is a generic caution that need to be promoted, and all users made aware of when dealing with these more sophisticated methods. That is the simple fact that every new parameter that can be introduced into a calibration procedure is another way to overfit and hide the fact that it is happening. Worse, the more sophisticated the algorithm the harder it is to see and recognize that that is going on. With PCR and PLS we introduced the extra parameter of the number of factors: one extra parameter. With wavelets we introduce the order and the locality of each wavelet: two extra parameters. With neural nets, we have the number of nodes in each layer: n extra parameters, and then there is even a metaparameter: the number of layers. No wonder reports of overfitting abound (and don’t forget: those are only the ones that are recognized)! And nary a diagnostic in sight. In a perfect world, a new algorithm would not be introduced until a corresponding set of diagnostic methods were developed to inform the user how the algorithm was behaving. As long as we are dreaming, let us have those diagnostics be informative, in the sense that if the algorithm was misbehaving, it would point the user in the proper direction to fix it.
REFERENCES 1. 2. 3. 4. 5. 6.
Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27 (1999). Mark, H. and Workman, J., Spectroscopy 14(5), 12–14 (1999). Massart, D.L., Vandeginste, B.G.M., Deming, S.N., Michotte, Y. and Kaufman, L., Chemo metrics: A Textbook (Elsevier Science Publishers, Amsterdam, 1988). 7. Wold, S., KettanahWold, N. and Skagerberg, B., Chemometrics and Intelligent Laboratory Systems 7, 53–65 (1989). 8. Wold, S., Chemometrics and Intelligent Laboratory Systems 14 (1992).
34 Collaborative Laboratory Studies: Part 1 – A Blueprint
We will begin by taking a look at the detailed aspects of a basic problem that confronts most analytical laboratories. This is the problem of comparing two quantitative methods performed by different operators or at different locations. This is an area that is not restricted to spectroscopic analysis; many of the concepts we describe here can be applied to evaluating the results from any form of chemical analysis. In our case we will examine a comparison of two standard methods to determine precision, accuracy, and systematic errors (bias) for each of the methods and laboratories involved in an analytical test. As it happens, in the case we use for our example, one of the analytical methods is spectroscopic and the other is an HPLC method. As it happens, a particularly opportune event occurred recently, almost simultaneously with our writing these next few chapters: an article [1] appeared in LC-GC, a sister magazine to Spectroscopy, that also takes concepts that we discussed and described in some of our early chapters, and applies them to a reallife situation (or at least a simulation of a reallife situation), the main difference is that the experiment described deals with macroscopic objects while the “real world” deals in atoms and molecules). In past chapters [2, 3] we also described how probabilistic phenomena give rise to distributions and even included computer programs to allow simulations of this, but given the constraints of time and text space, we were not able to link that to the actual behavior of the physical world nearly as well as Hinshaw does. In the case described, given the venue, the interest is in the chromatography, and for that reason we will not dwell on their application. However, we do strongly urge our readers to obtain a copy of this article and read it for it is description of the basis and generation of the distributions that arise from the effects of the random behavior of the physical world. The probabilistic and statistical experiments described are superb examples of how concepts such as these can be illustrated and brought to life. The statistical tools we describe in the next few chapters, and use for this demonstra tion, are ones that we have previously described. These tools include statistical hypothesis testing and ANOVA. Our previous descriptions of these topics were generic and rather general; at that time we were interested in presenting the theoretical background and reasoning behind the development of these statistical techniques. Now we will use them in a practical situation, to show how these methods can be used to evaluate various characteristics relating to the precision and accuracy of analytical methods, applying them to real data to simultaneously demonstrate how to use them and the nature of the results that can be obtained. We will use ANOVA to evaluate potential bias in reported results inherent in the analytical methods themselves, or due to the operators (i.e., location of laboratory) performing the methods. For the next series of articles all computations were completed using MathCad Worksheets [4] written by the authors. The objectives of this next set of articles is to determine the precision, accuracy, and bias due to choice of analytical
168
Chemometrics in Spectroscopy
method and/or operator for the determination of an analyte within a set of hypothetical production samples and spiked recovery samples (samples of gravimetrically known composition). The discussion will occupy the Chapters 34–39.
EXPERIMENTAL DESIGN The experimental design used for this hypothetical study is based on a relatively simple factorial model where individual samples are measured as shown in Figure 341 and Table 341. We have previously discussed factorial designs [5] although, as was the case with ANOVA, our previous discussion was simplified and primarily theoretical, to demonstrate the principles involved, while in the current discussion, we apply these concepts to a more realistic practical situation. For this hypothetical test, samples consist of three production run samples (i.e., Nos. 1–3) with a target analyte value of 3.60 units (percent, grams, pounds, etc.). In addition, three spiked recovery samples with target analyte levels of 3.40, 3.61, and 3.80% respectively are represented by Nos. 4–6. This experimental model allows the methods and locations (labs or operators) to be compared for precision, accuracy, and systematic errors. We will use the designation Lab 1 and Lab 2 to indicate different locations and/or operators performing the identical procedures for METHODS A and B (or I and II). Before considering the design and the analysis of it in detail, let us take a look at the factors that are being included in the design, and their impact on the experimental design and the analysis of this design: we have six samples, two methods of analysis for the constituent of interest, two laboratories, two chemists in each laboratory and five repeat readings of the constituents of each sample by each chemist. Statistical hypothesis
Method I
r1 r2 r3 r4 r5
Method II
r1 r2 r3 r4 r5
Method I
r1 r2 r3 r4 r5
Method II
r1 r2 r3 r4 r5
Lab 1
Each sample (n = 6)
Lab 2
Sample
Location
Method
Replicates
Figure 34-1 A simple factorial design for collaborative data collection. Each sample analyzed (in this hypothetical case n = 6) requires multiple labs, or operators, using both methods of analysis and replicating each measurements a number of times (r = 5) for this hypothetical case.
169
Collaborative Laboratory Studies: Part 1 Table 34-1 “As reported” analytical data∗ for collaborative study Sample No. – Replicate no.
Lab 1 – Method B
Lab 2 – Method B
Lab 1 – Method A
Lab 2 – Method A
1�1 1�2 1�3 1�4 1�5 Mean
3�507 3�463 3�467 3�501 3�489 3.485
3�507 3�497 3�503 3�473 3�447 3.485
3�462 3�442 3�460 3�517 3�460 3.468
3�460 3�443 3�447 — — 3.450
2�1 2�1 2�3 2�4 2�5 Mean
3�479 3�453 3�459 3�461 3�481 3.467
3�497 3�660 3�473 3�447 3�453 3.506
3�446 3�448 3�455 3�456 3�455 3.452
3�460 3�470 3�450 3�460 3�460 3.460
3�1 3�2 3�3 3�4 3�5 Mean
3�366 3�362 3�351 3�353 3�347 3.356
3�370 3�327 3�387 3�430 3�383 3.379
3�318 3�330 3�328 3�322 3�323 3.324
3�337 3�317 3�337 3�330 3�330 3.330
4�1 4�2 4�3 4�4 4�5 Mean
3�421 3�377 3�399 3�379 3�379 3.391
3�407 3�400 3�417 3�353 3�380 3.391
3�366 3�360 3�361 3�362 3�370 3.364
3�380 3�380 3�380 3�380 3�380 3.380
5�1 5�2 5�3 5�4 5�5 Mean
3�565 3�568 3�561 3�576 3�587 3.571
3�540 3�550 3�573 3�533 3�543 3.548
3�538 3�539 3�544 3�540 3�543 3.541
3�560 3�580 3�590 3�580 3�560 3.570
6� 1 6� 2 6� 3 6� 4 6� 5 Mean
3�764 3�742 3�775 3�767 3�766 3.763
3�860 3�833 3�933 3�870 3�810 3.881
3�741 3�740 3�739 3�742 3�744 3.741
3�740 3�760 3�730 3�770 3�750 3.740
∗
Note: For this hypothetical exercise, Samples 1–3 have a target value of 3.60% absolute; whereas Samples 4–6 are Spiked Recovery Samples with target values of 3.40 (No. 4), 3.61 (No. 5), and 3.80 (No. 6).
170
Chemometrics in Spectroscopy
testing provides us with an objective method of determining whether or not a given difference in conditions (i.e., factor) has an effect on the readings. We have the following a priori expectations for the behavior of these several factors: a) Since we know that the samples are of different composition we expect the measure ments of the constituent value to reflect this genuine difference in composition, and be therefore to be systematic, and be constant across all other factors. Any departure from constant differences (beyond the amount expected from random variation due to unavoidable random error of the analysis, of course) can be attributed to an effect of the corresponding factor, or due to blunders such as improper mixing or sampling of the material. b) There may be an effect due to the use of two different laboratories. This effect may or may not be the same for the two different methods of analysis. This can be examined by comparing the results of measurements on the same sample by the same method in each of the two different laboratories. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test. Before doing so, the existence of the appropriate circumstances must first be determined. c) There may be an effect due to the use of two different methods of analysis. This effect may or may not be the same in the two different laboratories. There may or may not be a difference between the two chemists in each laboratory. This can be examined by comparing the results of measurements on the same sample by the two different methods of analysis. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test; if circumstances are appropriate, results from the two chemists in each laboratory and the results from the two laboratories may also be combined. Before doing so, the existence of the appropriate circumstances must first be determined. d) There may or may not be a difference between the two chemists’ readings of the constituent values in a given laboratory. If we arbitrarily label the chemists in each laboratory as “Chemist #1” and “Chemist #2”, we would not expect a systematic difference between the corresponding chemists in the two different laboratories. This can, however, happen by coincidence. This can be examined by comparing the results of measurements on the same sample by the two different chemists in each laboratory. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test. Before doing so, the existence of the appropriate circumstances must first be determined. Many of these aspects will be presented over the next several chapters. e) We do not expect any systematic effects among the five repeat readings of each sample by each chemist in each laboratory. We do expect random variations, reflecting unavoidable random errors of measurement. These unavoidable random errors of measurement are quantified by the terms “precision” and “accuracy”. f) We expect the precision and accuracy for each method to be the same at both laboratories. This can be examined by comparing the precision and accuracy of each method in each laboratory, combining results from multiple samples when appropriate. Before doing so, the existence of the appropriate circumstances must first be determined. g) We do not expect the precision and accuracy to be the same for the two methods except by coincidence.
Collaborative Laboratory Studies: Part 1
171
h) We expect the precision and accuracy to be the same for all four chemists for each method, unless we find a difference in precision and/or accuracy between laboratories. This can be examined by comparing the precision and accuracy of each method as performed by each chemist, combining results from multiple samples when appropriate. Before doing so, the existence of the appropriate circumstances must first be determined. The use of the statistical tools of ANOVA and statistical hypothesis testing, described previously in these chapters and whose application is described in further detail below, allows separation of the effects due to the various factors and objective verification as to which ones are statistically significant. In the absence of any systematic effects due to one or more of the factors, our a priori expectation is that any differences seen are due to the effects of unavoidable random errors only, and will therefore be nonsignificant. Therefore, any statistically significant effects found due to differences between sets of readings indicates that the corresponding factor has a real, systematic effect on the readings. By posing the scientific questions about the effects of the factors in the formalism of statistical hypothesis tests [6], any statistically significant result is an indication that the corresponding factor has a real, systematic effect on the readings, and this gives us the handle we need to extract that information from the mass of data we obtain from this simpleseeming, but (as we see) actually very complicated experimental design. Data analysis for this series was performed using MathCad and the statistical methods used are described in greater detail in Youden’s monograph [7] and in Mark and Workman [8]. We use the MathCad worksheets both to illustrate how the theoretical concepts can be put to actual use and also to demonstrate how to perform the calculations we describe. The worksheets will be printed along with the chapters in which they are first used. At a later date we are planning to enable you to go to the Spectroscopy home page (http://www.spectroscopymag.com) and find them. If, and when, the actual URLs for the worksheets become available, we will let you know. The primary goal of this series of chapters is to describe the statistical tests required to determine the magnitude of the random (i.e., precision and accuracy) and systematic (i.e., bias) error contributions due to choosing Analytical METHODS A or B, and/or the location/operator where each standard method is performed. The statistical analysis for this series of articles consists of five main parts as: Part 1: Overall comparison of both locations and analytical methods for precision and accuracy; Part 2: Analysis of Variance testing for both locations and analytical methods to deter mine if an overall bias exists for location or analytical method; Part 3: Testing for systematic error in each method by performing a comparison test for a set of measurements versus the known True Value; Part 4: Performing a ranking test to determine if either analytical method or location affects the results as a systematic error (bias); and Part 5: Computing the “efficient comparison of two methods” as described by Youden and Steiner in reference [7]. The analyst may use one or more of these statistical test methods to compare analyti cal results depending upon individual requirements. It is recommended that the easiest
172
Chemometrics in Spectroscopy
and most fruitful test for the effort expended would be the test method described in Chapter 38. This simple set of tests statistically compares precision, accuracy, and sys tematic error for two methods with the minimum quantity of analytical effort. Chapter 38 is most highly recommended above the Chapters 34–37, but it is a useful tool to proceed through an understanding of the first chapters before proceeding to Chapter 38. The basic experimental design required for statistical methods in Chapters 34–37 is demonstrated in Figure 341 and the data is presented in Table 341. The basic experimental design required for Chapter 38 statistical methods is given in Figure 342 and the corresponding data in Table 342. Thus, if you would like to follow along by performing these tests on your own real data, the basic designs are demonstrated here to allow you to collect data before proceeding through the statistical methods described within the next 6 chapters.
r1
Sample X
r2 r3 r4 r5
Sample Y
r1 r2 r3 r4 r5
Sample X
r1 r2 r3 r4 r5
Sample Y
r1 r2 r3 r4 r5
Method A
Method B
Method
Sample
Replicates
Figure 34-2 Simple experimental design for Youden/Steiner comparison of two Methods (data shown in Table 342).
Table 34-2 Analytical data entry for comparison of two methods tests Method A
Mean
Method B
Sample X
Sample Y
Sample X
Sample Y
3�366 3�380 3�360 3�380
3�741 3�740 3�740 3�760
3�421 3�407 3�377 3�400
3�764 3�860 3�742 3�833
3�372
3�745
3�401
3�800
Collaborative Laboratory Studies: Part 1
173
ANALYTICAL METHODS Sample collection and handling Let us say the first three samples tested were collected by Lab 2 from their production facility. These samples were retained from actual production lots. An aliquot from each retained jar was removed and shipped to Lab 1 in appropriate sealed containers. METHOD B testing was started at both laboratories the day following receipt of the samples to rule out any possible aging effects. METHOD A testing was performed in Lab 1 on the following day, while the METHOD A testing in Lab 2 occurred a week later. The second three samples were spiked, produced at Lab 2 using the pure analyte reagent and Control material. An aliquot of each sample was shipped to Lab 1 in appropriate sealed containers. Once again, the METHOD B testing was performed on the same day at both locations. METHOD A testing was done at both sites within a 2day time period.
METHOD A and B analysis All six samples at both sites were prepared the same way. Five separate aliquots from each sample were separately sampled and prepared for testing. Each aliquot was then measured three times. Conditions and standard operating procedures for METHODS A and B were carefully specified for both Labs 1 and 2.
RESULTS AND DATA ANALYSIS Comparing all laboratories and all methods for precision and accuracy COMPARISON OF PRECISION AND ACCURACY FOR METHODS AND LABO RATORIES USING THE GRAND MEAN FOR SAMPLES No. 1–3 (Collabor_GM Worksheet), OR BY USING A SPIKED RECOVERY STUDY FOR SAMPLES No. 4–7 (Collabor_TV Worksheet) To compute the results shown in Tables 343 and 344, the precision of each set of replicates for each sample, method, and location are individually calculated using the root mean square deviation equation as shown (Equations 341 and 342) in standard symbolic and MathCad notation, respectively. Thus the standard deviation of each set of sample replicates yields an estimate of the precision for each sample, for each method, and for each location. The precision is calculated where each yij is an individual replicate (j) measurement for the ith sample; y¯ i is the average of the replicate measurements for the ith sample, for each method, at each location; and N is the number of replicates for each sample, method, and location. The results of these computations for these data
174
Chemometrics in Spectroscopy
Table 34-3 Individual sample analysis precision for hypothetical production samples Sample no. Sample 1 Sample 2 Sample 3 Pooled
METHOD B – Lab 1
METHOD B – Lab 2
METHOD A – Lab 1
METHOD A – Lab 2
0�020 0�013 0�0079 0�015
0�025 0�088 0�037 0�057
0�0089 0�0066 0�0068 0�008
0�0089 0�010 0�012 0�010
Table 34-4 Individual sample analysis precision for hypothetical spiked recovery samples Sample no. Sample 4 Sample 5 Sample 6 Pooled
METHOD B – Lab 1
METHOD B – Lab 2
METHOD A – Lab 1
0�019 0�010 0�012 0�014
0�025 0�015 0�047 0�032
0�0041 0�0026 0�0019 0�0030
METHOD A – Lab 2 0�000 0�013 0�016 0�012
are found in Tables 343 and 344 representing samples 1–3 (hypothetical production samples), and 4–6 (hypothetical spiked samples), respectively. � � N �� � �y − y¯ �2 � i=1 i i S = N − 1 � �−−−−−−−−−−−−−−→ � �� � Y − meanY 2 S= N −1
(341)
(342)
The pooled precision and accuracy for each sample for both analytical methods and locations are calculated using Equations 343 and 344, representing standard symbolic and MathCad notation, respectively. The pooled precision is calculated where each yi is an individual replicate measurement for an individual sample; y¯ i is the average of the replicate measurements for each sample, each method, each location; and Ni is the number of replicates for an individual (ith) sample, method, and location. The results of these computations for these data are found in Tables 343 and 344 (Pooled) row representing samples 1–3, and 4–6, respectively. The results from Tables 343 and 344 indicate there is no trend in error versus concentration, therefore the error appears to show no trending with respect to concentration.
Ps =
� � N1 N2 N3 N4 2 � 2 � 2 � 2 �� y1j − y¯ 1 + y2j − y¯ 2 + y3j − y¯ 3 + y4j − y¯ 4 � � j=1 j=1 j=1 j=1 N1+N2+N3+N4−4
(343)
175
Collaborative Laboratory Studies: Part 1 Table 34-5 Individual sample analysis estimated accuracy using grand mean calculation Sample no. Sample 1 Sample 2 Sample 3 Pooled
Ps =
METHOD B – Lab 1
METHOD B – Lab 2
METHOD A – Lab 1
METHOD A – Lab 2
0�025 0�014 0�012 0�018
0�029 0�096 0�051 0�065
0�029 0�031 0�037 0�033
0�029 0�017 0�024 0�024
�
− −−−−−−−−−−−−−−−−�−−−−−−−−−−−−−−−−−2−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−− � Y 3 − meanY 3 + Y 4 − meanY 42 Y 1 − meanY 12 + Y 2 − meanY 22 + N1+N2+N3+N4−4 N1+N2+N3+N4−4
(344) To compute the results shown in Table 345 for production samples, the accuracy of each set of replicates for each sample, method, and location was individually calculated using the root mean square deviation equation as shown in equations 345 and 346 in standard symbolic and MathCad notation, respectively. The standard deviation of each set of sample replicates yields an estimate of the accuracy for each sample, for each method, and for each location. The accuracy is calculated where each yi is an individual replicate measurement; GM is the Grand Mean of the replicate measurements for each sample, both methods, both locations; and N is the number of replicates for each sample, method, and location. The results found in Table 345 represent samples 1–3. Note: Each sample had a Grand Mean computed by taking the mean for all measurements made for each of the samples 1–3. � � N 2 �� � yij − GMi � j=1 Si = (345) N −1 � �− � ��−−−−−−−−−→ � Y − GM2 S = N −1
(346)
To compute the results shown in Table 346 for the Spiked Recovery samples, the accu racy of each set of replicates for each sample, method, and location can be individually calculated using the root mean square deviation equation as shown in equations 345 and 346 in standard symbolic and MathCad 7.0 notation, respectively. The standard devia tion of each set of sample replicates yields an estimate of the accuracy for each sample, for each method, and for each location. The accuracy is calculated where each yi is an individual replicate measurement; and The Spiked or true values (TV) are substituted for GM in equations 345 and 346. The accuracy is calculated for each sample, each method, and each location; and N is the number of replicates for each sample, method, and location. The results found in Table 346 represent samples 344 through 346. Note: Each sample had a True Value given by a known analyte spike into the sample.
176
Chemometrics in Spectroscopy
Table 34-6 Individual sample analysis accuracy using Spiked Recovery study Sample no. Sample 4 Sample 5 Sample 6 Pooled
METHOD B – Lab 1
METHOD B – Lab 2
METHOD A – Lab 1
METHOD A – Lab 2
0�022 0�044 0�043 0�038
0�027 0�071 0�083 0�065
0�041 0�077 0�066 0�063
0�022 0�042 0�058 0�043
Table 34-7 Individual sample precision and accuracy for combined Methods A and B and Labs 1 and 2 – Production samples No. Sample 1 Sample 2 Sample 3 Pooled
GM
Precision
3.472 3.471 3.347 3.430
0�0231 0�0479 0�021 0�033
Accuracy 0�0278 0�0538 0�033 0�040
Table 34-8 Individual sample precision and accuracy for combined Methods A and B and Labs 1 and 2 – Spiked Recovery samples No. Sample 4 Sample 5 Sample 6 Pooled
TR
Precision
3�40 3�61 3�80 3�603
0�016 0�011 0�025 0�018
Accuracy 0�029 0�061 0�064 0�054
The analytical results for each sample can again be pooled into a table of precision and accuracy estimates for all values reported for any individual sample. The pooled results for Tables 347 and 348 are calculated using equations 341 and 342 where precision is the root mean square deviation of all replicate analyses for any particular sample, and where accuracy is determined as the root mean square deviation between individual results and the Grand Mean of all the individual sample results (Table 347) or as the root mean square deviation between individual results and the True (Spiked) value for all the individual sample results (Table 348). The use of spiked samples allows a better comparison of precision to accuracy, as the spiked samples include the effects of systematic errors, whereas use of the Grand Mean averages the systematic errors across methods and shifts the apparent true value to include the systematic error. Table 348 yields a better estimate of the true precision and accuracy for the methods tested. A simple statistical test for the presence of systematic errors can be computed using data collected as in the experimental design shown in Figure 342. (This method is demonstrated in the Measuring Precision without Duplicates sections of the MathCad Worksheets Collabor_GM and Collabor_TV found in Chapter 39.) The results of this test are shown in Tables 349 and 3410. A systematic error is indicated by the test using
177
Collaborative Laboratory Studies: Part 1
Table 34-9 Statistical test for the presence of systematic errors (using samples 1 and 2 only) Ftest for bias 16.53
Fcritical for bias 9.27
Table 34-10 Statistical test for the presence of systematic errors (using samples 4 and 5 only) Ftest for Bias 2.261
Fcritical for Bias 9.277
Samples 1 and 2, but not for Samples 4 and 5. This indicates that the difference between precision and accuracy is large enough to indicate a bias inherent within the analytical method(s). Since these are the same methods and locations tested, further evaluation is required to determine if a bias actually exists.
REFERENCES 1. 2. 3. 4. 5. 6. 7.
Hinshaw, J.V., LC-GC 17(7), 616–625 (1999). Mark, H. and Workman, J., Spectroscopy 2(2), 60–64 (1987). Workman, J. and Mark, H., Spectroscopy 2(6), 58–60 (1987). MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0; (1997). Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989). Youden, W. J. and Steiner, E. H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). 8. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).
This page intentionally left blank
35 Collaborative Laboratory Studies: Part 2 – using ANOVA
In this chapter the use of ANOVA will be described for use in collaborative study work.
ANOVA TEST COMPARISONS FOR LABORATORIES AND METHODS (ANOVA_s4 WORKSHEET) Analysis of Variance (ANOVA) is a useful tool to compare the difference between sets of analytical results to determine if there is a statistically meaningful difference between a sample analyzed by different methods or performed at different locations by different analysts. The reader is referred to reference [1] and other basic books on statistical methods for discussions of the theory and applications of ANOVA; examples of such texts are [2, 3]. Table 351 illustrates the ANOVA results for each individual sample in our hypo thetical study. This test indicates whether any of the reported results from the analytical methods or locations is significantly different from the others. From the table it can be observed that statistically significant variation in the reported analytical results is to be expected based on these data. However, there is no apparent pattern in the method or location most often varying from the others. Thus, this statistical test is inconclusive and further investigation is warranted.
Table 35-1 ANOVA: comparing methods and laboratories No.
F test for bias
F critical for bias
Difference
Bias
Sample 1
1�81
Sample 2
1�21
3.34
—
No
3.34
—
No
Sample 3
6�89
3.34
METHOD BLAB 1 + METHOD BLAB 2 vs. METHOD ALAB 1 + METHOD ALAB 2
Yes
Sample 4
3�28
3.24
METHOD ALAB 1
Yes
Sample 5
10�52
3.24
METHOD BLAB 1 + METHOD ALAB 2 vs. METHOD BLAB 2 + METHOD ALAB 1
Yes
Sample 6
24�10
3.24
METHOD BLAB 2
Yes
180
Chemometrics in Spectroscopy
ANOVA test comparisons (using ANOVA_s2 worksheet) Table 352 shows the ANOVA results comparing laboratories (i.e., different locations) performing the same METHOD B analytical procedure for analysis. This statistical test indicates that for the higher concentration spiked samples (i.e. 5 and 6 at 3.61 and 3.80% levels, respectively) a significant difference in reported average values occurred. However, Lab 1 was higher for Sample No. 5 and lower for Sample No.6 indicating no apparent trend in the analytical results reported for both labs, indicating that there is no systematic difference between labs using METHOD B. Table 353 illustrates the ANOVA results comparing laboratories (i.e., different loca tions) performing the same METHOD A for analysis. This statistical test indicates that for the midlevel concentration spiked samples (i.e. 4 and 4 at 3.40 and 3.61% levels, respectively) difference in reported average values occurred. However, this trend did not continue for the highest concentration sample (i.e., Sample No. 6) with a concentration of 3.80%. The Lab 1 was slightly lower in reported value for Samples 4 and 5. There is no significant systematic error observed between laboratories using the METHOD A. Table 354 reports ANOVA comparing the METHOD B procedure to the METHOD A procedure for combined laboratories. Thus the combined METHOD B analyses for each sample were compared to the combined METHOD A analyses for the same sample. This statistical test indicates whether there is a significant bias in the reported results for each method, irrespective of operator or location. An apparent trend is indicated using this statistical analysis, that trend being a positive bias for METHOD B as compared to
Table 35-2 ANOVA: comparing laboratories for METHOD B (Lab 1 vs. Lab 2) No. Sample Sample Sample Sample Sample Sample
Method 1 2 3 4 5 6
METHOD METHOD METHOD METHOD METHOD METHOD
B B B B B B
F test for bias
F critical for bias
Difference
Bias
0 0�98 1�99 0�0008 8�14 20�91
5�32 5�32 5�32 5�32 5�32 5�32
— — — — 0.024 −0�098
No No No No Yes Yes
Table 35-3 ANOVA: comparing laboratories for METHOD A spectrophotometry (Lab 1 vs. Lab 2) No. Sample Sample Sample Sample Sample Sample
Method 1 2 3 4 5 6
METHOD METHOD METHOD METHOD METHOD METHOD
A A A A A A
F test for bias
F critical for bias
Difference
Bias
1�10 2�52 1�18 76�3 29�52 1�53
5.99 5.99 5.99 5.32 5.32 5.32
— — — −0�016 −0�029 —
No No No Yes Yes No
181
Collaborative Laboratory Studies: Part 2
Table 35-4 ANOVA: comparing methods for combined laboratories and operators, all Method B vs. all Method A No.
Method comparison
Sample 1
METHOD B vs. METHOD A
Sample 2
METHOD B vs. METHOD A
Sample 3
METHOD B vs. METHOD A
Sample 4
METHOD B vs. METHOD A
Sample 5 Sample 6
F test for bias
F critical for bias
Difference
Bias
5�05
4.49
0.024
Yes
1�93
4.49
—
No
4.49
0.041
Yes
7�06
4.41
0.019
Yes
METHOD B vs. METHOD A
0�07
4.41
—
No
METHOD B vs. METHOD A
11�44
4.41
0.066
Yes
15�9
METHOD A. Thus METHOD B would be expected to report a higher level of analyte than METHOD A.
REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 2. Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). 3. Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974).
This page intentionally left blank
36 Collaborative Laboratory Studies: Part 3 – Testing for Systematic Error
TESTING FOR SYSTEMATIC ERROR IN A METHOD: COMPARISON TEST FOR A SET OF MEASUREMENTS VERSUS TRUE VALUE – SPIKED RECOVERY METHOD (COMPARET WORKSHEET) The Student’s (W.S. Gossett) ttest is useful for comparisons of the means and standard deviations of different analytical test methods. Descriptions of the theory and use of this statistic are readily available in standard statistical texts including those in the references [1–6]. Use of this test will indicate whether the differences between a set of measurement and the true (known) value for those measurements is statistically meaningful. For Table 361 a comparison of METHOD B test results for each of the locations is compared to the known spiked analyte value for each sample. This statistical test indicates that METHOD B results are lower than the known analyte values for Sample No. 5 (Lab 1 and Lab 2), and Sample No. 6 (Lab 1). METHOD B reported value is higher for Sample No. 6 (Lab 2). Average results for this test indicate that METHOD B may result in analytical values trending lower than actual values. For Table 362, a comparison of METHOD A results for each of the locations is made to the known spiked analyte value for each sample. This statistical test indicates that METHOD A results are lower than the known analyte values for Sample Nos. 4–6 for both Lab 1 and Lab 2. Average results for this test indicate that METHOD A is consistently lower than actual values.
Table 36-1 Comparison of METHOD B test results to true value Method–Location Sample Sample Sample Sample Sample Sample
4 4 5 5 6 6
METHOD METHOD METHOD METHOD METHOD METHOD
B–LAB B–LAB B–LAB B–LAB B–LAB B–LAB
1 2 1 2 1 2
ttest for bias
tcritical for bias
Difference
Bias
1�06 0�76 8�37 9�06 6�73 2�94
2�776 2�776 2�776 2�776 2�776 2�776
— — −0�038 −0�062 −0�037 0�061
No No Yes Yes Yes Yes
184
Chemometrics in Spectroscopy
Table 36-2 Comparison of METHOD A results to true value Method–Location Sample Sample Sample Sample Sample Sample
4 4 5 5 6 6
METHOD METHOD METHOD METHOD METHOD METHOD
A–LAB A–LAB A–LAB A–LAB A–LAB A–LAB
1 2 1 2 1 2
ttest for bias
tcritical for bias
Difference
Bias
19�52 9�0 59�8 6�0 68�4 7�07
2�776 2�776 2�776 2�776 2�776 2�776
−0�036 −0�018 −0�069 −0�036 −0�058 −0�050
Yes Yes Yes Yes Yes Yes
REFERENCES 1. MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0 (1997). 2. Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). 3. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 4. Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). 5. Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974). 6. Owen, D.B., Handbook of Statistical Tables (AddisonWesley Publishing Co., Inc., Reading, MA, 1962).
37 Collaborative Laboratory Studies: Part 4 – Ranking Test
RANKING TEST FOR LABORATORIES AND METHODS (MANUAL COMPUTATIONS) The ranking test for laboratories provides for the calculation of individual ranks for each laboratory or method using the averaged results collected for all replicates and all methods/locations. The summary of averaged analytical results discussed in this series is shown in Table 371a. These compiled results are assigned ranks by column from the largest to the smallest reported analytical values. The largest analytical result in each column receives a score of 1, whereas the smallest result receives the largest number. When two results in a column are identical, a 0.5 is added to the rank number, and the subsequent number is not used. Note column 1 in Table 371a; both row 1 and row 2 have the identical value of 3.485 and are assigned 1.5 as rank score values. Note that rank 2 is not used due to the tie, and the lower analytical results are given ranks 3 and 4, respectively. The rows are summed resulting in a rank score as column #8, Table 371b. Table 37-1a Results table for ranking test Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
3.485 3.485 3.468 3.450
3.467 3.506 3.542 3.460
3.356 3.379 3.324 3.330
3.391 3.391 3.364 3.380
3.571 3.548 3.541 3.570
3.763 3.861 3.741 3.740
L1: METHOD B–LAB 1 L2: METHOD B–LAB 2 L3: METHOD A–LAB 1 L4: METHOD A–LAB 2
Table 37-1b Ranked results table
L1: METHOD B–LAB 1 L2: METHOD B–LAB 2 L3: METHOD A–LAB 1 L4: METHOD A–LAB 2 ∗
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Score∗
1.5
2
2
1.5
1
2
10
1.5
1
1
1.5
3
1
9
3
3
4
4
4
3
21
4
4
3
3
2
4
20
If an individual laboratory score is equal to or outside of the limit boundaries, then we conclude that there is a pronounced systematic error present between the laboratory, or laboratories, with the extreme score. In this particular case the limits are 8–22.
186
Chemometrics in Spectroscopy
Table 37-1c Approximate 5% twotail limits for laboratory ranking Scores (from Ref. [1]) No. of locations/tests
Number of samples 3
4
5
6
7
8
9
10
3
—
4 12
5 15
7 17
8 20
10 22
12 24
13 27
4
—
4 16
6 19
8 22
10 25
12 28
14 31
16 34
5
—
5 19
7 23
9 27
11 31
13 35
16 38
18 42
6
3 18
5 23
7 28
10 32
12 37
15 41
18 45
21 49
7
3 21
5 27
8 32
11 37
14 42
17 47
20 52
23 57
8
3 24
6 30
9 36
12 42
15 48
18 54
22 59
25 65
9
3 27
6 34
9 41
13 47
16 54
20 60
24 66
27 73
10
4 29
7 37
10 45
14 52
17 60
21 67
26 73
30 80
The score values are compared to a statistical table of values found in reference [1]. This table is partially reproduced as Table 371c. If an individual laboratory score is equal to or outside of the limit boundaries, then we conclude that there is a pronounced systematic error present between the laboratory, or laboratories, with the extreme score. In this particular case the limits are 8 to 22, therefore there is no significant systematic error in the methods as determined using this test.
REFERENCE 1. Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975).
38 Collaborative Laboratory Studies: Part 5 – Efficient Comparison of Two Methods
COMPUTATIONS FOR EFFICIENT COMPARISON OF TWO METHODS (COMP_METH WORKSHEET) The section following shows a statistical test (text for the Comp_Meth MathCad Work sheet) for the efficient comparison of two analytical methods. This test requires that replicate measurements be made on two different samples using two different analyt ical methods. The test will determine whether there is a significant difference in the precision and accuracy for the two methods. It will also determine whether there is sig nificant systematic error between the methods, and calculate the magnitude of that error (as bias). This efficient statistical test requires the minimum data collection and analysis for the comparison of two methods. The experimental design for data collection has been shown graphically in Chapter 35 (Figure 352), with the numerical data for this test given in Table 381. Two methods are used to analyze two different samples, with approximately five replicate measurements per sample as shown graphically in the previously mentioned figure. The analytical results can immediately be plotted using the Youden/Steiner two sample graphic shown in Figure 381. This graphic gives a rapid method for visually determining if the reported analytical values contain systematic error. The presence of systematic error is indicated by the occurrence of twosample plot points that are found in the lower left, and upper right quadrants of the charts. The presence of points in these quadrants indicates that high analyte value samples are biased to the high end, and low analyte containing samples are biased to the low end. Analytical methods not exhibiting systematic (bias) errors should have randomly distributed twosample plot points throughout all the quadrants of the chart. Figure 381 gives an indication that METHOD A has a negative bias; and METHOD B is more random. However, the range of the axes is much lower for Method A indicating that the overall bias is quite small, and significantly less than Method B. The calculations for the efficient twomethod comparison are shown in Table 382 and the subsequent equations following. The mathematical expressions are given in MathCad symbolic notation showing that the difference is taken for each replicate set of X and Y and the mean is computed. Then the sum for each replicate set of X and Y is calculated and the mean is computed. The difference in the sums is computed (as d) and the differences are summed and reported as an absolute value (as �d). The mean difference is calculated as mean(d). Each X and Y result contains the systematic error of the analytical method for its respective laboratory, noting that the systematic error is assumed to be identical for
188
Chemometrics in Spectroscopy
Table 38-1 Analytical data entry for comparison of two methods tests METHOD A
METHOD B
Sample X
Sample Y
Sample X
Sample Y
3.366 3.380 3.360 3.380
3.741 3.740 3.740 3.760
3.421 3.407 3.377 3.400
3.764 3.860 3.742 3.833
3.372
3.745
3.401
3.800
Mean
METHOD A:
METHOD B:
3.9
3.905
3.9
+ +
mean(BY )
mean(AY ) 3.8
BY
AY
+++ ++ 3.7
3.35
3.8
+++
+ +
+
+ 3.4
3.45
3.7
mean(AX ) . AX
3.35
3.4
3.45
mean(BX ) . BX
Figure 38-1 Twosample charts illustrating systematic errors for Methods A vs. B.
Table 38-2 Calculations for comparison tests METHOD A:
METHOD B:
ADxy �= �AX − AY�
mean�ADxy� = 0�374 ATxy �= �AX + AY�
mean�ATxy� = 7�117
BDxy �= �BX − BY�
mean�BDxy� = 0�399 BTxy �= �BX + BY�
mean�BTxy� = 7�201 d � ATxy − BTxy � d = 0�337 Mean Difference: mean�d� = 0�084 d2 �= BTxy − ATxy
X and Y for each method. When the difference between X and Y is calculated (as d) the systematic error drops out so that the difference (d) between X and Y contains no systematic errors, only random errors. We then estimate the precision by using the difference quantities. The difference between the true analyte concentrations of X and Y
represents the true analyte difference between X and Y without the systematic error, but
Collaborative Laboratory Studies: Part 5
189
with the random errors. The relative precision between the two methods is calculated using Table 382 and equations 381 and 382. The Fstatistic used to compare the sizes of the Method A vs. Method B precision values is given by equation 385 and is compared to the Fstatistic table value (equation 387). The null (Ho ) hypothesis states that there is no difference in the precision of the two methods; whereas the alternate hypothesis (Ha ) indicates that there is a difference in the precision. For the methods compared in this study there is a significantly larger precision for METHOD B as compared to METHOD A. Method A precision is 0.007, whereas Method B precision is 0.037 representing a 5.3 factor increase. When summing the X and Y values, the systematic contribution is found twice. The two used in the denominator is indicative of the error contribution from each independent set of results (i.e., X and Y ). Given independent random errors only, the standard deviation of the sum of two measurements X and Y would be identical to the standard deviation of the differences between the two measurements X and Y . In the absence of any systematic error, Sr2 and Sd2 estimate the same standard deviation. In the presence of systematic error, Sd2 is large compared to Sr2. The larger the Sd2, the greater is the systematic error contribution. The relative systematic error between the two methods is calculated using Table 382, and equations 383 and 384. The F statistic is used to compare the sizes of the Method A vs. Method B systematic error values and is given by equation 386; and is compared to the F statistic table value (equation 387). The null (Ho ) hypothesis states that there is no difference in the systematic error found in the two methods; whereas the alternate hypothesis (Ha ) indicates that there is a difference in the size of the systematic error. For the methods compared in this study there is a significantly larger systematic error for METHOD B as compared to METHOD A. The test to determine whether the bias is significant incorporates the Student’s ttest. The method for calculating the ttest statistic is shown in equation 3810 using MathCad symbolic notation. Equations 388 and 389 are used to calculate the standard deviation of the differences between the sums of X and Y for both analytical methods A and B, whereas equation 3810 is used to calculate the standard deviation of the mean. The ttable statistic for comparison of the test statistic is given in equations 3811 and 3812. The F statistic and tstatistic tables can be found in standard statistical texts such as references [1–3]. The null hypothesis (Ho ) states that there is no systematic difference between the two methods, whereas the alternate hypothesis (Ha ) states that there is a significant systematic difference between the methods. It can be seen from these results that the bias is significant between these two methods and that METHOD B has results biased by 0.084 above the results obtained by METHOD A. The estimated bias is given by the Mean Difference calculation.
Measuring the Precision and Standard Deviation of the Methods (Youden/Steiner) Note that for the calculations of precision and standard deviation (equations 381 through 384), the numerator expression is given as 2n − 1. This expression is used due to the 2 times error contribution from independent errors found in each independent set (i.e., X and Y ) of results.
190
Chemometrics in Spectroscopy
Precision (Sr) � ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � ASr �= · �ADxy − mean�ADxy��2 2 · �nY − 1�
(381)
� ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � 2 BSr �= · �BDxy − mean�BDxy�� 2 · �nY − 1�
(382)
ASr = 6�692658 · 10−3
BSr = 0�037334
Standard deviation (Sd) � ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � ASd �= · �ATxy − mean�ATxy��2 2 · �nY − 1�
(383)
� �� −−−−−−−−−−−−−−−−−−−−−−−−→ � �−−−−−− � 1 � 2 BSd �= · �BTxy − mean�BTxy�� 2 · �nY − 1�
(384)
ASd = 0�012428
BSd �= 0�045387
F statistic calculation �Fs � for precision ratio Sr2 Ratio: PFs �=
B2 Sr A2 Sr
(385)
PFs = 31�118 Ho : If Fs is less than or equal to Ft , then there is NO DIFFERENCE in Precision estimation. Ha : If Fs is greater than Ft , then there is a DIFFERENCE in Precision estimation. F statistic calculation (Fs ) for presence of systematic errors Sd2 Ratio: SF s �=
B2 Sd A2 Sd
SF s = 13�337
(386)
191
Collaborative Laboratory Studies: Part 5
Ho : If Fs is less than or equal to Ft , then there is NO DIFFERENCE in systematic error. Ha : If Fs is greater than Ft , then there is a DIFFERENCE in systematic error. F statistic table value Ft df 1 � = nY − 1 df 1 = 3 qF�0�95� df 1 � df 1 � = 9�277
(387)
Student’s ttest for the difference in the biases between two methods Mean Difference: mean�d� = 0�084 � �� � �−−−−−−−−−−−−−−−−−−−−−−→ � 1 � 2 s �= · �d2 − mean�d� � �df 1 �
(388)
s = 0�053
s sm �= √ nY
(389)
sm = 0�026 Calculate ttest statistic: Te �=
mean�d� sm
Te = 3�201
(3810)
Enter alpha value as a2: �2 �= �95
Calculate ttable value: �1 �=
�2 + 1 2
(3811)
�1 = 0�975 t �= qt��1 � df 1 �
ttable value� t = 3�182
(3812)
192
Chemometrics in Spectroscopy
Ho : If Te is less than or equal to ttable value, then there is NO SYSTEMATIC DIF FERENCE between method results. Ha : If Te is greater than t, then there is a SYSTEMATIC DIFFERENCE (BIAS) between method results.
SUMMARY This set of articles presents the computational details and actual values for each of the statistical methods shown for collaborative tests. These methods include the use of precision and estimated accuracy comparisons, ANOVA tests, Student’s ttesting, The Rank Test for Method Comparison, and the Efficient Comparison of Methods tests. From using these statistical tests the following conclusions can be derived: 1. Both analytical methods are quite precise and accurate, therefore the production samples are below target value concentration. 2. The precision for METHOD B is significantly larger than METHOD A, indicating METHOD A is more precise than METHOD B. 3. There is no correlation of analytical error with concentration over the range tested (i.e., 3.40–3.80% analyte). 4. Analytical results comparing METHOD B and METHOD A will show significant variation due to the high precision of both analytical methods. 5. There is no operator/laboratory bias between labs for METHOD B. 6. There is no operator/laboratory bias between labs for METHOD A. 7. There is a significant bias between METHOD B and METHOD A; METHOD B yields higher results. 8. Both METHOD B and METHOD A results trend lower than actual values, but by small quantities (approximately −0.04% at the target value of 3.60%). 9. The laboratory ranking test did not show any laboratory or method outside of confidence limits, therefore neither method nor laboratory is consistently high or low in reported results. 10. METHOD B precision is a factor of 5.3 times greater than that of METHOD A. 11. The systematic error contribution is larger for METHOD B than METHOD A. 12. METHOD B is biased to +0.084 as compared to METHOD A.
ACKNOWLEDGEMENT The real analytical data used for Chapters 34–38 was graciously provided by Dan Devine of KimberlyClark Analytical Science & Technology.
REFERENCES 1. MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142 (1997). 2. Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). 3. Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989).
39 Collaborative Laboratory Studies: Part 6 – MathCad Worksheet Text
The MathCad worksheets used for this Chemometrics in Spectroscopy collaborative study series are given below in hard copy format. Unless otherwise noted, the worksheets have been written by the authors. The text files for the MathCad v7.0 Worksheets used for the statistical tests in this report are attached as Collabor_GM, Collabor_TV, ANOVA_s4, ANOVA_s2, CompareT, and Comp_Meth. References [1–11] are excellent sources of information of the details on these statistical methods. Collabor_GM
Collaborative Test Worksheet
RAW DATA ENTRY: X01
X05
X09
3.51 3.46 3.47 3.50 3.49 3.48 3.45 3.46 3.46 3.48 3.37 3.36 3.35 3.35 3.35
X02
X06
X10
3.51 3.50 3.50 3.47 3.45 3.50 3.66 3.47 3.45 3.45 3.37 3.33 3.39 3.43 3.38
X03
X07
X11
3.46 3.44 3.46 3.52 3.46 3.45 3.45 3.46 3.46 3.46 3.32 3.33 3.33 3.32 3.32
X04
3.46 3.44 3.45
X08
3.46 3.47 3.45
X12
3.34 3.32 3.34
Mean values for Data: n01:=rows(X01) n02:=rows(X02) n03:=rows(X03) n04:=rows(X04) mean(X01) mean(X02) mean(X03) mean(X04)
= = = =
3.485 3.485 3.468 3.45
n05:=rows(X05) n06:=rows(X06) n07:=rows(X07) n08:=rows(X08) mean(X05) mean(X06) mean(X07) mean(X08)
= = = =
3.467 3.506 3.452 3.46
n09:=rows(X09) n10:=rows(X10) n11:=rows(X11) n12:=rows(X12) mean(X09) mean(X10) mean(X11) mean(X12)
= = = =
3.356 3.379 3.324 3.3303
194
Chemometrics in Spectroscopy
GRAND MEANS FOR EACH ROW (USE IF NO “TRUE VALUE” IS AVAILABLE): GM1 �=
�mean�X01� + mean�X02� + mean�X03� + mean�X04�� 4
GM2 �=
�mean�X05� + mean�X06� + mean�X07� + mean�X08�� 4
GM3 �=
�mean�X09� + mean�X10� + mean�X11� + mean�X12�� 4
GRAND MEANS FOR EACH ROW: GM1 = 3�472 GM2 �= 3�47115 GM3 �= 3�347433
COMPUTATIONS FOR PRECISION AND ACCURACY: Precision: � �−−−−−−−−−−−−−−−−−−−−−−−−−→ � �� 1 � 2 SDp�X01� �= · �X01 − mean�X01�� n01 − 1 � −−−−−−−−−−−−−−−−−−−−−−−−→ �− 1 SDp�X02� �= · �X02 − mean�X02��2 n02 − 1
SDp�X01� = 0.02
SDp�X02� = 0.025
� −−−−−−−−−−−−−−−−−−−−−−−−−→ � 1 2 SDp�X03��= · �X03−mean�X03�� n03 − 1
� −−−−−−−−−−−−−−−−−−−−−−−−−→ � 1 SDp�X04��= · �X04−mean�X04��2 n04 − 1
Collaborative Laboratory Studies: Part 6
SDp�X03� = 8.888 ·10 –3 SDp�X04� = 8.888 ·10 –3
� �� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− � SDp�X05��= · �X05−mean�X05��2 n05 − 1 � −−−−−−−−−−−−−−−−→ �−−−−1−−−−− � 2 SDp�X06��= · �X06−mean�X06�� n06 − 1 SDp�X05� = 0.013 SDp�X06� = 0.088
� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− � 2 SDp�X07��= · �X07−mean�X07�� n07 − 1 � −−−−−−−−−−−−−−−−→ �−−−−1−−−−− � SDp�X08��= · �X08−mean�X08��2 n08 − 1 SDp�X07� = 6.557 ·10 –3
SDp�X08� = 0.01
� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− 2 SDp�X09��= · �X09−mean�X09�� n09 − 1
� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− SDp�X10��= · �X10−mean�X10��2 n10 − 1
SDp�X09� = 7.918 ·10 –3 SDp�X10� = 0.037
� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− SDp�X11��= · �X11−mean�X11��2 n11 − 1
195
196
� �−−−−−−−−−−−−−−−−−−−−−−−−−→ �� 1 2 · �X12−mean�X12�� SDp�X12��= n12 − 1 SDp�X12� = 0.012
SDp�X11� = 6.812 ·10 –3
Accuracy: � −−−−−−−−−−−−−−−−−−−−→ �− 1 SDa�X01� �= · �X01 − GM1�2 n01 − 1
� −−−−−−−−−−−−−−−−−−−−−→ � 1 2 SDa�X02� �= · �X02 − GM1� n02 − 1 SDa�X01� = 0.025
SDa�X02� = 0.029
� −−−−−−−−−−−−−−−−−−−−−→ � 1 2 SDa�X03� �= · �X03 − GM1� n03 − 1
� −−−−−−−−−−−−−−−−−−−−−→ � 1 SDa�X04� �= · �X04 − GM1�2 n04 − 1 SDa�X04� = 0.029 SDa�X03� = 0.029
� −−−−−−−−−−−−−−−−−−−−−→ � 1 SDa�X05� �= · �X05 − GM2�2 n05 − 1
� −−−−−−−−−−−−−−−−−−−−−→ � 1 2 SDa�X06� �= · �X06 − GM2� n06 − 1 SDa�X05� = 0.014
SDa�X06� = 0.096
Chemometrics in Spectroscopy
Collaborative Laboratory Studies: Part 6
197
� −−−−−−−−−−−−→ �−−−−1−−−−− 2 SDa�X07��= · �X07 − GM2� n07 − 1
� −−−−−−−−−−−−→ �−−−−1−−−−− SDa�X08��= · �X08 − GM2�2 n08 − 1 SDa�X07� = 0.031 SDa�X08� = 0.017 � −−−−−−−−−−−−→ �−−−−1−−−−− SDa�X09��= · �X09 − GM3�2 n09 − 1
� −−−−−−−−−−−−→ �−−−−1−−−−− 2 SDa�X10��= · �X10 − GM3� n10 − 1 SDa�X09� = 0.012
SDa�X10� = 0.051 � −−−−−−−−−−−−→ �−−−−1−−−−− 2 · �X11 − GM3� SDa�X11� �= n11 − 1
� −−−−−−−−−−−−→ �−−−−1−−−−− SDa�X12� �= · �X12 − GM3�2 n12 − 1 SDa�X11� = 0.037
SDa�X12� = 0.024
Pooled Standard Deviations (As Precision): Row 1: SpR1� = � − −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ −
�
�X01 − mean�X01��2 + �X02 − mean�X02��2 + �X03 − mean�X03��2 + �X04 − mean�X04��2 � n01 + n02 + n03 + n04 − 4 SpR1 = 0.0231474
198
Chemometrics in Spectroscopy
Row 2: SpR2� = � −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→
� −
�X05 − mean�X05��2 + �X06 − mean�X06��2 + �X07 − mean�X07��2 + �X08 − mean�X08��2 � n05 + n06 + n07 + n08 − 4 SpR2 = 0.0478817
Row 3: SpR3� = � −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→
� −
�X09 − mean�X09��2 + �X10 − mean�X10��2 + �X11 − mean�X11��2 + �X12 − mean�X12��2 � n09 + n10 + n11 + n12 − 4 SpR3 = 0.021
Pooled Standard Deviations (As Accuracy): Row 1: � − − − − −
−−−−−−−−−−−→
−−−−−−−−−−−→
−−−−−−−−−−−→ −
−−−−−−−−−−−→ � −− �X01 − GM1�2 + �X02 − GM1�2 + �X03 − GM1�2 + �X04 − GM1�2 � SpR1� = n01 + n02 + n03 + n04 − 4 SpR1 = 0.0277715
Row 2: � −−−−−−−−−−−−−→ − − −− −
−−−−−−−−−−−→
−−−−−−−−−−−→ −
−−−−−−−−−−−→ �
�X05 − GM2�2 + �X06 − GM2�2 + �X07 − GM2�2 + �X08 − GM2�2 � SpR2� = n05 + n06 + n07 + n08 − 4 SpR2 = 0.0537719
Row 3: � − − −− −
−−−−−−−−−−−→
−−−−−−−−−−−→ −
−−−−−−−−−−−→
−−−−−−−−−−−→ � −− �X09 − GM3�2 + �X10 − GM3�2 + �X11 − GM3�2 + �X12 − GM3�2 � SpR3� = n09 + n10 + n11 + n12 − 4 SpR3 = 0.033
199
Collaborative Laboratory Studies: Part 6
Measuring Precision without Duplicates (Youden/Steiner):
RAW DATA ENTRY (Enter single Determinations for Sample X from different laboratories or operators): Sample X LAB LAB LAB LAB
#1 #2 #3 #4
X:=
3.51 3.51 3.46 3.46
nX: = rows(X) mean(X) = 3.484 (Enter single Determinations for Sample Y from different laboratories or operators): Sample Y LAB LAB LAB LAB
#1 #2 #3 #4
Y: =
3.48 3.50 3.45 3.46
nY: = rows(X) mean(Y) = 3.47 3.5
mean(Y)
3.48
Y 3.46
3.44
3.44 3.46 3.48 3.5 mean(X), X
3.52
Two-sample Chart Illustrating systematic errors
200
Chemometrics in Spectroscopy
CALCULATIONS: Dxy �=�X − Y� Txy �=�X + Y� mean�Dxy� = 0.014 mean�Txy� = 6.955
Precision (Sr): � −−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � 1 · �Dxy − mean�Dxy��2 Sr �= 2 · �nY − 1� Sr = 8.276473 ·10 –3
Measuring the Standard Deviation of the Data (Youden/Steiner):
Standard Deviation (Sd): � −−−−−−−−−−−−−−−−−−−−−−−−−−−→ �− 1 2 Sd �= · �Txy − mean�Txy�� 2 · �nY − 1� Sd = 0.033653
Statistical Test for presence of systematic errors (Youden/Steiner):
F-statistic Calculation (Fs): Fs �=
Sd2 Sr2
Fs = 16.533
F-statistic Table Value (Ft): df1 �= nY − 1 df1 = 3 qF�0.95,df1� df1� = 9.277
Collaborative Laboratory Studies: Part 6
201
Test Criteria: If Fs is less than or equal to Ft, then there is NO SYSTEMATIC ERROR If Fs is greater than Ft, then there is SYSTEMATIC ERROR (BIAS)
Standard Deviation estimate for the distribution of systematic errors (Sb2):
2
Sd − Sr2
Sb2�=
2 Sb2 = 5.32 ·10–4
202
Chemometrics in Spectroscopy
Collabor_TV
Collaborative Test Worksheet
RAW DATA ENTRY: X01
X05
X09
3.42 3.38 3.40 3.38 3.38 3.56 3.57 3.56 3.58 3.59 3.76 3.74 3.77 3.77 3.77
X02
X06
X10
3.41 3.40 3.42 3.35 3.38 3.54 3.55 3.57 3.53 3.54 3.86 3.83 3.93 3.87 3.81
X03
X07
X11
3.37 3.36 3.36 3.36 3.37 3.54 3.54 3.54 3.54 3.54 3.74 3.74 3.74 3.74 3.74
X04
X08
X12
3.38 3.38 3.38 3.38 3.38 3.56 3.58 3.59 3.58 3.56 3.74 3.76 3.73 3.77 3.75
Mean Values for Data Rows: n01:=rows( X01) n02:=rows(X02) n03:=rows(X03) n04:=rows(X04) mean(X01) mean(X02) mean(X03) mean(X04)
= = = =
3.391 3.391 3.364 3.38
n05:= rows(X05) n06:=rows(X06) n07:=rows(X07) n08:=rows(X08) mean(X05) mean(X06) mean(X07) mean(X08)
= = = =
3.571 3.548 3.541 3.574
n09:=rows(X09) n10:=rows(X10) n11:=rows(X11) n12:=rows(X12) mean(X09) mean(X10) mean(X11) mean(X12)
= = = =
3.763 3.861 3.741 3.75
ENTER TRUE VALUES FOR EACH ROW (SPIKED RECOVERY SAMPLES): TR1:=3.40 TR1:=3.61 TR1:=3.80
COMPUTATIONS FOR PRECISION AND ACCURACY: Precision: � �−−−−−−−−−−−−−−−−−−−−−−−−−→ �� 1 � SDp�X01� �= · �X01 − mean�X01��2 n01 − 1
Collaborative Laboratory Studies: Part 6
� −−−−−−−−−−−−−−−−−−−−−−−−−→ � 1 � 2 SDp�X02� �= · �X02 − mean�X02�� n02 − 1 SDp�X01� = 0.019 SDp�X02� = 0.025
� −−−−−−−−−−−−−−−−−−−−−−−−→ �− 1 2 SDp�X03� �= · �X03 − mean�X03�� n03 − 1
� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− SDp�X04� �= · �X04 − mean�X04��2 n04 − 1 SDp�X03� = 0 SDp�X04� = 0
� −−−−−−−−−−−−−−−−−−−−−−−−−→ � 1 � SDp�X05��= · �X05 − mean�X05��2 n05 − 1 � −−−−−−−−−−−−−−−−→ �−−−−1−−−−− 2 SDp�X06��= · �X06 − mean�X06�� n06 − 1 SDp�X05� = 0.01 SDp�X06� = 0.015
� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− 2 SDp�X07��= · �X07 − mean�X07�� n07 − 1 � −−−−−−−−−−−−−−−−→ �−−−−1−−−−− SDp�X08��= · �X08 − mean�X08��2 n08 − 1 SDp�X07� = 2.588 ·10–3
SDp�X08� = 0.013
203
204
� −−−−−−−−−−−−−−−−→ �−−−−1−−−−− � 2 SDp�X09��= · �X09 − mean�X09�� n09 − 1 � −−−−−−−−−−−−−−−−→ �−−−−1−−−−− SDp�X10��= · �X10 − mean�X10��2 n10 − 1 SDp�X09� = 0.012
SDp�X10� = 0.047 � −−−−−−−−−−−−−−−−→ �−−−−1−−−−− � 2 SDp�X11��= · �X11 − mean�X11�� n11 − 1 � − −−−−−−−−−−−−−−−−→ �−−−1−−−−− SDp�X12��= · �X12 − mean�X12��2 n12 − 1
SDp�X11� = 1.924 ·10 –3 SDp�X12� = 0.016
Accuracy: � −−−−−−−−−−−→ �−−−−1−−−−− SDa�X01� � = · �X01 − TR1�2 n01 − 1
� −−−−−−−−−−−→ �−−−−1−−−−− 2 SDa�X02� � = · �X02 − TR1� n02 − 1 SDa�X01� = 0.022
SDa�X02� = 0.027 � −−−−−−−−−−−−−−−−−−−−→ � 1 SDa�X03� � = · �X03 − TR1�2 n03 − 1
� −−−−−−−−−−−→ �−−−−1−−−−− SDa�X04� � = · �X04 − TR1�2 n04 − 1 SDa�X04� = 0.022 SDa�X03� = 0.041
Chemometrics in Spectroscopy
Collaborative Laboratory Studies: Part 6
� ��−−−−−−−−−−−−−−−−−−−→ �− 1 2 SDa�X05� � = · �X05 − TR2� n05 − 1
� −−−−−−−−−−−−−−−−−−−−→ � 1 2 SDa�X06� � = · �X06 − TR2� n06 − 1 SDa�X05� = 0.044
SDa�X06� = 0.071 � −−−−−−−−−−−−−−−−−−−−→ � 1 2 SDa�X07� � = · �X07 − TR2� n07 − 1 � −−−−−−−−−−−−−−−−−−−−→ � 1 2 SDa�X08� � = · �X08 − TR2� n08 − 1 SDa�X07� = 0.077 SDa�X08� = 0.042 � −−−−−−−−−−−−−−−−−−−−→ � 1 2 SDaX09 � = · �X09 − TR3� n09 − 1
� −−−−−−−−−−−→ �−−−−1−−−−− SDa�X10� � = · �X10 − TR3�2 n10 − 1 SDa�X09� = 0.043
SDa�X10� = 0.083 � −−−−−−−−−−−−−−−−−−−−→ � 1 2 SDa�X11� � = · �X11 − TR3� n11 − 1
� −−−−−−−−−−−−−−−−−−−−→ � 1 SDa�X12� � = · �X12 − TR3�2 n12 − 1 SDa�X11� = 0.066
SDa�X12� = 0.058
205
206
Chemometrics in Spectroscopy
Pooled Standard Deviations (As Precision): Row 1: SpR1 �= � −−−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→
�
�X01 − mean�X01��2 + �X02 − mean�X02��2 + �X03 − mean�X03��2 + �X04 − mean�X04��2 � n01 + n02 + n03 + n04 − 4 SpR1 = 0.0159961
Row 2: SpR2 �= � −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→
� −
�X05 − mean�X05��2 + �X06 − mean�X06��2 + �X07 − mean�X07��2 + �X08 − mean�X08��2 � n05 + n06 + n07 + n08 − 4 SpR2 = 0.0114967
Row3: SpR3 �= � −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→
� −
�X09 − mean�X09��2 + �X10 − mean�X10��2 + �X11 − mean�X11��2 + �X12 − mean�X12��2 � n09 + n10 + n11 + n12 − 4 SpR3 = 0.025
Pooled Standard Deviations (As Accuracy): Row 1: � −−−−−−−−−−−−→ − −−−−−−−−−−−→ −−−−−−−−−−−→ −
−−−−−−−−−−−→
−
�
�X01 − TR1�2 + �X02 − TR1�2 + �X03 − TR1�2 + �X04 − TR1�2 � SpR1 �= n01 + n02 + n03 + n04 − 4 SpR1 = 0.0289623
Row2: � −−−−−−−−−−−−→ − −−−−−−−−−−−→ −−−−−−−−−−−→ − −−−−−−−−−−−→
−
�X05 − TR2�2 + �X06 − TR2�2 + �X07 − TR2�2 + �X08 − TR2�2 � SpR2 �= n05 + n06 + n07 + n08 − 4 SpR2 = 0.0608954
Collaborative Laboratory Studies: Part 6
207
Row 3:
� − −−−−−−−−−−−→ −−−−−−−−−−−→ − −−−−−−−−−−−→ −
−−−−−−−−−−−→ −
�
2 2 �X09 − TR3� + �X10 − TR3� + �X11 − TR3�2 + �X12 − TR3�2 � SpR3 �= n09 + n10 + n11 + n12 − 4 SpR3 = 0.064
Measuring Precision without Duplicates (Youden/Steiner):
RAW DATA ENTRY (Enter single Determinations for Sample X from different laboratories or operators): Sample X LAB LAB LAB LAB
#1 #2 #3 #4
X:=
3.42 3.41 3.37 3.38
nX� = rows�X� mean�X� = 3�394 (Enter single Determinations for Sample Y from different laboratories or operators): Sample Y LAB LAB LAB LAB
#1 #2 #3 #4
Y:=
nY� = rows�Y� mean�Y� = 3�551
CALCULATIONS: Dxy�= �X − Y� Txy�= �X + Y� mean�Dxy� = −0�157 mean�Txy� = 6�944
3.56 3.54 3.54 3.56
208
Chemometrics in Spectroscopy
3.56
mean(Y) 3.55
Y
3.54
3.53
3.36
3.38
3.4
3.42
3.44
mean(X), X
Two-sample Chart illustrating systematic errors
Precision (Sr): � −−−−−−−−−−−−−−−−−−−−−−−−−−−→ �− 1 2 Sr �= · �Dxy − mean�Dxy�� 2 · �nY − 1� Sr = 0.015805
Measuring the Standard Deviation of the Data (Youden/Steiner):
Standard Deviation (Sd): � −−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � 1 · �Txy − mean�Txy��2 Sd�= 2 · �nY − 1� Sd = 0�023765
Statistical Test for presence of systematic errors (Youden/Steiner):
F-statistic Calculation (Fs): Fs�=
Sd2 Sr2
Fs = 2.261
Collaborative Laboratory Studies: Part 6
209
F-statistic Table Value (Ft): df1� = nY − 1 df1 = 3 qF�0�95� df1� df1� = 9�277 If Fs is less than or equal to Ft, then there is NO SYSTEMATIC ERROR If Fs is greater than Ft, then there is SYSTEMATIC ERROR (BIAS)
Standard Deviation estimate for the distribution of systematic errors (Sb2):
2
Sd − Sr2
Sb2�=
2 Sb2 = 1.575 · 10−4
210
Chemometrics in Spectroscopy
ANOVA_s4
ANOVA (Analysis of Variance) Test This Worksheet demonstrates using Mathcad’s F distribution function and programming operators to conduct an analysis of variance (ANOVA) test. Enter sample data used in test: An element of D represents the data collected with a particular factor.
Data Entry:
D0
3.421
3.407
3.366
3.380
3.377
3.400
3.360
3.380
3.399
D1
3.417
D2
3.361
D3
3.380
3.379
3.353
3.362
3.380
3.379
3.380
3.370
3.380
Enter level of significance a: � � = 0�05
211
Collaborative Laboratory Studies: Part 6
Program for conducting ANOVA test:
ANOVA( D , α )
n total
0
0
SX
SX2 0 T
0
for i ∈ 0 .. last ( D ) SDi
Di
nDi
length Di
SX
SX
SDi Di . Di
SX2 SX2 2
T
SDi
T
nDi n total
n total
nDi 2
SS factor
SX
T
n total
SS error
SX2 T
SS total
SX2
2
SX
n total
df factor
length ( D )
1
df error
n total
length ( D )
df total
n total
1
SS factor df factor Analysis 0
SS error
df error
SS total
df total
Analysis 0 Analysis 1
Analysis 0
df factor SS error df error 0
0,2 1 , 2
α , df factor , df error
Analysis 2
qF 1
Analysis 3
Analysis 1 < Analysis 2
Analysis
SS factor
212
Chemometrics in Spectroscopy
Calculate Mean Values: mean�D0 � = 3�391 mean�D1 � = 3�3914 mean�D2 � = 3�3638 mean�D3 � = 3�38 Conducting an analysis of variance: For a given set of grouped data D and level of significance a: ⎡
⎤ �3� 3� ⎢ 3�281 ⎥ ⎥ ANOVA�D� �� = ⎢ ⎣ 3�239 ⎦ 0
The ANOVA table: ⎡
SS 2�519 · 10−3
⎢ ⎢ −3 ANOVA�D� ��0 = ⎢ ⎢ 4�094 · 10 ⎣ 6�613 · 10−3
df MS ⎤ 3 8�396 · 10−4 Between Groups ⎥ ⎥ −4 ⎥ 16 2�559 · 10 ⎥ Within Groups ⎦ Total 19 0
The Calculated F statistic: ANOVA�D� ��1 = 3�281485
The critical F Statistic: ANOVA�D� ��2 = 3�238872
The hypothesis test conclusion at the specified level of significance: ANOVA�D� ��3 = 0 0 = reject hypothesis – there is a significant difference 1 = accept hypothesis – there is not a significant difference
Collaborative Laboratory Studies: Part 6
213
ANOVA_s2
ANOVA (Analysis of Variance) Test This Worksheet demonstrates using Mathcad’s F distribution function and programming operators to conduct an analysis of variance (ANOVA) test. Enter sample data used in test: An element of D represents the data collected with a particular factor.
Data Entry:
D0
3.421
3.366
3.377
3.360
3.399
D1
3.361
3.379
3.362
3.379
3.370
Enter level of significance a: � � = 0�05
214
Chemometrics in Spectroscopy
Program for conducting ANOVA test:
ANOVA( D, α )
n total SX
0
0
SX2 0 T
0
for i ∈ 0.. last ( D ) SDi
Di
nDi
length Di
SX
SX
SDi
SX2 SX2
Di Di 2
T
SDi
T
nDi n total
n total
nDi 2
SS factor
SX
T
n total
SS error
SX2 T
SS total
SX2
2
SX
n total
df factor
length ( D )
1
df error
n total
length ( D )
df total
n total
1
SS factor df factor Analysis 0
SS error
df error
SS total
df total
Analysis 0 Analysis 1
Analysis 0
df factor SS error df error 0
0,2 1 , 2
α , df factor , df error
Analysis 2
qF 1
Analysis 3
Analysis 1 )
Figure 71-2 Relationship of Laboratory CV (as powers of 2) with analyte concentration (as powers of 10− exp �. (For example, 6 on the abscissa represents a concentration of 10−6 or 1 ppm with a CV (%) of 24 .)
Table 71-2 Relationship of Laboratory CV (%) (as powers of 2) with analyte concentration (as powers of 10) CV (%) 20 21 22 23 24 25 26
Analyte conc.
Absolute conc.
Conc. in ppm
100 10−1 10−2 10−3 10−6 10−9 10−12
Near 100% 10% 1.0% 0.1% 1 ppm 1 ppb 1 ppt
106 105 104 103 1 10−3 10−6
485
Limitations in Analytical Accuracy: Part 1
Table 71-3 Relationship of Laboratory CV (as powers of 2) with analyte concentration (as powers of 10) CV (as 2exp ) 0 1 2 3 4 5 6
conc. (as 10− exp )
Absolute conc.
Conc. in ppm
0 −1 −2 −3 −6 −9 −12
Near 100% 10% 1.0% 0.1% 1 ppm 1 ppb 1 ppt
106 105 104 103 1 10−3 10−6
of magnitude that concentration decreases; for low (micro) concentrations, CV doubles for every 3 orders of magnitude decrease in concentration. Note that this represents the between-laboratory variation. The within-laboratory variation should be 50–66% of the between laboratory variation. Reflecting on Figures 711 and 712, as some have called this Horowitz’s trumpet. How interesting that he plays such a tune for analytical scientists. Another form of expression can also derived as CV (%) is another term for % relative standard deviation (%RSD) as equation 716 (reference [6]). %RSD = 2�1−0�5 log C�
(716)
There are many tests for uncertainty in analytical results and we will continue to present and discuss these within this series.
REFERENCES 1. Horwitz, W., Analytical Chemistry 54(1), 67A–76A (1982). 2. Horwitz, W., Laverne, R.K. and Boyer, W.K., Journal – Association of Official Analytical Chemists 63(6), 1344 (1980). 3. ASTM E177 – 86. Form and Style for ASTM Standards, ASTM International, West Conshohocken, PA ASTM E177 – 86 “Standard Practice for Use of the Terms Precision and Bias in ASTM Test Methods.” 4. Helland, S., Scand. J. Statist. 17, 97 Scandinavian Journal of Statistics (1990). 5. Mark, H. and Workman, J., Statistics in Spectroscopy, (2nd ed.), (Elsevier, Amsterdam, 2003), pp. 205–211, 213222. 6. Personal Communication with G. Clark Dehne Capital University Columbus, OH 432092394 (2004), ASTM International E13 Meeting.
This page intentionally left blank
72 Limitations in Analytical Accuracy: Part 2 – Theories to Describe the Limits in Analytical Accuracy
Recall from our previous chapter [1] how Horwitz throws down the gauntlet to analytical scientists stating that a general equation can be formulated for the representation of analytical precision based on analyte concentration (reference [2]). He states this as equation 721: CV% = 21−05 log C
(721)
where C is the mass fraction as concentration expressed in powers of 10 (e.g., 0.1% analyte is equal to C = 10−3 ). A paper published by Hall and Selinger [3] points out that an empirical formula relating the concentration (c) to the coefficient of variation (CV) is also known as the precision (. They derive the origin of the “trumpet curve” using a binomial distribution explanation. Their final derived relationship becomes equation 722: CV =
c−015 50
(722)
They further simplify the Horwitz trumpet relationship in two forms as: CV% = 0006c−05
(723a)
= 0006c05
(723b)
and
They then derive their own binomial model relationships using Horwitz’s data with variable apparent sample size. CV% = 002c−015
(724a)
= 002c085
(724b)
and
Both sets of relationships depict relative error as inversely proportional to analyte concentration. In yet a more detailed incursion into this subject, Rocke and Lorenzato [4] describe two disparate conditions in analytical error: (1) concentrations near zero; and macrolevel concentrations, say greater than 0.5% for argument’s sake. They propose that analytical
488
Chemometrics in Spectroscopy
error is comprised of two types, additive and multiplicative. So their derived model for this condition is (725): x = e +
(725)
where x is the measured concentration, is the true analyte concentration, and is a Normally distributed analytical error with mean 0 and standard deviation . It should be noted that represents the multiplicative or proportional error with concentration and represents the additive error demonstrated at small concentrations. Using this approach, the critical level at which the CV is a specific value can be found by solving for x using the relationship shown in equation 726: CVx2 = x2 + 2
(726)
where x is the measured analyte concentration as the practical quantitation level (PQL used by the U.S. Environmental Protection Agency (EPA)). This relationship is simplified to equation 727. x=
�
1 2 CV − 2
(727)
where CV is the critical level at which the coefficient of variation is a preselected value to be achieved using a specific analytical method, and is the standard deviation of the multiplicative or measurement error of the method. For example, if the desired CV is 0.3 and is 0.1, then the PQL or x is computed as 3.54. This is the lowest analyte concentration that can be determined given the parameters used. The authors describe the model above as a linear exponential calibration curve as equation 728. y = + e +
(728)
where y is the observed measurement data. This model approximates a consistent or constant standard deviation model at low concentrations and approximates a constant CV model for high concentrations, where the multiplicative error varies as e .
DETECTION LIMIT FOR CONCENTRATIONS NEAR ZERO Finally detection limit (D) is estimated using equation 729. 3 D = √ r
(729)
where is the standard deviation of the measurement error measured at low (near zero) concentration, and r is the number of replicate measurements made.
Limitations in Analytical Accuracy: Part 2
489
REFERENCES 1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Limitations in Analytical Accuracy – Part 1 Horwitz’s Trumpet,” Spectroscopy 21(9), 18–24 (2006). 2. Horwitz, W., Analytical Chemistry 54(1), 67A–76A (1982). 3. Hall, P. and Selinger, B., Analytical Chemistry, 61, 1465–1466 (1989). 4. Rocke, D. and Lorenzato, S., Technometrics 37(2), 176–184 (1995).
This page intentionally left blank
73 Limitations in Analytical Accuracy: Part 3 – Comparing Test Results for Analytical Uncertainty
UNCERTAINTY IN AN ANALYTICAL MEASUREMENT By making replicate analytical measurements one may estimate the certainty of the analyte concentration using a computation of the confidence limits. As an example, given five replicate measurement results as: 5.30%, 5.44%, 5.78%, 5.00%, and 5.30%. The precision (or standard deviation) is computed using equation 731, � � r �� 2 � xi − x¯ � i=1 s= r − 1
(731)
where s represents the precision, � means summation of all the �xi − x¯ �2 values, xi is an individual replicate analytical result, x¯ is the mean of the replicate results, and r is the total number of replicates included in the group (this is often represented as n). For the above set of replicates s = 0282. The degrees of freedom are indicated by r − 1 = 4. If we want to calculate the 95% confidence level, we note that the tvalue is 2.776. So the uncertainty (U ) of our measurement result is calculated as 732: s U = x¯ ± t · √ r
(732)
So the example case results in an uncertainty range from 5.014 to 5.714 with an uncertainty range of 0.7. Therefore if we have a relatively unbiased analytical method, there is a 95% probability that our true analyte value lies between these upper and lower concentration limits.
COMPARISON TEST FOR A SINGLE SET OF MEASUREMENTS VERSUS A TRUE ANALYTICAL RESULT Now let us start this discussion by assuming we have a known analytical value by artificially creating a standard sample using impossibly precise weighing and mixing methods so that the true analytical value is 5.2% analyte. So we make one measurement and obtain a value of 5.7%. So then we refer to errors using statistical terms as follows: Measured value: 5.7% “True” value: = 52%
492
Chemometrics in Spectroscopy
Absolute error: Measured Value − True Value = 05% Relative % error: 05/52 × 100 = 96% Then we recalibrate our instrumentation and obtain the results: 5.10, 5.20, 5.30, 5.10, and 5.00. Thus our mean value (¯x is 5.14. Our precision as the standard deviation (s) of these five replicate measurements is calculated as 0.114 with n − 1 = 4 degrees of freedom. The tvalue from the t table, � = 095, degrees of freedom as 4, is 2.776. To determine if a specific test result is significantly different from the true or mean value, we use equation 733 as the test statistic Te : � � � x¯ − √ � � � · n� Te = � s
(733)
For this example Te = 1177. We note there is no significant difference in the measured value versus the expected or true value if Te ≤ tvalue. And there is a significant difference between the set of measured values and the true value if Te ≥ tvalue. We must then conclude here that there is no difference between the measured set of values and the true value, as 1177 ≤ 2776.
COMPARISON TEST FOR A TWO SETS OF MEASUREMENTS If we take two sets of five measurements using two calibrated instruments and the mean results are x¯ 1 = 514 and x¯ 2 = 516, we would like to know if the two sets of results are statistically identical. So we calculate the standard deviation for both sets and find s1 = 0114 and s2 = 0193. The pooled standard deviation s12 = 0079. The degrees of freedom in this case is n1 − 1 equals 5 − 1 = 4. The tvalue at � = 095, d.f. = 4, is 2.776. To determine if one set of measurements is significantly different from the other set of measurements we use equation 734 the test statistic Te : � � � x¯ 1 − x¯ 2 Te12 = �� � 1 � s · n +n 1
2
� � � � � �
(734)
For this example, Te12 = 0398. So there is no significant difference in the sets of measured values we would expect Te ≤ tvalue, since 0398 ≤ 2776. And if there is a significant difference between the sets of measured values we expect Te ≥ tvalue. We must conclude here that there is no difference between the sets of measured values.
Limitations in Analytical Accuracy: Part 3
493
CALCULATING THE NUMBER OF MEASUREMENTS REQUIRED TO ESTABLISH A MEAN VALUE (OR ANALYTICAL RESULT) WITH A PRESCRIBED UNCERTAINTY (ACCURACY) If error is random and follows probabilistic (normally distributed) variance phenomena, we must be able to make additional measurements to reduce the measurement noise or variability. This is certainly true in the real world to some extent. Most of us having some basic statistical training will recall the concept of calculating the number of measurements required to establish a mean value (or analytical result) with a prescribed accuracy. For this calculation one would designate the allowable error (e), and a probability (or risk) that a measured value (m) would be different by an amount (d). We begin this estimate by computing the standard deviation of measurements, this is determined by first calculating the mean, then taking the difference of each control result from the mean, squaring that difference, dividing by n − 1, then taking the square root. All these operations are included in the equation 735. � � n �� 2 � xi − x¯ � i=1 s= (735) n − 1 where s represents the standard deviation, � means summation of all the �xi − x¯ �2 values, xi is an individual control result, x¯ is the mean of the control results, and n is the total number of control results included in the group. If we were to follow a cookbook approach for computing the various parameters we proceed as follows: (1) Compute an estimate of (s) for the method (see above) (2) Choose the allowable margin of error (d) (3) Choose the probability level as alpha (�, as the risk that our measurement value (m) will be off by more than (d) (4) Determine the appropriate t value for t1−/2 for n − 1 degrees of freedom (5) Finally the formula for n (the number of discrete measurements required) for a given uncertainty as equation 736. � 2 2� t ·s n = +1 (736) d2 Problem Example: We want to learn the average value for the quantity of toluene in a test sample for a set of hydrocarbon mixtures. s = 1, = 095, d = 01. For this problem t1−�/2 = 196 (from t table) and thus n is computed as equation 737: � � 1962 · 12 n= + 1 = 385 (737) 012 So if we take 385 measurements we conclude with a 95% confidence that
the true analyte value (mean value) will be between the average of the 385 results X ± 01.
494
Chemometrics in Spectroscopy
THE Q-TEST FOR OUTLIERS [1–3] We make five replicate measurements using an analytical method to calculate basic statistics regarding the method. Then we want to determine if a seemingly aberrant single result is indeed a statistical outlier. The five replicate measurements are 5.30%, 5.44%, 5.78%, 5.00%, and 5.30%. The result we are concerned with is 6.0%. Is this result an outlier? To find out we first calculate the absolute values of the individual deviations:
Compute deviation
Absolute deviation
5.30–6.00 5.44–6.00 5.78–6.00 5.00–6.00 5.30–6.00
0.70 0.56 0.22 1.00 0.70
Thus the minimum deviation (DMin is 0.22; the maximum deviation is 1.00 and the deviation range (R) is 100 − 022 = 078. We then calculate the QTest Value as Qn using equation 738: Qn =
DMin R
(738)
This results in the Qn of 022/078 = 028 for n = 5. Using the QValue Table (90% Confidence Level as Table 731 we note that if Qn ≤ QValue, then the measurement is NOT an Outlier. Conversely, if Qn ≥ QValue, then the measurement IS an outlier. So since 028 ≤ 0642 this test value is not considered an outlier.
SUMMATION OF VARIANCE FROM SEVERAL DATA SETS We sum the variance from several separate sets of data by computing the variance of each set of measurements; this is determined by first calculating the mean for each set, then taking the difference of each result from the mean, squaring that difference,
Table 73-1 QValue table (at different confidence levels) n: Q90%: Q95%: Q99%:
3
4
5
6
7
8
9
0941 0970 0994
0765 0829 0926
0642 0710 0821
0560 0625 0740
0507 0568 0680
0468 0526 0634
0437 0493 0598
10 0412 0466 0568
495
Limitations in Analytical Accuracy: Part 3
dividing by r − 1 where r is the number of replicates in each individual data set. All these operations are included in equation 739:
2
s =
r �
i=1
xi − x¯ 2
r − 1
(739)
where s2 represents the variance for each set, � means summation of all the �xi − x¯ �2 values, xi is an individual result, x¯ is the mean of the each set of results, and r is the total number of results included in each set. The pooled variance is given as equation 7310: sp2 =
s12 + s22 + + sk 2 k
(7310)
where sk 2 represents the variance for each data set, and k is the total number of data sets included in the pooled group. The pooled standard deviation p is given as 7311:
p =
� s2p
(7311)
REFERENCES 1. Miller, J.C. and Miller, J.N., Statistics for Analytical Chemistry, 2nd ed. (Ellis Horwood Limited Publishers, Chichester, 1992), pp. 63–64. 2. Dixon, W.J. and Massey, F.J., Jr, Introduction to Statistical Analysis, 4th ed. (ed. W.J. Dixon) (McGrawHill, New York, 1983), pp. 377, 548. 3. Rohrabacher, D.B. “Dixon’s QTables for Multiple Probability Levels” Analytical Chemistry 63, 139 (1991).
This page intentionally left blank
74 The Statistics of Spectral Searches
There are a variety of mathematical techniques used for determining the matching index (or agreement) between an unknown test spectrum (or signal pattern) and a set of known or reference spectra (or multiple signal patterns) [1–12]. The set of known spectra are often referred to as a reference spectral library. In general, high match score values or similarity is indicative of greater ‘alikeness” between an unknown test spectrum and single or multiple known reference spectra contained within a reference library. A basic list of the techniques used to compare an unknown test spectrum to a set of known library spectra is found in Table 741. Some of the mathematical approaches used will be described in greater detail in this chapter.
COMMON SPECTRAL MATCHING APPROACHES The ASTM (American Society for Testing and Materials) has published a “Standard Practice for General Techniques for Qualitative Analysis” (Method E 125288). The method describes techniques useful for qualitative evaluation of liquids, solids, and gases using the spectral measurement region of 4000 to 50 cm−1 (above 2500 nm) [1, 2].
MAHALANOBIS DISTANCE MEASUREMENTS The Mahalanobis Distance statistic (or more correctly the square of the Mahalanobis Distance), D2 , is a scalar measure of where the spectral vector a lies within the mul tivariate parameter space used in a calibration model [3, 4]. The Mahalanobis distance is used for spectral matching, for detecting outliers during calibration or prediction, or for detecting extrapolation of the model during analyses. Various commercial software packages may use D instead of D2 , or may use other related statistics as an indication of high leverage outliers, or may call the Mahalanobis Distance by another name. D2 is preferred here since it is more easily related to the number of samples and variables. Model developers should attempt to verify exactly what is being calculated. Both mean centered and not meancentered definitions for Mahalanobis Distance exist, with the meancentered approach being preferred. Regardless of whether meancentering of data is performed, the statistic designated by D2 has valid utility for qualitative calculations. If a is a spectral vector (dimension f by 1) and A is the matrix of calibration spectra (of dimension n by f , then the Mahalanobis Distance is defined as: D2 = at AAt + a
(745a)
498
Chemometrics in Spectroscopy
Table 74-1 A listing of Classic Spectral Search Algorithms and Terminology 1. Visual overlap test spectrum (t) and reference spectrum (r) to compare spectral shapes for similarity 2. Search and identify (compare individual peaks or sets of spectral peaks) 3. Compare physical data or chemical measurements between samples 4. Use mathematical methods, such as Hit quality index (HQI) value (or ‘similarity’ value vs. library reference sample). An example list of such HQI methods includes
a. Euclidean distance (d) algorithm for d =
�
n � i=1
2
�ti − ri
� 21
b. First derivative Euclidean distance algorithm for 1Dd = c. Sum of differences as sd =
n � i=1
�
(741) n � i=1
2
1 ti − 1 ri
� 21
ti − ri
d. Correlation [row matrix (row vector) dot product� as r = �T • R =
(742) (743)
n �
Ti Ri
(744)
i=1
5. Other approaches: Hamming networks, pattern recognition, wavelets, and neural network learn ing systems are sometimes discussed but have not been commercially implemented.
For a meancentered calibration, a and A in equation 745a are replaced by a − a and A −A respectively. If a weighted regression is used, the expression for the Mahalanobis Distance becomes equation 745b: D2 = at ARAt + a
(745b)
In MLR, if m is the vector (dimension k by 1) of the selected absorbance values obtained from a spectral vector a, and M is the matrix of selected absorbance values for the calibration samples, then the Mahalanobis Distance is defined as equation 746a: D2 = mt MMt −1 m
(746a)
If a weighted regression is used, the expression for the Mahalanobis Distance becomes equation 746b: D2 = mt MRMt −1 m
(746b)
In PCR and PLS, the Mahalanobis distance for a sample with spectrum a is obtained by substituting the decomposition for PCR, or for PLS, into equation 745a. The statistic is expressed as equation 747a. D 2 = st s
(747a)
If a weighted PCR or PLS regression is used, the expression for the Mahalanobis Distance becomes equation 747b. D2 = st St RS−1 s
(747b)
499
The Statistics of Spectral Searches
The Mahalanobis Distance statistic provides a useful indication of the first type of extrapolation. For the calibration set, one sample will have a maximum Mahalanobis 2 . This is the most extreme sample in the calibration set, in that, it is the Distance, Dmax farthest from the center of the space defined by the spectral variables. If the Maha 2 , then the estimate for lanobis Distance for an unknown sample is greater than Dmax the sample clearly represents an extrapolation of the model. Provided that outliers have been eliminated during the calibration, the distribution of Mahalanobis Distances should 2 can be used as an indication of be representative of the calibration model, and Dmax extrapolation.
EUCLIDEAN DISTANCE There may be some great future algorithm or approach developed using some of these concepts, but for now how about the Euclidean Distance approach (equation 748) where: d = �X21 −X11 2 +�X22 −X12 2 +�X23 −X13 2 + · · · + �X2i −X1i 2 05
(748)
where Xki are data points from each of two spectra where k is the spectrum or sample number and i is the data point number in the spectrum. The distance is calculated at each data point (from 1 to i), with a comparison between the test spectrum (sub 2) and each reference spectrum (sub k). The distance from a reference spectrum to the test spectrum is calculated as the Euclidean distance.
COMMON SPECTRAL MATCHING (CORRELATION OR DOT PRODUCT) Techniques for matching sample spectra include the use of Mahalanobis distance and Cross Correlation techniques “Correlation Matching”, described earlier. The general method for comparing two spectra (test versus reference), where the reference is a known compound or the mean spectrum of a set of known spectra, is given as the MI (Match Index). The MI is computed by comparing the vector dot products between the test and the reference spectra. The theoretical values for these dot products range from −1.0 to +1.0, where −1.0 is a perfect negative (inverse) correlation, and +1.0 is a perfect match. Since for near infrared spectroscopy only positive absorbance values are used to compute the dot products, the values for the match index must fall within the 0.0 to +1.0 range. The mathematics is straightforward and are demonstrated below. The MI is equal to the cosine of the angle (designated as between two row vectors (the test and reference spectra) projected onto a twodimensional plane, and is equivalent to the correlation (r) between the two spectra (row vectors) as equation 749. MI = cos =
�
T • R T R
�
(749)
where T is the test spectrum row matrix, and R is the reference spectrum row matrix.
500
Chemometrics in Spectroscopy
Note the following equation 7410a: �T • R =
n �
T i Ri
(7410a)
i=1
where Ti represents the individual data points for the test spectrum (designated as the absorbance values of spectrum T from wavelengths i = 1 through n), and Ri represents the individual data points for the reference spectrum (designated as the absorbance values of spectrum R from wavelengths i = 1 through n).
And where
T R =
� n � i=1
Ti2
� 21 �
n � i=1
R2i
� 21
(7410b)
note, the angle ( , in degrees, between two vectors can be determined from the MI using = cos−1 MI
(7410c)
The “alikeness” of one test spectrum (or series of spectra) to a reference spectrum can be determined by calculating a pointbypoint correlation between absorbance data for each test and reference spectrum. The correlation matching can be accomplished for all data points available or for a preselected set only. The more alike the test and reference spectra are, the higher (closer to 1.00) are the r (correlation coefficient) and R2 (coefficient of determination) values. A perfect match of the two spectra would produce r or R2 values of 1.00000. The sensitivity of the technique can be increased by pretreating the spectra as first to higher order derivatives and then calculating the correlation between test and reference spectra. Full spectral data can also be truncated (or reduced) to include only spectral regions of particular interest, a practice which will further improve matching sensitivity for a particular spectral feature of interest. Sample selection using this technique involves selecting samples most different from the mean population spectrum for the full sample set. Those samples with correlations of the lowest absolute values (including negative correlations) are selected first and then samples of second lowest correlation are selected (and so on) until the single sample of highest correlation is found. The distribution of spectra about the mean is assumed to follow a normal distribution with a computable standard deviation. This assumption indicates that a uniformly distributed sample set can be selected based on the correlation between test spectra and the mean spectrum of a population of spectra.
REFERENCES 1. ASTM “Practice for General Techniques for Qualitative Infrared Analysis”, ASTM Committee E 13, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 194282959. 2. ASTM Committee E13.11, “Practice for Near Infrared Qualitative Analysis”, ASTM Com mittee E 13, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 194282959. 3. Mahalanobis, P.C., Proceedings of the National Institute of Science 2, 49–55 (1936). 4. Mark, H.L. and Tunnell, D., Analytical Chemistry 57, 1449–1456 (1985).
The Statistics of Spectral Searches
501
5. Workman, J., Mobley, P., Kowalski, B. Bro, R., Applied Spectroscopy Reviews 31(1–2), 73 (1996). 6. Whitfield, R.G., Gerber, M.E. and Sharp, R.L., Applied Spectrocopy 41, 1204–1213 (1987). 7. Mahalanobis, P.C., Proceedings of the National Institute of Science 2, 49 (1936). 8. Reid, J.C. and Wong, E.C., “DataReduction and Search System for Digital Absorbance Spectra”, Applied Spectrocopy 20, 320–325 (1966). 9. Owens, P.M. and Isenhour, T.L., “Infrared Spectral Compression Procedure for Resolution Independent Search Systems”, Analytical Chemistry 55, 1548–1553 (1983). 10. Tanabe, K. and Saeki, S., “Computer Retrieval of Infrared Spectra by a Correlation Coefficient Method”, Analytical Chemistry 47, 118–122 (1975). 11. Azarraga, L.V., Williams, R.R. and de Haseth, J.A., “Fourier Encoded Data Searching of Infrared Spectra (FEDS/IRS)”, Applied Spectroscopy 35, 466–469 (1981). 12. de Haseth, J.A. and Azarraga, L.V., “InterferogramBased Infrared Search System”, Analytical Chemistry 53, 2292–2296.
This page intentionally left blank
75 The Chemometrics of Imaging Spectroscopy
Imaging spectroscopy is particularly useful toward understanding the structure and func tional relationships of materials and biological samples. Spatial images of chemical structure demonstrate physical or chemical phenomena related to a particular structural anatomy. Software packages such as MATLAB and many others provide easily learned methods for image display and mathematical manipulation for matrices of data [1]. Imaging data may be measured using array and camera data comprised of spatial data on an X, and Y plane with the Zaxis being related to frequency and the fourth dimension related to amplitude or signal strength. The Figures below illustrate the types of data useful for imaging problems. Figures 751a and 751b illustrate secondorder data comprised of signal amplitude (A), multiple frequencies (�/, and time. In this image model, one is taking spectroscopic measurements over time. Figure 751b shows another form of secondorder data where spectroscopic amplitude at a single wavelength is combined with spatial information. Figure 752 shows thirdorder data or a hyperspectral data cube where the spectral amplitude is measured at multiple frequencies (spectrum) with X and Y spatial dimen sions included. Each plane in the figure represents the amplitude of the spectral signal at a single frequency for an X and Y coordinate spatial image. Such data shown in the above figures provides powerful information relating structure to chemical knowledge. Such data may be measured by rastering a spectrometer or microscope over a particular area, or by using an array detection scheme for collecting spectroscopic data. Imaging provides an entirely expanded dimension of spectroscopy and increases the power of spectroscopic techniques to reveal new information regarding investigations into new materials and biological or chemical interactions.
IMAGE PROJECTION OF SPECTROSCOPIC DATA Table 751 demonstrates the rows × columns data matrix that can be obtained by rastering a spectrophotometer across a twodimensional plane surface of paper with a pattern entered onto the paper, using, for example, water or an invisible ink that has a unique spectral absorption. For illustrative purposes the data shown here are created with a computer. One might imagine spectroscopic data measured at single or multiple wavelengths to obtain a similar data matrix. One may also enhance the signal or spectra using the toolbox of preprocessing techniques to enhance or draw out a clearer image. The data matrix is preprocessed using any signal enhancement technique to obtain the spectroscopic data of greatest interest as it relates to the spatial characteristics of the material or sample surface under study. In this particular case each data point represents the absorbance difference between a no absorbing wavelength for the paper surface and an absorbing wavelength for the transparent ink added to the paper. The difference in
504
Chemometrics in Spectroscopy
(a)
(b)
A
t
Aλ /v
λ /v
Y
X
Figure 75-1 (a) Second order data (amplitude, multiple frequencies, time); (b) Second order data (amplitude at one frequency, with X and Y spatial dimensions).
A
Y
X
λ /v A
λ /v
Figure 75-2 Third order data (Hyperspectral Data Cube: Amplitude, multiple frequencies, and X, Y spatial data – each plane represents the amplitude of spectral signal at a single frequency for an X, Y coordinate spatial image). Table 75-1 Simulated absorbance data depicting an ink pattern on a twodimensional paper surface with spatial dimensions X and Y .001 .001 .001 .001 .001 .001 .001 .001 .001 .001 1.1 .001 .001 .001 .001 .001 .001 .001 .001 .001 .001; .001 .001 .001 .001 .001 .001 .001 .001 .001 .001 1.11 .001 .001 .001 .001 .001 .001 .001 .001 .001 .001; .001 1.11 .001 .001 .001 .001 .001 .001 .001 .001 1.10 .001 .001 .001 .001 .001 .001 .001 .001 1.10 .001; .0012 .0012 1.10 .0012 .0012 .0012 .0012 .0012 .0012 .0012 1.11 .0012 .0012 .0012 .0012 .0012 .0012 .0012 1.10 .0012 .0012; .0011 .0011 .0011 1.11 .0011 .0011 .0011 .0011 .0011 .0011 1.12 .0011 .0011 .0011 .0011 .0011 .0011 1.10 .0011 .0011 .0011; .0011 .0011 .0011 .0011 1.11 .0011 .0011 .0011 .0011 .0011 1.11 .0011 .0011 .0011 .0011 .0011 1.11 .0011 .0011 .0011 .0011; .0012 .0012 .0012 .0012 .0012 1.10 .0012 .0012 .0012 .0012 1.10 .0012 .0012 .0012 .0012 1.12 .0012 .0012 .0012 .0012 .0012; .0010 .0010 .0010 .0010 .0010 .0010 1.10 .0010 .0010 .0010 1.11 .0010 .0010 .0010 1.10 .0010 .0010 .0010 .0010 .0010 .0010; .0011 .0011 .0011 .0011 .0011 .0011 .0011 1.10 .0011 .0011 1.12 .0011 .0011 1.10 .0011 .0011 .0011 .0011 .0011 .0011 .0011; .0011 .0011 .0011 .0011 .0011 .0011 .0011 .0011 1.11 .0011 1.11 .0011 1.11 .0011 .0011 .0011 .0011 .0011 .0011 .0011 .0011; .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 1.12 1.12 1.10 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012; .0011 .0011 .0011 .0011 .0011 .0011 .0011 .0011 1.11 .0011 1.11 .0011 1.11 .0011 .0011 .0011 .0011 .0011 .0011 .0011 .0011; .0012 .0012 .0012 .0012 .0012 .0012 .0012 1.11 .0012 .0012 1.11 .0012 .0012 1.11 .0012 .0012 .0012 .0012 .0012 .0012 .0012; .0011 .0011 .0011 .0011 .0011 .0011 1.10 .0011 .0011 .0011 1.11 .0011 .0011 .0011 1.11 .0011 .0011 .0011 .0011 .0011 .0011; .0011 .0011 .0011 .0011 .0011 1.12 .0011 .0011 .0011 .0011 1.10 .0011 .0011 .0011 .0011 1.11 .0011 .0011 .0011 .0011 .0011; .0011 .0011 .0011 .0011 1.10 .0011 .0011 .0011 .0011 .0011 1.12 .0011 .0011 .0011 .0011 .0011 1.12 .0011 .0011 .0011 .0011; .0012 .0012 .0012 1.11 .0012 .0012 .0012 .0012 .0012 .0012 1.11 .0012 .0012 .0012 .0012 .0012 .0012 1.10 .0012 .0012 .0012; .001 .001 1.11 .001 .001 .001 .001 .001 .001 .001 1.11 .001 .001 .001 .001 .001 .001 .001 1.10 .001 .001; .0012 1.11 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 1.11 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 1.10 .0012; .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 1.10 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012 .0012; .001 .001 .001 .001 .001 .001 .001 .001 .001 .001 1.10 .001 .001 .001 .001 .001 .001 .001 .001 .001 .001]
505
The Chemometrics of Imaging Spectroscopy 2-D image map 20 18 16
Y-dimension
14 12 10 8 6 4 2 2
4
6
8
10
12
14
16
18
20
X-dimension
Figure 75-3 Twodimensional contour plot of data matrix A found in Table 751.
absorbance between these two wavelengths will be directly related to the amount of ink added to the paper surface. By applying imaging software to the data matrix, an image of the ink content added to paper will appear. The first graphical representation using MATLAB® software is that of a two dimensional contour surface plot of the data from Table 751 [2]. This Figure 753 plot can represent multiple levels of zaxis data (absorbance) by the use of contours and color schemes. The MATLAB® commands for generating this image are given in Table 752 where A represents the raster data matrix shown in Table 751. The second graphical representation using MATLAB® software is that of a three dimensional surface plot (Table 753, Figure 754). This plot visually represents the threedimensional data where the X and Y axes are spatial dimensions and the Z axis depicts absorbance. The MATLAB® commands for this graphic are given in Table 753 where A represents the raster data matrix given in Table 751.
Table 75-2 MATLAB® commands for generating a contour plot of data matrix A found in Table 751 ⊂ contour(A) ⊂ grid ⊂ title([‘2D Image Map’]) ⊂ xlabel([‘XDimension’]) ⊂ ylabel([‘YDimension’])
506
Chemometrics in Spectroscopy
Table 75-3 MATLAB® commands for generating a 3D surface plot of data matrix A found in Table 751 ⊂ surf(A) ⊂ colormap(cool) ⊂ title([‘3D Image Map’]) ⊂ xlabel([‘XDimension’]) ⊂ ylabel([‘YDimension’]) ⊂ zlabel([‘ZDimension’])
3-dimensional map
Z-dimension
1.5
1
0.5
0 30 25
20
20 15
Y-dimension 10
10 0
5 0
X-dimension
Figure 75-4 Threedimensional surface plot of data matrix A found in Table 751
The third graphical representation using MATLAB® software is that of a two dimensional contour map overlay onto a threedimensional surface plot (Table 754, Figure 755). This plot visually represents Figure 753 overlay onto Figure 754 For this threedimensional graphic, the X and Y spatial dimension axes correlate to the Zaxis depicting absorbance (or spectroscopic signal). The MATLAB® commands for this graphic are given in Table 754 where A represents the raster data matrix given in Table 751 So by producing a matrix of data containing a contrast between the signal and the background one may obtain useful images for study. In order to utilize this technique for optimization of image quality, one must process the raw signal to enhance the difference between the component of interest and the background signal. The signal is enhanced using many of the techniques described in this text. The use of MLR, PCR, PLS and other background correction, derivatives, and the like can all be used to enhance the signal to noise between the component of interest for imaging and the background signal. Once this contrast is achieved the simple techniques described
507
The Chemometrics of Imaging Spectroscopy
Table 75-4 MATLAB® commands for generating a 2D contour plot over a 3D surface plot ⊂ C=(A1); ⊂ surf(C) ⊂ axis([0 25 0 25 1 1]) ⊂ hold ⊂ contour(A) ⊂ grid ⊂ title([‘3D Image Map with Contour’]) ⊂ xlabel([‘XDimension’]) ⊂ ylabel([‘YDimension’]) ⊂ zlabel([‘ZDimension’])
3-D image map with contour
1
Z-dimension
0.5
0
–0.5
–1 25 20 15
Y-dimension
20
25
15
10
10
5 0
5 0
X-dimension
Figure 75-5 Twodimensional contour plot overlay onto threedimensional surface plot of data matrix A found in Table 751. (see Color Plate 24)
here are useful for projecting the image for structurechemical composition studies or for detecting the presence and location of impurities.
REFERENCES 1. Workman, J., NIR News 9(3), 4–5 (1998). 2. MATLAB® software from The Mathworks, Inc. 24 Prime Park Way, Natick, MA 01760.
This page intentionally left blank
Glossary of Terms
This set of terms is a supplement to the text. Many of these terms are included to clarify issues discussed in the text. We refer to the text index for more detailed coverage of the statistics and chemometrics terms. Many of these terms refer to the measuring instrument or the process of making a measurement rather than to mathematical concepts. Action limit, n – the limiting value from an instrument performance test, beyond which the instrument or analytical method is expected to produce potentially invalid results. Analysis, v – the process of applying a calibration model to an absorption spectrum so as to estimate a component concentration value or property. Analyzer, n – all piping, hardware, computer, software, instrumentation, and one or more calibration models required to automatically perform analysis of a specific sample type. Analyzer calibration, n – see multivariate calibration. Analyzer model, n – see multivariate model. Analysis precision, n – a statistical measure of the expected repeatability of results for an unchanging sample, produced by an analytical method or instrument for samples whose spectra represent an interpolation of a multivariate calibration. The reader is cautioned to refer to specific definitions for precision and repeatability based on the context of use. Analysis result, n – the numerical or qualitative estimate of a physical, chemical, or quality parameter produced by applying the calibration model to the spectral data collected by an instrument according to specified measurement conditions. Analysis validation test, n – see validation test. Calibration, v – a process used to create a model relating two types of measured data. Also, a process for creating a model that relates component concentrations or properties to absorbance spectra for a set of samples with known reference values. Calibration model, n – the mathematical expression that relates component concentra tions or properties of a set of reference samples to their absorbances. It is used to predict the properties of samples based upon their measured spectrum.
510
Glossary of Terms
Calibration, multivariate, n – a process for creating a model that relates component concentrations or properties to the absorbances of a set of known reference samples at more than one wavelength or frequency. Calibration samples, n – the set of samples used for creating a calibration model. Reference component concentration or property values need to be known, or measured by a suitable reference method in order that they may be related to the measured absorbance spectra during the calibration process. Calibration transfer, n – a method of applying a multivariate calibration developed on one instrument to data measured on a different instrument, by mathematically modifying the calibration model or by a process of instrument standardization. Check sample, n – a single pure compound, or a known, reproducible mixture of compounds whose spectrum is constant over time such that it can be used as a qual ity or validation or verification sample during an instrument performance or function test. Control limit, n – for validation tests, the maximum difference allowed between a valid analytical result, and a reference method result for the same sample. A measured value that exceeds a control limit requires that action be taken to correct the process. Control limits are statistically determined. Estimate, n – the value for a component concentration or property obtained by applying the calibration model for the analysis of an absorption spectrum; v this is also a general statistical term referring to an approximation of a parameter based upon theoretical computation. Inlier, n – see nearest neighbor distance inlier. Inlier detection methods, n – statistical tests which are conducted to determine if a spectrum resides within a region of the multivariate calibration space which is sparsely populated. Instrument standardization, v – a procedure for standardizing the response of multiple instruments such that a common multivariate model is applicable for measurements con ducted across these instruments, the standardization being accomplished via adjustment of the spectrophotometer hardware or via mathematical treatment of one or a series of collected spectra. Model validation, v – the process of testing a calibration model to determine bias between the estimates from the model and the reference method, and to test the expected agreement between estimates made with the model and the reference method. Multivariate calibration, n – an analyzer calibration that relates the spectrum at multiple wavelengths or frequencies to the physical, chemical, or quality parameters; v – the process or action of calibrating.
Glossary of Terms
511
Multivariate model, n – a multivariate mathematical rule or formula used to calculate a physical, chemical, or quality parameter from the measured spectrum. Nearest neighbor distance inlier, n – a spectrum residing within a significant gap in the multivariate calibration space, the result for which is subject to possible interpolation error across the sparsely populated calibration space. Optical background, n – the spectrum of radiation incident on a sample under test, typically obtained by measuring the radiation transmitted through or reflected from the spectrophotometer when no sample is present, or when an optically thin or nonabsorbing standard material is present. Optical reference filter, n – an optical filter or other device which can be inserted into the optical path in the spectrophotometer or probe producing an absorption spectrum which is known to be constant over time such that it can be used in place of a check or test sample in a performance test. Outlier detection limits, n – the limiting value for application of an outlier detection method to a spectrum, beyond which the spectrum represents an extrapolation of the calibration model. Outlier detection methods, n – statistical tests which are conducted to determine if the analysis of a spectrum using a multivariate model represents an interpolation of the model. Outlier spectrum, n – a spectrum whose analysis by a multivariate model represents an extrapolation of the model. Performance test, n – a test that verifies that the performance of an instrument is consistent with historical data and adequate to produce valid analysis results. Physical correction, n – a type of post processing where the correction made to the numerical value produced by the multivariate model is based on a separate physi cal measurement of, for example, sample density, sample pathlength, or particulate scattering. Post-processing, n – performing a mathematical operation on an intermediate analysis result to produce the final result, including correcting for temperature effects, adding a mean property value of the calibration model, or converting the instrument results into appropriate units for reporting purposes. Pre-processing, n – performing mathematical operations on raw spectral data prior to multivariate analysis or model development, such as selecting wavelength regions, correcting for baseline, smoothing, mean centering, and assigning weights to certain spectral positions. Primary method, n – see reference method.
512
Glossary of Terms
Reference method, n – the analytical method that is used to estimate the reference component concentration or property value which is used in calibration and validation procedures. Reference values, n – the component concentrations or property values for the calibra tion or validation samples which are measured by the reference analytical method. Spectrophotometer cell, n – an apparatus which allows a liquid sample or gas to flow between two optical surfaces which are separated by a fixed distance, referred to as the sample pathlength, while simultaneously allowing light to pass through the liquid. There are variations of this including variablepathlength cells, and multipass cells, and so on. Test sample, n – a sample, or a mixture of samples which has a constant spectrum for a limited time period, which is well characterized by the primary method, and which can be used as a QC sample in a performance test. Test samples and their spectra are generally not reproducible over extended periods. Validation, v – the process by which it is established that an analytical method is suitable for its intended purpose. Validation samples, n – a set of samples used in validating a calibration model. Validation samples are not generally part of the set of calibration samples. Reference component concentrations or property values are known (measured using a reference method), and are compared to those estimated using the model. Validated result, n – a result produced by the spectroscopic (or instrumental) method that is equivalent, within control limits, to the result expected from the reference method so that the result can be used in lieu of the direct measurement of the sample by the reference method. Validation test, n – a test performed on a validation sample that demonstrates that the result produced by the instrument or analytical method and the result produced by the reference method are equivalent to within statistical tests.
Index
A (estimated), 110, 114 A/D converter, 274, 277, 306 Ab initio theory, 225 Abscissa (xaxis), 71, 298, 340, 384–6, 479–80 Absorbance noise, 265–6, 277, 282, 286–8, 289, 291, 311, 321–2 Absorbance, 28 Absorptivity, 165, 283, 461, 480, 500 Accuracy, 121, 125, 136, 167, 173–7, 329, 453, 478, 484, 490 Actual result, 37–8, 40, 315 Addition, 6, 10, 78 Alchemy, 159 Algebra, matrix, 9–16, 17–20, 23–31, 33–41, 43–5, 47–9 Algebraic manipulation, 28, 43–4 Algebraic transformation, 26 Algorithms, 26, 48–9, 135–6, 152, 159, 160, 161, 163–6, 461 multivariate, 79 Alikeness, 376, 493, 496 Allpossiblecombinations design of three factors, 53 Allpossiblecombinations experiment, 63–4 Allpossible combinations of factors, 89 Allowable uncertainty, 478 Alpha error, 97 Alphalevel(s), 101 Alphasignificance level, 98 Alternate population, 97–8, 101 Alternative hypothesis test, 93, 392 Alternative hypothesis, Ha, 93, 392 American Pharmaceutical Review, 423 American Society for Testing and Materials International (ASTM), 493 Amount of nonlinearity, 146, 150–2, 155, 447–9, 453, 455, 457 Amplitude, 326, 330, 344, 499–500 Analogtodigital (A/D) conversion, 273
Analysis of noise, 223–6, 227–33, 235–41, 243–52, 253–66, 267–72, 273–9, 281–8, 289–94, 295–307, 309–11, 313–17, 319–23, 325–33 Analysis of variance (ANOVA), 59, 64–5, 171, 179, 210, 213, 215, 431, 450 accuracy, 167 data table, 59 general discussion, 59, 66–8 precision, 167–8 preclude to, 248 for regression, 155 results comparing laboratories, 179–80 statistical design of experiments, 168–72 table showing calculations, 67, 212 table, 59, 67, 212, 215 Test Comparisons for Laboratories and Methods, 179, 180 Analyte, 28, 30, 34, 121, 131, 141, 165, 168, 183, 187, 188, 223, 378, 379, 382–3, 385, 386, 390, 421, 430, 437, 478, 480 concentration, 121, 142, 188, 377, 378, 389, 420, 429, 431, 435–6, 441, 479–81, 483–4, 487 Analytic geometry, 71–6, 77–9, 81–4, 85–8 refresher, 3 Analytical Chemistry: Apages, 477 critical review issues, 1 fundamental reviews, 1, 48–9 Analytical designs, 53 Analytical uncertainty, 487 Anscombe, 421, 425, 427–9, 435, 442 data, 428, 442–3 Anscombe’s plot, 442 Antibiotics, 419 Anticholesterol drugs, 419 AOAC, Association of Official Analytical Chemists, 479 AOTF, 365, 415 Applied spectroscopy, 313, 459 Applied statistics, 59, 376–80, 429
514 Approximation, 155, 232, 328, 340, 348, 350, 355, 368, 456 Array detection, 499 Association of Official Analytical Chemists (AOAC), 479 Astronomical measurements, 224 Atomic absorption, 479 Augmented matrices, 14–15, 17–18, 20, 36 Auxiliary statistics, calibration, 1, 120–5, 133–4, 141, 154, 398, 422 Average analytical value, 479 Average of samples from a population, 52, 54, 59, 94, 390 Average, 33, 48, 52, 183, 185, 245, 247, 262, 306, 326, 358, 372, 479 Balanced design for three factors, 52–3 Band position, in spectroscopy, 132–3 Beer’s law, 34, 37, 47, 120–1, 132, 141–4, 156, 235, 282, 289, 368 Behavior of the derivative, 335, 341, 346, 348 Bestfit line, 361, 440, 451 Bestfit linear model, 453 Betalevel(s), 101 Betweenlaboratory variation, 481 Betweentreatment mean square, 59, 67, 70, 176 Biascorrected standard error (SEP(c)), 378, 382–3, 478–9 Biascorrected standard error (SEV(c)), 477–8 Bias, 3, 124, 167, 171, 177, 180, 187, 189, 379, 478–80 due to location or analytical method, 167–8, 171, 187 Biased estimator, 187–9, 379, 480 Big “if”, 423 Binomial distribution, 296, 483 Bioassay, 479 Biological interactions, 499 Biological samples, 499 Black box, 26, 154, 159 Blackbody radiation, energy density of, 224 Blank sample, 227 Bounds for a data set, 2 C (estimated), 111, 114 Calculating correlation, 381–2 Calculations for Comparison Tests, 188 Calculus, 229, 260, 276, 313, 457–8, 469–71, 473
Index Calibration: auxiliary statistics, 133–4, 154, 421 developing the model, 381 equations, 12, 28 error sources, 121–2, 132–3 linear regression, 28–9, 131, 165 lines, 34, 152, 424, 431–2, 463 sample selection, 494–6 samples, 35, 136–7, 379, 385, 494, 506 set, 135, 137, 377, 379, 389, 463, 495 of spectrometers, 121, 131, 162 in spectroscopy, 2, 28, 35, 117, 418–19, 429, 459, 462 transfer, 135, 161, 460, 506 Central limit theorem, derivation of, 101 Chebyshev polynomials, 437, 440 Chemical causes, 142 Chemical interactions, 463, 499 Chemical measurements: qualitative, 125 quantitative, 125 Chemical variation in sample, 500 Chemometric calibrations, 156, 333 Chemometric designs, 89 Chemometric modeling, 134 Chemometrician, 26, 147, 149, 156, 464, 467, 475 Chemometricsbased approach, 473 Chemometrics, 1–2, 48–9, 89, 117, 119–21, 131, 134, 135, 159–60, 163 Chisquared distribution (2 , 429, 433 set of tables, 102 Chromatography, 167, 418, 420, 479 Classical designs, 53 Coefficient of determination, 375, 379, 385, 398, 496 Coefficient of multiple determination, 28–30, 361, 364–5, 453 Coefficient of variation (CV), 479, 483–4 Coefficients for orthogonalized functions, 452 Collaborative Laboratory Studies, 167–77, 179–81, 183–4, 185–6, 187–92, 193–221 Collaborative study problems, 3, 169, 478 Collinearity, 113, 153 Color schemes, 501 Column vectors in row space, 85 Common Spectral Matching, 493, 495 Commutative rule, 6 Comparing laboratories methods for precision and accuracy, 170, 173–7 Comparing test results for analytical uncertainty, 487
Index Comparison of correlation coefficient: and SEE, 393, 399, 401 and standard deviation, 379–80 Comparison test: for a set of measurements versus true value, 171, 183, 216 for a two sets of measurements, 488 Compliance, 478 Computed transmittance noise, 277 Concentration, 28 ,30, 31, 34, 35, 37, 47, 48, 52, 63, 90, 107, 110, 113, 114, 120, 121, 125, 127, 131, 132, 141, 142, 144, 146, 147, 153, 155, 165, 174, 180, 188, 223, 289, 290, 368, 369, 373, 375, 377, 381, 382, 389, 395, 420, 421, 429, 431, 435, 436, 439, 441, 443, 460, 462, 463, 479, 481, 483, 484, 487 expressed in powers of, 10, 479, 483 units, 28, 132 Confidence interval, 254, 390, 429 Confidence level, 389, 390, 392, 395, 402, 404, 405, 407, 478, 487, 490 Confidence limits: for correlation coefficient, 390 for slope and intercept, 395, 405–6 Constant term, 35–7, 47, 439 Continuous population: distribution of means, 273–4 probability of obtaining a mean value within a given range, 97, 251, 305–7 Contour surface plot, 501 Controlled experiment, 57–9, 62, 93, 159 Correlation coefficient, 5, 6, 123, 124, 147, 154, 155, 163, 164, 232, 375, 379–86, 389–93, 398, 399, 402, 404, 439, 440, 443, 450, 452, 455, 474, 475, 496 confidence levels, 379–80, 390–1, 393, 405 discussion of use, correlation coefficient, population value for, p, 59, 103, 469 methods for computing, 398 Correlation or dot product, 495 Correlation, 3–6, 123–5, 154, 163, 164, 175, 232, 375, 377–87, 389–93, 398, 399, 402, 404, 420, 427, 428, 439, 440, 443, 450, 452, 455, 474, 494–6 Cosine, 72, 73, 74, 437, 495 Counting, 281, 282, 298 Covariance, 6, 7, 232, 474, 475 Covariance of (X, Y), 381, 382 Cramer’s rule, 45 Critical level, 484, 485
515 Critical value, 98, 101, 103, 428, 429 Critical, 1, 41, 48, 52, 98, 101, 103, 104, 156, 161, 162, 212, 215, 219, 428, 429, 475, 484 Cross Correlation techniques, 495 Crossproduct matrix, 475 Crossproduct, 24, 232, 252, 299, 301, 303, 474, 475 CV, coefficient of variation, 479, 483, 484 Daniel and Wood, 440, 444 Data: continuous, 274, 285, 305, 319 discrete, 247, 250, 274, 285, 305, 309, 315, 327, 332, 336, 489 historical, 433 Data conditioning, 113 Data matrix A, 109, 110, 113, 114, 127, 128, 501 Data set: bounds for, 2 synthetic, 148 Dependent (or “Y”) variable, 28, 34, 379, 468, 469 Dependent events, II, 367, 468 Dependent variable (Y variable), 34, 124, 368 Derivative (difference) ratios, 229, 240–1, 284 Derivatives (different spacings), 341, 344, 349, 351 Derivatives of spectra, 335, 409 Descartes, Rene (1596–1650), 71 Designed experiments, 51, 147 Detection limit, 477, 484 for concentrations near zero, 484 Detection, 282, 376, 477, 479, 484, 499 Detector noise, 223, 224, 226–8, 230, 235, 241, 243, 247, 250, 253, 254, 267, 273, 281–9, 292, 293, 295, 309–11, 313, 320, 325, 327, 328, 332 Determinants, 41–5, 440 Deterministic considerations, 478–9 Developing the model, calibration, 154–5, 381 Diagnosis of data problems, 3 Diagonal elements in a matrix, 43 Diagonal product, 43 Differences, successive, 423, 424 Different size populations, 379, 380, 392, 404 Diffuse reflectance, 154, 163, 225, 235 Digitized spectrum, 273–4, 281 Dimensionality, reducing, 81 Direction angles, 74, 75, 77 Direction cosines, 73, 74
516 Direction in 3D space (cosine), 74 Direction notation, 72 Discriminant analysis and its subtopics of, 3 Distance between two points, 71 Distance formula, 71 Distribution of means: continuous population, 273–4 discrete population, 250, 273–4 sampling, 54, 60, 61, 170, 274 Distribution(s), 167, 296, 298, 305, 314, 328, 350, 376, 433, 449, 456 binomial, 296, 483 Chi (, 102 Chisquared ( 2 102 constituent, 459–60 continuous, 247, 309, 319 discrete, 309, 332 F, 210, 213 finite, 248–50, 252, 259, 262–3, 340–1, 357 Gaussian (normal), 52, 124, 433, 449 Gaussian mathematical expression, 52, 124, 335–6 hypergeometric, 4, 33–4 infinite, 248–51, 259, 267 of means for a discrete population, 250, 273–4 multinomial, 296, 357, 436, 442–3, 483–4 Normal (Gaussian) mathematical expression, 3–7, 103, 247–9, 247–50, 267–8, 275, 449 Poisson, 61, 285, 290, 296–9, 304, 309, 315, 319, 327, 328, 332 Poisson, formula for, 283–4, 286, 296–9, 320 of a population, 52, 54, 389–90, 392 of possible measurements showing confidence limits, figure showing, 389–93 probability, 296 of S, 175 of standard deviations for a discrete population, 59, 247, 305 t, 103, 389–90, 392 of variances, 489, 491 of (X JS), 5–7 of X variable, 473–4 of Y variable, 473–4 Dividebyzero computation, 309 Division, 6, 11, 25, 78, 244, 245, 251, 340, 341, 346 Dot product, 494, 495
Index Double blind, 274, 331, 347, 371 Double negative, 97 Draper & Smith, 427–8, 441–2 Drift, 60, 61, 121, 147, 155, 417, 418 between sets of readings, 60, 61 instrument, 61, 155, 417 and other systematic error, 418 DurbinWatson Statistic, 421, 423, 424, 427–9, 431, 432, 435 Echelon form, 14, 15, 20 Effect of instrumental variation on the data, 161 Effect of noise on computed transmittance, 275 Effect of variations of the data on the model, 161 Efficacy, 419 Efficient comparison of two methods, 171, 187 Eigenanalysis, 109, 114 Eigenvectors, 128 Electrochemistry, 420 Electromagnetic spectrum, 142 Electronic noise error, 225 Elementary calculus book, 229 Elementary row operations, 18 Elementary statistics, 95, 285, 306, 379 Elimination, matrix operation, 17, 18, 24, 48 Empty or null set, 9 Energy density of blackbody radiation, 224 Energydistribution product, 330 Error of integral, 329 Error propagation, 289–91 Error source, 121, 223, 231–2, 274, 325, 417–18 Error sources, calibration, 121–2, 132–3 Error(s) combined, 28–9, 121–2, 123–4, 145, 153, 155, 170, 176, 187–9, 370, 409, 410 electronic noise, 225 estimating total, 3, 34, 70, 164, 392, 408, 429 experimental, 57, 93 heteroscedastic, 424 homoscedastic, 54 of interpretation, 421 maximum, 370, 414 nonrandom, 428 peaktopeak, 343, 345, 347–8, 460 population, 98, 101, 103 propagation of, 289–91
Index random, 52, 64, 66, 67, 170, 171, 188, 189, 418, 421, 424, 447, 448, 453, 460, 462, 463 reference method, 28–9, 70, 91, 97, 123 repack, 154 and residuals, 9 sampling, 60 at some stated confidence interval, 3, 34, 70, 164, 392, 408, 429 source of, 232 in spectroscopic data, 31, 34, 38, 120, 131, 141, 353, 355, 359, 367, 377, 499 stochastic, 52, 64–6, 91, 101, 170–1, 188–9, 273–4, 418, 421, 424–5, 447–8, 460, 463, 489 systematic, 167, 168, 176, 177, 188, 190, 200, 201, 208, 209, 219–21 true, 121–2, 231–2, 489 undefined, 251, 277, 305 unsystematic (random), 52, 64–6, 91, 101, 170–1, 188–9, 273–4, 418, 421, 424–5, 447–8, 460, 463, 489 ESR, 335 Euclidean distance (D), 494, 495 Events, dependent, II, 28, 33 Ewing’s terminology, 231 EXCEL™, 241 Excessive signal levels (saturation), 142 Expectation, 170, 171, 230, 259, 260, 265, 270, 276, 311, 315, 341 Expected result, 28, 94–5, 228, 230, 247, 254, 256, 273, 275–7, 285, 295–6, 298–9, 309, 325–7 Expected value of a parameter (E(S), 285, 299–300 Expected value of a parameter S, 228, 299, 432 Experiment: balls in jar, 150, 161 controlled, 54 Experimental chemometrics, 159 Experimental design, 51, 53–5, 57, 59, 62–4, 88, 89, 91, 93, 94, 97, 101, 103, 105, 168, 171, 172, 176, 187, 460, 461 balanced, 52–3 crossed, 93–5, 104–105 efficient, 53–4 fractional factorial, 54, 92 nested designs, 54, 62 nested, 54, 62 oneatatime, 62, 91
517 onefactor, twolevel experiment, 91 seven factors, table showing, 53 threefactor, twolevel crossed experiment, 89, 461 twofactor crossed experiment, 63 twofactor, twolevel crossed experiment, 63 Experimental versus control designs, 62 Expression for relative absorbance noise, 320 Expression for transmittance noise, 289, 320 Extrapolating or generalizing results, 160, 493–5 Fdistribution, 210, 213, 397, 432 Fratio, 432 Fstatistic Calculation (Fs) for precision ratio, 190, 220 F statistic, 189, 190, 191, 200, 208, 209, 212, 215, 220, 221, 478 F test, 58, 59, 421, 431, 433, 478 statistical significance of, 432–3 for the regression, 58–9, 431–3 F values, 431–2 F, t2 statistic, 189 Factor analysis scores, 109, 114 Factor analysis, 3, 109, 114, 120 Factorial, 92, 307 design for collaborative data collection, 168 designs, 54, 91, 92, 168 model experimental design, 168 Factors in statistical/chemometric parlance, 51 Failure to use adequate controls, 57–8 Family of curves of multiplication factor as a function of Er, 251 Fatal flaw, 432 FDA/ICH guidelines, 427, 431, 435, 436 Finite population, 273 First difference (derivative), 269, 350 Fisher’s Z transformation (i.e., the Zstatistic), 389, 390 Food and Drug Administration, 447 Fourier coefficients, 28, 381, 383 Fourier transform infrared (FTIR), 231, 246, 335, 365, 415 100% line, 151, 263, 481 table of standard deviations, 479–81 midinfrared spectrometer, 100% line, 231–2, 246, 270, 365, 415 spectrometers, 231–2 Fractional factorial designs, 54 Fractional factorial, 92
518 Frequency, 107, 109, 113, 127, 224, 315, 499, 500 FWHH, full width at half height, 336 Gammaray spectroscopy, 223, 282 Gauss, Carl Friedrich, 249, 253, 314 Gaussian distribution, 124 ,433, 449 Gaussianshaped bands, 335 Generalized inverse of a matrix, 9 Generalizing results, 160, 493–5 Genetic algorithms (GA), 166 Goodness of fit test, 375–9 Goodness of fit, 375, 379, 381, 389, 395, 398, 429, 441 Gossett, W.S. (Student’s ttest), 183 Graecolatin square design, 92 Grand Mean, 57, 58, 65, 66, 70, 173, 175, 176, 194 H statistic, 98, 103, 189 Ha, alternative hypothesis, 93, 392 Hamming networks, 494 Handbook of Chemistry and Physics, 276 Heterogenous, variance, 59, 229–30, 262–3, 267–8, 313–15 Heteroscedastic error, 424 Higher order differences (derivatives), 165, 372, 373, 496 Hit quality index (HQI), 494 Ho, null hypothesis, 93, 95, 97, 103–5, 189, 392, 404 Homogeneous, variance, 268, 376 Homoscedastic error, 54 Horwitz’s Trumpet, 477 Hydrogen bonding, 142, 154 in NIR and IR, 235 Hyperplane, 4, 34 Hyperspectral data cube, 499 Hypotenuse, 87, 88 Hypothesis test, 54, 58, 59, 67, 94, 97, 98, 102, 103, 167, 171, 212, 215, 389, 392, 393, 405 Chisquare, 102 nomenclature, 389, 392–3 null, 392 Hypothesized population, 97, 101 Hypothetical synthetic data, 448 ICH specifications, 419 Identity matrix, 11, 12, 19, 20 Image projection, 499
Index Imaging, 499, 501, 502 Incorrect choice of factors/wavelengths, 418 Independent error, 189, 424 Independent variable (X variable), 28, 34, 120, 379 Inferences, statistical, 375, 377 Infinitefinite numbers, 248–52, 259, 262–3, 305 Infrared, 35, 147, 223, 226, 230, 495 Ingle and Crouch’s development, 238 Inhomogeneous sample, 60 Instrument: bandwidth broad Compared to absorbance band, 142 noise, 223–6, 243 Instrument (and other) noise, 223–6, 243 Instrumental causes, 142 Integers, population of, 101, 389, 392 Integral, 247–52, 261–4, 266, 275, 276, 296, 298, 299, 307, 327, 328, 330–2, 436, 457, 458 Integrated circuit problem, 274 Integration interval, 249, 250, 259, 328, 329 Interaction: between variables, 91, 461 with solvent, 142 Intercept (k0 , 95, 123, 375, 379, 380, 395–8, 405–407, 429, 452, 453 confidence limits, 375, 396, 397 of a linear regression line, 381, 395 of a regression line, 379 Interference, 246, 459, 477 Interlaboratory tests, 477 International Chemometrics Society (NAmICS), 1, 362 Interpretive spectroscopy, 377 Inverse Beer’s Law, 120 Inverse of a matrix, 11, 19, 21, 25, 26 Kmatrix (multiple linear regression), 3, 138 Known samples, 135 Kowalski, Bruce, 467 KubelkaMunk function for diffuse reflectance, 235 Laboratory data and assessing error, 3 Laboratory error, 477 Lack of fit error, 28 Latin & Graecolatin cubes, 92 Latin Square design, 92 Latin squares, 92 LCGC, 167
Index Learning set, for calibration, 378, 384, 460, 475 Least squared differences, 30 Leastsquares, 28, 357–9, 433, 436, 457, 469, 471, 473, 475 criterion, 421, 436 line, 28 property, 468 Left singular values (LSV) matrix or the U matrix, 109, 114, 127 Level of significance, 210, 212, 213, 215, 405 Limit of detection (LOD), 376 Limit of reliable measurement, 477 Limits in analytical accuracy, 483 Linear leastsquares, 28 Linear regression, 165, 375, 376, 379, 381, 389, 395, 431 calibration, 28–9, 165, 431 Linearity, 132, 138, 141, 163, 164, 417, 418, 420, 421, 423, 424, 428, 429, 431, 433, 435, 436, 439–43, 447, 449, 450, 452, 459, 460, 461, 463, 464 assumption of, 47, 141–4 calibration, 131–4, 141, 145, 146, 148, 149, 159, 163, 165, 417, 423, 431, 435, 447, 455 Loadings matrix V, 110, 114 Log, 1/R, 235, 286 Log(R), 277, 294, 322 Logarithm, 95, 153, 155, 238, 277, 322 Lorentzian distribution, 337–40, 410, 411 Lownoise case, 264, 266, 322, 325, 332 Lower confidence limit(LCL), 389, 390 Lower limit, 327, 328, 390, 391, 395, 404, 407, 408 Luck, concept of, 359 Mahalanobis distance, 3, 493–5 weighted regression, 494 Main diagonal (of matrix), 6, 23 Malinowski, Ed, 120 Mandel, John, 477 Manual wet chemistry, 431, 435 Mass fraction of analyte, 479 Match index, 495 Matching index, 493 MathCAD, 167, 171, 173–6, 187, 189, 193, 210, 213, 375, 379–82, 389, 392, 395, 398 Mathematical constructs, 142 Mathematical statistics, 467
519 Mathematician, 26, 33, 34, 467 MATLAB (Matrix Laboratory), 40, 107–11, 113, 114, 116, 117, 127, 128, 249, 258, 263, 267, 315, 328, 362, 364, 419, 501, 502 Matrix, 5–7, 9–12, 15, 17–20, 23–31, 33–6, 38, 41, 43, 47–9, 55, 77, 85, 88, 107–11, 113, 114, 117, 120, 127, 128, 138, 153, 165, 362–4, 381, 382, 389, 439, 468–71, 473, 475, 493–5, 499, 501, 502 addition, 10 algebra refresher, 3 algebra, 7, 9, 12, 23, 28, 30, 31, 33, 38, 41, 43, 47, 88, 107, 109, 113, 117, 127, 471, 473 division, 11 form, 15, 17, 19, 23, 29, 35, 36, 120, 439, 469 inversion, 26, 27, 41, 48, 153, 439, 469 multiplication, 6, 7, 11, 23–5, 27, 363, 470 nomenclature, 21 notation, 5, 6, 11, 17–19, 23, 29, 30, 35, 107, 381, 468, 470, 471, 475 operations, 6, 10, 17–19, 24, 25, 28, 31, 48, 108, 111, 114–16, 362, 471 product, 24, 27 row operations, 36 Maximum error, 433, 459–60 Maximum likelihood: equation, 33, 34 estimator, 33–4, 433 method, 433 Maximum variance in the multivariate distribution, 3 MDL, minimum detection limit, 477 Mean: population, 94, 101, 103, 496 of a population, (mu), 94, 101, 103, 496 of a sample (X bar), 94 sample, 5, 58–9, 104–105 Mean deviation, 173, 175, 176, 183, 189 Mean square: betweentreatment, 59 for /s1 for /s1withintreatment, 59 regression, 440 residuals, 30, 70, 421 Mean square error(MSE), 59, 67–70, 450, 479 Means and standard deviations from a population of integers of random samples, Computer, 98, 136, 273
520 Means and standard deviations of a population of integers, computer program, 59–61 Microphonics, 224 MidIR, 226 Miller and Miller, 375, 383, 395, 396, 405–408 Minimum detection limit (MDL), 477 MND (Multivariate normal distribution), 2–7 Mode, 152, 282 Modelbuilding, 92 Model for the experiment, equation example, 58–9, 168 Molar absorptivity, 479 Molar concentration, 479–81 Molecular absorption spectroscopy, 479 Monitor, 137, 246, 478 MonteCarlo calculations, 253 MonteCarlo numerical computer simulation, 314 Monte Carlo study, 249, 253, 314 Most probable equation, 33 Multilinear regression, 23–9, 33–41, 47–9 Multinomial distribution, 296, 437–9, 483 Multiple correlation, 378 Multiple frequencies, 499, 500 Multiple linear least squares regression (MLLSR), 3 Multiple linear regression(MLR), 3, 21, 23, 28, 30, 33, 34, 35, 41, 43, 47, 107, 113, 119, 127, 134, 138, 145–51, 153–7, 163, 165, 166, 418, 441, 459, 460, 494, 502 Multiplication, 6, 7, 9, 11, 23–5, 27, 77, 78, 250, 251, 332, 363, 470 Multiplier terms, 28, 30 Multipliers, 28, 315, 356 Multiplying both sides of an equation, 25 Multivariate distribution, 3 Multivariate linear models, 12 Multivariate normal distribution (MND), 2–7 Multivariate regression, 84, 107, 109 N random samples, Table of standard deviations of, 67–9 N as number of total specimens, 91–2 Narrow band, 336 Nearinfrared, 35, 131, 223, 226 detectors, 223 reflectance analysis, 223 Nested designs, 54, 62
Index Neural networks (NN), 3, 138, 147, 165, 166 learning systems, 494 New materials, 499 News Flash, 455, 459, 460 95% confidence limit, 101, 102, 478, 487, 489 NIR, 1, 131, 149, 151, 235, 295, 335, 365, 415, 417–19, 421, 459–61, 463 NMR, 335 Noisetosignal ratio of the reference signal, 240 Noise, 223–6, 227–32, 235–41, 243–52, 253–66, 277–9 characteristics, 224, 227, 277, 320 FTIR spectrometer, 231 instrument, 243, 253, 267, 273, 289, 295, 309, 313, 325 level, 91, 151, 230, 235, 245, 246, 249, 251–4, 257, 259, 262–4, 267, 273, 274, 284, 295, 306, 309–11, 321, 325, 327, 330, 341, 357, 369, 373, 374 ratio, 151, 230, 249, 256, 269, 325, 355, 356 spectra, 253, 267, 273, 281, 289, 295, 309, 313, 319, 325, 332, 369 spectrum, 223, 230, 241, 254, 289, 356, 357, 369, 374 variance, 369 Noisy data, figures of, 151, 353 NonBeer’s law relationship, 120–1 Noncollimated radiation, 142 Nondetector noise, 224 Nondispersive analyzers, 225 Nonlinear detector, 142 Nonlinear dispersion, of spectrometer, nonlinearity, 4, 133–4 Nonlinear electronics, 142 Nonlinearities, 132, 155, 225, 252, 295, 443 Nonsignificant result, 97 Normal (Gaussian) distribution, 449 Normal distribution weighting factor, 249 Normal distribution, 4–7, 103, 247–50, 258, 267, 273, 275, 277, 296, 298, 304–307, 315, 326–8, 330, 331, 337–40, 350, 355, 367, 409, 414, 423, 433, 452, 456, 496 Normal method, 3–4, 103 Normal probability distribution, 296 Normal random number generator, 258 Normality of Residuals, 433 Normallydistributed noise, 275, 277, 278, 295, 301, 309
Index Normally distributed, 54, 65, 251, 258, 263, 267, 273, 278, 296, 328, 376, 389, 425, 427, 433, 452, 453, 455, 484, 489 Null hypothesis test, 392 Number of experiments needed, 91 Number of measurements required, 489 Number of samples in the calibration set, 389 Oneatatime designs, 62, 91 Onefactor, twolevel experiment, figure showing, 133, 154, 163, 252 100% line, from FTIR spectrometer, 231 Onehundredpercent transmittance line, from FTIR spectrometer, 246 Onetailed hypothesis test, figure showing, 94 Operative difference for denominator, 431 Operative difference for numerator, 432 Opposite sides, 83 Opticalnull principle, 224 Optimization designs, 53 Ordinary regression theory, 132 Ordinate (yaxis), 71 Ordinate, 269, 384–6, 479 Original population, 94, 97–8, 101, 103, 379 Orthogonal, 153, 440 Chebyshev polynomials, 440 Orthogonalize the variables, 440 Orthogonalized functions, 452 Orthogonalized quadratic term, 452 Outliers, 378, 379, 417, 433, 479, 490, 493, 495 prediction, 378–9, 464, 490 samples, 378 theory and practice, 3 Overdetermined, 33, 34, 37, 47 P (probability), 97–8, 101, 298, 306, 309, 330, 332, 375–6 Pmatrix (multiple linear regression), 3, 138 Pmatrix formulation, 120 Painkillers, 419 Pairs of values, 389, 455 Parabola, 347 Parameter , 251 Parameters, 4, 28, 89, 143, 165, 166, 299, 301, 359, 379, 420, 436, 437, 449, 484, 489 estimate, 301 or matrix names, 9 population, 97–8, 101, 389–90 statistical, 375, 379, 381, 398 Partial F or tsquared test for a regression coefficient, 58–9, 189, 191, 299
521 Partial least squares (PLS), 107, 113, 114, 119, 125, 127, 131, 132, 134, 138, 146–57, 159, 160, 163–6, 418, 460, 494, 502 Partial least squares regression (PLSR), 1, 3, 107, 113, 127 Partitioning the sums of squares, 58, 449, 475 Pascal’s triangle, table of, 83 Pathlength, 141, 143, 144, 225 Pattern recognition, 494 Peak picking algorithm, 347 Peaktopeak error, 132, 344–5 Peak, 132, 148, 152, 153, 165, 252, 336, 337, 343–5, 347, 355, 460 Pedagogic, 26, 54, 64, 81, 132, 152, 243, 250, 341, 375, 449, 452 Percent CV, 479 Perfectly noisefree spectrum, 146, 150 Pharmacopoeia, 419 Physical variation in sample, 377 Pitfalls of statistics, 375–80 Plane, 4, 6, 34, 71, 81–5, 120, 463, 495, 499 PLS singular value decomposition (plsSVD), 114 PLS, partial least squares regression, 1, 3, 107, 113, 127 Point estimates, 228 Poisson distribution, 61, 285, 290, 296, 298, 299, 304–307, 309, 315, 319, 327, 328, 332 formula for, 296, 306 Poissonnoise case, 291 Polynomials, 357, 359, 361, 373, 436–41, 447 Pooled precision, 174 and accuracy, 174 Pooled standard deviation, 197, 198, 206, 488, 491 Poor choice of algorithm and/or data transformation, 418 Population, 52, 54, 59, 94, 97, 98, 101, 103, 136, 273, 379, 380, 389, 390, 392, 404, 468, 496 distribution of, 103–105 error, 52 finite, 136, 273 of integers, 94, 97–8, 101, 103 large, 273, 379, 389, 392–3 mean (, 94, 101 original, 94, 97–8, 101, 103, 379 parameters, 97–8, 101, 389–90 of spectra, 496
522 Population (Continued) value, 59, 103, 468 variance, 59, 103 Potency, 419 Power of the statistical test, 97 Practical quantitation level (PQL), 484 Precision and standard deviation of methods (Comparison), 189, 220 Precision, 36, 101, 102, 121, 167, 168, 170–4, 176, 177, 187–90, 194, 197, 199, 200, 202, 206–208, 216, 220, 237, 239, 243, 250, 254, 258, 269, 291, 294, 307, 442, 452, 459, 473, 477–9, 483, 487, 488 Prediction: error, 28, 382, 383 samples, 135 vector, 107 Prediction error sum of squares (PRESS), 122–4, 136, 147 PRESS statistic, 123, 124 Principal components analysis (PCA), 3, 109, 113, 114, 119, 125, 127, 132–4, 138, 148, 149, 151, 153, 154, 156, 157, 159, 163, 166 Principal components for regression vectors, 86 Principal components regression (PCR), 1, 3, 107, 113, 127, 131, 134, 138, 145–50, 152–7, 163–6, 418, 460, 494, 502 Principal components scores, 109–11 Probabilistic answer, 119–21 Probabilistic calculations, 251–2, 259, 264, 306, 315, 375–6, 487–8 Probabilistic considerations, 119 Probabilistic force, 33, 119, 120, 167, 254, 259, 264, 427, 489 Probabilistic statements, 33, 427 Probability, 94, 97, 98, 101, 102, 105, 160, 251, 252, 262, 271, 274, 296, 298, 305, 306, 309, 315, 320, 328, 330, 332, 375, 376, 427, 428, 463, 487, 489 distribution, normal, 296 sampling, 274 theory, 160, 306 and statistics, connection between, 375 Projection, 4–6, 81–3, 86, 87, 463, 499 Proof that the variance of the sums equals the sums of the variances, 229, 232 Propagation of uncertainties expression, 310 Proportion, 224, 225, 256, 263, 276, 281, 283–5, 287, 317–21, 325, 330, 340, 341, 344, 346, 347, 351, 368, 420, 452, 483, 484
Index Pseudoinverse, 107, 468 Pseudoinverse, theorem, 107, 468 QTest for Outliers, 490 Quadratic nonlinearity, 452 Quadratic polynomial, 442, 447 Qualitative analysis (Spectral matching), 3 Qualitative, chemical measurements, 48 Quantitative analysis, 30, 34, 48, 49, 125, 162, 367, 381 Quantitative, chemical measurements, 254, 418–19 Quasialgebraic operations, 25 Quintic polynomial, 360 Rsquared (R2 , 379, 398 Random (stochastic) noise, 91, 146, 150, 224, 254, 370, 418, 424 Random effect(s), 64, 151, 266, 449 of noise, 227, 228 Random error, 52, 64, 66, 67, 170, 171, 188, 189, 418, 421, 424, 447, 448, 453, 460, 462, 463 Random numbers, 252, 263, 267, 467 generator, 258, 263, 267 Random phenomena, 33, 65 Random sample, 52, 98, 98, 136, 136, 267, 273, 450, 455, 489 Random variable, 155, 228, 229, 230, 232, 252, 260, 267, 298, 314, 353, 356 Randomness, 33, 285 behavior of, 33, 285 test for, 33, 285 Rank, 185 Ranking test, 171, 185 for Laboratories and Methods, 185 Rastering, 499 Ratio of the range (Sr) to the SEE, 386 Ratioed spectra, 226Ratios of upper to lower confidence limits, table of, 98, 389 Real data, 59, 150, 152, 155, 159, 167, 172, 259, 336, 341, 347, 425, 437, 440, 455 Real world samples, 119, 157, 245, 305, 336, 425, 489 Reducing dimensionality, 81 Reference laboratory value, 107, 167, 183, 187, 193–201 Reference method error, 33–5, 70, 119, 121–5, 171–2, 183–4, 273–4, 289–91, 439–41 Reference noise, 231, 282, 310 Reference spectral library, 493
523
Index Reflectance (reflection), 154, 163, 223, 225, 226, 227, 235, 282, 283, 378 Reflection (reflectance), 154, 163, 223, 225–6, 227, 235, 282, 283, 378 Regression: algorithms, 26, 48–9 analysis, 7, 34, 421, 440, 450, 468 calculations, 421 coefficients, 28, 30, 38, 40, 43, 107, 110, 114, 469 Regression (MLR), and Pmatrix, and its sibling, Kmatrix, 3 Regression line, 379, 381, 382, 395, 397, 420 linear equation, 379 Relative error of the absorbance, 289, 290 Relative mean deviation, 183 Reliability, 119 Repack, 154 averaging, 59–61 Repeat readings, 59, 168, 170 Repeatability, 60, 460, 477, 478, 505 Replicate measurement, 173, 174, 175, 187, 484, 487, 488, 490 Replicates, 173–6, 185, 478, 487–91 Representative sample, 54, 136 Reproducibility, 477, 478 Residual Error, 28 Residual sum of squares, 421, 432 Residuals, 433 Resourceconserving experimental design, 93–5 Response surface designs, 54 Response surface, 54, 62, 92 Result, actual, 37–8, 40, 315 Result, expected, 97, 145, 149, 179 Rho, table of exact values, 403 RHS, 263, 474 Right singular values matrix (RSV) or the V matrix, 109, 114 Right triangle, 82, 87 Rocke and Lorenzato, 483 Root mean square error (RMS), of FTIR spectrometer signal, 231, 415 difference, 176–7 FTIR % line, 246, 415 Rotation, 81, 83, 84, 365, 415 Row effects, 36, 70, 85–6 Row equivalent, 18, 36 Row operations, 18, 19, 20, 36, 37, 39, 41, 48 Row vectors in column space, 85 RSSK/Norm, 373, 374
S: calculation of the sample standard deviation, 103 standard deviation of a sample, 101, 103 Sample: blank, 227 mean, 172 nonhomogeneity, 60–1 pathlength, 507, 508 presentation error, 123 representative, 54 selection, 3, 496 calibration, 35 statistic, 93–5, 97 Sampling, 54, 60, 61, 170, 274 Sampling distribution, 54, 60, 61, 170, 274 expression for, 290 Sampling error, 274 SavitzkyGolay convolution functions, 355, 357, 371, 372, 435 SavitzkyGolay/Steinier tables, 359 Scalars, 9 Scaling, 113, 299, 301, 341, 364, 414 Scatter diagrams, 378 Science of Statistics, 33, 125, 151, 160, 467, 473 Scintillation noise, 224, 316, 319, 322, 325, 327, 330, 332 Scores matrix T, 109, 114 Screening designs, 53 Second derivative, 335, 337–8, 340, 343, 345–6, 347–8 of the normal distribution, 338, 409 Second difference (derivative), 347–8 Second law of thermodynamics, 143 Secondorder data, 499 SECV, Standard error of cross validation, 419 SED, standard error of difference, 123–4, 163–4 SEE (Standard Error of Estimate), 123, 124, 379, 380, 383, 386, 398, 402, 406 SEL, Standard error of the laboratory, 478 Selfinteraction, 142 Selfpolymerization or condensation, 142 Sensitivity testing, 62 SEP (Standard Error of Prediction), 161–2, 381–4, 419 Sequential design, 92, 93, 103 Sequential experimental design, 103 Set of regression coefficients, 107, 110, 117 Shotnoise, 223, 289, 296
524 Signaltonoise (S/N) ratio, 347 Significance level, 98, 404 Simple correlation, 378 Simple least squares regression (SLSR), 3 Simple linear least squares regression (SLLSR), 3 Simultaneous equations, 23, 24, 25, 26, 27, 29, 439, 469, 470 Sine, 330 Single wavelength, 132, 134, 359, 499 Singular value decomposition (SVD), 127 Singular values matrix (SVM) or the S matrix, 109, 114 Slope (k1 , 75, 395 confidence limits, 396 defining in two dimensions, 75–6 of a linear regression line, 395 Solvent interactions, 63, 142 Source of error, 232 Spatial dimension, 499, 501, 502 Spatial information, 499 Special designs, 62(this ciation is only in table) Specimen, 92 Spectra: of noise, 131–3 population of, 496 Spectral matching (Qualitative analysis), 3 Spectral matching approaches, 493 Spectral search algorithms, 494 Spectral searches, 493 Spectrophotometry, 479 Spectroscopic amplitude, 499 Spectroscopic imaging, 499–503 Spectroscopist, 30, 141, 142, 143, 144, 146, 147, 151, 152, 156, 245, 255 Spectroscopy: calibration, 367–74 FTIR, 231, 246, 335, 365, 415 home page, 171 magazine, 1, 141, 467 Spectrum, noise, 151, 278, 369–70 Specular reflection, 124, 223, 225–6, 270, 282 Spiked or true values (TV), 175 Spiked recovery method, 183 Square of the correlation coefficient, 450 Square root of variance (standard deviation), 474 Squares for residuals, 424 Standard calibration set, 379
Index Standard deviation (s or S), calculation for a sample, 101 of A, 237–9 of a of a sample (s or S), 101 of difference (SOD), 423, 479 pooled, 58, 60, 197–8, 206, 488, 491 of T, 229, 289 Standard deviation of a population(), 59, 94, 101, 103 Standard error of calibration (SEC), 122, 124, 163–4, 380, 385 Standard error of cross validation (SECV), 410 Standard error of estimate (SEE), 123, 124, 379, 380, 383, 386, 398, 402, 406 Standard error of laboratory (SEL), 478 Standard error of prediction (SEP), 161–2, 381–4, 419 Standard error of the laboratory (SEL), 478 Standard error of the mean, 382 Standard error of validation (SEV), 477–8 Standard error of validation (SEV), 123–4 Standard Practice for General Techniques for Qualitative Analysis, 493 Standardization concepts, 3 Statistic, test, 93, 94 Statistical analysis, 8, 17, 180, 423, 425, 491 Statistical conclusion, 441–3 Statistical design of experiments, Statistical design of experiments, using ANOVA, 168–72, 473–5 Statistical experimental design, 51, 54, 62, 89, 91 Statistical inferences, 375 Statistical significance, 97, 441, 450, 461 Statistical tests, 171, 192, 193, 375, 378, 443, 447, 449, 506, 507, 508 Statistical variability, 423 Statistically designed experiments, 41, 147 Statistically significant, 51–2, 57–60, 97, 171, 179, 424, 427–9, 439, 441, 443, 447, 455, 477, 478 Statistician, 91, 119–20, 162, 247, 376, 423–6, 433, 464 Statistics: applied, 428, 449–50 general, 1, 379–80, 429 mathematical, 58, 311, 314 pitfalls, 375–80 science of, 33, 125, 151, 160, 467, 473 Steinier, 359, 360, 361, 362
Index Stochastic (random) noise, 91, 146, 150, 224, 254, 370, 418, 424 Stochastic error, 52, 64–6, 91, 101, 170–1, 188–9, 273–4, 418, 421, 424–5, 447–8, 460, 463, 489 Stray light effects, 132–3, 463 Stray light, 132, 142, 152, 155, 463 Student’s (W.S. Gossett) ttest, 183 Student’s tstatistic, 396 Student’s ttest, mathematical description, 221 Student’s tvalue for a regression, 189, 390 Studentized ttest for the residual, 183–4 Subclasses, 378 Subsamples, 478 Subtraction, 6, 10, 70, 79, 449 Sum of differences, 494 Sum of squares, 23, 34, 58, 70, 421, 432, 449, 450, 457, 458, 470, 475 betweengroups, 70, 449–50, 457–8 due to error, 34, 70, 470 for regression, 34 for residuals, 421 withingroups, 449 Summation, 490 notation, 30, 381. 382, 395, 396 of variance from several data sets, 490 Superwhizbang chemometrics, 149 Survey, 478 Systematic effects, 151, 170, 171, 449 Systematic error (bias), 171, 187, 201, 209, 477 Systematic errors for methods A vs. B, 188, 219 Tdistribution, nature of, 93, 103–4 Tstatistic, 189, 395 Ttable, 189, 191, 192, 216, 222 Ttest, 57, 59, 93, 183, 189, 191, 192, 221, 439 T, calculation, 191 T, F1/2 , 189 Tangent of the x direction angle, 75 Test for nonlinearity, 431, 435 Test samples, 135, 168, 508 Test spectrum, 493, 495, 496 row matrix, 495 Test statistic, 93, 94, 122, 189, 191, 221, 392, 404, 488 Testing correlation for different size populations, 392, 404 Testing for nonlinearity, 421, 435
525 Testing for systematic error in a method, 183 Tests for nonlinearity, 133–4 Tests for randomness, 33 Thermal, independent noise, 263 Thirdorder data, 499 3D to 2D projection, 81 3rd order data, 499, 500 Threedimensional data, 81, 501 Threedimensional surface plot, 502, 503 Three factor, two level, 89 Total degrees of freedom, calculation, 59, 70, 216, 487–8 calculation, 475 Training set (Calibration set), 135, 137, 379, 389, 495 Transfer of calibrations, 3 Transmittance multiplication factor, 331 Transmittance, 275 Transpose of a matrix, 12, 28 Trigonometric functions of a right triangle, 82 True derivative, 347 True error, 231 True value, 175, 183, 194, 202, 216, 217, 468 Trumpet curve, 483 2D into 1D by rotation, 84 Twodimensional contour: map overlay onto a threedimensional surface plot, 502 plot, 501, 503 Twodimensional coordinate space, 72 Twodimensional reduction, 83 Two equations and two unknowns, 43 Twofactor design, 63 Twofactor, twolevel crossed experiment, 91–2 Twosample charts, 188, 219 Twoway analysis of variance (ANOVA), 64, 65 Type of noise, 246, 332 U.S. Environmental Protection Agency (EPA), 484 UCL, upper confidence limit, 390, 391 Unaddressed Problems in Chemometrics, 135 Unbiased estimators, 431, 487 Uncertainty in an Analytical Measurement, 487 Uncontrolled, nonsystematic variable, 187–8 Undefined error, 34, 251 Unexplained Error, 28 Uniformly distributed noise, 275, 277, 278
526 Unit matrix, 23, 25, 26, 469 Univariate least squares regression, or Simple least squares regression (SLSR), 3 Univariate methods, 419, 421 Univariate statistics, 6–7, 123–4, 419–21, 462–3 Unsolved problems, 135, 162 Unsystematic (random), errors, 52, 64, 66, 170–1, 188–9, 421, 424, 447–9, 453, 460, 462, 463 Upper confidence limit (UCL), 390, 391 Upper critical limit, 98 Upper Limit, 271, 307, 389, 390, 404, 407, 408, 439 Validation, 133, 135–7, 375, 419, 464 of calibration models, 3 parameters, 420 Validity of a test set, 135 Variability, measures of, 61, 98, 260, 277, 429 Variable, 4, 6, 23, 25, 28, 29, 33, 40, 47, 51, 52, 53, 131, 146, 150, 153, 155, 228, 229, 330 apparent sample size, 483 interaction, 460–2 nested, 54, 62 uncontrolled, nonsystematic, 187–8 Variance(s), ( 2 58, 250 addition of, 357 between groups, 58–9, 212, 215 computation of, 258, 315, 564 definition of, 262, 265, 473 heterogenous, 356 homogeneous, 268, 376 population ( 2 , 52 sample (s2 or S2 , 431–2, 491 square root of, 474 sum of, 229, 232, 261 techniques, 168, 171, 179–80 terms become infinite at sufficiently small values of Er, 267 of variance, 65, 262, 473 within groups, 58–9 of X, 356, 369, 376 of Y, 376 Variation of pathlength, 225
Index Variations in temperature, 63, 224–5Vector(s), 6, 7, 11, 77, 85, 86, 87, 382, 471, 495, 496 addition, 78 division, 78 multiplication, 77, 78 subtraction, 79 Vignetting the beam, 326 Voigtman’s development, 291 Wavelength selection error, 131, 157, 459, 460, 464 Wavelets, 166, 494 Withinlaboratory variation, 481 Withintreatment mean square, 70, 175–6 Xaxis (abscissa), 71, 82, 93, 121, 125, 131, 285, 336, 337, 342, 343, 350, 364, 378, 415, 424, 425, 449, 450 Xdirection angle, 82, 83 Xray, 223, 281, 282, 285, 298 Xscale, 365, 415 Xvariable, 28, 121, 425, 445, 449, 460 X versus Y, 375–7, 379 figure of, 378 X, Y coordinate spatial image, 500 Yaxis (ordinate), 71 Ydirection angle, 72, 73. Y distribution, 376 Y estimate, 34, 120, 122–4, 187–9, 337, 381 Yintercept, 95, 421 Y variable, 34, 124, 150, 368, 474 Youden, W.J., 171, 172, 187 Youden/Steiner Comparison of Two Methods, 172 Youden’s monograph, 171 Z: calculation of, 403, 404 as number of standard deviations from the hypothesized population mean, 330 Z axis data, 501 Z statistic, 94, 375, 389, 390, 391, 392, 404 Ztest, 93 Zerocrossing, 350 (mu), mean of a population, X, 98, 103–104 P (rho), population value for correlation coefficient, 103
COLOUR PLATE SECTION
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 301
289
277
265
253
241
229
217
205
193
181
169
157
145
133
121
97
–0.2
109
85
73
61
49
37
25
1
13
0
Colour Plate 1 Six samples worth of spectra with two bands, without (left) and with (right) stray light. (see Figure 271, p. 132)
PLS Loadings 0.2 0.15 0.1
300
288
276
264
252
240
228
216
204
192
180
168
156
144
132
120
108
96
84
72
60
48
36
–0.05
24
0
0 12
Loading
0.05
–0.1 –0.15 –0.2 –0.25 –0.3 Index
Colour Plate 2 PLS loadings from the synthetic data used to test the fit of models to nonlinearity. (see Figure 331, p. 164)
Exact versus approximate solution 0.6
Absorbance noise
0.5 0.4 0.3 0.2 0.1
1
0.96
0.92
0.88
0.8
0.84
0.76
0.72
0.68
0.6
0.64
0.56
0.52
0.48
0.4
0.44
0.36
0.32
0.28
0.2
0.24
0.16
0.12
0.08
0
0.04
0
%T
Colour Plate 3 Absorbance noise as a function of transmittance, for the exact solution (upper curve: equation 4232) and the approximate solution (lower curve: equation 4233). The noiseto signal ratio, i.e., E/Er was set to 0.01. (see Figure 422, p. 237) 5
Integration terms 4
f(E r)
3
Normal distribution
Product
2
f(E r)
1 0 –0.25 –1
–0.13
–0.01
0.11
0.23
0.35 ΔE r
0.47
0.59
0.71
0.83
0.95
–2 –3 –4 –5 –6
Expansion of integral functions 2
f(E r)
1.5 1
Normal distribution
Product
0.23
0.2
0.17
0.14
0.11
0.08
0.05
0.02
–0.01
–0.04
–0.07
–0.1
–0.13
–0.16
–0.19
–0.5
–0.22
0
0.25
f(E r)
0.5
ΔE r
–1 –1.5 –2
Colour Plate 4 The Normal curve, the function f (Er [= Er /(Er + Er from equation 4362 and their product. (see Figure 435, p. 248)
Multiplication factor for T as a function of E r
1.4
σ = 0.1
σ = 1.0
Multiplication factor
1.2 1 0.8 0.6 0.4 0.2
4.84
4.4
4.62
4.18
3.96
3.74
3.3
3.52
3.08
2.86
2.64
2.2
2.42
1.98
1.76
1.54
1.1
1.32
0.88
0.66
0.44
0
0.22
0
Er
Colour Plate 5 Family of curves of multiplication factor as a function of Er , for different values of the parameter sigma (the noise standard deviation), for Normally distributed error. Values of sigma range from 0.1 to 1.0 for the ten curves shown. (see Figure 436, p. 251)
140
Transmittance noise
120
100
80
60
40
20 0
0
1
2
3
4
5
6
7
8
9
10
S/N (Er /ΔEr)
Colour Plate 6 Transmittance noise as a function of reference S/N ratio, for alternate anal ysis (equation 4468a). The sample transmittance was set to unity. The limit for the value of �Es + Es �/�Er + Er was set to 10,000 for the upper curve and to 1000 for the lower curve. (see Figure 447a1, p. 263)
1.2
Transmittance noise
1
0.8
0.6
0.4
0.2
0 4
5
6
7
8
9
10
S/N (Er /ΔEr)
Colour Plate 7 Expansion of Figure 447a1. (see Figure 447a2, p. 263)
140
120
Transmittance noise
Monto-Carlo (equation 44-76a) 100
80
Theory (equation 44-19) Approx (equation 44-52b)
60
40
20
0
0
1
2
3
4
5
6
7
8
9
10
S/N (Er /ΔEr)
Colour Plate 8 Comparison of empirically determined transmittance noise value with those determined according to the lownoise approximations of equation 4419 and equation 4452b. (see Figure 448a, p. 264)
140
Transmittance noise
120 100 80 60 40 20
0
0
1
2
3
4
5
6
7
8
9
10
S/N (Er /ΔEr)
Colour Plate 9 Transmittance noise as a function of reference S/N ratio, at various values of sample transmittance. Blue curve: T = 1. Green curve: T = 0.5. Red curve: T = 0.1. (see Figure 449a1, p. 265)
1.2 1.1
Transmittance noise
1 0.9 0.8 0.7 0.6 0.5 0.4
T=1
0.3
T = 0.5
0.2
T = 0.1 4.2
4.4
4.6
4.8
5
5.2
S/N (Er /ΔEr)
Colour Plate 10 Expansion of Figure 449a1. (see Figure 449a2, p. 265)
S/N = 4 1.2 1.1
Transmittance noise
1 0.9
S/N = 4.5
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Transmittance
Colour Plate 11 Transmittance noise as a function of transmittance, for different values of reference energy S/N ratio (recall that, since the standard deviation of the noise equal unity, the set value of the reference energy equals the S/N ratio). (see Figure 4410a, p. 266)
8 7
Absorbance noise
6 5
Computed 4 3 2
Theory 1 0
0
5
10
15
20
25
30
35
40
45
50
S/N (Er /ΔEr)
Colour Plate 12 Comparison of computed absorbance noise to the theoretical value (according to equation 4432), as a function of S/N ratio, for constant transmittance (set to unity). (see Figure 4411a1, p. 267)
0.35
Absorbance noise
0.3 0.25 0.2
Computed
0.15
Theory
0.1 0.05 0 5
10
15
20
25
30
35
40
45
S/N (Er /ΔEr)
Colour Plate 13 Expansion of Figure 4411a1. (see Figure 4411a2, p. 268)
12.00 10.00
Er = 10
SD (A)/A
8.00
Er = 3 6.00 4.00 2.00
0.86
0.82
0.78
0.74
0.7
0.66
0.62
0.58
0.54
0.5
0.46
0.42
0.38
0.3
0.34
0.26
0.22
0.18
0.1
0.14
0.00
%T
Colour Plate 14 Family of curves for SD(A/A for different values of Er . (see Figure 4510, p. 273)
Variances using 5,000 and 100,000 values 20,000 18,000 16,000
Variance
14,000
Er, 100,000 values
12,000 10,000 8,000
Es, 100,000 values
6,000 4,000 2,000
9.65
9.30
8.95
8.60
8.25
7.90
7.55
7.20
6.85
6.50
6.15
5.80
5.45
5.10
4.75
4.40
4.05
3.70
3.35
3.00
0
Er
Expansion of plot 0.20 Er term, 100,000 values
Variance
0.15
Es term, 100,000 values 5,000 values
0.10
0.05
9.65
9.30
8.95
8.60
8.25
7.90
7.55
7.20
6.85
6.50
6.15
5.80
5.45
5.10
4.75
4.40
4.05
3.70
3.35
3.00
0.00
Er
Colour Plate 15 Values of the variances in the two terms of equation 4577, using different numbers of values. (see Figure 4512, p. 275)
1 ≤ λ ≤ 11
(a)
Poisson distribution
0.4 0.35
λ=1
0.3
P(X)
0.25 0.2 0.15
λ = 11
0.1 0.05
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
X 0