Statistical Computing Environments for Social Research 0761902694, 9780761902690

The nature of statistics has changed from classical notions of hypothesis testing, towards graphical and exploratory dat

260 106 11MB

English Pages 256 [260] Year 1996

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Statistical Computing Environments for Social Research
 0761902694, 9780761902690

  • Commentary
  • Archive.org scan
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

ATISTIGAL

=.

COMPUTING ENVIRONMENTS

ron] G0) G7 AN RESEARCH Robert Stine lela. editors

|

Digitized by the Internet Archive in 2022 with funding from Kahle/Austin Foundation

https://archive.org/details/statisticalcompu0000unse_jOro

STATISTICAL SME SapiNits ENVIRONMENTS FOR SOCIAL aa) Ne

NOTTS \..

pit

TiS *

Te

cos Aa AAS

AAS

=

.

= —

-

a

P

=

'

om

>

a

Port

ae

JkIJOS “Yr

2

>

an

a

ep

ae Foe) foe tS Sefer ioed|2 =

=

=

ae) =>

a

;

# -

.

>

STATISTICAL GeMEOSMRNE ENVIRONMENTS mor: SOCIAL RESEARCH Robert

Stine

Tommaso editors

SAGE Publications International Educational and Professional Publisher Thousand Oaks London New Delhi

Copyright © 1997 by Sage Publications, Inc.

All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher.

For information address: SAGE Publications, Inc. 2455 Teller Road Thousand Oaks, California 91320

E-mail: Crd Tees eev ue cmog

FF

SAGE Publications’ ia

,

6 Bonhill Street,

London EC2A,4PU

,

|

3

5S

=

%

|

1 i

ek

SE )

4*

7 &

i

B

%

g

19 9?

United Kingdom SAGE Publications India Pvt. Ltd. M-32 Market Greater Kailash I New Delhi 110 048 India

Printed in the United States of America Library of Congress Cataloging-in-Publication Data Main entry under title: Statistical computing environments for social research / editors, Robert Stine, John Fox.

cn ps Includes bibliographical references (p. _)and index. ISBN 0-7619-0269-4 (cloth: acid-free paper). — ISBN 0-7619-0270-8 (pbk.: acid-free paper) 1. Social sciences—Statistical methods—Computer programs. I. Stine, RobertA. II. Fox, John, 1947—

HA32.5678 1996 300€.285—dc20

96-9978

This book is printed on acid-free paper.

POR

IS09

“10h O87

16. &

et

Production Editor: Gillian Dickens oo...

fr

T US

,

eel

Contents

. Editors’ Introduction ROBERT STINE and JOHN FOX

PART 1: COMPUTING

ENVIRONMENTS

. Data Analysis Using APL2 and APL2STAT

a

JOHN FOX and MICHAEL FRIENDLY

. Data Analysis Using GAUSS and Markov

4]

J. SCOTT LONG and BRIAN NOSS

. Data Analysis Using Lisp-Stat

66

LUKE TIERNEY

. Data Analysis Using Mathematica ROBERT A, STINE

89

. Data Analysis Using SAS

108

CHARLES HALLAHAN

. Data Analysis Using Stata LAWRENCE

126

C. HAMILTON and JOSEPH M. HILBE

. Data Analysis Using S-Plus

152

DANIEL A. SCHULMAN, ALEC D. CAMPBELL, and ERIC C. KOSTELLO

PART 2: EXTENDING

LISP-STAT

. AXIS: An Extensible Graphical User Interface for Statistics

175

ROBERT STINE

10. The R-code: A Graphical Paradigm for Regression Analysis

193

SANFORD WEISBERG

TE ViSta: A Visual Statistics System

207

FORREST W. YOUNG and CARLA M. BANN

Index

232

About the Authors

247

Editors’

Introduction

ROBERT STINE JOHN FOX

1.1

OVERVIEW

The first—and larger—part of this book describes seven statistical computing environments: APL2STAT, GAUSS, Lisp-Stat, Mathematica, S, SAS, and Stata. The second part of the book describes three

innovative statistical computing packages—Axis, R-code, and Vista— that are programmed using Lisp-Stat. Statistics software has come a long way since the days of preparing batches of cards and waiting hours for reams of printed output. Modern data analysis typically proceeds as a flowing sequence of interactive steps, where the next operation depends upon the results of previous ones. All of the software products featured in this book offer immediate, interactive operation in which the system responds directly to individual commands—either typed or conveyed through a graphical interface. Indeed, the nature of statistics itself has changed, 1

2

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

moving from classical notions of hypothesis testing toward graphical, exploratory modeling which exploits the flexibility of interactive computing and high-resolution displays. All of the seven computing environments make use of some form of integrated text/program editor, and most take more or less sophisticated advantage of window-based interfaces. What distinguishes a statistical computing environment from a standard statistical package? Programmability. The most important difference is that standard packages attempt to adapt to users’ specific needs by providing a relatively small number of preprogrammed procedures, each with a myriad of options. Statistical computing environments, in contrast, are much more programmable. These environments provide preprogrammed statistical procedures—perhaps, as in the case of S or Stata,

a wide range of such procedures—but they also provide ming tools for building other statistical applications. Extensibility. To add a procedure to a traditional statistical requires—if it is possible at all—a major programming a language, such as C or Fortran, that is not specifically towards statistical computation.

Programming

programpackage effort in oriented

environments,

in con-

trast, are extensible: a user’s programs are employed in the same manner as programs supplied as part of the computing environment.

Indeed,

the

programming

language

in which

users

write

their applications is typically the same as that in which most of the environment is programmed. The distinction between users and developers becomes blurred, and it is very simple to incorporate others’ programs into one’s own. Flexible data model. Standard statistical packages typically offer only rectangular data sets whose “columns” represent variables and whose “rows” are observations. Although the developers who write these packages are able to use the more complex data structures provided by the programming languages in which they work, these structures—arrays, lists, objects, and so on—are not directly accessible to users of the packages. The packages are geared towards processing data sets and producing printed output, often voluminous printed output. Programming environments provide much greater flexibility in the definition and use of data structures. As a consequence, statistical computing in these environments is primarily transformational: output from a procedure is more frequently a data structure than a printout. Programs oriented towards transforming data can be much more modular than those that have to produce a final result suitable for printed display.

EDITORS’

INTRODUCTION

3

These characteristics of statistical computing environments allow researchers to take advantage of emerging statistical methodologies. In general, nonlinear adaptive estimation methods have gradually begun to replace the simpler traditional schemes. For example, rather than restrict attention to the traditional combination of normally distributed errors and least-squares estimators, robust estimation leads to estimators more suited to the non-Gaussian features typical of much real data. Such robust estimators are generally nonlinear functions of the data and require more elaborate, iterative estimation. Often, however, standard software has not kept pace with the evolving methodology— try to find a standard statistics package that computes robust estimators. As a result, the task of implementing new procedures is often the responsibility of the researcher. In order to use the latest algorithms, one frequently has to program them. Other comparisons and evaluations of statistical computing tools appear in the literature. For example, in a discussion that has influenced our own comments on the distinction between packages and environments, Therneau (1989, 1993) compares S with SAS. Also, in a

collection of book reviews, Baxter et al. (1991) evaluate Lisp-Stat and

compare it with S. Researchers who do not expect to write their own programs or who are primarily interested in teaching tools might find the evaluation in Lock (1993) of interest.

1.2

SOME

BROAD

COMPARISONS

All of the programming environments include a variety of standard statistical capabilities and, more importantly, include tools with which to construct new statistical applications. APL2STAT, Lisp-Stat, and GAUSS are built upon general-purpose interactive programming languages—APL2, Lisp, or a proprietary language. Because these are extensible programming languages that incorporate flexible data structures, it is relatively straightforward to provide a shell for the language that is specifically designed to support statistical computation. The chapters that describe these environments discuss how a shell for each is used. Because of their expressive power, APL2 and Lisp are particularly effective for programming shells, as illustrated for LispStat in the three chapters that comprise the second part of the book.

4

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

In keeping with the evolution of their underlying languages, LispStat and APL2STAT encourage object-oriented programming and offer objects that represent, for example, plots and linear models. One must keep in mind, though, that both of these object systems are customized, nonstandard implementations. Given the general success of object-oriented programming, it is apparent that this trend will sweep through statistical computing as well, and the two chapters offered here give a glimpse of the future of statistical software. (The S package is also moving in this direction.) In addition, Lisp-Stat possesses the most extensive—and extensible—interactive graphics capabilities of the seven products, along with powerful tools for creating graphical user interfaces (GUIs) to statistical software. To illustrate the power and promise of these tools for the development of statistical software, we have included three chapters describing statistical packages that are written in Lisp-Stat. GAUSS, like the similar MatLab, is a matrix manipulation language that possesses numerous intrinsic commands for computing regressions and performing other statistical analyses. Because so many of the computing tasks of statistics amount to some form of matrix manipulation, GAUSS provides a natural environment for doing statistics. Its specialization to numerical matrices and PC platforms provides the basis for GAUSS’s well-known speed. Many users of GAUSS see it as a tool for statistics and use it for little else. In contrast, relatively few Mathematica

users do data anal-

ysis. Mathematica is a recent symbolic manipulation language similar to Axiom, Derive, MACSYMA, and Maple. As shown in the accompanying chapter, one can use Mathematica as the basis for analyzing data, but its generality makes it comparatively slow. None of the other computing environments, however, comes with its ability to simplify algebraic expressions. Although S incorporates a powerful programming language with flexible data structures, it was designed—in contrast to Lisp or APL— specifically for statistical computation. Most of the statistical procedures supplied with S are written in the same language that users employ for their own programs. Originating at Bell Laboratories (the home-away-from-home of John Tukey and the home of William Cleveland), S includes many features of exploratory data analysis and modern statistical graphics. It is probably the most popular computing tool

EDITORS’

INTRODUCTION

5

in use by the statistical research community. As a result, originators of new statistical methods often make them available in S, usually with-

out cost. S-plus, a commercial implementation of S, augments the basic S software and adapts it to several different computing platforms. Stata is similar in many respects both to S and to Gauss. Like S, Stata has a specifically statistical focus and comes with a broad array of built-in statistical and graphical programs. Like Gauss, Stata has an

idiosyncratic programming language that permits the user to add to the package’s capabilities. SAS, of course, is primarily a standard statistics package, offering a wide range of preprogrammed statistical procedures which process self-described rectangular data sets to produce printouts. Most SAS users employ the package for statistics, although SAS has grown to become a data~-management tool as well. SAS originated in the days of punch cards and batch jobs, and the package retains this heritage: one can think of the most current window-based implementations of SAS as consisting of a card-reader window; two printer windows (for “listing” and “log” output), and a graphics screen. There is even an anachronistic CARDS statement for data in the input stream. But SAS is programmable in three respects: (1) the SAS data step is really a simple programming language for defining and manipulating rectangular data sets; (2) SAS incorporates a powerful (if slightly awkward) macro facility; and (3) the SAS interactive matrix language (IML), which is

the focus of the chapter on SAS in this book, provides a computing environment with tools for programming statistical applications which can, to a degree, be integrated with the extensive built-in capabilities of the SAS system. Table 1.1 offers a comparison of some of the features of the seven computing environments. This table includes the suggested retail price for a DOS or Windows implementation, an indication of platforms for which the software is available, the time taken for several

common computing tasks, and several programming features. All of the programs are available for DOS or Windows-based personal computers. The table indicates if the software is available for Macintosh and Unix systems. The timings, supplied by the editors and authors of the chapters in this book, were taken on several different machines (all 486 PCs) and may not, therefore, be wholly comparable. Moreover, it is our experience that particular programs can perform

6

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

TABLE 1.1 Comparison of the Seven Statistical Computing Environments in Terms of Price for

a DOS/Windows

Implementation; Availability

on Alternative Platforms; Speed for Several Computing Tasks on a 66-MHz 486 Machine; and Some Programming Features Mathe-

Feature

Price’

APL2STAT

Lisp-Stat

matica

Gauss

S-plus

Stata

SAS

$0/$630°

$0

$795

$495

$1450

$338/944

$3130 +$1450°

Available for Mac

{e)

e

e

fo)

fe)

Unix

°

°

e

°

e

e

e

4.2 8) 4 11.8

4.1 1.4 ii! 0.6

el 0.7 , ely: — xB),

(1)

f=

where the function p measures the lack of fit at the 7th observation, y; is the dependent-variable value, and x; is the vector of values of

the independent variables. (For clarity, this notation suppresses the potential dependence of p on x;; the function p can be made to depend on x; to limit the leverage of observations.) The fitting criterion of robust regression generalizes the usual least-squares formulation in which p(y — x’B) = (y — x’B)*. With suitable assumptions, we can differentiate with respect to B in (1) and obtain a set of equations that

generalizes the usual normal equations,

5 oy; — 1B) = 0,

(2)

i=1

where

= dp/dB. The function w is known

as the influence function

of the robust regression (not to be confused with the influence values computed in regression diagnostics). One must resort to a very general minimization program to solve (2) directly. A clever trick, however, converts this system to a more tractable form and reveals the effect of the influence function. In particular, assume that we can write w(e) = eW(e) so that (2) becomes a

weighted least-squares problem, oy. €;X;W;



0,

(3)

iil

where €; = y; — x/B is the ith residual and w; = W(é;) is the associated weight. The weights are clearly functions of the unknown coefficients, leading to an apparent dilemma: the weights w; require the coefficients B, and the coefficients require the weights. The most popular method for resolving this dilemma and computing B is iteratively reweighted least squares. The essential idea of this

EDITORS’

INTRODUCTION

ale

approach is simple: start the iterations by fitting an ordinary leastSquares regression, that is, treat the weights as constant, wr = =! and solve for B from (3). Label the initial uate of B as Bo. Given Bo, a new, data-dependent set of weights is ws = W(y; — x/Bo). Returning to (3), the new set of weights defines a new vector of estimates, say

B,. The new coefficients lead to a newer set of weights, which in turn give rise to a still newer set of estimates, and so forth. Typically, the influence function y is chosen so that the weight function W down-weights observations with large absolute (standardized) residuals. Least squares aside, the most popular influence functions are the Huber and the biweight, for which the weight functions are,

respectively,

We)

=

if

Beek

Lie

emelssr 1:

and

(Lene

W3(e) = | 0,

Poem leh

ls

(el coy ts

Qualitatively, the Huber merely bounds the effect of outliers, whereas the biweight eliminates sufficiently large outliers. Because robust regression reduces to least squares when W(e) = 1 for all e, least squares does not limit the effect of outliers. As defined, neither of these weight functions is practical, because both ignore the scale of the residuals. In practice, one evaluates the weight for an observation with residual €; as w; = W(é;/cs), where c is a tuning constant that controls the level of robustness and s is a measure of the scale of the residuals. A popular robust scale estimate is based on the median absolute deviation (MAD) of the residuals,

s = MAD/0.6745. Division by 0.6745 gives an estimate of the standard deviation when the error distribution is normal. For the robust estimate of B to be 95% efficient when sampling from a normal population, the tuning constant for the weight function is c = 1.345 for the Huber (equivalent to 1.345/0.6745 ~ 2 MAD’s), and c = 4.685 for the biweight (equivalent to 4.685/0.6745 ~ 7 MAD’s). Figure 1.1 shows a plot of the Huber and biweight functions W;, and Ws defined with these tuning constants.

2

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

Biweight

eaipetsy

(Oey

ee)

2558

Ome

Figure 1.1. Biweight and Huber Robust-Regression Weight Functions

A remaining subtle feature of programming a robust regression is deciding when to terminate the iterations. It can be proven that iterations based on the Huber weight function must converge. Those for the biweight, however, may not: One need not obtain a unique solution to (2) when using the biweight. Keeping the potential for divergence in mind, iteration proceeds until the fitted regression coefficients are unchanged, to some acceptable degree of approximation, from one iteration to the next. Programming a robust regression within the confines of a statistics package presents some interesting challenges. First, one must be able to use the coefficients estimated by the initial regression in subsequent computations,

and then be able to program

a sequence

of

weighted least-squares regressions that terminates when some condition is satisfied. Second, one must decide how

much

flexibility to

permit in the choice of estimators. Because the choice of the influence function is crucial, one might decide to let this function also be an argument of the computing procedure. Users can then program their own influence functions rather than be restricted to a limited domain of choices. Table 1.1 shows the environments that have this degree of flexibility.

BOOTSTRAP RESAMPLING

As previously noted, bootstrap resampling provides an alternative to asymptotic expressions for the standard errors of estimated regression coefficients. The method assesses sampling variation empirically

EDITORS’

INTRODUCTION

13

by repeatedly sampling observations (with replacement) from the observed data, and calculating regression coefficients for each of these

bootstrapped samples. One can build the bootstrap samples for a regression problem in two ways, depending upon whether the model matrix X is treated as

fixed or random. Denote the data for the ith observation as (y;, X;). The procedure for one iteration of the random-X bootstrap is: R1. Form a set of random integers, uniformly distributed on 1,...,n, and

label these {i{, ..., 15}. That is, sample with replacement n times from the integers 1,..., n.

R2. Build a bootstrap sample of size n defined as y* = [Yir, eae Ye | and xX* =

[x;+, Saree

|

R3. Compute the associated regression estimates from y* and X*. In the case of least squares these would be B* = (X*X*)"!X*y*.

The procedure to use if X is treated as fixed is first to compute the residuals {€;,..., €,,} from the original fitted model. The bootstrap for

fixed-X is then:

F1. Same as R1.

F2. Compute y* = XB + e*, where e* = (Ej... 5 Ex)’ F3. Same as R3, but with X* = X.

In either case, one repeats the calculation B times, forming a set of bootstrap replications {B*,..., 8%} of the original estimates. This collection provides the basis for estimating standard errors and confidence intervals. Thus the bootstrapping task involves drawing random samples, iterating a procedure (the robust regression), and collecting the coefficients from each of the bootstrap replications. The sequence of tasks is very similar to that in a traditional simulation, but in a simulation samples are drawn from a hypothetical distribution rather than from the observed data.

KERNEL DENSITY ESTIMATION

Kernel density estimation is a computationally demanding method for estimating the shape of a density function. Given a random sample Z,, Z),...,Z, from some population, the kernel estimator of the

14

STATISTICAL COMPUTING

FOR SOCIAL RESEARCH

ENVIRONMENTS

density of the Z’s is the function P = 1yo fn)

K(=4),

z—Z,;

(4)

where /t is termed the smoothing constant, s is an estimate of scale, and K is a smooth density function (such as the standard normal), called

the kernel. While the choice of the kernel K can be important, it is h that dominates the appearance of the final estimate: the larger the value of h, the smoother the estimate becomes. The optimal value of h depends on the unknown density, as well as on the kernel function. Typically, programs choose ht so that the method performs well if the

population is normally distributed. Stine’s chapter uses Mathematica to derive the optimal value for / in this context. The calculation of the kernel density estimator offers many of the computing tasks seen in the other problems, particularly iteration and the accumulation of iterative results. In practice, f is evaluated at a grid of m positions along the z-axis, with this sequence of evaluations plotted on a graph. At each position z, the kernel function is evaluated for each of the observations and summed as in (4), producing a total of mn evaluates of K. Obviously, it is wise to choose a kernel that can

be computed quickly and that yields a good estimate. A particularly popular choice is the Epanechnikov kernel,

3

K,(z) =

——(1-

Al

a —

5)

0,

fon

z

Epona

(5)

otherwise.

This kernel is optimal under certain conditions. As with the influence function in robust estimation, kernel density estimation requires an estimate of scale. Here the popular choice is to set

~ foe interquartile inter til =e), ee rnin(

7

1.349

(6)

where * = )°;(Z;—Z)*/(n—1) is the familiar estimate of variance. Di-

vision by the constant 1.349 = 2 x 0.6745 makes the interquartile range a consistent estimator of o in the Gaussian setting. Finally, one can ex-

EDITORS’

INTRODUCTION

lS

ploit the underlying programming language to allow the smoothing kernel to be an input argument to this procedure.

1.5

STATISTICS

PACKAGES

BASED

ON LISP-STAT

The several statistical computing environments described in the first part of this book vary along a number of dimensions that affect their use in everyday data analysis. In one respect, however, Lisp-Stat distinguishes itself from the others. All of these computing environments encourage users to create new statistical procedures. But by providing tools for building user interfaces, along with tools for programming statistical computations, Lisp-Stat is uniquely well suited to the development of new and innovative statistical packages. What we mean by a “package” in this context is a program, or related set of programs, that mediates the interaction between users and data. Traditional statistical programs, such as SAS, SPSS, and Minitab,

are packages in this sense. Developing a package in Lisp-Stat entails several advantages, however:

e The package can take advantage of Lisp-Stat’s substantial built-in statistical capabilities. You do not have to start from scratch. ¢ Because Lisp-Stat provides tools for the development of graphical interfaces, the cost of experimentation is low. ¢ A package can easily itself be made extensible, both by the developer and by users. Lisp-Stat’s programming capabilities and object system encourage this type of open-ended design.

The three Lisp-Stat-based statistical packages described here—Axis, R-code, and Vista—all

make

extensive use of the GUI-building

and

object capabilities of Lisp-Stat, but in quite different ways. Of the three packages, R-code’s approach is the most traditional: users interact with their data via drop-down menus, dialog boxes, and a variety of “plot controls.” The various menus, dialogs, and controls are, however, very carefully crafted, and the designers of the package have taken pains to make it easy to modify and extend. The result is a data-analysis environment that is a pleasure to use, and one that incorporates state-of-the-art methods for the graphical exploration of regression data.

16

STATISTICAL COMPUTING

Axis offers many

ENVIRONMENTS

of the same

FOR SOCIAL RESEARCH

statistical methods

as R-code,

but

presents the user with a substantially different graphical interface. Here, for example, it is possible to manipulate variables visually by moving tokens that represent them. Axis also includes a very general mechanism for addressing features in the underlying Lisp-Stat code by permitting users to send arbitrary “messages” to objects. Like R-code, Axis incorporates tools to facilitate additions. Vista uses the graphical-interface tools of Lisp-Stat in the most radical manner, not only to control specific aspects of a data analysis, but also to represent visually—and in alternative ways—the data-analysis process as a whole. Vista accomplishes these representations by developing simple, yet elegant, relationships between the functions that are employed to process data and various elements of visual displays, such as nodes and edges in a graph. Although it remains to be seen whether these visualizations of the process of data analysis prove productive, it is our opinion that the ideas incorporated in Vista are novel,

interesting, and worth pursuing. We believe that it is no accident that Vista also includes tools for extension: this style of programming is encouraged by the object orientation of Lisp-Stat. 1.6

HOW TO ACQUIRE THE STATISTICAL ENVIRONMENTS

COMPUTING

APL2STAT and the freeware TryAPL2 interpreter are available by anonymous ftp from watservl.waterloo.edu. The APL2STAT software is in the directory /languages/ap]/workspaces in the files a2stry20.zip (for use with TryAPL2) and a2satf20.zip (for use with APL2/ PAS): The TryAPL2 software is in the directory /languages/ap]/tryap12. The

commercial APL2/PC interpreter may be purchased from APL Products, IBM Santa Teresa, Dept. M46/D12T,

P.O. Box 49023, San Jose,

CA 95161-9023. GAUSS may be purchased from Aptech Systems Inc., 23804 SE Kent-Kangley Road, Maple Valley, WA 98038. Lisp-Stat is available by anonymous ftp from umnstat.stat.umn.edu. Mathematica may be purchased from Wolfram Research Inc., 100 Trade Center Drive, Champaign, IL 61820-7237. S-plus may be purchased from Statistical Sciences Inc., 1700 Westlake Ave. N., Seattle, WA 98109.

EDITORS’

INTRODUCTION

7

SAS may be purchased from the SAS Institute Inc., SAS Campus Drive, Cary, NC 27513:

Stata may be purchased from Stata Corporation, 702 University Dr. East, College Station, TX 77840.

Axis may be obtained by anonymous

ftp from compstat.wharton

.upenn.edu. It is located in the directory pub/software/1isp/AXIS. R-code, version 1, for DOS/Windows

and Macintosh computers, is

included with Cook and Weisberg (1994). The Unix implementation

of the package, and information about version 2, may be obtained via the World-Wide Web at http: //www.stat.umn.edu/~rcode/index.htm1.

R-code is licensed to purchasers of Cook and Weisberg’s book. Vista may be obtained by anonymous ftp from www.psych.unc.edu. REFERENCES Baxter, R., Cameron,

M., Weihs, C., Young, F. W., & Lubinsky, D. J. (1991). “Lisp-Stat:

Book Reviews.” Statistical Science 6 339-362.

Cook, R. D., & Weisberg, S. (1994). An Introduction to Regression Graphics. New York: Wiley. Duncan, O. D. (1961). “A Socioeconomic Index for All Occupations.” In A. J. Reiss, Jr., with O. D. Duncan, P. K. Hatt, & C. C. North (Eds.), Occupations and Social Status.

(pp. 109-138). New York: Free Press. Fox, J. (1991). Regression Diagnostics. Newbury Park: Sage. Fox, J., & Long, J. S. (Eds.) 1990. Modern Methods of Data Analysis. Newbury Park: Sage. Lock, R. H. (1993).

A Comparison of Five Student Versions of Statistics Packages. Amer-

ican Statistician 47 136-145.

Therneau, T. M. (1989, 1993). manuscript.

A Comparison of SAS and S. Mayo Clinic. Unpublished

@



Pas

'

u

»

_

ond

© ,

c=

See

AG

Va

NE

©

. -~

¢

es

ee

eas :

oa

Aas

QT.

ee

LINEAR_MODEL LAST_LM

'PRESTIGE+INCOME+EDUCATION'

The LINEAR_MODEL function takes a symbolic model specification as a right argument. The syntax is similar to that in SAS or S. A more

DATA ANALYSIS

USING APL2 AND APL2STAT

31

PRESTIGE

EDUCATION

1.minister 2.railroad_conductor 3.railroad_engineer

Figure 2.1. A Scatterplot Matrix for Duncan’s Occupational Prestige Data NOTE: The highlighted points were identified interactively with a mouse.

complex model could include specifications for interactions, transformations,

or nesting, for example.

Because both

INCOME and

EDUCATION

are quantitative variables, a linear regression model is fit; character variables (e.g., a REGION variable with values 'East', 'Midwest', etc.) are

treated as categorical, and dummy regressors are suitably generated to represent them. APL2STAT also makes provision for ordered-category data. Rather than printing a report, the LINEAR_MODEL function returns a linear-model object; because no name for the object was specified, the function used the default name LAST_LM. This object contains the following slots (for brevity, most are suppressed):

382

COMPUTING

ENVIRONMENTS

8> SLOTS 'LAST_LM' PARENT MRE COEFFICIENTS COV MATRIX RESIDUALS FITTED VALUES RSTUDENT HATVALUES COOKS D

We could retrieve the contents of a slot directly, using the GET function, but it is more natural to process LAST_LM using functions meant for linear-model objects. PRINT, for example, will find an appropriate method to print out a report of the regression: 9> PRINT GENERAL

'LAST_LM' LINEAR MODEL: Coefficient CONSTANT ~6.0647 INCOME 0.59873 EDUCATION 0.54583 df Regression

SS

2

Residuals 42 Total 44 R-SQUARE = 0.82817 Source CONSTANT INCOME EDUCATION

PRESTIGE+INCOME+EDUCAT ION ie p 4.2719 Taig 0.16309 0.11967 5.0033 0.00001061 0.098253 5.5554 1.7452E6 Std.Error

SS Hee 4474.2 5516.1

F

p

36181

101.22

0

7506.7 43688 SE = 13.369

N = 45

i ia 1

df AY 2 42

i; 2.0154 25.033 30.863

p 0.16309 0.00001061 1.7452E6

Likewise, the INFLUENCE_PLOT function plots studentized residuals against hatvalues, using the areas of plotted circles to represent Cook’s D influence statistic. The result is shown in Figure 2.2, where the la-

beled points were identified interactively. 10> INFLUENCE_PLOT

'LAST_LM'

eS 2

ae

aGcz

§°t

eas

eQ720 'QO

a

Q

ag.

O

Co 0

el

(g)

ae

0

10FJ jo pezyuapnys s[enpisey S,ULIUNG UOISSaIBZIY Aq sanyeajzeZ] V }O[g aanBry °Z'Z

GaAZILNaGnALsS

IWNGISa¥

1

' ' ' ' ' ! ' ' ' '‘ ‘ ' ' ' ‘ ‘ H

ie

@

! '

'

'

SSeS 1 ' ' ' ' ' ' '

-anqeayey aBLIOAL JY} SAU} a4} PUL ITM} ye aTe SaUTT [PITAGA dU} ‘ZF PUL O1EZ JO S[ENPISaI pezyUapn}s je are sauT] feyUOZTIOY sy], ‘“esnow e YIM ATaATeIA}UT peyquept o1am sjutod pajeqey ayy “uoTssar8a1 oy} UT adUENTUT Jo sMsvour e /ONSHRIs G SOOD 0} [euOHsOdord are sapdIID ay} JO seare OUT, “ALON SANTWA LVH

33

COMPUTING

34

ENVIRONMENTS

This analysis (together with some diagnostics not shown here) led us to refit the Duncan regression without ministers and railroad conductors: 1l> OMIT

'minister'

'railroad_conductor'

12> 'NEW_FIT'! LINEAR_MODEL 'PRESTIGE+INCOME+EDUCATION' NEW_FIT 13> PRINT 'NEW_FIT' ION GENERAL LINEAR MODEL: PRESTIGE+INCOME+EDUCAT Coefficient Std.Error t p 0.086992 “1.7546 3.6526 6.409 CONSTANT Ap ShvAlests! Pasles 0.12198 0.8674 INCOME 0.0017048 3.3645 0.09875 0.33224 EDUCATION

The OMIT function locates the observations in the observation-names vector and places corresponding zeroes in the observation-selection vector ASELECT. (APL2STAT functions and operators use ASELECT to determine which observations to include in a computation; observations

corresponding to entries of one are included and those to entries of zero are not.) Notice that we explicitly name the new fitted model NEW_FIT to avoid overwriting LAST_LM. The income coefficient is substantially larger than before, and the education coefficient is substan-

tially smaller than before.

2.3

PROGRAMMING (AND BOOTSTRAPPING) REGRESSION IN APLe

ROBUST

APL2STAT includes a general and flexible function for robust regression, but to illustrate how programs are developed in APL2, we write a robust-regression function that computes an M-estimator using the bisquare weight function. To keep things as simple as possible, we bypass the general data handling and object facilities provided by APL2STAT. Similarly, we do not make provision for specifying alternative weight functions,

scale estimates,

tuning constants,

and con-

vergence criteria, although it would be easy to do so. A good place to start is with the bisquare weight function, which

written in APL in the following manner:

may be simply

DATA ANALYSIS

USING APL2 AND APL2STAT

35

VWT+BISQUARE Z [1] WTe(1>]Z)x(1-z*2)*2 [2] Vv The monadic function | in APL2 is absolute value, * is exponentiation, and the relational function > returns 1 for “true” and 0 for “false.” In addition to the weight function, we require a function for weighted least-squares (WLS) regression: VBeWT

WLS

YX3;Y3X

| Ye=TOS i2eexXel I 2NX [3]

WT+WT*0.5

[4] Be(YxWT)BXx[1] WT Ey

¢ The function WLS takes weights as a left argument and a matrix, the first

column of which is the dependent variable, as a right argument. ¢ The first line of the function extracts the first column of the right argument, ravels this column into a vector (using ,), and assigns the result to Y; the

second line extracts all but the first column, assigning the result to X. ¢ The square roots of the weights are then computed, and the regression coefficients B are calculated by ordinary least-squares regression of the weighted Y values on the weighted Xs. The domino or quad-divide symbol (@) used dyadically produces a least-squares fit. (Used monadically, this symbol returns the inverse of a nonsingular square matrix.)

Our robust-regression function takes a matrix right argument, the first column

of which is the dependent variable, and returns regres-

sion coefficients as a result: VB+ROBUST [1]

YX;ITERATION;LAST_B;X;Y

Ye, IT liZyyx

D2

xe wx

[3]

Be YX

[4]

ITERAT IONRNDN (2,2) 1.91837 0.72441 -1.25045 -1.63685

Here and later we show computer code and output in the typewriter font. The » indicates that the command was entered by the user, with

the results generated by GAUSS. You can move around the screen with arrow keys, modify commands, and rerun them. The results of computations can be saved to matrices in memory, which can then be used for additional computations. Printed results can be spooled to an output file that can be examined and edited. As you work interactively, you can save commands to a file that can be edited and rerun. In edit mode, you enter commands

with an ASCII editor. With few

exceptions, you can enter the same commands that you might enter in command mode. The difference is that you save a series of commands to a disk file that can be executed all at once and then edited and rerun at a later time. For complex work or program development, you will usually work in edit mode. From

either command

or edit mode,

you can access

on-line help

by pressing alt-h, which largely reproduces the 2-volume, 1,600-page manual. MATRICES VERSUS DATA SETS

Matrices are two-dimensional arrays of numbers that are assigned names and kept in memory. For example, the command a=RNDN(2,2) creates a 2 x 2 matrix of random normal numbers and assigns this matrix to the variable that is stored in memory. Data sets are stored

COMPUTING

44

ENVIRONMENTS

on disk either in data file format, which allows many variables to be stored in a single file, or in matrix file format, which stores a single

matrix. Data sets must be read into matrices before they can be analyzed with matrix commands. If an entire data set will not fit into memory all at once, it can be read and processed in pieces. Thus statistical procedures are not limited by the amount of data than can be held in memory. The DATALOOP command applies GAUSS matrix commands to in disk files. For example, if you had the vectors x1, x2, and x3 in memory, you could construct a new variable such as newl = SQRT(x1./x2) + LN(x3). If you wanted to apply this transformation to the variables x1, x2, and x3 in the disk file mydata, you would create a new data set, say mydata2, with the commands!

variables

DATALOOP

mydata

MAKE newl ENDATA;

mydata2;

= SQRT(x1./x2)

+ LN(x3);

DATALOOP also includes features for selecting cases and variables. DATALOOP is an extremely powerful feature that allows you to use the full power of GAUSS to construct new variables and modify data sets.

INTRINSIC VERSUS EXTRINSIC COMMANDS

Intrinsic commands are internal to GAUSS and always available for use. Examples are the command

columns

of a matrix, and

MEANC, which takes the mean of the INV, which takes the inverse of a matrix.

Extrinsic commands are constructed from intrinsic commands and/or other extrinsic commands. GAUSS’s flexibility comes from the way that extrinsic commands can be seamlessly integrated into the system. You can add new commands

to accomplish the mundane

(e.g., a

command to change DOS directories) or the complex (e.g., a new interface to GAUSS). For example, we can create the command absmeanc to take the mean of the absolute values of a vector of numbers: PROC absmeanc(x) ; RETP(MEANC(ABS(x))); ENDP;

DATA ANALYSIS

USING GAUSS

AND MARKOV

45

You would save the code to disk as a source file and tell the GAUSS library where the file is located. The first time you use an extrinsic command, the source file is automatically read and compiled. Unless you clear memory

or leave GAUSS,

extrinsic commands

remain

in

memory after their first use. Thus you can use extrinsic commands in exactly the same way as you use intrinsic commands. GAUSS comes with a large run-time library of extrinsic commands including procedures such as ordinary least squares and numerical integration. In our examples, we often use extrinsic commands that are not part of the run-time library. To avoid confusion, all commands that are included with GAUSS are capitalized (e.g., OLS, MEANC), whereas the names of variables or procedures we create are in lowercase (e.g., x1, absmeanc).

VECTORIZING

GAUSS’s power comes from its rich set of matrix commands; its speed comes from the speed with which it manipulates matrices. Thus, when

programming,

you should

use matrix

operations rather

than looping through a series of scalar operations. For example, to compute the mean of the 100 numbers in a column vector x, you could either write a program similar to that used in most programming languages: i ee

EO uae Os DOSUNDIESTe=—

total

= total

1005

+ x[i];

mete 1s ENDO;

mean

= total/i;

or use the matrix command more than 100 times faster.

mean

= MEANC(x). The matrix approach is

GRAPHICS

GAUSS includes a graphics programming language referred to as Publication

Quality Graphics

(PQG).

PQG

produces

noninteractive

46

COMPUTING

ENVIRONMENTS

graphs that use an extremely high 4190 x 3120 pixel resolution. Programs are included for scatterplots, line plots, box and whisker plots,

contour plots, and 3D scatterplots. Using the programming language, you can customize plots nearly any way you want or construct new

types of graphics.

3.2

APPROACHES

In this section, we

TO USING

GAUSS

use multiple regression to illustrate the different

ways you might use GAUSS. Data from Duncan (1961) are used for regressing occupational prestige on income and education. We assume that the data are saved in a GAUSS file named named prestige, income, and education.

USING THE EXTRINSIC COMMAND

duncan, with variables

OLS

The simplest way to compute a regression is with GAUSS’s extrinsic command 0LS. This command takes three variables as input: the name of the data set, the name of the dependent variable, and a vector with

the names of the independent variables. For example, » CALL OLS("duncan,"

"prestige",

"income"|"educ")

where "income"|"educ" is a 2x 1 vector stacking income on top of "educ".

This command produces the following results as seen in Figure 3.1. You can have OLS return results to matrices in memory with the command » {vnam,m,b,stb,vc,stderr,sigma,cx,rsq,resid,dwstat}

= OLS("duncan","prestige","income"|"educ") The same results are returned to the screen but, in addition, the estimated coefficients are returned to the matrix b, the standardized coefficients to the matrix stb, and so on.? These matrixes are then available

for further manipulation.

10D

deg

peztpzepueys

ajeutjsg”

872SSTS0 S672v9d0

|a|< qoza

000°0 000°0 £9T 0

9T6TS8°0 TO8LE8 0 zea

€S7860 L996TT TV6TLZ

St

v9

‘Te ans81y

STQetzeA

onda AWOONT LNYLSNOO

0 0 9-

pEssrs €€L86S £99790

mei)el

TPIOL

:saseo PTIPA

*SS

:pezenbs-y

:sS Tenptsey

SCY

ajeultqs”

ONE

LB9EP

878 °0

669° 90SL

Ee”

0 0 VY

paepueqs

Juepusedeq :@TQetazeA

TPSSSSs TEEOO'S SIOLVsLs

eanTeA-

:4s0 Jo z027z9 pas :poezenbs-1zeqy :wopeerjy JO seerzbeq

*a JO AATTTQeqoid

YaTM

0028° LSadd

000°0 69 facie cv aS

47

48

COMPUTING

ENVIRONMENTS

GAUSS COMMANDS

FROM THE COMMAND

LINE

You could also compute the regression results from the command line by writing a simple program for regression. Assume that the matrices prestige,

income, and educ contain the Duncan

data. We would

compute the regression results in the following steps. First, we assign the data to the matrices y and x to make later formulas simpler. The ONES command adds a column of ones for the intercept. The ~ is the horizontal concatenation operator that combines the vectors into a matrix. > y = prestige

» nobs = ROWS(prestige) >» x = ONES(nobs,1)~income~educ » nvar = COLS(x)-1

Standard formulas for least squares are entered in an obvious way: » > » »

bhat = INVPD(x’x)*x’y resid = y - x*bhat s2 = (resid’resid) / (nobs-nvar) covb = s2*INVPD(x’x)

To compute the t values, we need to take the square roots of the diagonal of the covariance matrix: » seb = SQRT(DIAG(covb) )

> t = bhat

./ seb

This last line includes the command °. /” illustrating that when a command is preceded by a period, the operation is applied to each element of the matrix individually. Thus A./B would divide a;; by b; for all elements. The command A/B, without the period, would divide the matrix A by the matrix B—that is, AB~!. Results can then be printed,

where the ’ transposes a column into a row: » b’ -6.0647

0.5987

0.5458

DATA ANALYSIS

USING GAUSS

AND

MARKOV

49

The obvious limitation of this approach is that each time you want to run a regression, you must reenter each line of code. The solution is to construct a procedure that becomes an extrinsic command. CREATING A REGRESSION PROCEDURE

It is very simple to take the commands used previously to construct the procedure myols: 1. PROC(6) .

= myols(y,x,xnm);

LOCAL bhat,covb,xpxi,resid,nobs,s2,nvar, seb,t,prob, fmt,omat;

3. 4. See Gon

nobs = ROWS(y); x = ONES(nobs,1)~x; nvar = COLSCx)—1. Xpxa (= sINVPD(xX4x)

7 8.

Did Gea XD Xe OCay = resid = y - x*bhat;

9. s2 = (resid’resid) / (nobs-nvar); LOE COVDE——S2 XN xis 11. seb = SQRT(DIAG(covb)); i2eptte= bhat ./ sebs 13. prob = 2*CDFTC(ABS(t) ,nobs-nvar); 14. print “Variable Estimate StdErr t-value 15... HG

print. Ane

17.

omat

18.

CALL

19.

RETP(bhat,seb,s2,covb,t,prob);

20.

Prob”;

SAA”, SR Oates See TO) ocals

peeer Nie Ona eae I eam o eeSom Tato Ans = (“CONSTANT” |xnm)~bhat~seb~t~prob; PRINTFM(omat,0~1~1~1~1, fmt) ;

ENDP;

Line 1 defines the procedure myols as taking three values for input and returning six matrices to memory. y is a vector containing observations for the dependent variable. x is a matrix containing the independent variables. xnm contains names to be associated with the columns of x to be used to label the output.* Line 2 lists the variables to be used.

These are called local variables because procedure and are not available to the ishes running. Lines 3 through 13 are discussion. Lines 14 through 18 format returns the matrixes

they are only available to the user after the procedure finself-evident given our earlier and print the results. Line 19

bhat, seb, s2, covb, t, and prob so that they can

50

COMPUTING

ENVIRONMENTS

be used for further analysis. With myols, the regression could be computed with the command » {bhat,seb,s2,covb,t,prob} = = myols(prestige, income~educ, “INCOME” |“EDUC”)

The results are as follows: Variable

Estimate

StdErr

t-value

CONSTANT INCOME EDUC

-6.0647 0.5987 0.5458

4.2719 0.1197 0.0983

-1.420 5.003 Bp o0

Prob

0.1631 0.0000 0.0000

With obvious substitutions, regressions with other variables could be computed. USING MARKOV

AND GAUSSX

Our first three approaches make you feel like you are a programmer rather than a data analyst working in an interactive environment. If your analysis is simply running a few regressions, there would be few advantages to these approaches. In response to the difficulty in using GAUSS for routine data analysis, enhancements to GAUSS have been developed that provide a more accessible interface: GAUSSX and Markov. GAUSSX

(Econotron

Software

1993b)

includes

a wide

variety of

econometric procedures including ARIMA, exponential smoothing, static and dynamic forecasting, generalized method of moments estimation, nonlinear 2SLS and 3SLS, Kalman

filters, and nonparamet-

ric estimation. Although GAUSSX allows you to access any GAUSS command, it is menu driven and acts as a shell that isolates you from the GAUSS command line. You can work in GAUSSX without knowing how to use GAUSS. With GAUSSX, you would run our sample regression with the commands create (u) 1 45; open; fname=duncan; ols (d) prestige c income end;

educ;

DATA ANALYSIS

These commands

would

USING GAUSS AND MARKOV

ail

be submitted, with results returned to the

screen. Markov (Long 1993) takes a different approach to the same end.5 Markov is designed so that you can use GAUSS as if Markov were not there. You are limited only in that you cannot use the memory that Markov requires or use the names of variables and procedures used by Markov. Markov adds new commands that make life simpler. To run our sample regression, the following commands would be entered: » » » >»

set dsn set lhs set rhs go reg

duncan prestige income educ

Results are left in predefined matrices. For example, the regression estimates are placed in the matrix _b and the standardized coefficients are placed in —stdb. COMMENTS

A common feature of all of these approaches is that key results are both printed and returned to matrices in memory. These matrices can be manipulated with other GAUSS commands. Suppose you were running a logit with the coefficients returned to the vector _b. You could compute the percentage change in the odds for a unit change in the independent variables with the code 100*(exp(_b)-1). The ability to

easily manipulate the results of analyses is an extremely useful feature of GAUSS. 3.3,

ANALYZING

OCCUPATIONAL

PRESTIGE

This section begins a more realistic analysis of the occupational data using a combination of GAUSS and Markov commands.® We begin with descriptive statistics to make sure our variables are in order. This is done by specifying the data set and running the means program with Markov: » set

dsn

>» go means

duncan

De

COMPUTING

ENVIRONMENTS

This generates the output Variable

Mean

Std

INCOME EDUC PRESTIGE

41.8667 52.5556 47.6889

Dev

24.4351 29.7608 31.5103

Minimum

Maximum

Valid

7.000 7.000 3.000

81.000 100.000 97.000

45.00 45.00 45.00

Missing

0.00 0.00 0.00

While the results look reasonable, a scatterplot matrix is a useful way

to check for data problems. Because graphics procedures in GAUSS plot matrices that are in memory, we use the Markov command » read

from

duncan

to read the vectors income, educ, and prestige from disk into memory. The read command illustrates the difference between programming

in GAUSS and working in GAUSS enhanced by Markov. In GAUSS, the functions of the read command are accomplished with the code OPEN fl = duncan; x = READR(f1,100);

income = x[.,2]; educt="xiie3 prestige = x[.,4]; CLOSE(f1);

This is an awkward way to accomplish a routine task and illustrates that GAUSS is fundamentally a programming language. Without a supplemental package such as Markov or GAUSSX, you end up debugging GAUSS code to accomplish the mundane and repetitious. With the data in memory, we specify the variables to plot and execute a Markov procedure for a scatterplot matrix: » set x income

educ

prestige

>» go matrix

This produces Figure 3.2. in which each panel is a scatterplot between two variables. For example, the panel in row 1, column 2 plots INCOME

against EDUC. In this figure, several observations stand out as potential outliers that might distort the results of OLS regression. Consequently,

DATA ANALYSIS

USING

7.0

100.0

GAUSS

AND

MARKOV

53

97.0

INCOME 3.0 100.0

7.0

81.0

.

3

ace

PRESTIGE

7.0

81.0

3.0

97.0

Figure 3.2. Scatterplot Matrix of Duncan Data

when we run our regression, we save regression diagnostics to the file duncres: » >» > >»

set set opt go

lhs prestige rhs income educ resid duncres reg

The reg procedure produces a variety of descriptive statistics, measures of fit, and tests of significance. The key results are as follows:

Variable

OLS Estimate

Std Error

CONSTANT INCOME EDUC

6.064663 0.598733 0.545834

4.271941 0.119667 0.098253

t-value

-1.42 5.00 5.56

2-tailed Prob

0.163 0.000 0.000

Cor With Dep Var

Std Estimate



A 0.46429 0251553"

0.83780 0.85192

54

COMPUTING

ENVIRONMENTS

Residuals and predicted values are saved to the file duncres as specified by the command opt resid duncres. Among other things, this file includes predicted values in the variable hat and studentized residuals in the variable rstudent. A quick scatterplot of these is generated with the Markov commands » >» » »

read from duncres set x hat set y rstudent

go quick

The quick procedure that produced Figure 3.3 is designed for “quick and dirty” plots without customization. For presentation, graphics can be highly customized.

For example,

we can add specific ranges, tic marks, and labeling. In presenting these commands, we do not precede each line by a chevron and end each line with a semicolon. This reflects a move to edit mode in which we

3.41

rstudent

00 oO

0.0118

0.2813 hat

Figure 3.3. Quick Plot of Studentized Residuals

DATA ANALYSIS

USING GAUSS

AND MARKOV

50

construct a command

file that can be saved, modified, and rerun as the results become refined. Setex set

nats

y rstudent;

opt xrange

0 .3;

opt yrange

-4 4;

opt xticsize 0.10; opt yticsize 2; opt grid on; label x Hat Values; label y Studentized

Residuals;

go Xy;

This produces the plot Showing how to add trates that the plotting grammer. To reproduce

in Figure 3.4. plot options in GAUSS without Markov illusfacilities in GAUSS are designed for the proFigure 3.4. using standard GAUSS commands,

t 1

i.

i

r

Residuals Studentized

fae;

0.1

0.2 Hat

Values

Figure 3.4. Refined Plot of Studentized Residuals

0.3

56

COMPUTING

ENVIRONMENTS

you would use the commands graphset; _psymsiz = .5; MUTESO5 Sho la) 2

ytics(-4,4,2,0); splotsiz = 6A; eporidsael|0¢ xlabel (“Hat Values”); ylabel(“Studentized Residuals”); AHilewiel = oils call xy(hat,rstudent);

Although these commands are difficult to use for routine work, they should be viewed as part of a graphical programming language. As such, they are very powerful. It is clear from Figure 3.4—however it is produced—that outliers may be a problem and that robust procedures are appropriate. This provides an opportunity to illustrate how to program in GAUSS. 3.4

PROGRAMMING

IN GAUSS

When programming in GAUSS, you move among three components: the command screen, in which you interactively enter commands and see output; the editor, in which you can construct command files that either from the editor or from the command line;

can be executed

and the output file containing results. A typical programming session involves testing parts of your program from command mode, adding these lines to a program in the editor, executing the program, and examining the results in the output file. This section shows how you can program GAUSS to compute nonstandard statistical procedures. The code is designed to be as simple as possible and excludes tests for problems in the data that you might normally include.

M-ESTIMATION

Very simply, the idea of M-estimation is to assign smaller weights to observations with larger residuals, thus making the results robust to outliers. A variety of weight functions can be used. We use Huber’s

DATA ANALYSIS

USING GAUSS

AND

MARKOV

57

weight function, which is defined as aie!

Weiler Ven

for |u| < 1.345

for |u| > 1.345

(1)

Estimation proceeds by iterating through four steps: 1 . Estimate 8 by weighted least squares (WLS). 2 . Calculate the residuals r. 3. Compute a scale estimate § = median( auak|r|) 0. 4 . Obtain new weights w().

Iterations continue until convergence is reached. To implement this algorithm, the first step is to write a procedure for WLS: PROC(2)

= wls(y,x,wt);

LOCAL

wy,wx,bwls,resid;

wy = y.*SQRT(wt); wx = x.*SQRT(wt); bwls = INVPD(wx’wx) *wx’wy; resid = y - x*bwls; RETP(bwls,resid); ENDP;

The weights are contained in the vector wt, which is then used in the standard way to compute the WLS estimate bwls. Residuals resid are the difference between the observed and expected values. With this procedure,

the command

{bwls,resid}=wls(prestige,

educ~income,wt)

computes the WLS estimates of the effects of prestige on education and income with weights wt. We use wis in the four steps of the M-estimation algorithm. Comments are contained within /* */: 1. PROC(1) 2.

LOCAL

= mest(y,x); wt,bold,bnew,resid,nobs,nvar,shat,tol,

pctchng;

/* define constants tol = 0.000001; Dom Pw

nobs = ROWS(y); x = ONES(nobs,1)~x;

and set up the data */

58

COMPUTING

7.

ENVIRONMENTS

wt = ONES(nobs,1); petehng— 100;

8 9.

bnew

= 9999999;

10.

/* iterate until

11. We. sh. 14.

DO UNTIL pctchng < tol; bold = bnew; { bnew,resid } = wls(y,x,wt); shat = MEDIAN(ABS(resid))/.6745;

15.

wt = resid/shat;

16.

wt

=

%change in estimate

< tol */

(ABS(wt) .1.345) .*

(1.345. /ABS(wt))); 17.

pctchng

Ste

19. 20s

= MAXC(ABS((bold-bnew) ./bold)) ;

ENDOE

RETP(bnew); ENDPs

Weights wt are initially set to one, which results in OLS estimates for start values. Lines 11 through 18 repeatedly compute WLS estimates until the new estimates change by less than pctchng percent, which is computed in line 17 as the maximum value of the absolute value of the percentage change from the old to the new estimates. Line 16 needs elaboration because it involves a standard trick in GAUSS. Re-

call from Equation (1) that the weight is to be set to 1 if the absolute value is less than 1.345. This is tested for each observation with the code

(ABS(wt) . nrep; PWDY OND sel = 1 + TRUNC(RNDU(nobs,1)*nobs); yboot = ydata[sel,.]; xboot = xdata[sel,.]; b = mest(yboot,xboot); bboot = bboot|b’; irep = irep + 1; .

ENDO;

. . . . ea .

bboot = TRIMR(bboot,1,0); sdboot = STDC(bboot); bmest = mest(prestige, income~educ) ; “M-EST” bmest’; SDaesdboote: “EST/SD” bmest’./sdboot’;

Most of the work is done in lines 7 through 14, where we loop nrep times. Uniform random numbers are drawn in line 8 to select which rows of our original data we want to use in the current iteration.

The selected rows are used to create a bootstrap sample in the next

60

COMPUTING

ENVIRONMENTS

two lines. The bootstrap sample is passed to mest. The resulting Mestimates are accumulated in the 100 x 3 matrix bboot. After we have completed all bootstrap iterations, the dummy values placed at the top of bboot are “trimmed” with TRIMR before the standard deviations of the estimates are computed with STDC. Line 17 computes the Mestimates for the full sample, before the results are printed: M-EST SD EST/SD

-7.11072 PAA. -2.45951

0.70149 We tl7Asil 4.08771

As an indication of GAUSS’s

0.48541 Oi SZ 3.66380

speed, these 100 iterations took 5.4 sec-

onds on an 80486 running at 66 MHz.

KERNEL DENSITY ESTIMATION OF THE SAMPLING

DISTRIBUTION

Our bootstrapping resulted in 100 estimates of each parameter, which can be used to estimate the sampling distribution of the estimator. If estimates were plotted, the empirical distribution would be

lumpy reflecting the small number of estimates being plotted. A density estimator can be used to smooth the distribution (see Fox 1990, pp. 88-105). We have n observations b; that we want to smooth. Consider a histogram with bars of width 2h. Instead of counting each observation in an interval as having equal weight in determining the height of the bar, density estimation weights each observation by how far it is from the center of the bar, with more distant values receiving smaller weights. We use the Epanechinkov kernel for weighting

Keys

eae (1= =) for |z| < /5 4/5

0 Oversimplifying

&

otherwise

a bit, the density estimator

takes a particular inter-

val or window of data, weights the observations relative to how far they are from the center of the interval, and uses the weighted

sum

to determine the density at that point. In Markov, the density estimates for the bootstrap would be computed simply with the com-

DATA ANALYSIS

mand

{sx,sy,winsiz}

USING GAUSS

AND

MARKOV

61

= smooth(b), where sx and sy contain the coordi-

nates to be plotted and winsiz is the size of the window. To implement this procedure in GAUSS, we first create a procedure to divide a range of values into npts parts that will be used to define

the windows for density estimation:’ PROC

seqas(strt,endd,npts);

LOCAL

siz;

siz = (endd-strt) /npts; RETP(SEQA(strt+0.5*siz,siz,npts)); ENDP;

Second, we construct a procedure that computes the function K(z): PROC epan(z); LOCAL

a,t;

t = (ABS(z) .< SQRT(5)); a = CODE(t,SORT(5) |1); RETP(t.*((0.75)*(1-(0.2) .*(z*2))./a)); ENDP;

These procedures are used by the procedure density to compute the coordinates for the smoothed distribution. density takes as input the values to be smoothed and the number of points to be computed for the density function. The procedure is PROC

(2) = density(y,npts);

LOCAL

smth,sy,std,xloc,yloc,i,nrows,t;

sy = SORTC(y,1); /* sort the data */ std = MINC((sy[INT(3*ROWS (sy) /4)]Fe PWD

sy [INT(ROWS (sy) /4)])/1.34|STDC(sy)) s

5.

smth = 0.9*std*(ROWS(sy)*(-0.2));

6.

xloc

fee

yL0Ge=—x1l0G;

8.

nrows

= seqas(MINC(y) ,MAXC(y),npts) ; = ROWS(y);

Cie St ce phe DOO WH LEs

11. 12. ie

joa

iA

ENDO?

15. 16.

:

68

COMPUTING

ENVIRONMENTS

ss (e& i 7)

3

The expression (+ 1 2) given to the interpreter is a standard compound Lisp expression: a function symbol, +, followed by two arguments and enclosed in parentheses. The system responds by evaluating the expression and returning the result. Arguments can themselves be expressions: S(t 25

(2a

aed)

The interpreter evaluates a function call recursively by first evaluating all argument expressions and then applying the function to the argument values. Expressions can also consist of numbers or symbols: —

pi

. 14159

(+ pi 1) .14159 Vv ee WV VPV x error:

unbound

variable

-X

An error is signaled if a symbol does not have a value. A few procedures, called special forms, act a bit differently. A LispStat special form used for assigning a value to a global variable is def: > (def x (normal-rand

10))

X

> xX (-0.2396

1.03847

-0.0954114

.

.

+9)

The special form def evaluates its second argument but not its first. Another useful special form is quote, used to prevent an expression from being evaluated:

DATA ANALYSIS

USING LISP-STAT

69

> (quote (+ 1 2)) (+ 1 2) Sa le?) (+ 1 2)

A single quote ’ before an expression is a shorthand notation for using quote explicitly. Some additional special forms are introduced as they are needed. With this brief introduction, we can begin to use some of the built-

in Lisp-Stat functions to examine the Duncan occupational status data set. Suppose the file duncan.1sp contains the Lisp-Stat expressions (def (def (def (def

occupation ’("ACCOUNTANT" "AIRLINE-PILOT" income ’(62 72... .)) education ’(86 76... .)) prestige ‘(82 83 .. .))

. . .))

We can read in this file using the load function. The function variables provides a list of the variables created during this session using def: > (load "duncan") 3; loading "duncan.1|sp" > (variables) (EDUCATION

INCOME

OCCUPATION

We can start with some

PRESTIGE

univariate

X)

summaries

of one of the vari-

ables, say the income variable: > (mean income) 41.8667 > (standard-deviation 24.4351 > (fivnum income) (7 21 42 64 81)

income)

The function fivnum returns the five number summary needed for a skeletal box plot: the minimum, and the maximum.

first quartile, median, third quartile,

70

O

OF

anB1y “LF

O2

068

oot

suUTeISORSTPY JO ay}

QUIOSUT

09

SeTqetAeA UT dy}

40 02 09

ULdUNG]

00

O

TeUOHedNdd_Q SN}zeISge}eqJas

uoneonp”

O0t;08

ee O2

OF

08

osso1g

O09

OOT

re

DATA ANALYSIS

USING LISP-STAT

WY

The histogram function produces a window containing a histogram of the data. Histograms of the three variables produced by the expressions (histogram (histogram (histogram

income) education) prestige)

are shown in Figure 4.1. Each histogram would appear in its own window. A menu can be used to perform various operations on the histograms such as changing the number of bins. In addition, the histogram function returns a histogram plot object. This object is a software representation of the plot that can be used, for example, to change the plot’s title, add or remove data points, or add a curve to the plot. Section 3 illustrates the use of plot objects to add simple animations to some plots. To begin an analysis of the relation between prestige, income, and

education, we can look at two scatterplots. The function plot-points is used to produce a scatterplot. To see what variations and options are available, we can ask for help for this function: > (help ‘’plot-points) PLOT-POINTS Args: (x y &key (title

[function-doc] "Scatter Plot") variable-labels point-labels symbol color) Opens a window with a scatter plot of Y vs X, where X and Y are compound number-data. VARIABLE-LABELS and POINT-LABELS, if supplied, should be lists of character strings. TITLE is the window title. The plot can be linked to other plots with the link-views command. Returns a plot object.

Keyword arguments listed after the &key symbol are optional arguments that can be supplied in any order following the two required arguments x and y, but they must be preceded by a corresponding keyword, the argument symbol preceded by a colon. We can construct scatterplots of education against income and prestige against income as

ie

COMPUTING

ENVIRONMENTS

> (plot-points income education :variable-labels ’("Income" :point-labels occupation)

"Education")

# > (plot-points income prestige svariable-labels

:point-labels #

The resulting plots are shown in Figure 4.2(a). Most of the points in

the education-against-income plot fall in an elliptical pattern. There are three exceptions—two points below and one point above the main body of points. Using the plot menus, we can link the plots, turn on point labels, and then select the three outlying points in the educationagainst-income plot. The corresponding points in the prestige-againstincome plot are automatically highlighted as well as shown in 4.2(b). Ministers have very high education and prestige levels for their income, whereas railroad engineers and conductors have rather low ed-

ucation for their levels of income. For the conductors, prestige is also low for their income level. Linking is a very useful interactive graphical technique. Linking can be used, as it is used here, to examine where points outlying in one plot are located in other linked plots. It can also be used to examine approximate conditional distributions by selecting points in one plot with values of a variable in a particular range and then examining how these selected points are distributed in other linked plots. Linking several two-dimensional plots is thus one way of gaining insight into the higher dimensional structure of a multivariate data set. Based on these plots, a regression analysis of prestige on income and education is likely to be well determined in the direction of the principal component of variation in income and education, but the orthogonal component of the fit might be sensitive to the three points identified in the plots. This can be confirmed with a three-dimensional plot of the data produced by spin-plot.

(spin-plot

(list income education prestige) :variable-labels ’("I" "E" "p") :point-labels occupation)

DATA ANALYSIS a

USING LISP-STAT

=

73

¥

P|

8

MINISTER (se:] oo

"

sararkg

S

°

Sa ae |

“5

°

pe)

wo

°

5

ue ee

;

r

; , RR-CONDUCTOR ,RR-ENGINEER

: .-)

oOo

°

N

=

0

20

40

60

80

100

Income b

oO oO 4

°

,MINISTER

°

oOo

64%

°°?

=

°

oo

3 oo

°

,RR-ENG INEER

x

mn

°

»

n

°

°

, RR-CONDUCTOR

°

one

Guns

oO N

oe

0

20

40

60

80

100

Income

Figure 4.2. Linked Plots of Two Variable Pairs With Outlying Points in the Education Against Income Plot Highlighted

74

COMPUTING

ENVIRONMENTS

The plot can be rotated to show

two views

of the point cloud, one

orthogonal to the education-income diagonal and one along the diagonal. These views are shown in Figure 4.3. After this preliminary graphical exploration, a regression model can

be fit using the regression-model

> (def m (regression-model

(list

function:

income

education)

prestige :predictor-names

Least

‘("Income" "Education"))) Estimates:

Squares

Constant Income Education

-6.06466 0.598733 0.545834

R Squared: Sigma hat: Number of cases: Degrees of freedom:

0.828173 13.369 45 42

The function fits the model, prints sion model object as its result. This the variable m and can be used for Objects, such as this regression returned by the plotting functions, sending them messages using the provides some information on the

> (send REGRESS Normal Help is

(4.27194) (0.119667) (0.0982526)

a summary, and returns a regresmodel object has been assigned to further examination of the fit. model object and the plot objects can be examined and modified by send function. The :help message messages that are available:

m :help) ION-MODEL-PROTO Linear Regression Model available on the following:

tiowooO wItdOoO aexog ee en |

US

aexoo

trewoo0

wItdOO

d OM], ‘€'p BANBIy ye URSUNG] ay} JO SMarA [PUOISUSUTTC-seTYL PoweIo

75

76

COMPUTING

ENVIRONMENTS

:ADD-METHOD :ADD-SLOT :BASIS :CASE-LABELS : COEF-ESTIMATES :COEF-STANDARD-ERRORS :COMPUTE :COOKS-DISTANCES .

As an example, the coefficient estimates shown in the summary can be obtained using > (send m :coef-estimates)

(-6.06466

0.598733

0.545834)

A first step in examining the sensitivity of this fit to the individual observations might be to compute the Cooks’s distances and plot them against case indexes: > (plot-points #

Figure 4.4 shows the resulting plot with the four most influential points highlighted. These points include the three already identified as well as the reporter occupation. At this point, it would be useful to go back and locate the reporter point in the plots used earlier. One way to do this is to use the function name-list to construct a scrollable list of the occupations, link the name list plot, and then select the reporter entry in the name list.

4.3,

WAITING

LISP FUNCTIONS

Because the Duncan data set contains some possible outliers, it might be useful to compute a robust fit. Currently, there are no tools for robust regression built in to Lisp-Stat, but it is easy for a user to add such tools using the Lisp language. This section outlines a rather minimalist approach designed to reflect what might be done as part of a data analysis. A complete system for robust regression would need to be designed more carefully and would be more extensive. The systems

DATA ANALYSIS

USING LISP-STAT

Ta,

\o

o

, MINISTER

s Oo

c °

,RR-CONDUCTOR

i REPORTER

0

10

20

, RR-ENG INEER

30

40

50

Figure 4.4. Plot of Cooks’s Distances Against Case Index for the Linear Regression of Prestige on Income and Education

for fitting linear regression models and generalized linear models in Lisp-Stat provide examples of more extensive systems. To begin, we need to select a weight function and write a Lisp function to compute the weights. Weights are usually computed as w(r/C), where r is a scaled residual, C is a tuning constant, and w is

a weight function. We use the biweight function

(= x)

at elt

UX ee otherwise with a default tuning constant of C = 4.685.

78

COMPUTING

ENVIRONMENTS

The expression (defun

biweight

(x)

(* (- 1 (4 (pmin (abs x) 1) 2)) 2))

defines a Lisp function to compute the biweight function. The special form defun is used to define functions. The first argument to defun is the symbol naming the function, the second argument is a formal parameter list, and the remaining arguments are the expressions in the body of the function. When

the function is called, the expressions in

the body are evaluated sequentially and the result of the final expression is returned as the function result. In this case, there is only one expression in the function body. The function robust-weights defined by (defun robust-weights (m &optional (c 4.685)) (let* ((rr (send m :raw-residuals)) (s (/ (median (abs rr)) .6745))) (biweight (/ rr (**c™s)))))

computes the weights for a regression model using the biweight function and the raw residuals for the current fit. This function accepts a value for the tuning constant as an optional argument; if this argument is not supplied, then the default value of 4.685 is used. This function definition uses the let* special form to establish two local variables, rr and s. This special form establishes its bindings sequentially so that the variable rr can be used to compute the value for the scale estimate s. To implement an iterative algorithm, we also need a convergence criterion. The maximal relative change in the coefficients is a natural choice; it is computed by a function defined as (defun max-relative-error

(x y)

(max (/ (abs (- x y)) (pmax

(sqrt machine-epsi lon)

(abs y)))))

The function robust-loop implements the iteratively reweighted leastsquares algorithm:

DATA ANALYSIS

(defun robust-loop

(m &optional

(epsilon (limit

USING LISP-STAT

79

.001)

20))

(send m :weights nil) (let ((count 0) (last-beta nil) (beta (send m :coef-estimates) ) (rel-err 0)) (loop (send m :weights (robust-weights m)) (setf count (+ count 1)) (setf last-beta beta) (setf beta (send m :coef-estimates) ) (setf rel-err (max-relative-error beta last-beta) ) (if (or (< rel-err epsilon) (< limit count)) (return (list beta rel-err count))))))

First this function uses the expression (send m :weights

nil) to remove

any weights currently in the model. Then it uses a let expression to set up some local variables. The let special form is similar to let* except that it establishes its bindings in parallel. The body of the let expression is a loop expression that executes its body repeatedly until a return is executed. The setf special form is used to change the values

of local variables;

def cannot be used because

it affects

only global variables. To avoid convergence problems, the function imposes a limit on the number of iterations. The result returned on termination is a list of the final parameter list, the last relative error value, and the iteration count.

Having defined our robust regression functions, we can now apply them to the Duncan data set: > (robust-loop ((-7.41267

m)

0.789302

0.419189)

0.000851588

11)

The algorithm reached the convergence criterion in 11 iterations. The estimated income coefficient is larger and the education coefficient is smaller than those for ordinary least squares.

COMPUTING

80

ENVIRONMENTS

RR-ENG INEER Sn |

CMe)

O90

°

°

°o

eo

0

©

A

°

°

So°0

60



29

P56

5

,

°

ig

°

oO

°

°

Ww

Oo

A

e

bs

RR-CONDUCTOR , REPORTER

Oo N Oo

MINISTER oO

®

0

10

20

30

40

50

Figure 4.5. Plot of Weights From Robust Fit Against Case Indexes

After executing the robust fitting loop, the regression model object m contains the final weights of the algorithm. We can plot these weights against case indexes to see which points have been downweighted: (plot-points (iseq 45) (send m :weights) :point-labels occupation)

The result is shown in Figure 4.5. The three lowest weights are for ministers, reporters, and railroad conductors; the weight for railroad

engineers is essentially one. To assess the variability of the robust estimators, we can use a very simple form of bootstrap in which we draw repeated samples of size 45 with replacement from our data and compute the robust estimates

DATA ANALYSIS

for these

samples.

The

expression

USING LISP-STAT

(sample x n t) draws

81

a random

sample of size n from a list or vector x. The final argument t indicates that sampling is to be done with replacement. Omitting this argument or supplying nil implies sampling without replacement. The function robust-bootstrap defined as (defun

robust-bootstrap

(m &optional (nb 100) (epsilon ((n (send m :num-cases)) (k (- (send m :num-coefs) (send m :x)) ~~ Oe (send m :y)) nee (result nil) (i-n (iseg n))

(let*

.01)) 1))

(i-k (iseq k))) (dotimes

(i nb)

(let ((s (sample i-n n t))) (send m :x (select x s i-k)) (send m :y (select y s)) (send m :weights nil) (push (first (robust-loop m epsilon)) result))) (send m :x x) (send m :y y) (send m :weights nil) (transpose result)))

performs this simple bootstrap and returns a list of lists of the sampled values for each coefficient: (def bs (robust-bootstrap

m))

We can then look at five number summaries: > (fivnum

(first

(-15.6753

-9.11934

bs))

> (fivnum

(second

-7,46313

-5.43951

1.26937)

0.875396

1.31214)

0.515034

0.850496)

bs)) 0.787972

(0.257142

0.647229

> (fivnum

(third bs))

(0.051893

0.365448

0.431528

COMPUTING

B82

ENVIRONMENTS

Taking the income coefficients as an example, we could also compute means and standard deviations: > (mean

(second

bs))

0.754916

> (standard-deviation

(second

bs))

0.183056

The mean is below the median, suggesting some downward

skewness

in the bootstrap distribution. The median of the bootstrap sample is very close to the robust estimate of the income coefficient. Both the mean and the median of the bootstrap sample are higher than the OLS estimate. The standard deviation of the bootstrap sample, which can serve as an estimate of the standard error of the robust coefficient estimate, is 50% larger than the estimated OLS standard error in the

regression summary. We can also compute a kernel density estimate of the bootstrap distribution of the income coefficients: > (def p (plot-lines

(kernel-dens

(second

bs))))

Figure 4.6. shows the result. The ordinary least-squares coefficient is at approximately the 20th percentile of the bootstrap distribution: > (mean

(if-else

(“Kernel Density of Education”, PlotStyle-> Dashing/@{{.01,.01}, {202,01}, 1.040507) )) -Graphics-

The option PlotStyle in this example determines the length of the dashes. Dashing[{x,y}] produces dashes of length x separated by spaces of length y. The operator /@ maps the function Dashing over the list of sizes. The option is thus the list {Dashing [{.01,.01}], Dashing[{.02,.01}], Dashing [{.04,.01}] so that, for example, the

roughest kernel estimate kdEducl appears with short dashes. Only with h = 1 do three groups appear, apparently an artifact of setting h too small. The larger values of h, including hy», suggest two groups. It is known, however, that in bimodal problems the optimal width h,,, determined by the Gaussian assumption is too large, and this question of the number of modes deserves further attention. 5.5

FITTING

A REGRESSION

MODEL

Fit is the basic line-fitting function of Mathematica. Fit has an unusual syntax that differs from most regression software and requires some time to get used to it. For example, to fit the regression of Prestige on Income and Education, the first argument of Fit is a list of observations, each having the value of Prestige last. The second argument of Fit specifies how to use the values for each observation as labeled symbolically by the third argument. The dependent variable, which is last in each observation list, is not named. Fit returns a polynomial that can be manipulated. For example, we can extract various coefficients. In[34]:=

linFit

= Fit[Transpose[{INCOME,

EDUCATION,

{inc,

PRESTIGE}],

{1,inc,edu},

edu}]

Outl34]=

-6.06466

+ 0.545834

In[35]:=

Coefficient[linFit,

Out[35]=

0.545834

edu

edu]

+ 0.598733

inc

98

COMPUTING

ENVIRONMENTS

A similar command m[36];=

Out[36] =

fits a second-order model with an interaction.

PRESTIGE}], ose[{ INCOME, EDUCATION, quadFit = Fit[Transp {l,inc,edu, inc*2, edu*2, inc*edu },{inc, edu}] 2 - 0.0243761

-0.856207 0.967453

edu

+ 0.00745087

edu

+

inc

2

-0.00577384

edu inc - 0.000869464

inc

The output of Fit lacks the standard statistics so that we cannot tell whether any coefficients are significant or whether the added quadratic terms improve the fit. We

can, however,

take a good look at the fitted model.

Plotting

functions make it easy to draw the regression surface and display it with the observed data. The following command draws quadFit using a gray-scale coloring to indicate the height of the surface. The plot is named surf. In[37]:=

surf = Plot3D[{quadFit, GrayLevel[quadFit/100]}, {inc,0,100}, {edu,0,100}, AxesLabel->{“Income”, “Prestige” }];

“Education”,

The next command uses graphics primitives to construct a plot that shows the data for Income, Education, and Prestige as oversize points in

three dimensions. Combining this plot with surf produces the images in Figures 5.2(a) and 5.2(b). In[38]:=

pts = Show[Graphics3D[{PointSize[.025], Point /@Transpos INCOME, EDUCATION, e[{

In[39]:=

partA = Show[surf,pts]; partB = Show[ partA, ViewPoint->{-2,-2,-1}] -Graphics3D-

PRESTIGE}]}

Out[39]=

J];

Figure 5.2(a) shows the surface and data from the default viewpoint; the surface is transparent. Figure 5.2(b) is from a viewpoint looking

up the surface from below the origin; this view makes it clear which

DATA ANALYSIS

USING MATHEMATICA

2]e)

100 e

Prestige

Income

Figure 5.2a. Initial View of the Estimated Quadratic Regression Surface and Data

points are above and which are below the surface. Mathematica can build an animated sequence of rotations from such three-dimensional figures, but it generally lacks the ability to dynamically rotate images (with the exception of certain workstation implementations). The plots in Figures 5.2(a) and 5.2(b) make it clear that the curvature in quadFit is weak, but we need some

test statistics to confirm

these visual impressions. To obtain these, another Mathematica package includes the function Regress that provides more comprehensive results (albeit quite a bit more slowly than Fit). The syntax of Regress is the same as that of Fit. Its output is a list of rules that include various statistics. In[41]:= In[42]:=

Out[42]=

1 inc edu

RSquared

Estimate -6.06466 0.598733 0.545834

-> 0.828173,

SE 4.27194 0.119667 0.0982526

TStat -1.41965 5.00331 5.55541

AdjustedRSquared

PValue , 0.16309 0.0000105 0

-> 0.819991,

100

COMPUTING

ENVIRONMENTS

Educatiog

Prestige

Figure 5.2b. View Up the Regression Surface From Below the Origin, Showing Points Above and Below the Fit

EstimatedVariance

->

178.731,

ANOVATable

DoF

SoS

MeanSS

Model Error

2 42

36180.9 7506.7

18090.5 WS. 783i

Total

44

43687 .6

->

FRatio 101.216

PValue} 0

The coefficients of both Income and Education are statistically significant. A similar command reveals that R? = 0.837 for the quadratic fit and that the partial F test for the added coefficients is not significant. Regress produces further output if so directed by optional arguments. The following commands generate the scatterplot of the residuals on the fitted values shown in Figure 5.3. The OutputList option adds fitted values and the residuals to the results of Regress. Replacement

rules extract these from

the list regr into a form

plotted with ListPlot. In[43]:=

regr = Regress[Transpose[{ INCOME, EDUCATION,

PRESTIGE}],

{1,

inc,edu},

edu}, OutputList->{PredictedResponse, FitResiduals}]; In[44]:=

res

= PredictedResponse

fit = FitResiduals

/. regr;

/. regr;

{inc,

more

easily

DATA ANALYSIS

USING MATHEMATICA

101

Fit

Residuals

-30

e

Figure 5.3. Residuals Plotted on Fitted Values From the Linear Regression of Prestige on Income and Education

In[46]:=

Out[46]=

ListPlot[ Transpose[{res, fit}], AxesLabel->{‘“Residuals”, “Fit”}] -Graphics-

Although this plot does not indicate a problem with the fitted model, other regression diagnostic plots reveal the distorting influence of outlying occupations (e.g., Fox 1991).

5.6

BUILDING

A ROBUST

REGRESSION

Although Regress can do the weighted least-squares calculations needed to compute robust estimates by iteratively reweighted least squares, its generality makes it too slow for our ultimate bootstrapping task and it becomes necessary to write our own weighted least-squares program. The following robust regression uses the biweight influence function that is identical to BiweightKer but for the

missing normalization factor. In[47]:=

Biweight

[x_]

:=

If[Abs[x]SameCoefQ]

Out[57]=

{-7.41267,

0.789303,

ols,

20,

0.419188}

The third argument to FixedPoint is a limit on the maximum number of iterations (20 in this case). Because the biweight need not converge, such a limit is crucial. For later use, it is convenient to collect these commands into a single function.

The definition of the subroutine

next within

RobustRegress

closely resembles that of NextEst. In[58]:=

RobustRegress[X_,Y_, maxit_:20] := Module [ {next, ols,e,s, Xt = Transpose[X]}, next b= lee == (Ser—aVeoeXeebs s = Median[ Abs[e] ]/0.6745; WTS = Biweight[ e / (4.685 s) ]; Inverse[Xt . (WIS X)] . (Xt .

(WTS Y)) )s OS = Wonwercsayes 6 2d) 5 O88 o WHR FixedPoint[next, ols, maxit, SameTest->SameCoefQ]

de

The results for this function are quite close to our previous calculations. The small differences are due to the different starting values for the iterations. In[59]:= Out[59]=

To

see

RobustRegress[X,Y] {-7.41267,

the

LabeledListPlot

weights

from

0.789302,

0.419189}

assigned

to the

Graphics‘Graphics.

observations,

Its syntax

we

can

is identical

use

to

104

COMPUTING

ENVIRONMENTS

that of ListPlot except that it expects a third element for each observation—its label. To keep the plot from getting cluttered, the next command

filters the occupation labels, selecting those for which

the robust weight is less than 0.6. This command uses a so-called pure function to filter the labels and illustrates a popular style of programming with Mathematica. The plot produced by LabeledListPlot appears in Figure 5.4.

{WIS. Out[60]

=

Tl0LE}

SE

|

CONCUCEORNCON tid CLO

Instmance=agentsmsusn >

In[61]:= Outl61]=

reporter,

ee

>

minister,

3

>

>

>

>

>

>

>

>

ese >

>

Bn

lami ainsi

ndechinus >

>

a

anne tema

}

LabeledListPlot[Transpose[{Range[1, 45], WTS, labels}],PlotLabel->“Robust Weights”] -Graphics-

Range[a,b] produces the integer list {a,a+l, . . . ,b}. A thorough analysis would consider why ministers, for example, receive such low weight in the robust regression (see, eo Fox gah): Robust lheeee



©% 20%

Weights

Sen

e

ave

eo5e

eer,

758

econtractcmmachinist einsurance_agent

ereporfEpnductor

10

20

30 40 Figure 5.4. Labeled Weights From the Robus t Regression Plotted on Case Number Indicating Downweighted Occupation s

DATA ANALYSIS

5.7

BOOTSTRAP

USING MATHEMATICA

105

RESAMPLING

Although it possesses many advantages over ordinary least squares, robust regression has its own weaknesses. Aside from the need for iterative computation, expressions for the standard errors of the robust estimators are complex and not very accurate in small samples. Bootstrap resampling offers an alternative method that, although demanding yet more computation, tends to be more reliable in practice. Bootstrap estimates rely on sampling with replacement from the observed data. The function Random makes this quite easy to do. For example, a sample of 10 integers drawn with replacement from the collection {1,2,3,4,5} is In[62]:=

Table[

Random[Integer,{1,5}],

Guiatevstas

seh, Gh

ihe TA Shee

{10}]

ile Sh, V5 SH

Resample uses Random to generate the indexes for sampling with replacement from a given list. In[63]:=

Resample[n_]

:= Table[Random[Integer, {1,n}],{n}]

To bootstrap the robust regression of Prestige on Income and Education, we simply Resample the rows of the matrix X and vector Y defined at In[49]. Doing 100 iterations may take a while, but the Table function makes it quite easy to program. In[64J:=

bsCoefs = Table[index = Resample[45]; RobustRegress[X[[index]], Y[{[index]], 10],{100} ];

The second argument of Table gives the number of times (here 100) to repeat the calculations specified by the commands given as the first argument. To make the calculations faster, the robust regression was limited to 10 iterations (it still takes quite a while). The means and standard deviations of the bootstrap estimates allow us to estimate the bias and standard error of the robust estimates. The analysis is easier if we first transpose bsCoefs into three lists of 100 values. Either Table (using an iterative style) or Map (functional)

produces the needed summaries.

106

COMPUTING

In[65]:= In[66]:=

ENVIRONMENTS

bsCoefs

= Transpose[

bsCoefs

Table[ Mean[ bsCoefs[[i]] 0.737704,

];

], {1,3}

Outl66]=

{-6.93025,

In[67]:= Out[67]=

Map[StandardDeviation, bsCoefs] {3.03126, 0.192809, 0.156091}

]

0.458941}

Because the means of the bootstrap replications are similar to the observed robust estimates, the bootstrap suggests no bias. The standard deviations imply that the robust slope estimates are more variable than the least-squares estimates, whose standard errors are .12 and .098, respectively.

A kernel density estimate conveys the shape of the distribution of the robust estimates. The density estimates of the two slopes appear together in Figure 5.5, with the kernel for the slopes of Income in black and that for the slopes of Education in gray. Both densities are asymmetric due to the presence of highly influential observations. In[68]:=

kInc = KernelDensity[ bsCoefs[[2]] ]; kEduc = KernelDensity[ bsCoefs[[3]] ];

Kernel

Of24F0ts

Density

0 er Ura

Estimates

amVe

Sheena

Figure 5.5. Kernel Density Estimates From the Bootstrap Replica tions of the Robust Slopes (Education in gray, Income in black)

DATA ANALYSIS

In[69]:=

Out[69]=

5.8

USING MATHEMATICA

107

Plot[ {kInc[x], kEduc[x]},{x,0,1.5}, PlotStyle->{GrayLevel[0], GrayLevel[.5]}, PlotLabel->“Kernel Density Estimates” ] -Graphics-

DISCUSSION

Mathematica offers a very different computing environment for data analysis. Before choosing to compute in this environment, one must address a fundamental question: Do the symbolic computing capabilities offered by Mathematica make up for its slow speed and lack of interactive graphics? No other environment considered in this issue comes with its capabilities for symbolic mathematics, but none is

as slow as Mathematica either. Current versions of Mathematica also place considerable demands on the memory and CPU of the host system. One should keep in mind, though, that Mathematica runs on a

great many systems, and it is quite easy to use a workstation to per-

form the calculations while letting a PC manage the user interface. Perhaps the optimal compromise is to use Mathematica in conjunction with a statistics package. For example, Cabrera and Wilks (1992) show how to link Mathematica with S. With this combination, one can

reserve Mathematica for algebraic tasks such as finding h,,, and use S for regression and interactive graphics.

REFERENCES Belsley, D. A. 1993. “Econometrics ¢ M: A Package for Doing Econometrics in Mathematica.” Pp. 300-43 in Economic and Financial Modeling With Mathematica, edited by H. Varian. New York: Springer-Verlag. Cabrera, J. F. and A. R. Wilks. 1992. “An Interface From S to Mathematica.” Mathematica Journal 2:66-74.

Fox, J. 1991. Regression Diagnostics. Newbury Park, CA: Sage. Maeder, R. 1991. Programming in Mathematica, 2nd ed. Redwood City, CA: AddisonWesley. McNeil, D. 1973. Interactive Data Analysis. New York: Wiley. Silverman, B. W. 1986. Density Estimation. London: Chapman & Hall.

Wolfram, $. 1991. Mathematica: A System for Doing Mathematics by Computer, 2nd ed. Redwood City, CA: Addison-Wesley.

Data Analysis Using SAS CHARLES HALLAHAN

6.1

INTRODUCTION

At one time, SAS stood for Statistical Analysis System. Today, the acronym is offically uninterpretable, and SAS is promoted as the “SAS System for information delivery.” A recent count shows that the SAS System consists of at least 22 separate but integrated products ranging from the original Base SAS to SAS/ENGLISH, a natural language interface to SAS. This article considers only the statistical computing features of SAS. Whereas the SAS System has been broadening its scope over the years, the statistical component has been keeping pace, and SAS remains one of the most comprehensive statistical packages on the market. SAS’s MultiVendor Architecture design allows SAS to run on mainframes, minicomputers, UNIX workstations, and personal computers. The latest release is SAS 6.11, which I am currently using on

an IBM PS/2 under OS/2 warp. 108

DATA ANALYSIS

USING SAS

109

SAS has a very active user community. SAS Users Group International (SUGI) annual meetings have been held for the past 21 years. Regional groups such as the Northeast SAS Users Group (NESUG) and local groups such as the DC SAS Users Group in the Washington, DC, area, hold regular meetings at sites across the United States,

Europe, and Asia. The Bitnet listserv discussion group SAS-L (crossposted to Usenet newsgroup comp.soft-sys.sas) provides a worldwide forum for SAS-related topics. For the purposes of this article, statistical programming is taken to mean the programming of algorithms for data analysis not directly available as a command or procedure in SAS. The examples to be discussed—M-estimation, bootstrapping, and kernel density estimation—are all directly unavailable in SAS; I have recently started using SAS/INSIGHT, which does include kernel density estimation, however. In my opinion, the matrix language IML, possibly coupled with the SAS macro facility, is the most appropriate way in SAS to implement such statistical algorithms. For general data manipulation, most programming tasks can be accomplished in the SAS DATA step. In general, statistical analysis in SAS would take advantage of the extensive choice of preprogrammed procedures: GLM for general linear models, GENMOD for generalized linear models,

CATMOD

for various

addition to SAS/STAT, rate products

SAS/IML

categorical models,

and so forth. In

which contains these procedures, the sepa(matrix

language),

SAS/ETS

(econometric

modeling), SAS/QC (quality improvement), SAS/OR (mathematical programming), SAS/LAB (guided data analysis), SAS/PH (clinical trials analysis), and SAS/ INSIGHT (graphical EDA and GLIMs) form the family of SAS statistical tools. SAS jobs can be executed either in batch mode or interactively under the control of the Display Manager. The primary windows in interactive mode are the PROGRAM EDITOR and the LOG and OUTPUT windows. The PROGRAM EDITOR window features a fullscreen configurable editor. SAS statements in the PROGRAM EDITOR window are executed by either entering the “submit” command on a command line, hitting a predefined function key, or clicking on a popup menu. The LOG window echos the program statements and issues appropriate messages, warnings, or errors in a noticeable red color. Procedure output appears in the OUTPUT window. Special windows

110

COMPUTING

ENVIRONMENTS

such as the OPTIONS, KEYS, and TITLES windows are also available.

An exception to the command line interface for SAS is SAS/INSIGHT, which has a menu-based, point-and-click graphical interface.

6.2

BRIEF ILLUSTRATION

This section illustrates a typical SAS program by analyzing the Duncan occupational prestige data. An SAS program consists of a sequence of DATA and PROC (procedure) steps. The DATA step is needed to import data from an external source, usually in ASCII format, and convert the data to an SAS format as a prerequisite for processing by an SAS PROC. Data transformations and assignment of variable properties, such as formats and labels, are also carried out in a DATA step. The SAS data set becomes input for SAS statistical and graphical procedures. For the Duncan data, the objective is to predict an occupation’s prestige rating based on the explanatory variables income and education. The first step of identifying the files is system dependent. On a PC, suppose that the subdirectory containing the SAS file is c:\mydir. The libname statement specifies the SAS library to contain the SAS data set. libname

saslib

A DATA

'c:\mydir';

step begins with the keyword

DATA. I capitalize certain

words for emphasis, but SAS is not case sensitive. Several statements

can appear on the same line, and a single statement can run for multiple lines. DATA saslib.duncan; infile 'c:\mydir\duncan.dat'; label occup = 'name of occupation’;

length occup $ 35.; input

occup

income

educ

prestige;

Note that each SAS statement ends with a semicolon. It is necessa ry to specify the length of the character variable occup as 35. Otherwise, the default number of characters stored is 8. The result is a read-write

DATA ANALYSIS

USING SAS

111

loop being executed until an end-of-file indicator is reached. Because the SAS data set has a two-level name,

saslib.duncan,

it becomes

a

permanent file. The SAS data set can now be used as input to PROC REG for regression analysis. PROC REG data = saslib.duncan; model prestige = income educ;

Various options extend the output for to reproduce the graph on page 38 of studentized residuals versus hatvalues proportional to Cook’s D, the necessary REG and saved.

each procedure. For example, Fox (1991), a bubble plot of with the size of each bubble statistics are requested from

PROC REG data = saslib.duncan; id occup; model prestige = income educ / r influence; output out = saslib.regdiags cookd = cookd h = hat rstudent = rstudent;

A basic line-printer plot of studentized residuals versus hatvalues can be obtained by adding the statement plot rstudent.

* h.;

The id statement labels each observation with the value of the variable occup, making listed output easier to read. Residual analysis and influence diagnostics are requested as options to the model statement. Finally, an output data set, saslib.regdiags,

is created and contains,

along with the variables in the input data set, the quantities needed for the bubble plot. As would be expected, there are many options available for graphics output. The goptions, title, and axis statements are available to

enhance the resulting plot. The SAS/GRAPH procedure GPLOT and the bubble statement produce the plot. PROC GPLOT

data

= saslib.regdiags;

bubble rstudent*hat vaxis

= axis2

= cookd / haxis = axisl

frame

bcolor

= black;

112

COMPUTING

ENVIRONMENTS

An excellent source

for producing

statistical graphs with SAS

is

Friendly (1991).

With further work using the SAS/GRAPH batch-mode annotate facility, the graph could be enhanced to identify the large bubbles (Figure 6.1; see Friendly 1991, pp. 242-5).

Graphics output can be displayed on the screen, saved in an SAS catalog for later replay, sent directly to a printer or plotter, or written to a file for import into other programs (e.g., WordPerfect). PROC REG makes it unnecessary to do any statistical programming in SAS to produce standard regression diagnostics. However, it may be instructive to introduce IML by replicating the diagnostic statistics computed by REG in the matrix language. IML is discussed in more detail in the next section. What follows is a bare-bones IML program, just a sequence of matrix calculations. Each line can be entered and executed immediately,

or the whole program can be executed at one time. The formulas in the program can be found in Belsley, Kuh, and Welsch (1980). Comment lines begin with an asterisk.

CHHD e300

0.00

0.05

0.10

0.15

0.20

0.25

Diagonal of Hat Matrix

Figure 6.1. Bubble Plot: Studentized Residuals Versus Hatvalues

NOTE: The areas of the bubbles are proportional to Cook’s D influence statistic.

0.20

DATA ANALYSIS

PROC IML; use saslib.duncan;

read all

var

n = nrow(educ); y = prestige;

x = j(n,1,1)

* read

{prestige

data

into

USING SAS)

113

matrices;

educ income};

* number of observations; up y and x for regression;

* set

||income

||educ;

k = ncol(x); * number of parameters; XN (Xe =X) xe beta = xx*y; yhat = x*beta; r= y - yhat; s2 = ssq(r)/(n-k); hdiag = vecdiag(x*xx); s2_i = ((n-k)*s2 = r#r/(1-hdiag))/(n-k-1); rstand = r/(sqrt(s2*(1-hdiag))); cookd = (rstand#rstand/k)#(hdiag/(1-hdiag)); rstudent = r/(sqrt(s2_i#(1-hdiag))); print rstand rstudent cookd hdiag;

These program statements are straightforward representations of the corresponding matrix algebra equations. Note that a*b is the usual matrix multiplication and that a#b is elementwise multiplication. The IML program produces exactly the same results as does REG with the advantage of having all calculations under user control.

6.3.

INTRODUCTION

TO PROGRAMMING

IN SAS

Programming in SAS is generally carried out via the DATA step and SAS macro language. The more specialized task of statistical programming is best achieved with the matrix language, SAS/IML. IML, a separate SAS product, replaced PROC MATRIX, once a part of Base SAS, several years ago.

The only data object in IML is an m x n matrix whose values are either numeric or character. A matrix can be directly defined, as in the 2 x 3 matrix x ={1 2 3, 4 5 6}. Matrix elements are referenced by square brackets; for example, x12 = x[1,2] selects the element in row 1 and column 2, x1 = x[1,] selects the first row of x, and x2 = x[,2]

selects the second column of x. The usual matrix arithmetic operators are +, -, *, and

follows:

‘ for transpose. Operators specific to matrices are as

114

COMPUTING

ENVIRONMENTS

1. Elementwise operators; for example, z = x/y yields z[i,j]

yyliso = XPisdty

2. Subscript reduction operators; for example, x[+,] yields column sums of x and x[,:] yields row means ofx.

3. Concatenation operators; for example, x||y is horizontal concatenation.

Of course, matrices must be conformable for the various operations. Standard

control

structures

such

as DO-WHILE,

IF-THEN/

ELSE,

and START/FINISH for module definition are supported, along with a library of Base SAS functions and specialized matrix operations (SVD, Cholesky factors, GINV, etc.). Functions for eigenvalues of nonsymmetric matrices and nonlinear optimization were added in version 6.10. The program DIAGNOSE.PGM listed earlier is not, as currently written, a general program. It works only for a specific data set and variables in that data set. An IML program could be generalized in two ways, either by use of the SAS macro language or IML modules. Whereas modules are peculiar to IML, the SAS macro language considerably extends the flexibility of SAS and is applicable in any SAS

program. This article is not the place to go into much detail on the SAS macro language. However, to give some idea of how it works, suppose we wanted to generalize DIAGNOSE.PGM to handle any SAS data set and regression modei consisting of variables in that data set and to allow as an option whether or not the model has a constant term. The skeleton of the SAS macro would be *MACRO

diagnose(depvar, indvars,dsn=_last_,

const=YES); body of macro *MEND

diagnose;

The special symbol % signals the SAS supervisor, the primary inter-

preter of SAS code, that what

follows

should

be passed

off to the

SAS macro processor. The macro is called diagnose and has four parameters; the first two are positional and the last two keyword . The

DATA ANALYSIS USING SAS

115

advantage of keyword parameters, which must appear last in the argument list, is that they can be assigned default values. For example, the default model includes a constant term, as specified by const=YES in the %MACRO statement. The special SAS variable _last_ holds the value of the last SAS data set created. The four parameter names—depvar, indvars, dsn, and const—define (local) SAS macro variables whose val-

ues within the macro are referenced by prefixing the symbol & to the macro variable name. For example, the original DIAGNOSE.PGM specified the SAS data set with the statement use

saslib.duncan;

This statement will be replaced in the macro by use &dsn;

The macro diagnose could be written as %MACRO

diagnose(depvar,

PROC

use &dsn; read all var n = nrow(y);

{&depvar}

read

{&indvars}

%IF

all

.

var

%UPCASE(&const)

BSIR(K %MEND

indvars,dsn=_last_,const=YES)

;

IML;

into y;

into x;

= YES

%THEN

= J(fsl,ty [1 xs)

. rest

of statements

exactly

as

before

diagnose;

The SAS macro processor generates regular SAS code by resolving macro variables and evaluating macro functions. It is necessary to use the %STR function when adding a constant term to the x matrix so that the semicolon following the x is interpreted as the end of the IML statement

x = j(n,1,1)

|| x; and not as the terminating

symbol for

the macro statement %IF. An invocation of the macro that would reproduce the earlier results for the Duncan data set is %di agnose(prestige, income

educ,dsn=saslib.duncan)

116

COMPUTING

ENVIRONMENTS

OSE.PGM An alternative way within IML to generalize DIAGN would be to define an IML module. ule is

The structure of an IML mod-

argument

START module_name(...optional

TRISTE oo NIE

body of module FINISH;

Modules are executed with either the RUN or the CALL statements. A module with no arguments has all its variables treated as global variables;

otherwise

its variables

are local to the module.

A GLOBAL

option is available on the START command to declare specific variables to be global. An IML module can act as a function and return a single value (which, of course, could be an m x n matrix) by including the statement RETURN(variable_name);

in the module. An IML module may

also return results through its argument list. Compiled modules, as well as matrixes, can be permanently saved in an IML storage catalog with the STORE command and retrieved with the LOAD command. IML modules, as currently implemented, have two shortcomings. First, keyword arguments, as in the macro language, are not allowed. If a module is defined with n arguments, then exactly n arguments must be passed each time the module is invoked. The assignment of default values is not as straightforward as it is in the macro language. A way around this problem is to define a separate module without arguments whose sole purpose is to define a set of global variables with the default values. This situation arises, for example, when develop-

ing a set of flexible plotting routines. It is natural to specify a large number of default values for axes characteristics, fonts, size of char-

acters, plot symbols, and so forth. A particular module then only a relatively small number of arguments, whereas a GLOBAL ment on the START command for the module provides access to predefined default values. The second shortcoming is that only matrices are allowed guments,

not other modules.

An example, discussed

needs stateall the as ar-

later, where

it

is important to pass a module as an argument to another module is kernel density estimation. Suppose kde is the name of a user-defined

DATA ANALYSIS

USING SAS

117

module to perform kernel density estimation and kfun is a character matrix holding the name of a specific kernel function to use. For example, kfun = 'gauss', where gauss is the name of a user-defined function to evaluate the Gaussian kernel. In a loop within kde, it is necessary to evaluate the kernel at a point x, effectively gauss(x). This can be accomplished with apply(kfun,x). The IML function apply has as its first

argument the name of an existing module (kfun evaluates to gauss), and the remaining arguments

of apply are the necessary arguments

for that module.

6.4

PROGRAMMING

EXAMPLES

This section discusses three specific examples of nonstandard tasks to illustrate the programming features of SAS/IML: (i) M-estimation of the Duncan regression model described in section 2, (ii) bootstrap-

ping the M-estimator in part (i) to obtain bootstrap estimates of the standard errors for the regression coefficients based on 100 bootstrap samples, and (iii) kernel density estimates, including graphs, of the bootstrapped samples obtained in part (ii).

M-ESTIMATION

The same notation as in the editors’ introduction is used here for robust regression. One form of robust regression is p(r) = |r|, least absolute value (LAV) regression. Earlier releases of SAS had a PROC LAV in their Supplemental Library of user-contributed procedures, but there is not a general PROC ROBUST. The SAS/STAT manual has an example in which PROC NLIN provides robust estimates, but it may not be as general as one would like. There is also an example

in the IML manual illustrating regression quantiles (LAV is a special case) using the IML function 1p for linear programming. Algorithms

to calculate

M-estimators

are iterative, and

there are

several choices—for example, the method of modified residuals and iteratively reweighted least squares (IRLS; see Thisted 1988, pp. 149-51, for a short discussion). An IML module

M_EST

is now

defined, which allows either the

modified residuals or IRLS algorithms. Pseudo code for M_EST is

COMPUTING

118

ENVIRONMENTS

1. Get initial estimate bO. 2. Calculate current residuals r = y — x DO. 3. Calculate scale parameter s = MAD(r)/.6745. 4. If modified residuals, then regress W(r/s) * s on x to get chg_beta. else do calculate weights w = V(r/s)/r

perform weighted regression of r on x to get chg_beta. end 5. If not converged, then calculate new beta, b0 = b0 + chg_beta and go tOr2: In IML, the MAD function is defined as MAD(x)

= abs(x-median(x))[:];

A function module is needed for each V function (e.g., HUBER) where

the global variable c is the tuning constant. start HUBER(z) global (c); pz = z#(abs(z)c) return(pz) ;

-

c#(z= $3500, title "Occupational title"

the data have been saved

1950"

to the Stata-format

file

duncan.dta, we can retrieve them at any later time by typing use duncan. To view a variable list and labels: describe Contains data from duncan.dta Obs: 45 (max= 19778)

Vars: 4 (max= 99) Width: 37 (max= 200) 1. title str34 %34 s 2. income 3. educate 4. prestige Sorted by:

byte byte byte

%9.0g %9.0g %9.0g

Duncan (1961) SEI dataset 9 Jun 1995 15:11 Occupational title % males earning >= $3500, 1950 % males who are HS graduates * excellent or good prestige

We can modify our data in many different ways, such as replacing earlier values or generating new variables from logical or algebraic expressions, and merging or appending multiple data sets. Explicit case-number subscripts can be specified with variabl es when needed; for example, income[5] equals the value of income for the data set’s fifth case.

DATA ANALYSIS

7.4

USING STATA

131

A FIRST LOOK AT THE DATA

Our common analytical task involves replicating Duncan’s regression of prestige on income and educate. Today we have better tools than those that were available to Duncan. How does this change our approach, or shed new light on old data? One recent trend has been the increasing use of exploratory and graphical techniques. A stem-and-leaf display reveals the markedly bimodal distribution of educate: stem

educate

Stem-and-leaf

plot

o*| 7

1* 2* 3* 4* 5* 6* 7* 8* 9* 10*

for educate

(% males

who

are

HS graduates)

| 579 | 002235566889 | 02249 | 457 | 056 | | 1246 | 2446667 | 012378 |0

The two modes correspond to blue- and white-collar jobs. Although less starkly bimodal, the distributions of prestige (Figure 7.2) and income also reflect this two-class structure. For comparison, Figure 7.2

superimposes on the histogram of prestige a Gaussian curve having the same mean and standard deviation. More formal normality checks available in Stata include hypothesis tests based on skewness and kurtosis,

e.g., sktest

varname,

and

also quantile-normal

plots, qnorm

varname. A scatterplot matrix (Figure 7.3) provides a quick look at the bivariate distributions. In the prestige-income plot at lower left, one occupation has much higher prestige than we might expect, given its income value. To obtain a simple scatterplot matrix, we type graph followed by a variable list, a comma, and the option matrix:

graph

income

educate prestige,

matrix

132

COMPUTING

ENVIRONMENTS

10 .

on

|

Frequency

04

is

20

d

iS

60

40

% excellent

or

good



_=

100

BO

prestige

Figure 7.2. Histogram of Prestige, With Superimposed Normal Curve

A general syntax of “command variable list, option(s)” is followed by most Stata commands. Most Stata graphs involve the versatile graph command, which has many possible options. A symbol( ) option controls what plotting symbols appear. symbo1(0)

calls for large circles, and symbol(d)

small dia-

monds, for instance. Multiple y variables could be plotted with different symbols, or connected in a variety of ways including line segments, step functions, smooth curves, or error bars. symbol ([varname])

causes values of the variable varname to appear as plotting symbols. Thus the command graph

prestige

income,

symbol([title])

draws a scatterplot with values of title (occupational title) as plotting symbols, similar to that shown in Figure 7.4. This graph reveals that the high prestige/low income occupation is “minister.” Toward the lower right of Figure 7.4, another occupation, “railroad conductor,” stands out for its unusual combination of low prestige and higher income.

DATA ANALYSIS

0

50

USING STATA

100

Ab eh ae ee

ora

ace®

income

>= $3500,

:

°

% males earning

°

Speman

1950

aera.

400

a

a

°

ge°

i °

ave

SE

°

o

et

oe?

e

ent

el

oct

o

ice

%

e.

ae

@o

50 +

ae

oe

a Q9

educate

eves:

% Males

°

)

*

ao

Bi)

pee

who

are

Sees

HS graduates

S

L 60

F 20

AN

;

g

5” Syme a

7 80

Leo

°

Sc

a

133

°

% 2

cease

ore °

e

a

=

oo

0 °

ot

° a” ones : ) cia ee a 2 Ses o

28°

20

¢

ay

°

oe

a

p restige g

% excellent or

ee

good prestige

-59

cat



oa og

40

ee % wae

60

,

-9

BO

4

5p

£00

Figure 7.3. Scatterplot Matrix Showing Bivariate Distributions of Income, Educate, and Prestige

To regress prestige on income and education, repeating Duncan’s analysis: regress prestige

income

df

Source

Model

educate

| 36180.9458 | 7506.69865

2 42

Number of obs = 45 F(2, 42) = 101.22 Prob > F = 0.0000 R-squared = 0.8282

MS 18090.4729 178.73092

Adj

4-----------------------------

| 43687.6444

44

Std income

educate

. 5987328 - 5458339 -6.064663

ser

. 1196673 .0982526 4.271941

R-squared

Root MSE

992.90101

t 5.003 5.555 -1.420

= 0.8200

= 13.369

P>|t|

[95% Conf.

Interval]

0.000 0.000 0.163

.3572343 3475521 -14.68579

=. 8402313 ~=.7441158 2.556463

134

COMPUTING

ENVIRONMENTS

100 7

physician se minister

Remgh; GAR

onmscomntankad sony JARRE EARS instructor

bui Baatrimprcartt renters in the public schools Pailroad

engin

neLACR A ndAaRhY 90vEPDNE electricy@horter

50-4

Manager

of a small

store

n

poakkBRbbRenen

Daaghehe

an a daily

newspaper

;

a ci

RCL

nt

Pailroad

conductor

carrier

mail

carpenter ?

in

plumber

machi te Sane e§- EP ea PED: ony

barber

excellent good %ar prestige

coareArenant See nuck=drdy

streetcar motorman

a store

pestaied tometer ee shoe-shiner (al =I

0

20 % males

40 earning >=

$3500,

60 1950

80

Figure 7.4. Scatterplot of Prestige Versus Educate, With Occupation Titles Used as Plotting Symbols

The regression finds that income and educate together explain about 83% of the variance in prestige. Next we can obtain predicted values, residuals, and diagnostic statistics through Stata’s predict command: predict predict predict

yhat e, resid D, cooksd

predict creates variables derived from the most recent regression. Its general syntax is “predict new variable, option,” where the option specifies what kind of new variable should be created—in this example, predicted y values (no option specified), residuals (resid) and Cook’s D (cooksd). predict (or its close relative fpredict) can gener-

ate many of the diagnostic statistics described in Belsley, Kuh, and

DATA ANALYSIS

USING STATA

135

Welsch (1980) or Cook and Weisberg (1982)—studentized residuals, hat diagonals, DFBETAS, DFFITS, and so forth. predict also works with ANOVA/ANOCOVA, robust regression, and other methods. Similar commands

will generate predicted values, residuals, and di-

agnostic statistics after virtually any Stata model Ipredict after logistic regression). Cook’s distance or D measures

estimation

(e.g.,

how much the ith case influences

the vector of estimated regression coefficients. Rawlings (1988, p. 269) suggests that a Cook’s D value larger than 4/n indicates an influential observation. Stata’s system variable _N automatically stores the number of observations in a data set, and using this we discover that by Rawlings’ criterion, the Duncan data contain three influential cases:

list title prestige yhat D if D > 4/_N

6. 9. reporter 16.

on a daily railroad

title minister newspaper conductor

prestige 87 52 38

yhat 52.35878 81.53799 57.99738

D .5663797 .0989846 .2236412

Two of these occupations, minister and railroad conductor, were noticed earlier (Figure 7.4). The third, reporter on a daily newspaper, like railroad conductor has lower prestige than one would predict based on income and educate. Figure 7.5 shows a residual versus predicted values plot, in which the data points’ areas are proportional to Cook’s D. The most influential occupations stand out as large circles. Minister, at top center, has the largest positive residual; reporter, at

bottom right, has the largest negative residual.

7.5

ROBUST

REGRESSION

Visual inspection and diagnostic statistics, as just demonstrated, help in identifying influential cases. We could repeat our analysis without such cases simply by adding an if qualifier to the regression command: regress prestige

income

educate

if

D < 4/_N

136

ENVIRONMENTS

COMPUTING

as

© O

Ais

O

9

Oo Lol 3

o

°

2 in

Cy

schere

fon)

=!

:

e

°

°

:

age 2)

=

o

°

:

i

20 °

®

°

O

OLS

predicted

prestige

Figure 7.5. Residuals Versus Predicted Values Plot, With Symbol Areas Proportional to Cook’s D

Note that if works the same way with regress as it does with list or most other Stata commands. Outlier deletion using if requires allor-nothing decisions, however, based on an arbitrary cutoff. Robust regression offers a more efficient alternative: smoothly downweighting the outliers, with lower weights given to cases farther from the center.

Stata’s built-in robust regression method is an M-estimator employing iteratively reweighted least squares. The first iteration begins with ordinary least squares (OLS). Next, weights are calculated based on

the OLS residuals, using a Huber function. After several Huber iterations, the weight function shifts to a Tukey biweight, tuned for 95%

Gaussian efficiency. This combination of Huber and biweight methods was suggested by Li (1985). Stata’s version estimates standard errors and tests hypotheses using the pseudovalues approach of Street, Carroll, and Ruppert

(1988), which

does not require Gaussian

or even

symmetrically distributed errors. See Hamilton (1991, 1992) for more

DATA ANALYSIS

USING STATA

137

about this method, including Monte Carlo evaluations of its relative

efficiency and standard errors. The program that implements

these

calculations

is an

ASCII

file about 150 lines long, named rreg.ado. Programmers can study rreg.ado, or hundreds of other Stata .ado (automatic do) files, directly.

Although physically rreg.ado exists as a distinct file, this distinction is transparent to a data analyst. A single command invokes robust regression: rreg prestige Huber Huber Huber Biweight Biweight Biweight Biweight

income

educate

iteration iteration iteration iteration iteration iteration iteration

maximum difference maximum difference maximum difference maximum difference maximum difference maximum difference maximum difference Biweight iteration HP maximum difference WY CONDO Biweight iteration 9: maximum difference Robust regression estimates

in weights = in weights = in weights = in weights = in weights = in weights = in weights = in weights = in weights = Number of obs

.60472035 .16902566 .04468018 .29117224 .09448567 .1485161 .05349079 .01229548 .00575876 = 45

(25 AYA) Se e778) PrObe>a ees O 0000

prestige

| Coef.

StdshErn.

income educate -cons

| .8173336 | .4048997 | -7.49402

.1053558 .0865021 3.761039

oot 7.758 4.681 -1.993

P>|t|

[95% Conf.

0.000 0.000 0.053

.6047171 .2303313 -15.0841

Interval] 1.02995 - 579468 .0960642

| rreg’s output table resembles that for OLS regression seen earlier— including an asymptotic F test but not R* or sums of squares statistics, which would be misleading. Similar consistency in command syntax and output format is maintained to the extent reasonable across all Stata’s model-fitting procedures. Thus, with any of them, one knows where to look for coefficients, standard errors, tests, and intervals.

Furthermore, one knows that adding if qualifiers and options such as rreg prestige

income

educate

if educate

> 50,

level (90)

138

COMPUTING

ENVIRONMENTS

50, will restrict the analysis to those cases with educate greater than the of instead s interval nce confide 90% and print a table containing default 95%. The robust regression iteratively downweights observations having large residuals. Three occupations previously noticed received weights below .3: railroad conductor (.18), reporter on a daily news-

paper (.17), and minister (0). With a weight of zero, minister has effectively been dropped. We now have two alternative regression equa-

tions (standard errors in parentheses):

regress

predicted prestige = —6.1 + .60 income + .55 educate (4.3) (.12)

rreg

predicted prestige

(.10)

= —7.5 + .82 income + .40 educate

(3.8) (.11)

(1) (2)

(.09)

Given normally distributed errors, rreg should be almost as efficient

as regress (OLS).When error distributions have heavier-than-normal tails, rreg tends to be much more efficient than OLS. Residual distri-

butions in the sample at hand exhibit lighter-than-normal tails, however, so it is not surprising that rreg here has only slightly smaller estimated standard errors than regress. The most notable contrast between (1) and (2) is that the OLS equa-

tion accords roughly equal importance to income and educate as predictors of prestige, whereas robust regression finds income about twice as important. The OLS equation is, to a surprising degree, influenced by a single occupation: minister. This occupation has fairly high prestige (87), and a relatively large proportion

(84%) of ministers

are high-

school graduates; but only 21% have high incomes. We might conclude from this that education without income can still yield high prestige. However, if we set ministers aside, as done by robust regression, then income assumes dominant importance. Since the prestige of ministers derives partly from their identification with the sacred,

unlike the other more worldly occupations in these data, setting ministers aside as unique makes some substantive sense as well. Besides rreg, Stata provides a second regression method with high resistance to y-outliers. This is quantile regression or greg, which in

its default form predicts the conditional median (or .5 quantile) of

DATA ANALYSIS

USING STATA

139

the y variable. It belongs to a class the robustness literature has variously termed least absolute value (LAV), minimum absolute deviation (MAD), or minimum L1-norm estimators. Instead of minimizing a sum of squared residuals, greg minimizes the sum of absolute residu-

als. Like robust regression, quantile regression in Stata is implemented by ado-files. The command syntax for greg resembles that for regress, rreg, and most other Stata model-fitting procedures: qreg prestige

income

educate

Iteration

1: WLS

sum of weighted

Iteration Iteration Iteration Iteration Iteration Iteration

1: 2: 3: 4: 5: 6:

of of of of of of

sum sum sum sum sum sum

abs. abs. abs. abs. abs. abs.

weighted weighted weighted weighted weighted weighted

Median Regression Raw sum of deviations Min sum of deviations

Coef.

deviations deviations deviations deviations deviations deviations deviations

1249 415.9771

(about

Err.

t

Std.

=

435.45438 = = = = = =

448.85635 420.65054 420.1346 416.71435 415.99351 415.97706 Number

of obs =

Pseudo

R2

P>|t|

prestige

|

a

ee

SEES

income educate _cons

| .7477064 | .4587156 | -6.408257

.1016554 .0852699 4.068319

45

41)

jE35 5m 000 0.000 5.380 0.123 -1.575

[95% Conf.

= 0.6670

Interval]

ee

.5425576 .2866339 -14.61846

.9528553 .6307972 1.801944

This gives us a third regression equation: qreg

predicted prestige

= —6.4 + .75 income + 46 educate (4.1) (.10) (.09)

(3)

qreg agrees generally with rreg, that income predicts prestige better than educate does.

7.6

BOOTSTRAPPING

Both rreg and qreg, by default, estimate standard errors through meth-

ods that are asymptotically valid given normal or nonnormal, but independent and identically distributed, errors. Monte Carlo work

COMPUTING

140

ENVIRONMENTS

ed even suggests that the rreg standard-error estimates remain unbias

here. The qreg in samples much smaller than the n = 45 considered

stically standard errors, on the other hand, sometimes appear unreali low compared with the variation seen in small-sample Monte Carlo nd experiments. To deal with this problem, Stata provides a comma , default By for quantile regression with bootstrapped standard errors. atory bsqreg performs only 20 bootstrap repetitions—enough for explor

work, but not for “final” conclusions. More stable standard-error es-

timates require at least 200 bootstrap repetitions. Bootstrapping this median regression 200 times takes about one and a half minutes on a

66-MHz 486: bsqreg prestige

income

educate,

rep(200)

Median Regression, bootstrap(200) SEs Raw sum of deviations 1249 (about 41) Min sum of deviations 415.9771

prestige SS

ee

| Coef.

Stdeu enc.

ee! +----------

income educate _cons

~~~

| .7477064 | -4587156 | -6.408257

ee

.1426557 .1196385 3.457354

tc

P>|t|

45

Number

of obs

=

Pseudo

R2

= 0.6670

[95% Conf.

Interval]

nee

5.241 3.834 -1.854

0.000 0.000 0.071

.4598156 .2172753 -13.38548

The median regression coefficients’ bootstrap standard

1.035597 .7001558 .5689656

errors, based

on a technique called data resampling (which does not impose the assumption of identical error distributions at all values of x), are larger than the theoretical estimates given earlier. Stata does not presently supply bootstrap options for regress and rreg, so these require a small amount of programming effort. Stata does, however,

have

a generic bootstrap

command,

bstrap.

Unlike

most Stata commands, bstrap looks for a user-defined program telling it exactly what bootstrapping task to perform. We therefore must begin by defining such a single-purpose program, which we shall name bootreg.

bootreg

performs

robust

regression

with

the Duncan

data,

and stores coefficients on income and educate as new variables named Bincome and Beduc. This program, typed using any text editor, is saved as an ASCII file named bootreg. ado:

DATA ANALYSIS define bootreg Ue meeee a global S_1 "Bincome

USING STATA

141

program

Beduc"

exit

} rreg prestige

post

income

‘1’ _b[income]

educate,

iterate(25)

_b[educ]

end}

Bootstrapping for other purposes may require only slight modifications of the bootreg.ado format. The user need change only the definition name in line 1 (bootreg), the name of created variables in line 3 (Bincome, Beduc), the command line (rreg ... iterate(25)), and

the bootstrap outcomes which this program will post or add into the new

data set (_b[income], _b[educate]).2 The internal mechanism

by

which bstrap calls this program, and by which the program’s results are passed back to bstrap, need not concern most users. Iterative reweighting methods such as rreg occasionally fail to converge, and instead bounce back and forth indefinitely between two or more sets of weights. In ordinary data analysis, the researcher will notice such failures and think about how to resolve them. Often, a small

change in the tuning constant suffices. In bootstrap or Monte Carlo simulations, however, failure to converge presents a nuisance that can hang

up the experiment. To short-circuit such trouble, analysts can limit the maximum

number of iterations by adding an iterate( ) option to the

rreg command, as illustrated in the bootreg program given earlier. This option limits the robust regression iterations, not the bootstrap. Two hundred bootstrap repetitions (i.e., 200 repetitions of the recently defined command bootreg) are now accomplished through the Stata command

bstrap:

bstrap bootreg,

Variable

| Obs

ete lean ae eee ee

Bincome | 200 Beduc

rep(200)

| 200

leave

Mean a

a

Std. Dev. a

Min ers, one

Max cs

ol

.8173336

.2184911

18314

1.353468

.4048997

.1659067

-.0927046

.8444723

The leave option instructs Stata to leave the new data set, with 200 bootstrap estimates of the coefficients on income and educate, in active

142

COMPUTING

ENVIRONMENTS

memory. That is, the original (Duncan) data have now been cleared

and replaced by a new data set of bootstrap coefficient estimates. These bootstrap estimates, which we analyze in the following section, could also be saved as a separate data file. 7.7,

KERNEL

DENSITY

ESTIMATION

Kernel density estimation, a method for approximating the density function from an empirical distribution, is implemented in Stata through ado-file kdensity.ado. This is flexible and easy to use. To get a plot similar to that at the top left in Figure 7.6, we need type no more than kdensity

Bincome

kdensity options include storing the kernel density estimates, controlling the number of points at which the density estimate is evaluated,

specifying the halfwidth of the kernel, and specifying the kernel function. Seven kernel functions are built in: biweight, cosine, Epanech-

nikov, Gaussian, Parzen, rectangular, and triangular.

The top left plot in Figure 7.6 used kdensity’s defaults. These include the Epanechnikov kernel function estimated at 50 points. Since we did not specify the halfwidth, Stata supplied a value for us. This value is stored as a macro named S_3, which we can ask to see using a display command: display

$S$_3

-08527289

Macros are names that can stand for strings, program-defined results, or user-defined values. $S_3 denotes the contents or definition

of global macro S_3. Stata’s default halfwidths, in this example .085, would be optimal if a Gaussian kernel were applied to Gaussian data. When the data contain influential observations, bootstrapping often yields a bi- or multimodal distribution of estimates. Such is the case with Bincome, graphed in Figure 7.6. But kdensity’s default halfwidth seems to over smooth Bincome and conceal its bimodality. The top right and lower left plots in Figure 7.6, using narrower halfwidth s

AQtsusg ajtsuag

ha

desysjoog

uorssaiZay

iH

Aysueqsyo[g yo

TeMTE“ZUR

"ADA

moj Jeulay suolpun

BJPWTIES



ainBryq ‘9°Z Jeurley

SLOIUtE

Spee

rs

TSUU8>

S}BWTAISZ

SUYOSUBdS

oz0'0

euoourg

ae

BWOIUTE

ore

‘s}yusHyJe0D Buls—y

TBUUB>

SWOOUTg

ees

YIPTAITEY = SBO'D

Taudsay

= URDTMETEY

‘ACATUYSeUNdy YIPRARTAY =

[Tauday

‘AONTUUOSUAdS UPtAeTeY = GOO

QhO'O

Aqysuan

143

Jworayiq

S3EWT3635

a3ewt3s35

SYIPIMJTRH 10

COMPUTING

144

ENVIRONMENTS

of .04 and .02, more clearly reveal the shape of this distribution. The bottom

right plot, based on a biweight kernel with halfwidth

Review

window,

.085,

appears similar to the Epanechnikov kernel with halfwidth .04. Fox (1990) suggests that data analysts adjust the halfwidth experimentally until satisfied with the balance of smoothness and detail. Stata’s from which

we

can

recall and edit previously is-

sued commands, makes such experimentation easy. For example, a plot similar to the one at top right in Figure 7.6 requires an option setting the halfwidth as .04: kdensity Bincome,

w(.04)

To get the next plot (halfwidth .02), we just recall this command and replace the 4 with a 2. Figure 7.6 illustrates Stata’s ability to combine multiple graphs into one image. 7.8

PROGRAMMING

IN STATA

Stata commands and programming are closely interconnected. Many seemingly intrinsic commands (such as rreg and kdensity) turn out to be programs written in Stata’s proprietary language. In preceding sections, we saw the program bstrap call program bootreg, which in turn called rreg, which itself called the intrinsic hardcoded Stata command

regress. These linkages are fast and transparent to a data analyst who just performs bootstrapping, but they illustrate how Stata programmers can use other programs as components in ones they wish to design. Stata programming is facilitated by the manner in which Stata procedures routinely store their results in macros. For example, we have seen that regress temporarily stores regression coefficients as -b[varname], values

and standard errors as _se[varname]. It also stores such as the number of observations, _result(1); the residual sum

of squares, _result(4); the root mean

squared

error,

_result(9), and

so forth. These are all program-defined macro results. User-written programs can likewise store output as macros, if desired. One may also create macros to store temporary information within a program

routine. Local macros create variables that exist only within the pro-

gram defining them; global macros define variables that can be used

DATA ANALYSIS

USING STATA

145

between programs and called subroutines. Stata manuals provide details on creating and using macros. Programs are designed to meet a particular need, whether for a single statistic not supplied with the usual Stata output, or for a com-

plete new statistical procedure. Stata’s matrix algebra capabilities allow building models nearly from scratch. For instance, a user could write his or her own linear regression routine in matrix form. Another important feature is the built-in maximum likelihood algorithm, which offers choices of various optimization methods and the capability to build multiple-equation models and models employing ancillary parameters. The available tools permit one to create nearly any statistical model imaginable, or to accomplish a wide variety of datamanagement tasks. Many of the programs included with Stata today originated with users who wished to expand the package to meet their individual needs. Such users typically submit their work to either the Stata Technical Bulletin, a journal dedicated to Stata programming, or to StataList, an Internet listserver used by Stata users and Stata Corporation developers alike to ask questions, submit programs, disseminate information,

and discuss

Stata-related

material.

These two vehicles

have resulted in the remarkable growth of the Stata package over the past several years. The next section shows how to write a Stata program that calculates the trimmed means of one or more variables. This example, though quite simple, should provide some feel for the ease and power of Stata programming.

7.9

A PROGRAMMING

EXAMPLE:

TRIMMED

MEANS

The “10% trimmed mean” is the mean of a variable following the elimination of observations beyond the 10th and 90th percentiles. Trimmed means provide efficient estimates of center for symmetrical but heavytailed distributions, since, unlike ordinary means, they are unaffected

by the most extreme values. Stata does not automatically supply a trimmed mean, but we can easily write a program for this purpose. The intrinsic command summarize varname, detail calculates detailed summary statistics, displaying them on screen but also storing some

COMPUTING

146

as macros

ENVIRONMENTS

named

_result().

Our

example

will use

three

of these

macros:

_result(3) = mean _result(8) = 10th percentile

_result(12) = 90th percentile The trimmed mean of varname can be found in two steps: summarize once to obtain 10th and 90th percentiles, then summarize a second time to get the mean of values within this range. The necessary commands

are summarize

varname,

summarize

varname

detail

if varname

> _result(8)

& varname

< _result(12)

After the second command, the trimmed mean will be displayed on screen and also unobtrusively be stored as the local macro _result(3).

The next step is building a more general trimmed-mean program. The following commands,

typed into a file named

tmean.ado, define a

program called tmean that performs the calculations and displays their outcome: program define tmiean version 4.0 quietly summ ‘1’,

quietly display

detail summ ‘1% if ‘1’ > _result(8) "Trimmed mean =" _result(3)

& ‘1’ < _result(12)

end

The first line of code tells Stata that called tmean. Once this program has been able as a Stata command just like any vides information regarding the version program—in

we are defining a program defined, tmean becomes availother. The second line proof Stata required to run the

this case, version 4.0 or higher.

When it executes a program, Stata automatically stores any argu-

ments (such as variable names) that we typed on the command line after the program’s name as local macros named 1, 2, and so on. Left

DATA ANALYSIS

USING STATA

147

and right single quotes around a local macro name cause the substitution of that macro’s contents for the name (as we saw earlier, a $ sign performs this function with global macros). Thus the line quietly

summ

‘1’, detail

tells this program to quietly (without screen

output) summarize the first variable listed on the command line following the tmean command. Internal Stata commands may be shortened to the fewest characters needed to uniquely identify that command. For example, summarize may be shortened to sumn. The fourth line uses the results produced in line 3. It summarizes the same variable, but this time constraining analysis to only those values between the 10th and the 90th percentiles. Stata automatically defines new macros to store the most recent command results. The fifth line simply displays the resulting mean after printing the words “Trimmed mean =”. The newly defined command tmean, applied to Duncan’s education variable, gives us: tmean

educate Trimmed mean = 52.55882

This works, but it is not very refined. We might enhance our program to allow for calculation on a list of one or more variables, as well as provide for a nicer display. We can also save the results as user-defined global macros, usable by other programs. Lines beginning with * are comments, which we could embed anywhere to explain what a program is doing. With these enhancements, the program becomes *lversion 1.0: 11-8-95, Program * one or more variables. program define tmean version 4.0 local varlist "req ex"

parse parse

to calculate

the trimmed

mean

of

WwrX*n70

“*varlist*, parse(™") iv=—L display _n in green -col(1) "Variable" _col(20) "Trimmed Mean" display in green -col({i) *=---+---- " _col(20) "------------ : locale

while

Seed

quietly quietly

l=

summ summ

wu

{

‘1’, detail ‘1’ if ‘1’ > tresult(8)

&)°17"
, the user can enter expressions that are then evaluated, with the results printed to the screen: >1+3

The user can save the result of an expression by assigning it to a named variable: > some.data

10 * some.data

40 > some.data 4

Unless specifically deleted or reassigned, objects are always available for further operations. The persistence of objects—even across computing sessions—is an unusual feature of S.

DATA ANALYSIS

USING S-PLUS

155

The output of a function can also be assigned to an object: > sqrt.out

sqrt.out

10

Arguments to functions can be expressions, objects, or the output of other functions: > Sgrt(23 + 2) 5 > sqrt(some.data) 2 > sqrt(sqrt(some.data) ) 1.414214

The S-plus interpreter processes commands line by line. In typical use, results from expressions are saved for input into subsequent expressions. Data analysis in S-plus often proceeds as a series of gradual steps, enabling the user to check the results along the way. For example, to begin an analysis of Duncan’s occupational prestige data, we could calculate some summary

statistics:!

> Income inc.sumry inc.sumry Min. 1st Qu.

7 21 42 41.87

The mean

Median

Mean

3rd Qu.

Max.

64 81

and the median are close to each other, and the first and

third quartiles are roughly evenly spaced from the median, indicating that the distribution is relatively symmetrical.

156

8.3

COMPUTING

ENVIRONMENTS

EXTENSIBILITY

For many users, the large number of preprogrammed functions renders programming unnecessary. Getting the most out of S-plus, however, requires users to program their own functions. Programming basics are introduced here, with more complicated examples later. Continuing the univariate analysis of the Duncan data, one might want to compute the standard deviations of the variables. Although S-plus has a variance function, it does not have a function to compute standard deviations directly: > variance.edu

sqrt (variance. edu) 29.76083

Even more simply, > sqrt (var(Education) ) 29.76083

These steps can be repeated each time a standard deviation is needed, or a permanent reusable standard deviation function can be created:? sdev Income[3]

# Income of the third 75 > Income[1:10] # Income for first 62 72 75 55 64 21 64 80 67 72

Occupation ten Occupations

158

ENVIRONMENTS

COMPUTING

> some.data

Income[some.data] subscript

an object

as a

55

The colon is the sequence operator, and so the expression 1:10 produces a vector subscript consisting of the integers from 1 to 10. One frequent use for subscripting is to recode data. Suppose, for example, that the value for the 43rd element of Education was miscoded as 200 rather than 20. Then, to fix the error,

> Education[43]

duncan.mat

duncan.mat[1:10,

“Income” ]

[ye 72 Tiley Fay Moyel 27 Levels t8t0) ows yi

Here “Income” is surrounded by quotes to differentiate the column labeled Income from the object named Income. The ability to assign names to the dimensions of an array (discussed further later) is a unique and convenient feature of S. Data frames and lists are more complex data objects supported by S-plus. Data frames are like matrices but allow both numeric and nonnumeric vectors as columns. Data frames were introduced in New S, but not all functions have been updated to handle them. To use data stored as a data frame in an older function requires the transforming

DATA

ANALYSIS

USING

S-PLUS

159

function as.matrix to “coerce” the data into a matrix: > duncan. frame

is.data.frame(duncan. frame) if

> 1s.matrix(as.matrix (duncan. frame) ) iL

The commands is.data.frame and is.matrix are functions to test the type of an object. Lists are the most general objects in S in that their elements can be completely heterogeneous. For example, a single list could contain a character vector, three matrices, a data frame, and another list.

Because of their generality, statistical procedures often return their output as lists. Elements of lists are referenced by name (following a $) or by position (within double brackets). Suppose, for example, that the coefficients of a regression are stored in the first position of the list regression.object under the name coef; the coefficients can be referenced as regression.object$coef or as regression.object[[1]].

8.5

INTERACTIVE

AND

DYNAMIC

GRAPHICS

The ability to produce customized, publication-quality graphs is one of the great strengths of S-plus. Although there is no magic formula for eliminating the tedious iterations it takes to get a graph to look just right, one happy result of doing that work within S-plus is that as the data change, a new graph can be generated simply by rerunning the same set of commands. Graphics commands to the interpreter produce immediately visible effects in a display window. The user can therefore continually check results interactively while building up a figure. For example, to examine the two-way relationship between Income and Prestige in the Duncan data: > plot(Income,

Prestige)

The simple scatterplot produced by this command is shown in the upper left panel of Figure 8.1.° By default, S-plus labels the axes with

160

COMPUTING

ENVIRONMENTS

the arguments to plot. Other plotting defaults are context sensitive. For example, the text and point sizes are automatically scaled down for the quarter-panel figure. The scatterplot has an elliptical pattern with a few anomalous points. Having previously created a character vector called Occupations, the identify command

allows the user to label points on

the figure using a mouse: > identify(Income,

Prestige,

Occupations)

These labeled occupations are shown in the upper right panel of Figure 8.1. Ministers have high prestige but low income, and the two railroad occupations have among the highest incomes but only midlevel prestige. The labels placed on the figure are an example of an overlay; identify does not create a new plot. In the lower left panel, a smooth curve is calculated by the lowess function and overlaid on the scatterplot by the lines function: > lines(lowess(Income,

Prestige) )

Note that the lowess curve is approximately linear. Other kinds of overlays include regression lines, text, or even other plots. The lower right panel of the figure is completed by calling the title function > title(“Scatterplot of prestige against income + with anomalous points highlighted and

lowess

line”)

S-plus also has the ability to explore higher order relationships through dynamically linked plots: > brush(duncan.mat,

hist=T)

The resulting plot, shown in Figure 8.2, contains several different sec-

tions. On the upper left are the three pairwise scatterplots. Underneath

@Wodu| OV

02

08

JOONpUCD WY

08

09

09

awoou|

Ov

8woodu|

Ov

02

e

30)8|UjW

02

snf{g-s ut ydersy e Jo uoy -ejouuy pure juaurdojaaaq aatsserZ01g ay} SuULMOYS ‘eUTODUT JsUTeBSY a8yse1q JO s}o[d1a}RdS ‘“T'g aN31y

09

°

10)S}UjW

02

02 Ov

08

10}9NpUuoD WY

Ov

oul] $seMoj e pue payybyyBjy syujod snojewoue UyJM eWOdU! JsujeBe eByseid jo JojdseneoS

09

eBbyseld

ebyseid

08

Jo}ONpuoD YH

e

Je)s;ujwW

ool

oor

eBbnseld eByseid

09 os ool oF os 08

oor

@Woou}

162 Graph

COMPUTING

ENVIRONMENTS Help =

Options

RR conductor

ony

ec

Income

Education

Figure 8.2. Brush Plot of Prestige, Education, and Income

the scatterplots are histograms of all three variables, obtained by specifying the option hist=T. (Optional arguments to a function are usually given with an equals sign.) To the right of the top scatterplot is the three-dimensional point cloud. The user can rotate and spin the point cloud to investigate the data. All of these subplots are dynamically linked; with the mouse, the user can select a point in any part of the display and it will be highlighted in all of the plots. The effect of highlighting an observation on the histograms is to remove the observation from the distributions. We adjusted the point cloud until three points clearly stood out. The large highlighted block represents ministers. As seen in the earlier scatterplot, ministers’ prestige is not in line with their income. It is clear here that ministers’ income is also low for their education, but

theiz prestige is commensurate with their education. The two railroad occupations have much higher income than expected based on their education.

DATA ANALYSIS

8.6

OBJECT-ORIENTED

FEATURES

USING S-PLUS

163

OF S-PLUS

It may at first appear that using S-plus is confusing because of the number of different kinds of objects (e.g., arrays, functions, lists, data frames)

and

because

of the several

modes

of data

(e.g., numeric,

character). Part of the difficulty stems from the ongoing development of S-plus toward a more object-oriented language combined with the relative unfamiliarity of terms associated with this style of programming (e.g., classes, methods). In practice, however, the increasing number of different kinds of objects makes S-plus easier to use. S-plus determines an object’s relevant characteristics (such as mode and class) without the direct intervention of the user. Data ob-

jects carry sets of attributes—which normally remain invisible to the user—describing their structure and providing additional information. For example, matrices have a dimension attribute, which itself is

a vector of length two, and an optional “dimnames” attribute holding row and/or column labels. The attributes of objects in S-plus make them self-describing. S-plus supports true object-oriented programming in which the “method” by which a generic function processes an object is determined by the class of the object. For example, when the generic summary function is called with an argument of class Im (a linear model

object), the specific method

function

summary.1m is automati-

cally invoked to process the argument. Many other classes of objects also have specific summary methods. Moreover, a user can create a new object class along with methods specific to that class. If no specific method is available for a particular class, then a method may be “inherited” from another class as specified in the class attribute of the object being processed.

8.7

REGRESSION

ANALYSIS

OF DUNCAN'S

DATA

New S introduced a modeling language to facilitate the development of statistical models. Use of this language is restricted to the arguments of certain modeling functions. The syntax is straightforward: The tilde symbol connects the left- and right-hand sides of an equation and means “is a function of.” A linear regression model such as g = f(u, v, w) = bu + bv + bzw is written as g ~ u + Vv + W. Variables

164

COMPUTING

ENVIRONMENTS

may be either preexisting S-plus objects or valid S-plus expressions, but they must be either factor, categorical, or numeric objects. The modeling language is quite flexible and is capable of expressing models with dummy regressors, interactions, nesting, and so on. An OLS regression for the Duncan data is produced by the 1m (linear model) function:

> regl.out summary (regl.out)

Call:

Im(formula

= Prestige

~

Residuals: Min 1Q Median -29.54 -6.417 0.6546 Coefficients:

Value Std. (Intercept) Income Education

-6.0647 0.5987 0.5458

Income

+ Education)

3Q

Max

6.605

34.64

Error

t-value

Pr(>|t|)

4.2719 0.1197 0.0983

-1.4197 5.0033 Boo o4

0.1631 0.0000 0.0000

Residual standard error: 13.37 on 42 degrees freedom Muitiple R-Squared: 0.8282 F-statistic: 101.2 on 2 and 42 degrees of freedom, the p-value is 1.1lle-016 Correlation of Coefficients:

(Intercept) Income Education

-0.2970 -0.3591

Income -0.7245

of

DATA ANALYSIS

USING S-PLUS

165

The results indicate positive partial relationships of occupational prestige to income and education, but the graphical analysis of the previous section indicated the possibility of outliers in the data. A check on the analysis is provided by the command > regl.diagnostics

Ims.out reg2.out reg2.out$coef (Intercept) Income Education -7.241423

0.8772736

0.3538098

The effect is to make the income coefficient much larger and the education coefficient much smaller. Weights produced by LMS regression are binary, and the cases with zero weights can be found by subscript-

ing: > Occupations[Ims.out$wt == 0] “minister” “reporter” “conductor”

There are three M-estimation options in S-plus. The easiest to use is the robust feature of the glm function. This function fits generalized

166

COMPUTING

ENVIRONMENTS

linear models (including linear models with Gaussian errors, linear logit models, probit models, log-linear models, and so on); glm uses

the same modeling language as does Im. The robust option of glm uses the Huber weight function with a tuning constant set for 95% efficiency when the errors are normally distributed. With a bit of work, either the weight function or the tuning constant can be changed. The command is > reg3.out reg3.out$coef (Intercept) Income Education -7.145792 0.699746

0.4870792

The income coefficient is larger and the education coefficient is smaller than those in the original OLS analysis, but the changes are not as great as those for LMS. Again, we want to see which cases are downweighted the most. Because M-estimation produces continuous weights, a plot will be most useful. Figure 8.3 shows the results of plotting the weights against observation indexes; observations having weights less than one were labeled with the identify command. The three points assigned zero weight by the LMS regression also have the lowest weights here, but the weight accorded to railroad conductors is not very different from those for a number of other occupations. Notice that railroad engineers are weighted to one.

8.8

ADVANCED

PROGRAMMING

EXAMPLES

BOOTSTRAPPING

The small sample size of the Duncan data set makes it unwise to rely on the asymptotic normality and asymptotic standard errors of the robust regression coefficients. Bootstrapping, then, is a natural way to generate coefficient standard errors and to construct confidence intervals. S-plus does not include a preprogrammed bootstrap function, but several members of the user community have filled this gap by providing general-purpose bootstrap functions through StatLib. For

DATA ANALYSIS

USING S-PLUS

167

* coal miner

* carpenter

*

owner ofa factory employing 100

-

streetcar motorman

» mall carrler

Weights clerk ina store

RR conductor

°

« + building contractor

* trained machinist

+ Insurance agent

* reporter

* minister

Index

Figure 8.3. Weights From M-Estimation of the Regression of Occupational Prestige on Income and Education

example, S-plus implementations of many of the programs in Efron and Tibshirani (1993) are located there.

Writing code for a specific bootstrap problem is often as easy as using a general function. To bootstrap the robust coefficients,

> rob.coefs rob.inc.dens

Leol

anyOnyj—d

“SISSI-1L

PES HLIM

40 SISA TYNY

jOPOL AOAAZ | =PoL aDuNnesS

sisATeuY UOIsseIsay oy} Jo Woday ‘OTTL anstq

-S3NHIYHA

Fo 289Er Od ° 90S2 S608) Gt saupnbs-jo-wns

Ta00HW

(wopaaa4, yo saaubag [Sespo jo swaquniy

IS31 =P Cp

y

"¢HdOUde SHH) POY owBis ‘paupnbs Yy paysn!Py

paupnbs

ef" z3"

‘Lid 40 AYBWWNS

UO! POSS SWOIuU |

eon

gs 'o o9°O

We]

yUpYSUoy

BpPOwlzpSsy

u04

YalaWbdbd

sjdizjny

1pad sop asuodsay - |SPOW

asuodsay

SALKWILSS

uoissaubey

:Sa|quiaby (S38;qbolupp)

Ss]qoiupy

90° 9-

LSHAT>

aBbi,seaug

¢SSYWNOS

sisfjpouy

CUOlPOOINPA Bwosu,) (36135844) 361 }S844q97-934y

Hioday 84S!

223

224

= STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

values, fit values, residuals (raw, studentized, and externally studen-

tized residuals), leverages, and Cook’s distances. While the fit is very significant, we keep in mind that it may be due to the outliers we noted in the data visualization. Thus, we turn to a visualization of the

model to see if this is the case. The Visualize Model item produces the visualization spreadplot shown in Figure 11.11. This spreadplot has six windows. Of these, four are specialized scatterplots and two are lists of variable names and observation

labels. The OLS Regression

scatterplot plots the ob-

served response values against the predicted values, along with the superimposed regression line. It shows the overall regression, as well as residuals from the regression (shown indirectly as deviations perpendicular to the line). The residuals are directly shown in the Residuals1 plot, which plots fit values against residuals. Ideally, the Linear Regression point cloud should be narrow

(showing strong re-

gression), and both the regression and residuals point clouds should be linear and without outliers. The two Influence plots are designed to reveal outliers. They will appear as points that are separated from the main point cloud. These two plots reveal four points that may be outliers. They have been selected and labeled. Since the plots are linked, the selected points are shown in all windows, with labels. We see that these four points

include the three points that looked like outliers in the raw data (Figure 11.8) plus the “Reporter” job. Looking back at the raw data, we see that reporters have a rather high education level, but only average income and prestige. Spinning the data’s spinplot also reveals that the reporter job is on the edge of the main swarm of points.

ROBUST REGRESSION

We now turn to robust multiple regression. First, we added the code presented by Tierney (in this volume) to ViSta’s regression module, as discussed in Section 11.5. Then we performed the analysis by clicking on the Iterate button located in the Regression plot. Clicking there presents a choice of two iterative regression methods, one for robust regression, the other for monotone regression. Choosing the robust option, and specifying ten iterations, produces the visualization shown

oO

7?

oo

ie

On

Re

&

= Ee

4

@F

Be $10

ee

Bb

olyedo7 = (]

Be

68

*

igi

861 88 89 payiipald aBiysaid

YY Cs sls

Le ; on

a

fs

(=

pre

to

2 iy ae

SS Jo}INPUOD

| 8@

SIUJa}

Stku-h

afilnsald

89

.

m

at J “5 g Jaysiut =] = nS

z

a gon

yoidpeaids uorssei8ay-Ieeur] “TL IL aan3ry

uo

Ep ©

2

a

mb

as

CF]

uv =

BS eo

os o i2

a4 z

ae

Be

* a

>

ayelay]

a

2B oc Ob

$10

Ge

Bo 8

Gb

98

7

re

a6

Jo};INPUGD

yaqsaayipta

98

661

Jaysiul

Weldsfiud

|ooyIs

JayIFal

ajes|3m JanJom

Jadaax7009

Jabeuey

a

8/3 yy

OUIYIELY +S

UPI} JaauiGug

Ja}uadied

JAdED juaby aqueinsu sayode@’,3104s HLalD

[BL

Jaxueg

asoys

JopINpues yy Bpig Jo,9e1};u09 fil}oIe4 Jaumo

soyanpuoy

Jaauibug

payIpasd afiingasig

49

yy?

yy

SIHH-A

9 0

*

@9

uy

YY

Jays!Ulld

paygipald a6iysaid

*

:

8iheSL a0Gh

uolyEIOT [9

$10

*

Jafimey JayxeyJapun Jaauigug pints

+$1}U30

J0$$as0Jd

uone07GQ

ON +$!/9N

|

uoissaibay

Jaystuitd

+S$1WWAYI

$lxu-A [] = Wol}BI07 CF]

| aduanjul

$10

SUOIJeMIasgg

aWogu) uolpeanpy

Sajqeueg 10)3Ipaid

229

5

a

cr

E

226

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

in Figure 11.12. This visualization has the same plots as for the linear regression, plus a Robust Weights plot. This plot shows the iterative history of the weights assigned to each observation. First, we see that the estimated weight values have stabilized. Second, we see that three

of the outliers we have identified (Minister, Reporter, and RR Conduc-

tor) have been given low weight. In addition, the Cook‘s distances in the Influence 1 plot show that RR Conductor has the highest Cook’s distance, and the leverages in the Influence 2 plot show that RR Engineer is the only observation with high leverage. Noting, however, that the scale of the Influence 1 plot is much smaller than it was before the robust-regression iterations, we conclude that the only outlier remaining after the robust iterations is RR Engineer, which still has high leverage. There is slight, but unimportant, change in the residuals and regression plots. LINEAR REGRESSION WITH OUTLIERS REMOVED

On the basis of the robust regression, we removed the four apparent outliers from the data set. This was done by displaying the observation-labels window and removing the names of the outliers. We then used the Create Data item to create a new data set with these outliers removed. Finally, we analyzed these data by regression analysis. These steps can also be done by typing: (remove-selection (select-observations “("Minister" "Reporter" "RR Engineer" (create-data "JobSubset") (regression-analysis :response "Prestige" :predictors ’("Education" "Income"))

"RR Conductor") ))

Once this ordinary linear-regression analysis is completed, the workmap looks like Figure 11.13. The report of this analysis (which we do not present) shows that the squared correlation has increased to .90, showing that the fit in the original analysis was not due to the outliers. All fit tests remain significant. The visualization of the OLS analysis of this subset of the Duncan data is shown in Figure 11.14. First, we note that there are no unusual residuals or leverage values. Second, we see that there is one unusual Cook’s distance, for “Tram

UI Jaauibug

UOIZEI07

s}ybiam

ad

Jo}INpued

a1

yy

uolssaibay

payaipaid sald abn

al 88

ysnaoy

ab

89

SIKU-A

ae

CF]

@

oe

ab

UOIZEIO)

@9

abisaid

olye907[9

@&8

a+FJ3+1

aal

sjenpisay L

Fal

ysnqoy

saGesanay ysngoy

661 48 89 Bb Be @ a6iysasg payIipaid ¥$nqgoy

uotssar8ay-sngoy yojdpeaids

a $30UE}$I9 $4009 3$nqgoy

69 O $/ENpisay

BF Mey

82

8

Be

uolssaihay

aan31y "ZLIL

a

ysndou

Bb

ab

a

@9

89

yy

C]

aduanijuy Z

IHd-A

oo

uoseI07(|

@8

DaddIDald said Bi; |

Jaxueg Jadaayyoog Jajzuadies

Jaauibuqz

aGlysaid payIipaid ysnqoy

$1HH-A

3/9 UBMs}

yy JoyINpUED

aol

UOIZEIO] CL]

3104s

|!EL

G8

L 86 96 Fa co @B s}UuBblam uolpeniasaqa

oB

ysnqoy

[]

Jaded W131 aIUeInsu|

filo}2e4 jooygs Bpig aso

ied

aduanyjul |

}UaGy

JaydbalJoyIeJ}uoD Jaumo Jafieuey)

at ysnqoy

ey,

928

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

vista WorkMap

REG-Jobsubset

Figure 11.13. Workmap After Analyzing the Subsetted Duncan Data

Motorman,” but in comparison to the linear-regression analysis of the entire data (Figure 11.11), the Cook’s distance value is relatively small (note the change in scale of the y-axis). Thus, we conclude that the subset of data probably does not contain any outlying observations.

MONOTONIC

REGRESSION

Having satisfied ourselves that the subsetted data no longer contains outliers, we

turn our investigation to the linearity of the rela-

tionship between the response and the predictors. This is done using the MORALS (multiple optimal regression using alternating least squares) technique proposed by Young, de Leeuw, and Takane (1976). MORALS is quite similar to the more recently proposed ACE (alternating conditional expectations) techniques proposed by Brieman and

661 68 69 Bb Be BG afiysaidg pazIipaid $10

Gb

861 Be $10

°

OUP

safesane7 $10

Jaqieg Japuaieg aous

peyesqns—jojdpeaids eyeq

69 68 661 payI!pald aflysald

IL Gb

$ldb-A

UEWIO}

uotssaiZey-1esury]

G@

661 aBiysaig

filozIe4 JaxJom YPUWIOOL-) WEIL ond Jang

aan81q “PETIT

68

$4009 $10

IREL

89

UO!}Eed07 C]

68

aduanyul Z

89

SEQ |203

payipaid a61}$add

Gb Iipasdpay

}SAUIYIELY Jed URWJIEday

GF

@

Be

Be $10

Jaulys }UNBIEysaY HOOD

suoleaiasgg UBIZEPS

$10

coe

UPWIOPOL)

16 C]

@

UO!IPEINPZ

aWOdU|

JaUlW Jang Jaquinid

ae

SI@

WEIL,

$9IUE}$!IO

slhb-A

co

sjenpisay |

al

aduanyuy |

Ps

68

°

aBiysald

69

WEIL

@

Gb

uose207—9 J s1KU-A UPWIOFOL)

UOIyed07] C]

Be

ae!

uoissaibay

10)91Ipaid Sajgeueg

Fy

@

uolse707

$10

‘Udtth

Bt Bois 2 $/ENPIsay MEY $10

229

230

= STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

Friedman (1985). As implemented in ViSta, MORALS monotonically transforms the response variable to maximize the linearity of its relationship to the fitted model. Specifically, it solves for the monotone transformation of the response variable and the linear combination of the predictor variables which maximize the value of the multiple correlation coefficient between the linear combination and the monotonic transformation. The monotonic regression is done by clicking on the linearregression plot’s Iterate button (see Figure 11.14), and specifying that we wish to perform ten iterations of the monotonic-regression

method. When these iterations are completed, the report (not shown) tells us that the squared correlation has increased (it cannot decrease) to .94, and that all fit tests remain significant. Thus, there was not much improvement in fit when we relaxed the linearity assumption. The visualization is shown in Figure 11.15. We see that the regression plot, which is of primary interest, has a new, nonlinear but monotonic line drawn on it. This is the least-squares monotonic-regression line. Comparing this to the least-squares linear-regression line (the straight line) allows us to judge whether there is any systematic curvilinearity in the optimal monotonic regression. We judge that there does not seem to be systematic curvilinearity. The visualization also contains a new

RSquare/Betas plot, which shows

of the value

of the squared

multiple

the behavior, over iterations, correlation

coefficient,

and

of

the two “beta” (regression) coefficients. This plot shows that we have

converged on stable estimates of the coefficients, and that the estimation process was well-behaved, both of which are desirable.

Finally,

we note that the two influence plots and the residuals plot have not changed notably from the linear results shown in Figure 11.14 (although there is some suggestion that “Plumber” is a bit unusual). All of these results lead us to conclude that the relationship between the linear combination of the predictors and the response is linear, as we had hoped. If this analysis revealed curvilinearity, we would proceed to apply an appropriate transformation to the response variable, and to repeat the above analyses using the transformed response. As a final step, we may wish to output and save the results of the above analysis. This is done with the Create Data item, which produces a dialog box that lets the user determine what information will be placed in new data objects. Figure 11.16 shows the workmap that

aGlysaig

Gt

papiipasd

6s

82 BG1G8 Boaree

uolssaibay

UPWWUIO}POLY

WEIL

oO J}Ela}|

auoyougy

6 Be-

6S|

sH-A

©

Be

Bl

Gjenpisay

Mey

uolssasGay

8

Gite

IL SUP

uolyeI07

UOIZEYS ‘Udd}h

fisozae4 Jaxjom

JaulLy |209 WEIL YEWIOFOL) WEL Jang Hon Jang

SBD

}SAUIYIELY UBWJIEday

UEIDI}I3819

Jajyuadies

9104S HII

UEWIOOM Jes PED

Jaquinid?

Be BF 89 B2lG8188 auo}yougLy payIpaig afilpsasd

UOI}EI07

Be-&

Bc188188

auoyouopy paydipasid aflysasd

Jaqieg

Sie}

uorssaiZay-oruojouoyy peyjesqns—jo[dpeaids uedunqd eyeq

safesana] auojyouoly

aan814q L “STE

6B

6@t

Sab

sl4d-a

auop,ouoLy payipaid aGlysaid

3UO}OUOL

8

$4009

@c-

SlI416

Be Gb 69

JaqQuinid

CF]

Be-&

°

661

WEIL

89 68 aflysaid

YPWIOVOLY

UONZEIO7

Ge

s1at-A

a)

aduanyuy |

yoise301

ind

auojouo;

uolye201

sej}ag/asenbs-y

9

Sebco

$9IUE}$IQ

cB $UOIZEIS}|

86 96 Pa sPyagsasenbs me

suoleniasgg

auoyoudp)

e-

231

232

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

viSta WorkMap

Figure 11.16. Workmap After Creating Output Data Objects

results when the user chooses to create two data objects, one containing fitted values (scores) and the other containing regression coefficients. These data objects can serve as the focus of further analysis within ViSta, or their contents can be saved as data files for further

processing by other software.

11.5

ADDING

ROBUST

REGRESSION

TO ViSta

When we were asked to write this chapter, ViSta did not perform robust regression. It did, however, have a module

for univariate re-

viSta: A VISUAL STATISTICS SYSTEM

233

gression which would perform both linear and monotonic regression. Thus, we had to modify our code to add robust-regression capabilities to ViSta. The

code

regression

was

added

code, which

by taking Tierney’s he had

written

(this volume)

to demonstrate

how

robustrobust

regression could be added to Lisp-Stat, and modifying it so that it would serve to add robust regression to ViSta. The fundamental conversion step was to change Tierney’s robust-regression functions to become robust-regression methods for ViSta’s regression object (whose prototype is morals-proto). Thus, Tierney’s biweight and robust-weights functions become the following methods for our already-existing morals-proto object: (defmeth morals-proto

Cot

:biweight

(x)

(pmine(abs 2x) ht )82),),.2).)

(defmeth morals-proto :robust-weights (&optional (let* ((rr (send self :raw-residuals)) (s (/ (median (abs rr)) .6745)))

(send self

:biweight

(c 4.685))

(/ rr (* c s)))))

Comparing these methods to Tierney’s functions shows that we have only changed the first line of each piece of code. We made similar changes to the other functions presented by Tierney. Since ViSta is entirely based on object-oriented programming (see Young 1994; Young and Lubinsky 1995), this conversion added the new robust-regression

capability, while carrying along all of the other capabilities of the system (such as workmaps, guidemaps, menus, etc.). Of course, once Tierney’s code had been modified, we then had to

modify ViSta’s interface so that the user could use the new code. This involved modifying the action of the visualization’s Iterate button so that it would give the user the choice of robust or monotonic regression (originally, it only allowed monotonic-regression iterations). We also had to add another plot to the visualization, and modify dialog boxes, axis labels, window titles, and other details appropriately. While these modifications took the major portion of the time, the basic point remains. We could take advantage of the fact that ViSta uses Lisp-Stat’s object-oriented programming system to introduce a new analysis method by making it an option of an already-existing analysis

234

STATISTICAL COMPUTING

ENVIRONMENTS

FOR SOCIAL RESEARCH

object, and we did not have to recode major portions of our software.

Furthermore, we did not have to jump outside of the ViSta/Lisp-Stat system to do the programming. No knowledge of another programming language was required. Finally, since ViSta is an open system, statistical programmers other than the developers can also take advantage of its object-oriented nature in the same way.

11.6

CONCLUSION

ViSta is a visual statistics system designed for an audience of data analysts ranging from novices and students, through proficient data analysts, to expert data-analysts and statistical programmers. It has several seamlessly integrated data-analysis environments to meet the needs of this wide range of users, from guidemaps for novices, through workmaps and menu systems for the more competent, on through command lines and scripts for the most proficient, to a guidemap-authoring system for statistical experts and a full-blown programming language for programmers. As the capability of computers continues to increase and their price continues to decrease, the audience for complex software systems such as data-analysis systems will become wider and more naive. Thus, it is imperative that these systems be designed to guide users who need the guidance, while at the same time be able to provide full dataanalysis and statistical programming power. As we stated at the outset, our guiding principle is that data analyses performed in an environment that visually guides and structures the analysis will be more productive, accurate, accessible, and satisfying than data analyses performed in an environment without such visual aids, especially for novices. However,

we understand

that vi-

sualization techniques are not useful for everyone all of the time, regardless of their sophistication. Thus, all visualization techniques are optional, and can be dispensed with or reinstated at any time. In addition, standard

nonvisual

data-analysis methods

are available. This

combination means that ViSta provides a visual environment analysis without sacrificing the strengths of those standard system features that have proven useful over the years. We that it may be true that a single picture is worth a thousand

for datastatistical recognize numbers,

Vista:

A VISUAL STATISTICS SYSTEM

235

but that this is not true for everyone all the time. In any case, pictures and numbers give the most complete understanding of data. REFERENCES Bann, C. M. (1996a). “Statistical Visualization Techniques for Monotonic Robust Multiple Regression.” MA Thesis, Psychometrics Laboratory, University of North Carolina, Chapel Hill, NC. Bann, C. M. (1996b), “ViSta Regress: Univariate Regression with ViSta, the Visual Statis-

tics System.” L. L. Thurstone Psychometric Laboratory Research Memorandum (in preparation). Brieman, L., & Friedman, J. H. (1985). “Estimating Optimal Transformations for Multiple Regression and Correlation.” Journal of the Amerian

Statistical Association, 77,

580-619. Duncan, O. D. (1961). “A Socioeconomic Index of All Occupations.” In A. J. Reiss, O. D. Duncan, P. K. Hatt, & C. C. North (Eds.), Occupations and Social Status. (pp. 109-138). New York: Free Press.

Faldowski, R. A. (1995). “Visual Component Analysis.” Ph.D. Dissertation, Psychometrics Laboratory, University of North Carolina, Chapel Hill, NC.

Lee, B-L. (1994). “ViSta Corresp: Correspondence Analysis With ViSta, the Visual Statis-

tics System.” Research Memorandum 94-3, L. L. Thurstone Psychometric Laboratory, University of North Carolina, Chapel Hill, NC.

McFarlane, M. & Young, F. W. (1994). “Graphical Sensitivity Analysis for Multidimensional Scaling.” Journal of Computational and Graphical Statistics, 3, 23-34. Tierney, L. (1990). Lisp-Stat: An Object-Oriented Environment for Statistical Computing & Dynamic Graphics. New York, Wiley. Young, F. W. (1994). “ViSta—The Visual Statistics System: Chapter 1—Overview; Chapter 2—Tutorial.” Research Memorandum 94-1, L. L. Thurstone Psychometric Laboratory, University of North Carolina, Chapel Hill, NC.

Young, F. W. & Lubinsky, D. J. (1995). “Guiding Data Analysis With Visual Statistical Strategies.” Journal of Computational and Graphical Statistics, 4(4), 229-250. Young, F. W., Faldowski,

R. A., & McFarlane,

M. M. (1993). “Multivariate

Statistical

Visualization.” In C. R. Rao (Ed.), Handbook of Statistics, Volume 9 (pp. 959-998).

Amsterdam: North Holland. Young, F. W., & Smith, J. B. (1991). “Towards a Structured Data Analysis Environment:

A Cognition-Based Design.” In A. Buja, & P. A. Tukey, (Eds.), Computing and Graphics in Statistics, Volume 36 (pp. 253-279). New York: Springer-Verlag.

Young, F. W., de Leeuw, J., & Takane, Y. (1976). “Multiple and Canonical Regression With a Mix of Quantitative and Qualitative Variables: An Alternating Least

Squares Method With Optimal Scaling Features.” Psychometrika, 41, 505-530.

iM

eer



:

zh

aPepaeeru>,

38

=@"

plan

oy

ua” pn"

4c

deg

me

(inte!

458

4

;

:

uyAny

ed

ewehtah))

ene

:

=

eg nt

tbr

gil

es

tO

q ag

a

“rie

vd ip G

Wan

7?

oy et

at Ss dm

|

etaleven. ‘iPins

ae p

\fge

7

é ,

t=

*

Le

oe e

a Bei oa

Te

eo

é

a

of

“ot

ere

=